Controlling Costs
When running Parlant in production, you'll want to balance cost efficiency with response quality and latency. Parlant provides an OptimizationPolicy that lets you fine-tune how the engine allocates resources.
The Cost-Latency Trade-off​
Parlant's engine performs several LLM operations when generating a response: guideline matching, tool calling, and message generation. Each of these operations can process multiple items in a single LLM call (batching) or spread them across multiple calls.
Larger batch sizes mean fewer LLM calls, which reduces cost, but increases latency since more items must be processed together before any results are available.
Smaller batch sizes mean more LLM calls, which increases cost, but reduces latency since results come back incrementally. It may also trigger rate limits more quickly.
OptimizationPolicy​
The OptimizationPolicy interface controls how Parlant optimizes its operations. The default implementation, BasicOptimizationPolicy, provides sensible defaults that balance cost and latency.
Customizing the Policy​
To customize the optimization policy, create your own implementation and register it in the container:
import parlant.sdk as p
class CustomOptimizationPolicy(p.BasicOptimizationPolicy):
def get_guideline_matching_batch_size(
self,
guideline_count: int,
hints: dict = {},
) -> int:
# Use larger batches for cost savings
if guideline_count <= 20:
return 5
else:
return 10
async def configure_container(container: p.Container) -> p.Container:
container[p.OptimizationPolicy] = CustomOptimizationPolicy()
return container
async with p.Server(configure_container=configure_container) as server:
...
Guideline Matching Batch Size​
The get_guideline_matching_batch_size method determines how many guidelines are evaluated in a single LLM call. The default behavior scales with the number of guidelines.
To optimize for cost: Increase batch sizes. This reduces the number of LLM calls but may increase per-request latency.
To optimize for latency: Decrease batch sizes. This allows results to stream back faster but increases the total number of LLM calls.
Embedding Cache​
The use_embedding_cache method controls whether Parlant caches embeddings to disk. Caching is enabled by default and significantly reduces costs for repeated queries by avoiding redundant embedding API calls.
class NoCacheOptimizationPolicy(p.BasicOptimizationPolicy):
def use_embedding_cache(self, hints: dict = {}) -> bool:
return False # Disable caching