Controlling Costs

When running Parlant in production, you'll want to balance cost efficiency with response quality and latency. Parlant provides an OptimizationPolicy that lets you fine-tune how the engine allocates resources.

The Cost-Latency Trade-off

Parlant's engine performs several LLM operations when generating a response: guideline matching, tool calling, and message generation. Each of these operations can process multiple items in a single LLM call (batching) or spread them across multiple calls.

Larger batch sizes mean fewer LLM calls, which reduces cost, but increases latency since more items must be processed together before any results are available.

Smaller batch sizes mean more LLM calls, which increases cost, but reduces latency since results come back incrementally. It may also trigger rate limits more quickly.

OptimizationPolicy

The OptimizationPolicy interface controls how Parlant optimizes its operations. The default implementation, BasicOptimizationPolicy, provides sensible defaults that balance cost and latency.

Customizing the Policy

To customize the optimization policy, create your own implementation and register it in the container:

import parlant.sdk as p

class CustomOptimizationPolicy(p.BasicOptimizationPolicy):
    def get_guideline_matching_batch_size(
        self,
        guideline_count: int,
        hints: dict = {},
    ) -> int:
        # Use larger batches for cost savings
        if guideline_count <= 20:
            return 5
        else:
            return 10

async def configure_container(container: p.Container) -> p.Container:
    container[p.OptimizationPolicy] = CustomOptimizationPolicy()
    return container

async with p.Server(configure_container=configure_container) as server:
    ...

Guideline Matching Batch Size

The get_guideline_matching_batch_size method determines how many guidelines are evaluated in a single LLM call. The default behavior scales with the number of guidelines.

To optimize for cost: Increase batch sizes. This reduces the number of LLM calls but may increase per-request latency.

To optimize for latency: Decrease batch sizes. This allows results to stream back faster but increases the total number of LLM calls.

Need help optimizing costs?

The Cost-Latency Trade-off​

OptimizationPolicy​

Customizing the Policy​

Guideline Matching Batch Size​

The Cost-Latency Trade-off

OptimizationPolicy

Customizing the Policy

Guideline Matching Batch Size