CAI Technology
Menu ☰
iris · · 12 min read

Cost-aware LLM routing: how to cut 70% of the bill while keeping quality

Smart routing between premium model for design, mid model for work and local models for polling. Pseudocode, decision tree and measured numbers.

CAI Technology · Last reviewed: 4/30/2026
Cost-aware LLM routing: how to cut 70% of the bill while keeping quality

Cost-aware LLM routing: how we cut the LLM bill by ~70% without sacrificing quality

A common mistake in AI agent architecture is using a single premium model for all tasks. It works in pilot, scales poorly. This approach ignores that different phases of an agent workflow have radically different cognitive requirements. A premium model for designing a complex plan is a good investment; the same model for parsing the output of a status command is waste.

This article describes the cost-aware routing pattern we operate internally, with pseudocode, decision tree and real numbers measured over three months of operation.

TL;DR

Why routing is not a later optimization

There is a temptation to start with “a good model handles everything, we’ll optimize later”. The problem: a premium model has a per-token cost 1-2 orders of magnitude above a mid model, and 2-3 orders above a local model. At 1,000 invocations per day, the bill is visible; at 100,000, it becomes a strategic bottleneck.

Moreover, refactoring from mono-model to multi-model is an architectural change, not a local optimization. You change how a session looks, how you keep context, how you handle tools. Cost-aware design from the start is cheaper than later re-architecting.

The three task classes

All tasks an AI agent executes fit one of three classes, regardless of domain (ops, devsec, finance, customer support).

Class A — Design. Complex, open-ended task with reasoning needs. Example: the user requests “do a cost audit for last month and recommend 3 optimizations”. The model must understand the ambiguous request, query many sources, synthesize, prioritize. Frequency: tens of invocations per day. Acceptable latency: 5-30 seconds.

Class B — Work. Structured task, clear scope, well-defined output. Example: the agent executes step 3 of an already-approved plan — “create a DNS record with values X, Y, Z, validate propagation”. The logic is largely deterministic; the LLM parses outputs, recognizes errors, formats reports. Frequency: hundreds of invocations per hour. Acceptable latency: 1-3 seconds.

Class C — Polling/Triage. Repetitive, very limited task. Example: the agent reads the status of 100 servers, decides whether there is an anomaly worth alerting. 99% of the time, the answer is “nothing interesting”. Frequency: thousands of invocations per hour. Acceptable latency: 0.1-0.5 seconds.

Mapping to model classes

Class A → premium model. Largest context, best reasoning, highest per-token cost. Used only for design, under 10% of total volume. The bill per invocation is large, but small volume keeps the total under control.

Class B → mid model. Decent context, good tool-use capability, moderate cost. Used for actual work. Represents the bulk of invocation volume that matters.

Class C → small local model. Self-hosted, zero cost per invocation after fixed infrastructure cost. Capacity sufficient for triage and pattern recognition. Large volume does not push the bill.

Pseudocode: the router

def route(task: AgentTask) -> Model:
    # Explicit rules before any classifier
    if task.is_polling_or_status_check():
        return Models.LOCAL_SMALL

    if task.has_user_intent_freeform():
        # Asks for design, not deterministic steps
        return Models.PREMIUM

    if task.is_step_in_approved_plan():
        # Plan already approved → structured execution
        return Models.MEDIUM

    if task.context_tokens > 100_000:
        # Long context requires premium model anyway
        return Models.PREMIUM

    if task.requires_high_creativity():
        return Models.PREMIUM

    # Default: medium for work, premium if uncertain
    if task.confidence_in_classification < 0.7:
        return Models.PREMIUM

    return Models.MEDIUM

Notice: classification is on task metadata, not on input. If the task is “step 3 of an approved plan”, that is clearly work; if it is “the user wrote a new request”, it is clearly design.

Practical decision tree

                 Task received
                      |
        is "polling" or status check?
              |              |
            YES            NO
              |              |
        Local model     has freeform user intent?
                              |          |
                            YES         NO
                              |          |
                       Premium model   step in approved plan?
                                            |          |
                                          YES         NO
                                            |          |
                                       Mid model   context > 100k tokens?
                                                          |          |
                                                        YES         NO
                                                          |          |
                                                Premium model   Mid model

This tree has 3 decision nodes and covers over 95% of cases. For the rest, default is premium (safer, more expensive) or medium (cheaper, assuming a decent classifier).

Real numbers

In internal IRIS operation over three months:

The 69% reduction is not a projected estimate; it is measured. The difference up to the announced “70%” sits within monthly variance.

Common routing mistakes

Mistake 1: routing on input length. “Short input → cheap model” is a poor approximation. A short request can be extremely ambiguous and require premium. Length is not a proxy for complexity.

Mistake 2: routing on the user asking. The temptation to give a CEO the premium model and a junior the cheap one is easy and wrong. The task, not the user, determines the route.

Mistake 3: lack of fallback. If the cheap model fails (bad output, wrong parsing), no repeated retries on the cheap model. Automatic escalation to the premium model for a single retry. No infinite retry on cheap.

Mistake 4: premature optimization. Before implementing routing, measure. If your agent does 500 invocations per day and mono-model premium works economically, routing is over-engineering. Typical threshold from which it pays off is 10,000+ invocations per day.

Technical challenges

Context sharing between models. If the premium model started a conversation and the mid model must continue, the context must be transmitted. Two approaches: (a) all models receive the same context prompt; (b) a short summary is generated by the premium model and passed to the mid model. Variant (b) is cheaper but loses nuance; which you choose depends on the domain.

Prompt caching. Premium models now offer prompt caching with significant discount for repeated tokens. If the same system prompt appears 1000 times per day, caching reduces cost without routing. Combine: caching plus routing gives the best results.

Latency budget. A small local model can be slower in absolute terms than a small-context premium cloud model. For polling, latency matters little (it runs in background); for interactive design, latency matters a lot.

Conclusion

Cost-aware routing is not an esoteric technique. It is the simple recognition that different tasks have different requirements and that uniform treatment is waste. The propose-then-act pattern, described in a previous article, complements routing: phase separation makes obvious where the premium model’s cost is justified and where a smaller model is enough.

The bill reduction is not the main reason we recommend this pattern. The main reason is it lets us run an agent at scale without architecture being blocked by cost. We can accept large volumes, run continuous polling, run long procedures without checking the bill daily.

External sources

Next step

If your team operates an AI agent and wants to evaluate cost-aware routing potential on your own workload, we offer a 30-minute technical analysis at no cost.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.