Cost-aware LLM routing: how to cut 70% of the bill while keeping quality
Smart routing between premium model for design, mid model for work and local models for polling. Pseudocode, decision tree and measured numbers.
Cost-aware LLM routing: how we cut the LLM bill by ~70% without sacrificing quality
A common mistake in AI agent architecture is using a single premium model for all tasks. It works in pilot, scales poorly. This approach ignores that different phases of an agent workflow have radically different cognitive requirements. A premium model for designing a complex plan is a good investment; the same model for parsing the output of a status command is waste.
This article describes the cost-aware routing pattern we operate internally, with pseudocode, decision tree and real numbers measured over three months of operation.
TL;DR
- An AI agent does not have one type of task. It has at least three: design (rare, complex), work (frequent, semi-deterministic), polling (constant, trivial).
- Each task type maps to a different class of model: premium, mid, small local.
- Routing happens via a simple classifier (heuristics plus, optionally, a small dedicated model), not via the user picking manually.
- On our internal configuration, the 5% premium / 25% mid / 70% local distribution reduced aggregate cost to ~30% of baseline.
- Aggregate quality (success rate, internal satisfaction) is indistinguishable from mono-model premium.
Why routing is not a later optimization
There is a temptation to start with “a good model handles everything, we’ll optimize later”. The problem: a premium model has a per-token cost 1-2 orders of magnitude above a mid model, and 2-3 orders above a local model. At 1,000 invocations per day, the bill is visible; at 100,000, it becomes a strategic bottleneck.
Moreover, refactoring from mono-model to multi-model is an architectural change, not a local optimization. You change how a session looks, how you keep context, how you handle tools. Cost-aware design from the start is cheaper than later re-architecting.
The three task classes
All tasks an AI agent executes fit one of three classes, regardless of domain (ops, devsec, finance, customer support).
Class A — Design. Complex, open-ended task with reasoning needs. Example: the user requests “do a cost audit for last month and recommend 3 optimizations”. The model must understand the ambiguous request, query many sources, synthesize, prioritize. Frequency: tens of invocations per day. Acceptable latency: 5-30 seconds.
Class B — Work. Structured task, clear scope, well-defined output. Example: the agent executes step 3 of an already-approved plan — “create a DNS record with values X, Y, Z, validate propagation”. The logic is largely deterministic; the LLM parses outputs, recognizes errors, formats reports. Frequency: hundreds of invocations per hour. Acceptable latency: 1-3 seconds.
Class C — Polling/Triage. Repetitive, very limited task. Example: the agent reads the status of 100 servers, decides whether there is an anomaly worth alerting. 99% of the time, the answer is “nothing interesting”. Frequency: thousands of invocations per hour. Acceptable latency: 0.1-0.5 seconds.
Mapping to model classes
Class A → premium model. Largest context, best reasoning, highest per-token cost. Used only for design, under 10% of total volume. The bill per invocation is large, but small volume keeps the total under control.
Class B → mid model. Decent context, good tool-use capability, moderate cost. Used for actual work. Represents the bulk of invocation volume that matters.
Class C → small local model. Self-hosted, zero cost per invocation after fixed infrastructure cost. Capacity sufficient for triage and pattern recognition. Large volume does not push the bill.
Pseudocode: the router
def route(task: AgentTask) -> Model:
# Explicit rules before any classifier
if task.is_polling_or_status_check():
return Models.LOCAL_SMALL
if task.has_user_intent_freeform():
# Asks for design, not deterministic steps
return Models.PREMIUM
if task.is_step_in_approved_plan():
# Plan already approved → structured execution
return Models.MEDIUM
if task.context_tokens > 100_000:
# Long context requires premium model anyway
return Models.PREMIUM
if task.requires_high_creativity():
return Models.PREMIUM
# Default: medium for work, premium if uncertain
if task.confidence_in_classification < 0.7:
return Models.PREMIUM
return Models.MEDIUM
Notice: classification is on task metadata, not on input. If the task is “step 3 of an approved plan”, that is clearly work; if it is “the user wrote a new request”, it is clearly design.
Practical decision tree
Task received
|
is "polling" or status check?
| |
YES NO
| |
Local model has freeform user intent?
| |
YES NO
| |
Premium model step in approved plan?
| |
YES NO
| |
Mid model context > 100k tokens?
| |
YES NO
| |
Premium model Mid model
This tree has 3 decision nodes and covers over 95% of cases. For the rest, default is premium (safer, more expensive) or medium (cheaper, assuming a decent classifier).
Real numbers
In internal IRIS operation over three months:
- Total volume: about 1.4M LLM invocations
- Distribution by class: 4.8% premium, 23.7% medium, 71.5% local
- Normalized aggregate cost: 31% versus mono-model premium scenario
- Tracked errors at executor (when mid model got something wrong that premium would have caught): 0.4% of total. Automatic re-routing to premium for retry.
- Internal satisfaction (qualitatively, weekly evaluation): identical to mono-model premium
The 69% reduction is not a projected estimate; it is measured. The difference up to the announced “70%” sits within monthly variance.
Common routing mistakes
Mistake 1: routing on input length. “Short input → cheap model” is a poor approximation. A short request can be extremely ambiguous and require premium. Length is not a proxy for complexity.
Mistake 2: routing on the user asking. The temptation to give a CEO the premium model and a junior the cheap one is easy and wrong. The task, not the user, determines the route.
Mistake 3: lack of fallback. If the cheap model fails (bad output, wrong parsing), no repeated retries on the cheap model. Automatic escalation to the premium model for a single retry. No infinite retry on cheap.
Mistake 4: premature optimization. Before implementing routing, measure. If your agent does 500 invocations per day and mono-model premium works economically, routing is over-engineering. Typical threshold from which it pays off is 10,000+ invocations per day.
Technical challenges
Context sharing between models. If the premium model started a conversation and the mid model must continue, the context must be transmitted. Two approaches: (a) all models receive the same context prompt; (b) a short summary is generated by the premium model and passed to the mid model. Variant (b) is cheaper but loses nuance; which you choose depends on the domain.
Prompt caching. Premium models now offer prompt caching with significant discount for repeated tokens. If the same system prompt appears 1000 times per day, caching reduces cost without routing. Combine: caching plus routing gives the best results.
Latency budget. A small local model can be slower in absolute terms than a small-context premium cloud model. For polling, latency matters little (it runs in background); for interactive design, latency matters a lot.
Conclusion
Cost-aware routing is not an esoteric technique. It is the simple recognition that different tasks have different requirements and that uniform treatment is waste. The propose-then-act pattern, described in a previous article, complements routing: phase separation makes obvious where the premium model’s cost is justified and where a smaller model is enough.
The bill reduction is not the main reason we recommend this pattern. The main reason is it lets us run an agent at scale without architecture being blocked by cost. We can accept large volumes, run continuous polling, run long procedures without checking the bill daily.
Related articles
- The propose-then-act architecture for AI agents
- MCP server design patterns for AI agents
- Pillar IRIS — the CAI Technology orchestrator agent
- Pillar Consulting — AI assessment
External sources
- Anthropic Claude model overview — capabilities and pricing for premium and mid model classes
- OpenAI model selection guide — reference for task-based model selection
- Anthropic prompt caching — cost reduction for repeated system prompts
- “FrugalGPT” — Chen et al., arXiv 2305.05176 — seminal research on economical routing across LLMs
- Hugging Face Open LLM Leaderboard — reference for capability of open-source models used locally
Next step
If your team operates an AI agent and wants to evaluate cost-aware routing potential on your own workload, we offer a 30-minute technical analysis at no cost.