Observability of AI agents: what to monitor in production
AI agent dashboard: tokens consumed, latency p50/p95/p99, hallucination rate, tool-use success, audit-log completeness. A practical template.
Observability of AI agents: the minimum dashboard to operate an agent in production
An AI agent that works in pilot and an AI agent that operates daily in production look similar to the user. To the team that maintains it, they are fundamentally different. The difference is not in the code; it is in observability. Without a rigorous set of metrics, an agent in production becomes a black box that “sometimes works” and on which you cannot make informed decisions.
This article describes the metrics we track for every AI agent we operate internally and for clients, with details on how to compute them, what thresholds are reasonable, and how we interpret the values.
TL;DR
- Five families of metrics are mandatory: cost (tokens), latency, quality (hallucination, tool-use success), audit completeness, business impact.
- “Hallucination rate” is not measured automatically; it is measured by manual weekly sampling on 1-2% of responses.
- Latency must be tracked p50/p95/p99, not average. Average masks the long tails that frustrate users.
- Tool-use success is an early indicator of problems: a drop below 90% means something broke (model swapped, schema changed, upstream infrastructure).
- The dashboard should read in 30 seconds. More than 8-10 widgets becomes noise.
The five metric families
1. Cost — tokens consumed
For an operational agent, cost is a primary metric. Components:
- Input tokens per day, per model, per agent session
- Output tokens per day, per model, per agent session
- Equivalent cost in EUR/USD per day (computed from the provider’s pricing)
- Distribution by model class (premium / mid / local) for cost-aware routing
Typical threshold: if you see a sudden increase above 30% versus the previous-week baseline without a business explanation (higher request volume), investigate. Frequent causes: model swapped to a more expensive default, prompt grew (system prompt accidentally doubled), context loop that does not close.
2. Latency — p50, p95, p99
Average latency is misleading. The distribution is asymmetric: 90% of requests fast, 5-10% slow, 1-2% very slow. Use percentiles:
- p50 (median): how long half the requests take. If it grows, baseline is degraded.
- p95: 5% of requests are slower. Our typical threshold: under 8 seconds for work, under 30 seconds for design.
- p99: 1% of requests are slower. These directly frustrate users. Typical threshold: under 60 seconds.
Component breakdown: time in LLM (network plus inference), time in tool execution, time in I/O (file system, DB). Identifying the dominant component directs optimization.
3. Quality — hallucination and tool-use
Hallucination rate. The frequency with which the agent produces false statements or fabrications not in the context. Not measured automatically (one model cannot reliably check whether another hallucinates). Our procedure:
- Weekly sampling: random 1-2% of agent responses
- A human reviewer reads the response and the context
- Marks: correct / partial / hallucinated
- We compute the rate: hallucinated / (correct + partial + hallucinated)
Threshold: rate under 1% for factual tasks (lookup, status). Threshold under 3% for reasoning tasks (analysis, recommendation). Above this, the agent is not production-ready.
Tool-use success rate. The frequency with which the agent calls the right tool with valid parameters that produce a useful result. Automatic computation:
- Total tool calls
- Tool calls failed with structured error (validation, permission, upstream)
- Tool calls failed without recovery (the agent did not retry or did not switch to another tool)
- Successful tool calls
Typical threshold: above 92% success. Below 90%, something is broken.
4. Audit completeness
For agents performing impactful actions (especially with propose-then-act), audit is mandatory. Metrics:
- Percent actions with full plan logged (target: 100%)
- Percent actions with logged approval (target: 100% for actions requiring approval)
- Percent actions with post-execution result logged (target: 100%)
- Percent sessions with full trace (input → plan → approval → execution → result)
Below 100% means risk of litigation or non-compliance. The only acceptable tolerance is for actions with zero impact (status, info).
5. Business impact
Metrics showing whether the agent delivers value:
- Requests serviced per day
- Failed requests (user abandoned or redid manually)
- Time saved (estimated) per request
- User satisfaction (qualitative, weekly evaluation on a sample)
- Volumes of executed procedures (deploys, alerts handled, tickets resolved) — depends on the agent
These metrics are harder to standardize but are the ones that justify the agent to management.
Typical dashboard layout
Our dashboard for an agent has 8 widgets in this reading order:
[ Daily cost 7 days ] [ Current monthly cost ]
[ Latency p50/p95/p99 ] [ Tool-use success % ]
[ Impactful actions: plan / approve / execute ]
[ Top 5 errors in last 24h ]
[ Active sessions now ] [ Failed requests last hour ]
Reads in 30 seconds: cost trend OK, latency within limits, tool-use above threshold, all impactful actions audited, no top errors, normal active sessions, no failure spike.
Alerting
Automatic alerting must be conservative. An agent in production has natural noise; false alarms every hour are more damaging than none.
Our typical thresholds:
- Latency p95 > 2x baseline for 15 minutes → alert
- Tool-use success < 85% for 30 minutes → alert
- Daily cost > 1.5x baseline → alert (may be a legitimate spike, but check)
- Audit completeness < 100% for one day → critical alert
- Hallucination rate > 5% in weekly sampling → critical alert (with human reviewer)
- Upstream errors (LLM provider) > 10% → alert
All alerts routed to an ops channel with context. None resolves itself; all require human triage.
Sampling for hallucination
A critical detail: hallucination rate without manual sampling is unmeasurable. Our procedure:
- Random selection of 50-100 sessions per week
- A human reviewer reads the agent’s response plus the context (user input, intermediate tool outputs)
- Marks verdict: correct / partial / hallucinated
- Notes the pattern (if hallucinated: what type — fabricated facts, wrong attributions, incorrect aggregation)
- Track on weekly trend
The investment: 2-4 reviewer hours per week. For an agent serving real operations, this is far less than the cost of a single incident caused by undetected hallucination.
Common mistakes
Mistake 1: monitor only cost. Many teams build a tokens / EUR dashboard and consider observability done. Cost is easy to measure, quality is hard — which is exactly what makes it important.
Mistake 2: averages instead of percentiles. Average latency can be 2 seconds with p95 at 30 seconds. Users in p95 are the ones abandoning the product.
Mistake 3: alerts on everything. False alarms numb attention. Thresholds must be tuned empirically, not set to “round” values.
Mistake 4: lack of structured logs. Free-text logs are not queryable. Use structured JSON with fixed fields: timestamp, agent_id, session_id, action_type, model_used, tokens_in, tokens_out, latency_ms, status.
Mistake 5: lack of retention strategy. Logs grow fast. Define: 30-day fine-grained retention, 12-month aggregations, annual summary. Storage becomes a real cost otherwise.
Tooling
Our internal stack is modular:
- Structured logs written as JSON to disk
- Aggregation with Loki or equivalent
- Numeric metrics through Prometheus
- Visualization in Grafana
- Hallucination sampling through a Python script that reads logs and opens UI for reviewer
- Alerting via dedicated ops Telegram channel
The exact stack matters less than the metrics tracked. Start with a simple dashboard, iterate.
Conclusion
An AI agent without observability is an agent that will fail at something you will not notice until a user complains. The five metric families (cost, latency, quality, audit, business) cover almost every failure we have seen in practice. The investment is days at first setup and hours per week in operation. The benefit is that the agent becomes something you can entrust to your team — not a black box that “usually works”.
Related articles
- The propose-then-act architecture for AI agents
- Cost-aware LLM routing
- On-prem SIEM with local LLM for triage
- Pillar IRIS — the CAI Technology orchestrator agent
External sources
- Google SRE Book — Monitoring Distributed Systems — standard reference for SLOs and percentiles
- OpenTelemetry semantic conventions for AI — emerging standard for tracing AI workloads
- Anthropic — usage and metrics — reference for tokens reporting
- “Sparks of Artificial General Intelligence” — Bubeck et al., arXiv 2303.12712 — context for qualitative LLM evaluation
- Honeycomb on observability for AI — recent industry practice
Next step
If your team operates an AI agent and wants to define the right metrics for your workflow together, we offer a 30-minute technical consultation at no cost.