iris · April 30, 2026 · 12 min read

Observability of AI agents: what to monitor in production

AI agent dashboard: tokens consumed, latency p50/p95/p99, hallucination rate, tool-use success, audit-log completeness. A practical template.

CAI Technology · Last reviewed: 4/30/2026

Observability of AI agents: what to monitor in production

Observability of AI agents: the minimum dashboard to operate an agent in production

An AI agent that works in pilot and an AI agent that operates daily in production look similar to the user. To the team that maintains it, they are fundamentally different. The difference is not in the code; it is in observability. Without a rigorous set of metrics, an agent in production becomes a black box that “sometimes works” and on which you cannot make informed decisions.

This article describes the metrics we track for every AI agent we operate internally and for clients, with details on how to compute them, what thresholds are reasonable, and how we interpret the values.

TL;DR

Five families of metrics are mandatory: cost (tokens), latency, quality (hallucination, tool-use success), audit completeness, business impact.
“Hallucination rate” is not measured automatically; it is measured by manual weekly sampling on 1-2% of responses.
Latency must be tracked p50/p95/p99, not average. Average masks the long tails that frustrate users.
Tool-use success is an early indicator of problems: a drop below 90% means something broke (model swapped, schema changed, upstream infrastructure).
The dashboard should read in 30 seconds. More than 8-10 widgets becomes noise.

The five metric families

1. Cost — tokens consumed

For an operational agent, cost is a primary metric. Components:

Input tokens per day, per model, per agent session
Output tokens per day, per model, per agent session
Equivalent cost in EUR/USD per day (computed from the provider’s pricing)
Distribution by model class (premium / mid / local) for cost-aware routing

Typical threshold: if you see a sudden increase above 30% versus the previous-week baseline without a business explanation (higher request volume), investigate. Frequent causes: model swapped to a more expensive default, prompt grew (system prompt accidentally doubled), context loop that does not close.

2. Latency — p50, p95, p99

Average latency is misleading. The distribution is asymmetric: 90% of requests fast, 5-10% slow, 1-2% very slow. Use percentiles:

p50 (median): how long half the requests take. If it grows, baseline is degraded.
p95: 5% of requests are slower. Our typical threshold: under 8 seconds for work, under 30 seconds for design.
p99: 1% of requests are slower. These directly frustrate users. Typical threshold: under 60 seconds.

Component breakdown: time in LLM (network plus inference), time in tool execution, time in I/O (file system, DB). Identifying the dominant component directs optimization.

3. Quality — hallucination and tool-use

Hallucination rate. The frequency with which the agent produces false statements or fabrications not in the context. Not measured automatically (one model cannot reliably check whether another hallucinates). Our procedure:

Weekly sampling: random 1-2% of agent responses
A human reviewer reads the response and the context
Marks: correct / partial / hallucinated
We compute the rate: hallucinated / (correct + partial + hallucinated)

Threshold: rate under 1% for factual tasks (lookup, status). Threshold under 3% for reasoning tasks (analysis, recommendation). Above this, the agent is not production-ready.

Tool-use success rate. The frequency with which the agent calls the right tool with valid parameters that produce a useful result. Automatic computation:

Total tool calls
Tool calls failed with structured error (validation, permission, upstream)
Tool calls failed without recovery (the agent did not retry or did not switch to another tool)
Successful tool calls

Typical threshold: above 92% success. Below 90%, something is broken.

4. Audit completeness

For agents performing impactful actions (especially with propose-then-act), audit is mandatory. Metrics:

Percent actions with full plan logged (target: 100%)
Percent actions with logged approval (target: 100% for actions requiring approval)
Percent actions with post-execution result logged (target: 100%)
Percent sessions with full trace (input → plan → approval → execution → result)

Below 100% means risk of litigation or non-compliance. The only acceptable tolerance is for actions with zero impact (status, info).

5. Business impact

Metrics showing whether the agent delivers value:

Requests serviced per day
Failed requests (user abandoned or redid manually)
Time saved (estimated) per request
User satisfaction (qualitative, weekly evaluation on a sample)
Volumes of executed procedures (deploys, alerts handled, tickets resolved) — depends on the agent

These metrics are harder to standardize but are the ones that justify the agent to management.

Typical dashboard layout

Our dashboard for an agent has 8 widgets in this reading order:

[ Daily cost 7 days ]      [ Current monthly cost ]
[ Latency p50/p95/p99 ]    [ Tool-use success % ]
[ Impactful actions: plan / approve / execute ]
[ Top 5 errors in last 24h ]
[ Active sessions now ]    [ Failed requests last hour ]

Reads in 30 seconds: cost trend OK, latency within limits, tool-use above threshold, all impactful actions audited, no top errors, normal active sessions, no failure spike.

Alerting

Automatic alerting must be conservative. An agent in production has natural noise; false alarms every hour are more damaging than none.

Our typical thresholds:

Latency p95 > 2x baseline for 15 minutes → alert
Tool-use success < 85% for 30 minutes → alert
Daily cost > 1.5x baseline → alert (may be a legitimate spike, but check)
Audit completeness < 100% for one day → critical alert
Hallucination rate > 5% in weekly sampling → critical alert (with human reviewer)
Upstream errors (LLM provider) > 10% → alert

All alerts routed to an ops channel with context. None resolves itself; all require human triage.

Sampling for hallucination

A critical detail: hallucination rate without manual sampling is unmeasurable. Our procedure:

Random selection of 50-100 sessions per week
A human reviewer reads the agent’s response plus the context (user input, intermediate tool outputs)
Marks verdict: correct / partial / hallucinated
Notes the pattern (if hallucinated: what type — fabricated facts, wrong attributions, incorrect aggregation)
Track on weekly trend

The investment: 2-4 reviewer hours per week. For an agent serving real operations, this is far less than the cost of a single incident caused by undetected hallucination.

Common mistakes

Mistake 1: monitor only cost. Many teams build a tokens / EUR dashboard and consider observability done. Cost is easy to measure, quality is hard — which is exactly what makes it important.

Mistake 2: averages instead of percentiles. Average latency can be 2 seconds with p95 at 30 seconds. Users in p95 are the ones abandoning the product.

Mistake 3: alerts on everything. False alarms numb attention. Thresholds must be tuned empirically, not set to “round” values.

Mistake 4: lack of structured logs. Free-text logs are not queryable. Use structured JSON with fixed fields: timestamp, agent_id, session_id, action_type, model_used, tokens_in, tokens_out, latency_ms, status.

Mistake 5: lack of retention strategy. Logs grow fast. Define: 30-day fine-grained retention, 12-month aggregations, annual summary. Storage becomes a real cost otherwise.

Tooling

Our internal stack is modular:

Structured logs written as JSON to disk
Aggregation with Loki or equivalent
Numeric metrics through Prometheus
Visualization in Grafana
Hallucination sampling through a Python script that reads logs and opens UI for reviewer
Alerting via dedicated ops Telegram channel

The exact stack matters less than the metrics tracked. Start with a simple dashboard, iterate.

Conclusion

An AI agent without observability is an agent that will fail at something you will not notice until a user complains. The five metric families (cost, latency, quality, audit, business) cover almost every failure we have seen in practice. The investment is days at first setup and hours per week in operation. The benefit is that the agent becomes something you can entrust to your team — not a black box that “usually works”.

External sources

Google SRE Book — Monitoring Distributed Systems — standard reference for SLOs and percentiles
OpenTelemetry semantic conventions for AI — emerging standard for tracing AI workloads
Anthropic — usage and metrics — reference for tokens reporting
“Sparks of Artificial General Intelligence” — Bubeck et al., arXiv 2303.12712 — context for qualitative LLM evaluation
Honeycomb on observability for AI — recent industry practice

Next step

If your team operates an AI agent and wants to define the right metrics for your workflow together, we offer a 30-minute technical consultation at no cost.