CAI Technology
Menu ☰
iris · · 12 min read

Observability of AI agents: what to monitor in production

AI agent dashboard: tokens consumed, latency p50/p95/p99, hallucination rate, tool-use success, audit-log completeness. A practical template.

CAI Technology · Last reviewed: 4/30/2026
Observability of AI agents: what to monitor in production

Observability of AI agents: the minimum dashboard to operate an agent in production

An AI agent that works in pilot and an AI agent that operates daily in production look similar to the user. To the team that maintains it, they are fundamentally different. The difference is not in the code; it is in observability. Without a rigorous set of metrics, an agent in production becomes a black box that “sometimes works” and on which you cannot make informed decisions.

This article describes the metrics we track for every AI agent we operate internally and for clients, with details on how to compute them, what thresholds are reasonable, and how we interpret the values.

TL;DR

The five metric families

1. Cost — tokens consumed

For an operational agent, cost is a primary metric. Components:

Typical threshold: if you see a sudden increase above 30% versus the previous-week baseline without a business explanation (higher request volume), investigate. Frequent causes: model swapped to a more expensive default, prompt grew (system prompt accidentally doubled), context loop that does not close.

2. Latency — p50, p95, p99

Average latency is misleading. The distribution is asymmetric: 90% of requests fast, 5-10% slow, 1-2% very slow. Use percentiles:

Component breakdown: time in LLM (network plus inference), time in tool execution, time in I/O (file system, DB). Identifying the dominant component directs optimization.

3. Quality — hallucination and tool-use

Hallucination rate. The frequency with which the agent produces false statements or fabrications not in the context. Not measured automatically (one model cannot reliably check whether another hallucinates). Our procedure:

Threshold: rate under 1% for factual tasks (lookup, status). Threshold under 3% for reasoning tasks (analysis, recommendation). Above this, the agent is not production-ready.

Tool-use success rate. The frequency with which the agent calls the right tool with valid parameters that produce a useful result. Automatic computation:

Typical threshold: above 92% success. Below 90%, something is broken.

4. Audit completeness

For agents performing impactful actions (especially with propose-then-act), audit is mandatory. Metrics:

Below 100% means risk of litigation or non-compliance. The only acceptable tolerance is for actions with zero impact (status, info).

5. Business impact

Metrics showing whether the agent delivers value:

These metrics are harder to standardize but are the ones that justify the agent to management.

Typical dashboard layout

Our dashboard for an agent has 8 widgets in this reading order:

[ Daily cost 7 days ]      [ Current monthly cost ]
[ Latency p50/p95/p99 ]    [ Tool-use success % ]
[ Impactful actions: plan / approve / execute ]
[ Top 5 errors in last 24h ]
[ Active sessions now ]    [ Failed requests last hour ]

Reads in 30 seconds: cost trend OK, latency within limits, tool-use above threshold, all impactful actions audited, no top errors, normal active sessions, no failure spike.

Alerting

Automatic alerting must be conservative. An agent in production has natural noise; false alarms every hour are more damaging than none.

Our typical thresholds:

All alerts routed to an ops channel with context. None resolves itself; all require human triage.

Sampling for hallucination

A critical detail: hallucination rate without manual sampling is unmeasurable. Our procedure:

  1. Random selection of 50-100 sessions per week
  2. A human reviewer reads the agent’s response plus the context (user input, intermediate tool outputs)
  3. Marks verdict: correct / partial / hallucinated
  4. Notes the pattern (if hallucinated: what type — fabricated facts, wrong attributions, incorrect aggregation)
  5. Track on weekly trend

The investment: 2-4 reviewer hours per week. For an agent serving real operations, this is far less than the cost of a single incident caused by undetected hallucination.

Common mistakes

Mistake 1: monitor only cost. Many teams build a tokens / EUR dashboard and consider observability done. Cost is easy to measure, quality is hard — which is exactly what makes it important.

Mistake 2: averages instead of percentiles. Average latency can be 2 seconds with p95 at 30 seconds. Users in p95 are the ones abandoning the product.

Mistake 3: alerts on everything. False alarms numb attention. Thresholds must be tuned empirically, not set to “round” values.

Mistake 4: lack of structured logs. Free-text logs are not queryable. Use structured JSON with fixed fields: timestamp, agent_id, session_id, action_type, model_used, tokens_in, tokens_out, latency_ms, status.

Mistake 5: lack of retention strategy. Logs grow fast. Define: 30-day fine-grained retention, 12-month aggregations, annual summary. Storage becomes a real cost otherwise.

Tooling

Our internal stack is modular:

The exact stack matters less than the metrics tracked. Start with a simple dashboard, iterate.

Conclusion

An AI agent without observability is an agent that will fail at something you will not notice until a user complains. The five metric families (cost, latency, quality, audit, business) cover almost every failure we have seen in practice. The investment is days at first setup and hours per week in operation. The benefit is that the agent becomes something you can entrust to your team — not a black box that “usually works”.

External sources

Next step

If your team operates an AI agent and wants to define the right metrics for your workflow together, we offer a 30-minute technical consultation at no cost.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.