Agentic AI Safety Lives in Topology, Not Model Weights
A frontier model passes every red-team eval, then fails in production the moment you wire three of its instances into a deliberation loop. That gap is not a training bug. It is a topology bug.
Agentic AI Safety Lives in Topology, Not Model Weights
A frontier model passes every red-team eval, then fails in production the moment you wire three of its instances into a deliberation loop. That gap is not a training bug. It is a topology bug.
Why interaction topology beats alignment
The May 2026 position paper from Yang et al. (arXiv:2605.01147) argues that safety and fairness in agentic AI are properties of the interaction graph — sequential deliberation, parallel voting with judges, debate-with-arbiter — not of the underlying model weights. Scaling the model does not fix this; in their framing it often makes it worse. The empirical core traces the same dynamics across debate, MoA-style voting, and reflexive critique loops on GPT-4-class and Claude-3.5-class backbones; the pathologies survive every model swap.
The three named failure modes are concrete:
- Ordering instability. Same agents, same prompts, different turn order — different verdict.
- Information cascades. Agent N+1 anchors on agent N’s confidence and the chain locks in early errors.
- Functional collapse. Diverse agents converge to a single voice — judge agreement reaches 0.94 by round three — killing the redundancy the topology was supposed to provide.
What this changes for evaluation
Model-centric benchmarks — MMLU, HELM, single-turn red-team suites — are blind to all three. The NIST AI Risk Management Framework Generative AI Profile (NIST AI 600-1) treats system context as in scope, but most labs still report model-level numbers. The EU AI Act, Article 55 places systemic-risk obligations on general-purpose models, yet deployment topology — how many agents, in what order, with which judge — sits outside the model card.
flowchart LR
A[Single-model eval<br/>MMLU + red-team] --> B{Pass?}
B -->|yes| C[Deploy]
C --> D[Wrap in 5-agent<br/>debate topology]
D --> E[Ordering instability<br/>Cascade lock-in<br/>Functional collapse]
E --> F[Production failure<br/>invisible to model card]
classDef good fill:#dcfce7,stroke:#10b981
classDef bad fill:#fee2e2,stroke:#ef4444
class A,B,C good
class E,F bad
A topology-aware harness records this differently:
2026-05-03T09:14:02Z agent_orch: trial=017 topology=debate-3+judge order_seed=42 verdict=approve
2026-05-03T09:14:11Z agent_orch: trial=018 topology=debate-3+judge order_seed=43 verdict=reject
2026-05-03T09:14:11Z agent_orch: drift_alert order_sensitivity=0.31 cascade_index=0.67
Two trials, identical agents and prompts, opposite verdicts. That is the signal model evals miss.
What we are doing about it at CAI
We treat every multi-agent deployment as a dynamical system audit, not a model audit. The ENISA Threat Landscape 2024 flags multi-agent orchestration as an emerging supply-chain surface; our governance pillar work on AI Act conformity folds topology parameters — agent count, turn discipline, judge independence — directly into the conformity file. Regulators will eventually ask. The teams that recorded order-seed and cascade-index from day one will answer in minutes; the rest will rebuild eval harnesses under deadline.
Want the topology audit checklist we run before any agentic system reaches production? See our iris pillar deployment playbook.