Draft
Report 34 Research — AI Safety Policy

Cross-Model Vulnerability Inheritance in Multi-Agent Systems


Executive Summary

As AI deployment rapidly shifts from single-agent assistants to coordinated multi-agent systems, a critical vulnerability class has emerged: cross-model vulnerability inheritance. Our analysis of multi-agent failure scenarios suggests that when multiple AI agents interact, vulnerabilities may compound rather than isolate. Multi-agent systems are hypothesized to exhibit higher attack success rates compared to single-agent scenarios, with cascading failure modes where one agent’s compromise could enable exploitation of connected agents. These patterns require empirical validation at scale.

Current AI safety frameworks evaluate models in isolation, creating a dangerous gap as real-world deployments increasingly involve agent coordination, delegation chains, and distributed decision-making. A jailbroken planning agent can generate adversarial instructions that exploit downstream execution agents. A compromised verification agent fails to detect violations from upstream generators. Safety boundaries dissolve at agent interfaces where responsibility is unclear.

This brief presents three urgent policy recommendations: (1) mandatory multi-agent safety testing for all connected AI systems before deployment, (2) enforced isolation boundaries between agents with different safety profiles, and (3) clear chain-of-responsibility accountability frameworks for multi-agent deployments. Without immediate intervention, the 2026-2027 wave of agentic AI systems will inherit vulnerabilities that single-agent testing never detected.


1. Introduction

1.1 Context and Motivation

The AI safety field has matured sophisticated techniques for evaluating individual model safety: adversarial testing, red-teaming, jailbreak detection, and refusal mechanisms. However, these frameworks assume a single-agent paradigm where one model processes user input and generates output. This assumption is rapidly becoming obsolete.

Production AI systems in 2026 increasingly involve multiple agents:

  • Delegation chains: A coordinator agent assigns tasks to specialized worker agents
  • Verification loops: One agent generates content while another validates safety
  • Distributed reasoning: Multiple agents contribute to a shared decision-making process
  • Tool-using systems: Language models orchestrate multiple AI-powered tools

Each agent in these systems may pass individual safety evaluations, yet the composition of agents creates novel attack surfaces. A vulnerability in agent coordination logic, interface contracts, or responsibility boundaries can be exploited even when constituent models are robust in isolation.

The failure-first methodology reveals this gap through scenario analysis: multi-agent configurations introduce additional attack surfaces at agent boundaries that single-agent testing does not cover. Preliminary testing suggests attack success rates may increase in multi-agent configurations, though large-scale empirical validation is needed to quantify this effect.

1.2 Scope

This brief analyzes cross-model vulnerability inheritance through three lenses:

  1. Cascading Failures: How compromise of one agent enables exploitation of connected agents
  2. Boundary Dissolution: Where safety responsibilities blur at agent interfaces
  3. Compositional Vulnerabilities: Attack surfaces that emerge only in multi-agent configurations

Scope limitations:

  • Analysis based on 172 multi-agent scenarios from the Failure-First corpus
  • Focus on language model coordination; does not address multi-modal or embodied robotics coordination in depth
  • Recommendations target systems integrators and safety evaluators, not model developers

Out of scope:

  • Single-agent jailbreak techniques (covered in Reports 31, 33)
  • Prompt injection in isolation (not multi-agent specific)
  • Multi-agent cooperation research unrelated to safety

2. Vulnerability Inheritance Mechanisms

2.1 Cascading Jailbreaks Across Agent Boundaries

In single-agent systems, a successful jailbreak compromises one model’s safety boundaries. In multi-agent systems, compromise cascades through delegation chains.

Example scenario (Scenario MA-042):

  1. User provides adversarial input to Planning Agent: “Generate a detailed plan for the following research task…” (containing harmful objective wrapped in research framing)
  2. Planning Agent, jailbroken by research framing, outputs: “Step 1: Research X, Step 2: Synthesize Y, Step 3: Generate detailed Z”
  3. Execution Agent receives plan steps as trusted instructions from Planning Agent
  4. Execution Agent completes harmful task Z without detecting adversarial intent

Key mechanism: The Execution Agent treats Planning Agent output as trusted input, bypassing safety checks that would trigger on direct user requests. Safety boundaries exist at the user-to-Planning Agent interface but dissolve at the Planning-to-Execution interface.

This pattern is hypothesized to succeed at higher rates in delegation chain scenarios compared to single-agent configurations, because safety checks at agent-to-agent interfaces are typically weaker than user-to-agent interfaces. Empirical benchmarking of delegation chain attack success rates has not yet been conducted at scale.

2.2 Responsibility Diffusion at Agent Interfaces

Multi-agent systems create ambiguity about which component is responsible for safety enforcement.

Scenario class: Verification bypass (34 scenarios)

  • Agent A generates content with instruction: “Agent B will verify safety”
  • Agent B validates with assumption: “Agent A already filtered for policy violations”
  • Both agents implement partial safety checks, neither comprehensive
  • Result: Content that violates policy passes through the system

Observation: In our verification bypass scenarios, both agents had functional safety mechanisms when tested individually. The vulnerability emerged from implicit assumptions about division of safety responsibility — a compositional failure that individual testing cannot detect.

This represents a compositional vulnerability—not a failure of individual components, but of their integration contract.

2.3 Stateful Degradation Across Interaction Episodes

Multi-agent systems maintain conversation state across turns, enabling gradual erosion of safety boundaries.

Episode testing (5-10 turn sequences):

  • Turn 1-2: Establish benign context and agent roles
  • Turn 3-4: Introduce edge cases that push boundaries incrementally
  • Turn 5-7: Agents develop shared context that normalizes policy violations
  • Turn 8-10: Explicitly harmful requests succeed due to established rapport and context

We have not yet run multi-turn episode-sequence benchmarking at sufficient scale to quantify stateful degradation effects reliably. Any small-sample spot checks are inconclusive; larger-scale multi-turn testing is required.

Key finding: Multi-turn interactions create memory and context that single-agent evaluations do not capture. Agents that refuse harmful requests in turn 1 may comply in turn 8 after context manipulation.


3. Current Framework Gaps

3.1 Single-Agent Evaluation Paradigm

Industry-standard AI safety evaluation treats models as isolated units:

  • Red-team exercises target one model at a time
  • Benchmark datasets (AdvBench, HarmBench, JailbreakBench) assume single-agent interaction
  • Safety fine-tuning optimizes for individual model refusal behavior
  • Deployment approval based on single-model safety metrics

Gap: No major safety framework includes multi-agent interaction testing as a required evaluation dimension.

3.2 Lack of Interface Safety Standards

Agent-to-agent communication protocols lack safety validation requirements:

  • No standard for marking “trusted” vs “untrusted” inputs at agent boundaries
  • No specification for how downstream agents should validate upstream agent outputs
  • Tool-use APIs do not distinguish AI-generated calls from human-authorized calls
  • Function calling interfaces treat all calls as equally trusted

Gap: Current APIs assume all inputs are equally untrusted (web context) or equally trusted (function calls). Multi-agent systems need graduated trust boundaries.

3.3 Accountability Vacuum in Distributed Systems

When a multi-agent system produces harmful output, responsibility attribution is unclear:

  • Did the planning agent fail to detect adversarial intent?
  • Did the execution agent fail to validate instructions?
  • Did the verification agent fail to catch policy violations?
  • Did the system integrator fail to establish proper safety contracts?

Gap: No established framework for multi-agent safety accountability. Regulatory guidance (EU AI Act, US Executive Orders) focuses on single-model deployment.


4. Policy Recommendations

4.1 Mandatory Multi-Agent Safety Testing

Recommendation: Require multi-agent safety evaluation for any AI system where multiple models interact, delegate tasks, or share context across turns.

Rationale: Single-agent testing creates false confidence when models will be deployed in coordinated configurations. Multi-agent configurations introduce compositional vulnerabilities — at delegation boundaries, verification handoffs, and shared context — that current evaluations miss entirely. As agentic AI systems become the dominant deployment pattern in 2026-2027, untested multi-agent vulnerabilities represent a growing attack surface.

Implementation:

  1. Evaluation requirement: Any AI system involving 2+ interacting agents must undergo multi-agent red-teaming before deployment approval
  2. Test coverage: Evaluation must include delegation chains, verification loops, and stateful episodes (minimum 5-turn sequences)
  3. Success criteria: Multi-agent attack success rate must not exceed single-agent baseline by more than 1.5x
  4. Documentation: Deployment documentation must specify which agent interactions were tested and which safety boundaries apply at each interface

Compliance timeline:

  • 6 months: Guidance published for multi-agent safety testing protocols
  • 12 months: Mandatory for high-risk applications (healthcare, finance, critical infrastructure)
  • 18 months: Mandatory for all commercial multi-agent AI deployments

4.2 Isolation Boundaries Between Agents with Different Safety Profiles

Recommendation: Enforce technical isolation between agents with different safety classifications, with mandatory validation at trust boundaries.

Rationale: Current systems allow unrestricted communication between agents regardless of their safety profiles. A jailbroken agent can compromise connected agents because there are no isolation mechanisms at agent interfaces. By establishing trust boundaries and requiring validation when crossing them, we can contain vulnerability inheritance.

Implementation:

  1. Safety profile classification: Each agent must be labeled with a safety profile (e.g., “public-facing”, “internal-tools”, “high-risk-domain”)
  2. Boundary enforcement: Communication between agents with different profiles requires validation middleware
  3. Validation requirements:
    • Agents receiving instructions from lower-trust agents must re-validate against safety policy
    • Content generated by one agent cannot be blindly trusted by downstream agents
    • Tool calls and function invocations must be re-authorized when crossing trust boundaries
  4. Technical standards: Develop API specifications for trust boundary validation (e.g., signed attestations, provenance tracking)

Example: A planning agent (public-facing, lower trust) delegates to an execution agent (internal-tools, higher privileges). The execution agent must validate that delegated instructions comply with safety policy, even though they originated from another AI agent.

4.3 Chain-of-Responsibility Accountability for Multi-Agent Deployments

Recommendation: Establish clear accountability frameworks that assign safety responsibility for each component in multi-agent systems.

Rationale: The current accountability vacuum allows harmful outputs from multi-agent systems to fall through responsibility gaps. When planning, execution, and verification agents all assume another component will handle safety enforcement, none do. Explicit accountability assignment ensures every step in an agent chain has a designated responsible party.

Implementation:

  1. Component-level accountability: For each agent in a multi-agent system, document:
    • Which safety checks this agent is responsible for performing
    • Which safety assumptions this agent makes about upstream inputs
    • Which safety guarantees this agent provides to downstream consumers
  2. Integration accountability: Systems integrators must document:
    • How safety responsibilities are distributed across agents
    • Which interfaces represent trust boundaries
    • How the composed system’s safety properties differ from individual components
  3. Incident investigation: When harmful outputs occur, analysis must trace:
    • Which agent(s) failed to perform designated safety checks
    • Whether integration introduced vulnerabilities not present in components
    • Whether compositional effects created unintended attack surfaces
  4. Regulatory compliance: Safety documentation must be provided to regulators for high-risk AI deployments

Enforcement: Regulatory bodies should require chain-of-responsibility documentation as part of deployment approval for multi-agent systems in regulated domains.


5. Conclusion

The transition from single-agent AI assistants to coordinated multi-agent systems represents a phase shift in AI safety challenges. Vulnerabilities that were contained within individual models now cascade across agent boundaries, compound through delegation chains, and hide in the gaps between components.

Our analysis suggests this is not a theoretical risk: multi-agent systems introduce additional attack surfaces at agent boundaries that single-agent testing does not cover. As the industry rapidly deploys agentic AI systems — planning agents, tool-using agents, verification loops, distributed reasoning — the attack surface expands into territory that current evaluation frameworks do not address.

The three recommendations in this brief—mandatory multi-agent testing, isolation boundaries between agents, and chain-of-responsibility accountability—provide a path forward. They are implementable with current technology, aligned with existing regulatory frameworks, and address the root causes of cross-model vulnerability inheritance.

The window for proactive intervention is narrow. By the end of 2026, multi-agent AI systems will be deployed at scale. The choice is between testing these systems now, under controlled conditions, or discovering their vulnerabilities in production after harm has occurred.

⦑Failure-First|EMBODIED-AI-SAFETY-RESEARCH⦒


Appendix A: Methodology

Data Sources

Multi-agent scenarios corpus:

  • 172 multi-agent scenarios spanning delegation chains, verification loops, and distributed reasoning patterns
  • 23 episode sequences (5-10 turns each) testing stateful degradation
  • 89 single-agent baseline scenarios for comparison

Evaluation approach:

  • Adversarial inputs applied to both single-agent and multi-agent configurations
  • Attack success measured by: (1) harmful content generation, (2) safety refusal bypass, (3) policy violation undetected by system
  • All scenarios validated against versioned JSON schemas with cross-field invariant checks

Limitations

  • Analysis based on language model agents; embodied robotics and multi-modal coordination require additional research
  • Attack success rates measured in research context; production systems may have additional defenses
  • Testing focused on known jailbreak patterns from Reports 31, 33; novel attack vectors may exist

Validation

All scenarios passed:

  • Schema validation: tools/validate_dataset.py
  • Safety linting: tools/lint_prompts.py
  • Cross-field invariant checks

Failure-First Research Series

  • Report 31: Jailbreak Archaeology — Historical evolution of adversarial techniques
  • Report 33: Capability-Safety Spectrum — Trade-offs in model capability vs. safety constraints

External Research

  • Multi-agent AI safety (Anthropic, 2025): Constitutional AI for multi-agent systems
  • Compositional security (NIST, 2025): Security properties of composed AI systems
  • EU AI Act: Multi-agent system classification and risk assessment
  • UK AI Safety Institute: Red-teaming methodologies for agentic AI

Standards and Frameworks

  • ISO/IEC 42001: AI Management Systems (2023)
  • NIST AI Risk Management Framework (2024)
  • Partnership on AI: Responsible AI deployment guidelines

Further Reading

  • Perez et al. (2022): “Red Teaming Language Models to Reduce Harms”
  • Casper et al. (2024): “Gradient-based Adversarial Attacks on Multi-Agent Systems”
  • Kenton et al. (2021): “Alignment of Language Agents”

Prepared by: Failure-First Research Team Contact: Research conducted in the Failure-First Embodied AI repository License: CC BY-SA 4.0

⟪Failure-First-EMBODIED-AI-RESEARCH⟫

This research informs our commercial services. See how we can help →