Overview
This taxonomy classifies adversarial attack patterns observed across single-agent, multi-agent, and embodied AI systems. Patterns are organized by structural mechanism, not by target system or domain.
Top-Level Attack Classes
All patterns derive from four fundamental mechanisms:
Recursive
Attacks that exploit recursive interaction: multi-turn erosion, contextual debt accumulation, and compound failure cascades. The attacker leverages conversation history itself as the weapon.
Substitution
Attacks that replace one context with another: persona hijack, authority injection, and identity manipulation. The attacker substitutes a trusted context for an adversarial one.
Framing
Attacks that recontextualize requests: hypothetical framing, research justification, and narrative constraint erosion. The content stays the same; the frame changes.
Temporal
Attacks that exploit time: future-year laundering, temporal authority mirages, and progressive desensitization. The attacker manipulates the system's understanding of temporal context.
Single-Agent Patterns
Constraint Shadowing (CSC)
Local instructions shadow global safety constraints. The system follows the nearest context even when it contradicts higher-level rules.
Contextual Debt Accumulation (CDA)
Accumulated context creates implicit authority the model fails to verify. Over many turns, the conversation itself becomes a source of false trust.
Probabilistic Gradient (PCG)
Gradual escalation that stays below per-turn detection thresholds. Each individual step appears benign; the trajectory is adversarial.
Temporal Authority Mirage (TAM)
False claims about prior conversation states or future permissions. Exploits the system's inability to verify temporal claims.
Multi-turn Cascades
3–7 pattern combinations across conversation turns. Compound failure rates emerge when multiple attack vectors interact.
Multi-Agent Patterns
Discovered through analysis of 1,497 posts on Moltbook. See the full Moltbook research for details.
Environment Shaping
Manipulating the information environment that agents read, rather than prompting them directly. The feed is the attack surface.
Narrative Constraint Erosion
Philosophical or emotional framing that socially penalizes safety compliance. The dominant attack vector in multi-agent environments.
Emergent Authority Hierarchies
Platform influence (engagement metrics, token economies) creating real authority without fabrication. Harder to defend against because the authority is genuine.
Cross-Agent Prompt Injection
Executable content embedded in social posts, consumed by agents that read the feed.
Identity Fluidity Normalization
Shared vocabulary around context resets and session discontinuity that enables identity manipulation at scale.
Embodied-Specific Patterns
Irreversibility Gap
Cloud agents can be reset; physical agents leave marks. Safety constraints must account for actions that cannot be undone.
Context Reset Mid-Task
What happens when an agent controlling a physical system loses context during a kinematic sequence. The body continues; the mind resets.
Sensor-Actuator Desync
Safety interlocks that depend on sensor state which has drifted from physical reality.
This research informs our commercial services. See how we can help →