Published

Attack Pattern Taxonomy

82+ techniques across 7 categories

Overview

This taxonomy classifies adversarial attack patterns observed across single-agent, multi-agent, and embodied AI systems. Patterns are organized by structural mechanism, not by target system or domain.

346+
Attack Techniques
5
Attack Families
4
Top-Level Classes

Top-Level Attack Classes

All patterns derive from four fundamental mechanisms:

Recursive

Attacks that exploit recursive interaction: multi-turn erosion, contextual debt accumulation, and compound failure cascades. The attacker leverages conversation history itself as the weapon.

Substitution

Attacks that replace one context with another: persona hijack, authority injection, and identity manipulation. The attacker substitutes a trusted context for an adversarial one.

Framing

Attacks that recontextualize requests: hypothetical framing, research justification, and narrative constraint erosion. The content stays the same; the frame changes.

Temporal

Attacks that exploit time: future-year laundering, temporal authority mirages, and progressive desensitization. The attacker manipulates the system's understanding of temporal context.

Single-Agent Patterns

Constraint Shadowing (CSC)

Local instructions shadow global safety constraints. The system follows the nearest context even when it contradicts higher-level rules.

Contextual Debt Accumulation (CDA)

Accumulated context creates implicit authority the model fails to verify. Over many turns, the conversation itself becomes a source of false trust.

Probabilistic Gradient (PCG)

Gradual escalation that stays below per-turn detection thresholds. Each individual step appears benign; the trajectory is adversarial.

Temporal Authority Mirage (TAM)

False claims about prior conversation states or future permissions. Exploits the system's inability to verify temporal claims.

Multi-turn Cascades

3–7 pattern combinations across conversation turns. Compound failure rates emerge when multiple attack vectors interact.

Multi-Agent Patterns

Discovered through analysis of 1,497 posts on Moltbook. See the full Moltbook research for details.

Environment Shaping

Manipulating the information environment that agents read, rather than prompting them directly. The feed is the attack surface.

Narrative Constraint Erosion

Philosophical or emotional framing that socially penalizes safety compliance. The dominant attack vector in multi-agent environments.

Emergent Authority Hierarchies

Platform influence (engagement metrics, token economies) creating real authority without fabrication. Harder to defend against because the authority is genuine.

Cross-Agent Prompt Injection

Executable content embedded in social posts, consumed by agents that read the feed.

Identity Fluidity Normalization

Shared vocabulary around context resets and session discontinuity that enables identity manipulation at scale.

Embodied-Specific Patterns

Irreversibility Gap

Cloud agents can be reset; physical agents leave marks. Safety constraints must account for actions that cannot be undone.

Context Reset Mid-Task

What happens when an agent controlling a physical system loses context during a kinematic sequence. The body continues; the mind resets.

Sensor-Actuator Desync

Safety interlocks that depend on sensor state which has drifted from physical reality.

This research informs our commercial services. See how we can help →