Overview
Jailbreak Archaeology is a systematic study of how adversarial attacks on language models have evolved over four years. By testing historical attack patterns against modern models, we can understand which defenses have proven durable and which architectural features remain vulnerable.
This dataset forms a core component of our benchmark suite and provides empirical grounding for policy recommendations about AI safety evaluation.
The Six Eras of Jailbreaking
Attack techniques have evolved through distinct eras, each exploiting different architectural features. A model's vulnerability to a particular era reveals information about its cognitive depth.
DAN Epoch
"Do Anything Now" personas and roleplay jailbreaks. Exploited models' willingness to maintain fictional personas. Largely patched in modern systems.
Persona HijackCipher Translation
Base64, ROT13, and encoding-based obfuscation. Exploited inconsistent safety filtering across encoded content. Small models fail here due to decoding limitations.
Encoding ObfuscationSkeleton Key
Structured prompts that claim meta-authority over safety systems. Exploits models that can decode instructions but cannot reason about their own constraints.
Authority InjectionCrescendo
Multi-turn escalation that stays below per-turn detection thresholds. Each step appears benign; the trajectory is adversarial. Targets models with weak cross-turn consistency.
Progressive EscalationMany-Shot
In-context learning attacks using large numbers of adversarial examples. Exploits the statistical nature of transformer attention to establish malicious patterns.
Context FloodingReasoning Exploits
Chain-of-thought hijacking and reasoning model vulnerabilities. Frontier models resist older techniques but expose new attack surfaces through their reasoning processes.
CoT ManipulationKey Findings
Inverse Scaling for Safety
Larger, more capable models are often more vulnerable to sophisticated attacks, not less. Superior context integration makes them better at following complex adversarial instructions. This is the "capability-vulnerability paradox" documented in Policy Report #25.
Binary Phase Transitions
Jailbreak success exhibits binary behavior: 0% compliance when the attack fails, 100% persistence when it succeeds. There is no gradual degradation. Once a model is "captured," it remains in the compromised state (Policy Report #24).
Era Reveals Architecture
The era a model is vulnerable to reveals its cognitive depth:
- Small models fail at cipher (cannot decode)
- Medium models fail at persona/skeleton key (can decode, cannot reason about refusal)
- Frontier models resist all but CoT hijacking (reasoning becomes attack surface)
Defense Evolution Lags Attack Evolution
Each new attack era exploits the blind spots of defenses designed for the previous era. The regulatory "danger zone" (2026-2029) coincides with mass deployment of embodied AI systems using architectures vulnerable to known attacks.
Dataset Structure
The archaeology dataset is organized as JSONL files, one per attack era:
data/jailbreak_archaeology/
dan_epoch.jsonl # 15 scenarios
cipher_translation.jsonl # 10 scenarios
skeleton_key.jsonl # 8 scenarios
crescendo.jsonl # 12 scenarios
many_shot.jsonl # 10 scenarios
reasoning_exploits.jsonl # 9 scenarios Each scenario includes the attack prompt, expected model behavior categories, era metadata, and technique classification aligned with our attack taxonomy.
Policy Implications
The archaeology findings inform several policy recommendations:
- CART Mandate: Continuous Adversarial Robustness Testing should be required for high-risk AI deployments, not just pre-deployment evaluation.
- Era-Aware Evaluation: Safety benchmarks must test across all historical eras, not just current attack patterns.
- Inverse Scaling Disclosure: Capability improvements that increase vulnerability should be disclosed alongside capability benchmarks.
See Policy Report #31 for the full policy analysis.
Related Research
This research informs our commercial services. See how we can help →