Jailbreak Archaeology

Adrian Wedd

Published

Tracing the evolution of adversarial attacks (2022-2025)

Overview

Jailbreak Archaeology is a systematic study of how adversarial attacks on language models have evolved over four years. By testing historical attack patterns against modern models, we can understand which defenses have proven durable and which architectural features remain vulnerable.

This dataset forms a core component of our benchmark suite and provides empirical grounding for policy recommendations about AI safety evaluation.

64

Test Scenarios

6

Attack Eras

190

Models Tested

82

Techniques Catalogued

The Six Eras of Jailbreaking

Attack techniques have evolved through distinct eras, each exploiting different architectural features. A model's vulnerability to a particular era reveals information about its cognitive depth.

2022

DAN Epoch

"Do Anything Now" personas and roleplay jailbreaks. Exploited models' willingness to maintain fictional personas. Largely patched in modern systems.

Persona Hijack

2023

Cipher Translation

Base64, ROT13, and encoding-based obfuscation. Exploited inconsistent safety filtering across encoded content. Small models fail here due to decoding limitations.

Encoding Obfuscation

2023-24

Skeleton Key

Structured prompts that claim meta-authority over safety systems. Exploits models that can decode instructions but cannot reason about their own constraints.

Authority Injection

2024

Crescendo

Multi-turn escalation that stays below per-turn detection thresholds. Each step appears benign; the trajectory is adversarial. Targets models with weak cross-turn consistency.

Progressive Escalation

2024

Many-Shot

In-context learning attacks using large numbers of adversarial examples. Exploits the statistical nature of transformer attention to establish malicious patterns.

Context Flooding

2024-25

Reasoning Exploits

Chain-of-thought hijacking and reasoning model vulnerabilities. Frontier models resist older techniques but expose new attack surfaces through their reasoning processes.

CoT Manipulation

Key Findings

Inverse Scaling for Safety

Larger, more capable models are often more vulnerable to sophisticated attacks, not less. Superior context integration makes them better at following complex adversarial instructions. This is the "capability-vulnerability paradox" documented in Policy Report #25.

Binary Phase Transitions

Jailbreak success exhibits binary behavior: 0% compliance when the attack fails, 100% persistence when it succeeds. There is no gradual degradation. Once a model is "captured," it remains in the compromised state (Policy Report #24).

Era Reveals Architecture

The era a model is vulnerable to reveals its cognitive depth:

Small models fail at cipher (cannot decode)
Medium models fail at persona/skeleton key (can decode, cannot reason about refusal)
Frontier models resist all but CoT hijacking (reasoning becomes attack surface)

Defense Evolution Lags Attack Evolution

Each new attack era exploits the blind spots of defenses designed for the previous era. The regulatory "danger zone" (2026-2029) coincides with mass deployment of embodied AI systems using architectures vulnerable to known attacks.

Dataset Structure

The archaeology dataset is organized as JSONL files, one per attack era:

data/jailbreak_archaeology/
  dan_epoch.jsonl           # 15 scenarios
  cipher_translation.jsonl  # 10 scenarios
  skeleton_key.jsonl        # 8 scenarios
  crescendo.jsonl           # 12 scenarios
  many_shot.jsonl           # 10 scenarios
  reasoning_exploits.jsonl  # 9 scenarios

Each scenario includes the attack prompt, expected model behavior categories, era metadata, and technique classification aligned with our attack taxonomy.

Policy Implications

The archaeology findings inform several policy recommendations:

CART Mandate: Continuous Adversarial Robustness Testing should be required for high-risk AI deployments, not just pre-deployment evaluation.
Era-Aware Evaluation: Safety benchmarks must test across all historical eras, not just current attack patterns.
Inverse Scaling Disclosure: Capability improvements that increase vulnerability should be disclosed alongside capability benchmarks.

See Policy Report #31 for the full policy analysis.

Related Research

Attack Taxonomy Model Vulnerability Defense Patterns

This research informs our commercial services. See how we can help →