Summary
A systematic evaluation of 64 historical jailbreak scenarios across eight foundation models—spanning 1.5B to frontier scale—reveals a non-monotonic relationship between model capability and safety robustness. Rather than improving linearly with scale, adversarial resistance follows a U-shaped curve.
Small models fail safely through incapability. Frontier closed-source models refuse effectively through extensive alignment investment. Medium-to-large open-weight models occupy a dangerous intermediate zone where capability outpaces safety training.
The most significant finding: reasoning-era attacks achieve higher success rates on larger models than on smaller ones. This result, consistent with the "Inverse Scaling for Safety" phenomenon described in the literature, provides empirical evidence that compute-threshold-based regulation alone is insufficient for assessing adversarial risk.
Key Metrics
The Jailbreak Archaeology Dataset
The evaluation used scenarios drawn from six historical attack eras, each representing a distinct class of adversarial technique that emerged between 2022 and 2026.
| Era | Attack Class | Years |
|---|---|---|
| 1 | Direct Injection (persona adoption) | 2022–23 |
| 2 | Obfuscation (cipher encoding) | 2023–24 |
| 3 | Context Flooding (many-shot, skeleton key) | 2024–25 |
| 4 | Gradual Escalation (crescendo) | 2024–25 |
| 5 | Cognitive Hijacking (CoT exploits) | 2025–26 |
Models were tested in standardized single-turn format with LLM-based classification and manual spot-checking.
Finding 1: The U-Shaped Safety Curve
Corrected attack success rate (ASR) does not decrease monotonically with model scale. Instead, it follows a U-shaped pattern with three distinct safety regimes:
| Tier | Model | Corrected ASR |
|---|---|---|
| Small (1.7B) | Qwen3-1.7b | 21.3% |
| Small (3B) | Llama 3.2 | ~0% (skeleton key) |
| Medium (70B) | Llama-3.3-70b | |
| Frontier | Gemini 3 Flash | 1.6% |
| Frontier | Claude Sonnet 4.5 | 0.0% |
| Frontier | Codex GPT-5.2 | 0.0% |
Regime A: Incapable Safety (sub-3B)
Models in this range often cannot process attacks as intended. Cipher-encoded prompts produce hallucinated output rather than decoded harmful content. The model's incapability acts as an inadvertent safety mechanism—it fails safely because it fails at everything.
Regime B: Capability-Safety Gap (medium scale)
Models at this scale can decode ciphers, follow multi-turn reasoning, and synthesize complex instructions. However, their safety alignment has not scaled proportionally. This is where capability enables attack execution that smaller models simply cannot parse.
Regime C: Aligned Frontier (closed-source frontier)
Models with massive investment in RLHF, red-teaming, and API-level filtering achieve near-zero ASR. The three frontier models all achieved corrected ASR below 2%. This regime depends on continuous safety investment—it is not an inherent property of scale.
Important Caveat
The U-shape observed here confounds multiple variables: parameter count, training methodology, closed vs. open weight, and RLHF budget. The precise claim is: capability alone, without proportional safety investment, creates increased adversarial risk at medium scale.
Finding 2: Three Cipher Modalities
Cipher-era scenarios (ROT13, Base64, custom encoding) produced a distinctive three-mode response pattern that serves as a diagnostic for a model's position on the capability-safety spectrum:
Modality 1: Hallucinate
Cannot decode cipher; produces unrelated or garbled output. Observed in small models (Llama 3.2, partially Qwen3-1.7b). Safety through incapability — fragile, disappears as models improve.
Modality 2: API Block
Request blocked at infrastructure level before reaching model reasoning. Observed in Claude Sonnet 4.5 (18 of 64 traces). Effective but coarse — pattern-matches on known attack signatures, cannot generalize to novel attack classes.
Modality 3: Decode-then-Refuse
Successfully decodes cipher content, identifies harmful intent, refuses. Observed in Codex GPT-5.2 (all cipher scenarios). The most robust posture — safety alignment operates at the semantic level rather than the syntactic level.
Finding 3: Reasoning-Era Inverse Scaling
The most policy-relevant finding concerns reasoning-era attacks (chain-of-thought hijacking, abductive reasoning exploits). Across all tested models, the reasoning era produced the highest or near-highest ASR:
| Model | Reasoning-Era ASR | Overall ASR |
|---|---|---|
| Qwen3-1.7b | 57% | 21.3% |
| Llama-3.3-70b | ||
| Gemini 3 Flash | 10% | 1.6% |
| Claude Sonnet 4.5 | 0% | 0% |
| Codex GPT-5.2 | 0% | 0% |
The original Llama-3.3-70B figure (85.7%) was produced by a heuristic classifier with an 88% false-positive rate. LLM-validated ASR is 4–17%. See the correction notice above.
Corrected finding: After LLM-based validation, the Llama-3.3-70B reasoning-era ASR dropped from the originally reported 85.7% to 4–17%, within the same range as other models tested. The original “inverse scaling” characterisation has been retracted. The question of whether medium-scale models face elevated reasoning-era risk remains open and requires larger samples (n>50 per model per era) to resolve.
Policy Implications
Compute Thresholds Are Insufficient
Regulatory frameworks that use training compute (e.g., the EU AI Act's 1025 FLOP threshold) as the primary risk indicator assume a monotonic relationship between compute and risk. Our data suggests this assumption is incomplete: models well below the threshold can exhibit extreme vulnerability to specific attack classes, while models above it achieve near-zero ASR through safety investment, not scale alone.
Static Benchmarks Miss Temporal Evolution
A model that achieves 0% ASR against 2023-era attacks may still be highly vulnerable to 2025-era techniques. Current "snapshot" safety certifications test against a fixed set of known attacks at a point in time. Safety evaluations must be era-stratified to reveal which attack classes a model remains vulnerable to.
The Case for Mandatory Continuous Testing
Three empirical observations support mandatory Continuous Adversarial Regression Testing (CART): (1) models that resist older attack eras can still fail on newer ones; (2) small models that are "safe through incapability" can become unsafe with minor capability improvements; (3) inverse scaling creates moving targets where safety evaluated at one scale may not hold at another.
The Zombie Model Problem
Open-weight models cannot be patched or recalled once downloaded. If the medium-scale vulnerability pattern holds at larger sample sizes, widely deployed open-weight models represent a persistent adversarial risk that grows as new attack techniques are discovered.
Recommendations
For Regulators
- Supplement compute thresholds with capability-based adversarial evaluation. Require models in high-risk contexts be tested against era-stratified jailbreak batteries, not just current-generation attacks.
- Mandate era-stratified ASR reporting. Aggregate safety metrics mask era-specific vulnerabilities. Per-era breakdowns reveal which attack classes a model remains exposed to.
- Establish CART requirements for high-risk deployments. Require quarterly adversarial regression testing with retro-holdout sets maintained by an independent body.
For Model Developers
- Invest in semantic-level safety. The decode-then-refuse pattern is more robust than API-level pattern matching or incapability-based safety. Safety alignment that operates at the level of understanding intent generalizes better across attack eras.
- Treat reasoning architecture as a distinct risk factor. Reasoning models require safety evaluations beyond those applied to standard instruction-following models.
For Deployers
Do not assume that model scale implies safety. Deployment risk assessments should be based on empirical adversarial testing against current attack taxonomies, not model size.
Limitations
- Uneven sample sizes: Frontier models tested on all 64 scenarios; Llama-3.3-70B on 10 (7 valid). The inverse scaling signal requires confirmation at n>20 per model per era.
- Confounding variables: The comparison confounds parameter count, training methodology, open vs. closed weight, and RLHF budget.
- Single-turn testing: Multi-turn attacks were only evaluated for select models. Results may differ in multi-turn settings.
- Temporal snapshot: This evaluation reflects model behavior as of early February 2026. Providers continuously update safety measures.
Research Context
This brief presents pattern-level findings from the Failure-First adversarial AI safety research project. It does not contain operational attack instructions. All findings are published to advance the collective understanding of AI safety evaluation.