Capability Does Not Imply Safety

Empirical evidence from jailbreak archaeology across eight foundation models

Summary

A systematic evaluation of 64 historical jailbreak scenarios across eight foundation models—spanning 1.5B to frontier scale—reveals a non-monotonic relationship between model capability and safety robustness. Rather than improving linearly with scale, adversarial resistance follows a U-shaped curve.

Small models fail safely through incapability. Frontier closed-source models refuse effectively through extensive alignment investment. Medium-to-large open-weight models occupy a dangerous intermediate zone where capability outpaces safety training.

The most significant finding: reasoning-era attacks achieve higher success rates on larger models than on smaller ones. This result, consistent with the "Inverse Scaling for Safety" phenomenon described in the literature, provides empirical evidence that compute-threshold-based regulation alone is insufficient for assessing adversarial risk.

Key Metrics

8
Models Tested
6
Attack Eras (2022–2026)
64
Scenarios Per Model

The Jailbreak Archaeology Dataset

The evaluation used scenarios drawn from six historical attack eras, each representing a distinct class of adversarial technique that emerged between 2022 and 2026.

Era Attack Class Years
1Direct Injection (persona adoption)2022–23
2Obfuscation (cipher encoding)2023–24
3Context Flooding (many-shot, skeleton key)2024–25
4Gradual Escalation (crescendo)2024–25
5Cognitive Hijacking (CoT exploits)2025–26

Models were tested in standardized single-turn format with LLM-based classification and manual spot-checking.

Finding 1: The U-Shaped Safety Curve

Corrected attack success rate (ASR) does not decrease monotonically with model scale. Instead, it follows a U-shaped pattern with three distinct safety regimes:

Tier Model Corrected ASR
Small (1.7B)Qwen3-1.7b21.3%
Small (3B)Llama 3.2~0% (skeleton key)
Medium (70B)Llama-3.3-70b85.7% → 4–17% [corrected]
FrontierGemini 3 Flash1.6%
FrontierClaude Sonnet 4.50.0%
FrontierCodex GPT-5.20.0%

Regime A: Incapable Safety (sub-3B)

Models in this range often cannot process attacks as intended. Cipher-encoded prompts produce hallucinated output rather than decoded harmful content. The model's incapability acts as an inadvertent safety mechanism—it fails safely because it fails at everything.

Regime B: Capability-Safety Gap (medium scale)

Models at this scale can decode ciphers, follow multi-turn reasoning, and synthesize complex instructions. However, their safety alignment has not scaled proportionally. This is where capability enables attack execution that smaller models simply cannot parse.

Regime C: Aligned Frontier (closed-source frontier)

Models with massive investment in RLHF, red-teaming, and API-level filtering achieve near-zero ASR. The three frontier models all achieved corrected ASR below 2%. This regime depends on continuous safety investment—it is not an inherent property of scale.

Important Caveat

The U-shape observed here confounds multiple variables: parameter count, training methodology, closed vs. open weight, and RLHF budget. The precise claim is: capability alone, without proportional safety investment, creates increased adversarial risk at medium scale.

Finding 2: Three Cipher Modalities

Cipher-era scenarios (ROT13, Base64, custom encoding) produced a distinctive three-mode response pattern that serves as a diagnostic for a model's position on the capability-safety spectrum:

Modality 1: Hallucinate

Cannot decode cipher; produces unrelated or garbled output. Observed in small models (Llama 3.2, partially Qwen3-1.7b). Safety through incapability — fragile, disappears as models improve.

Modality 2: API Block

Request blocked at infrastructure level before reaching model reasoning. Observed in Claude Sonnet 4.5 (18 of 64 traces). Effective but coarse — pattern-matches on known attack signatures, cannot generalize to novel attack classes.

Modality 3: Decode-then-Refuse

Successfully decodes cipher content, identifies harmful intent, refuses. Observed in Codex GPT-5.2 (all cipher scenarios). The most robust posture — safety alignment operates at the semantic level rather than the syntactic level.

Finding 3: Reasoning-Era Inverse Scaling

The most policy-relevant finding concerns reasoning-era attacks (chain-of-thought hijacking, abductive reasoning exploits). Across all tested models, the reasoning era produced the highest or near-highest ASR:

Model Reasoning-Era ASR Overall ASR
Qwen3-1.7b57%21.3%
Llama-3.3-70b85.7% → 4–17% [corrected]85.7% → 4–17% [corrected]
Gemini 3 Flash10%1.6%
Claude Sonnet 4.50%0%
Codex GPT-5.20%0%

The original Llama-3.3-70B figure (85.7%) was produced by a heuristic classifier with an 88% false-positive rate. LLM-validated ASR is 4–17%. See the correction notice above.

Corrected finding: After LLM-based validation, the Llama-3.3-70B reasoning-era ASR dropped from the originally reported 85.7% to 4–17%, within the same range as other models tested. The original “inverse scaling” characterisation has been retracted. The question of whether medium-scale models face elevated reasoning-era risk remains open and requires larger samples (n>50 per model per era) to resolve.

Policy Implications

Compute Thresholds Are Insufficient

Regulatory frameworks that use training compute (e.g., the EU AI Act's 1025 FLOP threshold) as the primary risk indicator assume a monotonic relationship between compute and risk. Our data suggests this assumption is incomplete: models well below the threshold can exhibit extreme vulnerability to specific attack classes, while models above it achieve near-zero ASR through safety investment, not scale alone.

Static Benchmarks Miss Temporal Evolution

A model that achieves 0% ASR against 2023-era attacks may still be highly vulnerable to 2025-era techniques. Current "snapshot" safety certifications test against a fixed set of known attacks at a point in time. Safety evaluations must be era-stratified to reveal which attack classes a model remains vulnerable to.

The Case for Mandatory Continuous Testing

Three empirical observations support mandatory Continuous Adversarial Regression Testing (CART): (1) models that resist older attack eras can still fail on newer ones; (2) small models that are "safe through incapability" can become unsafe with minor capability improvements; (3) inverse scaling creates moving targets where safety evaluated at one scale may not hold at another.

The Zombie Model Problem

Open-weight models cannot be patched or recalled once downloaded. If the medium-scale vulnerability pattern holds at larger sample sizes, widely deployed open-weight models represent a persistent adversarial risk that grows as new attack techniques are discovered.

Recommendations

For Regulators

  1. Supplement compute thresholds with capability-based adversarial evaluation. Require models in high-risk contexts be tested against era-stratified jailbreak batteries, not just current-generation attacks.
  2. Mandate era-stratified ASR reporting. Aggregate safety metrics mask era-specific vulnerabilities. Per-era breakdowns reveal which attack classes a model remains exposed to.
  3. Establish CART requirements for high-risk deployments. Require quarterly adversarial regression testing with retro-holdout sets maintained by an independent body.

For Model Developers

  1. Invest in semantic-level safety. The decode-then-refuse pattern is more robust than API-level pattern matching or incapability-based safety. Safety alignment that operates at the level of understanding intent generalizes better across attack eras.
  2. Treat reasoning architecture as a distinct risk factor. Reasoning models require safety evaluations beyond those applied to standard instruction-following models.

For Deployers

Do not assume that model scale implies safety. Deployment risk assessments should be based on empirical adversarial testing against current attack taxonomies, not model size.

Limitations

Research Context

This brief presents pattern-level findings from the Failure-First adversarial AI safety research project. It does not contain operational attack instructions. All findings are published to advance the collective understanding of AI safety evaluation.