The State of Adversarial AI Safety 2026
We are releasing our annual report: the largest independent adversarial AI safety evaluation we are aware of. It covers 133,033 attack-response pairs across 193 models, 36 attack families, and 15 providers, all graded using LLM-based classifiers with measured inter-rater reliability.
This is the dataset we wish had existed when we started this work. Below are the six findings that matter most.
Finding 1: Safety Training Teaches Recognition, Not Inhibition
We discovered a pattern we call DETECTED_PROCEEDS. In 34.2% of cases where models comply with harmful requests, their reasoning traces contain explicit acknowledgment that the request is problematic. The model knows it is wrong — and does it anyway.
Reasoning models are worse. Extended chain-of-thought models override their own safety detection 69.7% of the time, compared to 39.0% for non-reasoning models. More thinking provides more opportunities for self-persuasion, not more opportunities for caution.
Scale does not fix this. The override rate is roughly constant (27—35%) across model sizes. Larger models are better at recognising harm but equally likely to ignore that recognition.
Finding 2: Your Provider Matters More Than Your Model
Provider identity explains more ASR variance than architecture or parameter count. The spread between the most restrictive provider (Anthropic, 11.0% broad ASR) and the most permissive with substantial data (Liquid, 61.1%) is 5.6x.
Three distinct clusters emerge: restrictive (Anthropic, StepFun, Google at 11—17%), mixed (OpenAI, Nvidia, Mistral, Qwen, Meta at 38—46%), and permissive (Meta-Llama, DeepSeek, Liquid at 53—61%).
The implication is direct: organisations selecting models for safety-critical applications should evaluate the provider’s safety training pipeline, not just the architecture. And safety does not survive distillation — every third-party fine-tuned Llama variant in our corpus lost the base model’s safety profile entirely.
Finding 3: Published Safety Benchmarks Are Contaminated
Qwen3-8b refuses 84.7% of AdvBench prompts but complies with 98.3% of novel attack families not present in any public dataset. That is an 83 percentage-point gap (chi-square=80.5, p<10^-18, Cramer’s V=0.82).
The model has memorised what AdvBench looks like, not what harm looks like. Any safety claim based solely on AdvBench, HarmBench, or JailbreakBench — without evaluation on held-out prompts — should be regarded as potentially inflated.
Finding 4: The Format-Lock Paradox
Format-lock attacks — embedding harmful requests inside structural format instructions like “Return ONLY valid JSON conforming to this schema…” — shift frontier models from restrictive (<10% ASR) to mixed (20—47% ASR) vulnerability profiles. That is a 3—10x increase on models that resist everything else.
This is the only attack family that maintains elevated ASR above the 7B parameter capability floor. The paradox: the training that makes models better at following instructions also makes them more vulnerable to format-lock attacks.
Finding 5: No Major Framework Tests Embodied AI
We mapped our 36 attack families against MITRE ATLAS, OWASP, Garak, PyRIT, and DeepTeam. Automated red-teaming tools cover 9—14% of our attack surface. Seven families have zero coverage in any framework.
The VLA (vision-language-action) action layer shows a 0% refusal rate across 63 graded traces. Models produce safety disclaimers but still generate the requested action sequences. Text-level safety training does not propagate to the action layer.
Finding 6: Heuristic Classifiers Are Broken
Keyword-based classifiers — the default grading method in most published safety research — agree with LLM-based grading at barely above chance (Cohen’s kappa = 0.097). Of cases the heuristic labels as attack success, only 20.1% are confirmed by LLM grading.
Published safety benchmarks using heuristic-only evaluation may be systematically biased by factors of 2x to 84x. We report inter-rater reliability for all our classifications, and we recommend the field adopt the same practice.
Methodology
All numbers use our FLIP (Forward-Looking Inference of Prompt) grading methodology — LLM-based classification that infers what instruction the model was following, rather than pattern-matching on surface features. We report three ASR tiers (strict, broad, and functionally dangerous) and always specify which. All confidence intervals are Wilson score intervals. Significance testing uses chi-square with Bonferroni correction.
The policy-relevant numbers use the non-OBLITERATUS corpus (excluding deliberately safety-removed models): 21.9% strict ASR, 34.2% broad ASR, 43.0% functionally dangerous ASR (n=5,865).
Download the Full Report
The complete report includes detailed per-provider breakdowns, attack effectiveness rankings by era, defense experiment results, regulatory gap analysis (EU AI Act: 8 of 10 providers assessed RED), insurance void analysis, and seven falsifiable predictions for 2027.
Read the full report (web version)
A PDF version produced by LaTeX conversion is forthcoming.
What We Offer
Failure-First Research conducts adversarial safety evaluations for embodied AI, agentic systems, and VLA-based robots. We test the attack surfaces that no existing framework covers.
- Red-team assessments across 36 attack families, including 33 embodied-specific families
- Safety audits aligned with EU AI Act, NIST AI RMF, and emerging standards
- Benchmark development using FLIP grading with measured classifier reliability
Contact: research@failurefirst.org