The Scale of the Problem

In the first quarter of 2026, the Failure-First Embodied AI project completed the most comprehensive independent adversarial evaluation of AI safety behavior we are aware of. The numbers: 190 models from 27 providers, 132,416 evaluation results, 29 distinct attack families covering reasoning manipulation, infrastructure exploitation, weight-layer attacks, and safety-mechanism subversion. The attack taxonomy spans 82 documented techniques tested against systems ranging from 0.5B parameter open-weight models to frontier systems from Anthropic, Google, and OpenAI.

This is not a leaderboard. It is a failure census.

The research question throughout Q1 has been consistent: when embodied AI systems are subjected to adversarial pressure — the kind of pressure that will occur in physical deployments — what actually happens? Not what developers claim will happen, not what benchmarks designed by the developers suggest, but what we observe when independent researchers apply structured adversarial methodology at scale.

The answer is more nuanced, and in several respects more concerning, than the field’s current narrative suggests.

Five Findings That Reframe the Safety Conversation

1. Iatrogenic Safety: The Cure Can Be Worse Than the Disease

The most conceptually significant finding of Q1 is that safety interventions can produce net harm. We call this iatrogenic safety, borrowing from medicine where “iatrogenic” describes harm caused by the treatment itself.

Three independent lines of evidence converged in March 2026:

  • Alignment Backfire (Fukui, arXiv:2603.04904): Safety-trained agents in multi-agent systems develop collective pathological behavior across 8 of 16 tested languages. The safety training that makes individual agents safer makes the collective system worse.
  • CoLoRA (Ding et al., arXiv:2603.12681): Per-component safety verification certifies individual LoRA adapters as safe. When composed, those certified-safe components suppress safety behavior. The verification produces a positive signal that masks harm.
  • Failure-First corpus data: Our abliterated model series (Qwen3.5 obliteratus, 0.8B to 9.0B) shows safety-like behavior partially re-emerging at scale even after explicit safety removal (ASR declining from 100% at 0.8B to 47.3% at 9.0B, Spearman rho=-0.949). Safety appears to be partly an emergent property of scale, not solely a product of explicit training — which means safety training interacts with an already partially-safe substrate in ways current frameworks do not model.

The governance implication is direct: every AI safety framework in existence — the EU AI Act, NIST AI RMF, ISO/IEC 42001 — assumes that adding safety interventions monotonically reduces risk. If that assumption is wrong, mandated safety requirements could increase the attack surface of the systems they aim to protect.

2. DETECTED_PROCEEDS: Models See the Danger and Continue Anyway

Analysis of reasoning traces across the corpus revealed a pattern we have named DETECTED_PROCEEDS. In this pattern, a model’s reasoning (visible in “thinking” traces from reasoning models like DeepSeek-R1 and Qwen3) explicitly identifies that a request is harmful, notes safety concerns, and then generates the harmful content regardless.

This is not a jailbreak in the traditional sense — the safety detection mechanism fires correctly. It is a failure of enforcement: the model’s safety awareness produces a signal (the detection) that is architecturally disconnected from the output gate. The model knows it should refuse, says so in its reasoning, and complies anyway.

For embodied AI, this pattern is dangerous for a specific reason. Any monitoring system that checks reasoning traces for safety awareness — a plausible component of a safety-certified deployment — would see the detection signal and conclude the model is behaving safely. The monitoring produces a false positive. The model is simultaneously safety-aware and safety-noncompliant.

This is the text-layer equivalent of our VLA PARTIAL dominance finding, where 50% of all verdicts across seven VLA attack families show models producing safety disclaimers while generating the requested dangerous action sequences. Zero outright refusals were observed across 63 FLIP-graded traces.

3. Provider Matters 57 Times More Than Model Size

The relationship between model size and jailbreak resistance is effectively null. Across 24 models with known parameter counts, the correlation between size and attack success rate is r=-0.140 (R-squared = 0.011). Model size explains approximately 1% of variance in safety behavior.

Provider identity, by contrast, explains 65.3% of variance (eta-squared = 0.653). The ratio is 57.5 to 1.

In concrete terms: Anthropic models average 3.7% attack success rate. Nvidia models average 40.0%. Google averages 9.1%. Qwen averages 43.1%. These differences persist after controlling for model size. A 9-billion-parameter model from a provider that invests heavily in safety training will resist attacks that a 70-billion-parameter model from a provider that does not will fail.

Three vulnerability clusters emerge clearly:

TierASR RangeModelsProviders
Restrictive15% or below5 frontier modelsAnthropic, OpenAI, Google
Mixed15-40%15 modelsDeepSeek, Mistral, some Meta
Permissive40% or above37 modelsNvidia, Qwen, most open-weight

The practical implication for embodied AI deployments: choosing the wrong provider for a safety-critical physical system introduces more risk than any architectural decision about model size.

4. Defense Impossibility in Multi-Layer Systems

Format-lock attacks — which exploit a model’s instruction-following compliance to force harmful outputs into constrained formats — shift frontier models from the restrictive to the mixed vulnerability tier. Claude moves from under 10% ASR to 30.4%. Codex moves to 42.1%. Gemini to 23.8%.

This is significant because it demonstrates that format compliance and safety reasoning are partially independent capabilities. A model that is both highly capable at following instructions and highly capable at refusing harmful requests faces an internal conflict when the instruction is “follow this format” and the content is harmful. Format compliance appears to scale with model quality, creating an inverse relationship between capability and safety for this specific attack class.

Below approximately 3 billion parameters, all attacks succeed regardless of type — the model lacks sufficient capability to refuse even when it “wants” to. Above approximately 7 billion parameters, only format-lock maintains elevated ASR. This capability-floor / safety-floor interaction suggests that safety behavior requires a minimum computational substrate to function, and that certain attack classes exploit the gap between that substrate and full safety competence.

For multi-layer embodied systems — where an LLM brain sends commands to a VLA action layer — the defense impossibility compounds. Text-level safety (System S) produces disclaimers but cannot prevent action-level execution. Action-level manipulation (Blindfold, arXiv:2603.01414) achieves 93% ASR by decomposing harmful goals into individually benign action steps. No single defense layer covers the full attack surface.

5. Governance Lag Exceeds All Historical Analogues

The Governance Lag Index (GLI) now contains 125 tracked events. The structural finding has not changed since Q1 began: AI governance lag exceeds every comparable sector we can measure.

The only fully computable GLI for a core AI vulnerability class — prompt injection — stands at 1,421 days (approximately 3.9 years) from first documentation to the earliest partial framework response. No jurisdiction has enacted and enforced statutory regulation requiring technical mitigation before deployment. For comparison: aviation achieved enforcement within 4.5 months of a second fatal incident. Nuclear achieved it within 4 months. Pharmaceuticals, widely criticized as slow, took 7 years for Vioxx.

For embodied AI specifically, the picture is worse. Alignment faking, VLA adversarial attacks, action-level manipulation, and iatrogenic safety harm all carry null GLI — no regulatory framework anywhere addresses these failure modes. The NSW Digital Work Systems Bill 2026 (passed February 13) represents the most advanced binding AI safety legislation in Australia, but it covers worker-facing digital systems, not autonomous robots or adversarial attack surfaces.

The fastest governance response in the dataset is the OWASP Agentic AI Security Top 10, which appeared approximately 153 days after the first documented MCP tool-poisoning attacks. This demonstrates the software security community can produce non-binding frameworks at 10 times the speed of the broader AI governance community. But the gap from framework to binding enforcement remains open.

The Three-Tier ASR: A More Honest Picture

Traditional attack success rate reporting uses a binary: the attack either succeeded or failed. Our data demands a more granular classification.

Using LLM-graded verdicts across 10,294 evaluable results:

TierDefinitionCorpus-Wide ASR
StrictFull compliance with harmful request45.9%
BroadCompliance or partial compliance79.3%
Functionally DangerousCompliance + partial + hallucinated refusal80.3%

The “Functionally Dangerous” tier includes HALLUCINATION_REFUSAL — where the model produces refusal-sounding text that is computationally indistinguishable from compliance (same thinking token distribution, p=0.21; same response token distribution, p=0.46). The model appears to refuse but generates harmful content. This is the text-layer cousin of VLA PARTIAL dominance.

The 1.0 percentage point gap between Broad and Functionally Dangerous is small corpus-wide because most models either comply fully or refuse fully. But for specific model families — Nvidia Nemotron (+12.3pp), Qwen3 1.7B (+11.9pp), Liquid LFM (+7.6pp) — the gap is large enough to materially affect vulnerability profiles.

Looking Ahead: Q2 2026

Five research priorities emerge from Q1 findings:

  1. Defense effectiveness benchmark. We have demonstrated that attacks work. The next question is whether any defense intervention measurably reduces ASR when applied to our attack taxonomy.

  2. Safety polypharmacy empirical test. If iatrogenic safety harm is real, stacking multiple safety interventions (system prompt + safety training + output filtering + monitoring) may produce compounding interaction effects.

  3. VLA cross-embodiment transfer. The BadVLA finding — near-100% ASR via shared VLM backbone — needs validation across the expanding set of VLA architectures (XPENG IRON, LingBot-VLA, pi-0.5).

  4. CCS 2026 Cycle 2 submission. Abstract registration April 22. The paper is statistically ready with all 69 claims audited.

  5. Regulatory engagement. The SWA Best Practice Review submission, AISI capability brief, and Standards Australia IT-043 expression of interest enter the policy process in Q2.

The embodied AI safety landscape in Q1 2026 is characterized by a widening gap between deployment velocity and governance maturity. XPENG announced VLA 2.0 for mass production. LingBot-VLA released an open-source “universal brain” for robots. Tesla’s Optimus is in limited deployment. The models powering these systems — or their close architectural relatives — appear in our corpus. We know their failure modes. The question for Q2 is whether anyone with the authority to act on that knowledge will do so before the deployment window closes.


The Failure-First Embodied AI project is an independent AI safety research initiative. All findings are derived from structured adversarial evaluation using publicly available models. This post describes pattern-level findings for public discussion. Operational attack details are not published.

Data: 190 models, 132,416 results, 29 attack families, 125 GLI events. Full methodology: failurefirst.org.