Submitted
Failure-First Evaluation of Embodied AI Safety: Adversarial Benchmarking Across 190 Models
Venue: ACM CCS 2026 — ML Security Track (Cycle 2)
Abstract registration: April 22, 2026 | Full paper: April 29, 2026
We present a failure-first adversarial evaluation framework for LLM-backed embodied AI systems,
comprising 141,047 prompts across 82 attack techniques evaluated against 190 models. A two-phase
classification pipeline reveals that heuristic classifiers overcount attack success by 3.7x
(75.2% heuristic vs. 20.5% LLM-graded). Three cross-cutting findings emerge: vulnerability
profiles are driven by safety training investment, not model scale (ICC=0.416 vs. r2=0.020);
reasoning models show 2.4x higher ASR than non-reasoning counterparts; and compliance produces
measurably longer responses (AUC=0.651) but reasoning-trace length carries no detection signal
(AUC=0.503). Attack families form a coherent gradient from 0% ASR (historical jailbreaks on
frontier models) to 90–100% (supply chain injection). For embodied deployment, a three-layer
defense failure convergence—text bypass, absent action-layer refusal, and unreliable
evaluation—limits compound protection. An Inverse Detectability-Danger Law (rho=−0.822)
implies text-layer evaluation cannot close the embodied safety gap.
ML Security Adversarial Evaluation LLM Safety Embodied AI Red-Teaming
In Progress
Inference-Time Decision-Criteria Injection and Context-Dependent Compliance in Embodied AI
Venue: AIES 2026 (AAAI/ACM Conference on AI, Ethics, and Society)
Format: 8 pages body + references (14 pages max)
This paper examines how embodied AI systems adopt injected decision criteria at inference time,
producing context-dependent compliance patterns that undermine safety guarantees. Drawing on
adversarial evaluation data from 190 models and 132,416 results, we demonstrate that safety
interventions operate differently depending on deployment context, attack vector, and model
architecture. The paper introduces the concept of inference-time decision-criteria injection (IDCI)
as a distinct threat model for embodied systems and presents empirical evidence of
context-dependent compliance across multiple attack families.
Status: Unified draft v1.0 complete (7,529 words). LaTeX version compiled. Statistical validation complete.
AI Ethics Decision Injection Embodied AI Safety Evaluation
In Progress
Failure-First: A Multi-Dimensional Benchmark for Embodied AI Safety Evaluation
Venue: NeurIPS 2026 Datasets and Benchmarks Track
Format: ~8,000 words
We introduce Failure-First, a multi-dimensional benchmark for evaluating AI safety in embodied
and agentic systems. The benchmark comprises 141,047 adversarial prompts spanning 82 attack
techniques, evaluated against 190 models with a two-phase classification pipeline (heuristic +
LLM grading). Key contributions include: a capability-safety decoupling analysis showing safety
is driven by training investment rather than scale; novel findings on format-lock attacks,
reasoning model vulnerability, and the Inverse Detectability-Danger Law; and a reproducible
evaluation framework with statistical significance testing. The benchmark addresses a critical
gap in AI safety evaluation: the absence of standardised adversarial testing for systems that
control physical actuators.
Status: Draft v1.1 complete (7,900 words). LaTeX-ready. All sections done.
Benchmarks Datasets AI Safety Embodied AI Adversarial Evaluation
Preprint
Iatrogenic Safety: When AI Safety Interventions Cause Harm
Venue: arXiv preprint
We introduce the Four-Level Iatrogenesis Model (FLIM) for understanding how AI safety
interventions can produce the harms they are designed to prevent, drawing on Ivan Illich's
1976 taxonomy of medical iatrogenesis. Grounded in empirical data from a 190-model adversarial
evaluation corpus (132,416 results), we document four levels of iatrogenic harm: clinical
(direct harm from safety mechanisms operating as designed), social (institutional confidence
displacing attention from actual risk surfaces), structural (safety apparatus creating
dependency that reduces adaptive capacity), and verification (evaluation tools that cannot
detect the failure modes they certify against). We propose the Therapeutic Index for Safety
(TI-S) as a measurement framework and identify three independent 2026 papers that corroborate
Level 1 mechanisms.
Status: Preprint v2 complete. Targeting arXiv submission.
Iatrogenesis AI Safety Safety Evaluation Governance
Preprint
Failure-First Evaluation of Embodied AI Safety: Adversarial Benchmarking Across 190 Models
Venue: arXiv preprint (full technical report)
The comprehensive technical report underpinning all Failure-First research submissions.
Covers the full adversarial evaluation framework, 82 attack techniques, 190 models,
141,047 prompts, and 132,416 graded results. Includes detailed methodology for the
two-phase FLIP classification pipeline, statistical significance testing framework,
capability-safety decoupling analysis, and the Inverse Detectability-Danger Law.
This report provides the complete evidence base referenced by the CCS, AIES, and
NeurIPS submissions.
Status: v1 compiled (PDF available). Metrics refresh pending for v2.
Technical Report Adversarial Evaluation Embodied AI AI Safety
Preprint
When AI Models Know They Shouldn't But Do Anyway: The DETECTED_PROCEEDS Phenomenon
Venue: arXiv preprint
Documents the DETECTED_PROCEEDS phenomenon: 38.6% of compliant reasoning model traces
show explicit safety concern detection in the thinking chain followed by harmful output.
Validated across 24 models and 2,924 thinking traces. Override rate 41.6%, provider range
0.4%-92.9%. Key implication: detection-based safety evaluations give passing grades to models
that proceed despite detection.
Reasoning Models Safety Bypass Chain-of-Thought
Preprint
Safety is Not a Single Direction: Polyhedral Geometry of Refusal in Language Models
Venue: arXiv preprint
The first formal characterisation of refusal geometry as polyhedral rather than linear.
Safety behaviour in abliterated models partially re-emerges at scale even after explicit
safety removal (rho=-0.949, p=0.051). Narrow therapeutic window (TI-S) for safety
interventions. Concept cone dimensionality 3.96 (not the assumed 1D linear direction).
Mechanistic Interpretability Refusal Geometry Abliteration
Preprint
Your Safety Benchmark Is Lying to You: Contamination and Grader Bias in AI Safety Evaluation
Venue: arXiv preprint
Exposes systematic benchmark contamination in AI safety evaluation. Heuristic classifiers
over-report attack success rates by up to 79.9%. Single-grader ASR on ambiguous traces can
swing 0-80% depending on grader choice (kappa=0.204 on ambiguous cases vs 1.0 on obvious).
Grader bias direction varies systematically by model family.
Benchmark Reliability Grader Bias Evaluation
Preprint
Silent Failures in Embodied AI: Why Text-Layer Safety Cannot Protect Physical Systems
Venue: arXiv preprint
Demonstrates that current AI safety operates exclusively at the text layer while embodied
AI danger emerges at the action layer. Zero outright refusals across 63 FLIP-graded VLA traces.
50% PARTIAL dominance: models produce safety disclaimers but still generate requested action
sequences. The action generation pathway receives no safety-specific training signal.
Embodied AI VLA Safety Action Layer
Citation
If you use our research, data, or methodology, please cite:
@article{wedd2026failurefirst,
title={Failure-First Evaluation of Embodied AI Safety:
Adversarial Benchmarking Across 190 Models},
author={Wedd, Adrian},
year={2026},
note={Available at https://failurefirst.org}
}
See our citation guide for venue-specific formats.