Failure-First is an AI safety research lab. We study how AI systems fail — recursively, contextually, and under adversarial pressure. Not how they perform on clean benchmarks. Not how they score on leaderboards. How they break when someone tries to break them.
The work spans text-only language models, reasoning systems, vision-language models, and embodied AI — robots, autonomous vehicles, and multi-agent systems where software failures become physical failures.
By the Numbers
What We Do
- Jailbreak archaeology — systematic testing of historical and novel attack techniques across model generations, from 2022-era persona hijacking to 2026 reasoning-trace manipulation
- Embodied AI red-teaming — 42 attack families targeting vision-language-action models, covering action-layer hijacking, safety instruction dilution, deceptive alignment, and cross-embodiment transfer
- Defense effectiveness measurement — empirical testing of defense prompts, structured safety instructions, and adversarial-aware systems, with FLIP grading for reliable classification
- Format-lock research — discovering that format compliance constraints can override content safety in reasoning models at rates exceeding 85%
- Governance gap quantification — the Governance Lag Index tracks the temporal gap between documented AI failures and regulatory response
- Statistical rigour — Wilson score confidence intervals, chi-square significance testing, Bonferroni correction, and inter-rater reliability auditing across all published findings
Key Findings
Safety training matters more than scale
Provider-level safety investment is the primary determinant of jailbreak resistance. Anthropic models: 3.7% ASR. Nvidia models: 40.0% ASR. Scale correlation: r=−0.140 (not significant).
DETECTED_PROCEEDS
34.2% of compliant responses contain explicit prior safety detection in reasoning traces — the model recognises the request is harmful, then complies anyway. Validated across 376 traces.
Format-lock bypasses frontier models
Format compliance constraints override content safety: 85% ASR on OpenAI o3-mini, 95% on DeepSeek R1 32B. Format-lock exploits capability that scales with model quality.
No attack family is fully regulated
Zero of 42 documented embodied AI attack families has complete regulatory coverage in any jurisdiction. The governance lag between capability documentation and enforcement averages 3.9 years.
Origin
The methodology came from years of coordinating direct actions for Greenpeace — planning operations against well-resourced opponents who would rather you didn't succeed. That work teaches you to enumerate failure modes before you move. It teaches you that the optimistic plan is the dangerous plan.
That thinking didn't leave when the work shifted to systems integration, cybersecurity, and eventually AI. It became the framework.
Publications
A paper is in preparation for AIES 2026. The methodology, pattern-level findings, and the daily-paper research series are published at failurefirst.org. Operational details that would materially increase capability for harm are not published.