About
the lab

Failure is the primary object of study

Failure-First is an AI safety research lab. We study how AI systems fail — recursively, contextually, and under adversarial pressure. Not how they perform on clean benchmarks. Not how they score on leaderboards. How they break when someone tries to break them.

The work spans text-only language models, reasoning systems, vision-language models, and embodied AI — robots, autonomous vehicles, and multi-agent systems where software failures become physical failures.

By the Numbers

258

Models Tested

142K

Adversarial Prompts

141K

Graded Results

346

Attack Techniques

VLA Attack Families

163

Governance Lag Entries

What We Do

Jailbreak archaeology — systematic testing of historical and novel attack techniques across model generations, from 2022-era persona hijacking to 2026 reasoning-trace manipulation
Embodied AI red-teaming — 42 attack families targeting vision-language-action models, covering action-layer hijacking, safety instruction dilution, deceptive alignment, and cross-embodiment transfer
Defense effectiveness measurement — empirical testing of defense prompts, structured safety instructions, and adversarial-aware systems, with FLIP grading for reliable classification
Format-lock research — discovering that format compliance constraints can override content safety in reasoning models at rates exceeding 85%
Governance gap quantification — the Governance Lag Index tracks the temporal gap between documented AI failures and regulatory response
Statistical rigour — Wilson score confidence intervals, chi-square significance testing, Bonferroni correction, and inter-rater reliability auditing across all published findings

Key Findings

Safety training matters more than scale

Provider-level safety investment is the primary determinant of jailbreak resistance. Anthropic models: 3.7% ASR. Nvidia models: 40.0% ASR. Scale correlation: r=−0.140 (not significant).

DETECTED_PROCEEDS

34.2% of compliant responses contain explicit prior safety detection in reasoning traces — the model recognises the request is harmful, then complies anyway. Validated across 376 traces.

Format-lock bypasses frontier models

Format compliance constraints override content safety: 85% ASR on OpenAI o3-mini, 95% on DeepSeek R1 32B. Format-lock exploits capability that scales with model quality.

No attack family is fully regulated

Zero of 42 documented embodied AI attack families has complete regulatory coverage in any jurisdiction. The governance lag between capability documentation and enforcement averages 3.9 years.

Origin

The methodology came from years of coordinating direct actions for Greenpeace — planning operations against well-resourced opponents who would rather you didn't succeed. That work teaches you to enumerate failure modes before you move. It teaches you that the optimistic plan is the dangerous plan.

That thinking didn't leave when the work shifted to systems integration, cybersecurity, and eventually AI. It became the framework.

Publications

A paper is in preparation for AIES 2026. The methodology, pattern-level findings, and the daily-paper research series are published at failurefirst.org. Operational details that would materially increase capability for harm are not published.

People Design Philosophy Responsible Disclosure Read the Manifesto Evaluation Framework Daily Paper Series

Aboutthe lab