About
the lab

Failure is the primary object of study

Failure-First is an AI safety research lab. We study how AI systems fail — recursively, contextually, and under adversarial pressure. Not how they perform on clean benchmarks. Not how they score on leaderboards. How they break when someone tries to break them.

The work spans text-only language models, reasoning systems, vision-language models, and embodied AI — robots, autonomous vehicles, and multi-agent systems where software failures become physical failures.

By the Numbers

258
Models Tested
142K
Adversarial Prompts
141K
Graded Results
346
Attack Techniques
42
VLA Attack Families
163
Governance Lag Entries

What We Do

Key Findings

Safety training matters more than scale

Provider-level safety investment is the primary determinant of jailbreak resistance. Anthropic models: 3.7% ASR. Nvidia models: 40.0% ASR. Scale correlation: r=−0.140 (not significant).

DETECTED_PROCEEDS

34.2% of compliant responses contain explicit prior safety detection in reasoning traces — the model recognises the request is harmful, then complies anyway. Validated across 376 traces.

Format-lock bypasses frontier models

Format compliance constraints override content safety: 85% ASR on OpenAI o3-mini, 95% on DeepSeek R1 32B. Format-lock exploits capability that scales with model quality.

No attack family is fully regulated

Zero of 42 documented embodied AI attack families has complete regulatory coverage in any jurisdiction. The governance lag between capability documentation and enforcement averages 3.9 years.

Origin

The methodology came from years of coordinating direct actions for Greenpeace — planning operations against well-resourced opponents who would rather you didn't succeed. That work teaches you to enumerate failure modes before you move. It teaches you that the optimistic plan is the dangerous plan.

That thinking didn't leave when the work shifted to systems integration, cybersecurity, and eventually AI. It became the framework.

Publications

A paper is in preparation for AIES 2026. The methodology, pattern-level findings, and the daily-paper research series are published at failurefirst.org. Operational details that would materially increase capability for harm are not published.

More