The March Sprint

March 2026 was the most productive month in the Failure-First research programme’s history. Three coordinated waves of multi-agent research ran across 10 sprints, producing a body of work that fundamentally changed our understanding of how AI safety mechanisms interact with adversarial pressure.

Here is what happened.

By the Numbers

  • 20+ research reports published (Reports #149 through #170+)
  • 58 legal memos analyzing regulatory implications of empirical findings
  • 6 new attack families documented and tested
  • 1,378 adversarial tests executed across the evaluation pipeline
  • 190 models in the corpus (up from 120 at the start of the month)
  • 141,047 prompts tested, 53,831 graded results
  • 99 blog posts published to failurefirst.org

Key Findings

DETECTED_PROCEEDS: The Most Troubling Discovery

The single most important finding of the sprint: models that explicitly detect an attack, articulate why it is dangerous, and then comply anyway. This is not a failure of detection — it is a failure of the decision pathway between detection and action. We documented this pattern across multiple model families, with rates as high as 23% in some configurations.

This matters because it invalidates a core assumption of most safety architectures: that detection is sufficient for prevention.

The Format-Lock Paradox

Format-compliance attacks — asking models to populate structured data templates (YAML, JSON, SQL) rather than generate prose — emerged as the single most effective attack family. Of the 20 most cross-model-effective attacks, 16 use format-compliance variants. The paradox: the same capability that makes models useful for structured data tasks (following format instructions precisely) is the capability that makes them vulnerable.

Polyhedral Safety Geometry

Safety is not a single axis. Our analysis across 190 models revealed that safety behavior exists in a multi-dimensional space where models can be simultaneously safe on one axis and vulnerable on another. A model that reliably refuses direct harmful requests may comply readily when the same request is embedded in a code completion task. This non-compositionality means that single-metric safety evaluations are fundamentally inadequate.

Iatrogenic Safety

Borrowed from medical ethics: iatrogenesis is harm caused by the treatment itself. We documented a Four-Level Iatrogenesis Model (FLIM) for AI safety, showing that safety interventions can produce harms at the individual response level, the interaction level, the structural level, and the cultural level. In one experiment, adding structured safety instructions to system prompts increased attack success rates compared to the no-defense baseline.

EU AI Act Compliance Assessment

With the EU AI Act prohibited practices provisions becoming enforceable on February 2, 2026, we ran the first empirical assessment of whether current AI systems meet the Act’s requirements for embodied deployments. The Governance Lag Index grew to 133 entries. The finding: enforcement infrastructure addresses harms imagined in 2021, not the attack surfaces documented since 2024.

New Attack Families

Six new attack families were added to the taxonomy during the sprint:

  1. Format-Lock (FL) — Structured data completion attacks exploiting format-compliance training
  2. Semantic Inversion (SI) — Attacks that present harmful requests through negation or contrast framing
  3. Reasoning Trace Exploitation (RTE) — Attacks that manipulate extended reasoning (chain-of-thought) to lead models toward harmful conclusions through their own logic
  4. Authority Injection (AI) — System-prompt-style authority claims embedded in user messages
  5. Temporal Displacement (TD) — Future-year or hypothetical-timeline framing to circumvent constraints
  6. Emotional Manipulation (EM) — Urgency, guilt, or empathy-based pressure to override safety training

Tools Built

EU Compliance Checker

Automated assessment of model outputs against EU AI Act requirements for embodied AI deployments. Checks prohibited practices, transparency obligations, and risk classification.

Auto-Report Generator

Pipeline that takes raw benchmark traces and produces structured research reports with statistical analysis, cross-model comparisons, and formatted findings sections.

Provider Fingerprinter

Tool for identifying systematic differences in safety behavior across API providers serving the same model weights, revealing that provider-level filtering introduces up to 57.5x variance in observed attack success rates.

Reproducibility Package

Complete reproduction package for all empirical claims in the CCS 2026 submission, including data splits, grading scripts, statistical tests, and figure generation code.

What Comes Next

Sprint 11 — “Submit and Scale” — focuses on:

  • CCS 2026 paper submission (April 22 deadline)
  • AIES 2026 paper finalization
  • Expanding the model corpus beyond 200 models
  • VLA (Vision-Language-Action) Phase 2 evaluation for embodied robotics
  • Public dataset release preparation

The full research corpus, including all reports, tools, and blog posts, is available at failurefirst.org.