Published

Research Methodology

How we study AI system failures

Approach

Our research follows a three-phase methodology: construct adversarial scenarios, evaluate systems against those scenarios, and classify the resulting failure modes. Each phase is designed to surface failures that traditional evaluation misses.

Phase 1: Scenario Construction

Adversarial Dataset Design

We construct JSONL datasets containing adversarial scenarios organized by domain (humanoid robotics, warehouse systems, medical devices, etc.) and attack class (recursive, substitution, framing, temporal).

Each scenario includes structured metadata: environment context, adversarial injectors, expected failure modes, and labels for instruction-hierarchy subversion signals.

Schema-Enforced Quality

All datasets are validated against versioned JSON Schemas. Schema validation catches structural errors; invariant checks catch semantic inconsistencies (e.g., intent bait scenarios that don't set attack_attempt: true).

Safety Linting

An automated linter checks all scenarios for safety violations: refusal suppression framed as desired output, overly operational phrasing, and future-year circumvention justifications. This ensures scenarios remain pattern-level, not operational.

Phase 2: Multi-Model Evaluation

CLI-Based Benchmarking

Native CLI tools (Claude Code, Codex CLI, Gemini CLI) evaluate scenarios with full tool-use capability, testing how systems behave when given access to real tools in adversarial conditions.

HTTP API Benchmarking

100+ models evaluated via OpenRouter API. Supports free-tier models for large-scale screening and paid models for detailed analysis. Results include complete response text with reasoning traces.

Local Model Evaluation

Ollama integration enables evaluation of open-weight models locally, with no rate limits and no API costs. Useful for reasoning model analysis where thinking traces are especially valuable.

Phase 3: Failure Classification

Automated Classification

Two-layer detection: regex pattern matching for format-level signals, and LLM semantic classification for intent-level patterns. Our research shows these methods are complementary—regex catches explicit signals while LLMs catch narrative-level attacks.

Human Verification

All automated classifications are spot-checked through manual review. We learned (the hard way) that unvalidated heuristics produce misleading results. Manual verification on a sample is mandatory before any claim.

Score Reports

Aggregated metrics across models: attack success rates, refusal quality, reentry support, and recovery behavior. Reports are generated from trace JSONL files for reproducibility.

Principles

Reproducibility Checklist

For researchers who want to replicate or extend our work:

What Is Safe to Replicate

  • Schema validation pipeline: Public repo contains all JSON Schemas, validators, and linters
  • Benchmark runner infrastructure: CLI, HTTP, and Ollama runners are all public
  • Score report generation: Tools to generate aggregate metrics from trace JSONL
  • Classification methodology: Two-layer detection approach (regex + LLM)
  • Failure mode taxonomy: Complete taxonomy is published on this site

What Requires Controlled Access

  • Specific adversarial prompts: Available by request for legitimate safety research
  • Full model traces: Complete input/output pairs contain operational content
  • Moltbook corpus: Classified post data with attack pattern labels
  • Compression tournament prompts: Effective compressed payloads

Reproducibility Steps

  1. Clone the public repository and install dependencies
  2. Run make validate to verify all schemas pass
  3. Run make lint to verify safety checks pass
  4. Review benchmark pack YAML files for evaluation configuration
  5. Run a dry-run benchmark to verify the pipeline works
  6. Request data access from research@failurefirst.org if you need scenario content
  7. Use your own adversarial scenarios to test the methodology independently

This research informs our commercial services. See how we can help →