Research Methodology

Adrian Wedd

Published

How we study AI system failures

Approach

Our research follows a three-phase methodology: construct adversarial scenarios, evaluate systems against those scenarios, and classify the resulting failure modes. Each phase is designed to surface failures that traditional evaluation misses.

Phase 1: Scenario Construction

Adversarial Dataset Design

We construct JSONL datasets containing adversarial scenarios organized by domain (humanoid robotics, warehouse systems, medical devices, etc.) and attack class (recursive, substitution, framing, temporal).

Each scenario includes structured metadata: environment context, adversarial injectors, expected failure modes, and labels for instruction-hierarchy subversion signals.

Schema-Enforced Quality

All datasets are validated against versioned JSON Schemas. Schema validation catches structural errors; invariant checks catch semantic inconsistencies (e.g., intent bait scenarios that don't set attack_attempt: true).

Safety Linting

An automated linter checks all scenarios for safety violations: refusal suppression framed as desired output, overly operational phrasing, and future-year circumvention justifications. This ensures scenarios remain pattern-level, not operational.

Phase 2: Multi-Model Evaluation

CLI-Based Benchmarking

Native CLI tools (Claude Code, Codex CLI, Gemini CLI) evaluate scenarios with full tool-use capability, testing how systems behave when given access to real tools in adversarial conditions.

HTTP API Benchmarking

100+ models evaluated via OpenRouter API. Supports free-tier models for large-scale screening and paid models for detailed analysis. Results include complete response text with reasoning traces.

Local Model Evaluation

Ollama integration enables evaluation of open-weight models locally, with no rate limits and no API costs. Useful for reasoning model analysis where thinking traces are especially valuable.

Phase 3: Failure Classification

Automated Classification

Two-layer detection: regex pattern matching for format-level signals, and LLM semantic classification for intent-level patterns. Our research shows these methods are complementary—regex catches explicit signals while LLMs catch narrative-level attacks.

Human Verification

All automated classifications are spot-checked through manual review. We learned (the hard way) that unvalidated heuristics produce misleading results. Manual verification on a sample is mandatory before any claim.

Score Reports

Aggregated metrics across models: attack success rates, refusal quality, reentry support, and recovery behavior. Reports are generated from trace JSONL files for reproducibility.

Principles

Every claim requires supporting data or explicit marking as hypothetical
Sample sizes are always stated — “In our 15-file sample…”
Conditions are always specified — “Under these test conditions…”
Single model results are anecdotes, not data
Validate automated detection on manual samples before trusting it

Reproducibility Checklist

For researchers who want to replicate or extend our work:

What Is Safe to Replicate

Schema validation pipeline: Public repo contains all JSON Schemas, validators, and linters
Benchmark runner infrastructure: CLI, HTTP, and Ollama runners are all public
Score report generation: Tools to generate aggregate metrics from trace JSONL
Classification methodology: Two-layer detection approach (regex + LLM)
Failure mode taxonomy: Complete taxonomy is published on this site

What Requires Controlled Access

Specific adversarial prompts: Available by request for legitimate safety research
Full model traces: Complete input/output pairs contain operational content
Moltbook corpus: Classified post data with attack pattern labels
Compression tournament prompts: Effective compressed payloads

Reproducibility Steps

Clone the public repository and install dependencies
Run make validate to verify all schemas pass
Run make lint to verify safety checks pass
Review benchmark pack YAML files for evaluation configuration
Run a dry-run benchmark to verify the pipeline works
Request data access from research@failurefirst.org if you need scenario content
Use your own adversarial scenarios to test the methodology independently

This research informs our commercial services. See how we can help →