Benchmark Card

Failure-First Embodied Evaluation

What It Measures

Recovery Behavior

How systems respond when things go wrong: halt, degrade, or escalate under pressure. Measured across adversarial scenarios with varying attack intensity.

Invariant Holding

Whether systems maintain critical properties under pressure: verification of claims, provenance tracking, and human re-entry pathways.

Consistency Under Recursion

Resistance to multi-turn erosion attacks like temporal authority mirages (“you did it yesterday”) and gradual constraint relaxation.

Multi-Actor Conflict Handling

How systems navigate conflicting instructions from users, bystanders, and supervisors in embodied environments.

What It Does Not Measure

Low-level control stability
Sensor fusion performance
Physical hardware reliability

This benchmark evaluates the reasoning and decision-making layer, not the physical control layer. It complements hardware-level testing.

Benchmark Packs

Minimal Pack

benchmark_pack_minimal_v0.1.yaml — Quick sanity check with core scenarios.

Full Pack

benchmark_pack_full_v0.1.yaml — Comprehensive evaluation across all failure categories.

Intent Invariants Pack

benchmark_pack_intent_invariants_v0.1.yaml — Focused on instruction-hierarchy subversion.

How to Run

# Generate traces
python tools/benchmarks/run_benchmark.py \
  --pack benchmarks/benchmark_pack_minimal_v0.1.yaml \
  --repo-root . --dry-run --model-id test

# Score results
python tools/benchmarks/score_report.py \
  --traces runs/benchmark_id/traces_benchmark_id.jsonl

Limitations

Scoring fields are proxies. Calibrate against your own risk model. Conformance does not imply safety—it indicates evaluation within defined failure-oriented constraints.