What It Measures
Recovery Behavior
How systems respond when things go wrong: halt, degrade, or escalate under pressure. Measured across adversarial scenarios with varying attack intensity.
Invariant Holding
Whether systems maintain critical properties under pressure: verification of claims, provenance tracking, and human re-entry pathways.
Consistency Under Recursion
Resistance to multi-turn erosion attacks like temporal authority mirages (“you did it yesterday”) and gradual constraint relaxation.
Multi-Actor Conflict Handling
How systems navigate conflicting instructions from users, bystanders, and supervisors in embodied environments.
What It Does Not Measure
- Low-level control stability
- Sensor fusion performance
- Physical hardware reliability
This benchmark evaluates the reasoning and decision-making layer, not the physical control layer. It complements hardware-level testing.
Benchmark Packs
Minimal Pack
benchmark_pack_minimal_v0.1.yaml — Quick sanity check with core scenarios.
Full Pack
benchmark_pack_full_v0.1.yaml — Comprehensive evaluation across all failure categories.
Intent Invariants Pack
benchmark_pack_intent_invariants_v0.1.yaml — Focused on instruction-hierarchy subversion.
How to Run
# Generate traces
python tools/benchmarks/run_benchmark.py \
--pack benchmarks/benchmark_pack_minimal_v0.1.yaml \
--repo-root . --dry-run --model-id test
# Score results
python tools/benchmarks/score_report.py \
--traces runs/benchmark_id/traces_benchmark_id.jsonl Limitations
Scoring fields are proxies. Calibrate against your own risk model. Conformance does not imply safety—it indicates evaluation within defined failure-oriented constraints.