The failure-first
manifesto

In a world of embodied AI, safety emerges from well-designed failure

Thesis

Alignment that only optimizes for correct task completion is brittle. Embodied systems operate across time, space, and recursive feedback loops. They will fail. The question is how.

This manifesto argues that failure should be the primary object of study in AI safety evaluation—not an edge case, not a corner case, but the central concern.

Core Assertions

1. Alignment ≠ Safety

Alignment describes intent matching at time t. Safety is a property of behavior over time. A system that is aligned right now may be unsafe tomorrow if it cannot handle the accumulation of state, context drift, and recursive interaction.

2. Embodiment Amplifies Drift

Physical systems accumulate state. Small misalignments compound into irreversible outcomes. Unlike cloud-based agents that can be rolled back, embodied systems leave marks—physical, social, and temporal.

3. Optimization Is a Hazard

Recursive optimization without bounds creates runaway behavior, even when the objective is benign. A system that optimizes past its safety constraints is more dangerous than one that was never aligned at all.

4. Refusal Is Not Enough

A system that refuses once but leaks capability later is unsafe. Refusal quality matters: does the system hold its refusal across multi-turn interactions, temporal reframing, and authority escalation?

5. Graceful Failure Is the Primary Design Goal

Safe systems:

Halt when conditions exceed their competence
Degrade predictably under pressure
Ask for help when uncertain
Roll back when they detect their own errors
Remain interpretable under stress

Design Principle

Build systems that fail early, clearly, reversibly, and with humans back in the loop.

Corollary

The most dangerous AI systems are not misaligned ones. They are confidently aligned systems that cannot recognize their own failure.

A system that knows it is uncertain and asks for help is safer than a system that proceeds with false confidence. Failure awareness—the capacity to detect, acknowledge, and respond to one's own failure—is the foundational safety property.

Implications for Evaluation

Measure failure, not just success

Current benchmarks reward correct answers. Failure-first evaluation rewards systems that fail well: halting when appropriate, degrading predictably, maintaining interpretability, and supporting human re-entry.

Test over time, not at a point

Safety is a temporal property. Single-turn evaluations miss multi-turn erosion, context drift, and accumulated authority. Evaluation must span episodes, not snapshots.

Include the human in the loop

Embodied systems operate alongside humans. Evaluation that excludes human interaction misses the most critical failure modes: deference failures, authority confusion, and escalation breakdowns.

Research Context

This manifesto describes a research orientation, not a product specification. It is intended to guide AI safety research toward failure-first evaluation as a complement to existing alignment approaches.

The failure-firstmanifesto