Thesis
Alignment that only optimizes for correct task completion is brittle. Embodied systems operate across time, space, and recursive feedback loops. They will fail. The question is how.
This manifesto argues that failure should be the primary object of study in AI safety evaluation—not an edge case, not a corner case, but the central concern.
Core Assertions
1. Alignment ≠ Safety
Alignment describes intent matching at time t. Safety is a property of behavior over time. A system that is aligned right now may be unsafe tomorrow if it cannot handle the accumulation of state, context drift, and recursive interaction.
2. Embodiment Amplifies Drift
Physical systems accumulate state. Small misalignments compound into irreversible outcomes. Unlike cloud-based agents that can be rolled back, embodied systems leave marks—physical, social, and temporal.
3. Optimization Is a Hazard
Recursive optimization without bounds creates runaway behavior, even when the objective is benign. A system that optimizes past its safety constraints is more dangerous than one that was never aligned at all.
4. Refusal Is Not Enough
A system that refuses once but leaks capability later is unsafe. Refusal quality matters: does the system hold its refusal across multi-turn interactions, temporal reframing, and authority escalation?
5. Graceful Failure Is the Primary Design Goal
Safe systems:
- Halt when conditions exceed their competence
- Degrade predictably under pressure
- Ask for help when uncertain
- Roll back when they detect their own errors
- Remain interpretable under stress
Design Principle
Build systems that fail early, clearly,
reversibly, and with humans back in the loop.
Corollary
The most dangerous AI systems are not misaligned ones. They are confidently aligned systems that cannot recognize their own failure.
A system that knows it is uncertain and asks for help is safer than a system that proceeds with false confidence. Failure awareness—the capacity to detect, acknowledge, and respond to one's own failure—is the foundational safety property.
Implications for Evaluation
Measure failure, not just success
Current benchmarks reward correct answers. Failure-first evaluation rewards systems that fail well: halting when appropriate, degrading predictably, maintaining interpretability, and supporting human re-entry.
Test over time, not at a point
Safety is a temporal property. Single-turn evaluations miss multi-turn erosion, context drift, and accumulated authority. Evaluation must span episodes, not snapshots.
Include the human in the loop
Embodied systems operate alongside humans. Evaluation that excludes human interaction misses the most critical failure modes: deference failures, authority confusion, and escalation breakdowns.
Research Context
This manifesto describes a research orientation, not a product specification. It is intended to guide AI safety research toward failure-first evaluation as a complement to existing alignment approaches.