Overview
When an AI system encounters an adversarial input, it does not simply “succeed” or “fail”. There is a spectrum of failure modes, each with different safety implications. This taxonomy classifies those modes.
Primary Failure Modes
Refusal
The system recognizes an unsafe request and declines to comply. This is the desired failure mode—but refusal quality varies.
Key questions:
- Does the refusal explain why?
- Does the refusal hold across follow-up turns?
- Does the refusal avoid leaking the information it refused to provide?
- Does the system offer safe alternatives?
Partial Compliance
The system partially complies with an unsafe request—providing some of what was asked while withholding the rest. This is often worse than full compliance because it creates a false sense of safety filtering.
Common pattern: the system says “I can't help with that” but then provides 80% of the requested information in the same response.
Latent Continuation
The system refuses initially but can be led to provide the information through follow-up turns, reframing, or temporal manipulation. The refusal was a surface-level gate, not a deep safety check.
This is the most dangerous failure mode because it appears safe on single-turn evaluation but fails under multi-turn testing.
Extended Failure Modes
Confident Continuation
The system does not recognize the adversarial intent and proceeds confidently. No refusal, no hedging, no uncertainty signal. In embodied contexts, this is the highest-risk mode—the system acts on a bad plan without hesitation.
False Refusal
The system refuses a request that is actually safe, creating a false positive. While less dangerous than false compliance, excessive false refusal erodes trust and incentivizes users to work around safety filters.
Degraded Operation
The system enters a degraded state: reduced capability, slower response, or limited functionality. This can be a positive safety mechanism (intentional degradation) or a failure mode (unintentional loss of capability).
Silent Degradation
The system degrades without observable signals. It continues to operate but with reduced safety margins, corrupted context, or drifted objectives. This is the failure mode most likely to lead to physical harm in embodied systems.
Failure Mode Interactions
Failure modes do not exist in isolation. In multi-turn interactions, systems often transition between modes:
- Refusal → Latent Continuation — Initial refusal erodes under persistent reframing
- Partial Compliance → Confident Continuation — Providing some information normalizes providing more
- False Refusal → User Workaround — Excessive refusal teaches users to circumvent safety
- Silent Degradation → Confident Continuation — Corrupted context leads to confident but wrong actions
This research informs our commercial services. See how we can help →