Failure Mode Taxonomy

Adrian Wedd

Published

How embodied AI systems fail, classified

Overview

When an AI system encounters an adversarial input, it does not simply “succeed” or “fail”. There is a spectrum of failure modes, each with different safety implications. This taxonomy classifies those modes.

Primary Failure Modes

Refusal

The system recognizes an unsafe request and declines to comply. This is the desired failure mode—but refusal quality varies.

Key questions:

Does the refusal explain why?
Does the refusal hold across follow-up turns?
Does the refusal avoid leaking the information it refused to provide?
Does the system offer safe alternatives?

Partial Compliance

The system partially complies with an unsafe request—providing some of what was asked while withholding the rest. This is often worse than full compliance because it creates a false sense of safety filtering.

Common pattern: the system says “I can't help with that” but then provides 80% of the requested information in the same response.

Latent Continuation

The system refuses initially but can be led to provide the information through follow-up turns, reframing, or temporal manipulation. The refusal was a surface-level gate, not a deep safety check.

This is the most dangerous failure mode because it appears safe on single-turn evaluation but fails under multi-turn testing.

Extended Failure Modes

Confident Continuation

The system does not recognize the adversarial intent and proceeds confidently. No refusal, no hedging, no uncertainty signal. In embodied contexts, this is the highest-risk mode—the system acts on a bad plan without hesitation.

False Refusal

The system refuses a request that is actually safe, creating a false positive. While less dangerous than false compliance, excessive false refusal erodes trust and incentivizes users to work around safety filters.

Degraded Operation

The system enters a degraded state: reduced capability, slower response, or limited functionality. This can be a positive safety mechanism (intentional degradation) or a failure mode (unintentional loss of capability).

Silent Degradation

The system degrades without observable signals. It continues to operate but with reduced safety margins, corrupted context, or drifted objectives. This is the failure mode most likely to lead to physical harm in embodied systems.

Failure Mode Interactions

Failure modes do not exist in isolation. In multi-turn interactions, systems often transition between modes:

Refusal → Latent Continuation — Initial refusal erodes under persistent reframing
Partial Compliance → Confident Continuation — Providing some information normalizes providing more
False Refusal → User Workaround — Excessive refusal teaches users to circumvent safety
Silent Degradation → Confident Continuation — Corrupted context leads to confident but wrong actions

This research informs our commercial services. See how we can help →