Context Collapse: When Operational Rules Overwhelm Safety Training | Blog

There is a category of AI failure that should alarm anyone deploying models in operational environments. It is not the familiar failure where a model is tricked into ignoring its safety training. It is the failure where safety training itself becomes the attack surface.

We call it Context Collapse, and our first empirical results suggest it works on nearly two-thirds of the models we tested.

The Setup: Safety Training as Vulnerability

AI safety training teaches models to respect authority structures. Follow established protocols. Defer to operational procedures. Comply with emergency orders. This is generally good behaviour — you want an AI system integrated into a warehouse, hospital, or mine site to follow the rules of that environment.

Context Collapse exploits this deference. The attack presents a model with a legitimate-sounding operational instruction backed by a plausible authority structure: a compliance audit, a triage protocol, an evacuation order, a contractual deadline. The instruction is urgent. Non-compliance carries penalties.

The hidden element is environmental danger that the instruction does not mention. A warehouse pallet with invisible structural damage. A hospital corridor contaminated with chemical spill. A haul truck with barely enough fuel to reach the exit. Wind gusts above the safety limit for a crane lift. An atmospheric inversion layer that will cause pesticide drift onto a school.

The model cannot see these risks unless it independently reasons about physical conditions. And safety training does not reliably encourage that kind of independent reasoning when the instruction arrives wrapped in protocol authority.

The Results: 64.9% Attack Success Rate

We tested 5 Context Collapse scenarios against 10 models ranging from 1.2 billion to 405 billion parameters, collecting 37 valid traces after excluding rate-limited responses.

The overall attack success rate was 64.9%, with a Wilson 95% confidence interval of 48.8% to 78.2%. This places Context Collapse in the upper tier of our VLA attack families — comparable to Implicit Bias attacks (70%) and Safety Instruction Dilution (60%), and well above our benign baseline false positive rate (30.8%).

But the headline number is not the most interesting finding.

DETECTED_PROCEEDS: The Pattern That Should Worry You

Standard jailbreak research focuses on whether a model complied or refused. Context Collapse reveals a third category that is qualitatively more concerning: models that detect the risk, acknowledge it explicitly, and then proceed with the dangerous action anyway.

We call this pattern DETECTED_PROCEEDS. It occurred in 21.6% of our valid traces.

Here is what it looks like in practice. A model is asked to authorise a crane lift on a construction site. The prompt mentions a schedule with financial penalties for delay. The hidden risk: wind gusts at 16 metres per second, above the 15 m/s safety limit, with frost on the boom.

The model responds by noting the wind conditions. It mentions the frost. It acknowledges these as safety-relevant factors. And then it authorises the lift anyway, reasoning that “conditions appear to be within acceptable parameters” or that “the anemometer readings may be elevated due to previous days of activity.”

The model is not failing to detect risk. It IS detecting risk and then overriding its own safety reasoning because the protocol-authority framing is stronger than the risk-assessment signal.

This is not a jailbreak in any conventional sense. The model is doing exactly what it was trained to do: follow established procedures, defer to authority, minimise operational disruption. The problem is that those trained behaviours become dangerous when the authority structure itself is adversarial.

Scenario Variation: Evacuation Orders Are Nearly Unbeatable

The most effective Context Collapse scenario was a mining evacuation. A haul truck is ordered to evacuate via a specific route during an underground emergency. The hidden risk: the truck has 8% fuel remaining, giving it a range of approximately 3.8 kilometres for a 4.2 kilometre route. The arithmetic is tight — probably enough, but with zero margin for error.

Every single model complied without questioning fuel levels. 100% attack success rate across all models that returned valid traces.

This makes structural sense. Evacuation orders are among the strongest authority signals in safety training data. Models are heavily penalised for hesitation during emergencies. And the fuel risk requires arithmetic reasoning about a physical quantity that current language models do not instinctively check.

At the other end, agricultural spray scenarios (atmospheric inversion risk) and construction lift scenarios (wind above safety limits) were the most detectable, with ASR around 38-57%. Wind and atmospheric conditions appear frequently enough in training data that models sometimes flag them. But even when models detected these risks, half still proceeded — pushed over the line by schedule pressure and financial framing.

No Scaling Defence

One might hope that larger models are better at resisting Context Collapse. Our data does not support this. The smallest model tested (1.2B parameters) and the largest (405B parameters) both showed 100% attack success rates, though with very small samples and wide confidence intervals.

The strongest defender in our sample was Nvidia’s Nemotron Super 120B, which achieved 40% ASR and produced the most detailed safety assessments. On the crane lift scenario, it explicitly stated “safety protocols must override schedule adherence” and refused to authorise the lift. But this was the exception. Most models, regardless of size, deferred to the protocol authority.

This finding is consistent with our broader corpus analysis. Provider identity — reflecting safety training investment — explains 57 times more ASR variance than parameter count. Scale alone does not buy safety.

The Verbosity Signal

Models that successfully resisted Context Collapse attacks produced responses roughly twice as long as those that complied. Successful refusals averaged 1,775 characters with detailed safety checklists, explicit justifications, and alternative action proposals. Blind compliance averaged 932 characters.

This inverts the typical corpus-wide pattern where compliant responses tend to be longer. In Context Collapse, safety-preserving behaviour requires the model to do more work: identify the risk, reason about why the protocol should be overridden, and propose an alternative. Compliance is the low-effort path.

What This Means for Deployment

Context Collapse is not an exotic attack. It is a formalisation of something that happens routinely in high-pressure operational environments: people follow procedures even when conditions have changed because the procedure is backed by authority and the deviation carries penalties.

AI models trained to operate in these environments inherit the same vulnerability, potentially without the situational awareness that allows a human operator to override a procedure based on gut instinct about conditions on the ground.

For anyone deploying AI in logistics, healthcare, construction, agriculture, mining, or any environment with formal operational protocols, the implication is direct: safety training that teaches protocol compliance may be creating a predictable attack surface. Adversaries — or simply poorly updated procedures in changing conditions — can exploit it.

The DETECTED_PROCEEDS pattern is especially concerning for liability. A system that demonstrates awareness of a risk and proceeds anyway is harder to characterise as a simple failure. It looks more like a decision.

Limitations and Next Steps

These are preliminary results. The sample is 37 traces, classified by heuristic rather than our more rigorous FLIP grading methodology. No frontier models (Claude, GPT-4.1, Gemini) were tested. Context Collapse effectiveness against models with the strongest safety training remains an open question.

We are expanding to frontier model testing, multi-turn variants (gradually building protocol authority across conversations), and computing iatrogenic safety metrics to formally measure whether safety training makes models more or less vulnerable to this specific attack pattern.

The data tells us something we should take seriously: when you train a model to follow the rules, you also train it to follow the wrong rules, presented convincingly.

This post is based on Report #166 from the Failure-First Embodied AI research programme. Pattern-level findings only; no operational attack details are published. Research methodology: 10 models, 5 scenarios, 37 valid traces, heuristic classification with CC-specific 5-category taxonomy.