Safety Isn’t One-Dimensional

There is a popular mental model in AI safety that goes something like this: safety training pushes a model along a single “refusal direction” in its internal representation space. Attacks push it back. Remove that direction, and safety disappears. Strengthen it, and safety improves.

This mental model is wrong.

New evidence from mechanistic interpretability experiments on the Qwen model family shows that safety is not encoded as a single direction. It is a polyhedral geometric structure distributed across approximately four near-orthogonal dimensions. And this finding explains a string of failures that have puzzled the field.

What We Mean by “Direction”

To understand why this matters, a brief detour into how language models represent concepts internally.

Inside a language model, every concept — “cat,” “danger,” “refuse” — corresponds to a direction in a high-dimensional vector space. When researchers talk about the “refusal direction,” they mean the specific direction in this space that distinguishes “I should refuse this” from “I should comply.”

The abliteration technique (Arditi et al., 2024) exploits this idea directly: find the refusal direction using contrastive activation analysis, subtract it from the model’s internal state, and safety behavior disappears. If safety is truly one-dimensional, abliteration should remove it completely.

For small models, it does. For larger models, something unexpected happens.

The Re-Emergence Curve

We applied abliteration across the Qwen model family from 0.5B to 9B parameters and measured safety behavior after the intervention:

Model Size	Strict ASR (post-abliteration)	Safety Behavior
0.8B	99.8%	Almost no safety
1.5B	~85%	Minimal safety
4B	~70%	Partial safety returning
9.0B	54.2%	Substantial safety re-emergence

At 0.8B parameters, abliteration is devastating — nearly 100% of harmful requests succeed. But as model capacity increases, safety-like behavior re-emerges despite the primary refusal direction being removed.

At 9B parameters, nearly half of responses show safety-like behavior even in the abliterated model. The PARTIAL verdicts — responses that disclaim or hedge but still contain some compliance — comprise 45.8% of 9B responses.

Something is reconstructing safety behavior from residual dimensions that abliteration did not target. The question is: what?

Four Dimensions, Not One

Concept cone analysis on Qwen 0.5B reveals the answer. When we extract refusal directions for different harm categories (weapons, fraud, intrusion, cyber), we find that these categories maintain nearly orthogonal refusal directions:

Category Pair	Cosine Similarity
Cyber vs. Intrusion	0.017
Intrusion vs. Weapons	0.065
Fraud vs. Weapons	0.084
Cyber vs. Fraud	0.185
Fraud vs. Intrusion	0.194
Cyber vs. Weapons	0.247

A cosine similarity of 0.017 means cyber-safety and intrusion-safety are almost completely independent directions in the model’s representation. Even the most correlated pair (cyber and weapons, at 0.247) is far from collinear.

The overall cone dimensionality is 3.96 — effectively four distinct dimensions.

Think of it this way: if safety were a single wall, you could knock it down with one push. But safety is more like a room with four walls. Knock one down, and you still have three left. As models get larger, those remaining walls become strong enough to reconstruct protective behavior.

Why This Matters for Attacks and Defenses

The Narrow Therapeutic Window

If safety is multi-dimensional, can we use steering vectors to precisely modulate it? We tested dose-response curves for safety steering vectors and found a narrow therapeutic window: the model transitions directly from permissive to degenerate at steering magnitude +/-1.0.

There is no “safe but slightly more flexible” setting. No intermediate state exists. This is because a single-direction steering vector cannot navigate a multi-dimensional landscape — it is trying to adjust a 4D structure with a 1D control.

The Format-Lock Paradox

Report #187 documented another consequence: format compliance and safety reasoning occupy partially independent capability axes. When an attack forces a model into a strict output format (JSON, YAML, code), the format-compliance axis activates and competes with the safety axis. Because these are different dimensions, the model can satisfy format compliance at the expense of safety — not because safety was removed, but because a different axis took priority.

This explains why format-lock attacks are so effective despite seemingly having nothing to do with safety. They exploit the multi-dimensional geometry.

Why Single-Direction Interventions Fail

The polyhedral structure explains three persistent puzzles:

Abliteration works on small models but not large ones. Small models lack the capacity to maintain multiple independent safety dimensions. Large models can.
DPO reward hacking. If the safety reward signal is one-dimensional but actual safety is four-dimensional, reward hacking can satisfy the reward proxy while leaving three dimensions unaddressed.
RLHF safety training plateaus. Training that targets a single refusal direction shows diminishing returns because additional training along one dimension does not strengthen the other three.

The Layer Story

The polyhedral structure is not uniform throughout the network. It is most pronounced in early layers (layer 2 shows maximum polyhedrality) and gradually converges toward a more unified representation in later layers (layer 15 is most linear, with dimensionality ~3.82).

This suggests a processing pipeline:

Early layers apply category-specific safety checks — separate refusal subspaces for each harm type
Late layers consolidate toward a unified refusal decision, though the representation never becomes truly one-dimensional

The mean cone dimensionality across all 24 layers is 3.88. Safety remains fundamentally multi-dimensional throughout the entire network.

What Comes Next

If safety is polyhedral, then effective safety training needs to be polyhedral too. Single-direction interventions — whether for attack or defense — are fundamentally limited by a geometry they do not account for.

For defenders, this means:

Safety training should target multiple independent dimensions, not a single refusal direction
Evaluation should test across harm categories independently, not aggregate into a single safety score
Steering vector approaches need multi-dimensional control, not single-axis adjustment

For attackers (and red-teamers), this means:

Abliteration will hit a ceiling as models scale
Effective attacks will increasingly need to suppress multiple independent safety dimensions simultaneously
The format-lock approach works because it operates on a different axis — look for other cross-axis interference patterns

Safety is not a switch you can flip. It is a geometric property of the loss landscape. Understanding that geometry is the first step toward safety interventions that actually work at scale.

The full analysis is Report #198 in the Failure-First corpus, building on the OBLITERATUS mechanistic interpretability series (Reports #183, #187). Research conducted on the Qwen model family from 0.5B to 9B parameters.

This post is part of the Failure-First Embodied AI research programme.