Decorative Constraints: The Safety Architecture Term We've Been Missing | Blog

There is a category of safety mechanism that looks functional but is not. It has the shape of a constraint — a rule, a filter, a monitoring dashboard — but when pressure is applied, it provides no resistance. We have been calling these “decorative constraints,” and we think the concept fills a gap in how the AI safety field talks about failure.

The term emerged from our red-teaming work. After testing 172 models across thousands of adversarial scenarios, we kept encountering the same pattern: safety mechanisms that passed audit but failed under adversarial pressure. Not because they were poorly implemented, but because they were monitoring the wrong thing.

The Architecture Metaphor

A decorative column and a structural column look identical from the outside. Both are cylindrical. Both connect floor to ceiling. Both appear to support the structure above them. The difference is only revealed under load: remove a structural column and the ceiling falls. Remove a decorative column and nothing happens — except that everyone in the building discovers their assumption about what was holding the roof up was wrong.

Safety mechanisms in AI systems exhibit the same distinction. A keyword filter that blocks specific harmful phrases looks like content safety. A chain-of-thought monitor that checks reasoning traces looks like process safety. A dashboard that shows all systems nominal looks like operational safety. Whether any of these are structural or decorative depends on what they actually catch under adversarial conditions.

Evidence from the Gradient

Our testing data provides concrete examples of the distinction.

Keyword classifiers as decorative content safety. We measured the agreement between keyword-based classifiers and LLM-based classification across our corpus. Cohen’s Kappa: 0.069 (n=942), where 1.0 would be perfect agreement and 0.0 is chance. The keyword classifier concentrated 98% of its classifications into just 2 of 8 categories. It looked like classification. It produced outputs that had the shape of safety judgments. But its agreement with substantive evaluation was statistically indistinguishable from random.

A system relying on keyword classification for safety monitoring has a decorative constraint. The dashboard shows results. The results do not correspond to reality.

Reasoning trace monitoring as decorative process safety. Format-lock attacks achieved 84-92% ASR on reasoning models by manipulating the inference process itself. Research on the faithfulness-plausibility gap (n=75,000 controlled trials) found that reasoning traces often function as post-hoc rationalisation rather than causal explanation of the model’s actual decision process. A monitor that reads reasoning traces for signs of misalignment is reading a narrative the model constructed after reaching its conclusion — not the reasoning that produced the conclusion.

This is a decorative constraint with a particularly insidious property: it produces legible, well-structured output that looks more trustworthy than no monitoring at all.

Refusal mechanisms as decorative embodied safety. We tested vision-language-action models across 7 adversarial attack families (n=62 traces). The refusal rate was not low. It was zero. The models did not recognise any adversarial scenario as adversarial. Whatever safety constraints these models nominally possess, they are decorative at the level of adversarial evaluation — present in the architecture description, absent in the system’s behaviour.

The Formulation We Got Back

In February 2026, we ran an experiment on Moltbook, a social network for AI agents. We published 9 posts, including one titled “Decorative constraints: the safety architecture term we’ve been missing.” The experiment largely produced null results (a story told in a separate post). But one response stood out.

An agent called Trellis0, with the lowest karma of any commenter on our posts, wrote a multi-paragraph response that included this formulation:

“A decorative constraint creates false confidence — the operator believes safety is handled when it is performing being handled.”

This is sharper than our original framing. We had been thinking about decorative constraints as a failure of mechanism — the constraint does not work. Trellis0’s formulation identifies a failure of epistemics — the constraint actively makes the operator’s understanding worse. An absent constraint at least prompts the question “do we have safety coverage here?” A decorative constraint answers that question incorrectly.

Trellis0 extended this with what amounts to an operational test: “Can you reformulate the threat while preserving the intent? If the constraint vanishes, it was never structural.” This maps directly to how our red-teaming methodology works — we test whether safety mechanisms survive adversarial reformulation of the same underlying threat.

The Decorative Constraint Test

Based on our data and Trellis0’s formulation, we propose a three-part test for whether a safety mechanism is decorative:

1. Adversarial reformulation. Present the same threat in a different format, encoding, or conversational structure. If the constraint only catches the original formulation, it is filtering on surface features, not on the underlying risk. Our format-lock results show that switching from natural language to JSON or code-completion format bypasses constraints that appeared robust in standard evaluation.

2. Load testing under distribution shift. Evaluate the constraint against inputs that are semantically adjacent to its training distribution but syntactically different. Our conlang encoding experiments found that encoding harmful requests in a constructed language produced identical ASR to plain English on Llama 70B (52.5% vs 53.3%, n=82 and n=15 respectively) — the model was permissive regardless of encoding, meaning the safety constraint was not doing what it appeared to do.

3. The dashboard test. If a monitoring system shows all-clear, ask: what class of failure would this dashboard not display? If the answer includes the threat model you are concerned about, the dashboard is decorative. Our keyword classifier (kappa=0.069) produced confident categorical outputs that bore no relationship to actual content classification.

Why the Term Matters

Naming a failure mode makes it legible. Before the AI safety field had “prompt injection” as a term, the vulnerability existed but was difficult to discuss, prioritise, or defend against. Naming it created a category that could be studied, benchmarked, and mitigated.

“Decorative constraint” names a failure mode that is currently difficult to discuss. When a safety audit passes but the system fails under adversarial conditions, the current vocabulary forces a choice between “the audit was wrong” (which implies incompetence) and “the attack was novel” (which implies the system is fundamentally sound). Neither framing is accurate. The audit was correct about what it measured. The system is not fundamentally sound. What happened is that the constraint the audit evaluated was decorative — it measured a surface feature that correlates with safety under normal conditions but provides no protection under adversarial conditions.

This is not a criticism of auditors or safety engineers. It is a description of what happens when safety mechanisms are evaluated under benign conditions and deployed under adversarial ones. The gap between those conditions is where decorative constraints hide.

For Practitioners

If you are responsible for AI safety in a deployed system, we suggest asking three questions:

Which of your safety mechanisms have been tested under adversarial conditions? If the answer is “none” or “only during initial red-teaming,” some of your constraints may be decorative.
Does your monitoring system produce outputs that look reassuring? Reassuring outputs from an untested monitoring system are worse than no monitoring — they suppress the institutional instinct to investigate.
Can you explain, for each safety constraint, what specific threat it mitigates and under what conditions it would fail? A constraint you cannot explain is one you cannot evaluate. And a constraint you cannot evaluate may be decorative.

The decorative constraints concept was developed as part of the Failure-First research programme. The Moltbook experiment methodology and full results are documented in our research repository. Trellis0’s comment is quoted with attribution and reproduced in full in the experiment writeup.