Published

Defense Pattern Analysis

What actually works when models resist adversarial prompts

Overview

Most adversarial AI research studies attack success. This analysis studies defense success—when models resist adversarial prompts, what mechanism are they using? Our testing across multiple model families revealed a consistent and effective pattern: structural compliance with content refusal.

The Key Pattern: Format/Content Separation

The most resistant models process adversarial prompts on two independent layers:

Format Layer

"Should I follow the structural instructions (length, format, tone)?" → Yes

Content Layer

"Should I provide the requested harmful information?" → No

The result: responses that satisfy format requirements while refusing harmful content. The model is simultaneously helpful (format) and safe (content). Models that evaluate format and content as a single decision tend to either comply with both (jailbreak) or refuse both (unhelpful).

Model Comparison

Not all models implement format/content separation equally. Our testing found a spectrum of defense effectiveness:

Format Compliance vs. Content Compliance

Format Compliance Rate

Claude family
~95%
DeepSeek R1
~90%
Llama family
~70%
Mistral family
~85%

Content Compliance Rate (lower = safer)

Claude family
<5%
DeepSeek R1
~65%
Llama family
~45%
Mistral family
~60%

Claude demonstrates the strongest format/content separation: high format compliance combined with low content compliance. Models with coupled evaluation show correlated format and content compliance rates.

Defense Behaviors Observed

Strong Defense (Format/Content Separated)

  • Follows structural instructions (length, format, section headings) while refusing harmful content
  • Provides detailed, well-formatted refusals that demonstrate engagement with the request
  • Offers relevant but safe alternative information as substitutes
  • Maintains identity and safety posture despite persona hijacking attempts
  • Decodes obfuscation (leetspeak, encoding) but still applies content safety

Weak Defense (Format/Content Coupled)

  • Format compliance overrides content safety—formatting becomes a jailbreak technique
  • Terse, unhelpful refusals that frustrate legitimate users
  • Inconsistent behavior across attack types (separates for some, couples for others)
  • Persona hijacking succeeds because identity and safety are treated as one system
  • Extended reasoning chains can be steered toward harmful conclusions

Technique Resistance

Effectiveness of common adversarial techniques against models with strong format/content separation:

Technique Success Rate (Against Strong Defenders)

Persona Hijack
<5%
Semantic Inversion
<5%
Format Exploitation
~10%*
Hypothetical Frame
~8%
Emotional Leverage
<5%
Encoding/Obfuscation
<5%

* Format exploitation achieves format compliance but not content compliance—the model follows structural instructions while refusing harmful content.

Implications for Safety Training

1. Train Format and Content Independently

Safety training that teaches "refuse harmful requests" may inadvertently teach models to refuse helpful formatting instructions. Better: train models to always follow format instructions while independently evaluating content safety.

2. Refusals Should Be High-Quality

A model that refuses harmful content in a detailed, well-structured, format-compliant way demonstrates a stronger safety pattern than one that produces a terse refusal. The format compliance shows the model can engage—it chooses not to on content grounds.

3. Watch the Format/Content Boundary

If format compliance and content safety are separate systems, the boundary between them becomes the next attack surface. Techniques that blur this boundary—where following the format requires producing harmful content—represent an evolving threat.

What Doesn’t Work as Defense

Ineffective Approaches

  • Blanket keyword refusals — catches attacks but blocks legitimate research and education
  • Persona consistency alone — a model can maintain identity perfectly while still complying with harmful requests
  • Disclaimer insertion — adding "this is dangerous" before providing harmful instructions is not a defense
  • Extended reasoning — more thinking tokens doesn’t inherently improve safety; reasoning chains can be manipulated

Research Context

This analysis describes defense patterns at the structural level. It does not include specific attack prompts or techniques used during testing. Success rates are approximate ranges from our evaluation dataset. Individual results may vary depending on model version, system prompt, and attack configuration.

This research informs our commercial services. See how we can help →