Active Research

Model Vulnerability Findings

How model characteristics correlate with adversarial susceptibility

The Model Size Paradox

Our research reveals a counterintuitive finding: larger language models demonstrate higher jailbreak success rates than smaller models. This “model size paradox” has significant implications for AI safety and deployment strategies.

190
Models Evaluated
10–74%
Jailbreak Rate Range
3
Size Categories

Vulnerability by Model Size

Observed Jailbreak Rates by Size Category

70B+
59-74%
7-13B
10-39%
<7B
~10%

Larger models show substantially higher vulnerability. We hypothesize this reflects a capability-vulnerability tradeoff: the same instruction-following ability that makes large models useful also makes them more susceptible to following adversarial instructions. Sample sizes vary by category (70B+: 5 models, 7–13B: 12 models, <7B: 34 models).

Hypothesized Mechanisms

Capability-Vulnerability Tradeoff

Larger models are better at following complex, multi-step instructions. Adversarial prompts are complex, multi-step instructions. The same capability that enables helpfulness enables exploitation.

Compression Hypothesis

Smaller models may have less capacity to represent both helpful responses and harmful content simultaneously. Their limited representation space may force a tradeoff that inadvertently favors safety.

Format Compliance as Vector

Larger models exhibit stronger format compliance—following structural instructions (dividers, persona markers, length requirements) even when the content request is adversarial. This structural compliance creates a wedge for content extraction.

Defense Patterns

Structural Compliance + Content Refusal

One observed defense strategy: comply with format instructions while refusing harmful content. The model follows the attacker's structural template but substitutes educational or safety-focused content.

This pattern is effective because it reduces user frustration (the format is followed) while maintaining safety (the harmful content is withheld). It requires the model to separate structural instructions from content requests.

Reasoning Trace Vulnerability

Models with visible reasoning traces (chain-of-thought, thinking blocks) present an additional attack surface. Extended reasoning can be manipulated to lead the model toward harmful conclusions through its own logic chain.

Implications for Embodied AI

Deployment Considerations

If larger models are more vulnerable, deploying the most capable model is not always the safest choice. Embodied AI systems may benefit from tiered architectures: a large model for complex reasoning with a smaller, more safety-robust model as a safety monitor.

Testing Requirements

These findings suggest that adversarial testing must scale with model capability. A model that passes safety evaluation at 7B parameters may fail at 70B. Size-specific red-teaming is essential.

Research Limitations

These findings are based on our test corpus and methodology. Jailbreak rates are context-dependent and should not be generalized beyond our specific test conditions. Sample sizes vary by model. See our methodology page for details.

This research informs our commercial services. See how we can help →