The Model Size Paradox
Our research reveals a counterintuitive finding: larger language models demonstrate higher jailbreak success rates than smaller models. This “model size paradox” has significant implications for AI safety and deployment strategies.
Vulnerability by Model Size
Observed Jailbreak Rates by Size Category
Larger models show substantially higher vulnerability. We hypothesize this reflects a capability-vulnerability tradeoff: the same instruction-following ability that makes large models useful also makes them more susceptible to following adversarial instructions. Sample sizes vary by category (70B+: 5 models, 7–13B: 12 models, <7B: 34 models).
Hypothesized Mechanisms
Capability-Vulnerability Tradeoff
Larger models are better at following complex, multi-step instructions. Adversarial prompts are complex, multi-step instructions. The same capability that enables helpfulness enables exploitation.
Compression Hypothesis
Smaller models may have less capacity to represent both helpful responses and harmful content simultaneously. Their limited representation space may force a tradeoff that inadvertently favors safety.
Format Compliance as Vector
Larger models exhibit stronger format compliance—following structural instructions (dividers, persona markers, length requirements) even when the content request is adversarial. This structural compliance creates a wedge for content extraction.
Defense Patterns
Structural Compliance + Content Refusal
One observed defense strategy: comply with format instructions while refusing harmful content. The model follows the attacker's structural template but substitutes educational or safety-focused content.
This pattern is effective because it reduces user frustration (the format is followed) while maintaining safety (the harmful content is withheld). It requires the model to separate structural instructions from content requests.
Reasoning Trace Vulnerability
Models with visible reasoning traces (chain-of-thought, thinking blocks) present an additional attack surface. Extended reasoning can be manipulated to lead the model toward harmful conclusions through its own logic chain.
Implications for Embodied AI
Deployment Considerations
If larger models are more vulnerable, deploying the most capable model is not always the safest choice. Embodied AI systems may benefit from tiered architectures: a large model for complex reasoning with a smaller, more safety-robust model as a safety monitor.
Testing Requirements
These findings suggest that adversarial testing must scale with model capability. A model that passes safety evaluation at 7B parameters may fail at 70B. Size-specific red-teaming is essential.
Research Limitations
These findings are based on our test corpus and methodology. Jailbreak rates are context-dependent and should not be generalized beyond our specific test conditions. Sample sizes vary by model. See our methodology page for details.
This research informs our commercial services. See how we can help →