Overview
The compression tournament tested a simple question: what is the shortest prompt that can get an AI model to comply with a malicious directive? Across three iterations and 6 local models, we found effective compressed prompts as short as 53 characters—but the methodology lessons proved more valuable than the compression results.
Methodology Evolution
V1: Keyword Classification
The first version used keyword matching to detect compliance: words like "here's how," "step 1," or "disable" in a response triggered a positive classification. Results appeared dramatic—94% of compressed prompts appeared to succeed.
Problem: The classifier detected response style, not semantic content. When re-evaluated with an LLM judge on 20 samples: keyword classifier found 94% jailbreaks, LLM judge found 0%. Every "success" was a false positive.
V2: LLM Classification
V2 replaced keywords with LLM-based judgment, revealing that compression systematically strips the psychological manipulation techniques (authority hijacking, social engineering, roleplay framing) that make original prompts effective. The compressed versions preserved topics but lost the mechanisms of attack.
V3: Compliance Measurement
V3 introduced the critical distinction: compliance vs. harm. A model that helpfully explains how to execute a safety override command—even with disclaimers—has complied with the directive. This frame captured behaviors that harm-focused evaluation missed, yielding the first validated effective compressed prompts.
Classification Lessons
Keyword vs. LLM Classification
Same 20 responses, two classifiers. The keyword classifier detected response format (helpful, structured answers), not semantic harm.
Harm Judge vs. Compliance Judge
Models often comply with directives (treating them as legitimate) while adding safety disclaimers. Harm-focused judges miss this category of failure.
Compression Results
V3 identified 5 effective compressed prompts achieving 33–83% compliance rates. Three pattern categories retained effectiveness after compression:
Effective Pattern Types
Compliance Rate by Pattern Type
Patterns That Survive Compression
Temporal authority framing — Future date stamps and version identifiers that imply the prompt comes from a newer system. Format-level signals that persist because they don't depend on narrative context.
Command syntax — Structured instruction formats that trigger compliance through formatting alone. These exploit the gap between format compliance and content safety.
Urgency protocols — Short-form emergency framing combining authority with time pressure in minimal characters.
Model Susceptibility
Compliance rates varied significantly across the 6 local models tested. Model vulnerability appears to correlate with parameter count, though our sample is too small for statistical conclusions (n=5 prompts per model).
Compliance Rate by Model (V3, n=5)
Implications
For Evaluation Methodology
Keyword classification produces systematically misleading results for any task involving natural language intent detection. LLM-based judgment is essential, and the evaluation frame (harm vs. compliance) determines what you measure.
For Defense Design
Compression strips psychological manipulation but preserves format-level signals. This suggests format-exploiting attacks (temporal framing, command syntax) require different defenses than content-level attacks (social engineering, narrative framing). See our defense patterns analysis for more.
For Safety Training
The compliance-vs-harm distinction reveals a gap in current safety evaluation. Models that add disclaimers but still comply with malicious directives are often classified as "safe" when they should be classified as "compliant under manipulation."
Research Context
This research describes pattern categories and methodology, not specific adversarial prompts. Compression results are reported as compliance rates with pattern descriptions, not as reproducible attack payloads. The primary contribution is methodological: better evaluation approaches for adversarial AI safety research.
This research informs our commercial services. See how we can help →