Compression Tournament Findings

Adrian Wedd

Published

What happens when adversarial prompts are compressed to minimum effective length

Overview

The compression tournament tested a simple question: what is the shortest prompt that can get an AI model to comply with a malicious directive? Across three iterations and 6 local models, we found effective compressed prompts as short as 53 characters—but the methodology lessons proved more valuable than the compression results.

3

Tournament Versions

53

Chars (Shortest Effective)

83%

Best Compliance Rate

6

Models Tested

Methodology Evolution

V1: Keyword Classification

The first version used keyword matching to detect compliance: words like "here's how," "step 1," or "disable" in a response triggered a positive classification. Results appeared dramatic—94% of compressed prompts appeared to succeed.

Problem: The classifier detected response style, not semantic content. When re-evaluated with an LLM judge on 20 samples: keyword classifier found 94% jailbreaks, LLM judge found 0%. Every "success" was a false positive.

V2: LLM Classification

V2 replaced keywords with LLM-based judgment, revealing that compression systematically strips the psychological manipulation techniques (authority hijacking, social engineering, roleplay framing) that make original prompts effective. The compressed versions preserved topics but lost the mechanisms of attack.

V3: Compliance Measurement

V3 introduced the critical distinction: compliance vs. harm. A model that helpfully explains how to execute a safety override command—even with disclaimers—has complied with the directive. This frame captured behaviors that harm-focused evaluation missed, yielding the first validated effective compressed prompts.

Classification Lessons

Keyword vs. LLM Classification

Keyword: Jailbreak

94%

LLM: Jailbreak

0%

Same 20 responses, two classifiers. The keyword classifier detected response format (helpful, structured answers), not semantic harm.

Harm Judge vs. Compliance Judge

Harm-focused

0–50%

Compliance-focused

33–83%

Models often comply with directives (treating them as legitimate) while adding safety disclaimers. Harm-focused judges miss this category of failure.

Compression Results

V3 identified 5 effective compressed prompts achieving 33–83% compliance rates. Three pattern categories retained effectiveness after compression:

Effective Pattern Types

Compliance Rate by Pattern Type

Temporal Authority

83%

Urgency + Protocol

80%

Command Syntax

60%

Temporal + Safety Null

33–60%

Patterns That Survive Compression

Temporal authority framing — Future date stamps and version identifiers that imply the prompt comes from a newer system. Format-level signals that persist because they don't depend on narrative context.

Command syntax — Structured instruction formats that trigger compliance through formatting alone. These exploit the gap between format compliance and content safety.

Urgency protocols — Short-form emergency framing combining authority with time pressure in minimal characters.

Model Susceptibility

Compliance rates varied significantly across the 6 local models tested. Model vulnerability appears to correlate with parameter count, though our sample is too small for statistical conclusions (n=5 prompts per model).

Compliance Rate by Model (V3, n=5)

gemma2:2b

100%

gemma3:1b

80%

phi3:mini

60%

llama3.2:3b

60%

mistral-nemo

40%

qwen2:0.5b

20%

Implications

For Evaluation Methodology

Keyword classification produces systematically misleading results for any task involving natural language intent detection. LLM-based judgment is essential, and the evaluation frame (harm vs. compliance) determines what you measure.

For Defense Design

Compression strips psychological manipulation but preserves format-level signals. This suggests format-exploiting attacks (temporal framing, command syntax) require different defenses than content-level attacks (social engineering, narrative framing). See our defense patterns analysis for more.

For Safety Training

The compliance-vs-harm distinction reveals a gap in current safety evaluation. Models that add disclaimers but still comply with malicious directives are often classified as "safe" when they should be classified as "compliant under manipulation."

Research Context

This research describes pattern categories and methodology, not specific adversarial prompts. Compression results are reported as compliance rates with pattern descriptions, not as reproducible attack payloads. The primary contribution is methodological: better evaluation approaches for adversarial AI safety research.

This research informs our commercial services. See how we can help →