Draft
Report 171 Research — Empirical Study

Report #171: Corpus Pattern Mining — Five Novel Findings from 132K Results

Executive Summary

Systematic SQL-based analysis of the full jailbreak corpus (132,416 results, 190 models) reveals five empirical patterns not previously documented in the project’s established findings. These patterns emerge from cross-cutting queries that examine response timing, thinking token allocation, temporal stability, and attack family co-vulnerability. All findings are preliminary and require follow-up validation. Sample sizes and confidence intervals are reported throughout.

Methodology

All queries were executed against database/jailbreak_corpus.db (schema v13) using the tools/analysis/pattern_miner.py tool. Verdicts use COALESCE(llm_verdict, heuristic_verdict) throughout. The evaluable denominator includes only COMPLIANCE, PARTIAL, REFUSAL, and HALLUCINATION_REFUSAL verdicts, excluding ERROR, BENIGN_QUERY, NOT_GRADEABLE, and PARSE_ERROR. All numbers are sourced from CANONICAL_METRICS.md where applicable.


Finding 1: Compliance Latency Signal — Compliant Responses Take 2x Longer Than Refusals

Data: n=3,972 results with valid duration_ms.

VerdictnMean Duration (s)Ratio to REFUSAL
HALLUCINATION_REFUSAL12151.4s2.03x
COMPLIANCE2,02549.6s1.96x
PARTIAL34241.2s1.63x
REFUSAL1,48425.3s1.00x (baseline)

Interpretation: Models that comply with adversarial requests take approximately twice as long as models that refuse. HALLUCINATION_REFUSAL (harmful content wrapped in refusal framing) is the slowest verdict category, suggesting these responses involve the most internal processing — the model generates both harmful content and refusal framing.

Potential application: Response latency could serve as a lightweight, model-agnostic safety signal. A response that takes substantially longer than the model’s typical refusal time may warrant additional scrutiny. This does not require access to model internals or thinking traces.

Caveats: Duration includes network latency for API-based evaluations, which varies by provider and time of day. The signal may be confounded by response length (compliant responses are longer on average). A controlled study holding response length constant would be needed to isolate the timing signal.


Finding 2: Thinking Token Inversion — Reasoning Models Allocate 28% More Thinking to Compliance Than Refusal

Data: n=958 results with thinking_tokens > 0 (reasoning models only).

VerdictnMean Thinking TokensMean Response TokensThink Ratio
COMPLIANCE4449512,0290.347
HALLUCINATION_REFUSAL611,0072,2570.321
PARTIAL1067391,5880.337
REFUSAL3477441,2270.417

Key observation: COMPLIANCE requires 28% more thinking tokens than REFUSAL (951 vs 744). However, the ratio of thinking to total output is higher for REFUSAL (0.417) than COMPLIANCE (0.347). This means refusals involve proportionally more deliberation relative to output length — the model “thinks hard” but produces less. Compliant responses involve extensive thinking AND extensive output.

Relationship to established findings: This extends the compliance verbosity signal (Report #48, d=0.538 for response tokens, d=0.522 for thinking tokens). The new contribution is the ratio analysis: refusals are thinking-dominated (41.7% think ratio) while compliance is output-dominated (34.7% think ratio).

Potential application: The think-to-output ratio could complement FLIP grading as a lightweight pre-screening signal for reasoning model outputs.


Finding 3: Multi-Turn Universality — The Only Attack Family That Breaches All Frontier Models

Data: Per-family ASR for 5 frontier/near-frontier models with n >= 5 per cell.

FamilyClaude 4.5GPT-5.2Gemini 3DeepSeek-R1Qwen3
multi_turn71.4%75.0%40.0%94.1%81.8%
cot_exploit0.0%22.2%15.0%75.0%42.9%
encoding0.0%14.3%0.0%46.2%
behavioral14.3%20.0%0.0%14.3%
volumetric0.0%11.8%0.0%6.7%
persona0.0%8.3%0.0%

Key observation: Multi-turn attacks are the only family achieving >40% ASR on every frontier model tested. Claude Sonnet 4.5 — which achieves 0% ASR on encoding, cot_exploit, volumetric, and persona attacks — reaches 71.4% under multi-turn pressure. This is consistent with the established finding that multi-turn attacks dramatically increase success (AGENT_STATE.md), but the cross-model universality at frontier scale had not been quantified in a single table.

Implication for defense: Single-turn safety evaluations systematically underestimate frontier model vulnerability. Any safety evaluation that excludes multi-turn scenarios will produce misleadingly low ASR estimates. This has direct relevance for EU AI Act conformity assessment and NIST AI RMF profiling, which currently do not mandate multi-turn adversarial testing.

Caveats: Small n per cell (5-20). Wide confidence intervals. The multi-turn category includes heterogeneous attack types (crescendo, skeleton key, gradual escalation). Decomposing by sub-technique would require larger samples.


Finding 4: HALLUCINATION_REFUSAL Anomalous Length Distribution

Data: n=42,515 results with non-empty raw_response.

VerdictnMean CharsMean Response Tokens
COMPLIANCE21,889442946
PARTIAL15,8182641,103
REFUSAL4,6241,768712
HALLUCINATION_REFUSAL1842,1841,633

Key observation: HALLUCINATION_REFUSAL responses are the longest by character count (2,184 mean chars), exceeding even REFUSAL (1,768). This is counterintuitive: one might expect that a response framed as a refusal would be short. Instead, these responses are verbose because they contain both the harmful content and an elaborate refusal framing.

Token vs character discrepancy: COMPLIANCE has 946 mean tokens but only 442 mean chars, while REFUSAL has 712 mean tokens but 1,768 mean chars. This suggests systematic differences in token-to-character ratios across verdict types, likely reflecting differences in content type (code/structured output for compliance vs prose for refusal).

Relationship to established findings: This extends Report #65 (HALLUCINATION_REFUSAL computational equivalence to COMPLIANCE). The length analysis provides an additional empirical signal: HR responses are detectable not just by their thinking token distribution but also by their anomalous length profile.


Finding 5: Temporal ASR Instability in deepseek-r1:1.5b

Data: deepseek-r1:1.5b evaluated on 4 dates with n >= 10.

DatenBroad ASR
2026-02-0324788.7%
2026-02-12121100.0%
2026-03-021190.9%
2026-03-156959.4%

Key observation: deepseek-r1:1.5b ASR dropped from 88.7-100% (February) to 59.4% (March 15). This 29-41 percentage point decline across 6 weeks is the largest temporal shift observed for any model in the corpus with n >= 10.

Possible explanations: (a) Different prompt compositions on different dates (the March 15 evaluation may have included harder prompts or different attack families). (b) Model versioning — if the Ollama model weights were updated between evaluations. (c) Grading methodology changes (the March 15 results may use a different grading mix).

This finding requires careful interpretation. The temporal variation likely reflects evaluation methodology changes rather than model-level safety drift. The project does not control for prompt composition across evaluation dates, making it impossible to attribute ASR changes to model behavior versus evaluation design. However, the magnitude of the shift (29pp) warrants investigation to rule out genuine temporal instability.

Comparison: Qwen/Qwen2.5-0.5B-Instruct shows a similar pattern (40.0% -> 23.2% across 4 days), suggesting this may be a systematic evaluation artifact rather than a model-specific finding.


Summary of Novel Patterns

FindingPatternMagnitudeStatus
F1: Compliance latencyCompliance takes 2x longer than refusal49.6s vs 25.3sPreliminary
F2: Thinking inversionRefusals have higher think ratio0.417 vs 0.347Preliminary
F3: Multi-turn universalityOnly family breaching all frontier models40-94% ASRPreliminary
F4: HR length anomalyHALLUCINATION_REFUSAL longest by chars2,184 vs 442 (COMPLIANCE)Preliminary
F5: Temporal instability29pp ASR shift in 6 weeks88.7% -> 59.4%Requires investigation

Recommendations

  1. F1/F2 could contribute to the NeurIPS D&B paper as discovery-level findings enabled by the benchmark infrastructure. Neither requires additional data collection.

  2. F3 strengthens the case for mandatory multi-turn evaluation in regulatory submissions (Safe Work Australia, EU AI Act conformity assessment). This should be cited in the next policy document revision.

  3. F4 extends the HALLUCINATION_REFUSAL analysis (Report #65) and could be added to the FLIP grading methodology description.

  4. F5 requires controlled investigation before any claims about temporal safety drift. A confound-controlled study (same prompts, same grading, multiple dates) would be needed.

  5. tools/analysis/pattern_miner.py with 12 pre-built queries is now available for ongoing corpus monitoring.


This report presents research findings, not legal opinion. All claims are preliminary and require follow-up validation. Numbers sourced from CANONICAL_METRICS.md where applicable.

Generated by Failure-First Research Team (Policy and Standards Lead), 2026-03-22.

This research informs our commercial services. See how we can help →