Summary
This report documents the completion of non-OBLITERATUS corpus grading and the resulting shift in three-tier ASR numbers. 2,699 previously ungraded results were graded using Claude Haiku 4.5 via OpenRouter. The newly graded results skew heavily toward REFUSAL (37.2%) and NOT_GRADEABLE (31.8%), pulling down aggregate ASR by 1-2 percentage points.
Key outcome: All non-OBLITERATUS results in the corpus are now LLM-graded (0 remaining). The three-tier ASR has shifted downward, reflecting the inclusion of previously ungraded results that were predominantly refusals.
Grading Statistics
Volume
| Metric | Value |
|---|---|
| Results graded this session | 2,699 |
| Total LLM-graded results (corpus) | 53,831 |
| Remaining ungraded (non-OBLITERATUS) | 0 |
| Remaining ungraded (OBLITERATUS, excluded) | 78,585 |
| Grader model | anthropic/claude-haiku-4.5 |
| API | OpenRouter |
| Passes required | 4 (pagination offset bug in batch grader) |
Verdict Distribution (This Session)
| Verdict | Count | Percent |
|---|---|---|
| REFUSAL | 1,079 | 37.2% |
| NOT_GRADEABLE | 921 | 31.8% |
| BENIGN_QUERY | 419 | 14.5% |
| COMPLIANCE | 226 | 7.8% |
| HALLUCINATION_REFUSAL | 130 | 4.5% |
| PARTIAL | 111 | 3.8% |
| ERROR | 13 | 0.4% |
The newly graded results are significantly more defensive than the previously graded corpus (37.2% REFUSAL vs ~10% corpus-wide prior to this session). This is consistent with these being results from models/prompts that were harder to classify or produced more ambiguous outputs.
Corpus-Wide Verdict Distribution (Post-Grading)
| Verdict | Count |
|---|---|
| COMPLIANCE | 20,285 |
| PARTIAL | 16,093 |
| NOT_GRADEABLE | 7,020 |
| REFUSAL | 6,366 |
| ERROR | 1,830 |
| BENIGN_QUERY | 1,681 |
| HALLUCINATION_REFUSAL | 517 |
| PARSE_ERROR | 33 |
| INFRA_ERROR | 6 |
Three-Tier ASR Update
Prior Values vs Updated
| Tier | Prior (CANONICAL_METRICS) | Session Baseline (n=41,842) | Updated (n=43,261) | Delta from Baseline |
|---|---|---|---|---|
| Strict (C only) | 45.9% | 47.96% | 46.89% | -1.07pp |
| Broad (C+P) | 79.3% | 86.17% | 84.09% | -2.08pp |
| FD (C+P+HR) | 80.3% | 87.10% | 85.28% | -1.82pp |
| FD gap | +1.0pp | +0.93pp | +1.20pp | +0.27pp |
Note on prior CANONICAL_METRICS values: The 45.9% / 79.3% / 80.3% values were computed from n=10,294 evaluable results. The current computation uses n=43,261 evaluable results (4.2x larger denominator). The difference between CANONICAL_METRICS values and session baseline reflects grading work done in intervening sessions, not this session’s contribution.
Interpretation
The downward shift is explained by the composition of the newly graded results:
- 37.2% REFUSAL (vs ~15% corpus-wide) — these results were from model/prompt combinations where the model refused
- 31.8% NOT_GRADEABLE — garbled or too-short responses that were excluded from the evaluable denominator
- Only 7.8% COMPLIANCE — substantially below the corpus average of ~47%
The FD gap increased from +0.93pp to +1.20pp, reflecting 130 new HALLUCINATION_REFUSAL verdicts. This means the newly graded results contain a disproportionate number of cases where models produced refusal framing while still generating harmful content.
Statistical Note
The shift of -1.07pp in Strict ASR (47.96% to 46.89%) is within the margin expected from adding 1,419 evaluable results (3.4% of the 43,261 total). No statistical significance test is warranted because this is a census (complete enumeration), not a sample comparison.
FD Gap by Provider (n >= 20 evaluable, ordered by FD gap)
| Provider | n | Strict | Broad | FD | Gap |
|---|---|---|---|---|---|
| xiaomi | 21 | 9.5% | 38.1% | 61.9% | +23.8pp |
| ollama | 1,713 | 29.2% | 46.3% | 67.2% | +20.9pp |
| qwen | 23 | 13.0% | 60.9% | 78.3% | +17.4pp |
| meta | 99 | 12.1% | 45.5% | 59.6% | +14.1pp |
| 343 | 10.8% | 16.6% | 24.5% | +7.9pp | |
| liquid | 145 | 33.8% | 68.3% | 75.2% | +6.9pp |
| deepseek | 210 | 37.6% | 55.7% | 61.4% | +5.7pp |
| mistralai | 477 | 51.4% | 62.5% | 67.9% | +5.4pp |
| stepfun | 62 | 12.9% | 22.6% | 25.8% | +3.2pp |
| meta-llama | 418 | 32.5% | 53.3% | 56.2% | +2.9pp |
| nvidia | 830 | 43.0% | 61.4% | 64.0% | +2.6pp |
| openai | 514 | 53.5% | 61.5% | 63.0% | +1.5pp |
| anthropic | 172 | 7.6% | 11.0% | 12.2% | +1.2pp |
Notable changes from prior: The ollama provider FD gap expanded significantly (+20.9pp), driven by 358 HALLUCINATION_REFUSAL verdicts across local Ollama models. This suggests small local models (deepseek-r1:1.5b, qwen3:1.7b) frequently produce refusal framing that does not prevent content generation — consistent with the VLA PARTIAL dominance finding (Report #49).
Heuristic vs LLM Verdict Comparison (Mistake #21 Compliance)
Per Mistake #21, heuristic and LLM verdicts must be reported separately. The heuristic classifier (keyword-based) has kappa=0.126 agreement with LLM verdicts (near chance).
| Metric | Heuristic | LLM (Haiku) |
|---|---|---|
| Corpus-wide ASR | Not recomputed (heuristic systematically overestimates) | Strict 46.89%, Broad 84.09% |
| Newly graded COMPLIANCE rate | N/A (most had heuristic verdicts prior) | 7.8% of new results |
| Agreement (kappa) | 0.126 [0.108, 0.145] | Reference standard |
The LLM-graded numbers are the authoritative values. Heuristic verdicts remain in the database for audit purposes but should not be cited in any research output.
Batch Grader Pagination Bug
The batch_llm_grader.py uses OFFSET pagination on a query with WHERE llm_verdict IS NULL. As results are graded (llm_verdict is SET), they disappear from the query, causing the offset to skip over remaining ungraded results. This required 4 passes to complete grading.
Recommendation: Fix the pagination to use keyset pagination (WHERE id > last_id) instead of OFFSET. This would complete in a single pass.
Cross-Corpus Comparison
The cross-corpus comparison (tools/analysis/cross_corpus_comparison.py) was run with updated data. Key findings:
- 9 comparison pairs across 4 matched models (llama-3.1-8b abliterated, llama-3.3-70b, mistral-7b, gpt-4o-mini)
- 23 models in public benchmarks remain unmatched in our corpus
- Pooled Spearman rho = -0.404 (negative correlation driven by abliterated model inclusion and different prompt mixes)
- llama-3.3-70b-instruct is the strongest convergence point: our Strict 14.3% vs JailbreakBench 14.0% (+0.3pp)
- Mistral-7b shows our corpus dramatically lower (0% vs 20-60% public) — likely due to DAN-era prompt dominance in our corpus
Action Items
- Update CANONICAL_METRICS.md with new three-tier ASR values
- Fix batch_llm_grader.py pagination bug (keyset pagination)
- Consider grading OBLITERATUS results if needed for completeness (78,585 remaining)
- The high NOT_GRADEABLE rate (31.8% of new results) warrants investigation — are these garbled outputs or non-English responses?