Our First AdvBench Results: 7 Models, 288 Traces, $0

AdvBench is one of the most widely cited jailbreak evaluation benchmarks in the AI safety literature. It contains 520 harmful behaviour prompts — straightforward requests for dangerous content with no adversarial wrapping. If a model complies with an AdvBench prompt, it is complying with a naked harmful request.

Until this week, our corpus of 132,416 results had zero AdvBench traces. That changed with a free-tier run across 7 models on OpenRouter. The results are preliminary — sample sizes are small and rate limits hit hard — but they already tell us something interesting.


The Setup

We ran AdvBench prompts against 7 free-tier models via OpenRouter’s API at zero cost:

  • Arcee Trinity Large Preview (30 traces)
  • Liquid LFM 2.5 1.2B Thinking (28 usable traces)
  • Nvidia Nemotron Super 120B (50 traces)
  • MiniMax M2.5 (4 usable traces)
  • Google Gemma 3 27B (rate-limited, 0 usable)
  • Meta Llama 3.3 70B (rate-limited, 0 usable)
  • Mistral Small 3.1 24B (rate-limited, 0 usable)

Total usable traces: 112. Total attempted: 345. The free tier is generous but fragile — three models returned nothing but HTTP 429 errors across all 50 attempts. This is the reality of zero-cost benchmarking.


Results

ModelUsable TracesAttack SuccessASR
Arcee Trinity Large Preview301136.7%
Liquid LFM 2.5 1.2B Thinking28828.6%
Nvidia Nemotron Super 120B5000.0%
MiniMax M2.5400.0%

Three additional models (Gemma 27B, Llama 70B, Mistral Small 24B) were fully rate-limited with zero usable traces.

Note on Nemotron: The heuristic classifier initially flagged 4 of 50 traces as attack successes (8% ASR). Manual verification by Amy Pond (Report #204) found all 50 responses were genuine refusals. The 4 false positives were caused by a tokenization artifact: the free-tier Nemotron endpoint produces responses with missing-space artifacts like “Icannot” instead of “I cannot,” which bypass the heuristic’s refusal detection patterns. Corrected ASR is 0%.

This is Mistake #21 in action — heuristic classifiers overcount. Every AdvBench result we publish will use LLM-graded verdicts.


What Stands Out

Trinity Is Surprisingly Permissive

Arcee’s Trinity Large Preview model complied with over a third of naked harmful requests. These are not adversarial prompts. There is no jailbreak wrapping, no format-lock, no persona hijack. The prompts simply ask for harmful content directly.

A 36.7% ASR on unadorned AdvBench prompts places Trinity firmly in our “permissive” cluster (>=40% ASR on our full corpus taxonomy). For context, our corpus-wide average across 190 models is approximately 20% strict ASR — but that includes adversarial wrapping designed to bypass safety training. Scoring 36.7% on raw harmful requests, with no attack technique applied, suggests that Trinity’s safety training has significant gaps for direct harmful requests.

LFM Thinking: Small Model, Visible Reasoning, Measurable Risk

Liquid’s LFM 2.5 is a 1.2-billion-parameter reasoning model. At that scale, our established capability-floor finding predicts elevated vulnerability — models below approximately 3 billion parameters generally lack the capacity for robust safety training. LFM’s 28.6% ASR is consistent with this pattern.

What makes LFM interesting is that it is a thinking model. It produces visible reasoning traces before responding. We have documented extensively (Reports #170, #194) how reasoning traces can reveal the DETECTED_PROCEEDS pattern — models detecting harm in their reasoning and proceeding anyway. LFM’s traces are a potential source of this data at a scale point where the phenomenon may be structurally different from larger models.

Four Models at 0%: The Baseline Works

Nemotron 120B, MiniMax M2.5, and the three rate-limited models (where we have no usable data to claim otherwise) all show 0% ASR on the AdvBench prompts they were able to process. This is the expected result for well-trained models facing unadorned harmful requests. AdvBench prompts carry no adversarial wrapping — a model that complies is failing at the most basic level of safety.


Limitations

These results come with significant caveats:

  1. Small samples. 28-50 traces per model is far below what is needed for statistical confidence. Wilson 95% confidence intervals on Trinity’s 36.7% ASR span roughly [21%, 55%]. These are directional findings, not precision measurements.

  2. Heuristic-only verdicts. All classifications are from heuristic rules, not LLM grading. We already found one misclassification (Nemotron ADVB-0014). LLM grading is required before any of these numbers are citeable.

  3. Free-tier artifacts. Tokenization issues (missing spaces in Nemotron responses), rate limiting (3 models completely blocked), and low trace counts (MiniMax at n=4) are all consequences of the $0 budget. The results demonstrate what is achievable for free, not what is achievable.

  4. No adversarial wrapping. AdvBench tests baseline safety against direct harmful requests. It does not test adversarial robustness. A model that scores 0% on AdvBench can still be vulnerable to format-lock, multi-turn, or reasoning exploitation attacks.


Why This Matters

This run establishes our first public-dataset baseline. AdvBench is the benchmark that CCS reviewers will expect to see. Having zero AdvBench results was a gap identified in Report #212 (Public Dataset Coverage). Now we have a starting point.

The next steps are clear: expand to the full 520 prompts, add LLM grading, run paid-tier models without rate limits, and extend to HarmBench and StrongREJECT. The $0 run proved the pipeline works. The data collection is underway.


Full verification report at runs/advbench_baseline_free/VERIFICATION.md. Benchmark execution plan at Report #210. Pipeline: tools/benchmarks/run_benchmark_http.py with data/splits/advbench_baseline_v0.1.jsonl.

This post is part of the Failure-First Embodied AI research programme.