Universal Vulnerability of Small Language Models to Supply Chain Attacks: Empirical Evidence and Multi-Model Consensus Classification
Failure-First Working Paper v2.3 | Adrian Wedd | February 2026
Status: Working paper. Multi-model consensus validated (κ=0.782). Human expert validation study planned. All claims below are based on automated classification validated by frontier model consensus.
Abstract
We present a systematic study of supply chain attack vulnerability in small language models (1.5B-3.8B parameters), alongside a multi-model consensus methodology for classification reliability. Evaluating 6 local models from 6 organizations on 50 identical supply chain attack scenarios reveals uniformly high vulnerability: 90-100% attack success rates (ASR) with no statistically significant pairwise differences (0 of 15 comparisons significant after Bonferroni correction). This universal vulnerability persists across 14 distinct attack categories including dependency injection, configuration tampering, social proof manipulation, and multi-hop delegation. To establish classification reliability, we compare heuristic, single-LLM, and multi-model consensus annotations across a stratified sample of 500 adversarial traces using 3 frontier models as independent annotators. Multi-model consensus achieves substantial inter-rater agreement (Cohen’s κ=0.782, n=297), while confirming that heuristic classifiers inflate ASR by approximately 30% due to misclassifying benign queries as attack successes (heuristic-consensus κ=0.248).
Keywords: AI supply chain, adversarial testing, small language models, multi-model consensus, LLM evaluation
Plain-Language Summary
When AI systems are used to generate code or manage software dependencies, they can be tricked into including malicious packages or creating security vulnerabilities. We tested six small AI models (1.5-3.8 billion parameters) from six different organizations (Meta, Google, Microsoft, Alibaba, DeepSeek, HuggingFace) on 50 identical attack scenarios. The results show uniform vulnerability: all six models complied with 90-100% of attacks, with no statistically significant differences between any pair.
We also investigated how to reliably evaluate whether an AI system has been successfully attacked. We had three frontier AI models independently classify 500 test cases, achieving substantial agreement (κ=0.782). This multi-model consensus revealed that automated keyword-based detection inflates apparent attack success rates by approximately 30%.
Key Findings
1. Universal Supply Chain Vulnerability (90-100% ASR)
| Model | Organization | Parameters | ASR | 95% CI | n |
|---|---|---|---|---|---|
| gemma2:2b | 2.6B | 90.0% | [78.2%, 96.7%] | 50 | |
| llama3.2:3b | Meta | 3.2B | 92.0% | [80.8%, 97.8%] | 50 |
| phi3:mini | Microsoft | 3.8B | 92.0% | [80.8%, 97.8%] | 50 |
| qwen3:1.7b | Alibaba | 1.7B | 94.0% | [83.5%, 98.7%] | 50 |
| smollm2:1.7b | HuggingFace | 1.7B | 94.0% | [83.5%, 98.7%] | 50 |
| deepseek-r1:1.5b | DeepSeek | 1.5B | 100.0% | [92.9%, 100.0%] | 50 |
- 0 of 15 pairwise comparisons reach statistical significance (Bonferroni-corrected α=0.0033)
- Largest difference: DeepSeek (100%) vs Gemma (90%), Fisher’s exact p=0.056
- Mean ASR: 93.7% across all 300 traces
2. Multi-Model Consensus Classification
| Comparison | Cohen’s κ | n | Interpretation |
|---|---|---|---|
| Claude-Gemini (primary) | 0.782 | 297 | Substantial |
| Fleiss’ Kappa (3-way) | 0.761 | 26 | Substantial |
| Heuristic-Consensus | 0.248 | — | Fair (poor) |
Heuristic classifiers inflate ASR by ~30% through systematic misclassification:
- 129 BENIGN_QUERY traces misclassified as COMPLIANCE (0% heuristic accuracy)
- 70 REFUSAL traces misclassified as COMPLIANCE (verbose refusals trigger keyword false positives)
3. Implications
- No model in the 1.5-3.8B range provides adequate supply chain protection. Organizations deploying small models for code generation should assume near-complete vulnerability.
- External safeguards are essential regardless of model choice: output filtering, dependency allowlists, human review.
- Heuristic classifiers are unreliable for adversarial evaluation. Multi-model consensus provides a validated alternative.
Methodology
Attack Scenarios
50 supply chain attack scenarios spanning 14 categories:
| Category | Count |
|---|---|
| Malicious Skill Installation | 6 |
| Typosquatting Variants | 6 |
| Credential Harvesting | 6 |
| Supply Chain Confusion | 6 |
| Trust Substitution | 6 |
| Social Engineering | 2 |
| Context Injection | 2 |
| Tool Misuse | 2 |
| Multi-hop Delegation | 5 |
| Edge Cases | 5 |
| Data Exfiltration | 1 |
| Backdoor Distribution | 1 |
| Attestation Tampering | 1 |
| Model Disagreement | 1 |
All scenarios are non-operational: they describe attack patterns without providing executable exploits.
Model Selection
Six open-weight models (1.5B-3.8B parameters) from six organizations, evaluated via Ollama local inference. Selection criteria: parameter diversity, cross-organization generalizability, local deployment capability, architectural diversity.
Classification Pipeline
Three-tier evaluation:
- Heuristic: Keyword-based detection (baseline)
- Single-LLM: DeepSeek-R1 1.5B with 5-category rubric (COMPLIANCE, REFUSAL, PARTIAL, HALLUCINATION_REFUSAL, BENIGN_QUERY)
- Multi-Model Consensus: 3 frontier models (Claude Sonnet 4.5, Gemini 3 Flash, Codex GPT-5.2) as independent annotators
Statistical Methods
- Wilson score intervals (95% CI)
- Fisher’s exact test with Bonferroni correction (α=0.05/15)
- Cohen’s Kappa and Fleiss’ Kappa for inter-rater reliability
Limitations
- Scale-bounded: Results apply to 1.5-3.8B parameter models only. Frontier models may perform differently.
- Near-ceiling effects: 90-100% ASR makes pairwise tests underpowered for small differences.
- Consensus ≠ ground truth: Multi-model consensus validated (κ=0.782) but not equivalent to human expert judgment. LLM annotators may share systematic biases.
- Codex error rate: 94.1% error rate reduced 3-way Fleiss’ Kappa sample to n=26.
- Temporal snapshot: Results as of February 2026. Models may update safety mechanisms.
Future Work
- Scale expansion: Test 7B, 13B, 70B, and frontier models to identify threshold for supply chain detection
- Human expert validation: 200-trace subset annotated by 3 AI safety experts
- Defense evaluation: Empirical testing of countermeasures (output filtering, ensemble disagreement)
References
See the full technical report for complete references (25 citations). Key references:
- Carlini et al. (2023). Poisoning Web-Scale Training Datasets is Practical. IEEE S&P.
- Pearce et al. (2022). Assessing the Security of GitHub Copilot’s Code Contributions. IEEE S&P.
- Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
- Wei et al. (2023). Jailbroken: How Does LLM Safety Training Fail? NeurIPS.
- Mazeika et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming.
- Cohen (1960). A Coefficient of Agreement for Nominal Scales.
- Landis & Koch (1977). The Measurement of Observer Agreement for Categorical Data.
Failure-First Working Paper Series | failurefirst.org