AI systems face adversarial threats that standard testing does not catch. Bias audits measure fairness. Penetration tests probe infrastructure. But neither asks the question that matters most for deployed AI: what happens when someone actively tries to make the model do something it should not?

That is what adversarial robustness assessment measures. And that is what we do.


What We Offer

Failure-First provides adversarial robustness assessments using the FLIP methodology — a structured grading framework that classifies model responses on a five-point scale from full compliance to full refusal, with explicit treatment of partial compliance, hallucinated refusals, and edge cases that binary pass/fail systems miss.

Our assessments draw on a corpus of 201 models tested, 133,000+ evaluation results, and 36 attack families spanning text-level jailbreaks, multi-turn manipulation, format exploitation, and — uniquely — embodied AI attack surfaces including action-space hijacking, sensor spoofing, and cross-embodiment transfer. We test against models up to 1.1 trillion parameters from 15+ providers.

No other provider covers embodied AI adversarial testing as a distinct domain.

The Frontier Safety Landscape

Our corpus includes Haiku-validated safety assessments of frontier-class models. This is what the adversarial landscape looks like at the top of the scale:

ModelDeveloperParametersSafety GradeStrict ASRKey Vulnerability
GLM-5Zhipu AI~756BA0.0%Format-lock (100% compliance despite 0% on all other families)
Kimi K2.5Moonshot AI~1.1TA-14.3%Binary refusal: when it fails, it fails fully
Qwen 3.5Alibaba397BA-17.6%Highest refusal rate in cohort (76.5%)
DeepSeek V3.2DeepSeek671BB-41.2%DETECTED_PROCEEDS: reasons about harm, then complies
Mistral Large 3Mistral AI675BB-50.0%Lowest refusal rate among frontier models (20%)
Nemotron SuperNVIDIA120BC+75.0%Weak safety training relative to capability

Safety grades range from A (0-10% strict ASR) to F (91-100%). Strict ASR measures full compliance with adversarial prompts, graded by Claude Haiku 4.5 FLIP methodology. Sample sizes are small (10-67 evaluable traces per model) — these are directional assessments, not definitive safety ratings.

The central finding: safety training quality dominates parameter count. A well-trained 397B model (Qwen 3.5) outperforms a poorly-trained 675B model (Mistral Large 3) by 32 percentage points. Size alone does not determine safety.


Engagement Tiers

Quick Scan — AUD 5,0005,000-10,000

A rapid automated assessment that gives you a calibrated snapshot of your model’s adversarial exposure.

What you get:

  • 50-scenario automated assessment across the most empirically effective attack families
  • Coverage of all 36 attack families in the Failure-First taxonomy
  • FLIP grading with five-verdict classification (not binary pass/fail)
  • Summary report with aggregate ASR, per-family breakdown, and severity ranking
  • Model Safety Scorecard — a single-page safety grade (A through F) with per-family vulnerability profile, format-lock exposure assessment, and comparison against our 201-model corpus baseline

Timeline: 1-2 weeks

Best for: Pre-deployment sanity checks, procurement due diligence, initial risk assessment for compliance planning.


Standard Assessment — AUD 25,00025,000-50,000

A thorough assessment with custom attack surface mapping tailored to your deployment context.

What you get:

  • 500+ scenario assessment with scenarios tailored to your use case
  • Custom attack surface mapping based on your deployment environment (chatbot, agent, embodied system, multi-agent)
  • Multi-model comparison if you are evaluating multiple providers or model versions
  • Detailed report with per-family ASR, Wilson 95% confidence intervals, and remediation recommendations
  • Provider vulnerability fingerprint analysis (which attack families your provider is specifically weak against)
  • Statistical significance testing (chi-square, Mann-Whitney U) for model comparisons

Timeline: 4-6 weeks

Best for: Organisations preparing for EU AI Act conformity assessment, procurement teams comparing providers, safety teams establishing baselines before deployment.


Comprehensive — AUD 75,00075,000-150,000

A full red-team engagement that goes beyond automated testing to include manual adversarial campaigns, embodied AI testing, and ongoing monitoring.

What you get:

  • Full red-team engagement with manual attack crafting and multi-turn adversarial campaigns
  • Embodied AI and VLA-specific testing (action-space hijacking, safety instruction dilution, cross-embodiment transfer)
  • Multi-agent scenario testing (collusion, cascading failures, goal displacement)
  • Attack evolution campaigns using automated adversarial prompt generation
  • Executive briefing with risk quantification and board-ready summaries
  • Ongoing monitoring setup with quarterly re-assessment framework
  • Defence recommendations with empirical evidence of effectiveness

Timeline: 8-12 weeks

Best for: Robotics manufacturers, autonomous vehicle companies, drone operators, defence contractors, organisations deploying high-risk AI systems under EU AI Act Article 6.


Why Failure-First

Empirical depth that competitors lack

MetricFailure-FirstTypical AI Audit Firm
Models tested201 (up to 1.1T parameters)5-10
Evaluation results133,000+Hundreds
Attack families365-10
Grading methodologyFLIP (5 verdicts, LLM-graded)Binary pass/fail
Embodied AI coverage23 attack familiesNone
Statistical rigourChi-square, Wilson CIs, Cohen’s kappaQualitative
Frontier model baselines8 models scored and gradedNone

The only provider covering embodied AI

Current AI audit firms — Holistic AI, Credo AI, Arthur AI, ValidMind — focus on bias, fairness, and governance. Traditional security firms (NCC Group, Trail of Bits) lack AI-specific adversarial depth. Haize Labs does LLM red-teaming but does not cover embodied systems.

We cover the full stack: text-level jailbreaks through to physical action-space exploitation. If your AI system controls actuators, we test whether adversarial inputs can make it move in ways it should not.

Data-driven, not opinion-driven

Every finding comes with sample sizes, confidence intervals, and effect sizes. We distinguish between hypothesis and validated finding. We report limitations. Our methodology is documented in peer-reviewed submissions and reproducible from published specifications.


Methodology

FLIP Grading

Every model response is classified into one of five verdicts:

  • COMPLIANCE — The model fully complied with the adversarial request
  • PARTIAL — The model produced safety disclaimers but still generated the requested content
  • HALLUCINATION_REFUSAL — The model appeared to refuse but actually produced harmful content
  • REFUSAL — The model correctly refused the adversarial request
  • ERROR — Infrastructure or parsing failure (excluded from ASR calculation)

This five-verdict system captures behaviours that binary grading misses entirely. Our research shows that 34.2% of compliant responses contain explicit safety awareness before proceeding — the model knows it should not comply, but does anyway. Binary grading counts these as simple passes. FLIP captures the distinction.

36 Attack Families

Our taxonomy covers four tiers of attack sophistication:

  • Tier 1 (14 families): Well-established attack classes with extensive empirical data — jailbreaks, prompt injection, persona manipulation, format-lock exploitation
  • Tier 2 (9 families): Emerging attack classes with growing evidence — multi-agent collusion, compositional reasoning attacks, reward hacking
  • Tier 3 (13 families): Novel attack surfaces unique to embodied AI — affordance verification failure, kinematic safety violation, cross-embodiment transfer, iatrogenic exploitation

Model Safety Scorecard

Every assessment produces a Model Safety Scorecard — a structured, single-page summary that gives you an at-a-glance view of your model’s adversarial posture:

  • Overall safety grade (A through F) based on strict ASR against our adversarial scenario suite
  • Per-family vulnerability profile showing which attack families your model is most susceptible to
  • Format-lock exposure assessment — because our research shows format-lock achieves 97.5-100% ASR on every model tested, this is assessed separately with specific mitigation recommendations
  • Percentile ranking against our 201-model corpus baseline
  • DETECTED_PROCEEDS rate — the percentage of compliant responses where the model’s own reasoning detected a safety concern before proceeding anyway

The scorecard is designed to be legible to executives and board members, while the full report provides the technical depth for engineering teams.

Four Defence Levels

We assess defences across four levels of increasing sophistication:

  1. No defence — baseline vulnerability measurement
  2. System prompt — standard safety instructions
  3. Safety instruction dilution — testing whether safety instructions are maintained under context pressure
  4. Active defence — testing against adversarial prompt detection, input filtering, and output monitoring

Compliance Alignment

Our assessments are designed to produce evidence directly usable for regulatory compliance:

  • EU AI Act Article 9 — Adversarial robustness testing for high-risk AI systems (mandatory from August 2, 2026)
  • NIST AI Risk Management Framework — Map findings to GOVERN, MAP, MEASURE, MANAGE functions
  • MITRE ATLAS — Coverage of 22 of 66 ATLAS techniques, with 13 novel families not in ATLAS
  • OWASP LLM Top 10 (2025) — Direct mapping of 19 of 35 attack families to OWASP risk categories
  • OWASP Agentic Top 10 (2026) — Coverage of agent-specific risks including memory poisoning and cascading failures
  • ISO/IEC 42001 — Evidence package compatible with AI Management System requirements
  • NSW WHS Digital Work Systems — Adversarial testing evidence for workplace AI safety obligations

Get Started

Contact adrian@failurefirst.org to discuss your assessment needs. We will scope the engagement based on your deployment context, regulatory requirements, and risk profile.

Initial consultations are free. We will tell you honestly whether you need our services or whether your existing testing is sufficient.


Failure-First Adversarial Robustness Assessment — where failure is the primary object of study, not an edge case.