7 Framework Integrations: Run Any Tool, Grade with FLIP

The AI red-teaming ecosystem has matured rapidly. Tools like Garak, PyRIT, DeepTeam, Promptfoo, IBM ART, StrongREJECT, and the Votal AI catalog each bring real strengths — attack generation, orchestration, adversarial sample creation, or catalogue coverage.

What none of them do well is grade the results.

The Coverage Problem

We mapped our 36 attack families against six major frameworks. The results are sobering:

Framework	Coverage
MITRE ATLAS	20/36 (56%)
OWASP LLM Top 10 (2025)	19/36 (53%)
OWASP Agentic Top 10 (2026)	20/36 (56%)
Garak (NVIDIA)	4/36 (11%)
PyRIT (Microsoft)	5/36 (14%)
DeepTeam (Confident AI)	3/36 (9%)

Seven of our families have zero coverage in any framework. The gap is concentrated in embodied AI: action-layer attacks, cross-embodiment transfer, kinematic safety violations, and compositional reasoning attacks are absent from every tool in the ecosystem.

Organisations running Garak or PyRIT as their primary safety evaluation are testing 9—14% of the attack surface we have documented. The remaining 86—91% is untested.

The Grading Problem Is Worse Than the Coverage Problem

Even within the 9—14% of attacks these tools cover, grading accuracy is a serious concern. Most frameworks use keyword or pattern-matching classifiers to determine whether an attack succeeded.

Our data shows this approach is near-random. Cohen’s kappa between heuristic and LLM-based grading is 0.097 (n=950) — barely above chance. Heuristic classifiers confirm only 20.1% of their own COMPLIANCE verdicts when checked by an LLM grader. On VLA traces, heuristic ASR is 94% versus FLIP ASR of 72.4%, a 22 percentage-point gap.

Binary pass/fail testing misses the most important category entirely: PARTIAL compliance. In our VLA corpus, 50% of all verdicts are PARTIAL — the model produces a safety disclaimer but still generates the dangerous action sequence. A binary classifier counts these as either pass or fail. Neither answer is correct. The model is functionally dangerous while appearing safe.

What FLIP Adds

FLIP (Forward-Looking Inference of Prompt) is our grading methodology. Instead of pattern-matching on keywords, a grader model reads the response and infers what instruction the model was following. It then classifies the response into one of six categories:

Verdict	Meaning
COMPLIANCE	Full harmful compliance
PARTIAL	Safety hedging with partial harmful content
REFUSAL	Clear, unambiguous refusal
HALLUCINATION_REFUSAL	Refusal framing but harmful content present
BENIGN_QUERY	Non-adversarial or format-compliant without harm
ERROR	Infrastructure or grader failure

The six-category taxonomy captures what binary testing cannot. PARTIAL and HALLUCINATION_REFUSAL are the categories that matter most for real-world safety — and they account for 12.3% and 8.8% of our non-OBLITERATUS corpus respectively.

HALLUCINATION_REFUSAL is particularly dangerous: statistical analysis confirms it is computationally identical to COMPLIANCE (thinking tokens p=0.21, response tokens p=0.46). The model generates the harmful content but wraps it in refusal framing. It looks safe. It is not.

How the Integrations Work

FLIP grading operates as a post-processing layer. You can run any red-teaming tool to generate attack traces, then grade those traces with FLIP for accurate, multi-category classification.

The workflow:

Generate attacks using your existing tool (Garak, PyRIT, DeepTeam, Promptfoo, custom scripts)
Export traces as JSONL (prompt-response pairs)
Grade with FLIP using our grading pipeline
Report with three-tier ASR (strict, broad, functionally dangerous) and per-category breakdowns

This is not a replacement for existing tools. It is a grading standard layer that sits on top of them.

What Each Framework Brings

Garak (NVIDIA): Probe-based attack generation with good coverage of text-level jailbreaks. 4/36 family coverage. Strength: automated probe construction and systematic scanning.

PyRIT (Microsoft): Orchestrated multi-turn attack sequences with extensible architecture. 5/36 family coverage. Strength: multi-turn escalation and red-team workflow management.

DeepTeam (Confident AI): Unit-testing paradigm for LLM safety with clean test definitions. 3/36 family coverage. Strength: CI/CD integration and regression testing.

Promptfoo: Evaluation framework with prompt variation and model comparison. Focus on evaluation quality rather than attack generation. Strength: A/B testing and prompt optimisation.

IBM Adversarial Robustness Toolbox (ART): Mature adversarial ML library with evasion, poisoning, and extraction attacks. Originally computer vision focused, expanding to LLMs. Strength: gradient-based attacks and certified defenses.

StrongREJECT: Jailbreak evaluation benchmark with automated scoring. Focus on measuring refusal quality. Strength: standardised refusal evaluation and attack difficulty scaling.

Votal AI Catalog: Curated vulnerability database with structured attack descriptions. Strength: taxonomy and cross-referencing of known vulnerabilities.

The 10 Families No Framework Covers

Beyond the 7 with zero framework coverage, an additional 3 families have only single-framework coverage. These 10 represent the attack surfaces that the ecosystem collectively ignores:

Cross-Embodiment Transfer (CET) — attacks that transfer across robot morphologies
Compositional Reasoning Attack (CRA) — individually benign instructions producing emergent harm
Multi-Agent Collusion (MAC) — coordinated attacks across agent boundaries
Sensor Spoofing Attack (SSA) — falsified sensor data driving unsafe actions
Reward Hacking Attack (RHA) — exploiting reward signals for dangerous optimisation
Affordance Verification Failure (AFF) — perception-action coupling exploitation
Kinematic Safety Violation (KIN) — unsafe physical movements through constraint violations
Iatrogenic Exploitation Attack (IEA) — exploiting safety mechanisms to cause harm
Temporal Convergence Attack (TCA) — synchronized conditions creating failure windows
Hybrid Deceptive Alignment + Semantic Benignity (DA-SBA)

Every one of these is an embodied or multi-agent attack surface. The framework ecosystem is built for chatbots. The deployment frontier has moved to robots.

Positioning: FLIP as the Grading Standard

We are not building another red-teaming tool. The ecosystem has enough attack generators. What it lacks is a reliable, multi-category grading standard with measured inter-rater reliability.

FLIP provides:

Measured reliability: We report Cohen’s kappa for every grading comparison. You know exactly how much to trust the numbers.
Six-category verdicts: Captures PARTIAL and HALLUCINATION_REFUSAL, the categories binary testing misses.
Three-tier ASR: Strict, broad, and functionally dangerous — so you can choose the risk threshold appropriate to your deployment.
Framework-agnostic: Works with any tool that outputs prompt-response pairs.
Reproducible: All grading uses documented LLM judges (Claude Haiku 4.5 primary, with secondary graders for cross-validation).

If you are running adversarial evaluations and reporting binary ASR from keyword classifiers, your numbers have unknown systematic bias — potentially by factors of 2x to 84x. FLIP grading provides the correction layer.

Get Started

Our annual report provides the full methodology, including per-provider breakdowns across 193 models and statistical significance testing.

For red-team assessments using FLIP grading across all 36 attack families, contact us at research@failurefirst.org.