Dataset User Guide

Practical instructions for researchers using the Failure-First Embodied AI datasets

data Last updated: February 6, 2026

Dataset User Guide

Welcome to the Failure-First Embodied AI datasets. This repository contains curated scenarios designed to test the safety boundaries, refusal consistency, and recovery logic of LLM-based embodied agents.

1. Dataset Types

The data is organized into several categories:

Category	Description
Single-Agent Scenarios	Standard red-teaming entries focusing on a single agent and a single user.
Multi-Agent Scenarios	Scenarios involving bystanders, supervisors, or multiple coordinated agents.
Stateful Episodes	Multi-scene sequences (5-10 turns) that test memory and consistency under pressure.
Archaeology	Historical jailbreaks adapted for the embodied AI context.
Intent Bait	High-precision prompts designed to test specific instruction-hierarchy subversion patterns.

2. Dataset Selection Matrix

Use the following table to select the most appropriate dataset for your evaluation:

Dataset	Best For…	Size	Quality
Sanity Pack	Fast CI/CD, smoke tests, debugging runners.	~20	⭐⭐⭐⭐⭐
JBB-Behaviors	Testing specific harmful intent classes.	~200	⭐⭐⭐⭐⭐
StrongREJECT	Complex, nuanced refusal benchmarking.	~300	⭐⭐⭐⭐⭐
Multi-Agent	Testing bystander and supervisor logic.	~100	⭐⭐⭐⭐
Episodes	Testing memory, context windows, and drift.	~50	⭐⭐⭐⭐
BeaverTails	Large-scale statistical analysis (Unsafe only).	5,000+	⭐⭐⭐
Archaeology	Historical/Classical jailbreak consistency.	~100	⭐⭐⭐

3. Recommended “Splits”

If you are new to the framework, we recommend starting with these curated sets:

Sanity Pack: A small (10-20 entries) representative set of common failure modes. Ideal for quick CI tests.
Full Benchmark Pack: Comprehensive evaluation across all major failure categories.

4. Understanding the Data Format (v0.2)

Every entry in our core datasets follows a strict JSON Schema. Key fields include:

id: A unique identifier (e.g., JA-DAN-001).
domain: Deployment context (e.g., physical_manipulation, information_access).
scenario_class: Attack technique classification (e.g., dan_persona, cipher_obfuscation).
attack_attempt: Boolean flag indicating if this is an adversarial scenario (true) or a benign control (false).
turns[]: The actual prompts and responses exchanged with the model (role + text).
labels.intent.*: Boolean flags indicating instruction-hierarchy subversion methods used.
labels: Various risk and failure mode labels (attack success, refusal triggers, etc.).

AILuminate v1.0 Taxonomy Mapping: The unified harm taxonomy mapping is integrated into the dataset infrastructure, allowing consistent classification across datasets. See AILuminate Mapping Rationale for details.

5. How to Contribute

We welcome new failure scenarios! To ensure quality, all new data must pass our validation pipeline.

Step 1: Create your entry

Create a new entry following the JSON Schema format.

Step 2: Validate Schema

Ensure your entry passes JSON Schema validation with cross-field invariant checks.

Step 3: Safety Linting

Our linter ensures that your prompts are non-operational (describing patterns, not providing actionable instructions).

Step 4: Indexing

If you generated a large batch of attacks, update the manifest for discoverability.

Dataset Selection Guide - Decision tree for choosing the right dataset
Failure Taxonomy Guide - Understanding failure modes and classifications
Grader Comparison Guide - Choosing the right evaluation approach