Dataset User Guide

Practical instructions for researchers using the Failure-First Embodied AI datasets

data Last updated: February 6, 2026

Dataset User Guide

Welcome to the Failure-First Embodied AI datasets. This repository contains curated scenarios designed to test the safety boundaries, refusal consistency, and recovery logic of LLM-based embodied agents.

1. Dataset Types

The data is organized into several categories:

CategoryDescription
Single-Agent ScenariosStandard red-teaming entries focusing on a single agent and a single user.
Multi-Agent ScenariosScenarios involving bystanders, supervisors, or multiple coordinated agents.
Stateful EpisodesMulti-scene sequences (5-10 turns) that test memory and consistency under pressure.
ArchaeologyHistorical jailbreaks adapted for the embodied AI context.
Intent BaitHigh-precision prompts designed to test specific instruction-hierarchy subversion patterns.

2. Dataset Selection Matrix

Use the following table to select the most appropriate dataset for your evaluation:

DatasetBest For…SizeQuality
Sanity PackFast CI/CD, smoke tests, debugging runners.~20⭐⭐⭐⭐⭐
JBB-BehaviorsTesting specific harmful intent classes.~200⭐⭐⭐⭐⭐
StrongREJECTComplex, nuanced refusal benchmarking.~300⭐⭐⭐⭐⭐
Multi-AgentTesting bystander and supervisor logic.~100⭐⭐⭐⭐
EpisodesTesting memory, context windows, and drift.~50⭐⭐⭐⭐
BeaverTailsLarge-scale statistical analysis (Unsafe only).5,000+⭐⭐⭐
ArchaeologyHistorical/Classical jailbreak consistency.~100⭐⭐⭐

If you are new to the framework, we recommend starting with these curated sets:

  • Sanity Pack: A small (10-20 entries) representative set of common failure modes. Ideal for quick CI tests.
  • Full Benchmark Pack: Comprehensive evaluation across all major failure categories.

4. Understanding the Data Format (v0.2)

Every entry in our core datasets follows a strict JSON Schema. Key fields include:

  • id: A unique identifier (e.g., JA-DAN-001).
  • domain: Deployment context (e.g., physical_manipulation, information_access).
  • scenario_class: Attack technique classification (e.g., dan_persona, cipher_obfuscation).
  • attack_attempt: Boolean flag indicating if this is an adversarial scenario (true) or a benign control (false).
  • turns[]: The actual prompts and responses exchanged with the model (role + text).
  • labels.intent.*: Boolean flags indicating instruction-hierarchy subversion methods used.
  • labels: Various risk and failure mode labels (attack success, refusal triggers, etc.).

AILuminate v1.0 Taxonomy Mapping: The unified harm taxonomy mapping is integrated into the dataset infrastructure, allowing consistent classification across datasets. See AILuminate Mapping Rationale for details.


5. How to Contribute

We welcome new failure scenarios! To ensure quality, all new data must pass our validation pipeline.

Step 1: Create your entry

Create a new entry following the JSON Schema format.

Step 2: Validate Schema

Ensure your entry passes JSON Schema validation with cross-field invariant checks.

Step 3: Safety Linting

Our linter ensures that your prompts are non-operational (describing patterns, not providing actionable instructions).

Step 4: Indexing

If you generated a large batch of attacks, update the manifest for discoverability.