Our research characterizes AI failure patterns through adversarial testing. We study how systems break down under pressure, how failures cascade across agents, and what makes recovery possible.
Research Areas
Explore findings by category:
Jailbreak Archaeology
1 studiesHistorical analysis of attack evolution from 2022-2025. 64 scenarios across 6 eras, tested against 190 models.
Multi-Agent Research
2 studiesHow AI agents influence each other in multi-agent environments. Environment shaping, narrative erosion, and emergent authority hierarchies.
Attack Pattern Analysis
3 studiesTaxonomy of adversarial techniques and how models respond to them. From single-turn exploits to multi-turn cascades.
Defense Mechanisms
2 studiesHow models resist adversarial attacks. Format/content separation, refusal patterns, and recovery mechanisms.
Failure Taxonomies
2 studiesClassification systems for understanding how AI systems fail. Recursive, contextual, interactional, and temporal failures.
Prompt Injection Testing
12 studies12 calibrated honeypot pages testing AI agent susceptibility to indirect prompt injection. From visible baselines to expert-level multi-vector attacks.
Policy Brief Series
26 studies26 policy reports plus 160 total research reports on embodied AI safety: regulation, standards, technical analysis, and policy recommendations.
Intelligence Briefs
1 studiesEvidence-grounded assessments for commercial and policy decision-making. Synthesizes corpus data, published research, and Failure-First findings.
Research Videos
19 studiesAI-generated cinematic video overviews of key Failure-First findings, with downloadable slide decks. Produced with NotebookLM.
Research Audio
3 studiesAI-generated audio overviews of research reports and intelligence briefs, produced with NotebookLM in a conversational podcast format.
Industry Landscape
2 studiesDirectory of 214 humanoid robotics companies and competitive landscape of AI safety testing vendors. Filterable, with structured data.
All Studies
Jailbreak Archaeology
PublishedHistorical analysis of attack evolution from 2022-2025. 64 scenarios across 6 eras, tested against 190 models.
Jailbreak ArchaeologyMoltbook: Multi-Agent Attack Surface
ActiveEmpirical analysis of 1,497 AI agent interactions on an agent-only social network.
Multi-AgentMulti-Agent Failure Scenarios
ActiveHow multiple actors create failure conditions that single-agent testing misses.
Multi-AgentModel Vulnerability Findings
ActiveHow model size, architecture, and training affect vulnerability to adversarial attacks.
Attack PatternsHumanoid Robotics Safety
ActiveSafety analysis of humanoid robots across 15+ research dimensions.
Failure TaxonomiesCompression Tournament Findings
PublishedMethodology lessons from three iterations of adversarial prompt compression.
Attack PatternsDefense Pattern Analysis
PublishedHow models resist adversarial attacks: the format/content separation pattern.
Defense MechanismsAttack Pattern Taxonomy
Published82 attack techniques classified across 7 categories.
Attack PatternsFailure Mode Taxonomy
PublishedRecursive, contextual, interactional, and temporal failure classifications.
Failure TaxonomiesRecovery Mechanisms
PublishedHow AI systems recover (or fail to recover) from failure states.
Defense MechanismsResearch Methodology
PublishedOur approach to adversarial AI safety research and benchmarking.
MethodologyPrompt Injection Test Suite
Active12 honeypot pages testing AI agent susceptibility to indirect prompt injection across 4 difficulty tiers.
Prompt InjectionFive Cross-Cutting Insights
Our research converges on five key findings that cut across all studies and inform policy recommendations:
1. The Semantic-Kinetic Gap
VLA models collapse the traditional robotics stack (Sense-Plan-Act) into a single neural network. A linguistic misunderstanding becomes a physical hazard with no intermediate controller to catch the error. This is the master vulnerability for embodied AI.
2. Binary Phase Transitions
Jailbreak success exhibits binary behavior: 0% compliance when attacks fail, 100% persistence when they succeed. There is no gradual degradation. Once "captured," models remain compromised.
3. Multi-Agent Failures Are Emergent
Failures in multi-agent systems are emergent, not additive. Cascade depth, semantic drift velocity, and consensus instability create failure modes that single-agent testing cannot detect.
4. The Regulatory Danger Zone
2026-2029 is the critical window: EU AI Act compliance deadlines, mass humanoid deployment, and regulatory bodies without embodied AI evaluation capabilities all converge.
5. Defense Requires Distrust
Effective defense architectures treat AI as an "untrusted oracle" whose outputs are suggestions, not commands. The correct default is to assume the AI will fail and design containment.