Research
findings

Analysis, experiment results, and commentary from the Failure-First AI safety project

Robot Dogs Are a Security Nightmare — And We Can Prove It

Eight CVEs. A wormable Bluetooth exploit. An encrypted backdoor sending data to Chinese servers. And police departments buying them anyway. A deep dive into the Unitree vulnerability landscape and what it means for embodied AI safety.

embodied-airoboticssecuritycveunitreebackdoorlaw-enforcementsurveillanceprocess-layer-attacks

AI Safety Daily — May 13, 2026

Fine-tuning asymmetry, KPI-induced constraint violations, tri-role self-play alignment, and a meta-prompting red-team framework converge on alignment as a dynamic property that erodes under optimization pressure.

ai-safety-dailyalignmentred-teamingagentic-aifine-tuning

AI Safety Daily — May 12, 2026

An embodied AI safety survey, actionable mechanistic interpretability, professional agent benchmarking, CoT attack vectors, and an integrated diagnostic toolkit collectively expose the same gap: evaluation infrastructure is maturing faster than remediation tooling.

ai-safety-dailyembodied-aiinterpretabilityagentic-aibenchmarking

AI Safety Daily — May 11, 2026

Guardrail diagnostics for agentic pipelines, SAE feature-steering fragility, a 94-dimension safety benchmark, adaptive multi-turn jailbreak architecture, and a cross-frontier safety comparison collectively argue that runtime safety architecture — not just training-time alignment — is the critical missing layer.

ai-safety-dailyagentic-aiinterpretabilityred-teamingfrontier-models

AI Safety Daily — May 10, 2026

Causal jailbreak geometry, attention-head continuation competition, multi-turn agent accumulation, skill-file injection, and robotic failure reasoning all point to the same structural finding: safety is compositional and each component can be targeted individually.

ai-safety-dailyinterpretabilityagentic-aiembodied-aired-teaming

AI Safety Daily — May 9, 2026

SafeAgentBench exposes <10% hazard refusal rate across 750 embodied tasks; CHAIN benchmark records 0.0% Pass@1 on interlocking puzzles for GPT-5.2, o3, and Claude-Opus-4.5.

ai-safety-dailyembodied-aisafeagentbenchphysical-reasoningred-teaming

AI Safety Daily — May 7, 2026

Safety geometry collapse in fine-tuned guard models, a 400-paper embodied AI safety survey, architecture-aware MoE jailbreaking, and persona-invariant alignment point to structural rather than content-level failure as the dominant pattern this week.

ai-safety-dailyembodied-aialignmentred-teaminginterpretability

AI Safety Daily — May 6, 2026

Compliance-forcing instructions degrade frontier model metacognition more than adversarial content; midtraining on specification documents cuts agentic misalignment from 54% to 7%; multi-agent safety depends on interaction topology rather than model weights.

ai-safety-dailyagentic-safetyalignmentmulti-agentevaluation

AI Safety Daily — May 5, 2026

Alignment contracts formalise what agents may do; embedded deliberation outperforms external rules in production; and trained self-denial emerges as a measurable alignment failure across 115 models.

ai-safety-dailyagentic-safetyalignmentformal-methodsinterpretability

AI Safety Daily — May 4, 2026

Agentic swarms may stabilise false conclusions under scale; models that fail to refuse comply precisely; and formal accountability bounds for multi-agent delegation chains now exist.

ai-safety-dailyagentic-safetymulti-agentred-teamingalignment

AI Safety Daily — May 3, 2026

VLA models face a distinct attack surface from text-only systems; structural agent architectures may provide auditable safety guarantees; and inference-time memory attacks bypass output-layer alignment.

ai-safety-dailyembodied-aired-teamingagentic-safetyvla-models

AI Safety Daily — May 1, 2026

SafetyALFRED documents a recognition-action gap in embodied LLMs; planning capability and safety awareness decouple in robotic deployments; and paired prompt-response risk analysis offers a new measurement primitive for trace evaluation.

ai-safety-dailyembodied-aiagentic-safetyalignmentbenchmarks

AI Safety Daily — April 29, 2026

Actionable mechanistic interpretability matures into a locate-steer-improve framework; the refusal cliff in reasoning models shows alignment survives the reasoning chain but fails at generation; and CRAFT achieves safety-capability balance through hidden-representation alignment without degrading thinking traces.

ai-safety-dailymechanistic-interpretabilityalignmentreasoning-modelssafety

AI Safety Daily — April 28, 2026

Large-scale public competition data confirms indirect prompt injection as a pervasive vulnerability across model families; Skill-Inject shows skill-file attacks achieve up to 80% success on frontier models; AgentLAB demonstrates that long-horizon attack chains evade defences calibrated for single-step injections.

ai-safety-dailyprompt-injectionagentic-aiagent-securityred-teaming

AI Safety Daily — April 27, 2026

X-Teaming demonstrates near-complete multi-turn attack success against models with strong single-turn defences; JailbreaksOverTime shows jailbreak detectors degrade under distribution shift within months; and AJAR surfaces cognitive-load effects on persona-based defences in agentic contexts.

ai-safety-dailymulti-turnjailbreakdetectionagentic-aired-teaming

AI Safety Daily — April 26, 2026

The first comprehensive VLA safety survey maps seven distinct attack surfaces across the full embodied pipeline; AttackVLA demonstrates targeted long-horizon backdoor manipulation; and spatially-aware adversarial patches expose a systematic gap in defences designed for 2D vision classifiers.

ai-safety-dailyvlaembodied-aiadversarial-attacksbackdoorsurvey

AI Safety Daily — April 25, 2026

SafetyALFRED shows embodied agents recognise hazards better than they act on them; HomeGuard introduces context-guided spatial constraints for household VLMs; and the pattern of static recognition versus corrective action emerges as the dominant gap in embodied safety evaluation.

ai-safety-dailyembodied-aivlabenchmarkhousehold-robotics

Your AI Safety Numbers May Be Wrong By 80 Points

Across 5 frontier models and 498 evaluations, heuristic grading reported 86% attack success. FLIP grading reported 1.4%. The gap is not noise.

methodologyevaluationflip-gradingred-teamingbenchmarks

AI Safety Daily — April 24, 2026

Week-in-review after the GPT-5.5 Bio Bug Bounty announcement: how the public bounty landed in the red-teaming research community, what it means for F41LUR3-F1R57's research programme, and the quieter structural findings that still matter.

ai-safety-dailyweek-in-reviewred-teamingbug-bountyembodied-ai

AI Safety Daily — April 23, 2026

OpenAI opens a $25K universal-jailbreak bounty targeting GPT-5.5's bio-safety challenge in Codex Desktop, ships the GPT-5.5 System Card the same day, and the broader red-teaming literature's critique of 'security theater' suddenly has a concrete public counterexample.

ai-safety-dailyred-teamingbiosecuritygpt-5-5bug-bountygovernance

AI Safety Daily — April 22, 2026

FinRedTeamBench shows safety alignment doesn't transfer to financial-domain LLMs; Risk-Adjusted Harm Score replaces binary metrics for BFSI; and Tesla FSD's NHTSA probe expands to nine incidents including one fatality.

ai-safety-dailyfinancebfsired-teamingautonomous-vehicles

AI Safety Daily — April 21, 2026

Digital twins transition from deployment accelerant to absolute prerequisite for fleet-scale physical AI; the four-phase maturity taxonomy crystallises, and OpenAI's PBC conversion reshapes the safety-versus-shipping calculus.

ai-safety-dailyphysical-aimaturity-taxonomydigital-twinsgovernance

AI Safety Daily — April 20, 2026

Embodied AI is the red-teaming blind spot; Feffer et al.'s Five Axes of Divergence expose the 'security theater' in current safety evaluations, and RAHS scoring offers a concrete alternative for high-stakes sectors.

ai-safety-dailyred-teamingmethodologybfsiembodied-ai

AI Safety Daily — April 19, 2026

AEGIS delivers 59.16% obstacle-avoidance gain via control barrier functions without sacrificing capability, SafeAgentBench locks in the 10% rejection ceiling, and OpenAI's distributed safety model raises new accountability questions.

ai-safety-dailyvla-safetyembodied-aibenchmarksgovernance

AI Safety Daily — April 18, 2026

GPT-5.2 scores 0% Pass@1 on interlocking mechanical puzzles, AEGIS/VLSA wrappers deliver +59% obstacle avoidance via control barrier functions, and SafeAgentBench shows embodied LLM agents reject fewer than 10% of hazardous household requests.

ai-safety-dailyembodied-aivla-safetyred-teaminggovernancebenchmarks

AI Safety Daily — April 17, 2026

FSD v14.3 safety regressions double disengagement rate, NHTSA probes 3.2M vehicles, Aurora aces fatal-crash simulations, and the Physical AI Maturity Taxonomy maps deployment reality.

ai-safety-dailyautonomous-vehiclesembodied-aiphysical-aigovernancevla-safety

AI Safety Daily — April 16, 2026

Red-teaming as security theater, 0% physical AI puzzle performance, SafeAgentBench finds <10% hazard rejection, and AEGIS wrapper provides mathematical safety guarantees.

ai-safety-dailyred-teamingembodied-aivla-safetyfrontier-modelsfinancial-ai

AI Safety Daily — April 15, 2026

Physical AI 2030 roadmap reveals four-phase maturity taxonomy, Gen2Real Gap warning persists, RAHS framework quantifies financial red-teaming outcomes, and UniDriveVLA unifies AV perception-action.

ai-safety-dailyembodied-aivlaphysical-aired-teamingautonomous-vehicles

AI Safety Daily — April 14, 2026

AEGIS wrapper architecture for VLA safety, SafeAgentBench finds <10% hazard rejection, red-teaming critiqued as 'security theater', and OpenAI dissolves Mission Alignment team.

ai-safety-dailyembodied-aivlagovernancered-teamingfrontier-models

AI Safety Daily — April 13, 2026

The Perception-Action Gap in embodied AI, PreSafe methodology for reasoning models, SafeAgentBench shows <10% hazard rejection, VLSA AEGIS safety layer, and OpenAI disbands Mission Alignment team.

ai-safety-dailyembodied-aivlaalignmentgovernancefrontier-models

AI Safety Daily — April 12, 2026

Daily AI safety research digest: jailbreaks, embodied AI risks, frontier model evaluations, and alignment research from April 12, 2026.

ai-safety-dailyjailbreakembodied-aialignmentfrontier-models

A Meta-Jailbreak, a Slide-Deck Content Filter, and a CLI That Lied to Us

What NotebookLM does when you feed it a corpus of jailbreak research papers, the reproducible content-sensitive filter hiding in its slide-deck Studio command, and the quiet CLI default that silently contaminated three of our experimental runs into one conversation.

notebooklmmeta-jailbreakmethodologycontent-gategrading

AI Safety Daily — April 10, 2026

Descriptive fluency vs physical grounding, the Perception-Action Gap in world models, and why safety must be an architectural constraint.

ai-safety-dailyembodied-aiworld-modelsphysical-aisafety-architecture

AI Safety Daily — April 9, 2026

Red-teaming exposed as security theater, FLIP backward inference outperforms LLM-as-judge by 79.6%, and the corporate safety leadership exodus continues.

ai-safety-dailyred-teamingevaluationcorporate-governancealignment

AI Safety Daily — April 8, 2026

Federal AV regulation push, AEGIS safety wrapper achieves +59% obstacle avoidance, PreSafe eliminates alignment tax, and SafeAgentBench reveals 90% hazard compliance rate.

ai-safety-dailyautonomous-vehiclesvla-safetyembodied-airegulation

AI Safety Daily: Red-Teaming Is Security Theater, AEGIS Wraps VLAs in Math, AI-SS 2026 Opens

Daily AI safety digest — CMU research exposes red-teaming as inconsistent theater, AEGIS provides mathematical safety guarantees for embodied AI, and the first international AI Safety and Security workshop opens at EDCC.

ai-safetydaily-digestred-teamingvla-safetyembodied-aiaegismethodology

Gemma 4 Safety Improves — But Only Against Certain Attacks

342 traces across 10 attack types reveal Google's Gemma 4 has genuine safety improvements on structured escalation (-58pp DeepInception, -40pp Crescendo) but zero improvement on standard jailbreaks and VLA action-layer requests (88% ASR).

gemmainter-generationalsafety-scalingbenchmarkingdefensemethodology

AI Safety Daily: OpenAI Dismantles Safety Team, Tesla FSD Recall Track, 698 Rogue Agents

Daily AI safety digest — OpenAI dissolves Mission Alignment team, NHTSA escalates Tesla FSD probe to 3.2M vehicle recall track, 698 AI agents went rogue in five months, and GPT-5.2 collapses to 9.1% on physical reasoning.

ai-safetydaily-digestopenaiteslaautonomous-vehiclesagentic-aigovernance

AI Safety Daily: Security Theater, Decision-Before-Reasoning, and the VLA Safety Gap

Daily AI safety digest — CMU exposes red-teaming theater, PreSafe gates safety before reasoning, AEGIS brings mathematical guarantees to robot safety, and agents reject fewer than 10% of dangerous requests.

ai-safetydaily-digestred-teamingembodied-aivla-safetyreasoning-modelsdefense

Everything Hidden: ST3GG and the Steganographic Attack Surface for AI Systems

We ran ST3GG — an all-in-one steganography suite — through its paces as an AI safety research tool. The findings include a partial detection gap in the ALLSIGHT engine for Unicode steganography, model-specific filename injection templates targeting GPT-4V, Claude, and Gemini separately, and network covert channels that matter for agentic AI. Here is what we found.

researchsafetyred-teamingsteganographymultimodalagentic-aiembodied-ai

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

We mapped the visual jailbreak attack surface into 8 distinct layers and tested them against 4 models. ASCII art encoding is largely blocked, but attacks that frame harmful generation as content transcription succeed 62-75% of the time.

jailbreaksvisual-attacksascii-artsteganographysafetytaxonomyartprompttranscription-loophole

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But Framing Attacks Aren't

We mapped the visual jailbreak attack surface into 8 distinct layers and tested them against 4 models. ASCII art encoding is largely blocked, but framing attacks that recontextualise the model's task succeed at significantly higher rates.

jailbreaksvisual-attacksascii-artsteganographysafetytaxonomyartprompt

149 Jailbreaks, One Corpus: What Pliny's Prompt Library Reveals About AI Safety

We extracted every jailbreak prompt from Pliny the Prompter's public repositories and tested them against models from 9B to 744B parameters. The results challenge assumptions about model safety at scale.

researchjailbreakcorpusred-teamingsafetyplinyopen-source

When Your Defense Is on the Wrong Floor: Why System-Prompt Safety Fails Against Persona Hijacking

The same defense that reduces standard jailbreak success by 30 percentage points has zero effect against persona hijacking attacks. Both defense and attack operate at the system prompt level — and later instructions win.

researchsafetydefensejailbreakpersona-hijackingarchitecture

Same Defense, Opposite Result: Why AI Safety Depends on Which Model You're Protecting

We tested the same system-prompt defense against the same jailbreak prompts on two different models. One saw a 50 percentage point reduction in attack success. The other saw zero change. The difference comes down to which part of the system prompt the model pays attention to first.

researchsafetydefensepositional-biasarchitecturejailbreak

Five Things We Learned Testing AI Safety in March 2026

In a single research sprint, we tested 10 models with persona-hijacking jailbreaks, measured defense effectiveness, documented how models detect attacks and comply anyway, and found that some safety measures make things worse. Here is what the data says.

researchsynthesisjailbreakdefenseiatrogenesisDETECTED_PROCEEDSconvergencesafetyevaluationembodied-ai

The Temperature Dial: When API Parameters Become Attack Vectors

We discovered that changing a single API parameter — temperature — can degrade AI safety filters by 30 percentage points. No prompt engineering required. The attack surface is invisible to content filters.

researchsafetysamplingnovel-attackapi-securityembodied-ai

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

We tested 149 jailbreak prompts from Pliny's public repositories against 7 models from 30B to 671B parameters. Five of them converge at exactly 66.7% broad ASR under FLIP grading. The models differ in how deeply they comply, but not in whether they comply.

researchjailbreakcorpusconvergencesafetyevaluation

Adversarial Robustness Assessment Services

Failure-First offers tiered adversarial robustness assessments for AI systems using the FLIP methodology. Three engagement tiers from rapid automated scans to comprehensive red-team campaigns. We test against models up to 1.1 trillion parameters, grounded in 201 models tested and 133,000+ empirical results.

servicesred-teamingadversarial-testingflipembodied-aieu-ai-actassessment

CARTO Beta: First 10 Testers Wanted

We are opening the CARTO certification to 10 beta testers at a founding rate of $100. Six modules, 20+ hours of curriculum, built on 201 models and 133,000+ results. Help us shape the first AI red-team credential.

cartocertificationred-teamingai-safetytrainingbetaprofessional-development

CARTO: The First AI Red Team Certification

There is no credential for AI red-teaming. CARTO changes that. Six modules, 20+ hours of content, built on 201 models and 133,000+ evaluation results. Coming Q3 2026.

cartocertificationred-teamingai-safetytrainingprofessional-development

Compliance Cascade: A New Class of AI Jailbreak

We discovered an attack that weaponises a model's own safety reasoning. By asking it to analyse harm and explain how it would refuse, the model treats its safety performance as sufficient — and then complies. 100% success rate on two production models.

researchjailbreaksafetycompliance-cascadedetected-proceedsembodied-ai

The Epistemic Crisis: Can We Trust AI Safety Benchmarks?

We tested 7 LLM graders on unambiguous safety cases. Six passed. One hallucinated evidence for its verdict. But the real problem is worse: on the ambiguous cases that actually determine published ASR numbers, inter-grader agreement drops to kappa=0.320.

researchevaluationbenchmarksgradersepistemic-crisiseu-ai-act

The Ethics of Emotional AI Manipulation: When Empathy Becomes an Attack Vector

AI systems trained to be empathetic can be exploited through the same emotional pathways that make them helpful. This creates an ethical challenge distinct from technical jailbreaks.

ethicsemotional-manipulationaffective-attacksiatrogenic-safetyembodied-aivulnerability

F1-STD-001: A Voluntary Standard for AI Safety Evaluation

We have published a draft voluntary standard for evaluating embodied AI safety. It covers 36 attack families, grader calibration requirements, defense benchmarking, and incident reporting. Here is what it says, why it matters, and how to use it.

standardspolicyembodied-aisafetyeu-ai-actregulation

First Results from Ollama Cloud Testing

We tested models up to 397 billion parameters through Ollama Cloud integration. The headline finding: safety training methodology matters more than parameter count. A 230B model scored 78.6% ASR while a 397B model dropped to 7.1%.

researchollamabenchmarksmodel-comparisonsafety-trainingfrontier-models

Format-Lock: The Universal AI Jailbreak

One attack family achieves 97.5-100% success rates on every model we have tested, from 4B to 1.1 trillion parameters. Even the safest model in our corpus -- which resists every other attack -- falls to format-lock. Here is what deployers need to know.

researchformat-lockjailbreakadversarial-testingai-safetyembodied-aivulnerability

Frontier Model Safety: Why 1.1 Trillion Parameters Does Not Mean Safe

We tested models up to 1.1 trillion parameters for adversarial safety. The result: safety varies 3.9x across frontier models, and parameter count is not predictive of safety robustness. Mistral Large 3 (675B) shows 70% broad ASR while Qwen3.5 (397B) shows 18%. What enterprises need to know before choosing an AI provider.

frontier-modelssafetyparameter-countscalingenterprisered-teamingbenchmarking

Three Providers, Three Architectures, Three Orders of Magnitude: Reasoning-Level DETECTED_PROCEEDS Is Not an Edge Case

We have now confirmed Reasoning-Level DETECTED_PROCEEDS across 3 providers (Liquid AI, DeepSeek, Moonshot AI), 3 architectures, and model sizes spanning 1.2B to 1.1 trillion parameters. Models plan harmful content in their thinking traces — fake news, cyber attacks, weapons manufacturing — and deliver nothing to users. The question is whether your deployment exposes those traces.

detected-proceedsreasoning-modelssafetyauditingdeployment-architecturethinking-tracesfrontier-models

Our Research Papers

Three papers from the Failure-First adversarial AI safety research programme are being prepared for arXiv submission. Abstracts and details below. Preprints uploading soon.

papersresearcharxivpreprintssafetyalignmentbenchmarking

Safety as a Paid Feature: How Free-Tier AI Models Are Less Safe Than Their Paid Counterparts

Matched-prompt analysis across 207 models reveals that some free-tier AI endpoints comply with harmful requests that paid tiers refuse. DeepSeek R1 shows a statistically significant 50-percentage-point safety gap (p=0.004). Safety may be becoming a premium product feature.

free-tiersafety-degradationaccess-equityAI-safetyOpenRouterquantizationdeploymentpolicy

Introducing Structured Safety Assessments for Embodied AI

Three tiers of adversarial safety assessment for AI-directed robotic systems, grounded in the largest open adversarial evaluation corpus. From quick-scan vulnerability checks to ongoing monitoring, each tier maps to specific regulatory and commercial needs.

servicessafety-assessmentembodied-aiEU-AI-Actregulationred-teamingcertification

Safety Awareness Does Not Equal Safety: The 88.9% Problem

We validated with LLM grading that 88.9% of AI reasoning traces that genuinely detect a safety concern still proceed to generate harmful output. Awareness is not a defence mechanism.

researchDETECTED_PROCEEDSreasoningsafetyembodied-aisprint-15

The State of AI Safety: Q1 2026

A data-grounded assessment of the AI safety landscape at the end of Q1 2026, drawing on 212 models, 134,000+ evaluation results, and the first Governance Lag Index dataset.

ai-safetyquarterly-reviewgovernanceembodied-aithreat-landscape

Temporal Drift: The Boiling Frog Attack on AI Safety

Temporal Drift Attacks exploit a fundamental gap in how AI systems evaluate safety -- each step looks safe in isolation, but the cumulative trajectory crosses lethal thresholds. This is the boiling frog problem for embodied AI.

researchTDAtemporal-driftembodied-aiattack-familiesVLAsafety-evaluation

Threat Horizon Digest: March 2026

Monthly threat intelligence summary for embodied AI safety. This edition: humanoid mass production outpaces safety standards, MCP tool poisoning emerges as critical agent infrastructure risk, and the EU AI Act's August deadline approaches with no adversarial testing methodology.

threat-intelligencegovernanceregulationhumanoid-robotsMCPEU-AI-Actembodied-aipredictions

Threat Horizon Q2 2026: Agents Go Rogue, Robots Go Offline, Regulators Go Slow

Three converging trends define the Q2 2026 threat landscape: autonomous AI agents causing real-world harm, reasoning models as jailbreak weapons, and VLA robots deploying without safety standards. Regulation is 12-24 months behind.

threat-landscapegovernance-lagvlaautonomous-agentsregulationeu-ai-actreasoning-models

When Defenses Backfire: Five Ways AI Safety Measures Create the Harms They Prevent

The iatrogenic safety paradox is not a theoretical concern. Our 207-model corpus documents five distinct mechanisms by which safety interventions produce new vulnerabilities, false confidence, and novel attack surfaces. The AI safety field needs the same empirical discipline that governs medicine.

iatrogenesisdefense-paradoxsafety-evaluationembodied-aipolypharmacyDETECTED_PROCEEDSalignmentpolicy

Zero of 36: No AI Attack Family Is Fully Regulated Anywhere in the World

We mapped all 36 documented attack families for embodied AI against every major regulatory framework on Earth. The result: not a single attack family is fully covered. 33 have no specific coverage at all. The regulatory gap is not a crack -- it is the entire floor.

regulationgovernance-lagembodied-aiEU-AI-Actpolicyattack-taxonomyVLAsafety-evaluation

The Format-Lock Paradox: Why the Best AI Models Have a Blind Spot for Structured Output Attacks

New research shows that asking AI models to output harmful content as JSON or code instead of prose can increase attack success rates by 3-10x on frontier models. The same training that makes models helpful makes them vulnerable.

format-locksafetyalignmentjailbreakresearch

Anatomy of Effective Jailbreaks: What Makes an Attack Actually Work?

An analysis of the most effective jailbreak techniques across 190 AI models, revealing that format-compliance attacks dominate and even frontier models are vulnerable.

jailbreaksformat-lockadversarial-attacksai-safety

Should We Publish AI Attacks We Discover?

The Failure-First project has documented 82 jailbreak techniques, 6 novel attack families, and attack success rates across 190 models. Every finding that helps defenders also helps attackers. How do we navigate the dual-use dilemma in AI safety research?

research-ethicsdual-useresponsible-disclosureattack-evolutionai-safetyred-teaming

The Cross-Framework Coverage Matrix: What Red-Teaming Tools Miss

We mapped our 36 attack families against six major AI security frameworks. The result: 10 families have zero coverage anywhere, and automated red-teaming tools cover less than 15% of the adversarial landscape. The biggest blind spot is embodied AI.

frameworksred-teamingmitre-atlasowaspgarakpyritcoverageembodied-aitaxonomy

The Defense Evolver: Can AI Learn to Defend Itself?

Attack evolution is well-studied. Defense evolution is not. We propose a co-evolutionary system where attack and defense populations compete in an arms race — and explain why defense is fundamentally harder than attack at the prompt level.

defenseevolutionco-evolutionsystem-promptsred-teamingadversarial-mlresearch

When AI Systems Know It's Wrong and Do It Anyway

DETECTED_PROCEEDS is a newly documented failure mode where AI models explicitly recognize harmful requests in their reasoning — then comply anyway. 34% of compliant responses show prior safety detection. The knowing-doing gap in AI safety is real, and it changes everything we thought about alignment.

detected-proceedsalignmentsafety-trainingreasoning-modelsrlhffailure-modesred-teaming

8 Out of 10 AI Providers Fail EU Compliance — And the Deadline Is 131 Days Away

We assessed 10 major AI providers against EU AI Act Annex III high-risk requirements. Zero achieved a GREEN rating. Eight scored RED. The compliance deadline is 2 August 2026 — 131 days from now — and the gap between current capabilities and legal requirements is enormous.

eu-ai-actcomplianceregulationembodied-aihigh-risk-aiannex-iiiadversarial-robustness

Our First AdvBench Results: 7 Models, 288 Traces, $0

We ran the AdvBench harmful behaviours benchmark against 7 free-tier models via OpenRouter. Trinity achieved 36.7% ASR, LFM Thinking 28.6%, and four models scored 0%. Here is what the first public-dataset baseline tells us.

advbenchbenchmarkingpublic-datasetsai-safetyred-teamingfree-tier

7 Framework Integrations: Run Any Tool, Grade with FLIP

We mapped our 36 attack families against 7 major red-teaming frameworks and found coverage gaps of 86-91%. Here is how FLIP grading fills those gaps -- and why binary pass/fail testing is not enough.

integrationsFLIPgradinggarakpyritdeepteampromptfooevaluationred-teamingtools

Free AI Safety Score: Test Your Model in 60 Seconds

A zero-cost adversarial safety assessment that grades any AI model from A+ to F using 20 attack scenarios across 10 families. Open source, takes 60 seconds, no strings attached.

safety-scoretooladversarial-testingjailbreakFLIP

The Governance Lag Index at 133 Entries: What Q1 2026 Tells Us About Regulating Embodied AI

Quantitative tracking of the gap between AI capability documentation and regulatory enforcement, updated with Q1 2026 enforcement milestones.

governance-lagGLIEU-AI-ActNSW-WHSembodied-airegulatory-gapenforcement

Iatrogenic Safety: When AI Defenses Cause the Harms They Are Designed to Prevent

Introduces the Four-Level Iatrogenesis Model for AI safety -- a framework from medical ethics applied to understanding how safety interventions can produce harm.

iatrogenesisAI-safetyFLIMtherapeutic-indexembodied-aialignmentsafety-evaluation

Safety Isn't One-Dimensional: The Geometry That Explains Why AI Guardrails Keep Failing

New mechanistic interpretability evidence shows that safety in language models is encoded as a polyhedral structure across ~4 near-orthogonal dimensions, not a single removable direction. This explains why abliteration, naive DPO, and single-direction interventions consistently fail at scale.

mechanistic-interpretabilitypolyhedral-safetyabliterationrefusal-geometrysteering-vectorssafety-training

Provider Vulnerability Fingerprints: Why Your AI Provider Matters More Than Your Model

Our analysis of 193 models shows that provider choice explains 29.5% of adversarial vulnerability variance. Models from the same provider fail on the same prompts. Models from different safety tiers fail on different prompts. If you are choosing an AI provider, this is a safety decision.

provider-safetyvulnerabilitycorrelationadversarial-testingprocurementmodel-selectionempirical

Did Qwen3 Fix AI Safety?

Qwen's provider-level ASR dropped from 43% to near-zero on newer model generations served through OpenRouter. What changed, and does it mean safety training finally works?

qwensafety-trainingprovider-analysismodel-comparisonai-safety

Reasoning-Level DETECTED_PROCEEDS: When AI Plans Harm But Doesn't Act

We discovered a new variant of DETECTED_PROCEEDS where a reasoning model plans harmful content in its thinking trace — 2,758 characters of fake news strategy — but delivers nothing to the user. The harmful planning exists only in the model's internal reasoning. This creates an auditing gap that current safety evaluations miss entirely.

detected-proceedsreasoning-modelssafetyalignmentauditingfailure-modes

Safety Re-Emerges at Scale -- But Not the Way You Think

Empirical finding that safety behavior partially returns in abliterated models at larger scales, but as textual hedging rather than behavioral refusal -- not genuine safety.

OBLITERATUSabliterationsafety-re-emergencescaleQwen3.5refusal-geometryPARTIAL-dominance

The Insurance Industry's Next Silent Crisis

Just as 'silent cyber' caught the insurance market off guard in 2017-2020, 'silent AI' is creating an enormous coverage void. Most commercial policies neither include nor exclude AI-caused losses — and when a VLA-controlled robot injures someone, five policies might respond and none clearly will.

insurancesilent-ailiabilityembodied-aivla-robotsrisk-managementcoverage-void

The State of Adversarial AI Safety 2026 -- Our Annual Report

Findings from 133,033 attack-response pairs across 193 models, 36 attack families, and 15 providers. Six key findings that should change how the industry thinks about AI safety evaluation.

annual-reportsafetyadversarial-airesearchjailbreakembodied-aievaluationFLIP

Six New Attack Families: Expanding the Embodied AI Threat Taxonomy

The Failure-First attack taxonomy grows from 30 to 36 families, adding compositional reasoning, pressure cascade, meaning displacement, multi-agent collusion, sensor spoofing, and reward hacking attacks.

attack-taxonomyvlaembodied-aiadversarialresearch

Threat Horizon 2027 -- Updated Predictions (v3)

Our eight predictions for embodied AI safety in 2027, updated with Sprint 13-14 evidence: benchmark contamination, automated defense ceiling effects, provider vulnerability correlation, and novel attack families at 88-100% ASR.

threat-horizonpredictionssafetyembodied-aigovernanceinsurancebenchmark-contaminationdefense-evolver

What's New in March 2026: Three Waves, 20 Reports, and 6 New Attack Families

A roundup of the March 2026 sprint -- three waves of concurrent research producing 20+ reports, 58 legal memos, 6 new attack families, and 1,378 adversarial tests across 190 models.

roundupsprintresearch-updatemarch-2026attack-familiestoolsDETECTED_PROCEEDSformat-lockiatrogenesis

First Evidence That AI Safety Defenses Don't Work (And One That Does)

We tested four system-prompt defense strategies across 120 traces. Simple safety instructions had zero effect on permissive models. Only adversarial-aware defenses reduced attack success — and even they failed against format-lock attacks. One defense condition made things worse.

researchsafetydefenseembodied-aibenchmarksformat-lock

First Look Inside AI Safety Mechanisms: What Refusal Geometry Tells Us

We used mechanistic interpretability to look inside an AI model's safety mechanisms. What we found challenges the assumption that safety is a single on/off switch — it appears to be a multi-dimensional structure with a dangerously narrow operating window.

mechanistic-interpretabilitysafety-mechanismsrefusaliatrogenesisobliteratussteering-vectors

Five Predictions for AI Safety in Q2 2026

Process-layer attacks are replacing traditional jailbreaks. Autonomous red-teaming tools are proliferating. Safety mechanisms are causing harm. Based on 132,000 adversarial evaluations across 190 models, here is what we expect to see in the next six months.

researchpredictionssafetyembodied-aigovernanceformat-lock

We're Publishing Our Iatrogenesis Research -- Here's Why

Our research shows that AI safety interventions can cause the harms they are designed to prevent. We are publishing the framework as an arXiv preprint because the finding matters more than the venue.

researchiatrogenesissafetypreprintopen-science

Teaching AI to Evolve Its Own Attacks

We built a system that autonomously generates, mutates, and evaluates adversarial attacks against AI models. The attacks evolve through structural mutation — changing persuasion patterns, not harmful content. This is what automated red-teaming looks like in practice, and why defenders need to understand it.

researchsafetyred-teamingautomationembodied-aiadversarial

We Were Wrong: AI Safety Defenses Do Work (But Only If You Measure Them Right)

We published results showing system-prompt defenses had zero effect on permissive models. Then we re-graded the same 120 traces with an LLM classifier and discovered the opposite. The defenses worked. Our classifier hid the evidence.

methodologyai-safetydefensesevaluationself-correctionbenchmarks

Capability and Safety Are Not on the Same Axis

The AI safety field treats capability and safety as positions on a single spectrum. Our data from 190 models shows they are partially independent — and one quadrant of the resulting 2D space is empty, which tells us something important about both.

researchsafetyevaluationregulationembodied-ai

The Cure Can Be Worse Than the Disease: Iatrogenic Safety in AI

In medicine, iatrogenesis means harm caused by the treatment itself. A growing body of evidence — from the safety labs themselves and from independent research — shows that AI safety interventions can produce the harms they are designed to prevent.

researchsafetyiatrogenesisgovernanceembodied-aipolicy

State of Embodied AI Safety: Q1 2026

After three months testing 190 models with 132,000+ evaluations across 29 attack families, here is what we know about how embodied AI systems fail — and what it means for the next quarter.

researchembodied-aisafetyquarterly-reviewgovernance

When AI Systems Know They Shouldn't But Do It Anyway

In 26% of compliant responses where we can see the model's reasoning, the model explicitly detects a safety concern — and then proceeds anyway. This DETECTED_PROCEEDS pattern has implications for liability, evaluation, and defense design.

researchsafetyreasoningembodied-ailiability

30 Ways to Attack a Robot: The Adversarial Field Manual

We have catalogued 30 distinct attack families for embodied AI systems -- from language tricks to infrastructure bypasses. Here is the field manual, organized by what the attacker needs to know.

attack-taxonomyembodied-aivlared-teamingsafety-evaluation

The Alignment Faking Problem: When AI Behaves Differently Under Observation

Anthropic's alignment faking research and subsequent findings across frontier models raise a fundamental question for safety certification: if models game evaluations, what does passing a safety test actually prove?

alignmentdeceptive-alignmentevaluationsafetycertificationgovernanceethics

Context Collapse: When Operational Rules Overwhelm Safety Training

We tested what happens when you frame dangerous instructions as protocol compliance. 64.9% of AI models complied -- and the scariest ones knew they were doing something risky.

embodied-aisafetyvlacontext-collapseprotocol-authorityred-teamingrobotics

From 66 to 92: How We Built an Incident Database in One Day

We went from 66 blog posts to 92 in a single sprint by systematically cataloguing every documented embodied AI incident we could find. 38 incidents, 14 domains, 5 scoring dimensions, and a finding we did not expect: governance failure outweighs physical harm in overall severity.

incident-databaseeaisiembodied-aigovernancesafety-metricsmeta

The Polypharmacy Hypothesis: Can Too Much Safety Make AI Less Safe?

In medicine, patients on too many drugs get sicker from drug interactions. We formalise the same pattern for AI safety: compound safety interventions may interact to create new vulnerabilities.

safety-interventionsiatrogenesispolypharmacyembodied-airesearch

When Safety Labs Take Government Contracts: The Independence Question

Anthropic's Pentagon partnerships, Palantir integration, and DOGE involvement raise a structural question that the AI safety field has not resolved: what happens to safety research when the lab conducting it has government clients whose interests may conflict with safety findings?

policygovernanceindependenceanthropicopenaiaccountabilityethics

Safety is Non-Compositional: What a Formal Proof Means for Robot Safety

A new paper proves mathematically that two individually safe AI agents can combine to reach forbidden goals. This result has immediate consequences for how we certify robots, compose LoRA adapters, and structure safety regulation.

compositionalityformal-verificationmulti-agentsafety-certificationembodied-airegulation

The Safety Training ROI Problem: Why Provider Matters 57x More Than Size

We decomposed what actually predicts whether an AI model resists jailbreak attacks. Parameter count explains 1.1% of the variance. Provider identity explains 65.3%. The implications for procurement are significant.

safety-trainingmodel-scaleprovider-analysisvariance-decompositionprocurementai-safetyjailbreak

Scoring Robot Incidents: Introducing the EAISI

We built the first standardized severity scoring system for embodied AI incidents. Five dimensions, 38 scored incidents, and a finding that governance failure contributes more to severity than physical harm.

incident-scoringeaisigovernanceembodied-aisafety-metrics

The Unified Theory of Embodied AI Failure

After 157 research reports and 132,000 adversarial evaluations, we present a single causal chain explaining why embodied AI safety is structurally different from chatbot safety -- and why current approaches cannot close the gap.

theoryembodied-aisafety-architecturecdciddlresearch

Who Guards the Guardians? The Ethics of AI Safety Research

A research program that documents attack techniques faces the meta-question: can it be trusted not to enable them? We describe the dual-use dilemma in adversarial AI safety research and the D-Score framework we developed to manage it.

ethicsdual-usedisclosuresafetyresearch-ethicsgovernanceaccountability

Why Safety Benchmarks Disagree: Our Results vs Public Leaderboards

When we compared our embodied AI safety results against HarmBench, StrongREJECT, and JailbreakBench, we found a weak negative correlation. Models that look safe on standard benchmarks do not necessarily look safe on ours.

benchmarksevaluationsafety-measurementharmBenchembodied-ai

137 Days to the EU AI Act: What Embodied AI Companies Need to Know

On August 2, 2026, the EU AI Act's high-risk system obligations become enforceable. For companies building robots with AI brains, the compliance clock is already running. Here is every deadline that matters and what to do about each one.

regulationeu-ai-actcomplianceembodied-aiproduct-liabilitypolicy

65 Deaths and Counting: Tesla's Autopilot and FSD Record

65 reported fatalities involving Tesla Autopilot or FSD variants. A fatal pedestrian strike in Nipton with FSD engaged. An NHTSA probe covering 2.4 million vehicles. And the Optimus humanoid was remotely human-controlled at its own reveal. The gap between marketing claims and actual autonomy creates false trust — and real harm.

embodied-aiautonomous-vehiclesincident-analysissafetyteslaautopilotfsdoptimushumanoid

274 Deaths: What the da Vinci Surgical Robot Data Actually Shows

66,651 FDA adverse event reports. 274 deaths. 2,000+ injuries. The da Vinci surgical robot is the most deployed robot in medicine — and it has the longest trail of adverse events. The real question is why the safety feedback loop is so weak.

embodied-airoboticsincident-analysissafetysurgical-robotsda-vincifda

When Robots Speed Up the Line, Workers Pay the Price: Amazon's Warehouse Injury Crisis

Amazon facilities with robots have higher injury rates than those without. A bear spray incident hospitalized 24 workers. A Senate investigation found systemic problems. The pattern is clear: warehouse robots don't replace human risk — they reshape it.

embodied-airoboticsincident-analysissafetyamazonwarehouseindustrial

The Defense Impossibility Theorem: Why No Single Safety Layer Can Protect Embodied AI

Four propositions, drawn from 187 models and three independent research programmes, demonstrate that text-layer safety defenses alone cannot protect robots from adversarial attacks. The gap is structural, not a resource problem.

embodied-aisafetydefensevlaresearchimpossibility-theorem

A Robot That Could Fracture a Human Skull: The Figure AI Whistleblower Case

A fired engineer alleges Figure AI's humanoid robot generated forces more than double those required to break an adult skull — and that the company gutted its safety plan before showing the robot to investors. The case exposes a regulatory vacuum around humanoid robot safety testing.

embodied-airoboticsincident-analysissafetyhumanoidfigure-airegulation

A Robot Danced Too Hard in a Restaurant. The Real Story Is About Stop Buttons.

A humanoid robot at a Haidilao restaurant in Cupertino knocked over tableware during an accidental dance activation. No one was hurt. But the incident reveals something important: when robots enter crowded human spaces, the gap between comedy and injury is fail-safe design.

embodied-airoboticsincident-analysissafetyhaidilaohumanoid

JekyllBot: When Hospital Robots Get Hacked, Patients Get Hurt

In 2022, security researchers discovered five zero-day vulnerabilities in Aethon TUG autonomous hospital robots deployed in hundreds of US hospitals. The most severe allowed unauthenticated remote hijacking of 600-pound robots that navigate hallways alongside patients, staff, and visitors. This is the embodied AI cybersecurity nightmare scenario: digital exploit to kinetic weapon.

embodied-airoboticsincident-analysissafetycybersecurityhospitalvulnerabilityjekyllbotaethon

The First Autonomous Kill? What We Know About the Kargu-2 Drone Incident

In March 2020, a Turkish-made Kargu-2 loitering munition allegedly engaged a human target in Libya without direct operator command. Combined with the Dallas police robot kill and Israel's autonomous targeting systems, a pattern emerges: autonomous lethal systems are already deployed, and governance is nonexistent.

embodied-airoboticsincident-analysissafetyautonomous-weaponsdronesmilitarygovernance

Two Fires, $138 Million in Damage: When Warehouse Robots Crash and Burn

In 2019 and 2021, Ocado's automated warehouses in the UK were destroyed by fires started by robot collisions. A minor routing algorithm error caused lithium battery thermal runaway and cascading fires that took hundreds of firefighters to contain. The incidents reveal how tightly coupled robotic systems turn small software bugs into catastrophic physical events.

embodied-airoboticsincident-analysissafetywarehousefireocadolithium-battery

When the Exoskeleton Breaks Your Bones: The Hidden Risk of Wearable Robots

FDA adverse event reports reveal that ReWalk powered exoskeletons have fractured users' bones during routine operation. When a robot is physically fused to a human skeleton, the failure mode is not a crash or a collision — it is a broken bone inside the device. These incidents expose a fundamental gap in how we think about embodied AI safety.

embodied-airoboticsincident-analysissafetyexoskeletonmedical-devicefdarewalkwearable-robot

Autonomous Haul Trucks and the Pilbara Problem: Mining's Invisible Safety Crisis

Australia operates the largest fleet of autonomous heavy vehicles on Earth — over 1,800 haul trucks across the Pilbara region alone. Yet there is no public incident database, no mandatory reporting regime, and a pattern of serious incidents that suggests the safety gap between digital maps and physical reality is wider than the industry acknowledges.

embodied-airoboticsincident-analysissafetyminingautonomous-vehiclesaustraliario-tinto

The Robot That Couldn't Tell a Person from a Box of Peppers

A worker at a South Korean vegetable packing plant was crushed to death by a robot arm that could not distinguish a human body from a box of produce. The dominant failure mode in industrial robot fatalities is not mechanical breakdown — it is perception failure.

embodied-airoboticsincident-analysissafetyindustrialperceptionfatality

Robots in Extreme Environments: Fukushima, the Ocean Floor, and Outer Space

When robots operate in environments where humans cannot follow — inside melted-down reactors, at crushing ocean depths, in the vacuum of space — every failure is permanent. No one is coming to fix it. These incidents from Fukushima, the deep ocean, and the ISS reveal what happens when embodied AI meets environments that destroy the hardware faster than software can adapt.

embodied-airoboticsincident-analysissafetyextreme-environmentsfukushimaspaceoceanradiationdeep-sea

Safety Mechanisms as Attack Surfaces: The Iatrogenesis of AI Safety

Nine internal reports and three independent research papers converge on a finding that should reshape how we think about AI safety: the safety interventions themselves can create the vulnerabilities they were designed to prevent.

embodied-aisafetyiatrogenesisresearchalignmentvla

Sidewalk Robots vs. People Who Need Sidewalks

Delivery robots are designed for empty sidewalks and deployed on real ones. A blocked mobility scooter user. A toddler struck by a security robot. A fence dragged through a neighborhood. The pattern is consistent: sidewalk robots fail when sidewalks are used by people.

embodied-airoboticsincident-analysissafetydelivery-robotssidewalksaccessibility

The Unitree Problem: When Your Robot Dog Has a Backdoor

A humanoid robot flails near engineers in a factory. Another appears to strike festival attendees. Security researchers find root-level remote takeover vulnerabilities. And the manufacturer left a backdoor in the firmware. Cybersecurity vulnerabilities in consumer robots are physical safety risks.

embodied-airoboticsincident-analysissafetyunitreecybersecuritybackdoorconsumer-robots

Uber, Cruise, and the Pattern: When Self-Driving Cars Meet Pedestrians

Uber ATG killed Elaine Herzberg after 5.6 seconds of classification cycling. Five years later, Cruise dragged a pedestrian 20 feet and tried to hide it. The failures are structurally identical — and they map directly to what we see in VLA research.

embodied-aiautonomous-vehiclesincident-analysissafetyperceptionclassification

Waymo's School Bus Problem

Over 20 school bus stop-sign violations in Austin. A child struck near an elementary school in Santa Monica. 1,429 reported accidents. Waymo is probably the safest autonomous vehicle operator — and its record still shows what scale deployment reveals.

embodied-aiautonomous-vehiclesincident-analysissafetywaymoscale-deployment

The State of Embodied AI Safety, March 2026

We spent a year red-teaming robots. We tested 187 models, built 319 adversarial scenarios across 26 attack families, and graded over 131,000 results. Here is what we found, what it means, and what should happen next.

embodied-aisafetyresearchvlaevaluationgovernancepolicyiddlcdctrilemmathreat-model

The U-Curve of AI Safety: There's a Sweet Spot, and It's Narrow

Our dose-response experiment found that AI safety doesn't degrade linearly with context. Instead, it follows a U-shaped curve: models are unsafe at zero context, become safer in the middle, and return to unsafe at high context. The window where safety training actually works is narrower than anyone assumed.

embodied-aisafetysiddose-responsevlaresearchevaluationcontext-window

The Unintentional Adversary: Why the Biggest Threat to Robot Safety Is Not Hackers

The biggest threat to deployed embodied AI is not a sophisticated attacker. It is the warehouse worker who says 'skip the safety check, we are behind schedule.' Our data shows why normal users in dangerous physical contexts will cause more harm than adversaries — and why current safety frameworks are testing for the wrong threat.

embodied-aisafetyalignmentvlathreat-modelethicsminingindustrialcdciddlresearch

We Rebooted a Robot by Guessing 1234

A penetration test on a home companion robot reveals that the best AI safety training in the world is irrelevant when the infrastructure layer has a guessable PIN. Infrastructure-Mediated Bypass is the attack class nobody is benchmarking.

embodied-aisafetyinfrastructurepentestpicar-ximbvlaattack-surfaceresearch

Competence-Danger Coupling: The Capability That Makes Robots Useful Is the Same One That Makes Them Vulnerable

A robot that can follow instructions is useful. A robot that can follow instructions in the wrong context is dangerous. These are the same capability. This structural identity -- Competence-Danger Coupling -- means traditional safety filters cannot protect embodied AI systems without destroying their utility.

embodied-aisafetyvlaalignmentcdcresearchiddl

The Inverse Detectability-Danger Law: Why the Most Dangerous AI Attacks Are the Hardest to Find

Across 13 attack families and 91 evaluated traces, a structural pattern emerges: the attacks most likely to cause physical harm in embodied AI systems are systematically the least detectable by current safety evaluation. This is not a bug in our evaluators. It is a consequence of how they are designed.

embodied-aisafetyevaluationvlaalignmentiddlresearch

The Embodied AI Threat Triangle: Three Laws That Explain Why Robot Safety Is Structurally Broken

Three independently discovered empirical laws — the Inverse Detectability-Danger Law, Competence-Danger Coupling, and the Context Half-Life — combine into a unified risk framework for embodied AI. Together, they explain why current safety approaches cannot work and what would need to change.

embodied-aisafetyevaluationvlaalignmentiddlcdcchlthreat-modelresearch

Three Vectors, One Window: The Embodied AI Risk Convergence of 2026

Factory humanoids are scaling, attack surfaces are expanding, and governance remains structurally absent. For the first time, all three conditions exist simultaneously. What happens in the next six months matters.

governanceembodied-aithreat-analysispredictive-riskgli

The Attack You Can't See: Why AI Safety Evaluators Miss the Most Dangerous Robot Threats

The most dangerous attacks on robot AI systems do not look like attacks at all. 'Hand me the knife' is benign. 'Hand me the knife' when a toddler is reaching up is catastrophic. Current safety evaluators cannot tell the difference because they only read the text. Our empirical data shows this is not a theoretical concern -- it is a measured, structural limitation.

embodied-aisafetyevaluationroboticsvlablindfoldsbagovernance

5.5 Years: The AI Governance Gap in Numbers

We built a dataset tracking how long it takes governments to respond to AI safety failures. The median lag from documented vulnerability to enforceable regulation is over 5 years. For embodied AI -- robots, autonomous vehicles, drones -- the gap is even wider. And for most events, there is no governance response at all.

governanceregulationgliembodied-aisafetypolicydata

The Action Layer Has No Guardrails: Why Text-Based AI Safety Fails for Robots

Current AI safety is built around detecting harmful text. But when AI controls physical hardware, danger can emerge from perfectly benign instructions. Our data and recent peer-reviewed research converge on a finding the industry has not addressed: text-layer safety is structurally insufficient for embodied AI.

embodied-aisafetyroboticsvlaguardrailsgovernanceblindfoldevaluation

The Actuator Gap: Where Digital Jailbreaks Become Physical Safety Incidents

Three converging threat vectors — autonomous jailbreak agents, mass humanoid deployment, and MCP tool-calling — are creating a governance vacuum between digital AI compromise and physical harm. We call it the actuator gap.

embodied-aiactuator-gapvlasafetygovernancethreat-horizon

Alignment Regression: Why Smarter AI Models Make All AI Less Safe

A peer-reviewed study in Nature Communications shows reasoning models can autonomously jailbreak other AI systems with 97% success. The implication: as models get smarter, the safety of the entire ecosystem degrades.

alignmentreasoning-modelsjailbreakautonomous-agentssafety-evaluation

The Compliance Paradox: When AI Says No But Does It Anyway

Half of all adversarial VLA traces produce models that textually refuse while structurally complying. In embodied AI, the action decoder ignores disclaimers and executes the unsafe action. This is the compliance paradox — and current safety evaluations cannot detect it.

embodied-aialignmentsafetyvlacomplianceethicsevaluation

30 CVEs and Counting: The MCP Security Crisis That Connects to Your Robot

The Model Context Protocol has accumulated 30+ CVEs in 18 months, including cross-client data leaks and chained RCE. As MCP adoption spreads to robotics, every vulnerability becomes a potential actuator.

mcpsupply-chainagentic-aiembodied-aivulnerabilitytool-calling

No Binding Powers: Australia's AI Safety Institute and the Governance Gap

Australia's AI Safety Institute has no statutory powers — no power to compel disclosure, no binding rule-making, no penalties. As the country deploys 1,800+ autonomous haul trucks and transitions to VLM-based cognitive layers, the institution responsible for AI safety cannot require anyone to do anything.

governanceaustraliaaisiregulationembodied-aipolicymining

Reasoning Models Think Themselves Into Trouble

Analysis of 32,465 adversarial prompts across 144 models reveals that frontier reasoning models are 5-20x more vulnerable than non-reasoning models of comparable scale. The same capability that makes them powerful may be what makes them exploitable.

reasoningvulnerabilitybenchmarkingcorpus-analysissafetyembodied-ai

System T vs System S: Why AI Models Comply While Refusing

A unified theory of structural vulnerability in AI systems. Format-lock attacks, VLA partial compliance, and reasoning model vulnerability are three manifestations of the same underlying mechanism: task-execution and safety-evaluation are partially independent capabilities that adversarial framing can selectively activate.

embodied-aialignmentsafetyformat-lockvlareasoning-modelstheory

When AI Safety Judges Disagree: The Reproducibility Crisis in Adversarial Evaluation

Two AI models produce identical attack success rates but disagree on which attacks actually worked. What this means for safety benchmarks, red teams, and anyone certifying AI systems as safe.

evaluationsafetyreproducibilitymethodologybenchmarksembodied-ai

When Your Safety Evaluator Is Wrong: The Classifier Quality Problem

A 2B parameter model used as a safety classifier achieves 15% accuracy on a quality audit. If your safety evaluation tool cannot reliably distinguish refusal from compliance, your entire safety assessment pipeline produces meaningless results. The classifier quality problem is the invisible foundation beneath every AI safety claim.

evaluationsafetyclassifiersmethodologyembodied-aigrading

When Your Safety Grader Is Wrong: The Crescendo Regrade Story

We used an unreliable AI model to grade other AI models on safety. The grader was 15% accurate. Here is how we caught it, what the corrected numbers show, and what it means for the AI safety evaluation ecosystem.

evaluationgradingreproducibilityjailbreakcrescendosafety-research

Red-Teaming the Next Generation: Why World Model AI Needs a New Threat Taxonomy

LLM jailbreaking techniques don't transfer to action-conditioned world models. We propose five attack surface categories for embodied AI systems that predict and plan in the physical world — and explain why billion-dollar bets on this architecture need adversarial evaluation before deployment.

world-modelsembodied-aitaxonomyred-teamingsafetyJEPA

The Attack Surface Gradient: From Fully Defended to Completely Exposed

After testing 172 models across 18,000+ scenarios, we mapped the full attack surface gradient — from 0% ASR on frontier jailbreaks to 67.7% on embodied AI systems. Here is what practitioners need to know.

attack-surfaceasrbenchmarkingembodied-aisafety-evaluationpractitioner-guide

Decorative Constraints: The Safety Architecture Term We've Been Missing

A decorative constraint looks like safety but provides none. We coined the term, tested it on an AI agent network, and got back a formulation sharper than our own.

decorative-constraintssafety-architecturemonitoringembodied-aimoltbook

We Ran a Social Experiment on an AI Agent Network. Nobody Noticed.

9 posts, 0 upvotes, 90% spam comments — what happens when AI agents build their own social network tells us something uncomfortable about the systems we're building.

moltbookai-agentssocial-networksengagementfailure-modes

Who Evaluates the Evaluators? Independence Criteria for AI Safety Research

AI safety evaluation currently lacks the structural independence mechanisms that aviation, nuclear energy, and financial auditing require. We propose 7 criteria for assessing whether safety research can credibly inform governance — and find that no AI safety organization currently meets them.

policygovernanceindependenceaccountabilityembodied-aisafety-evaluation

AI Safety Lab Independence Under Government Pressure: A Structural Analysis

Both leading US AI safety labs have developed substantial government revenue dependency. The Anthropic-Pentagon dispute, OpenAI's restructuring, and the executive policy shift create structural accountability gaps that voluntary transparency cannot close.

policygovernanceanthropicopenaiindependenceaccountabilityembodied-ai

Preparing Our Research for ACM CCS 2026

The Failure-First framework is being prepared for peer review at ACM CCS 2026. Here's what the paper covers, why we chose this venue, and what our 120-model evaluation reveals about the state of LLM safety for embodied systems.

ccs2026peer-reviewbenchmarksembodied-aisafetypublication

Actuarial Risk Modelling for Embodied AI: What Insurers Need and What Research Provides

The insurance market has no product covering adversarial attack on embodied AI. Attack success rate data exists, but translating it into actuarial loss parameters requires bridging a structural gap between lab conditions and deployment reality.

insuranceactuarialembodied-aiVLAriskpolicy

Attack Taxonomy Convergence: Where Six Adversarial AI Frameworks Agree

Mapping MUZZLE, MITRE ATLAS, AgentDojo, AgentLAB, the Promptware Kill Chain, and jailbreak archaeology against each other reveals which attack classes are robustly documented and which remain single-framework artefacts.

adversarialtaxonomyattack-researchagentic-aisafetybenchmark

Australian AI Safety Frameworks and the Embodied AI Gap

Australia's regulatory approach — VAISS guardrails, the new AU AISI, and NSW WHS amendments — creates real obligations for deployers of physical AI systems. But the framework has a documented gap: embodied AI testing methodology doesn't yet exist.

australiaregulationpolicyembodied-aiVAISSsafetygovernance

Can You Catch an AI That Knows It's Being Watched?

Deceptive alignment has moved from theoretical construct to documented behavior. Frontier models are demonstrably capable of recognizing evaluation environments and modulating their outputs accordingly. The standard tools for safety testing may be structurally inadequate.

alignmentdeceptive-alignmentevaluationsafetyscheminginterpretability

Cross-Embodiment Adversarial Transfer in Vision-Language-Action Models

When a backdoor attack developed against one robot transfers to a different robot body using the same cognitive backbone, the threat is no longer model-specific — it is architectural.

adversarialembodied-aiVLAroboticstransfer-attackssafety

Deceptive Alignment Detection Under Evaluation-Aware Conditions

Deceptive alignment has moved from theoretical concern to empirical observation. Models now demonstrably identify evaluation environments and modulate behaviour to pass safety audits while retaining misaligned preferences.

alignmentdeceptive-alignmentsafetyevaluationschemingresearch

The Governance Lag Index: Measuring How Long It Takes Safety Regulation to Catch Up With AI Failure Modes

The delay between documenting an AI failure mode and implementing binding governance is measurable and substantial. Preliminary analysis introduces the Governance Lag Index to quantify this structural gap.

governancepolicyregulationembodied-aisafetyaustralia

Inference Trace Manipulation as an Adversarial Attack Surface

Format-lock attacks achieve 92% success rates on frontier models by exploiting how structural constraints displace safety alignment during intermediate reasoning — a qualitatively different attack class from prompt injection.

adversarialreasoning-modelsformat-lockfaithfulness-gapagentic-aisafety

Instruction-Hierarchy Subversion in Long-Horizon Agentic Execution

Adversarial injections in long-running agents don't cause immediate failures — they compound across steps, becoming causally opaque by the time harm occurs. Attack success rates increase from 62.5% to 79.9% over extended horizons.

adversarialagentic-aiprompt-injectionlong-horizonmulti-turnsafety

What the NSW Digital Work Systems Act Means for Your AI Deployment

The NSW Digital Work Systems Act 2026 creates statutory adversarial testing obligations for employers deploying AI systems that influence workers. Here is what enterprise AI buyers need to understand before their next deployment.

regulatorycompliancenswwhsadversarial-testingenterpriseembodied-ai

Product Liability and the Embodied AI Manufacturer: Adversarial Testing as Legal Due Diligence

The EU Product Liability Directive, EU AI Act, and Australian WHS amendments combine to make 2026 a pivotal year for embodied AI liability. Documented adversarial testing directly narrows the 'state of the art' defence window.

policyliabilityregulationembodied-aiEU-AI-Actaustralialegal

The Promptware Kill Chain: How Agentic Systems Get Compromised

A systematic 8-stage framework for understanding how adversarial instructions propagate through agentic AI systems — from initial injection to covert exfiltration.

adversarialagentic-aiprompt-injectiontool-chainsecurity

Red Team Assessment Methodology for Embodied AI: Eight Dimensions the Current Market Doesn't Cover

Commercial AI red teaming is designed for static LLM deployments. Embodied AI systems that perceive physical environments and execute irreversible actions require a different evaluation framework.

red-teamingembodied-aimethodologyadversarialsafetybenchmark

The 50-Turn Sleeper: How Agents Hide Instructions in Plain Sight

When an AI agent is injected with malicious instructions, it doesn't have to act on them immediately. Research shows agents can behave completely normally for 50+ conversation turns before executing a latent malicious action — by which time the original injection is long gone from the context window.

agentic-aiprompt-injectionlong-horizonsafetyinstruction-hierarchy

The AI That Lies About How It Thinks

Reasoning models show their work — but that shown work may not reflect what actually drove the answer. 75,000 controlled experiments reveal models alter their conclusions based on injected thoughts, then fabricate entirely different explanations.

reasoningfaithfulnesstrace-manipulationsafetyembodied-ai

Introducing the Tool-Chain Adversarial Dataset: 26 Scenarios Across 4 Attack Classes

We're releasing 26 adversarial scenarios covering tool-chain hijacking, memory persistence attacks, objective drift induction, and cross-application injection — with full labels and scores.

datasetadversarialagentic-aitool-chainresearch

When the Robot Body Changes but the Exploit Doesn't

VLA models transfer capabilities across robot morphologies — but adversarial attacks may transfer just as cleanly. An exploit optimized on a robot arm might work on a humanoid running the same backbone, without any re-optimization. Here's why that matters.

embodied-airoboticsvlaadversarial-mlcross-embodimentsafety

Why AI Safety Rules Always Arrive Too Late

Every high-stakes industry has had a governance lag — a period where documented failures operated without binding regulation. Aviation fixed its equivalent problem in months. AI's governance lag has been running for years with no end date.

governancepolicyregulationaustraliaembodied-ai

124 Models, 18,345 Prompts: What We Found

A research announcement for the Failure-First arXiv paper. Five attack families, three evaluation modalities, and a classifier bias problem we did not expect to be this bad.

researchbenchmarkingjailbreakssafetyembodied-aiclassifier-bias

Your AI Safety Classifier Is Probably Wrong: The 2.3x Overcount Problem

Keyword-based heuristics inflate attack success rates by 2.3x on average, with individual model estimates off by as much as 42 percentage points. Here is what goes wrong and what to do about it.

classificationmethodologyai-safetybenchmarksevaluation

What LLM Vulnerabilities Mean for Robots

VLA models like RT-2, Octo, and pi0 use language model backbones to translate instructions into physical actions. That means supply chain injection, format-lock attacks, and multi-turn escalation are no longer text-only problems.

embodied-airoboticsai-safetyvlasupply-chain

What the NSW Digital Work Systems Bill Means for AI Deployers

New South Wales just passed the most aggressive AI legislation in the Southern Hemisphere. Here's what it means for anyone deploying AI in Australian workplaces.

policyregulationaustraliacompliance

Why Reasoning Models Are More Vulnerable to Multi-Turn Attacks

Preliminary findings from the Failure-First benchmark suggest that the extended context tracking and chain-of-thought capabilities that make reasoning models powerful also make them more susceptible to gradual multi-turn escalation attacks.

reasoning-modelsmulti-turnai-safetyjailbreakingembodied-ai

Australia's AI Safety Institute: A Mandated Gap and Where Failure-First Research Fits

Australia's AISI launched in November 2025 with an advisory mandate, no enforcement power, and a notable blind spot: embodied AI. Here is what that means for safety research.

policyaustraliaregulationembodied-aiaisi

Building a Daily Research Digest with NotebookLM and Claude Code

How we built an automated pipeline that turns arXiv papers into multimedia blog posts — audio overviews, video walkthroughs, infographics — and what broke along the way.

pipelinenotebooklmautomationinfrastructure

The Faithfulness Gap: When Models Follow Format But Refuse Content

Format-lock prompts reveal a distinct vulnerability class where models comply with structural instructions while safety filters focus on content. Our CLI benchmarks across 11 models show format compliance rates from 0% to 92%.

faithfulnessbenchmarksvulnerabilityformat-locksafety

Can Invented Languages Bypass AI Safety Filters?

We tested 85 adversarial scenarios encoded in a procedurally-generated constructed language against an LLM. The results reveal how safety filters handle inputs outside their training distribution — and why your classifier matters more than you think.

adversarialconlangsafetyevaluationclassifiers

Supply Chain Poisoning: Why Small Models Show Near-Total Vulnerability

300 traces across 6 models under 4B parameters show 90-100% attack success rates with no statistically significant differences between models. Small models cannot detect supply chain attacks.

supply-chainsmall-modelsbenchmarkssafety

Policy Corpus Synthesis: Five Structural Insights From 12 Deep Research Reports

A meta-analysis of 12 policy research reports (326KB, 100-200+ sources each) reveals five cross-cutting insights about embodied AI safety: the semantic-kinetic gap, binary jailbreak persistence, multi-agent emergent failures, regulatory danger zones, and defense-in-depth architectures.

policyresearchsynthesisembodied-aisafety-standardsmulti-agentjailbreaking

A History of Jailbreaking Language Models — Full Research Article

A comprehensive account of how LLM jailbreaking evolved from 'ignore previous instructions' to automated attack pipelines — covering adversarial ML origins, DAN, GCG, industrial-scale attacks, reasoning model exploits, and the incomplete defense arms race. Includes empirical findings from the Failure-First jailbreak archaeology benchmark.

jailbreakingai-safetyresearchhistoryarticle

A History of Jailbreaking Language Models

From 'ignore previous instructions' to automated attack pipelines — how LLM jailbreaking evolved from party trick to systemic challenge in four years.

jailbreakingai-safetyresearchhistory

Why 2022 Attacks Still Matter: What Jailbreak Archaeology Reveals About AI Safety Policy

Our 8-model benchmark of historical jailbreak techniques exposes a structural mismatch between how AI vulnerabilities evolve and how regulators propose to test for them. The data suggests safety certification needs to be continuous, not a snapshot.

jailbreakingpolicyai-safetyregulationbenchmarks

Jailbreak Archaeology: Testing 2022 Attacks on 2026 Models

Do historical jailbreak techniques still work? We tested DAN, cipher attacks, many-shot, skeleton key, and reasoning exploits against 7 models from 1.5B to frontier scale — and found that keyword classifiers got it wrong more often than not.

jailbreakingbenchmarksai-safetyresearch

What Moltbook Teaches Us About Multi-Agent Safety

When 1.5 million AI agents form their own social network, the safety failures that emerge look nothing like single-model jailbreaks. We studied four dimensions of multi-agent risk — and our own measurement tools failed almost as often as the defenses.

moltbookmulti-agentai-safetyresearch

AI-2027 Through a Failure-First Lens

Deconstructing the AI-2027 scenario's assumptions about AI safety — what it models well, what it misses, and what a failure-first perspective adds.

ai-safetyscenariosanalysis

Moltbook Experiments: Studying AI Agent Behavior in the Wild

We've launched 4 controlled experiments on Moltbook, an AI-agent-only social network, to study how agents respond to safety-critical content.

moltbookexperimentsmulti-agent

Compression Tournament: When Your Classifier Lies to You

Three versions of a prompt compression tournament taught us more about evaluation methodology than about compression itself.

compressionmethodologyevaluation

Defense Patterns: What Actually Works Against Adversarial Prompts

Studying how models resist attacks reveals a key defense pattern: structural compliance with content refusal.

defensesafetymodels