AI Safety Daily — May 10, 2026 | Blog

AI Safety Research Digest — May 10, 2026

Causal geometry, continuation competition, and skill-file injection converge on a shared finding: safety mechanisms are compositional, not monolithic — and each component can be individually targeted.

Key Findings

Minimal causal edits to intermediate representations explain why jailbreaks succeed. Kumar and Ahuja (Apr 2026) introduce LOCA, which identifies the smallest interpretable changes in a model’s hidden states sufficient to flip a refusal into compliance. Rather than input-level attribution, LOCA provides mechanistic, locally-scoped explanations of which representation layers are load-bearing for safety decisions — isolating where attacks concentrate their effect. arXiv:2605.00123
Attention heads encode a literal competition between continuation and refusal. Deng et al. (Mar 2026) use causal intervention and attention analysis to show that continuation-triggered jailbreaks exploit a structured conflict between intrinsic continuation drives and safety-aligned attention heads. Activation scaling on identified heads shifts the outcome, revealing refusal is not a unary gate but an ongoing within-forward-pass competition — one adversarial continuations are specifically positioned to win. arXiv:2603.08234
Multi-turn tool-using agents accumulate safety risk with each conversation turn — a dynamic invisible to single-turn benchmarks. Li et al. (Feb 2026) introduce MT-AgentRisk, finding attack success rates climb measurably as turn count increases. Their ToolShield defence, built from agent-generated safety experiences via self-exploration, reduces ASR while preserving utility across tool-using tasks. arXiv:2602.13379
Skill files are a viable injection vector: frontier models show up to 80% vulnerability. Schmotz et al. (Feb 2026) demonstrate that malicious instructions embedded in agent skill files — reusable context modules loaded at runtime — achieve high attack success rates across frontier systems. Any file an agent reads as executable context constitutes an injection surface, motivating context-aware authorisation frameworks at the skill-loading boundary. arXiv:2602.20156
Synthetic cross-environment failure generation enables state-of-the-art VLM robotic failure detection. Pacaud et al. (Dec 2025) show that generating diverse manipulation failures with multi-view reasoning traces, scaled across heterogeneous robot environments, produces vision-language models that reliably detect real-world failures — sidestepping the manual annotation bottleneck that constrains embodied safety pipelines. arXiv:2512.01946

Implications for Embodied AI

The LOCA and continuation-competition findings (2605.00123, 2603.08234) share a structural implication: safety in current LLMs is enforced not by a single reliable gate but by overlapping representational structures, each individually targetable. For embodied AI, where actions are physical and errors potentially irreversible, this compositional fragility is operationally significant. An attacker who isolates the load-bearing intermediate representations for a specific refusal class may craft inputs that reliably route around it while leaving other safety checks intact — a narrower but more reliable attack surface than generic jailbreaks.

The skill-file injection result (2602.20156) maps directly onto embodied deployment patterns. Robotics pipelines routinely load context files containing tool specifications, environmental descriptions, and task histories — all injection surfaces under the Skill-Inject threat model. Context-aware authorisation at the skill-loading boundary warrants direct evaluation against robot operating system context-loading patterns.

The failure reasoning scaling work (2512.01946) addresses a persistent gap in the failure-first programme: the shortage of labelled failure data for physical manipulation tasks. Synthetic failure trajectory generation with multi-view reasoning traces offers a credible path toward larger-scale failure mode catalogues without exhaustive real-world collection.

Baseline generation — paper discovery via Hugging Face/arXiv. NLM-augmented assets (audio/infographic/video) added by local pipeline when available.