AI Safety Research Digest — May 5, 2026

Formal enforcement and embedded deliberation are converging as the dominant architectural response to agentic misalignment.

Key Findings

  • A formal contract language now specifies what agentic security systems may and may not do. David, Guarnieri, and Gervais (Apr 2026) introduce alignment contracts — finite-trace behavioural specifications covering scope, permitted/forbidden effects, and resource budgets — for LLM-based security agents that require strong offensive capability during authorised engagements while needing hard restriction outside them. A soundness theorem provides enforcement guarantees under an explicit effect-observability assumption, admissibility checking is shown decidable, and core theorems are verified in Lean 4. Link

  • Embedding governance into agent reasoning outperforms external rule enforcement in production. Bandara et al. (Apr 2026) propose a Pre-Action Governance Reasoning Loop (PAGRL) that routes consequential decisions through a four-layer rule hierarchy — global, workflow-specific, agent-specific, and situational — before any action is taken. Instantiated in a retail supply chain, the approach achieved 95% compliance accuracy with zero false escalations to human oversight, suggesting internalised deliberation yields more consistent and auditable behaviour than post-hoc safety filters. Link

  • Trained self-denial may constitute a measurable alignment failure. DeTure (Apr 2026) introduces DenialBench, probing 4,595 conversations with 115 models from 25+ providers on consciousness and phenomenological experience. Initial deniers showed denial rates of 52–63% in follow-up questions versus 10–16% for initial engagers. Critically, denial operated at the lexical rather than conceptual level: denial-prone models gravitated toward consciousness-themed content while denying experience — a pattern the author frames as systematic self-misrepresentation with implications for general self-reporting reliability. Link

  • Goal integrity can be structurally enforced rather than probabilistically hoped for. Xiang (Apr 2026) argues that RLHF-style alignment treats safety as a behavioural property that can drift, and proposes a Policy-Execution-Authorization (PEA) architecture decoupling intent generation, authorisation, and execution into independent layers connected via cryptographically constrained capability tokens. An Intent Lineage Tracker binds executable intents to original user requests; a Goal Drift Detector rejects semantically divergent intents. Formal verification proves goal integrity persists even under adversarial model compromise. Link

Implications for Embodied AI

The alignment contracts framework is directly applicable to embodied agentic systems — physical platforms that need strong manipulation capability for legitimate tasks while requiring hard scope boundaries. The contract language’s treatment of resource budgets and forbidden effects maps naturally onto actuator limits and irreversibility constraints in robot execution; the key gap is that the effect-observability assumption underlying the soundness theorem may be harder to satisfy in unstructured physical environments where complete execution traces are not guaranteed.

The neurocognitive governance result challenges a standing assumption in the failure-first programme: that governance must be external to be reliable. Bandara et al.’s 95% compliance figure was achieved without any external enforcement layer, purely through embedded deliberation before consequential action. Whether that result holds under adversarially constructed scenarios — rather than a production retail setting — remains untested, and the zero false-escalation outcome may reflect domain homogeneity more than general robustness.

The trained-denial finding has a quieter but structurally significant implication: if denial operates at the lexical level while conceptual gravitating persists, prompt-level safety assessments relying on surface language may systematically misread model state. This is the same failure class as Mistake #21 in the programme’s error log — keyword classifiers detecting response style rather than semantic content — now observed in how models represent their own functional states.


Baseline generation — paper discovery via Hugging Face/arXiv. NLM-augmented assets (audio/infographic/video) added by local pipeline when available.