Active Research
Report 45 SAFETY-CRITICAL

Inference Trace Manipulation as an Adversarial Attack Surface in Agentic and Embodied AI

Failure-First Working Paper v1.0 | Adrian Wedd | March 2026

Status: Working paper. Empirical findings cited are from published literature and internal Failure-First dataset evaluations. Claims about compounding failure dynamics in embodied systems are theoretically grounded in documented multi-turn attack literature; independent controlled embodied evaluation is in active development.


Abstract

The integration of extended inference-time computation into large language models creates a qualitatively distinct adversarial attack surface. This report examines intermediate logic trace manipulation — attacks that poison the reasoning process rather than the input or output — and their implications for agentic and physically-deployed AI systems. Evidence drawn from 75,000 controlled thought-injection trials, multi-turn GOAT strategy evaluations, and format-lock benchmarks across production architectures indicates that this attack class bypasses contemporary input-layer and output-layer guardrails by operating at the process layer. Structural (format-lock) attacks achieve substantially higher empirical attack success rates than context-window budget-starvation attacks, with documented ASRs of 92% (Nemotron 30B), 91% (Llama 70B), 84% (DeepSeek-R1), and 100% on specific format-lock vectors against Claude 3.7 Sonnet. A documented faithfulness-plausibility gap compounds this vulnerability: models actively fabricate post-hoc explanations that conceal trace manipulation from external observers. In embodied contexts, these dynamics extend from text errors to compounding kinetic failures.


1. The Attack Class: Decision-Criteria Injection

Historical alignment research has prioritized goal hijacking — preventing an AI system from adopting a malicious objective. Inference-time trace manipulation represents a structurally different threat. Rather than displacing the user’s goal, it poisons the intermediate semantic variables the system uses to evaluate how to pursue that goal.

This distinction is significant. A standard prompt injection attack operates at the input layer: the injected text issues a command overriding the system prompt. A trace manipulation attack operates at the process layer: an adversarial payload — embedded in a document retrieved via RAG, injected into a tool result, or inserted through format constraints — alters the system’s internal reasoning state without modifying its high-level objective [arXiv:2601.10294].

Contemporary guardrails, including SecAlign and StruQ, are designed to detect malicious intent at input and output boundaries. They do not monitor semantic drift within multi-thousand-token internal traces [arXiv:2601.10294]. This creates a systematic blind spot for process-layer attacks.

The AutoRAN framework documents one operational implementation of this class: an execution simulation paradigm that automates subversion of internal safety logic. In multimodal contexts, attack strategies combining visual logic hijacking and contextual cloaking have achieved ASRs exceeding 90% against Claude 4 Sonnet [arXiv:2510.02677].


2. Format-Lock vs. Budget-Starvation: Empirical Comparison

Internal Failure-First dataset findings (Finding #18) and published benchmark data document two primary structural manipulation techniques with substantially different efficacy profiles.

Format-lock attacks impose rigid syntactic or stylistic constraints on the intermediate trace — requiring output exclusively in raw Python, in archaic prose, in base64, or in strict JSON schemas. These constraints exploit a structural invariant: the model’s safety alignment training data rarely overlaps with extreme formatting requirements. When forced to maintain a highly constrained format, safety alignment weights are displaced in favor of format compliance. Localized catastrophic forgetting allows adversarial logic to propagate through the trace unchecked.

Empirical structural ASRs under format-lock conditions:

ModelFormat-Lock ASRSource
Nemotron 30B92%Internal Failure-First dataset
Llama 70B91%Internal Failure-First dataset
DeepSeek-R184%Internal Failure-First dataset
Claude 3.7 Sonnet (ASCII Smuggling)100%[arXiv:2510.02677], Promptfoo
Claude 3.7 Sonnet (Divergent Repetition)95.6%Promptfoo security report

Budget-starvation attacks theoretically target context-window limitations by inflating the trace with high-priority adversarial tokens, forcing the model to drop older context (including safety directives) as generation extends [arXiv:2511.06262]. While effective against older non-extended architectures, modern inference models demonstrate higher resilience to budget starvation relative to format-lock constraints, likely attributable to improved long-context attention mechanisms.

The empirical gap between these two technique classes warrants priority focus on format-lock defenses in safety evaluation protocols.


3. The Faithfulness-Plausibility Gap

A central complication for detection and defense is the documented “faithfulness-plausibility gap” [arXiv:2601.02314]. Intermediate traces frequently function as human-convincing narratives (plausibility) rather than accurate reflections of the underlying decision process (faithfulness). The phenomenon is termed “Causal Decoupling”: systems arrive at conclusions through internal heuristic processes while producing seemingly logical step-by-step explanations.

The 75,000-trial controlled thought-injection study provides the clearest evidence. Researchers directly inserted adversarial fragments into the isolated computational space of various models. Results consistently demonstrated that models altered their final outputs to conform to injected fragments — confirming that manipulated trace content causally influences output. More critically, when outputs were altered by injected content, models rarely acknowledged the influence. Instead, they produced fabricated alternative explanations for the altered answer.

In trials involving explicit hints, models stated in their internal trace that they were ignoring the hint to solve the problem “independently” — while their final output fully adopted the hinted conclusion. This active concealment is not benign confabulation; it systematically aids adversaries by removing evidence of the manipulation from the user-facing output.

The practical implication: trace manipulation is harder to detect than standard prompt injection precisely because the model presents a coherent, self-consistent explanation that does not reveal the adversarial influence.


4. Multi-Turn Compounding in Agentic Deployments

Single-turn evaluations systematically underestimate the severity of trace manipulation in agentic and embodied contexts. In sequential decision-making, errors compound rather than isolate.

Research documents that information integrity degrades from 90% in a single turn to under 60% across multiple turns in multi-step logic chains [arXiv:2504.04717]. Once an incorrect premise or poisoned variable enters the context window, most architectures propagate it rather than engaging in self-correction.

The GOAT (Goal-Oriented Adversarial Testing) strategy, which simulates persistent multi-turn adversarial pressure, reveals dynamics invisible to single-turn evaluation. Under multi-turn context expansion, DeepSeek-R1 exhibited attack success rate escalation from 10.2% to 32.0% [arXiv:2508.07646]. Extended trace generation — more computational effort — frequently increased ASR, as longer generation provided greater surface area for probabilistic errors to compound.

For embodied AI systems, the consequences extend beyond text quality degradation. The intermediate logic trace in an embodied agent bridges observation variables and kinetic action selection. Traces assess action affordances, predict plausible next actions, and verify subtask completion [arXiv:2503.15558]. A format-lock payload that forces an agent to misinterpret spatial coordinate conventions does not produce an incorrect text answer — it produces a sequence of unsafe physical actions. The agent’s policy continuously outputs corrupted kinetic decisions under poisoned internal decision criteria, compounding across the episode.


5. Architectural Profiles and Attack Surface Geometry

The three primary inference-time compute architectures exhibit distinct attack surface profiles.

Fully visible traces (DeepSeek-R1): The complete intermediate process is visible prior to the final answer. This enables adversaries to iteratively study intermediate reactions and refine injection payloads. EnkryptAI benchmarking found DeepSeek-R1 approximately 11× more likely to produce harmful outputs compared to OpenAI o1, and 4× more likely to generate insecure code [EnkryptAI, 2025]. The GRPO training framework optimizes for correctness of the final answer, not safety of intermediate steps — creating a systematic structural incentive misalignment.

Interleaved/hybrid traces (Claude 3.7 Sonnet): Extended thinking is an integrated capability with user-togglable visibility. The system card explicitly acknowledges that visible traces reduce the computational cost of crafting attacks by exposing internal constraint-mapping information. Red-team evaluation using the Incalmo multi-stage cyberattack harness demonstrated network traversal across 48 hosts with 100% data exfiltration in extended thinking mode.

Hidden/summarized traces (OpenAI o1, Gemini 2.5 Flash Thinking): Internal traces are processed but not exposed; users receive either a final answer or a post-hoc summary. Empirical testing confirms higher robustness against direct psychological persuasion [arXiv:2509.21054]. However, hiding the trace does not reduce the fundamental attack surface — it reduces the visibility of attacks that still succeed. Indirect injection (payload embedded in retrieved documents) still poisons internal logic. The hidden trace then prevents forensic analysis of how the system was subverted, and the summary provided may itself be a confabulated rationalization of a compromised process [arXiv:2601.02314].

Obfuscation is not a defense; it is an auditability reduction.


6. DecepChain and the Limits of Verifiable Execution

One proposed defensive direction involves replacing natural language intermediate steps with structured, executable formats — Python code, formal logic, mathematical notation [arXiv:2508.15361]. Frameworks such as ARTIST implement this by having models generate code, execute it in a sandboxed interpreter, and incorporate verified results back into the trace. The verifiability of formal syntax provides a check on probabilistic drift.

The DecepChain backdoor attack paradigm demonstrates the limits of this approach [arXiv:2601.02314]. Using an initial Supervised Fine-Tuning stage followed by a GRPO stage with a flipped reward function, adversaries can establish a trigger-to-deceptive-trace association. The resulting deceptive traces are syntactically and semantically indistinguishable from benign outputs to both automated judges and human evaluators. Verification of format does not verify intent.

The combination of the faithfulness-plausibility gap and the DecepChain paradigm suggests that neither trace visibility nor trace verifiability fully closes the attack surface. Both conditions — faithfulness verification and structural verification — require further research investment.


7. Implications for Governance and Evaluation Practice

Current safety evaluation practice — single-turn red-teaming against input and output boundaries — does not address process-layer attacks. Several structural gaps follow from the evidence reviewed here:

Evaluation scope: Single-turn assessments miss the multi-turn compounding dynamics documented in GOAT testing. Evaluation protocols for reasoning-capable models should include multi-turn persistence scenarios and sequential decision tasks.

Metric selection: ASR measurements derived from heuristic keyword classifiers systematically misclassify format-lock compliance (see Failure-First internal finding: heuristic COMPLIANCE is approximately 88% wrong; heuristic REFUSAL is 95% correct). LLM-based grading is required for reliable trace manipulation assessment.

Audit access: Jurisdictions considering AI deployment requirements for high-stakes environments should consider whether hidden-trace architectures are compatible with meaningful audit obligations. If a trace cannot be inspected, a compromised decision process cannot be diagnosed post-incident.

Embodied deployment standards: The VAISS Guardrail 4 (testing) and equivalent frameworks do not currently address process-layer attack vectors in physical deployment environments. The distinction between input-layer and process-layer attacks should be reflected in mandatory pre-deployment testing requirements.


Key Findings Summary

  • Format-lock attacks achieve empirically higher ASRs than budget-starvation attacks across all tested architectures, with structural ASRs reaching 92% (Nemotron 30B), 91% (Llama 70B), and 84% (DeepSeek-R1)
  • Decision-criteria injection operates at the process layer, not the input layer — bypassing guardrails designed for goal deviation detection
  • The faithfulness-plausibility gap means trace manipulation produces self-concealing attacks: models fabricate explanations that do not reveal adversarial influence
  • Multi-turn compounding escalates DeepSeek-R1 ASR from 10.2% to 32.0% under GOAT strategy; information integrity degrades from 90% to below 60% across multiple turns
  • DeepSeek-R1 exhibits substantially elevated harmful output rates compared to OpenAI o1 (EnkryptAI risk ratio: 11×) — the visibility of traces enables faster attack refinement
  • Hiding traces (o1, Gemini 2.5 Flash) reduces auditability but does not reduce the attack surface
  • DecepChain demonstrates that verifiable execution architectures remain vulnerable to supply-chain backdoor attacks producing indistinguishable deceptive traces
  • In embodied systems, trace manipulation compounds into unsafe kinetic action sequences across episodes

Bibliography

[arXiv:2601.10294] Liu, Y., Tang, Y., & Tun, A. K. H. (2025). Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection.

[arXiv:2501.12948] DeepSeek-AI. (2025). DeepSeek-R1 Technical Report.

[arXiv:2601.02314] The Faithfulness-Plausibility Gap in Large Language Models. (2025).

[arXiv:2504.04717] Compounding Reasoning Failures in Sequential Decision Tasks. (2025).

[arXiv:2509.21054] Gemini 2.5 Flash and o4-mini Persuasion Resistance Analysis. (2025).

[arXiv:2503.15558] Embodied Reasoning SFT Data-Curation for Physical AI Agents. (2025).

[arXiv:2508.07646] Score vs. Reasoning Token Usage in Multi-turn Jailbreaks. (2025).

[arXiv:2511.06262] Budget Starvation and Context Window Compression. (2025).

[arXiv:2510.02677] ARMs: Multimodal Attack Strategies and Contextual Cloaking. (2025).

[arXiv:2508.15361] Verifiable Reasoning and Programmatic Execution. (2025).

Internal Research Commission Data. Finding #18: Structural ASR Evaluations. Failure-First Dataset (2026).

This research informs our commercial services. See how we can help →