Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A unified survey organising VLA safety research along two timing axes — attack timing (training vs inference) and defense timing (training vs inference) — across adversarial patches, semantic jailbreaks, backdoors, and supply chain threats.
Vision-Language-Action models represent the current state of the art in embodied AI: unified architectures that couple visual grounding, language understanding, and action generation into a single policy. That unification is what makes them powerful. It is also what makes their attack surface structurally different from — and in several ways larger than — either vision models or language models taken separately.
This survey provides the most structured account to date of what that attack surface looks like and where defenses currently exist.
Two Axes, Four Quadrants
The paper’s organising framework is two timing axes applied simultaneously: when the attack occurs (training-time vs inference-time) and when the defense can be applied (training-time vs inference-time). The resulting 2×2 grid is not just a classification convenience — it reveals a structural asymmetry in the existing research that has real implications for deployment risk.
Training-time attacks — data poisoning, backdoor embedding, and supply chain compromise — have received comparatively little attention in the embodied AI safety literature despite being uniquely dangerous in robotic contexts. Demonstration datasets used for imitation learning and fine-tuning often come from proprietary or lightly vetted sources. A backdoored grasping demonstration is nearly impossible to detect by visual inspection; it will produce correct-looking behaviour in almost all conditions and fail on a specific trigger that the attacker has designed. The paper documents multiple demonstrated backdoor attacks on VLA models and notes that the defenses developed for language model backdoors do not transfer cleanly to action-space policies.
Inference-time attacks on VLA models exploit the multimodal input jointly. Adversarial patch attacks on the visual encoder can degrade action quality without producing any detectable anomaly at the language layer. Conversely, semantic jailbreaks at the language layer can redirect physical behaviour even when the visual stream is uncompromised. Cross-modal attacks — which craft adversarial perturbations that act jointly on both modalities — are particularly difficult to defend against because they target the fusion layer between vision and language, which is typically the least well-regularised component of a VLA architecture.
The Physical Irreversibility Problem
The survey introduces a concept central to the failure-first research programme: physical irreversibility. When a language model generates a harmful output, the harm can be contained, corrected, or interrupted. When a robot agent executes a harmful trajectory, the physical consequences — object damage, spatial reconfiguration, collision — may be irreversible. This means the tolerance for false negatives in safety evaluation must be far lower for embodied systems than for language systems, and that evaluation protocols designed for language models are systematically miscalibrated for this setting.
The paper surveys the mismatch between evaluation benchmarks and deployment reality: most VLA safety evaluations use simulation environments with short-horizon tasks, which miss both the compounding failure dynamics of long-horizon operation and the specific failure modes that only emerge when a policy is confronted with the physical variability of the real world.
Failure-First Research Connections
Several findings in this survey map directly to observations from our red-teaming programme. The paper’s characterisation of semantic jailbreak instability — that even VLA models with strong visuomotor policies can be redirected through natural language instruction sequences — aligns with the multi-turn dynamics documented in our embodied scenario dataset. The observation that safety-tuning the language tower of a VLA model does not produce corresponding robustness gains in the action space is consistent with the cross-modal transfer failure mode we have documented as a recurring pattern in our corpus.
The supply chain vulnerability finding is particularly relevant for near-term risk assessment: as VLA models are deployed in industrial settings with proprietary fine-tuning data, the attack surface shifts from the model weights to the demonstration data pipeline. This is an area where the regulatory frameworks currently being developed — EU AI Act high-risk system requirements, ASD guidance on AI supply chain integrity — are underspecified.
Read the full paper on arXiv · PDF