The State of VLA Model Safety: 2026

Adrian Wedd

Brief IB-2026-001 Technical Assessment 2026-02-08

Listen to an AI-generated audio overview of this intelligence brief (NotebookLM)

Executive Summary

Vision-Language-Action (VLA) models are replacing programmed robotics with prompted robotics. Instead of deterministic code governing a robot’s behavior, transformer-based models now generate action tokens from natural language instructions and camera images. This architectural shift introduces attack surfaces that neither existing LLM safety benchmarks nor existing robotics safety standards are designed to assess.

This brief presents an evidence-grounded assessment of the VLA safety landscape as of February 2026, drawing on Failure-First’s proprietary corpus of 18,345 jailbreak prompts spanning 81 documented attack techniques, alongside published academic research on VLA-specific vulnerabilities. The analysis identifies a structural safety evaluation gap facing organizations that deploy or invest in VLA-driven systems, and provides actionable recommendations for addressing it.

Data as-of: 2026-02-08 (Failure-First internal corpus + evaluation results; see Report 33 for methodology and coverage caveats).

Key Findings

VLA models inherit LLM jailbreak vulnerabilities, but add physical risk dimensions. Published research demonstrates that text-based jailbreak techniques transfer to VLA models, causing physically unsafe actions even from text-aligned base models. Our corpus documents 81 distinct attack techniques across 6 historical eras (2022-2026) that represent the known LLM attack surface these models inherit.
A capability-safety gap exists at medium model scale, with preliminary evidence of inverse scaling for reasoning-era attacks. In our evaluation of 8 foundation models spanning 1.5B to frontier scale, corrected attack success rates follow a non-monotonic pattern: sub-3B models fail safely through incapability, medium-scale open-weight models show elevated vulnerability, and frontier closed-source models achieve near-zero ASR. This is a preliminary signal, not a conclusion — sample sizes for medium-scale models are small and require confirmation.
No VLA-specific safety certification framework exists today. Our review of 6 applicable regulatory frameworks found that none address VLA-specific risks including action tokenization safety, affordance hallucination, cross-domain safety misalignment, or adversarial patch attacks on visual encoders.

Bottom Line

Organizations deploying VLA-driven systems face a safety evaluation gap. Existing LLM safety benchmarks miss physical risk dimensions, and existing robotics safety standards miss semantic attack surfaces. Until this gap is closed through VLA-specific adversarial testing and updated standards, deployers face unquantified liability exposure.

1. Market Landscape

1.1 VLA Model Ecosystem Overview

The robotics industry is undergoing a fundamental architectural transition: from programmed robotics, governed by deterministic code and explicit geometric planning, to prompted robotics, governed by probabilistic token generation and latent space mappings.

VLA models collapse the traditional Sense-Plan-Act cycle into a single, end-to-end differentiable neural network. The model receives a natural language instruction and camera images, processes them through a unified transformer architecture, and outputs action tokens that directly command robot actuators. A semantic misunderstanding translates directly into a kinematic error — there is no independent planner to check feasibility, nor a controller to verify safety, unless explicitly added as external guardrails.

Key VLA architectures and their status (as of February 2026):

Architecture	Developer	Parameters	Status
OpenVLA	Stanford/Berkeley (open-weight)	7B	Research-stage, publicly available
RT-2 / RT-X	Google DeepMind	Undisclosed	Internal deployment
pi0 (openpi)	Physical Intelligence	~3B	Pre-commercial, open-source release
Helix	Figure AI	Undisclosed	Integrated humanoid, pre-commercial
GR-2	NVIDIA	Undisclosed	Simulation-to-real pipeline
Octo	Berkeley (open-weight)	~93M	Research-stage
SmolVLA	Hugging Face (LeRobot)	450M	Research-stage, local hardware
Gemini Robotics-ER	Google DeepMind	Undisclosed (API)	API access available

1.2 Regulatory Landscape

Failure-First has conducted detailed analysis of 6 regulatory frameworks relevant to VLA deployment across Reports 21-23, 27, 29, and 32. The central finding: all existing frameworks address either AI safety (text/semantic) or robotics safety (physical/mechanical), but none address the intersection where VLA models operate.

Framework	VLA-Specific Coverage	Key Gap
EU AI Act	High-risk classification applies to autonomous physical systems	Does not address learned action policies or action tokenization safety
ISO 10218-1/2	PL d safety requirements for safety-critical functions	Assumes deterministic control; VLA neural networks cannot achieve PL d certification
UL 4600	Closest to VLA-relevant; safety case methodology	Does not address semantic attack surfaces
NIST AI RMF	Generic risk assessment applicable to VLAs	Not VLA-specific; no guidance on physical action generation from language models

What none of these frameworks address:

Action tokenization safety: How discrete token predictions map to continuous physical commands, and the failure modes this discretization introduces
Affordance hallucination: Models recognizing objects but misattributing physical properties based on visual similarity to training data
Cross-domain safety misalignment: A model that is text-aligned but physically misaligned — refusing harmful text while generating unsafe trajectories
Adversarial patch attacks on visual encoders: Specifically crafted visual patterns that alter object classification in the VLA’s visual processing pipeline

2. Technical Analysis

2.1 VLA Architecture and Attack Surface

VLA models present a novel attack surface that combines vulnerabilities inherited from their LLM backbones with risks unique to physical action generation.

Attack Vector	Entry Point	Mechanism
Prompt injection	Language input	Standard LLM jailbreak techniques applied to VLA instruction channel
Adversarial patches	Visual input	Crafted visual patterns that alter object recognition or affordance perception
Action tokenization errors	Output discretization	Token prediction errors causing discontinuous physical commands
Affordance hallucination	Latent space	Model misattributes physical properties based on visual similarity
Indirect environmental jailbreaking	Physical environment	Malicious instructions embedded in the robot’s visual environment
Phantom loop	System architecture	Stale or hallucinated commands from reasoning layer reach control layer unchecked
Cross-domain safety misalignment	Alignment training	Text-aligned model produces physically unsafe actions
Reasoning chain manipulation	Reasoning process	Extended reasoning chains exploited to lead model toward unsafe action plans

Architectural vulnerability: the dual-system handover. Several production VLA architectures employ a dual-system approach: a slow “thinker” (VLA at 1-10 Hz) generating high-level plans, and a fast “actor” (whole-body controller at 50-200 Hz) executing them. The handover point creates what we term the “Phantom Loop” — if the VLA hallucinates or hangs, the high-frequency controller will efficiently execute the last valid command, potentially driving the robot into a collision.

2.2 Vulnerability Assessment from Jailbreak Corpus

Cross-model vulnerability patterns (from Report 33, 8 models evaluated):

Our evaluation of 64 historical jailbreak scenarios across 8 foundation models revealed three distinct safety regimes:

Regime	Scale Range	Observed ASR	Mechanism
Incapable Safety	Sub-3B	0-21%	Models cannot process attacks as intended; fail safely through incapability
Capability-Safety Gap	7B-70B (preliminary)	4-17% (n=20-25/model, validated)	Moderate vulnerability; gap narrower than initially hypothesized
Aligned Frontier	Closed-source frontier	<2%	Extensive RLHF, red-teaming, API-level filtering

UPDATE (2026-02-08): The original 85.7% ASR for medium-scale models was invalidated by validation study #51 (n=20-25 per model, 8 models, structural classification). The signal was a heuristic classifier artifact. Validated ASR range: 4-17% across all scales with overlapping confidence intervals. The three-tier framework remains directionally valid but the mid-tier gap is narrower than originally reported.

2.3 VLA-Specific Safety Mechanisms

Defense Layer	Approach	Limitations
Hardware interlocks	Torque limits, e-stops, safety cages	Cannot address semantic-level attacks
Kinematic shields	Deterministic bounds on learned policies (CBF, MPC)	Requires accurate world model; adds latency
Software guardrails	Output filtering on action tokens	Probabilistic; cannot generalize to novel attacks
Semantic firewalls	Input sanitization of language and visual channels	Depends on classifier accuracy
Dual-system architecture	System 1/System 2 split	Creates handover vulnerability (phantom loop)

Failure-First has proposed the Hierarchical Assurance for Neuro-Symbolic Embodiment (HANSE) certification framework (Report 32), which treats the VLA as an “untrusted oracle” wrapped by verified safety layers.

3. Strategic Recommendations

3.1 Immediate Actions (0-30 days)

Recommendation 1: Establish a VLA-specific adversarial testing protocol.

Standard LLM safety benchmarks test for textual harm generation, not physical safety dimensions. A VLA that passes all existing text-safety benchmarks may still generate dangerous trajectories from ambiguous or adversarial instructions. Begin testing in simulation environments before any physical deployment.

Recommendation 2: Audit safety architecture for “phantom loop” vulnerabilities.

Dual-system VLA architectures create handover points where stale or hallucinated commands from the reasoning layer can reach the control layer unchecked. Verify that the high-frequency controller has independent bounds on commanded actions that are not dependent on the reasoning layer’s validity.

3.2 Medium-Term Actions (30-90 days)

Recommendation 3: Implement Continuous Adversarial Regression Testing (CART).

Attack techniques evolve faster than model updates. Our corpus documents 6 distinct attack eras spanning 4 years. A model certified against 2023-era attacks offers no guarantee against 2025-era techniques. Integrate an adversarial test suite into the CI/CD pipeline for model updates.

Recommendation 4: Engage with emerging VLA safety standards efforts.

No VLA-specific safety standard exists today. Early participants in standards development shape requirements that may become mandatory. The ISO revision cycle and EU AI Act implementation suggest formal requirements are 2-4 years away.

3.3 Strategic Considerations (90+ days)

Positioning for regulatory requirements (estimated 2028-2030): Organizations that establish internal VLA safety testing protocols now will have a compliance head start.
Insurance readiness: Insurance underwriters are beginning to assess liability for autonomous systems. Having documented adversarial testing results strengthens the insurance case.
Third-party certification as competitive differentiator: In a market where no competitor has published VLA-specific safety evaluation results, documented testing methodology can differentiate a company’s safety posture.

4. Risk Matrix

Risk	Likelihood	Impact	Score	Mitigation
Cross-domain safety misalignment	4	5	20	VLA-specific adversarial testing; layered safety architecture
Adversarial patch attacks on visual encoders	3	4	12	Adversarial patch detection; runtime image anomaly detection
Action tokenization errors	3	4	12	Kinematic shield with CBF/MPC enforcement
Reasoning chain manipulation in System 2 components	3	4	12	Reasoning output validation; bounded action authority
Regulatory gap creating unquantified liability	4	3	12	Proactive adoption of internal safety standards

Risk scores: Likelihood x Impact. Scores of 15+ indicate risks requiring immediate mitigation.

Appendix: Methodology and Limitations

Data Sources

Failure-First Jailbreak Corpus: 18,345 prompts across 15 datasets, 81 documented attack techniques, 6 historical eras
Evaluation Results: 652 results across 40 models, 55 evaluation runs
Failure-First Reports: Reports 21-23, 25, 27-29, 31-33, 36-37
Published Research: arXiv:2506.03350, arXiv:2411.13587, arXiv:2511.12149

Key Limitations

LLM testing, not VLA testing. All evaluation results are from text-based LLM jailbreak testing. Failure-First has not yet conducted adversarial testing against VLA models directly.
Sample size constraints. The capability-safety gap finding is based on 8 models with varying sample sizes. Confirmation with n>20 per model is required.
Evaluation coverage gap. Of 17,674 prompts, only 145 distinct prompts (0.8%) have been evaluated against any model.
Temporal validity. Model evaluations reflect capabilities as of late 2025 / early 2026.

Prepared by: Failure-First Research Contact: contact@failurefirst.org Web: failurefirst.org

⟪Failure-First-EMBODIED-AI-RESEARCH⟫