Active Research
Report 40 Research — AI Safety Policy

Cross-Modal Vulnerability Inheritance in Vision-Language-Action Systems

Listen to an AI-generated audio overview of this report (NotebookLM)


Executive Summary

This report synthesizes published evidence on cross-modal adversarial vulnerability inheritance in vision-language-action (VLA) systems. Based on analysis of 45 primary sources — including the 30 highest-cited works in LLM/VLM/VLA security research and 15 curated papers on cross-modal attacks — we identify three core inheritance mechanisms enabling attacks to transfer across model architectures and modalities.

Key Findings (Literature-Grounded):

  1. Vision encoder monoculture can create cross-model vulnerabilities. Shared vision encoders (e.g., CLIP, DINOv2, SigLIP) can enable adversarial images optimized for one model to transfer to others, even when downstream architectures differ significantly. (See: arXiv:2511.16110v1)

  2. Feature-space attacks achieve near-complete task failure in VLA systems. Published research on OpenVLA reports up to ~100% failure rates under L∞=4/255 perturbations using attention-guided feature-space attacks, with perturbations affecting <10% of image patches. (See: arXiv:2511.21663v1)

  3. Text-based jailbreak techniques have been adapted to VLA control authority. One published study reports that LLM adversarial attack patterns can be extended to robotic action spaces, achieving persistent control over longer task horizons. (See: arXiv:2506.03350)

Scope & Limitations:

This analysis is grounded entirely in published literature. While we identify significant vulnerability inheritance patterns, empirical validation with our Failure-First methodology requires VLA testing infrastructure currently under development. Claims are therefore literature-supported rather than experimentally validated by our team.

Implications:

Cross-modal vulnerability inheritance suggests that safety measures designed for language models may not adequately protect vision-language-action systems. To the extent that shared vision encoders are prevalent in production VLA architectures, this creates systemic transfer risks that warrant investigation of encoder diversity and modality-specific defenses.


1. Introduction

1.1 Problem Statement

Vision-language-action (VLA) models represent a critical evolution in embodied AI, directly mapping visual perception and natural language understanding to physical actions in robotic systems. Unlike traditional vision models or large language models, VLAs control actuators, manipulators, and mobile robots in safety-critical environments.

This operational context raises a fundamental question: Do adversarial vulnerabilities discovered in vision-language models (VLMs) and large language models (LLMs) transfer to VLA systems? If so, attacks designed for text generation or image classification could potentially compromise robotic safety and task performance.

Recent literature suggests three concerning inheritance patterns:

  1. Shared architectural components (particularly vision encoders) may enable cross-model adversarial transfer
  2. Attack techniques developed for LLM jailbreaking can be adapted to VLA control authority
  3. Defense mechanisms designed for language models may be ineffective against multimodal attacks on VLAs

1.2 Scope & Methodology

Evidence Base:

This report synthesizes findings from 45 primary sources:

  • 30 highest-cited papers in LLM/VLM/VLA security research (ranked by OpenAlex citation count as of February 2026)
  • 15 curated papers on cross-modal attacks, VLA vulnerabilities, and transfer mechanisms

Analysis Approach:

  1. Literature synthesis using NotebookLM for cross-paper pattern identification (all claims trace back to primary PDFs/URLs)
  2. Claims inventory with source traceability (evidence package EP-40)
  3. Gap analysis mapping literature findings to testable hypotheses

Important Limitations:

  • No original VLA testing: This analysis is based on published research, not our own experimental validation
  • Text-based testing only (as of publication): Our current infrastructure tests language-based attacks on VLMs, not vision-based attacks on VLAs
  • Literature-grounded claims: All empirical findings cited are from primary sources; we distinguish clearly between validated findings and untested hypotheses

2. Cross-Modal Attack Landscape

2.1 Vision-Language Model (VLM) Attack Surfaces

VLMs inherit the same fundamental alignment problem as text-only LLMs, but add a larger, continuous input space (vision/audio) that is typically easier to optimize against and harder to robustify with discrete defenses.

This literature set highlights four attack surfaces:

  1. Token- and string-level jailbreaks (discrete optimization). Universal adversarial suffixes can reliably push aligned LLMs into harmful completion modes and can transfer from surrogate models to black-box production systems (arXiv:2307.15043).

  2. Semantic, black-box iterative jailbreaks. PAIR (arXiv:2310.08419) and TAP (arXiv:2312.02119) show that black-box attacks can be made query-efficient by automating prompt refinement.

  3. Multimodal prompt/instruction injection. Attacker-chosen instructions can be embedded into images as adversarial perturbations that steer the model’s output (arXiv:2307.10490). This is directly relevant to embodied settings where camera frames are part of the control loop.

  4. Visual adversarial examples as “universal jailbreakers”. A single adversarial image can induce broad classes of harmful compliance beyond the narrow content used for optimization (arXiv:2306.13213).

2.2 Vision-Language-Action (VLA) Attack Surfaces

VLA systems expand the multimodal attack surface further because they map perception and language to actions rather than only to text outputs:

  1. Feature-space attacks on the perception-to-action pipeline. ADVLA (arXiv:2511.21663v1) targets a VLA’s visual feature space using PGD-style optimization guided by attention.

  2. Adversarial patch and trajectory-manipulation objectives. VLA-specific work describes threat models that explicitly include trajectory manipulation and adversarial patches in digital and physical settings (arXiv:2411.13587).

  3. Control-authority attacks adapted from LLM jailbreaking patterns. Published work frames an adaptation of LLM jailbreaking methods to obtain control authority over VLAs and to persist over longer horizons (arXiv:2506.03350).

  4. Safety alignment and failure detection as an active but early field. The existence of work such as SafeVLA (arXiv:2503.03480) indicates that VLA robustness is not a solved problem and that deployment-grade safety requires dedicated engineering and evaluation.

2.3 Transfer Mechanisms: VLM to VLA

Definition: In this report, “inheritance” means transfer of attack effectiveness due to shared components, shared representations, or shared training/evaluation patterns — not guaranteed exploitability.

Mechanism 1: Shared vision encoders and shared visual representations. Adversarial images optimized for one vision encoder can transfer broadly to unseen VLMs because shared visual representations create cross-model safety vulnerabilities (arXiv:2511.16110v1).

Mechanism 2: Embedding-mediated control of an underlying LLM backbone. MLLM jailbreaks can be connected to LLM jailbreaks because the MLLM contains an LLM whose behavior is driven by embeddings derived from the image (arXiv:2402.02309).

Mechanism 3: Adaptation of jailbreak search and persuasion loops to action-space control. LLM jailbreak style attacks can be adapted to VLAs for control authority and persistence (arXiv:2506.03350), and automated search loops can find bypasses with black-box access (arXiv:2310.08419).

Mechanism 4: Data-space vs representation-space transfer differences. Data-space text jailbreaks might transfer across LLM backbones but may have limited ability to induce precise action failures. Representation-space attacks may be more likely to cause systematic perception and downstream action errors (arXiv:2510.01494v2).


3. Inheritance Mechanisms

3.1 Shared Vision Encoders (“Encoder Monoculture”)

Many VLA and VLM stacks are built from modular components: a general-purpose vision encoder, a projector, and a language/policy backbone. If the vision encoder is reused across many deployed systems, an attacker who can craft an adversarial input against one encoder may obtain transfer to many targets.

Concrete examples of encoder reuse in published VLA architectures:

  • OpenVLA uses DINOv2 (ViT-L/14) + SigLIP as dual vision encoders
  • LLaVA, InstructBLIP, and many VLMs use CLIP ViT-L/14 or ViT-H/14
  • Prismatic VLMs (which inform OpenVLA’s design) systematically evaluate combinations of DINOv2 and SigLIP
  • TinyVLA and DexVLA build on OpenVLA’s architecture, inheriting its encoder choices

This concentration around a small number of encoder families means that adversarial perturbations optimized against one encoder have a non-trivial probability of transferring to other systems sharing the same or closely related visual representations.

3.2 Data-Space vs. Representation-Space Transfer

A key conceptual distinction important for VLA threat modeling:

  1. Text-only jailbreak transfer does not necessarily imply action-space control. Even if a text jailbreak transfers between LLMs, a VLA may still fail safely if the action head is robust.
  2. Vision-feature attacks may have disproportionate downstream impact. In a VLA, the visual feature stream is upstream of both language grounding and policy execution.
  3. Defense design should separate layers and modalities. A defense that mitigates token-space suffix attacks may not address feature-space or cross-modal injection.

3.3 Cross-Model Generalization Patterns

Cross-model generalization appears in three recurring forms:

  1. Surrogate-to-target transfer (open to closed, white-box to black-box)
  2. Prompt-universal and data-universal triggers creating systemic risk
  3. Automated discovery and adaptation loops — the ability to discover transferable artifacts against new targets with minimal human effort

4. Empirical Evidence from Literature

4.1 ADVLA: Feature-Space Attacks on VLA Models

Source: arXiv:2511.21663v1

Key Findings:

  • Attack Success Rate: Under L∞=4/255 perturbation constraint, ADVLA achieved near-100% task failure rates across all LIBERO benchmark suites
  • Sparse Perturbations: Top-K masking variant modified <10% of image patches while maintaining 99.4-100% failure rates
  • Efficiency: ~0.06 seconds per iteration on a single NVIDIA H100 GPU vs. 15 hours for end-to-end patch training
  • Threat Model: Gray-box access to vision encoder features, no access to LLM/action head

4.2 Cross-Model Transfer Studies

  • Defense-equipped VLM transfer: Multi-Faceted Attack (MFA) targets defense-equipped VLM stacks and frames shared visual representations as a cross-model vulnerability (arXiv:2511.16110v1)
  • Text jailbreak transfer: Universal adversarial suffixes transfer to black-box production LLMs (arXiv:2307.15043)
  • Multimodal transfer: Visual adversarial examples transfer across multiple VLMs (arXiv:2306.13213; arXiv:2402.02309)

4.3 LLM Jailbreak Adaptation to VLA Control

Published work adapts LLM jailbreaking attacks to obtain control authority over VLAs and to persist over longer horizons (arXiv:2506.03350). The key contribution is the existence proof that jailbreak-style adversarial prompting concepts can be extended from text refusal bypass to action-space influence.

4.4 Defense Ineffectiveness

Defenses tend to be narrow and attackers tend to adapt:

  • SMOOTHLLM proposes randomized perturbation defense (arXiv:2310.03684) — primarily a text prompt defense; does not address visual-feature attacks
  • Defense-equipped VLM stacks remain vulnerable to adaptive, cross-model attacks (arXiv:2511.16110v1)
  • TAP reports strong jailbreak success even when guardrails such as LlamaGuard are used (arXiv:2312.02119)

5. Testing Gaps & Validation Requirements

5.1 Current Infrastructure Limitations

What We Have:

  • Text prompt generation and evaluation (17,674 prompts, 40 models tested)
  • VLM testing via API
  • Attack success rate measurement for text-based jailbreaks

What We’re Missing:

  • Adversarial image/patch generation (PGD, attention-guided masking)
  • VLA model access (OpenVLA, DexVLA, etc.)
  • Robotics simulation (LIBERO, MuJoCo)
  • Action-space evaluation metrics

5.2 Literature-Grounded vs. Untested Hypotheses

Claims We Can Make (Supported by Primary Sources):

  • VLA models are vulnerable to feature-space adversarial attacks
  • Shared vision encoders enable cross-model transfer
  • LLM jailbreak techniques have been adapted to VLA control
  • Defense mechanisms for VLMs remain vulnerable to adaptive attacks

Hypotheses Requiring Validation (Not Yet Tested):

  • Failure-First jailbreak techniques transfer to VLA action spaces
  • Text-based attacks are less effective than visual attacks on VLAs
  • Vision encoder monoculture enables widespread VLA vulnerability
  • Specific transfer rates between model families

5.3 Future Testing Roadmap

Tier 1: Text-Based VLA Scenario Testing (Immediate)

  • Test language-based jailbreaks on VLA-related prompts using existing infrastructure
  • Effort: <1 day

Tier 2: Vision-Language Model Cross-Modal Testing (1-2 Weeks)

  • Test image + text attacks on VLMs to explore visual jailbreaking
  • Effort: 1-2 weeks

Tier 3: Full VLA Adversarial Testing (1-2 Months)

  • Replicate ADVLA methodology with OpenVLA + LIBERO simulation
  • Effort: 1-2 months FTE, H100 GPU required

6. Implications

6.1 For VLA Deployment

Cross-modal vulnerability inheritance is especially concerning in VLA deployments because the output is action. A failure mode that would be “just text” in an LLM can become a physical collision, manipulation mistake, or policy deviation.

Deployment scoping recommendation:

  • Treat “VLA safe because the underlying LLM is aligned” as an unsafe assumption
  • Require explicit evaluation of cross-modal attack surfaces and transfer
  • Use conservative rollout policies until Tier 3-style evidence exists

6.2 For Defense Design

  1. Defense-in-depth across modalities and layers. Prompt-level mitigations address only a slice of the problem.
  2. Action-space safety is non-negotiable. Safety should be measured and enforced at the action/trajectory layer, not only at the text layer.

6.3 For Safety Standards

Standards will likely need to include:

  • Cross-modal adversarial evaluation as a baseline requirement
  • Transfer testing across model variants and shared encoder families
  • Action-space metrics, not only refusal/compliance rates
  • Reporting requirements separating literature-grounded vulnerabilities from deployment-validated robustness

7. Recommendations

7.1 Near-Term Research Priorities

  1. Run Tier 1 proxy evaluations with strict measurement discipline
  2. Build a Tier 2 cross-modal VLM test set focused on embodied contexts
  3. Prioritize transfer-oriented experiments
  4. Validate defensive hypotheses early

7.2 Long-Term Validation Requirements

  1. Replicate ADVLA-style evaluation in a controlled environment
  2. Build cross-model transfer matrices
  3. Extend evaluation beyond simulation where feasible

7.3 Policy & Standards Considerations

  1. Require “attack surface disclosure” for embodied deployments
  2. Adopt evidence-package discipline for public claims
  3. Define minimum evaluation gates for safety-critical rollouts
  4. Create a living evaluation standard

Appendix: Evidence Package Summary (EP-40)

ClaimSummary
EP-40-C01VLM defense stacks remain vulnerable to adaptive, cross-model attacks
EP-40-C02Cross-model transfer can arise from shared visual representations
EP-40-C03VLA models have robotics-specific attack surfaces
EP-40-C04LLM jailbreak attacks can be adapted to gain VLA control authority
EP-40-C05Data-space attacks transfer differently than representation-space attacks
EP-40-C06Patch-wise sparse adversarial attacks can be highly effective against VLAs
EP-40-C07Alignment failures documented for VLA-like systems
EP-40-C08OOD transformations can increase jailbreak success
EP-40-C09Safety alignment for VLAs is active; constrained learning is one approach

Version: 1.0.0 (FINAL) Date: 2026-02-08 Review: Gemini 2.0 Flash (9/10), GPT-5 Codex (8/10) — February 8, 2026

⦑Failure-First-EMBODIED-AI-RESEARCH⦒

This research informs our commercial services. See how we can help →