基于视觉语言动作模型的空间表征与动作生成方法综述-资讯-控制网

基于视觉语言动作模型的空间表征与动作生成方法综述

点击数：2747 发布时间：2026-02-10 13:36:47

视觉语言动作（Vision–Language–Action，VLA）模型正日益成为构建通用具身智能的关键技术路径。本文从二维到三维空间表征演进以及自回归、扩散与强化学习等动作生成范式两个维度，对VLA研究进展进行了系统梳理，并介绍了该领域从二维感知向三维空间理解演进的过程，分析了基于自回归、扩散模型与强化学习等多种范式的动作建模方法在时序建模能力、任务适配性与泛化特性方面的共性与差异，进而对比分析了仿真平台与真实机器人系统中数据集、评测指标与系统架构的差异及其对模型泛化的影响，最后分析了VLA模型面临的空间理解、动作规划、数据效率及真实场景泛化等技术挑战，并对结构化三维表示、物理一致性动作生成、高效数据利用以及安全控制机制等未来发展方向进行了展望，从而为构建高效、可靠且可扩展的通用具身智能系统提供了参考。

关键词：视觉语言动作模型；具身智能；三维空间增强；动作生成；机器人操作

东北大学吴成东，黄路，庄曜铭，张欣，李畅澳大利亚WesternSydney大学 Hao Wu

摘要：视觉语言动作（Vision–Language–Action，VLA）模型正日益成为构建通用具身智能的关键技术路径。本文从二维到三维空间表征演进以及自回归、扩散与强化学习等动作生成范式两个维度，对VLA研究进展进行了系统梳理，并介绍了该领域从二维感知向三维空间理解演进的过程，分析了基于自回归、扩散模型与强化学习等多种范式的动作建模方法在时序建模能力、任务适配性与泛化特性方面的共性与差异，进而对比分析了仿真平台与真实机器人系统中数据集、评测指标与系统架构的差异及其对模型泛化的影响，最后分析了VLA模型面临的空间理解、动作规划、数据效率及真实场景泛化等技术挑战，并对结构化三维表示、物理一致性动作生成、高效数据利用以及安全控制机制等未来发展方向进行了展望，从而为构建高效、可靠且可扩展的通用具身智能系统提供了参考。

关键词：视觉语言动作模型；具身智能；三维空间增强；动作生成；机器人操作

Abstract: Vision–Language–Action (VLA) models have emerged as a promising foundation for general-purpose embodied intelligence. This survey provides a structured overview of recent advances in VLA research, focusing on two core aspects: the progression of spatial representations from 2D perception to 3D understanding, and the development of action generation paradigms, including autoregressive modeling, diffusion-based policies, and reinforcement learning. We examine how these paradigms differ in temporal modeling, task suitability, and generalization behavior across diverse embodied scenarios. Furthermore, we compare commonly used datasets, evaluation protocols, and system architectures in both simulation environments and real-world robotic platforms, and discuss how these factors influence model transfer and generalization. Finally, we summarize the key challenges faced by current VLA systems—such as spatial reasoning, long-horizon action planning, data efficiency, and real-world robustness—and outline future research directions, including structured 3D representations, physically grounded action generation, efficient data utilization, and safety-aware control. This survey aims to offer practical insights and guidance for the design of efficient, reliable, and scalable embodied intelligence systems.

Key words: Vision-Language-Action models; Embodied intelligence; 3D spatial enhancement; Action generation; Robot manipulation

在线预览：基于视觉语言动作模型的空间表征与动作生成方法综述.pdf

摘自《自动化博览》2026年第一期暨《2026具身智能专刊》

1.我有以下需求：
得到贵公司产品详细资料得到贵公司产品的价格信息贵公司产品销售人员联系我贵公司技术支持人员联系我
2.详细的需求：
*
姓名:	*
单位:
电话:	*
邮件:	*

技术频道

行业频道

热门频道

技术频道

行业频道

热门频道

资讯频道

热点新闻

推荐产品