MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi¹, Bin Xie², Yingfei Liu², Lin Sun⁴, Fengrong Liu⁵ Tiancai Wang²,

Erjin Zhou², Haoqiang Fan², Xiangyu Zhang^3,6, Gao Huang¹^✉

¹Department of Automation, BNRist, Tsinghua University, ²Dexmal,

³MEGVII Technology, ⁴Tianjin University, ⁵Harbin Institute of Technology, ⁶StepFun

shi-h23@mails.tsinghua.edu.cn, wtc@dexmal.com, gaohuang@tsinghua.edu.cn

Paper arXiv Code

Models Video
Talk@3D视觉工坊 Talk@具身智能之心

Introduction. (a) In Push Buttons tasks, pre- and post-push states look nearly identical, calling for temporal modeling. (b) Humans handle manipulation tasks via a dual-memory system: working memory (neural activity) supports short-term control, while episodic memory (hippocampus) preserves long-term experience. (c) Inspired by this, MemoryVLA introduces a Perceptual-Cognitive Memory Bank that consolidates low-level perceptual details and high-level cognitive semantics for temporally aware decision making. (d) MemoryVLA outperforms state-of-the-art baselines.

Framework

Overall architecture of MemoryVLA. RGB observation and language instruction are encoded by a 7B VLM into perceptual and cognitive tokens, forming short-term working memory. The working memory queries a perceptual-cognitive memory bank (PCMB) to retrieve relevant historical context, including high-level semantics and low-level visual details, adaptively fuses it with current tokens, and consolidates the PCMB by merging the most similar neighbors. The memory-augmented tokens then condition a diffusion transformer to predict a sequence of future actions.

Module Details

Module details (a) Retrieval: current perceptual and cognitive tokens query the PCMB via cross-attention with timestep positional encoding to fetch relevant historical features. (b) Gate fusion: current and retrieved tokens are adaptively fused via a gate mechanism. (c) Consolidation: the fused tokens are updated into PCMB. When PCMB reaches its capacity, we compute similarities between adjacent entries and merge the most similar pair to maintain compactness.

Experimental Setup

Experimental setup overview. Top: three simulation benchmarks, SIMPLER-Bridge with WidowX, SIMPLER-Fractal with Google Robot, and LIBERO with Franka. Bottom: real-world evaluation on two suites, General and Long-horizon Temporal. In total, we evaluate three robots across 10 suites, spanning over 150 tasks and 500 variations.