MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Hao Shi^1,2, Weiye Li¹, Bin Xie³, Yulin Wang¹, Renping Zhou¹, Tiancai Wang³,

Xiangyu Zhang⁴, Ping Luo², Gao Huang¹

¹THU LeapLab, ²HKU MMLab, ³Dexmal, ⁴StepFun

shihao1895@gmail.com

Paper arXiv Code

Models ICLR Version

Main Idea

Comparison of the main ideas. MemoryVLA++ is the extended journal version of MemoryVLA. We compare three paradigms: (1) typical VLAs are reactive and rely only on the present observation, (2) MemoryVLA introduces a working memory-episodic memory mechanism to capture past temporal dependencies, and (3) MemoryVLA++ further extends this by incorporating future imagination via a world model for full temporal modeling.

2 Motivation

Motivation of MemoryVLA++. (a) Button pressing shows the need for memory: similar visual states before and after pressing make it hard to know whether the button has already been pressed. Dynamic-conveyor grasping shows the need for imagination: predicting future object motion helps grasp at the right time. (b) Humans leverage the hippocampal system to maintain working-episodic memory, and use internal models to imagine future state evolution. (c) Inspired by these, MemoryVLA++ enables full temporal modeling in VLA models by combining present perception, past memory, and future imagination.

3 Method

3.1 Framework

Overall architecture. The current RGB observation and language instruction are encoded by a 7B VLM into perceptual and cognitive tokens, forming working memory. The working memory queries a Perceptual-Cognitive Memory Bank (PCMB) to retrieve relevant historical context with high-level semantics and low-level details. The retrieved context is adaptively fused with current tokens, while the PCMB is updated by merging the most similar neighbors. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. These tokens condition a diffusion action expert to predict temporally coherent action sequences.

3.2 Module Details

3.2.1 Memory Module

Details of memory module. (a) Retrieval: current perceptual and cognitive tokens query the PCMB via cross-attention with timestep PE to fetch relevant historical context. (b) Gate fusion: current tokens and retrieved histories are adaptively fused through a gating mechanism. (c) Consolidation: the current tokens are written back to the PCMB. When the PCMB reaches capacity, the most similar adjacent entries are merged to keep the memory compact.

3.2.2 Imagination Module

Details of imagination module. (a) Imagination generation: conditioned on the current observation and instruction, the world model denoises multi-scale future latent tokens, followed by spatial and temporal attention to capture decision-relevant future state evolution. (b) Memory-guided imagination integration: memory-aware tokens attend to imagined latents and adaptively fuse future cues to form full temporal-aware tokens.

3.2.3 Action Expert

Details of full temporal-aware action expert. During diffusion denoising, noisy action tokens are concatenated with cognitive token for cognition attention, while perception attention captures fine-grained details from perceptual tokens for temporally consistent action generation.

4 Experiment

4.1 Experimental Overview

Experimental setup overview. Top: simulation evaluation, covering general manipulation (Libero and SimplerEnv), long-horizon temporal manipulation (Mikasa-Robo and Calvin), and robustness & generalization evaluation (Libero-Plus). Bottom: real-robot evaluation on general tasks, long-horizon memory-dependent tasks, long-horizon imagination-dependent tasks, and robustness & generalization settings. Overall, our evaluation spans 3 robots, 5 simulation benchmarks, and 3 categories of real-robot tasks, covering nearly 200 tasks with extensive variations.