MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

1THU LeapLab, 2HKU MMLab, 3Dexmal, 4StepFun
shihao1895@gmail.com

Main Idea

Comparison of the main ideas. MemoryVLA++ is the extended journal version of MemoryVLA. We compare three paradigms: (1) typical VLAs are reactive and rely only on the present observation, (2) MemoryVLA introduces a working memory-episodic memory mechanism to capture past temporal dependencies, and (3) MemoryVLA++ further extends this by incorporating future imagination via a world model for full temporal modeling.

1 Abstract

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution.

Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences.

We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9% gains on general manipulation, +26% on long-horizon memory-dependent tasks, and +28% on long-horizon imagination-dependent tasks.

2 Motivation

Motivation of MemoryVLA++. (a) Button pressing shows the need for memory: similar visual states before and after pressing make it hard to know whether the button has already been pressed. Dynamic-conveyor grasping shows the need for imagination: predicting future object motion helps grasp at the right time. (b) Humans leverage the hippocampal system to maintain working-episodic memory, and use internal models to imagine future state evolution. (c) Inspired by these, MemoryVLA++ enables full temporal modeling in VLA models by combining present perception, past memory, and future imagination.

3 Method

3.1 Framework

Overall architecture. The current RGB observation and language instruction are encoded by a 7B VLM into perceptual and cognitive tokens, forming working memory. The working memory queries a Perceptual-Cognitive Memory Bank (PCMB) to retrieve relevant historical context with high-level semantics and low-level details. The retrieved context is adaptively fused with current tokens, while the PCMB is updated by merging the most similar neighbors. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. These tokens condition a diffusion action expert to predict temporally coherent action sequences.

3.2 Module Details

3.2.1 Memory Module

Details of memory module. (a) Retrieval: current perceptual and cognitive tokens query the PCMB via cross-attention with timestep PE to fetch relevant historical context. (b) Gate fusion: current tokens and retrieved histories are adaptively fused through a gating mechanism. (c) Consolidation: the current tokens are written back to the PCMB. When the PCMB reaches capacity, the most similar adjacent entries are merged to keep the memory compact.

3.2.2 Imagination Module

Details of imagination module. (a) Imagination generation: conditioned on the current observation and instruction, the world model denoises multi-scale future latent tokens, followed by spatial and temporal attention to capture decision-relevant future state evolution. (b) Memory-guided imagination integration: memory-aware tokens attend to imagined latents and adaptively fuse future cues to form full temporal-aware tokens.

3.2.3 Action Expert

Details of full temporal-aware action expert. During diffusion denoising, noisy action tokens are concatenated with cognitive token for cognition attention, while perception attention captures fine-grained details from perceptual tokens for temporally consistent action generation.

4 Experiment

4.1 Experimental Overview

Experimental setup overview. Top: simulation evaluation, covering general manipulation (Libero and SimplerEnv), long-horizon temporal manipulation (Mikasa-Robo and Calvin), and robustness & generalization evaluation (Libero-Plus). Bottom: real-robot evaluation on general tasks, long-horizon memory-dependent tasks, long-horizon imagination-dependent tasks, and robustness & generalization settings. Overall, our evaluation spans 3 robots, 5 simulation benchmarks, and 3 categories of real-robot tasks, covering nearly 200 tasks with extensive variations.

4.2 Real Robot Setup

4.3 Real Robot Evaluation

Insert Circle

Put Egg in Pan

Put Egg in Oven

Stack Cups

Stack Blocks

Pick Diverse Fruits (apple)

Pick Diverse Fruits (banana)

Pick Diverse Fruits (carrot)

Pick Diverse Fruits (chili)

Pick Diverse Fruits (grape)

Sequential Push Buttons

Change Food

Guess Where

Clean Table & Count

Pick Place Order

Clean Restaurant Table

Conveyor Pick Low

Conveyor Pick Mid

Conveyor Pick High

Conveyor Scan Pick

Bag Pack and Zip

Pick Place Order

Base

Unseen Background

Unseen Distractors

Unseen Lighting

Unseen Object

Unseen Container

Unseen Occlusion

Clean Restaurant Table

Base

Unseen Background

Unseen Distractors

Unseen Lighting

Unseen Object

Unseen Container

Unseen Occlusion

4.4 Simulation Evaluation

Put Spoon on Towel

Put Carrot on Plate

Stack Cube

Put Eggplant in Basket

Libero-Spatial Tasks

Bowl: Cabinet → Plate

Bowl: Cookie Box → Plate

Bowl: Drawer → Plate

Bowl: Stove → Plate

Libero-Object Tasks

Alphabet Soup in Basket

Chocolate in Basket

Orange Juice in Basket

Tomato Sauce in Basket

Libero-Goal Tasks

Bowl in Top Drawer

Open Middle Drawer

Turn On Stove

Wine Bottle on Rack

Libero-Long Tasks

Alphabet Soup & Tomato Sauce in Basket

Bowl in Bottom Drawer & Close

Turn On Stove & Put Moka Pot

Yellow/White Mug in Microwave & Close

Intercept Medium

Remember Color 3

Remember Color 5

Remember Color 9

Shell Game Touch

Calvin Sequence 01

Calvin Sequence 02

Calvin Sequence 03

Calvin Sequence 04

Calvin Sequence 05

Calvin Sequence 06

Calvin Sequence 07

Calvin Sequence 08

Background Textures 1

Background Textures 2

Background Textures 3

Camera Viewpoints 1

Camera Viewpoints 2

Language Instructions 1

Language Instructions 2

Light Conditions 1

Light Conditions 2

Objects Layout 1

Objects Layout 2

Robot Initial States 1

Robot Initial States 2

Sensor Noise 1

Sensor Noise 2

Sensor Noise 3

BibTeX

@article{shi2025memoryvla,
  title={MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation},
  author={Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and Wang, Tiancai and Zhou, Erjin and Fan, Haoqiang and Zhang, Xiangyu and Huang, Gao},
  journal={arXiv preprint arXiv:2508.19236},
  year={2025}
}