MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

1Department of Automation, BNRist, Tsinghua University, 2Dexmal,
3MEGVII Technology, 4Tianjin University, 5Harbin Institute of Technology, 6StepFun
shi-h23@mails.tsinghua.edu.cn, wtc@dexmal.com, gaohuang@tsinghua.edu.cn

Introduction. (a) In Push Buttons tasks, pre- and post-push states look nearly identical, calling for temporal modeling. (b) Humans handle manipulation tasks via a dual-memory system: working memory (neural activity) supports short-term control, while episodic memory (hippocampus) preserves long-term experience. (c) Inspired by this, MemoryVLA introduces a Perceptual-Cognitive Memory Bank that consolidates low-level perceptual details and high-level cognitive semantics for temporally aware decision making. (d) MemoryVLA outperforms state-of-the-art baselines.

Abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory.

Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences.

We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and PI-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Moreover, MemoryVLA exhibits strong robustness and generalization under various out-of-distribution conditions.

Framework

Overall architecture of MemoryVLA. RGB observation and language instruction are encoded by a 7B VLM into perceptual and cognitive tokens, forming short-term working memory. The working memory queries a perceptual-cognitive memory bank (PCMB) to retrieve relevant historical context, including high-level semantics and low-level visual details, adaptively fuses it with current tokens, and consolidates the PCMB by merging the most similar neighbors. The memory-augmented tokens then condition a diffusion transformer to predict a sequence of future actions.

Module Details

Module details (a) Retrieval: current perceptual and cognitive tokens query the PCMB via cross-attention with timestep positional encoding to fetch relevant historical features. (b) Gate fusion: current and retrieved tokens are adaptively fused via a gate mechanism. (c) Consolidation: the fused tokens are updated into PCMB. When PCMB reaches its capacity, we compute similarities between adjacent entries and merge the most similar pair to maintain compactness.

Experimental Setup

Experimental setup overview. Top: three simulation benchmarks, SIMPLER-Bridge with WidowX, SIMPLER-Fractal with Google Robot, and LIBERO with Franka. Bottom: real-world evaluation on two suites, General and Long-horizon Temporal. In total, we evaluate three robots across 10 suites, spanning over 150 tasks and 500 variations.

Real Robots Setup

Real-world Evaluation

General Manipulation Tasks

Insert Circle

Put Egg in Pan

Put Egg in Oven

Stack Cups

Stack Blocks

Pick Diverse Fruits (apple)

Pick Diverse Fruits (banana)

Pick Diverse Fruits (carrot)

Pick Diverse Fruits (chili)

Pick Diverse Fruits (grape)

Long-horizon Temporal Tasks

Sequential Push Buttons

Change Food

Guess Where

Clean Table & Count

Pick Place Order

Clean Restaurant Table

SimplerEnv Simulation Evaluation

SimplerEnv-Bridge Tasks

Put Spoon on Towel

Put Carrot on Plate

Stack Cube

Put Eggplant in Basket

SimplerEnv-Fractal Tasks

Pick Coke Can

Move Near

Open Drawer

Close Drawer

Put in Drawer

LIBERO Tasks

Robustness and Generalization Evaluation

Pick Place Order

Base

Unseen Background

Unseen Distractors

Unseen Lighting

Unseen Object

Unseen Container

Unseen Occlusion

Clean Restaurant Table

Base

Unseen Background

Unseen Distractors

Unseen Lighting

Unseen Object

Unseen Container

Unseen Occlusion

Put in Drawer

Base

Unseen Background (bedroom)

Unseen Background (office)

Unseen Lighting (brighter)

Unseen Lighting (darker)

Unseen Texture

Quantitative Analysis of Real-world OOD Tests

OOD Real Results

Quantitative Analysis of Pick and Move OOD Tests in Simulation

OOD Pick and Move Results

Quantitative Analysis of Hinge Object Manipulation OOD Tests in Simulation

OOD Drawer Results