SpatialActor

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

AAAI 2026 Oral

1Department of Automation, BNRist, Tsinghua University, 2Dexmal, 3MEGVII Technology, 4StepFun
shi-h23@mails.tsinghua.edu.cn, wtc@dexmal.com, gaohuang@tsinghua.edu.cn

Introduction. (a) Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. (b) Image-based methods typically entangle semantics and geometry, while inherent depth noise in real-world disrupts semantic understanding. (c) SpatialActor disentangle visual semantics, two complementary high-level geometry from noisy depth and expert priors, low-level spatial cues. (d) Performance under various degrees of noise, showing the robustness.

Abstract

Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of finegrained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high level geometry while overlooking low-level spatial cues essential for precise interaction.

We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features.

We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations.

Framework

Overall architecture of SpatialActor. The architecture employs separate vision and depth encoders. Semantic-guided Geometric Module (SGM) adaptively fuses robust yet coarse geometric priors from a pretrained depth expert with noisy depth features via gated fusion to yield high-level geometric representations. In the Spatial Transformer (SPT), low-level spatial cues are encoded as positional embeddings to drive spatial interactions. Finally, view-level interactions refine intra-view features, while scene-level interactions consolidate cross-modal information across views to support the subsequent action head.

Module Details

Module details (a) Semantic-guided Geometric Module (SGM) adaptively combines two complementary geometric representations via a gating mechanism. (b) Spatial Transformer (SPT) converts 3D points into spatial positional embeddings using RoPE to establish 2D–3D correspondences, followed by view-level and scene-level interactions for spatial token refinement.

Experiments

Simulation Performance

Real Robot Setup

Real World Performance

Simulation

RLBench Demos

Close Jar

Insert Onto Square Peg

Light Bulb In

Meat Off Grill

Open Drawer

Place Cups

Place Shape In Shape Sorter

Place Wine At Rack Location

Push Buttons

Put Groceries In Cupboard

Put Item In Drawer

Put Money In Safe

Reach And Drag

Slide Block To Color Target

Stack Blocks

Stack Cups

Sweep To Dustpan Of Size

Turn Tap

Real-World

Insert Ring Onto Cone

Pick Glue To Box

Place Carrot To Box

Push Button

Slide Block

Stack Block

Stack Cup

Wipe Table

BibTeX


      TBD