My name is Hao Shi (石昊), a third-year Master’s student in the Department of Automation at Tsinghua University, in a joint program with MEGVII Research, advised by Prof. Gao Huang and Xiangyu Zhang, and I also work closely with Tiancai Wang.

Previously, I received my Bachelor’s degree from the College of Intelligence and Computing at Tianjin University in 2023, advised by Prof. Di Lin.

My research interests lie primarily in Embodied AI, Robot Learning, VLA, and 3D Perception. I am dedicated to exploring foundation models for general robotic systems.

I am looking for Ph.D. opportunities for 2026 Fall in Embodied AI. I would be truly thankful for any advice, recommendations, or opportunities. Here is my CV. You may kindly reach me at shihao1895@gmail.com or via WeChat (u1s11024)

📖 Education

  • 2023.09 - 2026.06 (expected), M.S. in Artificial Intelligence, Department of Automation, Tsinghua University, Beijing.
  • 2020.06 - 2023.06, B.Eng. in Computer Science, College of Intelligence and Computing, Tianjin University.
    • Academic advisor: Prof. Di Lin
    • GPA: 3.81 / 4.0, 91.2 / 100
  • 2019.09 - 2020.06, B.Eng. in Materials Science, School of Materials Science and Engineering, Tianjin University.

📝 Research

*: equal contribution, ✉: corresponding author.

First (Co-first) Authors

Under Review 2025
sym

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang✉

  • MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.
ICLR 2025
sym

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding

Henry Zheng*, Hao Shi*, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, Zhongchao Shi, Gao Huang#✉

  • DenseGrounding is a framework for embodied 3D visual grounding. It tackles the loss of fine-grained visual semantics from sparse fusion of point clouds with multi-view images, as well as the limited textual semantic context from arbitrary instructions, enabling more accurate and context-aware grounding.
Under Review 2025
sym

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Gao Huang✉

  • SpatialActor is a disentangled framework for robust robotic manipulation. It decouples perception into complementary high-level geometry from fine-grained but noisy raw depth and coarse but robust depth expert priors, along with low-level spatial cues and appearance semantics.
CVPRW 2024
sym

DenseG: Alleviating Vision-Language Feature Sparsity in Multi-View 3D Visual Grounding

Henry Zheng*, Hao Shi*, Yong Xien Chng, Rui Huang, Zanlin Ni, Tianyi Tan, Qihang Peng, Yepeng Weng, Zhongchao Shi, Gao Huang✉

  • 1st Place and Innovation Award in CVPR 2024 Autonomous Grand Challenge, Embodied 3D Grounding Track (1/64 teams, 1/154 submissions).
  • Oral Presentation at CVPR 2024 Workshop on Foundation Models for Autonomous Systems.
NeurIPS 2023
sym

Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation

Tingliang Feng*, Hao Shi*, Xueyang Liu, Wei Feng, Liang Wan, Yanlin Zhou, Di Lin✉

  • We propose object style compensation for open compound domain adaptation. It builds an object-level discrepancy memory bank to capture fine-grained source–target domain gaps and compensates target features to align with source distribution, enabling cross-domain segmentation.

Other Authors

Under Review 2025
sym

GeoVLA: Enpowering 3D Representations in Vision-Language-Action Models

Lin Sun*, Bin Xie*, Yingfei Liu, Hao Shi, Tiancai Wang, Jiale Cao✉

  • GeoVLA is a framework that bridges 2D semantics and 3D geometry for VLA. By encoding geometric embeddings with a dual-stream design and leveraging a Mixture-of-Experts 3D-Aware Action Expert, it achieves robustness across diverse camera views, object heights, and sizes.
Under Review 2025
sym

Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

Yani Zhang*, Dongming Wu*, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, Xingping Dong✉

  • DEGround transfers general proposals from detection into grounding with shared queries, and mitigates vision–language misalignment through region activation and query-wise modulation, achieving 1st place on EmbodiedScan.

🎖 Honors and Awards

  • 2024.12, Philobiblion Scholarship, Comprehensive Excellence 1st Prize, Tsinghua University. (Top 10%, ¥10000)
  • 2024.06, 1st Place and Innovation Award in CVPR 2024 Autonomous Grand Challenge, Embodied 3D Grounding Track. (1/154, $9000)
  • 2023.12, CXMT Scholarship, Comprehensive Excellence 1st Prize, Tsinghua University. (Top 10%, ¥10000)
  • 2023.06, Outstanding Bachelor’s Thesis Award, Tianjin University.
  • 2023.06, Excellent Graduate Award, Tianjin University.
  • 2021.12, Huawei Intelligent Base Scholarship, Ministry of Education-Huawei Intelligent Base Future Stars.

💻 Internship

💬 Invited Talks

🎓 Service

Reviewer / PC Member:

  • ICLR 2026, ICLR 2025
  • AAAI 2026
  • ICCV 2025