Climb Box1 (CB1) - real2sim
Climb Box1 (CB1) - sim2real
Climb Box1 (CB1): climb onto a 50 cm box, walk to edge, and descend using single-hand support.
Jump Box1 (JB1) - real2sim
Jump Box1 (JB1) - sim2real
Jump Box1 (JB1): jump onto a 40 cm box.
Climb Box2 (CB2) - real2sim
Climb Box2 (CB2) - sim2real
Climb Box2 (CB2): side climb onto a 60 cm box.
Jump Box3 (JB3) - real2sim
Jump Box3 (JB3) - sim2real
Jump Box3 (JB3): jump onto a 40 cm box with both feet, then walk to the edge of box, and finally jump down.
Jump Box2 (JB2) - real2sim
Jump Box2 (JB2) - sim2real
Jump Box2 (JB2): running single-leg jump onto a 40 cm box and drop down.
Jump Climb Down1 (JCD1) - real2sim
Jump Climb Down1 (JCD1) - sim2real
Jump Climb Down1 (JCD1): jump onto a 20 cm box, climb onto a 60 cm box, and descend using single-hand support.
Safety Vault2 (SV2) - real2sim
Safety Vault2 (SV2) - sim2real
Safety Vault2 (SV2): double-hand safety vault over a 40 cm box.
Long Parkour Sequence 1
Long Parkour Sequence 2
Long Parkour Sequence 3
Long Parkour Sequence 4
Parkour Skill 1
Parkour Skill 2
Parkour Skill 4
Parkour Skill 5
Tic Tac
Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks.
In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
@misc{zhang2026meshmimic,
title = {MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction},
author = {Zhang, Qiang and Ma, Jiahao and Liu, Peiran and Shi, Shuai and Su, Zeran and Wang, Zifan and Sun, Jingkai and Cui, Wei and Yu, Jialin and Han, Gang and Zhao, Wen and Sun, Pihai and Yin, Kangning and Wang, Jiaxu and Cao, Jiahang and Zhang, Lingfeng and Cheng, Hao and Hao, Xiaoshuai and Ji, Yiding and Liang, Junwei and Tang, Jian and Xu, Renjing and Guo, Yijie},
year = {2026},
eprint = {2602.15733},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2602.15733}
}