OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao1 Ce Zheng1,†,‡ Yufu Wang2 Hsueh-Han Daniel Yang1
Liting Wen1 László A. Jeni1,†
1 Carnegie Mellon University 2 University of Pennsylvania Corresponding authors, Project Lead


Given a streaming video as input, OnlineHMR recovers the world coordinates human mesh and camera trajectory in an online manner.

[Arxiv]      [Code]    

Interactive Gallery

Explore the 4D reconstruction results on various number of individuals and diverse scene.

Left Click Drag with left click to rotate view
Scroll Wheel Scroll to zoom in/out
Right Click Drag with right click to move view
W S Moving forward and backward
A D Moving left and right
Q E Moving upward and downward

(Point clouds are downsampled for efficient online rendering)

Abstract

Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key–value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing.

Causality

Leveraging information only from previous frames through a cache memory.


Input: a streaming video. Output: World coordinate streaming SMPL.
1) Camera coordinates OnlineHMR estimates the parametric human model in camera space.
2) Human-Centric Incremental SLAM estimates the camera extrinsic.
3) Metrics depth together with SLAM depth to scale the SLAM outputs to the world.

Faithfulness

Quantitatively, we compare our result with previous and concurrent offline/online methods on the camera coordinate benchmark 3DPW and EMDB1, and world coordinate benchmark EMDB2. Qualitatively, we test our method on custom art and sports videos.

0:00-1:20, custom videos. 1:20-2:58, EMDB2.

Temporal Consistency

To alleviate camera coordinates human jittering caused by online adaptation, we curate a sliding window learning pipeline, integrated with velocity regularization. The large proportion of dynamic humans on SLAM is handled by a soft human mask. Camera extrinsics are enforced to be smooth through physical constraints. Together, get a more natural world coordinate human.

Efficiency

We report two metrics, frame-per-second (FPS) and average delay time (seconds) per frame (Avg. Delay). Results show that online methods have significantly lower delay compared to offline methods. Although our OnlineHMR has the max Avg. Delay among all online methods, we have a better accuracy in both camera coordinate and world coordinate results, with a detailed analysis in our paper.

Acknowledgements

We borrow this template from MonST3R. We thank Shubham Tulsiani and Michael Kaess for their wonderful lectures and insightful feedback, Aniket Agarwal for early-stage brainstorming, Taru Rustagi, Ananya Bal, Joel Julin, Kallol Saha, Hongwen Zhang, Zihan Wang, Jiatong Shi for helpful discussion, Nathalie Chang, Rena Ju for early-stage survey, and the anonymous reviewers for suggestions.