EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Video

Abstract

Reconstructing the absolute 3D pose and shape of the hands from the user’s viewpoint using a single head-mounted camera is crucial for practical egocen- tric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth–scale am- biguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to ac- quire. This paper addresses these challenges by introducing EgoForce , a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user’s (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm–hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth–scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera- space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations.

Architecture

EgoForce processes a monocular egocentric RGB frame by extracting hand and forearm crops, tokenizing them, and conditioning the features on crop intrinsics (CIT). A transformer jointly infers hand–arm features to predict 2D keypoints (with confidences) and root-relative 3D hand and arm poses, which are lifted to camera-space meshes via the ray space solver. When the forearm is out of view, arm tokens are replaced with missing-arm tokens, and a hand-conditioned variational prior infers a plausible arm representation. We apply this workflow independently to the left and right hand-forearm crops.

Visualization Results

We compare our rendering results with other approaches. Our method is capable of handling not only single-object (2. and 3. rows) cases but also complex multi-object (1. row) scenes with challenging occlusions. In addition, we primarily benchmark against other mainstream GS-based methods, and report FID values in the middle of the video to enable quantitative comparison.

Ours

HOT3D

ARCTIC

H2O

Camera-Space Trajectory Comparison

HandDGP Ours

Ablation Study

OFF ON

BibTeX

@inproceedings{millerdurai2026egoforce,
      title={EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera},
      author={Millerdurai, Christen and Wang, Shaoxiang and Xie, Yaxu and Golyanik, Vladislav and Stricker, Didier and Pagani, Alain},
      booktitle={Proceedings of the ACM SIGGRAPH (Conference Track)},
      year={2026}
    }

Acknowledgements

This work was partially funded by the Horizon Europe programme under the projects dAIEDGE, Grant Agreement No. 101120726, and IRIS-XR, Grant Agreement No. 101298672. The authors thank the anonymous reviewers for their valuable feedback.

EgoForce

Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

EgoForce estimates the camera-space hand–forearm mesh from a monocular front-facing head-mounted camera, making it well suited to smart glasses.

Demo Results of EgoForce with Project Aria Glasses