icon EgoForce icon

Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

1German Research Center for Artificial Intelligence (DFKI), 2Rhineland-Palatinate Technical University of Kaiserslautern-Landau (RPTU), 3Max Planck Institute for Informatics (MPII)

EgoForce estimates the camera-space hand–forearm mesh from a monocular front-facing head-mounted camera, making it well suited to smart glasses.

Demo Results of EgoForce with Project Aria Glasses

Video


Abstract

Reconstructing the absolute 3D pose and shape of the hands from the user’s viewpoint using a single head-mounted camera is crucial for practical egocen- tric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth–scale am- biguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to ac- quire. This paper addresses these challenges by introducing EgoForce , a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user’s (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm–hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth–scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera- space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations.

Architecture

Pipeline

EgoForce processes a monocular egocentric RGB frame by extracting hand and forearm crops, tokenizing them, and conditioning the features on crop intrinsics (CIT). A transformer jointly infers hand–arm features to predict 2D keypoints (with confidences) and root-relative 3D hand and arm poses, which are lifted to camera-space meshes via the ray space solver. When the forearm is out of view, arm tokens are replaced with missing-arm tokens, and a hand-conditioned variational prior infers a plausible arm representation. We apply this workflow independently to the left and right hand-forearm crops.

Visualization Results

We compare our rendering results with other approaches. Our method is capable of handling not only single-object (2. and 3. rows) cases but also complex multi-object (1. row) scenes with challenging occlusions. In addition, we primarily benchmark against other mainstream GS-based methods, and report FID values in the middle of the video to enable quantitative comparison.
Ours
HOT3D
ARCTIC
H2O

Camera-Space Trajectory Comparison

HandDGP Ours

Ablation Study

BibTeX

@inproceedings{millerdurai2026egoforce,
      title={EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera},
      author={Millerdurai, Christen and Wang, Shaoxiang and Xie, Yaxu and Golyanik, Vladislav and Stricker, Didier and Pagani, Alain},
      booktitle={Proceedings of the ACM SIGGRAPH (Conference Track)},
      year={2026}
    }

Acknowledgements

This work was partially funded by the Horizon Europe programme under the projects dAIEDGE, Grant Agreement No. 101120726, and IRIS-XR, Grant Agreement No. 101298672. The authors thank the anonymous reviewers for their valuable feedback.