Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution

Tze Ho Elden Tse
Kwang In Kim
Ales Leonardis
Hyung Jin Chang

University of Birmingham, UNIST
CVPR 2022

[Paper]
[Poster]

We propose a collaborative learning framework which allows sharing of mesh information across hand and object branches iteratively. Our model jointly reconstructs hand and object meshes from a monocular RGB image.

Abstract

Estimating the pose and shape of hands and objects under interaction finds numerous applications including augmented and virtual reality. Existing approaches for hand and object reconstruction require explicitly defined physical constraints and known objects, which limits its application domains. Our algorithm is agnostic to object models, and it learns the physical rules governing hand-object interaction. This requires automatically inferring the shapes and physical interaction of hands and (potentially unknown) objects. We seek to approach this challenging problem by proposing a collaborative learning strategy where two-branches of deep networks are learning from each other. Specifically, we transfer hand mesh information to the object branch and vice versa for the hand branch. The resulting optimisation (training) problem can be unstable, and we address this via two strategies: (i) attention-guided graph convolution which helps identify and focus on mutual occlusion and (ii) unsupervised associative loss which facilitates the transfer of information between the branches. Experiments using four widely-used benchmarks show that our framework achieves beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand and object shapes. Each technical component above contributes meaningfully in the ablation study.


Video



Paper and Supplementary Material

Tze Ho Elden Tse, Kwang In Kim, Ales Leonardis, and Hyung Jin Chang
Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution
In CVPR, 2022.


[Paper]
[Supplementary]


Acknowledgements

This research was supported by the Ministry of Science and ICT, Korea, under the Information Technology Research Center (ITRC) support program (IITP-2022-2020-0-01789) supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP) and an IITP grant (2021-0-00537). The computations described in this research were performed using the Baskerville Tier 2 HPC service (https://www.baskerville.ac.uk/) that was funded by EPSRC Grant EP/T022221/1 and is operated by Advanced Research Computing at the University of Birmingham. KIK was supported by the National Research Foundation of Korea (NRF) grant (No. 2021R1A2C2012195) funded by the Korea government (MSIT). This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.