Efficient hand tracking for Quest 2 VR headset
Mediapipe
MMPose
OpenCV
ARKit
Quest 2 has 4 spaced-out fish-eye cameras
And they have overlaps
All of this typical 2D training data is useless
Which leads them to creating their own dataset
The RGB-D camera estimates the GT 3D landmarks
The fish-eye cameras project the 3D landmarks to get the fish-eye GT hand landmarks
Fish-eye cameras
RGB-D Camera
This is for Hand Landmark estimation, when already have a crop
But this is too specialized for Hand Detection
This is for Hand Landmark estimation, when already have a crop
But this is too specialized for Hand Detection
We need to annotate data for in-the-wild images
1. Someone annotates the BBox for the first frame
2. It's used to estimate landmarks in first 2 frames
3. Estimate next frame's landmarks using the trajectory
4. Wrap a BBox around the new landmarks
5. Predict using NN with the BBox
...
1. Hand detection
2. Keypoint estimation
3. Pose reconstruction
Every frame, very smooth
...
No need to run DetNet on every camera
Run once, get 3D keypoints, project them to all cameras
After the first BBox is detected, same idea as in annotation
No need to run DetNet each frame
The mesh is a rigged parametric model
Levenberg-Marquadt solver optimizes the pose for each frame
When the hand is visible from all cameras, a hand scale parameter is optimized
stereo region
This is a heavy operation
So authors do this before hand tracking is even used
Keypoint estimator only predicts distance relative to the hand center
Actual hand distance is estimated after the model fitting
Hand size is fixed
So we can adjust the distance
They don't need a general-purpose hand detector
So, by design, they only output detections for left/right hand
Keypoints are estimated via regressing heatmaps
Then doing a weighted average of pixel positions
The next-frame "extrapolation" is used as input to the model
No need to train for both hands lol
Just train for one, then reflect
Smoothness isn't enforced in keypoint estimation
Only during mesh fitting
During training, the extrapolations are sometimes passed inaccurately
Otherwise the network just copies the input
Lots&lots of augmentations
And keypoint dropouts
Nice.
Nice.
Nice.