Given a skeleton sequence S with T frames, under the global coordinate system, we denote the set of joints in the t-th frame as
Contrastive Learning Framework:
NTU-RGBD | NTU-RGBD-120 |
---|---|
56,880 action samples in 60 action classes performed by 40 distinct subjects | 114,480 action samples in 120 action classes performed by 106 distinct subjects |
Kinect V2 | Kinect V2 |
3 cameras from different horizontal angles: −45 , 0 , 45 | 32 setups, and every different setup has a specific location and background |
Two protocols 1) Cross-Subject (Xsub): Training data comes from 20 subjects, and the remaining 20 subjects are used for validation. 2) Cross-View (X-view): Training data comes from the camera 0 and 45 , and validation data comes from camera −45 |
Two protocols 1) Cross-Subject (X-sub): Training data comes from 53 subjects, and the remaining 53 subjects are used for validation. 2) Cross-Setup (X-setup): picking all the samples with even setup IDs for training, and the remaining samples with odd setup IDs for validation |
1) Datasets:
2) Model Setting:
To understand the effects of individual transformation of the skeleton and the importance of transformations composition, the experiment will be repeated multiple times