Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition

The main contributions discussed in the paper are:

It efficiently enhances the expressiveness of GCNs with zero extra computation cost.
It is inspired by the decoupling aggregation in CNNs

2 . ADG(Attention Guided DropGraph):

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

The receptive field in Convolutional Neural Networks (CNN) is the region of the input space that affects a particular unit of the network.

Text

DropBlock: A regularization method for convolutional networks

The main idea of DropGraph is: when we drop one node, we drop its neighbor node set together

\space B_i=3(3-1)^2 \\ B_i=12

if i=3 and from graph we can say that davg=3

Expected number of nodes in the i th order neighborhood of a randomly sampled node is given by

where

d_{avg} =2e/n

The average expanded drop size is estimated as:

For conventional Dropout:

\gamma = 1-keep\_prob

For DropGraph:

Attention-guided drop mechanism:

NTU-RGBD	NTU-RGBD-120	Northwestern-UCLA
56,880 action samples in 60 action classes performed by 40 distinct subjects	114,480 action samples in 120 action classes performed by 106 distinct subjects	1494 video clips covering 10 categories performed by 10 different subjects
Kinect V2	Kinect V2	Kinect
3 cameras from different horizontal angles: −45 , 0 , 45	32 setups, and every different setup has a specific location and background	Captured by three Kinect cameras
Two protocols 1) Cross-Subject (Xsub): Training data comes from 20 subjects, and the remaining 20 subjects are used for validation. 2) Cross-View (X-view): Training data comes from the camera 0 and 45 , and validation data comes from camera −45	Two protocols 1) Cross-Subject (X-sub): Training data comes from 53 subjects, and the remaining 53 subjects are used for validation. 2) Cross-Setup (X-setup): picking all the samples with even setup IDs for training, and the remaining samples with odd setup IDs for validation	One protocol Training data comes from the first two cameras, and samples from the other camera are used for validation