KD & Mutual

Arvin Liu

Baseline KD

Distilling the Knowledge in a Neural Network (NIPS 2014)

Mutual Learning

Deep Mutual Learning

(CVPR 2018)

Algorithm - Step 1

Logits

y_{1,t}

y_{1,t}

y_{2,t}

y_{2,t}

x_t

x_t

Networks2

Networks1

Step 1: Update Net1

y

y

True Label

CE

D_{KL}(y_{2,t}||y_{1,t})

D_{KL}(y_{2,t}||y_{1,t})

Loss_1 = D_{KL}(y_{2,t}||y_{1,t})+ \text{CrossEntropy}(y,y_{1,t})

Loss_1 = D_{KL}(y_{2,t}||y_{1,t})+ \text{CrossEntropy}(y,y_{1,t})

Algorithm - Step 2

Logits

y_{1,t}

y_{1,t}

y_{2,t}

y_{2,t}

x_t

x_t

Networks2

Networks1

Step 2: Update Net2

y

y

True Label

CE

D_{KL}(y_{1,t}||y_{2,t})

D_{KL}(y_{1,t}||y_{2,t})

Loss_2 = D_{KL}(y_{1,t}||y_{2,t})+ \text{CrossEntropy}(y,y_{2,t})

Loss_2 = D_{KL}(y_{1,t}||y_{2,t})+ \text{CrossEntropy}(y,y_{2,t})

Cohorts Strategy 1

\Theta_{1,1}

\Theta_{1,1}

\Theta_{2,1}

\Theta_{2,1}

\Theta_{3,1}

\Theta_{3,1}

\Theta_{4,1}

\Theta_{4,1}

\Theta_{1,2}

\Theta_{1,2}

\Theta_{2,2}

\Theta_{2,2}

\Theta_{3,2}

\Theta_{3,2}

\Theta_{4,2}

\Theta_{4,2}

\Theta_{1,3}

\Theta_{1,3}

\Theta_{2,3}

\Theta_{2,3}

\Theta_{3,3}

\Theta_{3,3}

\Theta_{4,3}

\Theta_{4,3}

D_{KL} (2||1)

D_{KL} (2||1)

D_{KL} (3||1)

D_{KL} (3||1)

D_{KL} (4||1)

D_{KL} (4||1)

\Theta_{1,1}