Deep Mutual Learning

(CVPR 2018)

FitNet

(ICLR 2015)

y_{1}

y_{1}

x

x

Net1 up-half networks

Net2 up-half networks

Net2 down-half networks

Net1 down-half networks

Cross Networks

Net2 down-half networks

y_{12}

y_{12}

Net1 down-half networks

y_{21}

y_{21}

y_{2}

y_{2}

Net1

half neuron

Net2

half neuron

Why cross net? let neuron1 ~ neuron 2 but without hard constraint

must be close to predict y21

y_{1}

y_{1}

Net2 down-half networks

Net1 down-half networks

About Loss - DML

Net2 down-half networks

y_{12}

y_{12}

Net1 down-half networks

y_{21}

y_{21}

y_{2}

y_{2}

Mutual (update 1)

\text{CE}(y_{1},y) + \text{KL}(y_2||y_1) +

\text{CE}(y_{1},y) + \text{KL}(y_2||y_1) +

Net1 up-half networks

Net2 up-half networks

\sim y_2

\sim y_2

Teacher-Student's viewpoint

Teacher down-half networks

y_{12}

y_{12}

Student up-half networks

\sim y_2

\sim y_2

Fixed

Students up-half needs learn:
How to predict mid-neuron to fit teacher's content?

Student down-half networks

y_{12}

y_{12}

Teacher up-half networks

\sim y_2

\sim y_2

Fixed

Students down-half needs learn:
How to use the teachers mid-neuron to answer the final question (or to fit the teachers answer)?

Is fit neurons needed?

what's the fitting learning curve?

Net1

half neuron

Net2

half neuron

\alpha = 0

\alpha = 0

\alpha = 1

\alpha = 1

\alpha = 2

\alpha = 2

\alpha = -1

\alpha = -1

interpolate

2 * net2

- net 1

net2

net1

2 * net1

- net 2

Best

almot Best, but dis > a = 0

catestrophy

this phenomenon occurs in both net1&net2 down half network

Xross Mutual Learning

Xross Mutual Learning

More from Arvin Liu