Xross Mutual Learning
Deep Mutual Learning
(CVPR 2018)
Net1 Loss = D_KL(y2y1)
Algorithm
How to fit the neuron/attribute in the middle?
FitNet
(ICLR 2015)
Algorithm
Teacher Net
Student Net
Logits
Logits
Layer afterwards want to fit
L2loss
+BSKD loss
Potential Hazard
Teacher Net
Student Net
Logits
Logits
Neuron want to fit
1. The neurons between teacher net may exist lots of redundancy > the constraint of l2loss is too high
Teacher uphalf networks
Student uphalf networks
Student downhalf networks
Teacher downhalf networks
2. L2loss is superficial
Why talking about Fitnet?
在 Knowledge Distillation上，擬合特徵的幾乎只有Fitnet。
其他KD的方法大約提個：
對Attn做擬合 / 對特徵之間的關係(ex: FSP) 做擬合 / 對輸出的關係做擬合(ex: Graph)
Xross Learning
Stage 1  Cross the networks
Net1 uphalf networks
Net2 uphalf networks
Net2 downhalf networks
Net1 downhalf networks
Cross Networks
Net2 downhalf networks
Net1 downhalf networks
Net1
half neuron
Net2
half neuron
Why cross net? let neuron1 ~ neuron 2 but without hard constraint
must be close to predict y21
Net2 downhalf networks
Net1 downhalf networks
About Loss  DML
Net2 downhalf networks
Net1 downhalf networks
Mutual (update 1)
Net1 uphalf networks
Net2 uphalf networks
Net2 downhalf networks
Net1 downhalf networks
About Loss  XML (update 1)
Net2 downhalf networks
Net1 downhalf networks
Net1 uphalf networks
Net2 uphalf networks
Lets Look as TeacherStudent Architecture
TeacherStudent's viewpoint
Teacher downhalf networks
Student uphalf networks
Fixed
 Students uphalf needs learn:
How to predict midneuron to fit teacher's content?
Student downhalf networks
Teacher uphalf networks
Fixed
 Students downhalf needs learn:
How to use the teachers midneuron to answer the final question (or to fit the teachers answer)?
Result  1
ResNet18 & MobileNet V1
Parameters Distribution
Net1 uphalf networks
Net2 uphalf networks
Net2 downhalf networks
Net1 downhalf networks
2,775,104
135,040
3,180,388
8,445,028
ResNet18
MobileNet
V1
2conv block + FC
2conv block
3 residual block
1 residual block + FC
Validation Accuracy  Mobile
Independent: 62.6 DML: 65.9(+3.1) XML: 68.3(+2.4)
(Net 1: ResNet18, Net 2: MobileNet V1), Net1三個都相當接近。
Validation Accuracy  Res18
Independent: 74 DML: 74.3(+0.3) XML: 73.6(0.7)
XML Val Accuracy (2  best model)
Net2 downhalf networks
Net1 downhalf networks
Net2 downhalf networks
Net1 downhalf networks
Net1 uphalf networks
Net2 uphalf networks
0.7320
0.7021
0.7225
0.6889
XML Acc Brainstorming
Net1 uphalf networks
Net2 uphalf networks
0.7225 > 0.7320 (+0.0095)
0.6889 > 0.7021 (+0.01312)
Net1 downhalf networks
Net2 downhalf networks
0.7021 > 0.7320 (+0.0299)
0.6889 > 0.7225 (+0.0336)
The mimic of uphalf is good,
but downhalf is bad.
2,775,104
8,445,028
135,040
3,180,388
Result  2
ResNet18 & ResNet34
Parameters Distribution
Net1 uphalf networks
Net2 uphalf networks
Net2 downhalf networks
Net1 downhalf networks
675,392
1,340,224
19,988,068
10,544,740
ResNet18
ResNet34
2 residual block
2 residual block + FC
2 residual block
2 residual block + FC
Validation Accuracy  Res18
Independent: 73.9 DML: 75.7(+1.8) XML: 76.6(+0.9)
(Net 1: ResNet18, Net 2: ResNet34)
Validation Accuracy  Res34
Independent: 75.5 DML: 76.6(+1.1) XML: 77(+0.4)
(Net 1: ResNet18, Net 2: ResNet34), Net2三個都相當接近。
XML Val Accuracy (2  best model)
Res34 downhalf networks
Res18 downhalf networks
Res34 downhalf networks
Res18 downhalf networks
Res18 uphalf networks
Res34 uphalf networks
0.7606
0.7625
0.7661
0.7661
What Else?
Dynamic Computation
Net1 part1 networks
Net2 part1 networks
Net3 part1
networks
Net1 part2 networks
Net2 part2 networks
Net3 part2
networks
Net1 part3 networks
Net2 part3 networks
Net3 part3
networks
Time Cost
Conclusion

架構相似可能導致XML學的跟DML差不多好?

有時候是後半段學的比較好，有時候是前半段比較好，所以其實不一定。
Conclusion
改善前端
想辦法讓小model fit 大model
但是原本的Xross Learning的兩個model distance其實是挺相近的。平均相差(0.004)。
改善後端
這樣其實就是原本的KD/Mutual的問題了。
Why Works?
Why works?
 單一個upnet或downnet必須要同時去fit兩個輸出或輸入，使得model被迫要進行相似的mutlitask任務。但是因為這兩個任務太過相近，使得flatness或sensitivity變低，進而讓validation更高。 （Generalization）
 相比Fitnet，XrossNet的解法證明了不用這麼hard的constraint也可以達到feature擬似的效果。而且Fitnet必須要是2stage(先Fit在Knowledge Distillation)
 有種類似ensemble training的效果？
Uphalf part experience
Is fit neurons needed?
what's the fitting learning curve?
Net1
half neuron
Net2
half neuron
interpolate
2 * net2
 net 1
net2
net1
2 * net1
 net 2
Best
almot Best, but dis > a = 0
catestrophy
this phenomenon occurs in both net1&net2 down half network
Dis1  connect down 1
Dis1  connect down 2
Dis2  connect down 1
Dis2  connect down 2
Experience 0
Add distance
Net1 uphalf networks
Net2 uphalf networks
Like Fitnet + Mutual
Net1
half neuron
Net2
half neuron
L1/2 Loss
Validation Score
original XML: 68.3, XML + L2loss: 68.5 (+0.2)
distance: 0.002 > 0.001
Experience 1
Down half net > Discriminator
Change DownHalf Network
Net1 downhalf networks
Net2 downhalf networks
Classification Score
Synthesis Score (1=like Net1)
Classification Score
Synthesis Score (1=like Net2)
Train Net 1
Net1 downhalf networks
Classification Score
Synthesis Score
(1=like Net1)
Net1 uphalf networks
Net2 downhalf networks
Classification Score
Synthesis Score
(1=like Net2)
Net2 uphalf networks
Only Down Half
Generator_1
Discriminator_1
Generator_2
Discriminator_2
Experience 2
Independent Discriminator
Train Net 1
Net1 downhalf networks
Classification Score
Net1 uphalf networks
Net2 downhalf networks
Generator_1
Classification Score
Discriminator
Synthesis Score
Result
沒有變好。有比DML好，但沒有比XML更好。
兩個Generator沒有因此拉近，反而呈現不穩定的局勢。
Xross Mutual Learning
By Arvin Liu
Xross Mutual Learning
 736