Arvin Liu
Distilling the Knowledge in a Neural Network (NIPS 2014)
Teacher model too mean ->
needs temperature (hyper-parameters)
Deep Mutual Learning
(CVPR 2018)
* train from scratch
* needs train iteratively
Net1 Loss = D_KL(y2||y1)
Logits
Networks2
Networks1
Step 1: Update Net1
True Label
CE
Logits
Networks2
Networks1
Step 2: Update Net2
True Label
CE
WRN(Wide-Residual-Networks)
總之就是DML的Acc Curve幾乎從頭開始就會大於Independent。
就是將原本的hard-target關掉給0。
(Labelled data)
(All data)
CROSS-MODAL KNOWLEDGE DISTILLATION FOR ACTION RECOGNITION (10/10), ICIP 2019
Because Strategy2 leads less entropy.
單Model Accuracy
所有Model Ensemble
Accuracy
1. Generalization Problem
However, the validation accuracy is not.
BIASINGGRADIENTDESCENTINTOWIDEVALLEYS ICLR 2017
Networks1
'
Gaussian Noise
看看最後acc掉多少
不管gaussian noise多強,DML下降的loss明顯較少
2. Entropy Regularization
ResNet32 / CIFAR100 -
Mean Entropy ind vs dml = 1.7099 and 0.2602
Entropy Regularization 可以找到比較wide的minima。
詳細數學於Flatness的paper有提及。
Entropy : 有做Entropy Regularization
3. My viewpoint - Adaptive Temperature
We need temperature to increase entropy
Entropy 一定是越來越低 ->
他是一種Adaptive Temperature的Knowledge Distillation,而且更有意義。
(如果只用一個model在學的觀點來看)
In this: we know they learned different feature
-> cohort learning let single model learned more feature.
DML + L2 makes accuracy lower. -> prove the thought?
[Neuron switch problem]
ResNet[::-1]
Flatten
W
Logits
W'
Feature
But obviously,
the features are different after t-sne clustering
Independent < DML + L2 in mid < DML
However, I have some magic disprove it.