Arvin Liu - MiRA Training course
Knowledge Distillation
Please use the link: slides.com/arvinliu/kd
Missing info in data
Before KD:
Entropy Regularization
Label Refinery
What is Knowledge Distillation?
Baseline Knowledge Distillation
Does teacher model have to be larger?
Teacher-free KD
Other KD's techniques
Where's the problem in the following image if you are training an @1 image classification?
Where's the problem in the following image if you are training an @1 image classification?
GT will give a large penalty to your model and make it overfit.
GT
Human
Model
Loss
What data
looks like
And, to solve overfitting, the naive method is ....
Regularization
Class A
Data
Class A
Class B
Class C
Use Ground Truth Label to train your first model.
Use the first model’s logits to train your second model.
Continue train models until the accuracy converges.
Like noisy student, huh?
Make small models work like large models. i.e. accuracy.
Why? Apply your models on resource-limited device.
common methods:
Network Pruning
Low Rank Approximation
Quantization (ex: float16 -> int8)
Large
Model
Small
Model
Magic
Pre-trained
Large
Model
Untrained
Little
Model
-trained
Little
Model
Guide
Let pre-trained model guide small model by some "knowledge". For example,
Taxonomy Dependency
Feature Transformation, ...
Good
Recap Label Refinery:
Model 1
GT
Model 2
Model 3
What about Baseline KD?
Teacher
GT
Student
Why we need temperature?
Well-trained model has very low entropy -> like one-hot.
You may sense that distill from same model still improve...
Think about label refinery!
Model 1
GT
Model 2
Model 3
What if the teacher model is far smaller than student? (Re-KD)
Still works! (Better than training independently)
We know about label refinery...
Model 1
GT
Model 2
Model 3
What if we merge label refinery & KD (distill to smaller one?)...
Teacher 1
GT
Student 1
Student 2
Born again neural networks. (ICLR 2018)
Teacher only teach the first student.
Usually ensemble after training.
Teacher 1
GT
Student 1
Student 2
We know about label refinery...
Model 1
GT
Model 2
Model 3
What if we let the student guide the teacher?
Teacher 1
GT
Student 1
Student 2
Teacher 1
GT
Student 1
Teacher 1
Logits
Step 1: Update Net1
CE
Network 1
Network 2
GT
Logits
Step 2: Update Net2
Network 1
Network 2
CE
GT
Experimental statistics are very satisfactory :).
We know about label refinery...
Model 1
GT
Model 2
Model 3
Two models teach each other in one generation. (CVPR 2018)
Teacher 1
GT
Student 1
Student 2
What if ... learn by myself in one generation? (???)
Model A
GT
Model A
Model A
GT
Model A1
Model B1
Model A2
Dataset
Network
Block 1
Block 2
Block 3
...
Bottleneck 2
feature 2
Logits 2
FC 2
L2-loss
KD Loss
CE Loss
Bottleneck 1
feature 1
Logits 1
FC 1
Bottleneck 3
feature 3
Logits 3
FC 3
When the gap between two models is too large, student net may learn bad.
Model capacities are different.
Large
Model (i.e.
ResNet101)
Small
Model
Cannot distill well
Use a mid model to bridge the gap between teacher & student.
Or, maybe we should try distill from feature space?
When the gap between two models is too large, student net may learn bad.
Model capacities are different.
The path between X & Y is too far.
Large
Model (i.e.
ResNet101)
Small
Model
Cannot distill well
Teacher
ans: 0
next ans: 8
Only 1 loop
No end point
Student Model
Teacher Net (U)
Dataset
Student Net (U)
Regressor
T's feat
S's feat
S's feat transform
L2
Loss
Upper Model
Dataset
T's feat
Knowledge
Compression
(H, W, C)
(H, W, 1)
Can we learn something among the batch?
Relational KD: Learn the structures from a batch of logits.
Individual KD:
Student learn Teacher's output
Relational KD:
Student learn model's representation
Distance-wise KD
t : teacher's logits
s : student's logits
~=
~=
Angle-wise KD
~=
Merge two types of loss.
Now you learned that KD is very helpful -- even you distill some little informative trash (?)
Cross-modal distillation can help!
Student shouldn't blindly follow the teacher because teacher may also make some mistakes.
Video
NLP
Models
CV
Models
popcorn
cat
Distill what ???
images
*pop*
voices
Teacher Net
Student Net
Student Net
(inheritance part)
SHOULD
similar
Student Net
(exploration part)
SHOULD NOT
similar