Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)
Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua
University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney
What is Knowledge Distillation?
What is KD?
Teacher Net
Dataset
T's Result
Student Net
S's Result
Pretrained
GT
hard loss
soft loss
Usually smaller or equal to Teacher Net
- Baseline KD: mimic the logits of the teacher net.
- FitNet: mimic the intermediate value of the teacher net.
- AT: mimic the attention map of the teacher net.
... - Old KD methods are focused on mimic different parts of the teacher net.
What's Bad with the old-KD?
- Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
- Student: Attention Pattern is similar to teacher, and also make wrong prediction.
- Student CANNOT follow teacher blindly.
Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.
Proposed: IE-KD
(Inheritance and Exploation KD Framework)
Basic Ideas
Teacher Net
Student Net
Student Net
(inheritance part)
SHOULD
similar
- Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
- Does the shape of the features have to be the same?
Student Net
(exploration part)
SHOULD NOT
similar
Related works: FitNet
- As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
- Problems: T's feat may too many trash.
- To authors of IE-KD, they think the knowledge should be compacted.
Teacher Net (U)
Dataset
Student Net (U)
Regressor
T's feat
S's feat
S's feat transform
L2
Loss
Stage 1: Compact Knowledge Extraction
- Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
- Reconstruction loss: L2-loss
Dataset
Teacher Net (U)
T's feat
Encoder T
Compact
T's feat
Decoder T
- calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
- There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.
Stage 2: IE-KD
Stage 2: IE-KD
IE-AT | ||
IE-FT | ||
IE-OD |
exploration loss
Method Name
Inheritance loss
*means attention map of features
- All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?
Fun Question
Experiments
Is inh part and exp part different?
Inh part: tail like crocodile.
Exp part: ears are also important.
Inh part: head like seal.
Exp part: turtle shell should also concerned.
Generalization Evidence
Add gaussian noise at the middle layer
and observe the loss changes.
Does inh and exp really different?
CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.
SOTA Accuracy
Dataset: CIFAR-10, Criterion: Error rate
Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate
Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP
I & E are both important!
Proposed: IE-DML
(Inheritance and Exploation DML Framework)
What is DML?
Student Net 1
Dataset
S1's Result
Student Net 2
GT
hard
loss
soft loss
- DML (Deep Mutual Learning, CVPR'18):
- Let two student nets train iteratively (No more pre-trained.)
- It beats all of KD-based methods at that time.
Deep Mutual Learning
S2's Result
GT
hard
loss
hard
loss
IE-DML
Student Network 2
1
- Loss in IE-KD:
- Reconstruction loss is in stage 1.
- Loss in IE-DML: Jointly train auto-encoder
IE-DML Experiments
- Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
- Other techniques related to KD (like DML), can also adopt this framework.
Q & A
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
By Arvin Liu
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
Paper Reading - Revisiting Knowledge Distillation: An Inheritance and Exploration Framework
- 897