Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)
Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua
University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney
What is Knowledge Distillation?
What is KD?
Teacher Net
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/7948472/pasted-from-clipboard.png)
Dataset
T's Result
Student Net
S's Result
Pretrained
GT
hard loss
soft loss
Usually smaller or equal to Teacher Net
- Baseline KD: mimic the logits of the teacher net.
- FitNet: mimic the intermediate value of the teacher net.
- AT: mimic the attention map of the teacher net.
... - Old KD methods are focused on mimic different parts of the teacher net.
What's Bad with the old-KD?
- Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
- Student: Attention Pattern is similar to teacher, and also make wrong prediction.
- Student CANNOT follow teacher blindly.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867493/pasted-from-clipboard.png)
Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.
Proposed: IE-KD
(Inheritance and Exploation KD Framework)
Basic Ideas
Teacher Net
Student Net
Student Net
(inheritance part)
SHOULD
similar
- Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
- Does the shape of the features have to be the same?
Student Net
(exploration part)
SHOULD NOT
similar
Related works: FitNet
- As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
- Problems: T's feat may too many trash.
- To authors of IE-KD, they think the knowledge should be compacted.
Teacher Net (U)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/7948472/pasted-from-clipboard.png)
Dataset
Student Net (U)
Regressor
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867680/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867681/pasted-from-clipboard.png)
T's feat
S's feat
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867680/pasted-from-clipboard.png)
S's feat transform
L2
Loss
Stage 1: Compact Knowledge Extraction
- Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
- Reconstruction loss: L2-loss
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/7948472/pasted-from-clipboard.png)
Dataset
Teacher Net (U)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867680/pasted-from-clipboard.png)
T's feat
Encoder T
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867681/pasted-from-clipboard.png)
Compact
T's feat
Decoder T
- calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
- There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867534/pasted-from-clipboard.png)
Stage 2: IE-KD
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867534/pasted-from-clipboard.png)
Stage 2: IE-KD
IE-AT | ||
IE-FT | ||
IE-OD |
exploration loss
Method Name
Inheritance loss
*means attention map of features
- All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867534/pasted-from-clipboard.png)
Fun Question
Experiments
Is inh part and exp part different?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867525/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867528/pasted-from-clipboard.png)
Inh part: tail like crocodile.
Exp part: ears are also important.
Inh part: head like seal.
Exp part: turtle shell should also concerned.
Generalization Evidence
Add gaussian noise at the middle layer
and observe the loss changes.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867673/pasted-from-clipboard.png)
Does inh and exp really different?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867834/pasted-from-clipboard.png)
CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.
SOTA Accuracy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867664/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867666/pasted-from-clipboard.png)
Dataset: CIFAR-10, Criterion: Error rate
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867668/pasted-from-clipboard.png)
Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate
Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP
I & E are both important!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867821/pasted-from-clipboard.png)
Proposed: IE-DML
(Inheritance and Exploation DML Framework)
What is DML?
Student Net 1
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/7948472/pasted-from-clipboard.png)
Dataset
S1's Result
Student Net 2
GT
hard
loss
soft loss
- DML (Deep Mutual Learning, CVPR'18):
- Let two student nets train iteratively (No more pre-trained.)
- It beats all of KD-based methods at that time.
Deep Mutual Learning
S2's Result
GT
hard
loss
hard
loss
IE-DML
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867534/pasted-from-clipboard.png)
Student Network 2
1
- Loss in IE-KD:
- Reconstruction loss is in stage 1.
- Loss in IE-DML: Jointly train auto-encoder
IE-DML Experiments
![](https://s3.amazonaws.com/media-p.slid.es/uploads/731397/images/8867817/pasted-from-clipboard.png)
- Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
- Other techniques related to KD (like DML), can also adopt this framework.
Q & A
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
By Arvin Liu
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
Paper Reading - Revisiting Knowledge Distillation: An Inheritance and Exploration Framework
- 916