Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)
Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua
University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney
What is Knowledge Distillation?
What is KD?
Usually smaller or equal to Teacher Net
- Baseline KD: mimic the logits of the teacher net.
- FitNet: mimic the intermediate value of the teacher net.
- AT: mimic the attention map of the teacher net.
- Old KD methods are focused on mimic different parts of the teacher net.
What's Bad with the old-KD?
- Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
- Student: Attention Pattern is similar to teacher, and also make wrong prediction.
- Student CANNOT follow teacher blindly.
Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.
(Inheritance and Exploation KD Framework)
- Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
- Does the shape of the features have to be the same?
Related works: FitNet
- As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
- Problems: T's feat may too many trash.
- To authors of IE-KD, they think the knowledge should be compacted.
Teacher Net (U)
Student Net (U)
S's feat transform
Stage 1: Compact Knowledge Extraction
- Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
- Reconstruction loss: L2-loss
Teacher Net (U)
- calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
- There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.
Stage 2: IE-KD
Stage 2: IE-KD
*means attention map of features
- All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?
Is inh part and exp part different?
Inh part: tail like crocodile.
Exp part: ears are also important.
Inh part: head like seal.
Exp part: turtle shell should also concerned.
Add gaussian noise at the middle layer
and observe the loss changes.
Does inh and exp really different?
CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.
Dataset: CIFAR-10, Criterion: Error rate
Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate
Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP
I & E are both important!
(Inheritance and Exploation DML Framework)
What is DML?
Student Net 1
Student Net 2
- DML (Deep Mutual Learning, CVPR'18):
- Let two student nets train iteratively (No more pre-trained.)
- It beats all of KD-based methods at that time.
Deep Mutual Learning
Student Network 2
- Loss in IE-KD:
- Reconstruction loss is in stage 1.
- Loss in IE-DML: Jointly train auto-encoder
- Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
- Other techniques related to KD (like DML), can also adopt this framework.
Q & A
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
By Arvin Liu