Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)
Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua
University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney
What is Knowledge Distillation?
What is KD?
Teacher Net
data:image/s3,"s3://crabby-images/4ba66/4ba66837f7fd1111f1b8d1f371b25ea9599d0ec6" alt=""
Dataset
T's Result
Student Net
S's Result
Pretrained
GT
hard loss
soft loss
Usually smaller or equal to Teacher Net
- Baseline KD: mimic the logits of the teacher net.
- FitNet: mimic the intermediate value of the teacher net.
- AT: mimic the attention map of the teacher net.
... - Old KD methods are focused on mimic different parts of the teacher net.
What's Bad with the old-KD?
- Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
- Student: Attention Pattern is similar to teacher, and also make wrong prediction.
- Student CANNOT follow teacher blindly.
data:image/s3,"s3://crabby-images/9b019/9b019f56ebc3005969c59d586f881b97d81a2748" alt=""
Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.
Proposed: IE-KD
(Inheritance and Exploation KD Framework)
Basic Ideas
Teacher Net
Student Net
Student Net
(inheritance part)
SHOULD
similar
- Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
- Does the shape of the features have to be the same?
Student Net
(exploration part)
SHOULD NOT
similar
Related works: FitNet
- As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
- Problems: T's feat may too many trash.
- To authors of IE-KD, they think the knowledge should be compacted.
Teacher Net (U)
data:image/s3,"s3://crabby-images/4ba66/4ba66837f7fd1111f1b8d1f371b25ea9599d0ec6" alt=""
Dataset
Student Net (U)
Regressor
data:image/s3,"s3://crabby-images/c2b68/c2b68abf875020c5257983a82272be95d7ecf808" alt=""
data:image/s3,"s3://crabby-images/81cc2/81cc24771c1d2b4853234de09dda91fd31382438" alt=""
T's feat
S's feat
data:image/s3,"s3://crabby-images/c2b68/c2b68abf875020c5257983a82272be95d7ecf808" alt=""
S's feat transform
L2
Loss
Stage 1: Compact Knowledge Extraction
- Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
- Reconstruction loss: L2-loss
data:image/s3,"s3://crabby-images/4ba66/4ba66837f7fd1111f1b8d1f371b25ea9599d0ec6" alt=""
Dataset
Teacher Net (U)
data:image/s3,"s3://crabby-images/c2b68/c2b68abf875020c5257983a82272be95d7ecf808" alt=""
T's feat
Encoder T
data:image/s3,"s3://crabby-images/81cc2/81cc24771c1d2b4853234de09dda91fd31382438" alt=""
Compact
T's feat
Decoder T
- calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
- There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.
data:image/s3,"s3://crabby-images/fee9a/fee9ab33f0b1d57b66def1531387842647790ac3" alt=""
Stage 2: IE-KD
data:image/s3,"s3://crabby-images/fee9a/fee9ab33f0b1d57b66def1531387842647790ac3" alt=""
Stage 2: IE-KD
IE-AT | ||
IE-FT | ||
IE-OD |
exploration loss
Method Name
Inheritance loss
*means attention map of features
- All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?
data:image/s3,"s3://crabby-images/fee9a/fee9ab33f0b1d57b66def1531387842647790ac3" alt=""
Fun Question
Experiments
Is inh part and exp part different?
data:image/s3,"s3://crabby-images/ef087/ef087c3c796fc3981de8fff4337c196a7779ed54" alt=""
data:image/s3,"s3://crabby-images/3d0d9/3d0d9a505b65ee28e9c886e10acb10742af46606" alt=""
Inh part: tail like crocodile.
Exp part: ears are also important.
Inh part: head like seal.
Exp part: turtle shell should also concerned.
Generalization Evidence
Add gaussian noise at the middle layer
and observe the loss changes.
data:image/s3,"s3://crabby-images/1b53a/1b53a3085cf08c2afd6f87b0c3fcb95b1531b8f5" alt=""
Does inh and exp really different?
data:image/s3,"s3://crabby-images/642df/642df0bfdd63e0de2d7700d330311efc4ed0a9f4" alt=""
CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.
SOTA Accuracy
data:image/s3,"s3://crabby-images/33601/33601ab85d3b583988d88608b79a0d30e289aa15" alt=""
data:image/s3,"s3://crabby-images/d0441/d0441f33b48b1b9b53d397abcb3c1cbe47a1e177" alt=""
Dataset: CIFAR-10, Criterion: Error rate
data:image/s3,"s3://crabby-images/57e32/57e32e8b8b50454cc2d4d3f3b5c233505931bfb3" alt=""
Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate
Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP
I & E are both important!
data:image/s3,"s3://crabby-images/fc206/fc2068960c6e1a41a0569b21964e334812b33a5f" alt=""
Proposed: IE-DML
(Inheritance and Exploation DML Framework)
What is DML?
Student Net 1
data:image/s3,"s3://crabby-images/4ba66/4ba66837f7fd1111f1b8d1f371b25ea9599d0ec6" alt=""
Dataset
S1's Result
Student Net 2
GT
hard
loss
soft loss
- DML (Deep Mutual Learning, CVPR'18):
- Let two student nets train iteratively (No more pre-trained.)
- It beats all of KD-based methods at that time.
Deep Mutual Learning
S2's Result
GT
hard
loss
hard
loss
IE-DML
data:image/s3,"s3://crabby-images/fee9a/fee9ab33f0b1d57b66def1531387842647790ac3" alt=""
Student Network 2
1
- Loss in IE-KD:
- Reconstruction loss is in stage 1.
- Loss in IE-DML: Jointly train auto-encoder
IE-DML Experiments
data:image/s3,"s3://crabby-images/c0f6e/c0f6eba15640fe2b121a3dbdd484c27adcd31d04" alt=""
- Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
- Other techniques related to KD (like DML), can also adopt this framework.
Q & A
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
By Arvin Liu
Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)
Paper Reading - Revisiting Knowledge Distillation: An Inheritance and Exploration Framework
- 973