Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)

Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua

University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney

What is Knowledge Distillation?

What is KD?

Teacher Net

Dataset

T's Result

Student Net

S's Result

Pretrained

GT

hard loss

soft  loss

Usually smaller or equal to Teacher Net

[0, 0, 1]
[0, 0.3, 0.7]
  • Baseline KD: mimic the logits of the teacher net.
  • FitNet: mimic the intermediate value of the teacher net.
  • AT: mimic the attention map of the teacher net.
    ...
  • Old KD methods are focused on mimic different parts of the teacher net.

What's Bad with the old-KD?

  • Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
  • Student: Attention Pattern is similar to teacher, and also make wrong prediction. 
  • Student CANNOT follow teacher blindly.

Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.

Proposed: IE-KD

(Inheritance and Exploation KD Framework)

Basic Ideas

Teacher Net

Student Net

Student Net

(inheritance part)

SHOULD
similar

  • Make student net split into two parts.
    • inheritance loss: inheritance part should generate the feature that similar to teacher.
    • exploration loss: exploration part should generate the feature that different to teacher.
  • ​Does the shape of the features have to be the same?

Student Net

(exploration part)

SHOULD NOT
similar

Related works: FitNet

  • As we mentioned "Distill from intermediate value", let's introduce FitNet.
    • Stage 1: pretrain the upper part of student net.
    • Stage 2: Act like original KD (Baseline KD)
  • Problems: T's feat may too many trash.
  • To authors of IE-KD, they think the knowledge should be compacted.

Teacher Net (U)

Dataset

Student Net (U)

Regressor

T's feat

S's feat

S's feat transform

L2
Loss

Stage 1: Compact Knowledge Extraction

  • Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
  • Reconstruction loss: L2-loss

Dataset

Teacher Net (U)

T's feat

Encoder T

Compact
T's feat

Decoder T

  • calculate goal loss, inheritance loss & exploration loss.
    • Inheritance loss: should similar to teacher after encoder.
    • Exploration loss: should different to teacher after encoder.
  • There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.

Stage 2: IE-KD

Stage 2: IE-KD

IE-AT 
IE-FT
IE-OD

exploration loss

Method Name

||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1
-||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1
||max(A, 0) - max(B, 0)||_2
-||max(A, 0) - max(B, 0)||_2
||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1
-||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

Inheritance loss

*means attention map of features

  • All 3 encoders are different.
    • Why Exploration Features works? What if encoder of exploration randomly output some features?

Fun Question

Experiments

Is inh part and exp part different?

Inh part: tail like crocodile.

Exp part: ears are also important. 

Inh part: head like seal.

Exp part: turtle shell should also concerned.

Generalization Evidence

Add gaussian noise at the middle layer
and observe the loss changes.

Does inh and exp really different?

CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.

SOTA Accuracy

Dataset: CIFAR-10, Criterion: Error rate

Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate

Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP

I & E are both important!

Proposed: IE-DML

(Inheritance and Exploation DML Framework)

What is DML?

Student Net 1

Dataset

S1's Result

Student Net 2

GT

hard

loss

soft  loss

[0, 0, 1]
[0, 0.3, 0.7]
  • DML (Deep Mutual Learning, CVPR'18):
    • Let two student nets train iteratively (No more pre-trained.)
    • It beats all of KD-based methods at that time.

Deep Mutual Learning

S2's Result

[0.1, 0.1, 0.8]

GT

[0, 0, 1]

hard

loss

hard

loss

IE-DML

L = L_{\text{goal}} + \alpha L_{\text{inheritance}} + \beta L_{\text{exploration}}
L = L_{\text{goal}} + \alpha L_{\text{inheritance}} + \beta L_{\text{exploration}} + \gamma L_{\text{reconstruction}}

Student Network 2

1

  • Loss in IE-KD:
    •  
    • Reconstruction loss is in stage 1.
  • Loss in IE-DML: Jointly train auto-encoder
    •  

IE-DML Experiments

  • Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
  • Other techniques related to KD (like DML), can also adopt this framework.

Q & A