Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)

Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-Sheng Hua

University of Science and Technology of China, Alibaba Group, University of Southern California, University of Sydney

What is Knowledge Distillation?

What is KD?

Teacher Net

Dataset

T's Result

Student Net

S's Result

Pretrained

hard loss

soft loss

Usually smaller or equal to Teacher Net

[0, 0, 1]

[0, 0.3, 0.7]

Baseline KD: mimic the logits of the teacher net.
FitNet: mimic the intermediate value of the teacher net.
AT: mimic the attention map of the teacher net.
...
Old KD methods are focused on mimic different parts of the teacher net.

What's Bad with the old-KD?

Teacher: Due to the tail of the “cheetah” which resembles a “crocodile”, the model makes wrong prediction.
Student: Attention Pattern is similar to teacher, and also make wrong prediction.
Student CANNOT follow teacher blindly.

Layer-Wise Relevance Propagation (LRP): An visualization of the attention map in the input image.

Proposed: IE-KD

(Inheritance and Exploation KD Framework)

Basic Ideas

Teacher Net

Student Net

(inheritance part)

SHOULD
similar

Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
Does the shape of the features have to be the same?

Student Net

(exploration part)

SHOULD NOT
similar

Related works: FitNet

As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
Problems: T's feat may too many trash.
To authors of IE-KD, they think the knowledge should be compacted.

Teacher Net (U)

Dataset

Student Net (U)

Regressor

T's feat

S's feat

S's feat transform

L2
Loss

Stage 1: Compact Knowledge Extraction

Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
Reconstruction loss: L2-loss

Dataset

Teacher Net (U)

T's feat

Encoder T

Compact
T's feat

Decoder T

calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.

Stage 2: IE-KD


IE-AT
IE-FT
IE-OD

exploration loss

Method Name

||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

-||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

||max(A, 0) - max(B, 0)||_2

-||max(A, 0) - max(B, 0)||_2

||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

-||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

Inheritance loss

*means attention map of features

All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?

Fun Question

Experiments

Is inh part and exp part different?

Inh part: tail like crocodile.

Exp part: ears are also important.

Inh part: head like seal.

Exp part: turtle shell should also concerned.

Generalization Evidence

Add gaussian noise at the middle layer
and observe the loss changes.

Does inh and exp really different?

CKA (Center Kernel Alignment, ICML'19) similarity: A method to measure the similarity between feature maps.

SOTA Accuracy

Dataset: CIFAR-10, Criterion: Error rate

Dataset: Imagenet, Resnet34 -> 18, Criterion: Error rate

Dataset: PASCAL VOC (2007, for object detection),
Resnet50 -> 18, Criterion: MAP

I & E are both important!

Proposed: IE-DML

(Inheritance and Exploation DML Framework)

What is DML?

Student Net 1

Dataset

S1's Result

Student Net 2

hard

loss

soft loss

[0, 0, 1]

[0, 0.3, 0.7]

DML (Deep Mutual Learning, CVPR'18):
- Let two student nets train iteratively (No more pre-trained.)
- It beats all of KD-based methods at that time.

Deep Mutual Learning

S2's Result

[0.1, 0.1, 0.8]

[0, 0, 1]

hard

loss

hard

loss

IE-DML

L = L_{\text{goal}} + \alpha L_{\text{inheritance}} + \beta L_{\text{exploration}}

L = L_{\text{goal}} + \alpha L_{\text{inheritance}} + \beta L_{\text{exploration}} + \gamma L_{\text{reconstruction}}

Student Network 2

Loss in IE-KD:
- Reconstruction loss is in stage 1.
Loss in IE-DML: Jointly train auto-encoder

IE-DML Experiments

Here we knows: most of KD (which distillation target is intermediate value) can be adopt this framework.
Other techniques related to KD (like DML), can also adopt this framework.

Q & A

Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)

By Arvin Liu

Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)

Paper Reading - Revisiting Knowledge Distillation: An Inheritance and Exploration Framework

4 years ago
1,023

Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)

What is Knowledge Distillation?

What is KD?

What's Bad with the old-KD?

Proposed: IE-KD

Basic Ideas

Related works: FitNet

Stage 1: Compact Knowledge Extraction

Stage 2: IE-KD

Stage 2: IE-KD

Fun Question

Experiments

Is inh part and exp part different?

Generalization Evidence

Does inh and exp really different?

SOTA Accuracy

I & E are both important!

Proposed: IE-DML

What is DML?

IE-DML

IE-DML Experiments

Q & A

Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)

More from Arvin Liu

Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)