Paper review: Revisiting Knowledge Distillation: An Inheritance and Exploration Framework (CVPR'21)

Basic Ideas

Teacher Net

Student Net

(inheritance part)

SHOULD
similar

Make student net split into two parts.
- inheritance loss: inheritance part should generate the feature that similar to teacher.
- exploration loss: exploration part should generate the feature that different to teacher.
Does the shape of the features have to be the same?

Student Net

(exploration part)

SHOULD NOT
similar

Related works: FitNet

As we mentioned "Distill from intermediate value", let's introduce FitNet.
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
Problems: T's feat may too many trash.
To authors of IE-KD, they think the knowledge should be compacted.

Teacher Net (U)

Dataset

Student Net (U)

Regressor

T's feat

S's feat

S's feat transform

L2
Loss

Stage 1: Compact Knowledge Extraction

Make the feature output of the teacher net more "compact" for further use. (Use auto-encoder to achieve this goal.)
Reconstruction loss: L2-loss

Dataset

Teacher Net (U)

T's feat

Encoder T

Compact
T's feat

Decoder T

calculate goal loss, inheritance loss & exploration loss.
- Inheritance loss: should similar to teacher after encoder.
- Exploration loss: should different to teacher after encoder.
There's multiple choice for loss selection, we can adopt previous KD works for inheritance loss, and choose opposite function for exploration loss.

Stage 2: IE-KD


IE-AT
IE-FT
IE-OD

exploration loss

Method Name

||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

-||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

-||\frac{A}{||A||_2}-\frac{B}{||B||_2}||_1

||max(A, 0) - max(B, 0)||_2

||max(A, 0) - max(B, 0)||_2

-||max(A, 0) - max(B, 0)||_2

-||max(A, 0) - max(B, 0)||_2

||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

-||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

-||\frac{A*}{||A*||_2}-\frac{B*}{||B*||_2}||_1

Inheritance loss

*means attention map of features

All 3 encoders are different.
- Why Exploration Features works? What if encoder of exploration randomly output some features?

Revisiting Knowledge Distillation:
An Inheritance and Exploration Framework (CVPR'21)

What is Knowledge Distillation?

What is KD?

What's Bad with the old-KD?

Proposed: IE-KD

Basic Ideas

Related works: FitNet

Stage 1: Compact Knowledge Extraction

Stage 2: IE-KD

Stage 2: IE-KD

Fun Question

Experiments

Is inh part and exp part different?

Generalization Evidence

Does inh and exp really different?

SOTA Accuracy

I & E are both important!

Proposed: IE-DML

What is DML?

IE-DML

IE-DML Experiments

Q & A