4/30 by Arvin Liu
方法:將Network不重要的weight或neuron進行刪除,再重train一次。
原因:大NN有很多冗參數,而小NN很難train,那就用大NN刪成小NN就好了。
應用:只要他是NN(?)就可以。
方法:利用一個已經學好的大model,來教小model如何做好任務。
原因:讓學生直接做對題目太難了,可以讓他偷看老師是怎麼想/解出題目的。
應用:通常只用在Classification,而且學生只能從頭學起。
方法:利用更少的參數來達到原本某些layer的效果。
原因:有些Layer可能參數就是很冗,例如DNN就是個明顯例子。
應用:就是直接套新的model,或是利用新的layer來模擬舊的layer。
方法:將原本NN常用的計算單位:float32/float64壓縮成更小的單位。
原因:對NN來說,LSB可能不是那麼的重要。
應用:對所有已經train好的model使用,或者邊train邊引誘model去quantize。
* LSB: Least-Significant Bit, 在這裡指小數點的後面其實很冗。
ResNet101
巨肥胖Model
MobileNet
小小Model
Architecture
Design
Knowledge
Distillation
MobileNet
-pruned
Network Pruning
Quantization
(Fine-tune or Finalize)
.
You need to know the magic power of soften label.
Hidden information in relationship between categories
Model
Incompleteness
Crop (Augment)
Inconsistency
每次Refined的時候都可以降低Incompleteness & Inconsistency, 並且還可學到label間的關係
透過 soft target讓小model可以學到class之間的關係。
讓兩個Network同時train,並互相學習對方的logits。
Logits
Networks2
Networks1
Step 1: Update Net1
True Label
CE
Logits
Networks2
Networks1
Step 2: Update Net2
True Label
CE
More Details: https://slides.com/arvinliu/kd_mutual
Similar with label refinery很像。僅差異於:
Student
Teacher
大佬
小萌新
無法學習
用一個參數量介於Teacher&Student的TA做中間人來幫助Student學習,以此避免model差距過大學不好。
Teacher
Student
Only one loop
No end point
ans: 0
next ans: 8
先讓Student學習如何產生Teacher的中間Feature,之後再使用Baseline KD。
Teacher Net
Student Net
Logits
Fit
(in 2-norm distance)
Step 1: Fit feature
Teacher Net
Perform Baseline KD
Logits
Student Net
Step 2: Fit logits
Teacher Net
Student Net
Distill Feature
Teacher Net
Text
H
W
C
H
W
1
Knowledge
Compression
Feature Map
讓Student學習Teacher的Attention Map以此引導。
Individual KD : 以每個sample為單位做知識蒸餾。
Relational KD: 以sample之間的關係做知識蒸餾。
t : teacher's logits
s : student's logits
Individual KD:
Student learn Teacher's output
Relational KD:
Student learn model's representation
Distance-wise KD
t : teacher's logits
s : student's logits
~=
~=
Angle-wise KD
~=
Of course you can.
Mnist Model
circle
vertical line
0
9
1
[1, 0]
[1, 1]
[0, 1]
1 | 0.7 | 0 | |
0.7 | 1 | 0.7 | |
0 | 0.7 | 1 |
Cosine Similarity
Mnist TeacherNet
circle
vertical line
Relational Information
on Features
1 | 0.7 | 0 | |
0.7 | 1 | 0.7 | |
0 | 0.7 | 1 |
Teacher's Cosine Similarity Table
Mnist StudentNet
?
?
Relational Information
on Features
Student's Cosine Similarity Table
imitate to learn relationship between images
蒸餾兩兩sample的activation相似性。
Neuron Pruning in DNN (a=4, b=3, c=2)
Text
Dense(4, 3)
Dense(3, 2)
Dense(2, 2)
Dense(4, 2)
(3,2) Matrix
(4,3) Matrix
(2,2) Matrix
(4,2) Matrix
traivial
Feature 0~3
Feature 0~3
Param Changes:
(a+c) * b ->
(a+c) * (b-1)
Text
Conv(4, 3, 3)
(3, 2, 3, 3) Matrix
traivial
Feature map 0~3
Feature map 0~3
Conv(3, 2, 3)
(4, 3, 3, 3) Matrix
Conv(4, 2, 3)
(2, 2, 3, 3) Matrix
Conv(2, 2, 3)
(4, 2, 3, 3) Matrix
Param Changes:
(a+c) * b * k * k ->
(a+c) * (b-1) * k * k
Neuron Pruning in CNN (a=4, b=3, c=2)
Layer want to pruned
Feature map 0~3
Conv
Conv Weight:
(3, 4, k, k)
calculate sum of L1-norm
Prune
Ideal case for pruning by L-norm:
Filter Pruning via Geometric Media
Conv Weight:
(3, 4, k, k)
Distribution
Distribution we hope
Hazard of pruning by L-norm:
1. σ(V) must be large.
2. there's V close to 0.
Difficult to find an appropriate threshold.
All filters are non-trivial.
Filter Pruning via Geometric Media
Redundancy by pruning by L-norm:
Filter Pruning via Geometric Media
Maybe there're multiple filters with same function.
Pruning by Geometric Media can solve the problem.
Find Geometric Media in CNN
Filter Pruning via Geometric Media
Feature map 0~3
Conv
Conv Weight:
(3, 4, k, k)
Choose GM(s) and prune others
Feature map 0~3
Conv
Batch Normalization
Pytorch's BN
* g is L1-norm.
Average Percentage of Zeros
Data
Feature map 0~3
ReLU
Conv
Neuron is not a number, like
Feature Map with shape (n,m)
Sum of it !
Average Percentage of Zeros
Authors use L1-norm to prune weight. (Not neurons.)
Large
Small
Large
Easier to become winning ticket?
More Meaningful?
(Use w_final as their mask)
Large
L1-norm pruning
Question: Why winning tickets can perform better accuracy?
Experiment:
Pruning algorithms doesn't learn "network weight", they learn "network structure".
This conclusion is made in "lottery tickets".
after training
This conclusion is made by "rethink vals" first author.
Winning ticket must be measured by "weight".
winning ticket helps unstructured pruning (e.g. weight pruning), but doesn't help structured pruning (e.g. filter/neuron pruning.)
By "rethink vals" authors.
Distilling the Knowledge in a Neural Network (NIPS 2014)
Deep Mutual Learning (CVPR 2018)
Born Again Neural Networks (ICML 2018)
Label Refinery: Improving ImageNet Classification through Label Progression
Improved Knowledge Distillation via Teacher Assistant (AAAI 2020)
FitNets : Hints for Thin Deep Nets (ICLR2015)
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (ICLR 2017)
Relational Knowledge Distillation (CVPR 2019)
Similarity-Preserving Knowledge Distillation (ICCV 2019)
Pruning Filters for Efficient ConvNets (ICLR 2017)
Learning Efficient Convolutional Networks Through Network Slimming (ICCV 2017)
Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (ICML 2019)