Outline/Content
Cited By 3831 in 2021 Aug 12
Since we want to prune the weights near to zero
So add a regularizer to the weight will make weights near to zero
\(L_{total} \)
\(= L_{original} + \lambda \sum |w|^p\)
To keep the drop rate of each neuron B&A pruning.
(e.g. 30% before pruning, and 30% for pruned model)
The droprate of purned model should be adjust.
C : Numbers of non-zero connections/wieghts (∝ neuron^0.5)
D : Droprate
After pruning, use old weight then continue training.
Not retrain with pruned structure with new initialized weight.
Aggressive:
train -> prune 30% -> train
Iterative:
train -> prune 10% -> train-> prune 10% -> train -> prune 10% -> train
The neurons with zero input/output connections can be safely pruned.
For implement, we should not care about this in some stages, since the regularized term will digest them to zero.
Cited By 1904 in 2021 Aug 12
Notation:
i : layer index
j : channel index
h,w : height, width
x : feature map
n : number of channels
F : convolution kernel
Note:
Convolution parameters size is \((n_i, n_{i+1}, k_h, k_w)\)
and \(F_{i,j}\) size is \(ni \times kh \times kw\)
\(n_i\)
\(n_{i+1}\)
If we prune the filter \(j\) of conv layer \(i\)
Then the feature map \(j\) will become all the same
Then we can remove the weights connect to these output neuron
\(n_i\)
\(n_{i+1}\)
(a) \(\frac{\text{norm}}{\text{max norm}}\) of layer i
(b) Acc of direct prune the first \(n\) smallset \(L_1\) norm filter
(c) Retarin of (b)
Compute L1 norm, pruning strategy of next layer:
1. Use green + yellow
2. Only use green
\(P(x_i) = conv_a(x_i) + conv_{c}(conv_{b}(x_{i}))\)
For residual block, \(conv_{a}\) & \(conv_{b}\) can be pruned without restrictions.
\(conv_{c}\) follow the \(conv_{a}\)'s output structure.
a
b
c
Prune small L1 norm
> Random prune
> Prune from largest norm
Cited By 128 in 2021 Aug 12
Cited By 168 in 2021 Aug 12
Cited By 459 in 2021 Aug 12
8-bit : [0, 1, 2, ...., 255]
int8
int 8
int 8 x int8 => int 16
\(\sum\) int 16 => int32
\(=\)
How to register the scale of bias/weights/activations?
Example for how to get scale
Relu 6
scale = \(\frac{6}{255}\)
Matrix
scale = \(max_{i,j}(a_{i,j})\) - \(min_{i,j}(a_(i,j))\)
Linear Activation
Feed some data on/offline, then monitor the value of max-min in hidden layer output
For more detail
Cited By 1040 in 2021 Aug 12
$$a \times (q_i+b)$$
$$c \times (q_i+d)$$
Cited By 163 in 2021 Aug 12
Ristretto
"Aggressive" Quantization: INT4 and Lower
I skip this part, since it is too detail. :(
Pseudo quantize in forward pass .
Do normal backward pass.
Cited By 5574 in 2021 Aug 12
Huffman Coding
most/least
frequent token with
shortest/longest
binary codes
for smaller storage
E.g.
A : 100, B:10, C:1, D:1
All 2bits : 2 x (100+10+1+1) = 224 bits
0/10/110/111 (1/2/3/3 bits)
100+20+3+3 = 126 bits
Cited By 6934 in 2021 Aug 12
Teacher's prob : p
Student's prob : q
Soft Probability
Cited By 876 in 2021 Aug 12
Cited By 225 in 2021 Aug 12
Cited By 1409 in 2021 Aug 12
CNN Model
Input : Cropped Face
Output : 5 x,y pair (eyes, nose, mouth)
Multi-level input
Level 1 : Whole Face/Upper Face/Down Face
Level 2 : Cropped Left Eye/Right Eye ...
Level 3 : Smaller Crop ...
Cited By 144 in 2021 Aug 12
Note :
\(\gamma\) : a classifier to determine go to next stage or just use current feature to predict the result
minimize inference time
such that the loss of metric(such as acc, cross entropy, etc.)
Not easy to solve/understand
Now focus on the last layer of early exit
\(T\) : time cost of full model
\(T_4\) : time cost while exit at conv\(_4\)
\(\sigma_4\) : output of conv\(_4\)
\(\tau(\gamma_4)\) : time cost of decision function \(\gamma_4(\sigma_4(x))\)
As a classification problem with sample weight
Oracle Label
Sample Weight
Cited By 103 in 2021 Aug 12
Light Model for Mask Generation
Heavy computation for few blocks
Cited By 531 in 2021 Aug 12
Not Cover
Cited By 1919 in 2021 Aug 12
How to solve the irregular pruning & very sparse matrix in ASIC
Application Specific Integrated Circuit
Cited By 277 in 2021 Aug 12
1989 出生