Neural Network Compression

Outline/Content

Pruning

  • Learning both Weights and Connections for Efficient Neural Networks
  • Pruning Filters for Efficient ConvNets

Cited By 3831 in 2021 Aug 12

  1. Regularization
  2. Dropout Ratio Adjustment
  3. Local Pruning and Parameter Co-adaptation
  4. Iterative Pruning
  5. Pruning Neurons

Regularization

Since we want to prune the weights near to zero

So add a regularizer to the weight will make weights near to zero

\(L_{total} \)

\(= L_{original} + \lambda \sum |w|^p\)

Dropout Ratio Adjustment

To keep the drop rate of each neuron B&A pruning.

(e.g. 30% before pruning, and 30% for pruned model)

The droprate of purned model should be adjust.

C : Numbers of non-zero connections/wieghts (∝ neuron^0.5)

D : Droprate

 Local Pruning and Parameter Co-adaptation

After pruning, use old weight then continue training.

Not retrain with pruned structure with new initialized weight.

Iterative Pruning

Aggressive:

train -> prune 30% -> train

 

Iterative:

train -> prune 10% -> train-> prune 10% -> train -> prune 10% -> train

Pruning Neurons

The neurons with zero input/output connections can be safely pruned.

 

For implement, we should not care about this in some stages, since the regularized term will digest them to zero.

Results

Results

Results

Results

Cited By 1904 in 2021 Aug 12

Notation:

i : layer index

j : channel index

h,w : height, width

x : feature map

n : number of channels

F : convolution kernel

Note:

Convolution parameters size is \((n_i, n_{i+1}, k_h, k_w)\)

and \(F_{i,j}\) size is \(ni \times kh \times kw\)

\(n_i\)

\(n_{i+1}\)

If we prune the filter \(j\) of conv layer \(i\)

Then the feature map \(j\) will become all the same

Then we can remove the weights connect to these output neuron

\(n_i\)

\(n_{i+1}\)

(a) \(\frac{\text{norm}}{\text{max norm}}\) of layer i

(b) Acc of direct prune the first \(n\) smallset \(L_1\) norm filter

(c) Retarin of (b)

Compute L1 norm, pruning strategy of next layer:

1. Use green + yellow

2. Only use green

\(P(x_i) = conv_a(x_i) + conv_{c}(conv_{b}(x_{i}))\)

For residual block, \(conv_{a}\) & \(conv_{b}\) can be pruned without restrictions.

\(conv_{c}\) follow the \(conv_{a}\)'s output structure.

a

b

c

Prune small L1 norm

> Random prune

>  Prune from largest norm

Regularization

  • DSD: Dense-Sparse-Dense Training for Deep Neural Networks
  • Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
  • Structured pruning of deep convolutional neural networks

Cited By 128 in 2021 Aug 12

Results

Results

Results

Results

Results

Cited By 168 in 2021 Aug 12

Cited By 459 in 2021 Aug 12

Quantization

  • GEMMLOWP
    (general matrix multiplication low precision)
  • "Conservative" Quantization: INT8
    • Ristretto
  • Some Tricks
    "Aggressive" Quantization: INT4 and Lower
  • Quantization-Aware Training
  • NOT IN DISTILLER
    • Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

8-bit : [0, 1, 2, ...., 255]

int8

int 8

int 8 x int8 => int 16

\(\sum\) int 16 => int32

\(=\)

"Conservative" Quantization: INT8

How to register the scale of  bias/weights/activations?

Example for how to get scale

Relu 6

scale = \(\frac{6}{255}\)

Matrix

scale = \(max_{i,j}(a_{i,j})\) - \(min_{i,j}(a_(i,j))\)

Linear Activation

Feed some data on/offline, then monitor the value of max-min in hidden layer output

For more detail

Cited By 1040 in 2021 Aug 12

$$a \times (q_i+b)$$

$$c \times (q_i+d)$$

Cited By 163 in 2021 Aug 12

Ristretto

Some Tricks

"Aggressive" Quantization: INT4 and Lower

  1. Training / Re-Training
  2. Replacing the activation function
  3. Modifying network structure
  4. First and last layer
  5. Mixed Weights and Activations Precision

I skip this part, since it is too detail. :(

Quantization-Aware Training

Pseudo quantize in forward pass .

Do normal backward pass.

Cited By 5574 in 2021 Aug 12

Huffman Coding

most/least

frequent token with

shortest/longest

binary codes

for smaller storage

E.g.

A : 100, B:10, C:1, D:1

All 2bits : 2 x (100+10+1+1) = 224 bits

 

0/10/110/111 (1/2/3/3 bits)

100+20+3+3 = 126 bits

Knowledge Distillation

Cited By 6934 in 2021 Aug 12

Teacher's prob : p

Student's prob : q

Soft Probability

Cited By 876 in 2021 Aug 12

Cited By 225 in 2021 Aug 12

Conditional Computation

  • Deep Convolutional Network Cascade for Facial Point Detection
  • Adaptive Neural Networks for Efficient Inference
  • NOT IN DISTILLER
    • ​SBNet

Cited By 1409 in 2021 Aug 12

CNN Model

Input : Cropped Face

Output : 5 x,y pair (eyes, nose, mouth)

Multi-level input

Level 1 : Whole Face/Upper Face/Down Face

Level 2 : Cropped Left Eye/Right Eye ...

Level 3 : Smaller Crop ...

Cited By 144 in 2021 Aug 12

Note :

  1. The Deeper the better performance in these cases
  2. The Deeper the more expensive in these cases

\(\gamma\) : a classifier to determine go to next stage or just use current feature to predict the result

minimize inference time

such that the loss of metric(such as acc, cross entropy, etc.)

Not easy to solve/understand

Now focus on the last layer of early exit

\(T\) : time cost of full model

\(T_4\) : time cost while exit at conv\(_4\)

\(\sigma_4\) : output of conv\(_4\)

\(\tau(\gamma_4)\) : time cost of decision function \(\gamma_4(\sigma_4(x))\)

 

As a classification problem with sample weight

Oracle Label

Sample Weight

Cited By 103 in 2021 Aug 12

Motivation/Result

Light Model for Mask Generation

Heavy computation for few blocks

Other Resources

  • A Survey of Model Compression and Acceleration for Deep Neural Networks
  • 張添烜 模型壓縮與加速
  • Efficient Inference Engine 
  • Once For All
  • Song Han

Cited By 531 in 2021 Aug 12

Not Cover

Cited By 1919 in 2021 Aug 12

How to solve the irregular pruning & very sparse matrix in ASIC

Application Specific Integrated Circuit

Cited By 277 in 2021 Aug 12

1989 出生

Neural Network Compression

By sin_dar_soup

Neural Network Compression

  • 280