Neural Network Compression

https://github.com/IntelLabs/distiller

https://intellabs.github.io/distiller/index.html

Outline/Content

Pruning

Learning both Weights and Connections for Efficient Neural Networks
Pruning Filters for Efficient ConvNets

Cited By 3831 in 2021 Aug 12

Regularization
Dropout Ratio Adjustment
Local Pruning and Parameter Co-adaptation
Iterative Pruning
Pruning Neurons

Regularization

https://youtu.be/eZdOkDtYMoo

Since we want to prune the weights near to zero

So add a regularizer to the weight will make weights near to zero

$L_{total} $

$= L_{original} + \lambda \sum |w|^p$

Dropout Ratio Adjustment

To keep the drop rate of each neuron B&A pruning.

(e.g. 30% before pruning, and 30% for pruned model)

The droprate of purned model should be adjust.

C : Numbers of non-zero connections/wieghts (∝ neuron^0.5)

D : Droprate

Local Pruning and Parameter Co-adaptation

After pruning, use old weight then continue training.

Not retrain with pruned structure with new initialized weight.

Iterative Pruning

Aggressive:

train -> prune 30% -> train

Iterative:

train -> prune 10% -> train-> prune 10% -> train -> prune 10% -> train

Pruning Neurons

The neurons with zero input/output connections can be safely pruned.

For implement, we should not care about this in some stages, since the regularized term will digest them to zero.

Results

Cited By 1904 in 2021 Aug 12

Notation:

i : layer index

j : channel index

h,w : height, width

x : feature map

n : number of channels

F : convolution kernel

Note:

Convolution parameters size is $(n_i, n_{i+1}, k_h, k_w)$

and $F_{i,j}$ size is $ni \times kh \times kw$

$n_i$

$n_{i+1}$

If we prune the filter $j$ of conv layer $i$

Then the feature map $j$ will become all the same

Then we can remove the weights connect to these output neuron

$n_i$

$n_{i+1}$

(a) $\frac{\text{norm}}{\text{max norm}}$ of layer i

(b) Acc of direct prune the first $n$ smallset $L_1$ norm filter

Compute L1 norm, pruning strategy of next layer:

1. Use green + yellow

2. Only use green

$P(x_i) = conv_a(x_i) + conv_{c}(conv_{b}(x_{i}))$

For residual block, $conv_{a}$ & $conv_{b}$ can be pruned without restrictions.

$conv_{c}$ follow the $conv_{a}$'s output structure.

Prune small L1 norm

> Random prune

> Prune from largest norm

Regularization

DSD: Dense-Sparse-Dense Training for Deep Neural Networks
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
Structured pruning of deep convolutional neural networks

Cited By 128 in 2021 Aug 12

Results

Cited By 168 in 2021 Aug 12

Cited By 459 in 2021 Aug 12

Quantization

GEMMLOWP
(general matrix multiplication low precision)
"Conservative" Quantization: INT8
- Ristretto
Some Tricks
"Aggressive" Quantization: INT4 and Lower
Quantization-Aware Training
NOT IN DISTILLER
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

8-bit : [0, 1, 2, ...., 255]

int8

int 8

int 8 x int8 => int 16

$\sum$ int 16 => int32

$=$

"Conservative" Quantization: INT8

How to register the scale of bias/weights/activations?

Example for how to get scale

Relu 6

scale = $\frac{6}{255}$

Matrix

scale = $max_{i,j}(a_{i,j})$ - $min_{i,j}(a_(i,j))$

Linear Activation

Feed some data on/offline, then monitor the value of max-min in hidden layer output

For more detail

Cited By 1040 in 2021 Aug 12

$$a \times (q_i+b)$$

$$c \times (q_i+d)$$

Cited By 163 in 2021 Aug 12

Ristretto

Some Tricks

"Aggressive" Quantization: INT4 and Lower

Training / Re-Training
Replacing the activation function
Modifying network structure
First and last layer
Mixed Weights and Activations Precision

I skip this part, since it is too detail. :(

Quantization-Aware Training

Pseudo quantize in forward pass .

Do normal backward pass.

Cited By 5574 in 2021 Aug 12

Huffman Coding

most/least

frequent token with

shortest/longest

binary codes

for smaller storage

E.g.

A : 100, B:10, C:1, D:1

All 2bits : 2 x (100+10+1+1) = 224 bits

0/10/110/111 (1/2/3/3 bits)

100+20+3+3 = 126 bits

Knowledge Distillation

Distilling the Knowledge in a Neural Network
NOT IN DISTILLER
From (2020 May) [TA 補充課] Network Compression (1/2): Knowledge Distillation (由助教劉俊緯同學講授)
- Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
- Relational Knowledge Distillation