Vinod Ganesan
Dept. of CSE, IIT Madras
Source: Jeff Dean ISSCC 2020 [1]
[1] Deep Learning Revolution and its Implications for Computer Architecture and Chip Design, ISSCC 2020
10x/year
Source: OpenAI Blog [1]
Moore's law
There is a huge demand gap
Addressing this problem is vital to sustain the growth trajectory of DL
[1] https://openai.com/blog/ai-and-compute/
Source: OpenAI Blog [1]
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Neural Networks
FuSeConv
DATE 2020
IISWC 2020
DATE 2021
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Neural Networks
FuSeConv
DATE 2020
Enabling DNNs on resource-constrained environments is challenging
Area: 0.1-2 \( mm^2 \)
Power: ~100mW
Eyeriss (2016): 12.25 \( mm^2 \)
SCNN (2017): 8 \( mm^2 \)
NVDLA (2017): 5 \( mm^2 \)
Area: ~0.4 \( mm^2 \)
Very high area requirements!
[1]. Chen et. al, Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks, ISCA 2016
[2]. Parashar et. al, SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017
[3]. http://nvdla.org/
Area: ~0.4 \( mm^2 \)
Wu et. al, HPCA 2019 [1]
Key Idea: Develop light-weight micro-architectural extensions for general purpose processors
[1] Wu. et al, Machine Learning at Facebook: Understanding Inference at the Edge, HPCA 2019
Reducing this miss rate is crucial to improving performance
16KB 4-way Set Associative Cache
DNN data structures
Dynamic Sparsity
Static Sparsity
50.6%
65.6%
Effective Cache Capacity
Miss rate
Performance
Zero values pollute the cache
Key Idea 2: Develop Cache extensions to store zero-valued addresses compactly
Storing these zero cache-line addresses separately is not sufficient
Key Idea 3: Merge and store contiguous/strided zero-valued cache line addresses compactly
Proposal
Augment the cache-architecture with a TCAM based "null cache" to store zero valued addresses
0
1
0
0
4-bit zero-cache line address
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
0
1
0
1
0
1
0
0
0
0
0
1
0
0
1
0
X
First Order Merge
0
1
0
0
0
0
0
1
0
0
1
0
X
First Order Merge
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
X
First Order Merge
0
0
0
X
Second Order Merge
0
1
0
0
0
0
1
0
X
First Order Merge
0
X
0
X
Second Order Merge
Need to capture such higher order merge candidates!
Features
SparseCache was integrated into a low-end Intel processor (configuration similar to Intel Galileo boards) and evaluated using an in-house developed x86 simulator based on Intel’s pin-tool
Four pruned image-recognition DNNs used as benchmarks
0.1% area and 3.8% power overhead
[1]. Yu et. al, Scalpel: Customizing DNN pruning to the underlying parallelism , ISCA 2017
[2]. Han et. al, Deep Compression: Compressing DNNs with Pruning, Trained Quantization and Huffmann Encoding, ICLR 2016
Processor Config. | Intel Atom, in-order |
---|---|
Operating Freq. | 400 MHz |
L1 Data Cache | 16 KB, 4 way SA, 32B/Block, 1 cycle hit latency |
Null-Cache | 1KB, address only, TCAM, 1 cycle hit latency |
Main Memory Cycles | 100 cycles |
Benchmarks | Pruning | Dataset | No. layers |
---|---|---|---|
LeNet | Scalpel [1] | MNIST | 5 |
AlexNet | DC [2] | ImageNet | 8 |
VGG-16 | DC [2] | ImageNet | 16 |
SqueezeNet | DC [2] | ImageNet | 26 |
Additionally, SparseCache configuration gives 13% and 20% benefits in execution time and miss rates respectively on AlexNet
Light-weight extensions to GPPs are a viable alternative to custom hardware design, for resource-constrained devices
Results
5-21% application execution time reduction across 4 benchmarks
Same or better performance than a 2X larger data-cache with less than half its size
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Neural Networks
FuSeConv
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Neural Networks
FuSeConv
IISWC 2020
Characterizing each DNN on each hardware is infeasible
Source: Reddi et.al, MLPerf Inference Benchmark, ISCA 2020
N times
Latency
DNN Representation
Trained with many DNN, latency pairs to generalize to unseen DNNs
Learn an ML model that can predict the latency of a network on a device
How do we generalize this to multiple hardware devices?
DNN Representation
HW Representation
Latency
Key idea: A unique hardware representation for the cost model
Challenges
Android App
Central Database
118 networks
105 mobile devices
A diverse set of DNNs ranging from 40-800 MFLOPs spanning across many operators and parameters
Significant diversity in devices ranging from an 8-year old Cortex-A53 to 9 month old Kryo 585
DNN Representation
Latency
Challenges
Can we use simple features such as core freq, DRAM size, etc.
2.5x latency variability for the same frequency and DRAM size
Poor predictive power for a cost model learnt with frequency and DRAM size
Using static hardware features (CPU Freq, DRAM size, etc.) as the representation
DNN Representation
Latency
100.4
350.55
59.6
Unique fingerprint for a hardware
Key Question: How to choose the DNNs for the Signature Set?
Represent a device by its latency on a set of DNNs - the signature set
We propose three methods to select the signature set from the set of 118 networks.
We are interested in DNNs that help discriminate the hardware devices.
XGBoost
\( R^2 \): Coefficient of Determination
70% of devices as the training set, 30% as the test set
Only training set participates in signature set selection
RS: \( R^2 \) = 0.9125
SCCS: \( R^2 \) = 0.944
MIS: \( R^2 \) = 0.943
The cost model generalizes well to unseen mobile devices
Interesting Questions:
Mean \( R^2 \) = 0.927
MIS, SCCS
Predicted: 70 ms
Actual: 25 ms
Learning
Inference
Predicted: 24.5 ms
Actual: 25 ms
Even for 10% contribution from each device, the model shows superior \( R^2 \) values
11x
Collaborative \( R^2 \): 0.98
11x lower characterisation cost for collaborative model compared to the unique model
Extremely small contributions from many people \( \implies \) accurate generalizable cost models
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Networks
FuSeConv
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Networks
FuSeConv
DATE 2021
Systolic-arrays and depth-wise separable convolutions are ubiquitous for efficient inference
25-29x more performant over GPUs
12/12 efficient networks use depth-wise separable convolutions
Source: Han Cai et. al, Once For All, ICLR 2019
Source: TPU, cloud.google.com
Comparing MobileNet-V2 with ResNet-50
Incommensurate
Scaling
> Why are Depth-Wise (DW) convolutions inefficient on systolic arrays?
> What can we do to make DW convolution faster on systolic arrays?
> How good is the inference performance of our proposed solution?
Standard Convolution
Depthwise Separable Convolution = Depthwise Conv + Pointwise Conv
\( (N \times M \times C \times K^2 \times C_{out}) \)
\( (N \times M \times C \times (K^2 + C_{out})) \)
Upto \( K^2 \) times reduction in computations
Fully Separable Convolution (FuSeConv) : Composes of 1D depthwise convolutions and pointwise convolution
Depthwise Separable Convolution
Full Variant (D = 1)
Half Variant (D = 2)
Execute independent 1D convolutions
On 32 x 32 array
Area | 4.35% |
Power | 2.25% |
All PEs are completely utilized due to the 1D convolutions of FuSeConv and modified dataflow
> Evaluate 4 FuSe variants and compare with baseline
> Latency simulator based on SCALE-SIM [1] from ARM and Georgia Tech
> MobileNets (V1, V2, V3-Small, V3-Large) and MnasNet-B1
Choose half of layers greedily to maximize speedup of layer
> Full (D = 1) FuSe
> Half (D = 2) FuSe
> 50% Full FuSe
> 50% Half FuSe
All layers are replaced with FuSeConv
[1]. Samajdar et. al, SCALE-Sim: Systolic CNN Accelerator Simulator, ISPASS 2020
Average accuracy drop of ~1% in Half variants, and ~0.3% in Full variants
Upto 7.23x improvements in execution time
Efficient Deep Learning
Efficient Systems
SparseCache
Efficient Design Methodology
Generalizable cost models
Efficient Neural Networks
FuSeConv
Efficient DNN Training
Efficient Operators
Efficient Transformers
Fast devices with 50ms mean latency
Medium fast devices with 115ms mean latency
Slow devices with 235ms mean latency
A mobile device with latency distribution of 118 networks
> A class of algorithms that runs efficiently on systolic architectures
> Computational loops are transformed into Regular Iterative Algorithm (RIA)
> RIAs that have constant offsets only can be synthesized on systolic arrays
> i, j indices maps to Spatial dimensions of array
> k index maps to Temporal dimension
Time
Dim1
Dim2
> Non-Constant Offset indices in RIA
> 2D Convolution is not a Systolic Algorithm
After im2col transformation
Under Utilization
> More filters -> Data reuse -> High Utilization
> Extendable for convolution with channels too
2D Convolution with Multiple Filters
> FuSeConv composes of only 1D convolutions.
> 1 x K & K x 1depthwise convolutions
> 1 x 1 Pointwise Convolution
Matrix Multiplication
1D Convolutions
> Constant Offset => Systolic Algorithm
Channel 0
Channel 1
> Channelwise 1D convolutions can be computed in parallel
> FuSeConv + our dataflow mapping exploits parallelism which were absent in depthwise convolutions mapping natively on Systolic Arrays
Multiple Channels can execute in parallel along dimension 2
Operator-wise Latency Distribution of Full FuSe Variants
Layerwise Speedup of MobileNet-V2 Full FuSe Variant
More Speedup from Initial layers
Inference latency dominated by Depthwise -> Point-wise