Hardware Software Co-Design for Efficient Deep Learning Systems

Vinod Ganesan

Dept. of CSE, IIT Madras

Deep Learning is revolutionizing our daily lives

Source: Jeff Dean ISSCC 2020 [1]

[1] Deep Learning Revolution and its Implications for Computer Architecture and Chip Design, ISSCC 2020

The cost of the revolution - Exploding Compute demands

10x/year

Source: OpenAI Blog [1]

Moore's law

There is a huge demand gap

Addressing this problem is vital to sustain the growth trajectory of DL

[1]  https://openai.com/blog/ai-and-compute/

Source: OpenAI Blog [1]

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Neural Networks

FuSeConv

DATE 2020

IISWC 2020

DATE 2021

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Neural Networks

FuSeConv

DATE 2020

Motivation 

Enabling DNNs on resource-constrained environments is challenging

Area: 0.1-2 \( mm^2 \)

Power: ~100mW

Motivation: What about accelerators?

Eyeriss (2016): 12.25 \( mm^2 \)

SCNN (2017): 8 \( mm^2 \)

NVDLA (2017): 5 \( mm^2 \)

Area: ~0.4 \( mm^2 \)

Very high area requirements!

[1]. Chen et. al, Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks, ISCA 2016

[2]. Parashar et. al, SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017

[3]. http://nvdla.org/

 

Motivation: Optimizing General Purpose Processors is Vital!

Area: ~0.4 \( mm^2 \)

Wu et. al, HPCA 2019 [1]

Key Idea: Develop light-weight micro-architectural extensions for general purpose processors

[1] Wu. et al, Machine Learning at Facebook: Understanding Inference at the Edge, HPCA 2019

Insight: Caches have high miss-rates running DNNs

Reducing this miss rate is crucial to improving performance

16KB 4-way Set Associative Cache

Insight: DNNs are highly sparse and pollute the cache

DNN data structures

Dynamic Sparsity

Static Sparsity

50.6%

65.6%

Effective Cache Capacity

Miss rate

Performance

Zero values pollute the cache

Key Idea 2: Develop Cache extensions to store zero-valued addresses compactly

Insight: Coarse grained sparsity is preferred

Storing these zero cache-line addresses separately is not sufficient

Key Idea 3: Merge and store contiguous/strided zero-valued cache line addresses compactly

Key Ideas

  • Develop lightweight micro-architectural extensions for general-purpose processors to accelerate DL 
  • Develop cache-extensions to store zero-valued addresses compactly 
  • Merge and store multiple zero valued addresses in a single entry

Proposal

Augment the cache-architecture with a TCAM based "null cache" to store zero valued addresses

Solution: TCAM as a compact storage

  • Ternary Content Addressable Memories are capable of encoding don't cares "X" in addition to 0s and 1s
  • Each entry can store multiple contiguous or strided addresses with appropriate "X"s
  • Best fit for DNN workloads exhibiting coarse grained sparsity

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

4-bit zero-cache line address

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead 
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

0

1

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead 
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

0

1

0

0

1

0

1

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

0

1

0

0

1

0

X

First Order Merge

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead 
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

0

1

0

0

1

0

X

First Order Merge

0

0

0

0

Challenges with TCAM

  • Dynamically identify zero cache-blocks 
  • Low area and power overhead 
  • Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

0

1

0

0

1

0

X

First Order Merge

0

0

0

X

Second Order Merge

Challenges with TCAM

  1. Dynamically identify zero cache-blocks 
  2. Low area and power overhead
  3. Dynamically identify and merge multiple entries in TCAM

0

1

0

0

0

0

1

0

X

First Order Merge

0

X

0

X

Second Order Merge

Need to capture such higher order merge candidates!

Solution: SparseCache

Features

  • A TCAM based write-through null-cache to store zero-cache line addresses
  • A Zero Detector and Approximator to detect zero-cache lines and approximate near zero-cache lines (Challenge 1)
  • An Address Merger to find and merge contiguous and strided cache lines (Challenge 3)
  • An augmented replacement policy for null-cache to preferentially retain zero cache-lines

Solution: SparseCache

Experimental Methodology

  • SparseCache was integrated into a low-end Intel processor (configuration similar to Intel Galileo boards) and evaluated using an in-house developed x86 simulator based on Intel’s pin-tool

  • Four pruned image-recognition DNNs used as benchmarks

0.1% area and 3.8% power overhead

[1]. Yu et. al, Scalpel: Customizing DNN pruning to the underlying parallelism , ISCA 2017

[2]. Han et. al, Deep Compression: Compressing DNNs with Pruning, Trained Quantization and Huffmann Encoding, ICLR 2016

 

Processor Config. Intel Atom, in-order
Operating Freq.  400 MHz
L1 Data Cache  16 KB, 4 way SA, 32B/Block, 1 cycle hit latency
Null-Cache 1KB, address only, TCAM, 1 cycle hit latency
Main Memory Cycles  100 cycles
Benchmarks Pruning Dataset No. layers
LeNet Scalpel [1] MNIST 5
AlexNet DC [2] ImageNet 8
VGG-16 DC [2] ImageNet 16
SqueezeNet DC [2] ImageNet 26

Results: Performance improvements

  • 5-21% reduction in application execution times across all benchmarks 
  • 5-28% reduction in cache miss-rates across all benchmarks 
  • Benefits proportional to number of zero-valued cache lines
  • Miss rate reductions will lead to system-wide energy benefits due to reductions in off-chip memory accesses

Results: Performance improvements of SparseCache (16KB+1KB) over bigger Data-Cache (32KB)

  • A 16KB data-cache + 1KB null-cache is as performant as a significantly higher area and power consuming 32KB data-cache
  • Additionally, SparseCache configuration gives 13% and 20% benefits in execution time and miss rates respectively on AlexNet

Summary

  • We propose SparseCache, light-weight cache extensions to general purpose processors to speed up DNN execution by exploiting sparsity
  • SparseCache consists of a TCAM based null-cache, ZDA and AM to detect and store zero valued addresses compactly 
  • Light-weight extensions to GPPs are a viable alternative to custom hardware design, for resource-constrained devices

  • Results

    • 5-21% application execution time reduction across 4 benchmarks 

    • Same or better performance than a 2X larger data-cache with less than half its size

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Neural Networks

FuSeConv

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Neural Networks

FuSeConv

IISWC 2020

Characterizing each DNN on each hardware is infeasible

Source: Reddi et.al, MLPerf Inference Benchmark, ISCA 2020

Motivation: Exploding product space of DNNs and devices

Insight: Build Cost Models to Estimate Latency

N times

Latency

DNN Representation

Trained with many DNN, latency pairs to generalize to unseen DNNs

Learn an ML model that can predict the latency of a network on a device

How do we generalize this to multiple hardware devices?

DNN Representation

HW Representation

Latency

Key idea: A unique hardware representation for the cost model

Solution: Generalizable Cost Models

Challenges

  1. Collecting a large dataset across networks and devices
  2. Uniquely representing a hardware

Android App

Central Database

118 networks

105 mobile devices

Challenge: Collecting a large dataset

Challenge: Collecting a large dataset

Visualizing the Data

A diverse set of DNNs ranging from 40-800 MFLOPs spanning across many operators and parameters

Significant diversity in devices ranging from an 8-year old Cortex-A53 to 9 month old Kryo 585

DNN Representation

Latency

Challenge: Uniquely representing a hardware

Challenges

  1. Collecting a large dataset across networks and devices
  2. Uniquely representing a hardware

Can we use simple features such as core freq, DRAM size, etc.

2.5x latency variability for the same frequency and DRAM size

Poor predictive power for a cost model learnt with frequency and DRAM size

Using static hardware features (CPU Freq, DRAM size, etc.) as the representation

Do simple features work?

DNN Representation

Latency

100.4

350.55

59.6

Unique fingerprint for a hardware

Key Question: How to choose the DNNs for the Signature Set?

Solution: Signature Set Representation

Represent a device by its latency on a set of DNNs - the signature set

We propose three methods to select the signature set from the set of 118 networks.

  1. Random Sampling (RS)
  2. Mutual Information based Sampling (MIS)
  3. Spearman Correlation Coefficient based Sampling (SCCS)

Solution: Selecting Signature Sets

We are interested in DNNs that help discriminate the hardware devices.

XGBoost

\( R^2 \): Coefficient of Determination

70% of devices as the training set, 30% as the test set

Only training set participates in signature set selection

Experimental Methodology

Results: Performance of Cost Models

RS: \( R^2 \) = 0.9125

SCCS: \( R^2 \) = 0.944

MIS: \( R^2 \) = 0.943

The cost model generalizes well to unseen mobile devices

Interesting Questions:

  1. Does Random Sampling always perform well? 
  2. What is the optimal number of networks in the signature set?

Results: Consistency of Random Sampling

  • Performs competitively well on average
  • Outliers perform poorly

Mean \( R^2 \) = 0.927

MIS, SCCS

Results: Optimal number of networks in Signature Set

  • Ideal size: 5-10 networks (4-8% of all networks)
  • Very small signature set to effectively describe a hardware

Predicted: 70 ms

Actual: 25 ms

Learning

Inference

Predicted: 24.5 ms

Actual: 25 ms

Application in practice: Collaborative Workload Characterisation

Results: Simulation with 50 devices in collaboration

Even for 10% contribution from each device, the model shows superior \( R^2 \) values

11x

Collaborative \( R^2 \): 0.98

11x lower characterisation cost for collaborative model compared to the unique model

Extremely small contributions from many people \( \implies \) accurate generalizable cost models 

Summary

  • Characterizing latency of DNNs on hardware devices is challenging due to network and hardware diversity
  • We need accurate cost models to effectively characterize networks on devices
  • We proposed a superior cost model that generalizes across both networks and mobile devices on a real-world dataset 
  • The model utilizes a novel and easy-to-obtain representation of devices using latencies on a signature set of DNNs
  • We showed a practical setting of building such a generalizable cost model in a collaborative manner
  • We believe such a cost model will significantly reduce the computational and environmental overhead in characterizing DNNs on devices

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Networks

FuSeConv

Our thesis

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Networks

FuSeConv

DATE 2021

Motivation: Accelerators and Efficient Operators

Systolic-arrays and depth-wise separable convolutions are ubiquitous for efficient inference 

25-29x more performant over GPUs 

12/12 efficient networks use depth-wise separable convolutions

Source: Han Cai et. al, Once For All, ICLR 2019

Source: TPU, cloud.google.com

Motivation: The combination of both is inefficient

Comparing MobileNet-V2 with ResNet-50

Incommensurate

Scaling

> Why are Depth-Wise (DW)  convolutions        inefficient on systolic arrays?

> What can we do to make DW convolution     faster on systolic arrays?

> How good is the inference                                 performance of our proposed solution?

Background

Standard Convolution

Depthwise Separable Convolution = Depthwise Conv + Pointwise Conv

\( (N \times M \times C \times K^2  \times C_{out}) \)

\( (N \times M \times C \times (K^2 + C_{out})) \)

Upto \( K^2 \) times reduction in computations

Insight: Depth-wise convolutions poorly utilize systolic-arrays

Solution: FuSeConv

Fully Separable Convolution (FuSeConv) : Composes of 1D depthwise convolutions and pointwise convolution

Depthwise Separable Convolution

Full Variant (D = 1)

Half Variant (D = 2)

Solution: FuSeConv

Execute independent 1D convolutions

On 32 x 32 array

Area  4.35%
Power 2.25%

Solution: FuSeConv Mapping

All PEs are completely utilized due to the 1D convolutions of FuSeConv and modified dataflow

Experimental Methodology

> Evaluate 4 FuSe variants and compare with baseline

> Latency simulator based on SCALE-SIM [1] from ARM and Georgia Tech

> MobileNets (V1, V2, V3-Small, V3-Large) and MnasNet-B1 

Choose half of layers greedily to maximize speedup of layer

>  Full (D = 1) FuSe

>  Half (D = 2) FuSe

>  50% Full FuSe

>  50% Half FuSe

All layers are replaced with FuSeConv

[1]. Samajdar et. al, SCALE-Sim: Systolic CNN Accelerator Simulator, ISPASS 2020

Results

Average accuracy drop of ~1% in Half variants, and ~0.3% in Full variants

Upto 7.23x improvements in execution time

Summary

  • Widespread solutions for efficient inference are Hardware Accelerators and Efficient DNN Operator
  • However, Depthwise Separable Convolutions are inefficient on Systolic Arrays and lacks data reuse to exploit parallelism
  • We propose FuSeConv (Fully Separable 1D Convolutions) as a drop-in            replacement for Depthwise Separable Convolutions
  • To improve the utilization, we also proposed a modified dataflow for systolic-arrays
  • Our solution is atleast 3-7X faster than efficient mobile networks such as MobileNets and MnasNet with negligible accuracy-drop

Conclusion

Efficient  Deep Learning

Efficient Systems

SparseCache

Efficient Design Methodology

Generalizable cost models

Efficient Neural Networks

FuSeConv

Future Work

Efficient DNN Training

Efficient Operators

Efficient Transformers

Thank You

 

Questions?

 

Backup Slides

Fast devices with 50ms mean latency

Medium fast devices with 115ms mean latency

Slow devices with 235ms mean latency

A mobile device with latency distribution of 118 networks

Visualizing the Data

Systolic Algorithms

> A class of algorithms that runs efficiently on systolic architectures

> Computational loops are transformed into Regular Iterative                 Algorithm (RIA)

> RIAs that have constant offsets only can be synthesized on systolic     arrays

Mapping Systolic Algorithms to Systolic Arrays

> i, j indices maps to Spatial dimensions      of array

> k index maps to Temporal dimension

Time

Dim1

Dim2

What about 2D Convolutions? 

> Non-Constant Offset indices in RIA

> 2D Convolution is not a Systolic Algorithm

Then how are 2D Convolutions mapped onto Systolic Arrays?

After im2col transformation

Under Utilization

Then, How to make them efficient on Systolic?

> More filters -> Data reuse -> High Utilization

> Extendable for convolution with channels too

2D Convolution with Multiple Filters

FuSeConv is a systolic algorithm !

> FuSeConv composes of only 1D convolutions.

> 1 x K & K x 1depthwise convolutions

> 1 x 1 Pointwise Convolution

Matrix Multiplication

1D Convolutions

> Constant Offset => Systolic Algorithm

Efficiency of FuSeConv on our Proposed Hardware

Channel 0

Channel 1

> Channelwise 1D convolutions can be computed in parallel

> FuSeConv + our dataflow mapping exploits parallelism which were absent     in depthwise convolutions mapping natively on Systolic Arrays

Multiple Channels can execute in parallel along dimension 2

Evaluation

Operator-wise Latency Distribution of Full FuSe Variants

Layerwise Speedup of        MobileNet-V2 Full FuSe Variant

More Speedup from Initial layers

Inference latency dominated by Depthwise -> Point-wise

Made with Slides.com