FuSeConv: Fully Separable Convolutions for Fast Inference in Systolic Arrays

* Department of Computer Science and Engineering, IIT Madras

# School of Electrical and Computer Engineering, Purdue University

 

Surya Selvam\( ^{*\#} \), Vinod Ganesan \( ^* \) and Pratyush Kumar\( ^* \)

Efficient DNN Inference on Hardware is still a challenge

  • DNNs achieve SOTA on various tasks

10x/year

Source: OpenAI Blog

Moore's law

There is a huge demand gap

Solution

Domain-Specific Accelerators and Efficient DNN operators

Efficient Inference: Solutions

Systolic Arrays in TPUs

25-29x more performant than GPUs 

Domain-Specific Hardware Accelerators

Efficient DNN Operator

Depthwise Separable Convolutions
Computationally Efficient and ubiquitous

Source: Once For All

Surprisingly, the composition of efficient solutions is inefficient

Comparing MobileNet-V2 with ResNet-50

> Why are depthwise  convolutions                   inefficient on systolic arrays?

> What can we do to make DW convolution     faster on systolic arrays?

> How good is the inference                                 performance of our proposed solution?

Incommensurate

Scaling

In this work

> Why are depthwise convolutions inefficient on systolic arrays?

Formal analysis using Systolic-Algorithms

FuSeConv: Fully Separable 1D Convolutions, our hardware/software co-design solution

FuSeConv is 3x-7x more faster on 64x64 Systolic Arrays with minimum overhead

> What can we do to make DW convolution faster on systolic arrays?

> How good is the inference performance of our proposed solution?

Terminologies Recap

Standard Convolution

Depthwise Separable Convolution = Depthwise Conv + Pointwise Conv

FLOPS = N x M x K x K x C\( _{in}\) x C\( _{out}\)

FLOPS = N x M x C\( _{in}\) x K x K + N x M x C\( _{in}\) x C\( _{out}\)

Systolic Algorithms

> A class of algorithms that runs efficiently on systolic architectures

> Computational loops are transformed into Regular Iterative                 Algorithm (RIA)

> RIAs that have constant offsets only can be synthesized on systolic     arrays

Mapping Systolic Algorithms to Systolic Arrays

> i, j indices maps to Spatial dimensions      of array

> k index maps to Temporal dimension

Time

Dim1

Dim2

What about 2D Convolutions? 

> Non-Constant Offset indices in RIA

> 2D Convolution is not a Systolic Algorithm

Then how are 2D Convolutions mapped onto Systolic Arrays?

After im2col transformation

Under Utilization

Then, How to make them efficient on Systolic?

> More filters -> Data reuse -> High Utilization

> Extendable for convolution with channels too

2D Convolution with Multiple Filters

Lets look at Depthwise Convolutions

> Equivalent to C\( _{in}\) 2D single channel Convolutions

> No Filter Reuse availble in Depthwise Convolution

Problem: Poor Utilization on Systolic Arrays

Each channel-wise 2D convolution is sequential!!

FuSeConv: Our HW/SW Co-Design Solution

Fully Separable Convolution (FuSeConv) : Composes of 1D depthwise convolutions and pointwise convolution

Depthwise Separable Convolution

Full Variant (D = 1)

Half Variant (D = 2)

FuSeConv is a systolic algorithm !

> FuSeConv composes of only 1D convolutions.

> 1 x K & K x 1depthwise convolutions

> 1 x 1 Pointwise Convolution

Matrix Multiplication

1D Convolutions

> Constant Offset => Systolic Algorithm

Proposed Hardware Architecture

Execute independent 1D convolutions

On 32 x 32 Systolics

Area  4.35%
Power 2.25%

FuSeConv Mapping: Illustration

Efficiency of FuSeConv on our Proposed Hardware

Channel 0

Channel 1

> Channelwise 1D convolutions can be computed in parallel

> FuSeConv + our dataflow mapping exploits parallelism which were absent     in depthwise convolutions mapping natively on Systolic Arrays

Multiple Channels can execute in parallel along dimension 2

Evaluation

> Evaluate 4 FuSe variants and compare with baseline

> Analytical latency model based on SCALE-SIM

> MobileNets (V1, V2, V3-Small, V3-Large) and MnasNet-B1 

Choose half of layers greedily to maximize speedup of layer

>  Full (D = 1) FuSe

>  Half (D = 2) FuSe

>  50% Full FuSe

>  50% Half FuSe

> Latency = Load Values + Compute +  Communicate PSums + Flush Outputs

All layers are replaced with FuSeConv

Evaluation

Network Accuracy FLOPS (M) Params (M) Speedup
MobileNet-V2 Baseline 72 315 3.5 1x
MobileNet-V2 Full FuSe 72.49 430 4.46 5.1x
MobileNet-V2 Half FuSe 70.8 300 3.46 7.23x
MobileNet-V2 50% Full FuSe 72.11 361 3.61 2.0x
MobileNet-V2 50% Half FuSe 71.98 305 3.49 2.1x
Network Accuracy FLOPS (M) Params (M) Speedup
MobileNet-V3 Small Baseline 67.4 66 2.93 1
MobileNet-V3 Small Full FuSe 67.17 84 4.44 3.02x
MobileNet-V3 Small Half FuSe 64.55 61 2.89 4.16x
MobileNet-V3 Small 50% Full FuSe 67.91 73 3.18 1.6x
MobileNet-V3 Small 50% Half FuSe 66.9 63 2.92 1.68x

Average drop ~1% in Half Variants, drop ~0.3% in Full Variants

Evaluation

Inference Latency and Speedup on 64 x 64 Systolic Array

Scaling up of Speedup of Full FuSe variant wrt baseline

Multiple Trade-Offs solutions

DW Conv scales poorly !

Newer networks scales differently

Evaluation

Operator-wise Latency Distribution of Full FuSe Variants

Layerwise Speedup of        MobileNet-V2 Full FuSe Variant

More Speedup from Initial layers

Inference latency dominated by Depthwise -> Point-wise

Summary

> Efficient Inference is still a crucial problem for DNN deployment. Currently           proposed solutions are Hardware Accelerators and Efficient DNN Operator

> However, Depthwise Separable Convolutions are inefficient on Systolic Arrays     and lacks data reuse to exploit parallelism

> We propose FuSeConv (Fully Separable 1D Convolutions) as a drop-in                     replacement for Depthwise Separable Convolutions

> Our Co-Design solution is atleast 4X faster than efficient mobile networks such as       MobileNets and MnasNet with negligible accuracy-drop

> Motivates hardware-aware Neural Operator Search (NOS)

> We also proposed a 1D systolic dataflow for efficiently executing FuSeConv layers

Questions?

Reach me at selvams@purdue.edu

Made with Slides.com