FuSeConv: Fully Separable Convolutions for Fast Inference in Systolic Arrays

* Department of Computer Science and Engineering, IIT Madras

# School of Electrical and Computer Engineering, Purdue University

Surya Selvam\( ^{*\#} \), Vinod Ganesan \( ^* \) and Pratyush Kumar\( ^* \)

Efficient DNN Inference on Hardware is still a challenge

DNNs achieve SOTA on various tasks

10x/year

Source: OpenAI Blog

Moore's law

There is a huge demand gap

Solution

Domain-Specific Accelerators and Efficient DNN operators

Efficient Inference: Solutions

Systolic Arrays in TPUs

25-29x more performant than GPUs

Domain-Specific Hardware Accelerators

Efficient DNN Operator

Depthwise Separable Convolutions
Computationally Efficient and ubiquitous

Source: Once For All

Surprisingly, the composition of efficient solutions is inefficient

Comparing MobileNet-V2 with ResNet-50

> Why are depthwise convolutions inefficient on systolic arrays?

> What can we do to make DW convolution faster on systolic arrays?

> How good is the inference performance of our proposed solution?

Incommensurate

Scaling

In this work

> Why are depthwise convolutions inefficient on systolic arrays?

Formal analysis using Systolic-Algorithms

FuSeConv: Fully Separable 1D Convolutions, our hardware/software co-design solution

FuSeConv is 3x-7x more faster on 64x64 Systolic Arrays with minimum overhead

> What can we do to make DW convolution faster on systolic arrays?

> How good is the inference performance of our proposed solution?

Terminologies Recap

Standard Convolution

Depthwise Separable Convolution = Depthwise Conv + Pointwise Conv

FLOPS = N x M x K x K x C\( _{in}\) x C\( _{out}\)

FLOPS = N x M x C\( _{in}\) x K x K + N x M x C\( _{in}\) x C\( _{out}\)

Systolic Algorithms

> A class of algorithms that runs efficiently on systolic architectures

> Computational loops are transformed into Regular Iterative Algorithm (RIA)

> RIAs that have constant offsets only can be synthesized on systolic arrays

Mapping Systolic Algorithms to Systolic Arrays

> i, j indices maps to Spatial dimensions of array

> k index maps to Temporal dimension

Time

Dim1

Dim2

What about 2D Convolutions?

> Non-Constant Offset indices in RIA

> 2D Convolution is not a Systolic Algorithm

Then how are 2D Convolutions mapped onto Systolic Arrays?

After im2col transformation

Under Utilization

Then, How to make them efficient on Systolic?

> More filters -> Data reuse -> High Utilization

> Extendable for convolution with channels too

2D Convolution with Multiple Filters

Lets look at Depthwise Convolutions

> Equivalent to C\( _{in}\) 2D single channel Convolutions

> No Filter Reuse availble in Depthwise Convolution

Problem: Poor Utilization on Systolic Arrays

Each channel-wise 2D convolution is sequential!!

FuSeConv: Our HW/SW Co-Design Solution

Fully Separable Convolution (FuSeConv) : Composes of 1D depthwise convolutions and pointwise convolution

Depthwise Separable Convolution

Full Variant (D = 1)

Half Variant (D = 2)

FuSeConv is a systolic algorithm !

> FuSeConv composes of only 1D convolutions.

> 1 x K & K x 1depthwise convolutions

> 1 x 1 Pointwise Convolution

Matrix Multiplication

1D Convolutions

> Constant Offset => Systolic Algorithm

Proposed Hardware Architecture

Execute independent 1D convolutions

On 32 x 32 Systolics

Area	4.35%
Power	2.25%

FuSeConv Mapping: Illustration

Efficiency of FuSeConv on our Proposed Hardware

Channel 0

Channel 1

> Channelwise 1D convolutions can be computed in parallel

> FuSeConv + our dataflow mapping exploits parallelism which were absent in depthwise convolutions mapping natively on Systolic Arrays

Multiple Channels can execute in parallel along dimension 2

Evaluation

> Evaluate 4 FuSe variants and compare with baseline

> Analytical latency model based on SCALE-SIM

> MobileNets (V1, V2, V3-Small, V3-Large) and MnasNet-B1

Choose half of layers greedily to maximize speedup of layer

> Full (D = 1) FuSe

> Half (D = 2) FuSe

> 50% Full FuSe

> 50% Half FuSe

> Latency = Load Values + Compute + Communicate PSums + Flush Outputs

All layers are replaced with FuSeConv

Evaluation

Network	Accuracy	FLOPS (M)	Params (M)	Speedup
MobileNet-V2 Baseline	72	315	3.5	1x
MobileNet-V2 Full FuSe	72.49	430	4.46	5.1x
MobileNet-V2 Half FuSe	70.8	300	3.46	7.23x
MobileNet-V2 50% Full FuSe	72.11	361	3.61	2.0x
MobileNet-V2 50% Half FuSe	71.98	305	3.49	2.1x

Network	Accuracy	FLOPS (M)	Params (M)	Speedup
MobileNet-V3 Small Baseline	67.4	66	2.93	1
MobileNet-V3 Small Full FuSe	67.17	84	4.44	3.02x
MobileNet-V3 Small Half FuSe	64.55	61	2.89	4.16x
MobileNet-V3 Small 50% Full FuSe	67.91	73	3.18	1.6x
MobileNet-V3 Small 50% Half FuSe	66.9	63	2.92	1.68x

Average drop ~1% in Half Variants, drop ~0.3% in Full Variants

Evaluation

Inference Latency and Speedup on 64 x 64 Systolic Array

Scaling up of Speedup of Full FuSe variant wrt baseline

Multiple Trade-Offs solutions

DW Conv scales poorly !

Newer networks scales differently

Evaluation

Operator-wise Latency Distribution of Full FuSe Variants

Layerwise Speedup of MobileNet-V2 Full FuSe Variant

More Speedup from Initial layers

Inference latency dominated by Depthwise -> Point-wise

Summary

> Efficient Inference is still a crucial problem for DNN deployment. Currently proposed solutions are Hardware Accelerators and Efficient DNN Operator

> However, Depthwise Separable Convolutions are inefficient on Systolic Arrays and lacks data reuse to exploit parallelism

> We propose FuSeConv (Fully Separable 1D Convolutions) as a drop-in replacement for Depthwise Separable Convolutions

> Our Co-Design solution is atleast 4X faster than efficient mobile networks such as MobileNets and MnasNet with negligible accuracy-drop

> Motivates hardware-aware Neural Operator Search (NOS)

> We also proposed a 1D systolic dataflow for efficiently executing FuSeConv layers

Questions?

Reach me at selvams@purdue.edu

FuseConv_DATE_2021

By Vinod Ganesan

FuseConv_DATE_2021

Slides for Paper FuSeConv Fully Separable Convolutions for Fast Inference on Systolic Arrays

FuSeConv: Fully Separable Convolutions for Fast Inference in Systolic Arrays

Efficient DNN Inference on Hardware is still a challenge

Efficient Inference: Solutions

Surprisingly, the composition of efficient solutions is inefficient

In this work

Terminologies Recap

Systolic Algorithms

Mapping Systolic Algorithms to Systolic Arrays

What about 2D Convolutions?

Then how are 2D Convolutions mapped onto Systolic Arrays?

Then, How to make them efficient on Systolic?

Lets look at Depthwise Convolutions

Problem: Poor Utilization on Systolic Arrays

FuSeConv: Our HW/SW Co-Design Solution

FuSeConv is a systolic algorithm !

Proposed Hardware Architecture

FuSeConv Mapping: Illustration

Efficiency of FuSeConv on our Proposed Hardware

Evaluation

Evaluation

Evaluation

Evaluation

Summary

Questions?

FuseConv_DATE_2021

More from Vinod Ganesan