Gokulan R - CS15B033

Prof. Pratyush Kumar

30 May 2020

${}^1$ Shakti Multiplier-Accumulate Accelerator Network

ShaktiMAAN${}^1$

An open-source DNN Accelerator

Background
Contributions
1. ISA
2. Microarchitecture
3. Design Space Exploration
Custom Compiler Flow
Conclusion

Overview

$$x_1$$

$$x_2$$

$$x_3$$

$$w_1$$

$$w_2$$

$$w_3$$

$ y $ $=$ $\sigma$ $(\sum_{i} $ $w_i$ $ \cdot$ $x_i$ $ +$ $ b$ $) $

Inputs
Weights
Bias
Activation Function
Output

Building block - a neuron

$$x_1$$

$$x_2$$

$$x_3$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

$$h_5$$

$$y_1$$

$$y_2$$

$$w_{153}$$

$$w_{111}$$

$$w_{225}$$

$$w_{211}$$

Input Layer

Hidden Layer

Output Layer

Two Layered NN

w11		w1n



wm1		wmn
b1	b2	bn

Input of size m, output of size n

Output computed as vector-matrix multiplication

Fully Connected layer (FC)

Input vector, transposed

Weight Matrix

Output vector, transposed

Source: ISCA 2019 Tutorial

Convolution layer (CONV)

Source: ISCA 2019 Tutorial

Convolution layer (CONV)

Simple way to subsample

Max Pooling

2 x 2

stride 2

Average Pooling

2 x 2

stride 2

1	4
-2	7

2	-20
31	11

41	-8
0	3

-6	0
-11	-1

-5

Pooling layers

Deep Neural Networks

AlexNet (2012) - Breakthrough in ImageNet dataset

Alex et. al. ImageNet Classification with Deep Convolutional Neural Networks

DNN Workload

Workloads across domains - vision, NLP, statistical ML have a common operation - matrix multiplication
Goal: Build a fast matrix-multiplier!
General Matrix Multiplication (GEMM) algorithm
- Embarrassingly parallel
- Can reuse data significantly
- Highly ordered computation - across parallel units, the smallest unit of computation is the same

Improves compute throughput without improving B/W
Used in signal processing, polynomial operations
Smallest unit: Processing Element
Scalabity achieved by replicating PEs across different dimensions
Reuse achieved by communication between adjacent PEs

H.T.Kung, Why systolic architectures?

Systolic Array

$$ C = A \times B$$

Columns of B are loaded into columns of systolic array
Rows of $ A^T$ are sent into rows of systolic array
Partial sums (rows of C) flow along the columns of systolic array

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

Systolic Array

Background
Contributions
1. ISA
2. Microarchitecture
3. Design Space Exploration
Custom Compiler Flow
Conclusion

Overview

Paradigm

ISA level choices
Mapping GEMM to systolic array
Double buffering
Instruction fetch

Array dimension
Buffer sizes
Queue sizes
Buffer organisation

Tiling config
loop ordering
instruction scheduling

Synthesis

Dynamic

Human-made choices

Exploration performed using task-level simulator

Compile-time exploration by compiler

Design Space Exploration

LOAD/STORE

Load/Store a 3D slice of an n-D matrix
Read from DRAM and write it to on-chip SRAM
The slice can be discontinuous in DRAM, but continuous in SRAM.
LOAD $4 \times 8 \times 8 \times 512$
- 1 $\times $ LOAD(4, 8, 8, 512)
- 4 $\times$ LOAD(8, 8, 512)
- 4 * 8 $\times$ LOAD(8, 512)

Image Source

LOAD/STORE

DRAM base address, SRAM base address
Z_SIZE, Y_SIZE, X_SIZE
Z_STRIDE, Y_STRIDE
RESET

GEMM

Weight load phase: Load weights onto PEs
Compute phase
- Read input and send it along rows
- Read output and send it along columns from top
- Read output and store it back in buffer

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

GEMM

input $1 \times 8 \times 8 \times 512$
weight $256 \times 3 \times 3 \times 512 $
output $1 \times 8 \times 8 \times 256 $
systolic array: $64 \times 64$
64 different filters across different columns
64 different channels across different rows
$ \frac{256}{64} \times \frac{512}{64} \times 3 \times 3 $ GEMM instructions

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

GEMM

{input, weight, output} base address
output {height, width}
Stride {X, Y}, Padding {Top, Left, Right, Bottom}

Tensor ALU

Performed a windowed reduction over output feature maps
Similar to convolution, but without weights
Operation is vectorised across channels
ReLU can be mapped using R=S=1
$ k \times m$ maxpool can be mapped using R = k and S = m

Image Source

Tensor ALU

{input, output} base address
ALU opcode
{height, width} of {output, window}
stride {R, S, OW}

Microarchitecture

Dependency Resolution

$$ C = A \times B $$

LOAD A

LOAD B

GEMM: C = A*B

STORE C

push next

pop prev

push next

pop prev

Dependency module ensures that instructions are dispatched to execute only after dependencies are met
Dependency flags are inserted by compiler at compile time

Compiling model to accelerator

Relay IR: Map models from multiple frameworks to a single IR
TVM IR: Perform optimizations, schedule exploration, portable across hardware
JIT Compiler Runtime: Generate accelerator-specific binary from TVM IR

Chen et. al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Custom Compiler - inspired by TVM

Model to Accelerator ISA

input $1 \times 8 \times 8 \times 512$
weight $256 \times 3 \times 3 \times 512 $
output $1 \times 8 \times 8 \times 256 $
systolic array: $64 \times 64$

for i=1 to 256/64
  for j=1 to 512/64
    LOAD(input, j)
      for l=1 to 3, for m=1 to 3
  	    LOAD(weight, i, j, l, m)
        GEMM(input', weight', output')
  ALU(output, i)
  STORE(output, i)

64 filters across different columns
64 channels across different rows
$ \frac{256}{64} \times \frac{512}{64} \times 3 \times 3 $ GEMM instructions

Task-level Simulator

Functional simulator, not cycle accurate
Provides an estimate of execution time for a given instruction trace on a given accelerator configuration
Input - instruction trace, Output: Execution summary
- Execution time of entire trace, instruction level logs
- Utilisation of components, module-level logs
Uses
- Interface with TVM compiler to schedule exploration
- Analyse bottlenecks and refine configuration of accelerator

Status and Future Work

Module	Status
fetch-decode	Completed
dependency resolver	Completed
load	Final stages
store	Final Stages
GEMM (16x16)	Completed
ALU (vec_size=16)	Completed
Custom compiler	Work-in-progress
Task level simulator	Work-in-progress

Future directions
- cycle-accurate simulator
- explore big.LITTLE systolic arrays

FPGA Synthesis Results

Module	LUTs	FIFOs
fetch-decode	823	1317
dependency resolver	1427	858
load	*	*
store	*	*
GEMM (16x16)	90464	0
ALU (vec_size=16)	1280	0

*Work-in-progress

RTL in Bluespec System Verilog (BSV)
Synthesis using Xilinx Vivado v2018
Target FPGA: Xilinx Artix 7

Summary

ShaktiMAAN: open-source accelerator for DNNs
Matrix multiplication is performed by systolic array
Vector ALU to perform activation and pooling functions
TVM compiler to execute model across frameworks on the accelerator
Design space exploration of various hardware choices
Task level simulator for
- Interfacing with TVM compiler
- optimizing accelerator configuration

Acknowledgements - The Team

Vinod Ganesan

Neel Gala

Arjun Menon

Mohan Prasath

Rohan Kaulgekar

Sadhana

Sujay Pandit

Surya Selvam

Anand Uday Gokhale

Nidesh

Sundar Raman

Shilpa

Selvaraj

Rishabh Jain

ShaktiMAAN\({}^1\)

An open-source DNN Accelerator

Overview

Building block - a neuron

Two Layered NN

Fully Connected layer (FC)

Convolution layer (CONV)

Convolution layer (CONV)

Pooling layers

Deep Neural Networks

DNN Workload

Systolic Array

Systolic Array

Overview

Paradigm

Synthesis

Dynamic

Design Space Exploration

LOAD/STORE

LOAD/STORE

GEMM

GEMM

GEMM

Tensor ALU

Tensor ALU

Microarchitecture

Dependency Resolution

Compiling model to accelerator

Custom Compiler - inspired by TVM

Model to Accelerator ISA

Task-level Simulator

Status and Future Work

FPGA Synthesis Results

Summary

Acknowledgements - The Team

Thank You!