Gokulan R - CS15B033

Prof. Pratyush Kumar

30 May 2020

\({}^1\) Shakti Multiplier-Accumulate Accelerator Network

ShaktiMAAN\({}^1\)

An open-source DNN Accelerator

  1. Background
  2. Contributions
    1. ISA
    2. Microarchitecture
    3. Design Space Exploration
  3. Custom Compiler Flow
  4. Conclusion
 

Overview

$$x_1$$

$$x_2$$

$$x_3$$

$$w_1$$

$$w_2$$

$$w_3$$

b

\( y \) \(=\)  \(\sigma\) \((\sum_{i} \) \(w_i\) \( \cdot\) \(x_i\) \( +\) \( b\) \() \)

  1. Inputs
  2. Weights
  3. Bias
  4. Activation Function
  5. Output

Building block - a neuron

$$x_1$$

$$x_2$$

$$x_3$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

$$h_5$$

$$y_1$$

$$y_2$$

$$w_{153}$$

$$w_{111}$$

$$w_{225}$$

$$w_{211}$$

Input Layer

Hidden Layer

Output Layer

Two Layered NN

w11 w1n
wm1 wmn
b1 b2 bn
x1
xm
1
y1
yn

Input of size m, output of size n

Output computed as vector-matrix multiplication

Fully Connected layer (FC)

Input vector, transposed

Weight Matrix

Output vector, transposed

Convolution layer (CONV)

Convolution layer (CONV)

Simple way to subsample

Max Pooling

2 x 2

stride 2

Average Pooling

2 x 2

stride 2

1 4
-2 7
2 -20
31 11
41 -8
0 3
-6 0
-11 -1

7

31

41

0

2

6

9

-5

Pooling layers

Deep Neural Networks

AlexNet (2012) - Breakthrough in ImageNet dataset

Alex et. al. ImageNet Classification with Deep Convolutional Neural Networks

DNN Workload

  • Workloads across domains - vision, NLP, statistical ML have a common operation - matrix multiplication
  • Goal: Build a fast matrix-multiplier!
  • General Matrix Multiplication (GEMM) algorithm
    • Embarrassingly parallel
    • Can reuse data significantly
    • Highly ordered computation - across parallel units, the smallest unit of computation is the same
  • Improves compute throughput without improving B/W
  • Used in signal processing, polynomial operations
  • Smallest unit: Processing Element
  • Scalabity achieved by replicating PEs across different dimensions
  • Reuse achieved by communication between adjacent PEs

H.T.Kung, Why systolic architectures?

Systolic Array

$$ C = A \times B$$

  • Columns of B are loaded into columns of systolic array
  • Rows of \( A^T\) are sent into rows of systolic array
  • Partial sums (rows of C) flow along the columns of systolic array

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

Systolic Array

  1. Background
  2. Contributions
    1. ISA
    2. Microarchitecture
    3. Design Space Exploration
  3. Custom Compiler Flow
  4. Conclusion
 

Overview

Paradigm

  • ISA level choices
  • Mapping GEMM to systolic array
  • Double buffering
  • Instruction fetch
  • Array dimension
  • Buffer sizes
  • Queue sizes
  • Buffer organisation
  • Tiling config
  • loop ordering
  • instruction scheduling

Synthesis

Dynamic

Human-made choices

Exploration performed using task-level simulator

Compile-time exploration by compiler

Design Space Exploration

LOAD/STORE

  • Load/Store a 3D slice of an n-D matrix
  • Read from DRAM and write it to on-chip SRAM
  • The slice can be discontinuous in DRAM, but continuous in SRAM.
  • LOAD \(4 \times 8 \times 8 \times 512\)
    • 1 \(\times \) LOAD(4, 8, 8, 512)
    • 4 \(\times\) LOAD(8, 8, 512)
    • 4 * 8 \(\times\) LOAD(8, 512)

LOAD/STORE

  • DRAM base address, SRAM base address
  • Z_SIZE, Y_SIZE, X_SIZE
  • Z_STRIDE, Y_STRIDE
  • RESET

GEMM

  • Weight load phase: Load weights onto PEs
  • Compute phase
    • Read input and send it along rows
    • Read output and send it along columns from top
    • Read output and store it back in buffer

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

GEMM

  • input \(1 \times 8 \times 8 \times 512\)
  • weight \(256 \times 3 \times 3 \times 512 \)
  • output \(1 \times 8 \times 8 \times 256 \)
  • systolic array: \(64 \times 64\)
  • 64 different filters across different columns
  • 64 different channels across different rows
  • \( \frac{256}{64} \times \frac{512}{64} \times 3 \times 3 \) GEMM instructions

Image: Samajdar et. al. SCALE-Sim: Systolic CNN Accelerator Simulator

GEMM

 
  • {input, weight, output} base address
  • output {height, width}
  • Stride {X, Y}, Padding {Top, Left, Right, Bottom}

Tensor ALU

  • Performed a windowed reduction over output feature maps
  • Similar to convolution, but without weights
  • Operation is vectorised across channels
  • ReLU can be mapped using R=S=1
  • \( k \times m\) maxpool can be mapped using R = k and S = m

Tensor ALU

  • {input, output} base address
  • ALU opcode
  • {height, width} of {output, window}
  • stride {R, S, OW}

Microarchitecture

Dependency Resolution

$$ C = A \times B $$

LOAD A

LOAD B

GEMM: C = A*B

STORE C

push next

pop prev

push next

pop prev

  • Dependency module ensures that instructions are dispatched to execute only after dependencies are met
  • Dependency flags are inserted by compiler at compile time

1

2

3

4

Compiling model to accelerator

  • Relay IR: Map models from multiple frameworks to a single IR
  • TVM IR: Perform optimizations, schedule exploration, portable across hardware
  • JIT Compiler Runtime: Generate accelerator-specific binary from TVM IR

Chen et. al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Custom Compiler - inspired by TVM

Model to Accelerator ISA

  • input \(1 \times 8 \times 8 \times 512\)
  • weight \(256 \times 3 \times 3 \times 512 \)
  • output \(1 \times 8 \times 8 \times 256 \)
  • systolic array: \(64 \times 64\)
for i=1 to 256/64
  for j=1 to 512/64
    LOAD(input, j)
      for l=1 to 3, for m=1 to 3
  	    LOAD(weight, i, j, l, m)
        GEMM(input', weight', output')
  ALU(output, i)
  STORE(output, i)
  • 64 filters across different columns
  • 64 channels across different rows
  • \( \frac{256}{64} \times \frac{512}{64} \times 3 \times 3 \) GEMM instructions

Task-level Simulator

 
  • Functional simulator, not cycle accurate
  • Provides an estimate of execution time for a given instruction trace on a given accelerator configuration
  • Input - instruction trace, Output: Execution summary
    • Execution time of entire trace, instruction level logs
    • Utilisation of components, module-level logs
  • Uses
    • Interface with TVM compiler to schedule exploration
    • Analyse bottlenecks and refine configuration of accelerator

Status and Future Work

 
Module Status
fetch-decode Completed
dependency resolver Completed
load Final stages
store Final Stages
GEMM (16x16) Completed
ALU (vec_size=16) Completed
Custom compiler Work-in-progress
Task level simulator Work-in-progress
  • Future directions
    • cycle-accurate simulator
    • explore big.LITTLE systolic arrays

FPGA Synthesis Results

 
Module LUTs FIFOs
fetch-decode 823 1317
dependency resolver 1427 858
load * *
store * *
GEMM (16x16) 90464 0
ALU (vec_size=16) 1280 0

*Work-in-progress

  • RTL in Bluespec System Verilog (BSV)
  • Synthesis using Xilinx Vivado v2018
  • Target FPGA: Xilinx Artix 7

Summary

 
  • ShaktiMAAN: open-source accelerator for DNNs
  • Matrix multiplication is performed by systolic array
  • Vector ALU to perform activation and pooling functions
  • TVM compiler to execute model across frameworks on the accelerator
  • Design space exploration of various hardware choices
  • Task level simulator for
    • Interfacing with TVM compiler
    • optimizing accelerator configuration

Acknowledgements - The Team

Vinod Ganesan

Neel Gala

Arjun Menon

Mohan Prasath

Rohan Kaulgekar

Sadhana

Sujay Pandit

Surya Selvam

Anand Uday Gokhale

Nidesh

Sundar Raman

Shilpa

Selvaraj

Rishabh Jain

Thank You!

ddp_final

By Gokulan Ravi

ddp_final

  • 371