CS6886
Assignment 2
Pitfalls in Problem Statement
- Padding - constant time, doesn't affect result much
- Absence of scatter-gather instructions in AVX
Fallacies in the submissions
- Folder Structure (seriously?)
- Functions implemented without using AVX
- Constructing vector operand from individual operands
- Estimating operational intensity as a macro property
2D convolution
for i = 1 to N
for j = 1 to M
for k = 1 to E
for l = 1 to F
out = vector_init(0)
for c = 1 to C
for r = 1 to R
for s = 1 to S, s=s+8
inp = vector_load(input[i][c][k+r][l+s], 8)
wgt = vector_load(weight[j][c][r][s])
inp = vector_mul(inp, wgt)
out = vector_add(out, inp)
output[i][j][k][l] = vector_reduce(out)
Memory accessed = (C * R * S * 2 + 8) x 4 bytes
Compute performed = C * R * S * 2
Operational Intensity= \(\frac{C*R*S*2}{(C * R * S * 2 + 8) * 4}\)
Layerwise execution time


Input Stationary
for i = 1 to N
for c = 1 to C
for k = 1 to E
for l = 1 to F
inp = vector_load(input[i][c][k][l])
for j = 1 to M
for r = 1 to R
for s = 1 to S, s=s+8
wgt = vector_load(weight[j][c][r][s])
inp = vector_mul(inp, wgt)
out = vector_add(out, inp)
output[i][j][k][l] += vector_reduce(out)
Weight Stationary
for j = 1 to M
for c = 1 to C
for r = 1 to R
for s = 1 to S, s=s+8
wgt = vector_load(weight[j][c][r][s])
for i = 1 to N
for k = 1 to E
for l = 1 to F
inp = vector_load(inp[i][c][k][l])
inp = vector_mul(inp, wgt)
out = vector_add(out, inp)
output[i][j][k][l] += vector_reduce(out)
Output Stationary
for i = 1 to N
for j = 1 to M
for k = 1 to E
for l = 1 to F
out = vector_init(0)
for c = 1 to C
for r = 1 to R
for s = 1 to S, s=s+8
inp = vector_load(input[i][c][k+r][l+s], 8)
wgt = vector_load(weight[j][c][r][s])
inp = vector_mul(inp, wgt)
out = vector_add(out, inp)
output[i][j][k][l] = vector_reduce(out)
Choice between DATAFLOWS
- Data Reuse pattern
- IS \( \approx \) WS < OS
- Depends more on the implementation
- Tiling for layer 2
- 9 x 9 tiles
- Memory layout
- NHWC
- MRSC
ReLU
for i = 1 to N
for c = 1 to C
for k = 1 to H
for l = 1 to W, l=l+8
inp = vector_load(input[i][j][k][l])
zero = vector_init(0)
mask = vector_cmp(inp, zero, CMP_GT)
inp = vector_and(inp, mas)
vector_store(inp, input[i][j][k][l])
2D MaxPool
for i = 1 to N
for j = 1 to C
for k = 1 to E
for l = 1 to F
out = vector_init(FLOAT_MIN)
for r = 1 to R
for s = 1 to S, s=s+8
inp = vector_load(input[i][c][k+r][l+s], 8)
inp = vector_mul(inp, wgt)
out = vector_add(out, inp)
output[i][j][k][l] = vector_reduce(out)
sysdl_a2
By Gokulan Ravi
sysdl_a2
- 152