tiny (v0.5) performance benchmark on NVIDIA Jetson Nano

Daniel Morandini, 12286

August 2021

Introduction to parallel Computing

Project Overview

Results & Conclusion

Specs

the tiny v0.5 performance benchmark

Small vocabulary keyword spotting case

Speech Commands v2 dataset (Google)

12 categories

DS-CNN

Number of parameters 24908
input tensor 49x10
output tensor 12x1
DS convolution layers 4
Convolution layers 1
Fully connected layers 1

Project Overview

Results & Conclusion

Specs

Jetson Nano Developer Kit

NVIDIA Tegra X1 series SoC

CPU Complex

ARM Cortex-A57 (quad-core, 64bit)

48KB of L1 instruction cache (per core)

32KB of L1 data cache (per core)

2MB global L2 unified cache

maximum operating frequency 1.43GHz

GPU

NVIDIA Maxwell

512 GFLOPS (FP16)

128 cores

maximum operating frequency 921MHz

RAM

4GB LPDDR4, shared

theoretical bandwidth peak 25.6GB/s

operating frequency 1.6GHz

GB/s
CPU-GPU 5.6
GPU-CPU 10.5
GPU-GPU 16.7

actual bandwidth

Computational Costs (per inference)

MACC total 1,664,768
FLOPs total 3,329,536
Theoretical Execution Time 0.0284ms
Max Memory Bandwidth 70.7MB/s
T_t = Ops_{algo} / (Ops_{cc} * N_{cores} * F)

Project Overview

Results & Conclusion

Specs

Throughput Test Results

Performance results for window 2:
  # Inferences :         2400
  Runtime      :       10.309 sec.
  Throughput   :      232.803 inf./sec.
Runtime requirements have been met.
Performance results for window 4:
  # Inferences :         2400
  Runtime      :       10.296 sec.
  Throughput   :      233.105 inf./sec.
Runtime requirements have been met.
Performance results for window 6:
  # Inferences :         2400
  Runtime      :       10.362 sec.
  Throughput   :      231.609 inf./sec.
Runtime requirements have been met.
Performance results for window 8:
  # Inferences :         2400
  Runtime      :       10.382 sec.
  Throughput   :      231.177 inf./sec.
Runtime requirements have been met.
Performance results for window 10:
  # Inferences :         2400
  Runtime      :       10.333 sec.
  Throughput   :      232.260 inf./sec.
Runtime requirements have been met.
---------------------------------------------------------
Median throughput is 232.260 inf./sec.
---------------------------------------------------------

Are these results Good or Bad?

Processor Utilisation

4.2ms

P_{util} = 100 * T_t / T_a

Execution Time

0.67%

T_t = 0.0284ms
T_a = 4.2ms

Is It memory bound?

Algorithm needs 70.7MB/s, Jetson has 5.6GB/s (minimum)

NO

Are we stuck at computing?

======== Profiling result:
Time(%)      Time       Avg  Name
 21.29%  89.681ms  22.397us  precomputed_convolve_sgemm
 17.52%  73.824ms  8.1940us  tf::FusedBatchNormInferenceMetaKernel
 15.25%  64.242ms  7.1300us  tf::BiasNCHWKernel
 12.29%  51.794ms  12.935us  tf::DepthwiseConv2dGPUKernelNCHWSmall
  8.25%  34.775ms  34.740us  explicit_convolve_sgemm
  7.57%  31.884ms  6.3700us  tf::ShuffleInTensor3Simple
  4.86%  20.468ms  20.447us  cudnn::im2col4d_kernel
  3.91%  16.455ms  16.438us  cudnn::pooling_fw_4d_kernel
  2.93%  12.327ms  3.0780us  cudnn::kern_precompute_indices

~0.395ms per inference are spent doing actual computations

91% of the time is spent wasting it.

Fine.

MLPerf on NVIDIA Jetson

By Daniel Morandini

MLPerf on NVIDIA Jetson

  • 502