A Domain-Specific Supercomputer for Training Deep Neural Networks

Recap: TPU v1

Recap: TPU v1

Motivation

  • 2013 - TPU v1 development starts
  • 15 months later, TPU v1 was deployed in datacenters
  • 2014 - What next?
    • Improve TPU v1 and build better inference chip
    • Attack the more harder training problem
  • SOTA training in 2014: Worker nodes and parameter servers, update parameters synchronously

DSA supercomputer vs cluster

we chose to build a DSA supercomputer instead of clustering CPU hosts with DSA chips.

  • Training time is huge. TPUv2 chip would take two to 16 months to train a single Google production application, so a typical application might want to use hundreds of chips.
  • DNN wisdom: bigger datasets plus bigger machines lead to bigger breakthroughs. Moreover, results like AutoML use 50x more computation.

Diff b/w training & inference on TPU

  • Harder parallelization: Training needs synchronization to update weights
  • More computation - calculating gradients, etc.
  • More memory - need to store intermediate values of forward and back prop for gradient calculation, 10x higher
  • More programmability - Should be able to adapt to newer algorithms
  • Wider data: 8-bit int sufficient for inference, training needs higher precision fp data

Critical feature: Communication

Text

Made with Slides.com