Text
Quantization was introduced in TPUs so that large computations can be mapped to a smaller bit, let us know look into how a simple ReLU operation takes place
Orginal ReLU operation
ReLU operations with the internal conversion.
TensorFlow has recently released an API that converts 32-bit floating-point integers into 8-bit integers.
Generally, all weight values in the neural network are in a small range. Say, for example, from -8.342343353543f to 23.35442342554f. In TensorFlow, while doing quantization, the least value is equated to 0 and the maximum value to 255. All the values in between are scaled inside the 0 to 255 range.
Based on Paper: Ternary Neural Networks with Fine-Grained Quantization by Abhisek Kundu, Researcher Parallel Computing Lab, Intel Labs, Bangalore, India
α ∗ ,Wt∗ = arg min α,Wt J(α,Wt ) = ||W − αWt ||2 2 s.t. α ≥ 0, Wt i ∈ {−1, 0, 1}, i = 1, 2, . . . , n
Previous papers used to quantize only the weights and not the inputs(activations) during inference
Use a threshold D>0
Important to remember that they are not performing any retraining.
Take self-driving cars, for example, where a difference of a few percentages could make the difference between a safe drive and a crash. So while quantization is powerful, as it can make computing a lot faster, its applications should be considered carefully.