Written | Shashank Prasanna

Translated | Hu Yanjun

As shown in the figure above, the most versatile is the CPU, which can run arbitrary code. The CPU can also perform tasks that the dedicated chip can perform, such as image processing, speech processing, machine learning, etc. However, CPU performance and energy efficiency are relatively low. 

Post-Training Quantization (PTQ): Using a trained model computed in FP32 format, the scale factor is determined, and then FP32 is mapped to INT8. The scale factor is determined by TensorRT measuring the distribution of the output of the activation function in each neural network layer, and then finding a scale factor that minimizes the loss of information between the reference distribution and the quantized distribution (KL divergence).

Quantized Perception Training (QAT): Scale factors are calculated during training so that the model can adapt to and minimize information loss.

The head image is derived from brookhaven, Pixabay