2022 Episode 054

1. Background and difficulty

2. Quantitative solution practice

2.1 Heavy parameterization optimizer

2.2 Partial quantification based on quantitative sensitivity analysis

2.3 Quantitative perception training based on channel distillation

3. Optimize at deployment time

3.1 Graph optimization

3.1.4 Performance Testing

3.2 Online Service Optimization

4. Summary

YOLOv6 version updated

5. References

6. Author of this article

YOLOv6 is an open source 2D object detection model for industrial applications released by Meituan [1], which is characterized by fast speed, high accuracy and friendly deployment, and has a wide range of applications in many visual business scenarios of Meituan. Improving the speed of inference through quantization is a basic operation in practical industrial applications, but because the YOLOv6 series models use a large number of heavy parametric modules, how to quantify YOLOv6 with high efficiency and high accuracy has become an urgent problem to be solved.

This paper aims to solve the problem of YOLOv6 quantification, and takes the YOLOv6s model as an example, and explores a feasible quantitative solution from two aspects: post-training Quantization (PTQ) and quantization-Aware Training (QAT).

RepOpt [3] studied the difficulty of quantifying the heavy parameter structure, and found that the branch fusion operation of the heavy parameter structure significantly magnified the standard deviation of the weight parameter distribution. The abnormal weight distribution produces an excessively large numerical distribution of the network activation layer, which further leads to excessive quantitative loss of the layer, so the model accuracy loss is serious.

We implemented the RepOpt version of the YOLOv6s network (YOLOv6s_repopt) with a floating-point accuracy of 42.4% (300 epoch) consistent with the YOLOv6s_repvgg, and the network structure of the two versions was consistent during the deployment phase. We first analyzed the data distribution characteristics of the YOLOv6s_repopt model.

As shown in Figure 4 (right), the convolutional and other operators of RepOpt are trained with pseudo-quantization to improve the accuracy of the quantized model, and then the quantization model is directly deployed without the need for model fusion. Later, we will show the specific QAT algorithm and the results of improving the accuracy of the model.

YOLOv6s_repopt reached 40.9% mAP after PTQ, which is a big improvement from the previous 35.0%, but still has a 1.5% loss of accuracy and is not enough to meet business requirements. Therefore, we adopted Partial PTQ, a method of enabling the partial quantization sensitive layer in the network to resume floating-point calculations to quickly restore the accuracy of the quantization model. First, a quantitative sensitivity analysis is required for each layer in the network.

We conducted comparative tests on the YOLOv6s-repopt network on the commonly used sensitivity analysis methods mean squared error (MSE), signal-to-noise ratio (SNR), and cosine similarity. The calibration test uses data from 4 batches, 1 batch for sensitivity calculations, and the batch size is set to 32.

When testing, only one layer is quantized at a time, and the sensitivity value is calculated after obtaining the activation data of the layer, which represents the quantized sensitivity of the layer. For comparison, we can directly calculate the mAP of the network on the COCO val dataset, using the detection accuracy as the quantified sensitivity of the layer, that is, the higher the detection accuracy, the lower the sensitivity of the layer (hereinafter referred to as the mAP method).

At this point, our optimized PTQ has reached an accuracy of 42.0%, and further improving model accuracy requires the introduction of quantitative perception training (QAT). Quantization Aware Training (QAT) can improve the loss of PTQ quantization accuracy, by adding pseudo-quantization operations to operators such as convolution during training (as shown in Figure 4), so that network parameters can better adapt to the information loss caused by quantization, thereby significantly reducing the loss of accuracy after quantization.

When the traditional classification network is distilled, it often distills the logits of the output of the last layer; However, in the detection network, the method of “feature map” distillation is generally used, and the feature map output by the student directly fits the feature map output by the teacher network (generally the entire feature map is not selected, but some areas of interest).

When quantizing deployment, you can directly utilize TensorRT’s PTQ interface to generate a quantization engine, but this approach tends to have a large loss of accuracy. Therefore, it is generally necessary to QAT first to make the quantization model accurate meet the business requirements, and then export the ONNX with “Quant” and “DeQuant” nodes, and finally use TensorRT to build a quantization engine. We found that the resulting graph structure of the two scenarios is not the same, resulting in a large difference in the actual operational efficiency of the deployment model, and generally the model generated by the QAT method is less efficient.

We compared the quantitative models on the NVIDIA T4 machine (see table 5 below). Although the QAT INT8 model has a QPS that is ~27% higher than that of FP16, it is still far from PTQ INT8. We analyzed this phenomenon in detail and found that the reason is that the “Quant” node introduced by QAT broke the fusion strategy of the original TensorRT, resulting in many operators not being able to fuse, which affected the performance of the final quantization engine. In this section, we use YOLOv6s_repopt as an example to show a graph optimization method for locating specific bottlenecks. In quantitative practice, graph optimization is a very practical means, we can concoct according to the law, improve the QPS of the model.

First, we performed a performance analysis of the model of QAT INT8 and the model of PTQ INT8 using the nsys tool [5], as shown in the following table:

From this, we find that 10.8% of the kernels in QAT INT8 perform permutationKernelPLC3 operations, which correspond to quantize_scale_node nodes, as shown in Figure 8 below:

By analyzing the ONNX graph structure, we found out why the QAT INT8 engine neck.reduce_layer1.conv output is FP32 and has quantize_scale_node reserved for both branches. Because the input quantization scale of the neck.upsample1.upsample_transpose branch is 0.083, while the neck. The input quantization scale of the Rep_n3.conv1 branch is 0.105, and the input scale of the two nodes is different, resulting in the neck.reduce_layer1.conv not being output directly to INT8.

According to the above analysis, if the output of a node is input into different branch nodes, and the quantization scale of the branch nodes is different, the quantize_scale_node nodes cannot be dissolved, which in turn leads to additional overhead. How to solve this problem? We used a simple approach by forcing the quantization scale to be the same for all branch nodes (as a rule of thumb, scales on the same order of magnitude can be safely combined), i.e. directly modifying the parameters of the Quantizer nodes in the QAT network.

We have sorted out all the nodes in the YOLOv6s_repopt that need to be scaled (as shown in Table 7), and since the 8-bit quantization range of TensorRT is [-127,127], we only need to set the Activation Amax of the multiplexed branch to the same value, generally taking the maximum value in the multiplexed branch.

We conducted end-to-end throughput testing on NVIDIA T4 servers, using “multi-instance” concurrency processing technology to achieve 552 YOLOv6s_repopt INT8 QPS, a ~40% improvement over FP16. We monitored the metrics of the server and found that the T4 GPU was only 95% utilized at this time, and there was still room for crush, while the 16-core CPU utilization had exceeded 1500%, almost full capacity. We speculate that the “bottleneck” of the entire online service may be in the CPU, and image preprocessing uses a lot of CPU resources.

In summary, based on YOLOv6 V1.0 version, this paper discusses the quantization difficulties and specific schemes of the 2D detection model based on the heavy parametric structure design based on YOLOv6s as an example, and improves the QPS by about 40% through quantitative acceleration under the premise that the model accuracy is basically maintained. Pre-processing optimizations at deployment have increased by an additional 214%, greatly improving throughput for industrial deployments. The following table lists the methods attempted in this article and the effects of overlays.

Recently, YOLOv6 has updated the V2.0 version, and added a large and medium-sized network, the performance of lightweight and small networks has been comprehensively upgraded, to further improve the comprehensive performance, the quantitative effect has also been greatly improved, of which YOLOv6-S quantitative model reached 43.3mAP and 869 FPS (TensorRT 8.4). For more details, please refer to the official technical report [7].

———- END ———-

| YOLOv6: The fast and accurate object detection framework is open source

| The object detection open source framework YOLOv6 has been fully upgraded, and the faster and more accurate version 2.0 is coming

Read more

Front end | Algorithm | Backend | data