First, the background of the accident
This accident is also a typical example of a slow pagination query encountered in our group, through this article, you can clearly follow us to restore the scene of the accident, as well as the adjustments and changes made at each step of the way to encounter problems.
Second, the scene of the accident problem
16:00 Received feedback from colleagues, and the availability rate of fusion system sub-queries was reduced
16:05 Query the UMP monitoring of the interface, and find that the interface TP99 is abnormally high
Open the machine monitoring, found that almost all the machine TP999 is abnormally high, observe the machine CPU monitoring, found that the CPU usage is not high
16:10 Check the database monitoring, found that the database CPU is abnormally high, located is a database problem, and received a large number of slow SQL emails.
Positioning here, we are basically sure that this is not a problem that can be solved in a few minutes, so we divide into two steps to deal with it.
Step 1: Open throttling to prevent more slow sql requests
Step 2: Analyze the slow sql and carry out the transformation online
Looking at slow SQL, most of them are SQL involved in the fusion system sub-query interface, and because the upstream system surges in call traffic for the interface around 15:35, and the database CPU surges, the interface TP999 skyrockets at the same time, presumably due to the pressure caused by the inventory for the interface to the database, resulting in increased interface time. However, the call volume of this interface is not high, and looking at the slow SQL again, I found that there are a large number of slow SQL that have been traversed to hundreds of miles. Speculation is a matter of deep division.
16:15 The troubleshooting log found that most of the SQL refers to the merchant xxxx, the query found that there are 10W pieces of data under it (one-tenth of the total amount), MQ found that there are a large number of retries, and the query interface timeout time found that the configuration is 2S. Presumably, the high frequency of retries caused by slow queries is slowing down the performance of the database.
16:25 After observing the code, it was determined that it was a deep paging problem and the optimization scheme was determined. To avoid inventory modification of the interface, first we optimize SQL to optimize it into the form of a subquery. That is, first query the ID through pageNo and pageSize, then take out the minimum and maximum values, and then use the range query to query out the whole table data. Due to the continuous pressure on the database online, let the upstream suspend the consumption of MQ first.
17:40 Optimize the code to go online, upstream reopen MQ consumption, but due to the accumulation of more news on consumption, after opening directly, it still puts pressure on the fusion database. The TP99 of the interface soared again, and the database CPU once again soared to 100%.
18:00 After review, it was decided not to optimize the old interface, but to develop a new interface, based on the rolling ID for sub-query. There is a need to promote upstream participation in development and joint investigation.
22:20 The new interface is online, the MQ consumption is re-released, and the new interface performs smoothly and “solves the problem” with a large backlog of messages upstream.
Third, the cause of the problem and the solution
Question SQL:
Taking the above SQL as an example, the working principle of MySQL limit is to read the first 1000 records, then discard the first 1000, read the next 100 wants, so the larger the code, the greater the offset, the worse the performance.
1) Query ID + Query based on ID
That is, first use the query condition to query out the id, and then query the range through id, that is to say, the method I used when I first optimized
First query out the ID, using the above SQL as an example
Then after querying out the id, use id to query in, because it is directly based on the primary key in query, so the efficiency is higher
2) Query optimization based on ID
Since all eligible IDs have already been queried in the first query, you can use a range query instead of the in query, which is more efficient
The query needs to be compared to the elements in the collection, but the range query only needs to compare the largest and smallest)
Use subqueries
Using subqueries can reduce IO interaction with the database and is a common way to solve deep paging.
3) Use scrolling queries
Each time the interface will return the largest id (cursor) of the queried data, the next query passes the cursor, the server only needs to take out n data whose id is greater than the cursor according to this cursor. n is the number of items per ⻚ display.
In this way, the server is relatively simple to implement and the performance is very good. The disadvantage is that it requires client modification, and it is necessary to ensure that the IDs are self-incrementing and orderly, and the results need to be sorted by ID.
The final decision was to use a rolling query.
After the final optimization SQL was launched, the performance was stable. The second week, along with inventory, re-optimized SQL for non-multi-spec SKUs. As follows:
After testing it without problems, it went online. Observe that the online monitoring is stable.
I thought that when I was at ease, a week later, the database again appeared a large number of slow queries, database CPU alarms, observation interface monitoring:
It can be seen that under the premise that the call volume is not large, the time consumption of the interface reaches 60S. Contact the operation and maintenance students to help troubleshoot, found a large number of slow SQL:
As you can see, this is our optimized SQL. After the operation and maintenance students explained this sql, they found that this SQL took the primary key index, not the index of the org_code we thought we should go.
After preliminary communication with O&M, it was concluded that in some cases, the priority of the primary key index will be higher than that of the ordinary index.
Fourth, the final solution
Because we used a primary key index for sorting, and we queried a field that is not in the index tree but only in the leaf node. Therefore, mysql thinks that primary key indexes are better because they can be sorted without having to go back to the table, so using primary key indexes eventually leads to full table scans.
Finally, the first query ID is used (the index is guaranteed not to query the leaf node field), and the SQL of the corresponding data is used to query the corresponding data through join:
Again explain, we can find that our established index has gone:
So go online and solve the problem.
After the online is stable, analyze the previous problem SQL, execute the following two statements, the same SQL, different merchants, MYSQL execution results are also different.
Looking at the information, we know:
MYSQL will compare the number of limits with the number of queries in the where condition, if the number of limits is relatively small (for example, the number of skus in some merchants is more), it will be “optimized” as the primary key index, because MYSQL believes that the primary key index will reduce the query of the index tree once, and the results can be obtained in a short period of time. (No LIMIT will not go the primary key index)
Therefore, in this kind of SQL where index A order by primary key index limit N, it is necessary to consider the case of MYSQL optimizing the primary key index.
In addition to the optimized SQL after the final launch of the above, indexes can also be forced to use through force index:
However, this way of writing the index name dead, if the index name is modified later, it is easy to lead to security risks.
V. Summary of the problem
1) B-end systems also need to consider the protection of their own systems, access and flow limiting, etc., to prevent abnormal traffic or abnormal calls to their own system to death. Fortunately, this time, the upstream system calls the fusion API through MQ, and can pause consumption, if it is called with API, and the traffic is large, the database continues to be in a high-pressure state, which will affect the overall stability of the fusion system.
2) Never tolerate possible risk points. This time, this paging query sku interface, I have seen it before, but at that time, I felt that this interface had good performance in the case of less data, and also had the index of the merchant dimension, so I let it go, considering subsequent optimization. The result is now a problem.
3) For the optimization of SQL, be cautious before going online, and need the same SQL, and need to be repeatedly tested and adjusted for different boundary situations (such as the multi-SKU merchants this time).