I. Introduction

When we do paging requirements on a daily basis, we generally use limit implementation, but when the offset is particularly large, the query efficiency becomes low. This article will be divided into four scenarios to discuss how to optimize the deep paging problem of MySQL million data, and attach a recent practical case of optimizing the production of slow SQL.

Second, why does limit-deep paging slow down?

First look at the following table structure ha:

Suppose the execution SQL for deep paging is as follows:

The execution time of this SQL is as follows:

It takes 0.742 seconds to execute, why does deep paging slow down? If you change to limit 0,10, it only takes 0.006 seconds

Let’s first look at the execution process of this SQL:

Through the ordinary secondary index tree idx_update_time, filter the update_time conditions and find the record IDs that meet the conditions

Through ID, go back to the primary key index tree, find the row that satisfies the record, and then take out the displayed column (back to the table)

Scan the 100010 rows that meet the criteria, then throw away the first 100,000 rows and return

The execution process of SQL

The execution plan is as follows:

SQL is slow for two reasons:

The limit statement scans offset+n rows first, then discards the previous offset rows and returns the last n rows of data. That is to say, limit 100000,10, 10, will scan 100010 lines, while limit 0,10, only scan 10 lines.

Limit 100000,10 scans more rows, which also means more times to return to the table.

Third, the optimization plan

1. Optimize through sub-query

Because of the above SQL, the table returned 100,010 times, in fact, we only need 10 pieces of data, that is, we only need 10 times to return to the table is actually enough. Therefore, we can optimize by reducing the number of times we return to the table.

1) Review the B+ tree structure

So, how to reduce the number of return to the table? Let’s first review the B+ tree index structure of ha~

In InnoDB, indexes are divided into primary key indexes (clustered indexes) and secondary indexes

The primary key index, the leaf node holds the entire row of data

Second-level index, the leaf node holds the value of the primary key.

2) Transfer the condition to the primary key index tree

If we transfer the query conditions back to the primary key index tree, we can reduce the number of times we return to the table. If you transfer to the primary key index tree query, the query condition must be changed to the main key id, and the previous SQL update_time what do these conditions do? Draw to the sub-query there ~

How is the subquery smoked there? Because the secondary index leaf node has a primary key ID, we can directly check the primary key ID according to the update_time, and we also transfer the condition of limit 100000 to the subquery, and the complete SQL is as follows:

The query effect is the same, the execution time only takes 0.038 seconds!

Let’s look at the execution plan

From the execution plan, it is known that the subquery table a query uses the index idx_update_time. First get the primary key ID of the clustered index on the index, eliminating the back-to-table operation, and then the second query directly according to the ID of the first query can be checked for 10 more times!

Therefore, this scheme is OK~

2. INNER JOIN delay association

The optimization idea of deferred association is actually the same as that of subqueries: transfer the conditions to the primary key index tree and then reduce the return table. The difference is that the deferred association uses an inner join instead of a subquery.

The optimized SQL is as follows:

The query effect is also leveraged, and it only takes 0.034 seconds

The execution plan is as follows:

The query idea is to first query the primary key ID that meets the conditions through the secondary index tree of the idx_update_time, and then join the original table through the primary key ID, so that the primary key index is directly followed by the primary key index, and the return table is also reduced.

3. Label record method

The essential cause of the limit deep paging problem is that the higher the offset, the more rows mysql scans and then discards them. This results in a decrease in query performance.

In fact, we can use the label record method, that is, mark which one was queried last time, and the next time we check again, scan down from the article. It’s like reading a book, where you saw it last time, you fold it or clip a bookmark, and the next time you look at it, you can turn it over.

Assuming the last time logged to 100000, the SQL can be modified to:

In this case, no matter how many pages you turn later, the performance will be good because the id index is hit. But there are limitations to this approach: a field similar to continuous increment is required.

4, the use of between… and…

Many times, it is possible to convert a limit query to a query of a known location so that MySQL scans between through ranges… and, you can get the corresponding results.

If you know that the boundary value is 100000, 100010, you can optimize like this:

Fourth, hands-on practical cases

Let’s take a look at a practical case ha. Suppose there is now a table structure as follows, and there are 2 million data.

The business requirements are as follows: obtain the most 2021 type A account data and report it to the big data platform.

1. The implementation of general ideas

Many partners receive such a request and will directly achieve this:

2. Practical optimization program

In the above implementation scheme, there will be a limit deep paging problem, because the amount of account table data is several million. So how to optimize it?

In fact, you can use the label record method, some partners may have doubts, id primary key is not contiguous, can you really use the label record?

Of course, id is not continuous, we can make it continuous by order by. The optimization scheme is as follows: