Source | https://segmentfault.com/a/1190000018075241
Preface
The topic of this article is to document a performance optimization, the problems encountered in the optimization process, and how to solve them. To provide you with an optimization idea, the first thing to state is that my way is not the only one, and there is definitely more than one solution to the problems you encounter on the road to performance optimization.
How to optimize
First of all, we must make it clear that talking about optimization without requirements is playing hooliganism, so who told you that millions of concurrency has been achieved on xx machines, basically it can be considered that they do not know how to pretend to understand, and the simple number of concurrency is completely meaningless. Secondly, we must have a goal before we optimize, to what extent, optimization without a clear goal is uncontrollable. Then, we have to figure out exactly where the performance bottlenecks are, rather than messing around aimlessly.
Requirements description
This project is that I was responsible for a separate module in the previous company, which was originally integrated in the master code, but later because the concurrency was too large, in order to prevent problems from dragging down the master service, I was responsible for splitting it out alone. The split requirements for this module are that the stress test QPS cannot be less than 30,000, the database load cannot exceed 50%, the server load cannot exceed 70%, the single request duration cannot exceed 70ms, and the error rate cannot exceed 5%.
The configuration of the environment is as follows: server: 4-core 8G memory, centos7 system, ssd hard disk database: Mysql 5.7, maximum number of connections 800 cache: redis, 1G capacity. The above environments are all services purchased from Tencent Cloud.
Stress testing tool: locust, which uses Tencent’s auto scaling to achieve distributed stress testing.
The requirements are described as follows: the
user enters the homepage, queries the database whether there is a suitable pop-up configuration, if not, continues to wait for the next request, and returns to the front end if there is a suitable configuration. If the user clicks the pop-up window, the user click is recorded, and the configuration is not returned within the configured time, if the user does not click, then continue to return to this configuration after 24 hours, if the user clicks, but there is no subsequent configuration, then wait for the next time.
Focus on the analysis
According to the
demand, we know that there are several important points, 1. Need to find out the appropriate user’s pop-up configuration, 2. Need to record the user’s next return to the configuration time and record it in the database, 3. Need to record what operation the user has performed on the returned configuration and record it in the database.
We
can see that the above three key points all exist in database operations, not only reading the library, but also writing the library. From here, we can see that if there is no cache, all requests are pressed into the database, which is bound to account for the total number of connections, an access denied error, and because the SQL execution is too slow, the request cannot be returned in time. Therefore, the first thing we need to do is to separate the writing library operation, improve the response speed of each request, and optimize the database connection. The architecture diagram of the whole system is as follows:
To put the write library operation into a first-in-first-out message queue, in order to reduce complexity, the list of Redis is used to do this message queue.
Then the stress test is carried out, and the results are as follows:
QPS 502 error around 6000 rose sharply to 30%, server CPU jumped back and forth between 60%-70%, the number of database connections was occupied by the number of TCP connections was about 6000, obviously, the problem is still in the database, after troubleshooting SQL statements, the reason for the query is to find out the appropriate user’s configuration operation to read the database each time the number of connections is exhausted. Because we only have 800 connections, too many requests will inevitably lead to database bottlenecks. Well, the problem is found, we continue to optimize, and the updated architecture is as follows
we load all the configuration into the cache , the database will only be read if there is no configuration in the cache.
Next, we stress test again, and the results are as follows: when the QPS is pressed to about 20,000,
it can’t go up, the server CPU beats between 60%-80%, the number of database connections is about 300, and the number of TPC connections per second is about 15,000.
This problem is a problem that has
been bothering me for a long time, because we can see that we have 20,000 QPS, but the number of TCP connections has not reached 20,000, I guess, the number of TCP connections is the problem that causes the bottleneck, but for what reason it is temporarily impossible to find out.
At this time guess, since it is impossible to establish a TCP connection, is it possible that the server limits the number of socket connections, verify the guess
, let’s see, enter the ulimit -n command in the terminal, the result displayed is 65535, see here, feel that the number of socket connections is not the reason that limits us, in order to verify the guess, increase the number of socket connections to 100001
The pressure test was carried out again, and the results were as follows:
when the QPS was pressed to about 22,000, it could not go up, the server CPU was beating between 60%-80%, the number of database connections was about 300, and the number of TPC connections per second was about 17,000.
Although there is a little improvement, but there is no substantial change, the next few days, I found that I could not find the optimization solution,
those days were really uncomfortable, could not find the optimization solution, a few days later, the problem was sorted out again, found that although the number of socket connections is enough, but not all of them are used, guess, after each request, the TCP connection is not immediately released, resulting in sockets can not be reused. After searching
the data and finding the problem, the
TCP link will not be released immediately after the connection is terminated after four handshakes, but will be in the timewait state and will wait for a period of time to prevent the client from receiving subsequent data.
Well, the
problem is found, we have to continue to optimize, the first thing that comes to mind is to adjust the waiting time after the end of the TCP link, but Linux does not provide this kernel parameter adjustment, if you want to change, you must recompile the kernel yourself, fortunately, there is another parameter net.ipv4.tcp_max_tw_buckets, the number of timewait, the default is 180000. We adjust to 6000, then open timewait for quick recovery, and enable reuse, the complete parameter optimization is as follows
#timewait The number is 180000 by default.
net.ipv4.tcp_max_tw_buckets = 6000 net.ipv4.ip_local_port_range = 1024 65000
#启用 timewait fast recycling.
net.ipv4.tcp_tw_recycle = 1#开启重用. Allows TIME-WAIT sockets to be reused for new TCP connections.
net.ipv4.tcp_tw_reuse = 1
We stress test again, and the results show that
QPS 50,000, server CPU 70%, database connection is normal, TCP connection is normal, the response time is 60ms on average, and the error rate is 0%.
This
concludes
with the development, tuning, and stress testing of the entire service. Looking back on this tuning, I got a lot of experience, and most importantly, I deeply understood that web development is not an independent individual, but a combination of engineering practice such as networks, databases, programming languages, operating systems, etc., which requires web developers to have solid basic knowledge, otherwise there is a problem and do not know how to analyze and find.
end
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇