click “zhisheng” above to follow, star or pin to grow together
Flink goes from beginner to proficient Series of articles
The following is the CPU usage of the online machine,
you can see that from April 8, the CPU usage is gradually increasing over time, and the final usage rate reaches 100%, resulting in the online service being unavailable, and then restarting the machine and resuming.
1. Troubleshooting ideas
are simple analysis of possible problems, divided into 5 directions:
code problems of the system itself
Problems with internal downstream systems caused by avalanche effect
Upstream system call volume burst HTTP
request third-party problems
The machine itself is a problem
2. Start troubleshooting
view the log, no centralized error log is found, and the code logic processing error is preliminarily excluded.
First, the internal downstream system was contacted to observe their monitoring, and it was found that it was normal. The impact of downstream system faults on us can be ruled out.
of the provider interface, there is no sudden increase in 7 days, and the problem of the call volume of the business side is eliminated.
View the TCP monitoring, the TCP status is normal, and you can rule out the problem caused by the timeout of the third-party HTTP request.
Looking at the machine monitoring, 6 machine CPUs are rising, and the situation is the same for each machine. Troubleshoot machine failures. That is, there is no direct localization of the problem by the above method.
Check the call volume
1: Restart the 5 machines with more serious problems among the 6 and resume business first. Keep a live room to analyze problems.
2. View the current TOMCAT thread PID.
3. Check the system occupancy of the thread under the PID.
top -Hp 384
4 , found that pid 4430 4431
4432 4433 thread occupies about 40% of CPU 5, convert these PIDs to base 16, 114e
6. Download the current Java thread stack
sudo -u tomcat jstack -l 384>/1.txt
7. Query the corresponding thread situation in 5 and find that it is all caused by GC
8、 dump java heap data
sudo -u tomcat jmap -dump:live,format=b,file=/dump201612271310.dat 384
9. Using MAT to load heap files, it can be seen that the javax.crypto.JceSecurity object occupies 95% of the memory space, and the problem is initially located.
MAT download address:
10、 Looking at the reference tree of the class, you can see that the
BouncyCastleProvider object holds too much. That is, the way we handle the object in our code is wrong, locating the problem.
4. Code analysis
One of our code is written like this
This is the function of encryption and decryption, and each time you run encryption and decryption, you will new a BouncyCastleProvider object, which will be put down in the Cipher.getInstance() method.
Consider the implementation of
Cipher.getInstance(), which is the low-level code implementation of jdk, traced to the JceSecurity class
verifyingProviders will remove every time after put, verificationResults will only put, will not remove.
See that verificationResults is a static map, which belongs to the JceSecurity class. So every time you run to encryption and decryption, you will put an object to this map, and this map belongs to the dimension of the class, so it will not be recycled by GC. This results in a large number of new objects not being recycled.
5. The code improvement
puts the problematic object as static, holding one for each class, and will not create new ones many times.
6. This article summarizes
that when encountering online problems, don’t panic, first confirm the idea of troubleshooting:
- check the
to check the CPU situation
View TCP situation
threads, jstack to view java
to analyze heap files by MAT, Look for sources of objects that cannot be recycled
to view java
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.
Like + watch, less bugs 👇