clickzhisheng” above to follow, star or pin to grow together

Flink goes from beginner to proficient Series of articles

The following is the CPU usage of the online machine,

you can see that from April 8, the CPU usage is gradually increasing over time, and the final usage rate reaches 100%, resulting in the online service being unavailable, and then restarting the machine and resuming.

1. Troubleshooting ideas

are simple analysis of possible problems, divided into 5 directions:

  • code problems of the system itself
  • Problems with internal downstream systems caused by avalanche effect
  • Upstream system call volume burst HTTP
  • request third-party problems
  • The machine itself is a problem

2. Start troubleshooting

  1. view the log, no centralized error log is found, and the code logic processing error is preliminarily excluded.
  2. First, the internal downstream system was contacted to observe their monitoring, and it was found that it was normal. The impact of downstream system faults on us can be ruled out.
  3. Check the call volume

  4. of the provider interface, there is no sudden increase in 7 days, and the problem of the call volume of the business side is eliminated.
  5. View the TCP monitoring, the TCP status is normal, and you can rule out the problem caused by the timeout of the third-party HTTP request.
  6. Looking at the machine monitoring, 6 machine CPUs are rising, and the situation is the same for each machine. Troubleshoot machine failures. That is, there is no direct localization of the problem by the above method.

3. Solution

1: Restart the 5 machines with more serious problems among the 6 and resume business first. Keep a live room to analyze problems.

2. View the current TOMCAT thread PID.

3. Check the system occupancy of the thread under the PID. top -Hp 384

4 , found that pid 4430 4431 4432 4433 thread occupies about 40% of CPU 5, convert these PIDs to base 16, 114e

114f 1150, respectively 1151

6. Download the current Java thread stack sudo -u tomcat jstack -l 384>/1.txt

7. Query the corresponding thread situation in 5 and find that it is all caused by GC

threads

8、 dump java heap data

sudo -u tomcat jmap -dump:live,format=b,file=/dump201612271310.dat 384

9. Using MAT to load heap files, it can be seen that the javax.crypto.JceSecurity object occupies 95% of the memory space, and the problem is initially located.

MAT download address:

http://www.eclipse.org/mat/

10、 Looking at the reference tree of the class, you can see that the BouncyCastleProvider object holds too much. That is, the way we handle the object in our code is wrong, locating the problem.

4. Code analysis

One of our code is written like this

This is the function of encryption and decryption, and each time you run encryption and decryption, you will new a BouncyCastleProvider object, which will be put down in the Cipher.getInstance() method.

Consider the implementation of

Cipher.getInstance(), which is the low-level code implementation of jdk, traced to the JceSecurity class

verifyingProviders will remove every time after put, verificationResults will only put, will not remove.

See that verificationResults is a static map, which belongs to the JceSecurity class. So every time you run to encryption and decryption, you will put an object to this map, and this map belongs to the dimension of the class, so it will not be recycled by GC. This results in a large number of new objects not being recycled.

5. The code improvement

puts the problematic object as static, holding one for each class, and will not create new ones many times.

6. This article summarizes

that when encountering online problems, don’t panic, first confirm the idea of troubleshooting:

    check the

  1. log
  2. to check the CPU situation
  3. View TCP situation
  4. to view java

  5. threads, jstack to view java
  6. heap,

  7. jmap
  8. to analyze heap files by MAT, Look for sources of objects that cannot be recycled
| https://urlify.cn/Q3Ar6z

public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.

Like + watch, less bugs 👇

Buy Me A Coffee