Source |

Through this more than a month of hard work, FullGC has been optimized from 40 times / day to nearly 10 days to trigger once, and the time of YoungGC has also been reduced by more than half, such a big optimization, It is necessary to document the tuning process in between.

For JVM garbage collection, it has always been in the theoretical stage, and I know the promotion relationship between the new generation and the old generation, and this knowledge is only enough to cope with the interview. Some time ago, the FullGC of the online server was very frequent, with an average of more than 40 times a day, and the server automatically restarted every few days, which indicates that the state of the server is already very abnormal, get such a good opportunity, of course, to take the initiative to request tuning. Server GC data before untuning, FullGC is very frequent.

First of all, the configuration of the server is very average (2-core 4G), with a total of 4 server clusters. The number and time of FullGC per server is about the same. The startup parameters of several JVM cores are

: -Xms1000M –

Xmx1800M -Xmn350M -Xss300K -XX:+DisableExplicitGC -XX:SurvivorRatio=4 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX: CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:LargePageSizeInBytes=128M -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:+ PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC

-Xmx1800M: Set the maximum available memory of the JVM to 1800M. -Xms1000m: Set the JVM initialization memory to 1000m. This value can be set to the same as -Xmx to avoid the JVM reallocating memory every time garbage collection is complete. -Xmn350M: Set the size of the young generation to 350M. Total JVM memory size = young generation size + old generation size + persistent generation size. The size of the durable generation is generally fixed at 64m, so when the younger generation is increased, the size of the old generation will be reduced. This value has a great impact on system performance, and Sun officially recommends 3/8 of the entire heap. -Xss300K: Set the stack size for each thread. After JDK 5.0, the stack size per thread was 1M, and previously the stack size per thread was 256K. More applied threads require memory sizing. Decreasing this value spawns more threads for the same physical memory. However, the operating system still has a limit on the number of threads in a process, which cannot be generated indefinitely, and the experience value is about 3000~5000.

first time I looked

at the parameters, and immediately felt why the new generation is so small,

how to improve the throughput if it is so small, and it will lead to frequent triggering of YoungGC, such as the new generation collection as above takes 830s. The initialization heap memory is not consistent with the maximum heap memory, and it is recommended that these two value settings are the same after various sources, which can prevent memory reallocation after each GC. Based on the previous knowledge, the first online tuning was carried out: the size of the new generation was increased, and the initialization heap memory was set to the maximum memory

Xmn350M -> -Xmn800M
-XX:SurvivorRatio=4 -> -XX:SurvivorRatio=8
-Xms1000m ->-Xms1800m

modified SurvivorRatio to 8 with the intention of allowing as much trash as possible to be recycled in the new generation. After deploying the configuration to two servers online (prod, prod2 and the other two unchanged for easy comparison), after 5 days of operation, observing the GC results, YoungGC was reduced by more than half and the time was reduced by 400s, but the average number of FullGC times increased by 41 times. YoungGC is basically as expected, but this FullGC is completely inadequate.

thus the first optimization fails 。

> the second


In the process of optimization, our supervisor found that there is an object T with more than 10,000 instances in memory, and these instances occupy nearly 20M of memory. So according to the use of this bean object, the cause was found in the project: caused by anonymous inner class reference, and the pseudocode is as follows:

public void doSmthing(T t){
redis.addListener(new Listener(){
  public void onTimeout(){
if(t.success()){ //perform operation } } }); }

Since listener will not be released after the callback, and the callback

is a timeout operation, when an event exceeds the set time (1 minute), the callback will be made, which causes the object T to never be recycled, so there will be so many object instances in memory. After discovering the memory leak through the above example, first troubleshoot the error log file in the program, and first solve all error events. Then after releasing again, the GC operation is still basically unchanged, although it solves a little memory leak problem, but it can be explained that the root cause has not been solved, and the server continues to restart inexplicably.


memory leak investigation found the memory leak problem after the first

tuning, so everyone began to investigate the memory leak, first troubleshoot the code, but this efficiency is quite low, basically no problem found. So I continued to dump memory when I wasn’t very busy online, and finally caught a big object

There are more than 4W of this object, and they are all ByteArrowRow objects, which can confirm that this data is generated when the database is queried or inserted. So another round of code analysis was carried out, in the process of code analysis, through the operation and maintenance colleagues found that at some time of the day the ingress traffic increased several times, as high as 83MB/s, after some confirmation, there is no such a large business volume at all, and there is no file upload function. Consulting Alibaba Cloud customer service also explained that it is completely normal traffic and can rule out the possibility of attack. Just when I was still investigating the problem of ingress traffic, another colleague found the root cause, it turned out that under a certain condition, all unprocessed specified data in the table would be queried, but because the where condition was added to the where condition when querying, the number of queries reached more than 400,000, and by logging to view the requests and data at that time, it can be judged that this logic has indeed been executed, and there are only 4W objects in the memory dumped. This is because so many of them happened to be queried out at the time of dump, and the rest are still in transmission. And this also explains why the server restarts automatically.

After solving this problem, the online server is running completely normally, using the parameters before untuning, running FullGC only 5 times for about 3 days

second tuning

The memory leak problem has been solved, the rest can continue to tune, after looking at the GC log, it was found that the first three GullGCs, the old era occupied less than 30% of the memory, but FullGC occurred. So a survey of various data was carried out, and in the blog it was very clear and clear that metaspace led to FullGC, the default metaspace of the server was 21M, and the maximum time metaspace occupied about 200M in the GC log, so the following tuning was carried out, The following are the modified parameters of prod1 and prod2, respectively, prod3, prod4 remain unchanged

Xmn350M-> -Xmn800M
-Xms1000M ->1800M
-XX: CMSInitiatingOccupancyFraction=75


Xmn350M -> -Xmn600M
-Xms1000M ->1800M
-XX: CMSInitiatingOccupancyFraction=75

prod1 and 2 are just different in size of the new generation, everything else is the same. Ran online for about 10 days, comparison: PROD1: prod2: prod3: prod4:

In contrast, 1,2 two servers FullGC is much lower than 3,4 two, and 1,2 two servers YounGC compared to 3,4 is also reduced by about half, and the efficiency of the first server is more obvious, except for the reduction in the number of YoungGCs, and the throughput is more than the 3,4 two that have been running for an extra day (through the number of thread starts), indicating that the throughput of prod1 is particularly obvious. the number of GCs passed and the time of GC, this optimization was declared successful, and the configuration of prod1 was better, which greatly improved the throughput of the server and reduced the time of GC by more than half.

The only time FullGC in prod1:

The reason is not seen on the GC log, the old era only occupies about 660M when the cms remark, which should not be enough to trigger the conditions for FullGC, and through the previous YoungGC surveys, it also ruled out the possibility of promoting large memory objects, through the size of the metaspace, and did not meet the conditions of GC. This needs to continue to investigate, if you know, please point out, thank you in advance.


Through more than

a month of tuning, the following points have been summarized:

  • FullGC more than once a day is definitely not normal
  • It is found that FullGC frequently prioritizes the investigation of
  • memory leak problems

  • After the memory leak is solved, the jvm can be tuned with less space, as a learning is okay, otherwise do not invest too much time
  • If you find

  • that the CPU continues to be high, after troubleshooting code problems, you can find operation and maintenance to consult Alibaba Cloud customer service, During this investigation, it was found that 100% of the CPU was caused by server problems, and it was normal after server migration.
  • Data query is also counted as the server’s ingress traffic, if the access business is not so large, and there is no attack problem, you can investigate
  • the

  • database side It is necessary to pay attention to the server GC from time to time, and the problem can be found early

The above is the process and summary of JVM tuning in the past month or so, if there are any errors, please correct them.



public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.

like + Looking, less bugs 👇