background
Once you’ve launched new features, watch more. If there is an instability, a high priority needs to be given to identify the cause to avoid larger problems.
Some apps reboot
At 11:58, an alarm was received, and a pod1 memory was restarting within three minutes
12:02, pod1 has not been alarmed for 5 minutes, the data has been restored normally [there is a memory-consuming function, when multiple people trigger at the same time, it will occasionally be restarted]
12:06, JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC times, average > = 5.0 times in the last 2 minutes, current value 5.0000 times
12:07, JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC, the data has been restored normally
At 12:15, an alarm was received, and the pod2 three-minute internal restart alarm was received
At 12:18, pod2 and pod3 have a three-minute internal restart alarm
At 12:20, pod2 and pod3 and pod4 have a three-minute restart alarm
12:24, check the restart pod log, found that you have been brushing the log of the tableStore exception: 2022-09-01 12:24:41[16-d ] [I/O dispatcher 36] WARN c.a.o.tablestore.core.utils.LogUtil – TraceId:8-1 Failed RetriedCount:1 [ErrorCode]:O TSServerBusy, [Message]:Service is busy., [RequestId]:00-e1, [TraceId]:81db-cd, [HttpStatus:]503
12:25, JVM monitoring node machine IP: 10.10.48.117 JVM_FullGC times average > = 5.0 times in the last 2 minutes, and the current value is 5.5000 times
12:25, starting from 2022-09-01 12:22:00, 3 JVM monitoring node machine IPs occur in 0 days, 3 minutes: 10.10.48.117 JVM_FullGC times average > = 5.0 times in the last 2 minutes, the current value is 5.5000 JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC times Average > = 5.0 times, and the current value is 5.000 times in the last 2 minutes
At 12:28, other businesses have checked, and there are no obvious abnormalities. In addition to individual services, there will also be an error of 503 when querying the tableStore
12:41, apply for operation and maintenance students to add two more pods. Increase the number of available pods
At 12:50, after adding the pod, the restart did not stop.
At 13:07, check to see if there is an OOM. No
At 13:22, I received feedback from students of Alibaba TableStore that the reason for 503 was that the index used for index query was not an overlay index, and the 5XX error caused by excessive pressure on the main table was checked back
At 13:23, the scheduled tasks of the queries involved are stopped
At 13:30, the JVM_FullGC and pod restart alarms did not stop
At 13:40, see that it is full GC alarm, increase the Pod and JVM heap memory by 4G each, and reissue the version
At 13:42, the reissue was completed. Both the JVM_FullGC and pod restart alarms disappear
At 14:00, start the scheduled task that you stopped earlier
Heap memory and GC conditions when the service is unstable
Operations that consume large amounts of memory occur that cause FullGC to occur frequently.
Is it the TableStore exception that caused the service exception
No. Less than 12% of requests with errors in the look query
Error conditions for ots
Why does the pod restart
Full GC takes too long, causing the container to determine that the pod is abnormal and restart it.
During an exception, the rules for pod survival detection are as follows:
FullGC will STW when all requests are blocked.
FullGC takes more than 30s and the pod restarts. During the anomaly, FullGC took more than 120s in time. According to the configured rule, the container restarts the pod
If the FullGC exceeds 30s, the container will restart the pod
Why does FullGC fire
A memory-consuming operation occurred.
The data returned by the TableStore server consumes a lot of memory
The newly added query TableStore business thread
Whether the above services occupy memory reasonably
Irrationality. From a business point of view, each query will not exceed 100 records that meet the conditions. But now 131262 bar is returned
Too much data is returned
After checking the code, the query condition is incorrect. The three conditions for querying the tableStore should be the relationship of and, but now it is or
Why such a serious logic error
The presence error logic is the old code that went live in 2020. Students who write new features can copy the past directly. Thinking that it was the online code two years ago, code reivew did not take it as the focus, and the specific process, the relevant students did not remember.
Such a serious logic error, why did the previous service not have such a problem
The previous service was also problematic.
The old code is triggered by a scheduled task. Just querying TableStore serially, although it consumes memory, but if the pod being executed has no other memory-consuming operations being performed, FullGC will not be triggered.
This may also be the reason why the current app restarts occasionally.
The new business scenario is to receive an mq message and then trigger this old code according to the conditions, when receiving n messages at the same time, the memory occupied by *n is easy to trigger FullGC
It’s been almost a week since the new feature went live, why did you trigger this exception today
When querying TableStore, the conditions need to be met to trigger the old code with the exception. When there is an exception, there is a large amount of business data that triggers the exception logic.
Why the test session was not found
case does not cover all business scenarios.
When the filter criteria for the three ands are misspelled as or, more data will be found.
The query result data involved in the current scenario is used for data permission control, and the problem caused by the return of more data is that people who should see the data can see it, and those who should not see the data can also see it.
If there is only one test case: whether the person who is viewing the data can see it. Well, the test passed.
An exception occurs when a quadrant is operated
If qps increases when the pod is restarted, the pod is added first
If the pod is restarted, it is recognized that the FullGC is taking too long, then the priority is to increase the memory to solve
When an exception occurs, the data in the jvm heap must be dumped. When the cause of the exception is not found, check the dumped heap data, because when dumping, the jvm in some pods may have just started, and the abnormal operation has not been triggered.
After the exception occurs, the two-quadrant operates
Code that is already online, try not to copy if new features relate
Code that is already online, if new features are involved, also needs CodeReview
The code to be launched should be tested thoroughly
The features to be launched, the critical path, and the test case should be comprehensive
Increase the alarm of jvm heap memory, if it exceeds 80%, and then dump the data in Heap for analysis
Each iteration ensures time for technical optimization. There is a bug in this part of the current rule that needs to be solved: in the current rules, the priority of the requirements is determined by the product, and if a technical requirement is lowered by the product, if it causes a failure, the product does not need to be held responsible. It is necessary to promote the solution of this matter, keep the rights and responsibilities consistent, and make how big a decision you need to take and how much benefit you will enjoy.