background

Once you’ve launched new features, watch more. If there is an instability, a high priority needs to be given to identify the cause to avoid larger problems.

Some apps reboot

At 11:58, an alarm was received, and a pod1 memory was restarting within three minutes

12:02, pod1 has not been alarmed for 5 minutes, the data has been restored normally [there is a memory-consuming function, when multiple people trigger at the same time, it will occasionally be restarted]

12:06, JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC times, average > = 5.0 times in the last 2 minutes, current value 5.0000 times

12:07, JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC, the data has been restored normally

At 12:15, an alarm was received, and the pod2 three-minute internal restart alarm was received

At 12:18, pod2 and pod3 have a three-minute internal restart alarm

At 12:20, pod2 and pod3 and pod4 have a three-minute restart alarm

12:24, check the restart pod log, found that you have been brushing the log of the tableStore exception: 2022-09-01 12:24:41[16-d ] [I/O dispatcher 36] WARN c.a.o.tablestore.core.utils.LogUtil – TraceId:8-1 Failed RetriedCount:1 [ErrorCode]:O TSServerBusy, [Message]:Service is busy., [RequestId]:00-e1, [TraceId]:81db-cd, [HttpStatus:]503

12:25, JVM monitoring node machine IP: 10.10.48.117 JVM_FullGC times average > = 5.0 times in the last 2 minutes, and the current value is 5.5000 times

12:25, starting from 2022-09-01 12:22:00, 3 JVM monitoring node machine IPs occur in 0 days, 3 minutes: 10.10.48.117 JVM_FullGC times average > = 5.0 times in the last 2 minutes, the current value is 5.5000 JVM monitoring node machine IP: 10.10.48.116 JVM_FullGC times Average > = 5.0 times, and the current value is 5.000 times in the last 2 minutes

At 12:28, other businesses have checked, and there are no obvious abnormalities. In addition to individual services, there will also be an error of 503 when querying the tableStore

12:41, apply for operation and maintenance students to add two more pods. Increase the number of available pods

At 12:50, after adding the pod, the restart did not stop.

At 13:07, check to see if there is an OOM. No

At 13:22, I received feedback from students of Alibaba TableStore that the reason for 503 was that the index used for index query was not an overlay index, and the 5XX error caused by excessive pressure on the main table was checked back

At 13:23, the scheduled tasks of the queries involved are stopped

At 13:30, the JVM_FullGC and pod restart alarms did not stop

At 13:40, see that it is full GC alarm, increase the Pod and JVM heap memory by 4G each, and reissue the version

At 13:42, the reissue was completed. Both the JVM_FullGC and pod restart alarms disappear

At 14:00, start the scheduled task that you stopped earlier

Heap memory and GC conditions when the service is unstable

Operations that consume large amounts of memory occur that cause FullGC to occur frequently.

Is it the TableStore exception that caused the service exception

No. Less than 12% of requests with errors in the look query

Error conditions for ots

Why does the pod restart

Full GC takes too long, causing the container to determine that the pod is abnormal and restart it.

During an exception, the rules for pod survival detection are as follows:

FullGC will STW when all requests are blocked.

FullGC takes more than 30s and the pod restarts. During the anomaly, FullGC took more than 120s in time. According to the configured rule, the container restarts the pod

If the FullGC exceeds 30s, the container will restart the pod

Why does FullGC fire

A memory-consuming operation occurred.

The data returned by the TableStore server consumes a lot of memory

The newly added query TableStore business thread

Whether the above services occupy memory reasonably

Irrationality. From a business point of view, each query will not exceed 100 records that meet the conditions. But now 131262 bar is returned

Too much data is returned

After checking the code, the query condition is incorrect. The three conditions for querying the tableStore should be the relationship of and, but now it is or

Why such a serious logic error

The presence error logic is the old code that went live in 2020. Students who write new features can copy the past directly. Thinking that it was the online code two years ago, code reivew did not take it as the focus, and the specific process, the relevant students did not remember.

Such a serious logic error, why did the previous service not have such a problem

The previous service was also problematic.

The old code is triggered by a scheduled task. Just querying TableStore serially, although it consumes memory, but if the pod being executed has no other memory-consuming operations being performed, FullGC will not be triggered.

This may also be the reason why the current app restarts occasionally.

The new business scenario is to receive an mq message and then trigger this old code according to the conditions, when receiving n messages at the same time, the memory occupied by *n is easy to trigger FullGC

It’s been almost a week since the new feature went live, why did you trigger this exception today

When querying TableStore, the conditions need to be met to trigger the old code with the exception. When there is an exception, there is a large amount of business data that triggers the exception logic.

Why the test session was not found

case does not cover all business scenarios.

When the filter criteria for the three ands are misspelled as or, more data will be found.

The query result data involved in the current scenario is used for data permission control, and the problem caused by the return of more data is that people who should see the data can see it, and those who should not see the data can also see it.

If there is only one test case: whether the person who is viewing the data can see it. Well, the test passed.

An exception occurs when a quadrant is operated

If qps increases when the pod is restarted, the pod is added first

If the pod is restarted, it is recognized that the FullGC is taking too long, then the priority is to increase the memory to solve

When an exception occurs, the data in the jvm heap must be dumped. When the cause of the exception is not found, check the dumped heap data, because when dumping, the jvm in some pods may have just started, and the abnormal operation has not been triggered.

After the exception occurs, the two-quadrant operates

Code that is already online, try not to copy if new features relate

Code that is already online, if new features are involved, also needs CodeReview

The code to be launched should be tested thoroughly

The features to be launched, the critical path, and the test case should be comprehensive

Increase the alarm of jvm heap memory, if it exceeds 80%, and then dump the data in Heap for analysis

Each iteration ensures time for technical optimization. There is a bug in this part of the current rule that needs to be solved: in the current rules, the priority of the requirements is determined by the product, and if a technical requirement is lowered by the product, if it causes a failure, the product does not need to be held responsible. It is necessary to promote the solution of this matter, keep the rights and responsibilities consistent, and make how big a decision you need to take and how much benefit you will enjoy.