id=”js_tags” class=”article-tag__list”> included in the collection #Flink
id=”js_article-tag-card__right” class=”article-tag-card__right”> 29


According to the watermark discovery of the subtask, it was delayed by 10 minutes, and then checked to see if there was an abnormality or the condition of BackPressure, and finally found that the three subtask counterpressures on the source->watermarks->filter side all showed High

rebooted multiple times, The problem remains.
backpressure is found to be very short on the checkpoint side of the normal task of positioning the checkpoint


counterpressure task



It can be seen about the time process of checkpoint, and the internal subtask task is basically that the downstream subtask takes a long time, so it is initially suspected that it is due to downstream sink consumption.
Analyze the Metrics of the upstream subtask



If the sender buffer occupancy of a Subtask is high, it indicates that it is speed limited by downstream backpressure; If a Subtask occupies a high buffer on the receiving end, it is inspecting backpressure upstream.
The fact that outPoolUsage and inPoolUsage are both low or both high indicate that the current Subtask is normal or under downstream backpressure, which should not be much doubt. What is interesting is that when outPoolUsage and inPoolUsage behave differently, this may be due to an intermediate state of backpressure conduction or indicate that the Subtask is the source of backpressure.
If the outPoolUsage of a Subtask is high, it is usually affected by the downstream task, so you can check the possibility that it is itself the source of the backpressure. If a Subtask’s outPoolUsage is low but its inPoolUsage is high, it may be the source of backpressure. Because the backpressure is usually transmitted to its upstream, resulting in the outPoolUsage of some subtasks upstream being high, we can further judge based on this. It is worth noting that backpressure is sometimes short and has little effect, such as a short network delay from a certain channel or the normal GC of the TaskManager, in which case we can not deal with it.
It can be analyzed that the upstream is divided into downstream speed limits.

In general, a high floatingBuffersUsage indicates that the backpressure is
being transmitted upstream, while exclusiveBuffersUsage indicates whether the backpressure is tilted (floatingBuffersUsage is high, exclusiveBuffersUsage is low because a few channels occupy most of the Floating Buffer).
analyze the cause

it can be seen that the subtask’s data is not particularly skewed
In addition, the most common problem may be the efficiency of user code execution (frequent blocking or performance issues). The most useful way is to profile the CPU of the TaskManager, from which we can analyze whether the Task Thread is running a CPU core: if so, we must analyze which functions the CPU mainly spends, such as our production environment occasionally encounters user functions stuck in Regex (ReDoS); If not, depending on where the Task Thread is blocked, it may be some synchronous call of the user function itself, or a temporary system suspension caused by system activities such as checkpoint or GC. Of course, the results of the performance analysis may also be normal, but the lack of resources for the job request leads to backpressure, which usually requires extended parallelism. It is worth mentioning that in future versions of Flink will provide the CPU flame graph of the JVM directly in the WebUI [5], which will greatly simplify the analysis of performance bottlenecks. In addition, TaskManager memory and GC problems may also cause backpressure, including frequent Full GC or even loss of contact caused by unreasonable memory in each area of the TaskManager JVM. It is recommended to observe GC problems by enabling the G1 garbage collector for TaskManager and adding -XX:+PrintGCDetails to print GC logs.
test to adjust sink to kafka for print print console found that there is still a backpressure problem, rule out sink write kafka slow problem, because the original write ES has a delay so change to
kafka
, so this time first rule out this problem.
Reduce the CEP time window size from 3 minutes – 1 minute – 20s
The time of backpressure appears more and more backward, the general problem is located in the CEP operator-related, and at this time the time of each checkpoint is increasing, although the size of the state is the same but the time is doubled, so modify the checkpoint-related configuration to continue testing and find that the problem still exists Analysis
thread
taskmanager thread proportion, found that CEP operator occupies 94% of the CPU, Therefore, increasing operator concurrency 3 reduces the CPU usage of 6 threads as follows

The
backpressure does not appear again after 1 time, and the follow-up will continue to follow, and will try to increase the time window
of the CEP, try to configure the best practice to increase the partition
and find that the data skew is serious, because the Kafka partition is 3, but the parallelism degree is 6, so the 6 subtask data of the CEP operator is skewed seriously, so the Reblance method is executed on the add source side to force the polling method to allocate data



you can see that the data here is much more uniform than before
Cep configuration
- parallelism, double
-
cep time window -
for Kafka partition: -
sink: 2 sinks to kafka
30s
by collabH
Original address: https://github.com/collabH/repository/blob/master/bigdata/flink/practice/%E8%AE%B0%E5%BD%95%E4%B8%80%E6%AC%A1Flink%E5%8F%8D%E5%8E%8B%E9%97%AE%E9%A2%98.md
end
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇