This article in-depth study and solve the problem of sporadic latency of services on the Kubernetes platform, not long ago I also encountered a similar problem, it seems to be a metaphysical event, at first due to network link jitter, after a period of time still exist, although the impact is after P99.99 data, but disturbing the mind, and finally through multi-faceted positioning, solved the problem. Finally, it is found that it has nothing to do with business and network, but the infrastructure itself has a problem, as follows to give a specific troubleshooting plan, and to a certain extent, explain how containers, cgroups, and CPUs will affect network latency.

As our Kubernetes cluster continues to grow, our requirements for service latency are becoming more stringent. We’re starting to observe that some of the services running on our Kubernetes platform are facing occasional latency issues that aren’t due to performance issues with the app itself.

We found that latency issues generated by applications on Kubernetes clusters appear to be random, and the establishment of some network connections may exceed 100ms, causing downstream services to time out or retry. The response time of these services to process the business themselves can be well maintained within 100ms, and it takes more than 100ms to establish a connection. In addition, we also found that for some SQL queries that should be executed very fast (on the order of milliseconds), it actually exceeds 100ms from the application point of view, but it is completely normal from the perspective of MySQL database, and we have not found possible slow query problems.

Through troubleshooting, we narrowed down the problem to establishing a connection to the Kubernetes node, including requests inside the cluster or requests involving external resources and external visitors. The easiest way to reproduce this problem is to use Vegeta on any internal node to launch an HTTP stress test on a service exposed by NodePort, and we can observe that some high-latency requests are generated from time to time. In this article, we’ll talk about how we tracked down this issue.

Cut through the fog to find the key

to the problem

We want to reproduce the problem with a simple example, then we hope to narrow down the scope of the problem and remove unnecessary complexity. At first, the data flow between Vegeta and Kubernetes Pods involved too many components, and it was difficult to determine if this was a deeper network problem, so we needed to do a subtraction.

The Vegeta client makes a TCP request to one of the Kube nodes in the cluster. The Kubernetes cluster in our data center uses overlay networks (running on top of our existing data center network), which encapsulates the IP packets of the overlay network in the IP packets of the data center. When the request arrives at the first kube node, it performs NAT translation, which translates the IP and port of the kube node into the network address of the overlay, specifically the IP and port of the pod running the application. When a response is requested, the corresponding inverse transformation (SNAT/DNAT) occurs. This is a very complex system with a large number of variable states maintained that are constantly updated as services are deployed.

When we first used Vegeta for stress testing, we found a delay during the TCP handshake phase (between SYN and SYN-CK). To simplify the complexity of HTTP and Vegeta, we use hping3 to send SYN packets, observe whether there is a delay in the response packets, and then close the connection. We were able to filter out packets with latency greater than 100ms to simply reproduce Vegeta’s Layer 7 stress test or simulate a service exposed to a SYN attack. The following log shows the result of sending TCP SYN/SYN-ACK packets to port 30927 of kube-node at 10ms intervals and filtering out slow requests,

theojulienne@shell ~ $ sudo hping3 -s -p 30927 -i u10000 | egrep --line-buffered ' rtt=[0-9]{3}\.' len=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 mslen=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 mslen=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 mslen=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 mslen=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 mslen=46 ip= ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Based on the sequence numbers in the logs and the time, the first thing we observe is that this delay is not a single occurrence, but often accumulates, as if the backlog of requests has been processed for the last time.

Next, we want to pinpoint which component might have an exception. Are they the NAT rules of kube-proxy, after all, they have hundreds of lines? Or is the IPIP tunnel or similar network component performing poorly? One way to troubleshoot is to test each step in the system. IF WE TOMBSTONE THE NAT RULES AND FIREWALLS, WHAT HAPPENS WHEN WE JUST USE IPIP TUNNELING?

If you are also on a kube node, then Linux allows you to communicate directly with the pod, very simple:

theojulienne@kube-node-client ~ $ sudo hping3 -S -i u10000 | egrep --line-buffered 'rtt=[ 0-9]{3}\.' len=40 ip= ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 mslen=40 ip= ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 mslen=40 ip= ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

From our results, the problem is still there! This rules out iptables as well as NAT. Is there something wrong with TCP? Let’s see what happens if we request with ICMP.

theojulienne@kube-node-client ~ $ sudo hping3 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.' len=28 ip= ttl=64 id=42594 icmp_seq=104 rtt=110.0 mslen=28 ip= ttl=64 id=49448 icmp_seq=4022 rtt=141.3 mslen=28 ip= ttl=64 id=49449 icmp_ seq=4023 rtt=131.3 mslen=28 ip= ttl=64 id=49450 icmp_seq=4024 rtt=121.2 mslen=28 ip= ttl=64 id=49451 icmp_seq=4025 rtt=111.2 mslen=28 ip= ttl=64 id= 49452 icmp_seq=4026 rtt=101.1 mslen=28 ip= ttl=64 id=50023 icmp_seq=4343 rtt=126.8 mslen=28 ip= ttl=64 id=50024 icmp_seq=4344 rtt=116.8 mslen=28 ip= ttl=64 id=50025 icmp_seq=4345 rtt=106.8 mslen=28 ip= ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

The results show that ICMP is still able to reproduce the problem. IS THAT IPIP TUNNELING CAUSING THE PROBLEM? Let’s simplify the matter even further.

So is it possible that any communication between these nodes will cause this problem?

theojulienne@kube-node-client ~ $ sudo hping3 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.' len=46 ip= ttl=61 id=41127 icmp_seq=12564 rtt=140.9 mslen=46 ip= ttl=61 id=41128 icmp_seq=12565 rtt=130.9 mslen=46 ip= ttl=61 id=41129 icmp_ seq=12566 rtt=120.8 mslen=46 ip= ttl=61 id=41130 icmp_seq=12567 rtt=110.8 mslen=46 ip= ttl=61 id=41131 icmp_seq=12568 rtt=100.7 mslen=46 ip= ttl= 61 id=9062 icmp_seq=31443 rtt=134.2 mslen=46 ip= ttl=61 id=9063 icmp_seq=31444 rtt=124.2 mslen=46 ip= ttl=61 id=9064 icmp_seq=31445 rtt=114.2 mslen=46 ip= ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Behind this complexity, it is simply any network communication between two kube nodes, including ICMP. If this target node is “exceptional” (some nodes will be worse than others, such as higher latency and more frequent problems), then we can still see similar delays when the problem occurs.

So the question now is, we obviously didn’t find this problem on all machines, why does this issue only appear on servers of those kube nodes? Does it appear when the kube node is the request sender or request receiver? Fortunately, we can easily narrow down the problem: we can use an out-of-cluster machine as the sender and the same “known failure” machine as the target of the request. We found that there is still a problem with the request in this direction.

theojulienne@shell ~ $ sudo hping3 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.' len=46 ip= ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 mslen=46 ip= ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 mslen=46 ip= ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 mslen=46 ip= ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

and then repeat the above operation, This time we send a request from the kube node to the external node.

theojulienne@kube-node-client ~ $ sudo hping3 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.' ^C--- hping statistic ---22352 packets transmitted, 22350 packets received, 1% packet lossround-trip min/avg/max = 0.2/7.6/1010.6 ms

By looking at the latency data in the capture, we get more information. Specifically, latency is observed from the sending end (below), whereas the server on the receiving end does not see latency (top) – note the Delta column in seconds:

In addition, by looking at the difference in the order of TCP and ICMP network packets on the receiving end (based on sequence ID), we found that ICMP packets always arrive at the receiving end in the order in which they were sent, but the delivery time is irregular, and the sequence ID of TCP packets is sometimes staggered, and some of them will stall. In particular, if you count the ports on which SYN packets are sent/received, these ports are not sequential on the receiving end, while they are ordered on the sending side.

The network cards currently used by our servers, such as the hardware we use in our own data centers, have some subtle differences in processing TCP and ICMP network packets. When a datagram arrives, the NIC hashes the packets passed on each connection and attempts to assign different connections to different receive queues, allocating (presumably) a CPU core to each queue. For TCP packets, this hash value contains both the source IP, the port, and the destination IP, and the port. In other words, the hash value of each connection is likely to be different. For ICMP packets, the hash value contains only the source and destination IPs because there is no port. This explains the finding above.

Another new discovery is that ICMP packets between two hosts are stalled for a period of time, but TCP packets are fine during the same time. This seems to tell us that the hash of the received NIC queue is “joking”, and we are almost certain that the pause occurred during the processing of the RX packet on the receiving end, not on the sending end.

This ruled out the transfer issue between kube nodes, so we now know that this is a pause at the stage of processing the packet and is some kube nodes that act as receivers.

To understand why the problem occurs on the receiving side of the kube node service, let’s take a look at how Linux handles network packets.

In the simplest original implementation, the network card receives a network packet and sends an interrupt to the Linux kernel to inform it that a network packet needs to be processed. The kernel stops other work it is currently doing, switches the context to the interrupt processor, processes network messages, and then switches back to the previous work task.

Context switching can be very slow, which may be fine for a 10Mbit network card in the 90s, but now many servers are 10 Gigabit network cards, and the maximum packet processing speed may be 15 million packets per second: on a small 8-core server, this means millions of interruptions per second.

Many years ago, Linux added a new NAPI, the Networking API was used to replace the traditional way of the past, and modern network card drivers can use this new API to significantly improve the performance of high-rate packet processing. At low rates, the kernel still accepts interrupts from the NIC as described previously. Once packets that exceed the threshold arrive, the kernel disables interrupts and starts polling the network card to crawl network packets in batches. This process is done in a “softirq,” or software interrupt context. This occurs in the final phase of the system call, when the program has run into kernel space instead of user space.

This way is

much faster than the traditional way, but it also brings another problem. If the number of packets is so large that we spend all of our CPU time processing packets received from the network card, then we cannot let the user-mode program actually process these queued network requests (such as fetching data from TCP connections, etc.). Eventually, the queue fills up and we start dropping packets. To trade-off the time it takes for user-mode and kernel-mode to run, the kernel limits the number of packets processed by a given software interrupt context, arranging a “budget.” Once this “budget” value is exceeded, it wakes up another thread, called “ksoftiqrd” (or you’ll see this thread in the ps command), and it continues to handle these software interrupt contexts outside of the normal system call path. This thread uses the standard process scheduler, which enables fair scheduling.

By sorting out the paths for the Linux kernel to process network packets, we found that there is a real possibility that this processing process may stall. If the interval between softirq processing calls becomes longer, it is possible for network packets to be in the RX queue of the NIC for some time. This could be due to a CPU core deadlock or some slower task blocking the kernel to process softirqs.

Narrowing the problem down to a core or approach

so far, we believe that this delay is indeed possible, and we also know that we seem to be observing some very similar signs. The next step is to confirm the theory and try to understand what is causing the problem.

Let’s look again at the network request where the problem occurred:

len=46 ip= ttl=61 id=29573 icmp_seq=1953 rtt=99.3 mslen=46 ip=

ttl=61 id=29574 icmp_seq=1954 rtt=89.3 mslen=46 ip= ttl=61 id=29575 icmp_seq=1955 rtt=79.2 mslen=46 ip= ttl=61 id=29576 icmp_seq=1956 rtt=69.1 mslen=46 ip= ttl=61 id=29577 icmp_seq=1957 rtt=59.1 mslen=46 ip= ttl=61 id=29790 icmp_seq=2070 rtt=75.7 mslen=46 ip= ttl=61 id=29791 icmp_seq=2071 rtt=65.6 mslen=46 ip= ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

As we discussed earlier, these ICMP packets are hashed to a specific NIC RX queue and then processed by a CPU core. If we want to understand what the kernel is doing, then we first need to know which CPU core and how softirq and ksoftiqrd handle these packages, which can be very helpful in locating the problem.

Now we can use tools that help us track the running state of the Linux kernel in real time, so I can use bcc. bcc allows you to write a short C program and mount it to any function in the kernel, which can then cache the events and pass them to a user-mode Python program, which will do some summary analysis of these events and return the results to you. The above “arbitrary functions mounted to the kernel” is actually a challenge, but it has been used as safely as possible, because it is itself designed to track down such production environment problems, which are generally not easily reproduced in a test or development environment.

The idea is very simple: we know that the kernel is processing those IMCP ping packets, so let’s intercept the kernel’s icmp_echo method, which accepts an inbound ICMP “echo request” packet and initiates an ICMP reply “echo response”. We can identify these packages by the icmp_seq serial number displayed in HPING3.

The code for this bcc script may seem complicated, but breaking it down doesn’t sound that scary. The icmp_echo function passes a pointer to a struct sk_buff * skb, which is the packet containing the ICMP echo request. We can do some in-depth research, extract the echo.sequence (corresponding to the icmp_seq shown in hping3 above) and send it back to user space. At the same time, we can also easily obtain the current process name or process ID. When the kernel processes these packets, we see the following result:

TGID PID PROCESS NAME ICMP_SEQ0 0 swapper/11 7700 0 swapper/11 7710 0 swapper/11

7720 0 swapper/11      7730       0       swapper/11      77420041   20086   prometheus      7750       0       swapper/11      7760       0       swapper/11      7770       0       swapper/11      778

4512 4542 spokes-report-s 779

Note here about process names that in softirq, which occurs after a system call, you can see that the process that initiated the system call is displayed as “process”, even though this is the kernel processing it in the context of the kernel.

By running, we can now correlate the quiesced packets observed by HPING3 with the processes that process it. Perform a simple grep operation on the captured icmp_seq values, and these contexts allow you to see what happened before the packets were processed. Packets that match the icmp_seq value shown in hping3 above are flagged and also show the RTT value we observed (in parentheses we assume that RTT<50ms requests are not filtered

): TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT--10137   10436   cadvisor        195110137   10436   cadvisor        195276      76      ksoftirqd/11    1953 ** 99ms76      76      ksoftirqd/11    1954 **  89ms76      76      ksoftirqd/11    1955 ** 79ms76      76      ksoftirqd/11    1956 ** 69ms76      76      ksoftirqd/11    1957 ** 59ms76      76      ksoftirqd/11    1958 ** (49ms)76      76      ksoftirqd/11    1959 ** (39ms)76      76      ksoftirqd/11    1960 ** (29ms)76      76      ksoftirqd/11    1961 ** (19ms)76      76      ksoftirqd/11    1962 ** (9ms)--10137   10436   cadvisor        206810137   10436   cadvisor        206976      76      ksoftirqd/11    2070 ** 75ms76      76      ksoftirqd/11    2071 ** 65ms76      76      ksoftirqd/11    2072 ** 55ms76      76      ksoftirqd/11    2073  ** (45ms)76      76      ksoftirqd/11    2074 ** (35ms)76      76      ksoftirqd/11    2075 ** (25ms)76      76      ksoftirqd/11    2076 ** (15ms)76      76      ksoftirqd/11    2077 ** (5ms)

The above results tell us something. First, these packets are processed by the ksoftirqd/11 process, which conveniently tells us that this pair of machines hashes their ICMP packets onto the receiver’s CPU core 11. We can also see that every time we see quiescing, we always see that some packets are processed in the context of the system call softirq in cadvisor, and then ksoftirqd takes over and processes the backlog, which corresponds exactly to those quiesced packets that we find.

The fact that cAdvisor always runs immediately before a stuttering request also suggests that this may be related to the issue we are troubleshooting. Ironically, as described on the cAdvisor homepage, we use cAdvisor precisely to “analyze the resource usage and performance characteristics of running containers,” but it raises this performance issue. As with many container-related things, these are relatively cutting-edge tools, and there are situations that lead to some unexpected performance degradation.

What did cAdvisor do to cause a pause?

Now that we understand how quiescing occurs, the processes that cause it, and which CPU cores are quiescing, we now have a good understanding of it. In order for the kernel to hard-block instead of scheduling ksoftirqd in advance, and we also see packets processed in the softirq context of cAdvisor, we think that cAdvisor calling syscall may be very slow, and the rest of the network packets can be processed normally after it


This is just a theory, so how do we verify that this is really happening? What we can do is track what is running on the CPU cores throughout the process, find out the point where the packets are out of “budget” and start waking up ksoftirqd processing, and then go back and look at what is running on the CPU cores. Think of it as an X-ray of the CPU every few milliseconds. It looks like this:

..., 4096) = 658 <0.179438>

[pid 10384] <... futex resumed> )       = 0 <0.104614>[pid 10436] <... futex resumed> )       = 0 <0.175936>

[pid 10436] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_" ..., 4096) = 577 <0.228091>

[pid 10427] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.207334>[pid 10411] <... epoll_ctl resumed> )   = 0 <0.118113>[pid 10382] <... pselect6 resumed> )    = 0 (Timeout) <0.117717>

[pid 10436] <... read resumed> "cache 154234880\nrss 507904\ nrss_h"..., 4096) = 660 <0.159891>

[pid 10417] <... futex resumed> )       = 0 <0.917495>[pid 10436] <... futex resumed> )       = 0 <0.208172>[pid 10417] <... futex resumed> )       = 0 <0.190763>

[pid 10417] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 576 <0.154442>

At this point, we are pretty sure that the read() system call is slow. In the context of what read reads and mem_cgroup, those read() calls are reading the memory.state file, which describes the system’s memory usage and cgroup limits. cAdvisor polls this file to get details of the resources used by the container. We can verify whether the kernel or cAdvisor problem is by manually calling this method:

theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/nullreal 0m0.153suser 0m0.000ssys 0m0. 152stheojulienne@kube-node-bad ~$

Since we can reproduce the problem, it indicates that the kernel triggered a “pathological” method.

What is causing this reading to be so slow

then at this stage, we can easily find similar problems reported by others. As it turned out, this issue had long been reported to cAdvisor, which was found to be an issue of high CPU usage, just not noticing that latency also randomly affected the network stack. In fact, some in-house developers have noticed that cAdvisor is consuming more CPU than expected, but it doesn’t seem to be causing a problem because our server CPU performance is adequate, so we don’t investigate CPU usage.

Looking at this problem, it is mainly about the memory cgroup, which is responsible for managing and counting the memory usage within the namespace (container). When all processes in the cgroup exit, the memory cgroup is freed by Docker. However, “memory” is not only the memory of the process, and while the usage of process memory has disappeared, it turns out that the kernel also allocates memory for cache space, such as dentries and inodes (directory and file metadata), which are cached into memory cgroups. As can be seen from this problem

: “

zombie” cgroups: cgroups that are not running and deleted still hold some memory space (in our case, these cache objects are directory data, but it is also possible that page caching or tmpfs).

Instead of iterating through all cached pages when cgroup frees, which can also be slow, the kernel will lazily wait for these memories to be used before reclaiming them, and when all memory pages are cleaned up, the corresponding cgroup will finally be reclaimed. At the same time, these cgroups will still be counted in the statistics.

From a performance standpoint, they amortized the huge time spent directly recycling overall by reclaiming each page in stages, opting for a quick initial cleanup, but this way keeping some cache in memory. But that’s fine, when the kernel reclaims the last page of memory in the cache, cgroup is eventually cleaned up, so it’s not a “leak”. Unfortunately, the problem is with the way memory.stat performs the search, such as the kernel is still version 4.9 on some of our servers, which is problematic to implement, plus our servers generally have a lot of memory space, which means that the last memory cache is reclaimed and cleaned up the zombie cgroup can take a long time.

It turns out that our nodes have a large number of zombie cgroups, some of which have reads/quiescies for more than a second.


temporary workaround for this cAdvisor problem is to immediately release the system-wide directory/inode node cache, which immediately eliminates read latency, and the network latency is also addressed because the deletion of the cache includes cached pages occupied by “zombie” cgroups, which are also freed. This is not the final solution, but it verifies the cause of the problem.

Newer kernel versions (4.19+) have proven to improve the performance of memory.stat calls, so this is no longer an issue after updating to this version of the kernel. During this time, we used existing tools to detect problems with nodes in our Kubernetes cluster and gracefully remove and restart them: we use these tools to detect latency conditions, and when we find that the latency is high enough to trigger the problem, we then deal with it by restarting it gracefully. This gives us a respite during which we are able to upgrade the systems and kernels of the remaining servers.

Summary This

issue manifests itself as a few hundred milliseconds stall of the NIC RX queue, which causes high latency on short connections and delays in the middle of connections, such as between MySQL queries and response packets. Understanding and maintaining the performance of our most foundational systems, such as Kubernetes, is critical to the reliability and speed of all services built on top of it. When we make significant technical “investments” in this and improve performance, every system we run benefits from these improvements.

Original link:



public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.

like + Looking, less bugs 👇