This article is translated from https://learnk8s.io/kubernetes-network-packets, not verbatim, and brings some understanding of its own.

Read this article to learn how packets are forwarded inside and outside Kubernetes, starting with the original web request to the container that hosts the application.

Before diving into the details of how packets flow through a Kubernetes cluster, let’s clarify Kubernetes’ requirements for the network.

The Kubernetes network model defines a basic set of rules:

These requirements do not limit the implementation to a solution.

Instead, they describe the characteristics of a cluster network in general.

To meet these limitations, you must address the following challenges:

In this article, we’ll focus on the first three points, starting with the network within the pod, container-to-container communication.

Let’s look at a primary container that runs the app and another container that accompanies it.

In the example, there is a pod with nginx and busybox containers:

When deployed, the following occurs:

Network configuration is done quickly in the background.

However, let’s take a step back and try to understand why running the container requires the above action.

In Linux, a network namespace is a separate, isolated logical space.

You can think of a network namespace as a separate part of a physical network interface after it is divided into small chunks.

Each section can be configured individually and has its own network rules and resources.

These include firewall rules, interfaces (virtual or physical), routing, and everything related to the network.

But ultimately, the physical interface is required to process all the real packets, and all virtual interfaces are created based on the physical interface.

Network namespaces can be managed through ip-netns, which use the ip netns list to list namespaces on hosts.

Note that the created network namespace appears under /var/run/netns, but Docker does not follow this rule.

For example, here are some namespaces for Kubernetes nodes:

Note the cni- prefix; This means that the namespace is created by the CNI plugin.

When you create a Pod, and the Pod is assigned to a node, the CNI will:

If a Pod contains multiple containers, they are all placed in the same namespace.

So, what happens when you list the namespaces of the containers on a node?

You can SSH connect to the Kubernetes node and view the namespace:

lsns is a command that lists all available namespaces on the host.

Keep in mind that there are multiple namespace types in Linux.

Where are the Nginx containers?

What are those pause containers?

List all the namespaces on the node first to see if you can find the Nginx container:

Nginx containers are in the mount (mnt), Unix time-sharing (uts), and PID (pid) namespaces, but not in the network namespace (net).

Unfortunately, lsns only shows the smallest PID per process, but you can filter further based on this process ID.

Use the following command to retrieve the Nginx container in all namespaces:

The pause process reappears, hijacking the network namespace.

What’s going on?

Each pod in the cluster has an additional hidden container running in the background, called a pause container.

List the containers running on the node and get the pause containers:

As you can see, each pod on the node will have a corresponding pause container.

This pause container is responsible for creating and maintaining network namespaces.

The creation of the network namespace is done by the underlying container runtime, usually by containerd or CRI-O.

Before you deploy a pod and create a container, the network namespace is created by the runtime.

The container runtime does this automatically, without the need to manually perform ip netns to create the namespace.

The topic goes back to pause containers.

It contains very little code and goes to sleep immediately after deployment.

However, it is essential and plays a vital role in the Kubernetes ecosystem.

What is the use of a container that goes to sleep?

To understand what it does, let’s imagine a Pod with two containers, as in the previous example, but without pause containers.

Once the container starts, the CNI will:

What if Nginx crashes?

The CNI will have to perform all the steps again, and the network of both containers will be disrupted.

Because sleep containers are unlikely to have any errors, creating a network namespace is often a safer and more robust option.

I mentioned earlier that the Pod and both containers will have the same IP address.

How is that configured?

Let’s verify it.

First, find the IP address of the pod:

Next, locate the relevant network namespace.

Because the network namespace is created from a physical interface, you need to access the cluster nodes first.

If you are running minikube, use minikube ssh to access the node. If you are running in a cloud factory, there should be some way to access the nodes via SSH.

Once inside, find the newly created named network namespace:

In an example, cni-0f226515-e28b-df13-9f16-dd79456825ac. You can then run the exec command within that namespace:

This IP is the IP address of the Pod! Locate the network interface by looking for 12 in the @if12

You can also verify that the Nginx container listens to HTTP traffic from within that namespace:

If you can’t access the worker nodes in your cluster via SSH, you can use kubectl exec to get the shell of the busybox container and use the ip and netstat commands directly internally.

Now that we introduced communication between containers, let’s look at how to establish Pod-to-Pod communication.

There are two possible scenarios for Pod-to-Pod communication:

The entire workflow relies on virtual interface pairs and bridges, so let’s take a look at this part first.

Connect the Pod to the root namespace via a virtual Ethernet pair.

These virtual interface devices (v in veth) connect and act as tunnels between the two namespaces.

With this veth device, you connect one end to the pod’s namespace and the other to the root namespace.

CNI can do these things for you, but you can also do them manually:

The namespace of the pod now has a tunnel that can access the root namespace.

On the node, each new pod will set such a veth pair.

One is, create an interface pair; The other is to assign an address to the Ethernet device and configure the default route.

Here’s how to set up the veth1 interface in the namespace of the pod:

On the node, let’s create another veth2 pair:

You can check the existing veth pair as before.

In the namespace of the pod, retrieve the suffix of the eth0 interface.

In this case, you can use the command grep -A1^12 to find (or scroll to where the target is):

You can also use the ip -n cni-0f226515-e28b-df13-9f16-dd79456825ac link show type veth

Note 3: eth0@if12 and 12: cali97e50e215bd@if3 symbols on the interface.

From the Pod namespace, the eth0 interface is connected to the root namespace interface number 12 and is therefore @if12.

At the other end of the veth pair, the root namespace is connected to interface number 3 of the pod namespace.

Next is the bridge that connects the veth pairs to both ends.

The bridge aggregates each virtual interface located in the root namespace. This bridge allows traffic between virtual pairs and also through the common root namespace.

Add a few more relevant principles.

The Ethernet bridge is located at Layer 2 of the OSI networking model.

You can think of a bridge as a virtual switch that accepts connections from different namespaces and interfaces.

An Ethernet bridge can connect multiple available networks on a node.

Therefore, you can use a bridge to connect two interfaces, the veth of the pod namespace to the veth of another pod on the same node.

Next, let’s look at the purpose of bridges and veth pairs.

Suppose there are two Pods on the same node, and Pod-A sends a message to the Pod-B.

At this point, the communication between Pod-A and Pod-B is successful.

For communication between cross-node pods, there are additional communication hops.

ARP resolution does not occur because the source and destination IPs are not in the same network segment.

The checking of network segments is done using bitwise arithmetic.

When the destination IP is not on the current network segment, packets are forwarded to the node’s default gateway.

When determining where a packet is forwarded, the source node must perform a bitwork

This is also known as working with actions.

Review the rules of bitwise and arithmetic:

All but 1 and 1 are false.

If the source node has an IP of 192.168.1.1, a subnet mask of /24, and a destination IP of 172.16.1.1/16, bitwise and arithmetic will know that they are on different network segments.

This means that the destination IP is not on the same network as the source of the packet, and the packet is forwarded through the default gateway.

Math time.

We have to start with the binary 32-bit address for the AND operation.

First, find the source IP network and the destination IP segment.

After bitwise arithmetic, you need to compare the destination IP to the subnet of the source node of the packet.

The result of the operation is 172.16.1.0, which is not equal to 192.168.1.0 (the network of the source node). Indicates that the source IP address and destination IP address are not on the same network.

If the destination IP is 192.168.1.2, that is, in the same subnet as the sending IP, the AND operation gets the node’s local network.

After a bitwise comparison, ARP looks up the MAC address of the default gateway through the lookup table.

If there is an entry, the packet is forwarded immediately.

Otherwise, broadcast first to find the MAC address of the gateway.

By this point, you should already be familiar with how traffic flows between pods. Let’s take a moment to see how the CNI manages appeals.

The Container Network Interface (CNI) is primarily concerned with the network in the current node.

CNIs can be thought of as a set of rules to follow to address Kubernetes networking needs.

These CNI implementations are available:

They all follow the same CNI standards.

Without a CNI, you need to do the following manually:

And that’s not all, everything like this needs to be done when deleting or restarting a pod.

The CNI must support four different operations:

Let’s take a look at how CNIs work.

When a Pod is assigned to a specific node, the Kubelet itself does not initialize the network.

Instead, Kubelet gives the task to CNI.

However, Kubelet specifies the configuration in JSON format and sends it to the CNI plugin.

You can go to the /etc/cni/net.d folder on the node and view the current CNI configuration file using the following command:

Each CNI plug-in uses a different type of network configuration.

For example, Calico uses BGP-based Layer 3 networking pods

Cilium uses an eBPF-based overlay network from Layer 3 to Layer 7

Like Calico, Cilium supports throttling traffic by configuring network policies.

So which one should you use? There are two main types of CNIs.

In the first class, using basic network settings (also known as flat networking), pods are assigned a CNI of IP addresses from the cluster’s IP pool.

This approach can quickly exhaust IP addresses and become a burden.

Instead, another class is to use overlay networks.

In simple terms, an overlay network is a rebuilt network on top of the main (bottom) network.

The overlay network works by encapsulating packets from the underlying network that are sent to a pod on another node.

A popular technique for overlay networks is VXLAN, which tunnels the L2 domain on an L3 network.

So which is better?

There is no single answer, it depends on your needs.

Are you building a large cluster with tens of thousands of nodes?

Maybe overlay networks are better.

Do you care about simpler configuration and auditing of network traffic without wanting to lose that capability in a complex network?

Flat networks are better for you.

Now that we’ve discussed CNI, let’s look at how Pod-to-service communication is connected.

Because pods are dynamic in Kubernetes, the IP addresses assigned to pods are not static.

The IP of a pod is ephemeral and changes every time a pod is created or deleted.

Services in Kubernetes solve this problem by providing a robust mechanism for connecting a set of pods.

By default, when a service is created in Kubernetes, a virtual IP is assigned.

In Service, you can use selectors to associate a Service with a target Pod.

What happens when a pod is deleted or added?

The virtual IP of the service remains static.

But traffic can no longer reach the newly created pod without intervention.

In other words, services in Kubernetes are similar to load balancers.

But how do they work?

The service in Kubernetes is built on two components in the Linux kernel:

In addition, it can block and prohibit unauthorized access.

iptables, on the other hand, is a user-oriented program that can be used to configure IP packet filtering rules for the Linux kernel firewall.

iptables are implemented as different Netfilter modules.

You can use the iptables CLI to modify filtering rules on the fly and insert them into the netfilters mount point.

The filter is configured in a different table and contains a chain of packets for processing network traffic.

Different protocols use different kernel modules and programs.

When IPTABLES are mentioned, they usually refer to IPv4. For IPv6, the terminal tool is ip6tables.

iptables have five chains, each of which maps directly to Netfilter’s hook.

From the perspective of iptables, they are:

They map accordingly to Netfilter hooks:

When a packet arrives, a Netfilter hook is “triggered” depending on the stage it is in. This hook executes specific iptables filtering rules.

Ah! It looks complicated!

Nothing to worry about though.

That’s why we use Kubernetes, all of which are abstracted out by using a service, and a simple YAML definition can set these rules automatically.

If you’re interested in viewing iptables rules, you can connect to the node and run:

You can also use this tool to visualize the iptables chain on a node.

Here is an example diagram from a visual iptables chain on a GKE node:

Note that there may be hundreds of rules configured here, think about how to configure it yourself!

At this point, we have seen how pods on the same node communicate with pods on different nodes.

In Pod-Service communication, the first half of the link is the same.

When a request goes from Pod-A to Pod-B, because the Pod-B is “behind” the service, there will be some differences in the process of transmission.

The original request was made on the eth0 interface of the Pod-A namespace.

Next, the request goes through veth to the bridge in the root namespace.

As soon as they reach the bridge, packets are immediately forwarded through the default gateway.

As with the Pod-to-Pod section, the host makes a bitwise comparison. Because the virtual IP of the service is not part of the node CIDR, packets are immediately forwarded through the default gateway.

If the MAC address of the default gateway does not already appear in the lookup table, ARP resolution is performed to find out the MAC address of the default gateway.

Now something magical has happened.

Netfilter’s NF_IP_PRE_ROUTING hook is triggered and the iptables rule is executed before the packet is routed through the node. This rule modifies DNAT, the destination IP address of the Pod-A packet.

The virtual IP address of the preceding service is rewritten to the IP address of the Pod-B.

Next, the packet routing process is the same as the Pod-to-Pod communication.

After the packet is rewritten, the communication is pod-to-pod.

However, in all these communications, a third-party feature is used.

This feature is called conntrack or link tracing.

When the Pod-B sends back a response, conntrack associates the packet with the link and traces its source.

NAT relies heavily on conntrack.

Without link tracing, it is not known where the packet containing the response will be sent back.

When using conntrack, the return path of a packet can easily be set to the same source or destination NAT change.

The other part of the communication is the opposite of today’s links.

The Pod-B receives and processes the request and now sends the data back to the Pod-A.

What happens now?

Pod-B sends a response, setting its IP address as the source address and Pod-A’s IP address as the destination address.

Another NAT occurs when the packet arrives at the interface of the node where the pod-A resides.

At this point, conntrack starts working, modifying the source IP address, the iptables rule performs SNAT, and modifies the source IP address of the pod-B to the virtual IP of the original service.

For Pod-A, the response comes from the service rather than the Pod-B.

The rest is the same. Once SNAT is complete, the packet arrives at the bridge in the root namespace and is forwarded to Pod-A through the veth pair.

Let’s review the relevant points of this article together

This article is reproduced from: “Chen Shaowen’s Blog”, original: https://url.hi-linux.com/GQueR, copyright belongs to the original author. Welcome to contribute, submission email: editor@hi-linux.com.

Recently, we set up a WeChat group for technical exchanges. At present, the group has joined a lot of gods in the industry, interested students can join us to exchange technology, in the “wonderful Linux world” public account directly reply to “add group” to invite you to join the group.

You may also like it

Click on the image below to read it

KUR8 : A Kubernetes cluster topology and metrics visualization tool

Click on the picture above, “Meituan | hungry” big extra selling red envelopes every day free to receive

For more interesting Internet news, pay attention to the “Wonderful Internet” video number to understand it all!