Containers could have been a replacement for lightweight virtual machines. However, due to Docker/OCI standardization, the most widely used form of container is to have only one process service per container. This approach has many advantages – increased isolation, simplified horizontal scaling, higher reusability, etc. However, it also has a big drawback – under normal circumstances, virtual (or physical) machines rarely run only one service.

While Docker tried to provide some workarounds for creating multiservice containers, Kubernetes took a bolder step and chose a set of cohesive containers called pods as the smallest deployable unit.

When I stumbled upon Kubernetes a few years ago, my previous virtual machine and bare metal experience led me to learn about pods very quickly.

One of the first things you’ll learn when you’re new to Kubernetes is that each pod has a unique IP and hostname, and within the same pod, containers can communicate with each other via localhost. So, obviously, a pod is like a miniature server.

However, over time, you’ll find that each container in a pod has an isolated file system, and from inside one container, you can’t see processes running in other containers in the same pod. All right! Maybe a pod isn’t a miniature server, but just a set of containers with a shared networking stack.

But then you learn that containers in a pod can communicate through shared memory! So, between containers, the network namespace is not the only thing that can be shared….

Based on my final findings, I decided to dive in:

    what

  • is the actual difference between a pod and a container

  • implemented under the hood

  • How to create pods using Docker

Along the way, I hope it will help me solidify my Linux, Docker, and Kubernetes skills.


 
1
Explore Container

The OCI runtime specification does not limit container implementations to Linux containers, i.e. containers implemented using namespaces and cgroups. However, unless expressly stated otherwise, the term container in this article refers to this rather traditional form.

Before

we look at the namespaces and cgroups that make up the container, let’s quickly set up an experimental environment:

$ cat > Vagrantfile <# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2" do |config|


  config.vm.box = "debian/buster64"
  config.vm.hostname = "docker-host"
  config.vm.define "docker-host"
  config.vagrant.plugins = ['vagrant-vbguest']

  config.vm.provider "virtualbox" do |vb|

    vb.cpus = 2

    vb.memory = "2048"

  end

  config.vm.provision "shell", inline: <<-SHELL

    apt-get update    apt-get install -y curl vim  SHELL

  config.vm.provision "docker"

endEOF$ vagrant up

$ vagrant ssh

Finallylet’s start a container:

$ docker run --name foo --rm -d --memory='512MB' --cpus='0.5' nginx

To explore container namespaces

, let’s first look at which primitives are created when the container starts:

# Look up the container in the process tree. $ ps auxfUSER       PID  ...  COMMAND... root      4707       /usr/bin/containerd-shim-runc-v2 -namespace moby -id cc9466b3e... root      4727        \_ nginx: master process nginx -g daemon off; systemd+  4781            \_ nginx: worker processsystemd+  4782            \_ nginx: worker process

# Find the namespaces used by 4727 process.

$ sudo lsns        NS TYPE   NPROCS   PID USER    COMMAND... 4026532157 mnt         3  4727 root    nginx: master process nginx -g daemon off; 4026532158 uts         3  4727 root    nginx: master process nginx -g daemon off; 4026532159 ipc         3  4727 root    nginx: master process nginx -g daemon off; 4026532160 pid         3  4727 root    nginx: master process nginx -g daemon off;

4026532162 net         3  4727 root    nginx: master process nginx -g daemon off;

We can see that the namespaces used to isolate the above containers are the following:

    > mnt (mount): The container has an isolated mount table.

  • uts (Unix Time Sharing): Containers have their own hostname and domain.

  • IPC (Interprocess Communication): Processes within a container can communicate through system-level IPC and other processes within the same container.

  • pid (process ID): Processes within a container can only see other processes within the same container or with the same PID namespace.

  • net: Containers have their own networking stack.

Note that the user namespace is not used, and the OCI runtime specification mentions support for user namespaces. However, while Docker can use this namespace for its containers, it is not used by default due to inherent limitations. Therefore, the root user in the container is most likely the root user on the host system. Beware!

Another namespace that doesn’t appear here is cgroup. It took me a while to understand the difference between the cgroup namespace and the cgroups mechanism. The cgroup namespace provides only an isolated view of the container’s cgroup hierarchy. Similarly, Docker supports putting containers into a private cgroup namespace, but this is not done by default.

Exploring the cgroups Linux namespace for containers

can make the processes in the container think they are running on a dedicated machine. However, not seeing other processes does not mean that they are not affected by other processes. Some resource-intensive processes may unexpectedly consume too much of the resources shared on the host.

That’s where cgroups help is needed!

You can view the cgroups limit for a given process by examining the corresponding subtree in the cgroup virtual file system. Cgroupfs is usually hung in the /sys/fs/cgroup directory, and process-specific related parts can be viewed in /proc//cgroup:

PID=$(docker inspect -- format '{{. State.Pid}}' foo)

# Check cgroupfs node for the container main process (4727).


$ cat /proc/${PID}/cgroup11:freezer:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba010:blkio:/docker/ cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba09:rdma:/8:pids:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba07:devices: /docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba06:cpuset:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba05:cpu,cpuacct: /docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba04:memory:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba03:net_cls,net_ prio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba02:perf_event:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba01: name=systemd:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0

0::/system.slice/containerd.service

It seems that Docker uses the /docker/ pattern. Well, anyway:

ID=$(docker inspect --format '{{. Id}}' foo)

# Check the memory limit.


$ cat /sys/fs/cgroup/memory/docker/${ID}/memory.limit_in_bytes
536870912  # Yay! It's the 512MB we requested!

# See the CPU limits.


ls /sys/fs/cgroup/cpu/docker/${ID}

Interestingly, starting a container without explicitly setting any resource limits configures a cgroup. I haven’t checked it in practice, but my guess is that CPU and RAM consumption is unlimited by default, and Cgroups might be used to restrict access to certain devices from inside the container.

Here’s the container I presented in my head after the investigation:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/A1HKVXsfHNlAXONicrbpnOPkeW5ofLKyP0ibFQmVBLlRkuibm4V4FwDjc5E3NbpxlJibramR1JBbcicxbibJ1yn0TrEw/640?wx_fmt=png”>


 
2
Explore pods

Now, let’s take a look at the Kubernetes Pod. Like containers, pod implementations can vary between different CRI runtimes. For example, when a Kata container is used as a supported runtime class, some pods can be real virtual machines! And as expected, VM-based pods differ from traditional Linux container implementations in terms of implementation and functionality.

To keep the comparison between containers and pods fair, we’ll explore Kubernetes clusters using ContainerD/Runc runtime. This is also Docker’s mechanism for running containers under the hood.

Set up the

playground

This time we use minikube based on the VirtualBox driver and the Containd runtime to set up the experimental environment. To quickly install minikube and kubectl, we can use the arkade tool written by Alex Ellis:

# Install arkade (). $ curl -sLS https://get.arkade.dev | sh$ arkade get kubectl minikube

$ minikube start --driver virtualbox --container-runtime containerd

The pod of the experiment can be set as follows:

$ kubectl --context=minikube apply -f - <          memory: "256Mi"

    - name: sidecar      image: curlimages/curl

      command: ["/bin/sleep""3650d"]

resources: limits:

memory: "128Mi"


EOF

explores containers for pods

The actual pod check should be done on the Kubernetes cluster node:

$minikube ssh

Let’s see the process of the pod there:

$ ps auxfUSER       PID  ...  COMMAND... root      4947         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/... root      4966             \_ /pauseroot      4981         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...

root      5001             \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent


root      5016                 \_ /usr/bin/python3 /usr/local /bin/gunicorn -b 0.0.0.0:80 httpbin:app -k geventroot      5018         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...

100       5035             \_ /bin/sleep 3650d

Based on the time of runtime, it is likely that the above three process groups were created during pod startup. This is interesting because in the manifest file, there are only two containers, httpbin and sleep.

The above findings can be cross-checked using the ContainerD command line named ctr:

$ sudo ctr --namespace=k8s.io containers lsCONTAINER      IMAGE                                   RUNTIME... 097d4fe8a7002  docker.io/curlimages/curl@sha256:1a220  io.containerd.runtime.v1.linux... dfb1cd29ab750  docker.io/kennethreitz/httpbin:latest   io.containerd.runtime.v1.linux...

f0e87a9330466  k8s.gcr.io/pause:3.1                    io.containerd.runtime.v1.linux

Indeed, three containers were created. At the same time, a command-line crictl inspection using another and CRI runtime monitoring found that there were only two containers:

$ sudo crictl psCONTAINER      IMAGE          CREATED            STATE    NAME     ATTEMPT  POD ID097d4fe8a7002  bcb0c26a91c90  About an hour ago  Running  sidecar  0        f0e87a9330466

dfb1cd29ab750  b138b9264903f  About an hour ago  Running  app      0        f0e87a9330466

NOTE, HOWEVER, THAT THE POD ID FIELD ABOVE IS THE SAME AS THE PAUSE:3.1 CONTAINER ID OF CTR OUTPUT. Well, it looks like this pod is a secondary container. So, what is it for?

I haven’t noticed anything in the OCI runtime specification that corresponds to pods. Therefore, when I am not satisfied with the information provided by the Kubernetes API specification, I usually go directly into the Kubernetes Container Runtime Interface (CRI) Protobuf file to find the corresponding information:

 kubelet expects any compatible container runtime// to implement the following gRPC methods:service RuntimeService {    ...    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}        rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}        rpc  RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}        rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}    rpc  ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}    rpc StartContainer( StartContainerRequest) returns (StartContainerResponse) {}        rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}        rpc RemoveContainer(RemoveContainerRequest ) returns (RemoveContainerResponse) {}    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}        rpc ContainerStatus(ContainerStatusRequest) returns ( ContainerStatusResponse) {}        rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}        rpc ReopenContainerLog( ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}    // ...    } message CreateContainerRequest {

    // ID of the PodSandbox in which the container should be created.

    string pod_sandbox_id = 1;     Config of the container.    ContainerConfig config = 2;     Config of the PodSandbox. This is the same config that was passed    // to RunPodSandboxRequest to create the PodSandbox. It is passed again

    // here just for easy reference. The PodSandboxConfig is immutable and

    // remains the same throughout the lifetime of the pod.    PodSandboxConfig sandbox_config = 3;

So, a pod is actually made up of a sandbox and containers running in a sandbox. The sandbox manages the common resources of all containers in the pod, and the pause container is started in the RunPodSandbox() call. A simple Internet search reveals that the container is just an idle process.

The

following namespace for the Explore pod

is the namespace on the cluster node:

$ sudo lsns        NS TYPE   NPROCS   PID USER            COMMAND4026532614 net         4  4966 root            /pause4026532715 mnt         1  4966 root            /pause4026532716 uts         4  4966  root            /pause4026532717 ipc         4  4966 root            /pause4026532718 pid         1  4966 root            /pause

4026532719 mnt         2  5001 root            /usr/bin/python3 /usr/ local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent


4026532720 pid         2  5001 root            /usr/bin/python3 /usr/local /bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent4026532721 mnt         1  5035 100             /bin/sleep 3650d

4026532722 pid         1  5035 100             /bin/sleep 3650d

The first part is much like a Docker container, and the pause container has five namespaces: net, mnt, uts, ipc, and pid. But it’s clear that the httpbin and sleep containers only have two namespaces: mnt and pid. What’s going on?

LSNS has proven to be not the best tool for examining process namespaces. Instead, to check the namespace used by a process, refer to /proc/${pid}/ns Location:

# httpbin containersudo ls -l /proc/ 5001/ns...

lrwxrwxrwx 1 root root 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'


lrwxrwxrwx 1 root root 0 Oct 24 14:05 mnt -> 'mnt:[4026532719]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 pid -> 'pid:[4026532720]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 uts -> 'uts:[4026532716]'

# sleep container

sudo ls -l /proc/5035/ns...

lrwxrwxrwx 1 100 101 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'


lrwxrwxrwx 1 100 101 0 Oct 24 14:05 mnt -> 'mnt:[4026532721]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 pid -> 'pid:[4026532722]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 uts -> 'uts:[ 4026532716]'

is not easy to notice, but the httpbin and sleep containers actually reuse the net, uts, and ipc namespaces of the pause container!

We can cross-check verification with crictl:

# Inspect httpbin container $ sudo crictl inspect dfb1cd29ab750{  ...

  "namespaces": [

    {

      "type""pid"

    },    {

      "type""ipc",


      "path" "/proc/4966/ns/ipc"    },    {

      "type""uts",


      "path""/proc/4966/ns/uts"    },    {

      "type"" mount"

    },    {

      "type""network",


      "path""/proc/4966/ns/net"    }  ],  ...}

# Inspect sleep container.

$ sudo crictl inspect 097d4fe8a7002

...

I think the above finding perfectly explains the ability of containers in the same pod:

  • via localhost and/or

  • Using IPC (Shared Memory, Message Queuing, etc.)

However, after seeing how all these namespaces can be freely reused between containers, I’m starting to suspect that the default boundaries can be broken. In fact, a deeper reading of the Pod API specification reveals that when the shareProcessNamespace flag is set to true, the pod’s container will have four common namespaces instead of the default three. But there’s an even more shocking finding — hostIPC, hostNetwork, and hostPID flags can enable containers to use the corresponding host’s namespace.

Interestingly, the CRI API specification seems to be more flexible. AT LEAST SYNTETICALLY, IT ALLOWS NET, PID, AND IPC NAMESPACES TO BE SCOPED, POD, OR NODE. Therefore, you can build a pod so that its containers cannot communicate with each other through localhost.

Explore Pod’s cgroups What does

Pod’s cgroups

look like? systemd-cgls is a good way to visualize the cgroups hierarchy:

$ sudo systemd-cglsControl group /:-.slice├─kubepods│ ├─burstable│ │ ├─pod4a8d5c3e-3821-4727-9d20-965febbccfbb│ │ │ ├─f0e87a93304666766ab139d52f10ff2b8d4a1e6060fc18f74f28e2cb000da8b2│ │ │ │ └─4966 /pause│ │ │ ├─dfb1cd29ab750064ae89613cb28963353c3360c2df913995af582aebcc4e85d8

│ │ │ │ ├─5001 / usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent


│ │ │ │ └─5016 /usr/bin/python3 /usr/local /bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent│ │ │ └─097d4fe8a7002d69d6c78899dcf6731d313ce8067ae3f736f252f387582e55ad│ │ │   └─5035 /bin/sleep 3650d

...

Therefore, the pod itself has a parent node (Node), and each container can be adjusted independently. This is what I expected because in the pod manifest, resource limits can be set separately for each container in the pod.

At the moment, the pod in my head looks like this:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/A1HKVXsfHNlAXONicrbpnOPkeW5ofLKyP1ShYUuG3iaxSwLKR4zdSPmMpRrDQIehibA8I00I4YpKjcU0KEjfamUMQ/640?wx_fmt=png”>


 
3
Implement pods with Docker

If the underlying implementation of a pod is a set of semi-fused (emi-fused) containers with a common cgroup parent, can I use Docker to produce a pod-like construct?

Recently I tried doing something similar to have multiple containers listen on the same socket, and I know that Docker can use the docker run –network container: syntax to create a container that can use an existing network namespace. But I also know that the OCI runtime specification only defines the create and start commands.

So when you use docker exec to execute a command in an existing container, you’re actually running (i.e. create and then start) a brand new container that happens to reuse all the namespaces of the target container (proof 1[1] and 2[2]). This gives me a lot of confidence that pods can be built using standard Docker commands.

We can use a machine with only Docker installed as an experimental environment. But here I will use an additional package to simplify using cgroups:

$ sudo apt-get install cgroup-tools

First, let’s configure a parent cgroup entry. For brevity, I’ll just use CPU and memory controllers

:

sudo cgcreate -g cpu, memory:/pod-foo

# Check if the corresponding folders were created:

ls -l /sys/fs/cgroup/cpu/pod-foo/

ls -l /sys/fs/cgroup/memory/pod-foo/

Then we create a sandbox container:

$ docker run -d --rm \  --name foo_sandbox \  --cgroup-parent /pod-foo \

  --ipc 'shareable' \


  alpine sleep infinity

Finally, let’s start the actual container that reuses the sandbox container namespace:

# app (httpbin)$ docker run -d --rm \  --name app \  --cgroup-parent /pod-foo \  --network container:foo_sandbox \  --ipc container:foo_sandbox \  kennethreitz/httpbin

# sidecar (sleep)

$ docker run -d --rm \  --name sidecar \  --cgroup-parent /pod-foo \  --network container:foo_sandbox \  --ipc container:foo_sandbox \

  curlimages/curl sleep 365d

Have you noticed which namespace I omitted? That’s right, I can’t share UTS namespaces between containers. It doesn’t seem to be possible in the docker run command at the moment. Well, it’s a bit of a pity. But aside from the UTS namespace, it’s a success!

cgroups look a lot like Kubernetes:

created: $ sudo systemd-cgls memoryController memory; Control group /:├─pod-foo│ ├─488d76cade5422b57ab59116f422d8483d435a8449ceda0c9a1888ea774acac7

│ │ ├─27865 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent


│ │ └─27880 /usr/bin/python3 /usr/local /bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent│ ├─9166a87f9a96a954b10ec012104366da9f1f6680387ef423ee197c61d37f39d7│ │ └─27977 sleep 365d│ └─c7b0ec46b16b52c5e1c447b77d67d44d16d78f9a3f93eaeb3a86aa95e08e28b6

│   └─27743 sleep infinity

The global namespace list also looks similar:

$ sudo lsns        NS TYPE   NPROCS   PID USER    COMMAND... 4026532157 mnt         1 27743 root    sleep infinity4026532158 uts         1 27743 root    sleep infinity4026532159 ipc         4 27743 root    sleep infinity4026532160 pid         1 27743 root    sleep  infinity4026532162 net         4 27743 root    sleep infinity

4026532218 mnt         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent


4026532219 uts         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532220 pid         2 27865 root    /usr/bin/python3 /usr/local /bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent4026532221 mnt         1 27977 _apt    sleep 365d4026532222 uts         1 27977 _apt    sleep 365d

4026532223 pid         1 27977 _apt    sleep 365d

The httpbin and sidecar containers appear to share the IPC and NET namespaces:

# app container$ sudo ls -l /proc/27865/ns

lrwxrwxrwx 1 root root 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'


lrwxrwxrwx 1 root root 0 Oct 28 07:56 mnt -> 'mnt:[4026532218]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 pid -> 'pid:[4026532220]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 uts -> 'uts:[4026532219]'

# sidecar container

$ sudo ls -l /proc/ 27977/ns

lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'


lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 mnt -> 'mnt:[4026532221]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 pid -> 'pid:[4026532223]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 uts -> 'uts:[4026532222]'


 
4 
Summary

Containers and pods are similar. Under the hood, they rely mainly on Linux namespaces and cgroups. However, pods are more than just a set of containers. A pod is a self-sufficient high-level construct. Containers for all pods run on the same machine (cluster nodes), their lifecycles are synchronized, and communication between containers is simplified by reducing isolation. This brings pods closer to traditional VMs, bringing back familiar deployment patterns like sidecars or reverse proxies.

Related links:

  1. https://github.com/opencontainers/runtime-spec/issues/345
  2. https://github.com/opencontainers/runtime-spec/pull/388