etcd is a fast, distributed, and consistent key-value store that serves as a backing store for persistent storage of Kubernetes object data such as pods, replication controllers, secrets, services, etc. In fact, etcd is the only place where Kubernetes stores cluster state and metadata. The only component that talks directly to etcd is the Kubernetes API Server. All other components read and write data to etcd indirectly through API Server.

etcd also implements a monitoring feature that provides an event-based interface for asynchronously monitoring key changes. Once the key is changed, its observers are notified. API Server components rely heavily on this to be notified and move the current state of etcd to the desired state.

Should the number of etcd instances be odd?

In an HA environment, you typically run 3, 5, or 7 etcd instances, but why? Because etcd is a distributed data store, it can be scaled horizontally, but you also need to ensure that the data is consistent across each instance, and to do this, your system needs to agree on state. Etcd uses the RAFT consensus algorithm for this[1].

The algorithm requires a majority (or quorum) cluster to move to the next state. If you only have 2 ectd instances, if any of them fail, the etcd cluster cannot transition to the new state because there is no majority, and in the case of 3 instances, one instance may fail and the majority of instances can still be available.



API Server is the only component in Kubernetes that interacts directly with etcd. Kubernetes, as well as all other components in the client (kubectl), must handle cluster state through the API Server. API Server provides the following features:

  • providing a consistent way to store objects in etcd.
  • Validation of these objects is performed so that clients cannot store incorrectly configured objects, which can happen if they are written directly to the etcd data store.
  • Provides RESTful APIs to create, update, modify, or delete resources.
  • Provides optimistic concurrent locking, so in the case of concurrent updates, changes to an object are never overwritten by other clients.
  • Authentication and authorization is performed on requests sent by clients. It uses the plugin to extract the client’s user name, user ID, and groups to which the user belongs, and determines whether the authenticated user can perform the requested action on the requested resource.
  • If the request attempts to create, modify, or delete a resource, admission control [2] is performed. Examples: AlwaysPullImages, DefaultStorageClass, ResourceQuota, etc.
  • Implement a monitoring mechanism (similar to etcd) for the client to monitor changes. This allows components such as the dispatcher and Controller Manager to interact with the API Server in a loosely coupled manner.

Controller Manager

In Kubernetes, a controller is a control loop that monitors the state of the cluster and then makes changes or requests changes as needed. Each controller attempts to move the current cluster state closer to the desired state. The controller tracks at least one Kubernetes resource type, and these objects have a canonical field that represents the desired state.

Controller examples:


  • Manager (controller for ReplicationController resources)
  • ReplicaSet, DaemonSet, and Job controllers
  • Deployment

  • controllerStatefulSet
  • controllernode
  • controllerservice
  • controller
  • endpointscontrollernamespace
  • controllerPersistentVolume controller

The controller uses a monitoring mechanism to be notified of changes. They monitor API Server changes to resources and take action on each change, whether creating new objects or updating or deleting existing objects. Most of the time, these actions include creating additional resources or updating the monitored resources themselves, but since using monitoring does not guarantee that the controller will not miss any events, they also perform periodic relist actions to ensure that nothing is missed.

Controller Manager also performs lifecycle functions such as namespace creation and lifecycle, event garbage collection, terminating pod garbage collection, cascading delete garbage collection[3], node garbage collection, and so on.

The Scheduler

scheduler is a control plane process that assigns pods to nodes. It monitors newly created pods that don’t have allocated nodes, and for each pod discovered by the scheduler, the scheduler is responsible for finding the best running node for that pod.

Nodes that meet pod scheduling requirements are called viable nodes. If there is no suitable node, the pod will remain unscheduled until the scheduler is able to place it. Once it finds a viable node, it runs a set of functions to score the node and select the node with the highest score. It then notifies the API Server about the selected node, a process called binding.

The selection of nodes is a two-step process:

    > filter the list of all nodes to get a list of acceptable nodes that the pod can schedule to. (For example, the PodFitsResources filter checks whether the candidate node has enough available resources to satisfy a specific resource request for a pod)
  1. scores the list of nodes obtained from step 1 and ranks them to select the best node. If multiple nodes score the highest, use round-robin to ensure that pods are evenly deployed across all nodes.

Factors to consider when making scheduling decisions include:

  • pod requests for hardware/software resources? Does the node report memory or disk pressure conditions?
  • Does the node

  • have a label that matches the node selector in the pod specification?
  • If a pod request is bound to a specific host port, is that port already in use on that node?
  • Do pods tolerate taints on nodes?
  • Does the pod specify node affinity or anti-affinity rules? Wait.


scheduler does not instruct the selected node to run pods. All Scheduler does is update the pod definition through API Server. The API server notifies the Kubelet pod that it has been scheduled through the watch mechanism. Then the kubelet service on the target node sees that the pod has been dispatched to its node, and it creates and runs the pod’s container.

Kubelet Kubelet

is an agent that runs on

each node in the cluster and is the component responsible for everything that runs on the worker nodes. It ensures that the container is running in the pod.

The main functions of the kubelet service are:

  1. register its running node by creating a node resource in API Server.
  2. Continuously monitors pods on the API Server that have been scheduled to the node.
  3. Start the pod’s container with the configured container runtime.
  4. Continuously monitor running containers and report their status, events, and resource consumption to the API Server.
  5. Run a container liveness probe to restart the container if the probe fails, terminate the container when the container’s pod is removed from the API Server, and notify the server that the pod has terminated.


It runs on each node and ensures that one pod can talk to another pod, one node can talk to another, one container can communicate with another, etc. It is responsible for monitoring the API Server for changes in service and pod definitions to keep the entire network configuration up to date. When a Service consists of multiple pods, the proxy is load balanced across those pods.

kube-proxy gets its name because it is an actual proxy server that accepts connections and proxies them to pods, and the current implementation uses iptables or ipvs rules to redirect packets to randomly selected backend pods without passing them through the actual proxy server.

  1. When you create a service, a virtual IP address is immediately assigned.
  2. API Server notifies the kube-proxy agent running on the worker node that a new service has been created.
  3. Each kube-proxy makes the service addressable by setting iptables rules, ensuring that each service IP/port pair is intercepted, and modifying the destination address to one of the pods that supports the service.
  4. Monitor API Server changes to the service or its endpoint objects.

A container runtime that focuses on running containers,

setting namespaces, and cgroups of containers is called a low-level container runtime, and a container runtime that focuses on formatting, unpacking, managing, and sharing images and providing APIs to meet developer needs is called a high-level container runtime (container engine).

The container runtime is responsible for:

  1. if it is not available locally, pull the container image required by the container from the image registry.
  2. Images are pulled to copy-on-write file systems, with all container layers overlapping each other to create a merge file system.


  7. Change the kernel to assign some kind of isolation to the container, such as processes, networks, and file systems.
  8. Alerts the kernel to allocate some resource limits, such as CPU or memory limits.
  9. Pass a system call (syscall) to the kernel to start the container.
  10. Ensure that the SElinux/AppArmor settings are correct.



RAFT consensus algorithm:


Admission control:


Cascade delete garbage collection:





public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.

like + Looking, less bugs 👇