how to improve the resource utilization of Flink K8s clusters?

must be known to newbies Kubernetes architecture

how to debug network latency issues in a Kubernetes cluster

Companies in the industry today seem to fall into two Kubernetes camps: those that already use it heavily for production workloads, and those that are migrating their workloads to them.

The problem with Kubernetes is that it’s not a monolithic system like Redis RabbitMQ or PostgreSQL. It is a combination of several control plane components (e.g. etcd, API server) that run our workloads on the user (data) plane through a set of VMs. At first glance, the number of metrics from control plane components, virtual machines, and workloads can be overwhelming. Forming a comprehensive observability stack from these metrics requires good knowledge and experience in managing Kubernetes clusters.

So how to deal with a huge number of indicators? Reading this article can be a good starting point 🙂


us to cover the most critical metrics based on K8s metadata, which form a good benchmark for monitoring workloads and ensuring they are in a healthy state. To make these metrics available, you need to install kube-state-metrics and Prometheus to grab the metrics it exposes and store them for later querying. We’re not going to cover the installation process here, but a good boot is the Prometheus Helm Chart, which installs both with default settings.

The most critical Kubernetes metrics to monitor

For each metric listed, we’ll cover what the metric means, why you should care about it, and how to set a high alarm based on it.

CPU/Memory Requests vs. Actual Usage

Each container can define requests for CPU and memory. The Kubernetes dispatcher is using these requests to ensure that it selects a node that can host pods. It does this by considering the unused resources on the compute node minus the currently scheduled pod requests.

Let’s look at a clearer example: let’s say you have a node with 8 CPU cores, running 3 pods, each with a container requesting 1 CPU core. The node has 5 unreserved CPU cores for the scheduler to use when allocating pods.

5 cores are available for other pods

Keep in mind that “available” does not refer to actual usage, but to CPU cores that have not been requested (reserved) by pods currently scheduled to the node. A pod that requires 6 CPU cores will not be scheduled to this node because there are not enough available CPU cores to host it.

The Actual Usage metric tracks how many resources a pod uses while running. When we measure actual usage, it usually spans a set of pods (deployment, statefulset, etc.), so we should refer to percentiles rather than the usage of individual pods. P90 should be a good cut-off value for this problem. For example, a deployment that requires 1 CPU core per pod might actually use 0.7 cores on its replica’s P90.

It is important to keep requests consistent with actual usage. Requests that are higher than actual usage result in inefficient (underutilization) of resources. Think about what happens when a pod with 4 cores is requested to use 1 core at the 90th percentile. K8s may schedule this pod to a node with 4 cores to spare, meaning that no other pod will be able to use the reserved 3 unused cores. In the graph below, we can clearly see that each pod keeps 4 cores, but actually uses one core, which means that we “waste” 6 cores on the node, which will remain unused.

Requests higher than actual usage are equivalent to underutilized

The same goes for memory – if we set the request higher than the usage, we won’t end up using the available memory.

Another option is for pods to request less than their actual usage (overuse). In the case of CPU overuse, your application will run slower due to insufficient resources on the node. Imagine 3 pods that request 1 core per pod, but actually use 3 cores. These 3 pods may be scheduled into an 8-core machine (1 request * 3 = 3<8), but when they do, they will compete for CPU time because their actual usage (9 cores) exceeds the number of cores on the node.

The actual usage of pods exceeds the number of cores on one node

How to solve it? Let’s define pod requests as 100%. The reasonable range for actual use (CPU or memory, which doesn’t matter) is 60%–80% at the 90th percentile. For example, if you have a pod that requests 10GB of memory, 90% of its actual usage should be 6GB-8GB. If its usage is less than 6GB, you won’t be able to fully utilize your memory and waste money. If it’s higher than 8GB, you’re at risk of OOMKilled because of insufficient memory. The same rules we apply to memory requests can also be applied to CPU requests.

CPU/Memory Limits vs. Actual Usage

When the scheduler uses resource requests to schedule workloads into nodes, resource limits allow you to define the boundaries of workload resource usage at runtime.

It’s important to understand how CPU and memory limits are enforced so that you understand the impact of workloads that cross them: when a container reaches the CPU limit, it is throttled, meaning it gets fewer CPU cycles from the operating system than it might have and ultimately results in slower execution times. It doesn’t matter if the node hosting the pod has idle CPU cycles to be idle – containers are limited by the docker runtime.

It is very dangerous to be throttled by the CPU unknowingly. Service call latency in the system rises, and if a component in the system is throttled and you don’t set up the required observability beforehand, it can be difficult to pinpoint the root cause. If restricted services are a core process in the business, this situation can result in partial service interruptions or complete unavailability.

Memory throttling is enforced differently than CPU throttling: when your container hits the memory limit, it is OOMKilled, which has the same effect as OOMKIlled due to insufficient memory on the node: the process will drop running requests, the service will run out of capacity until the container restarts, and then it will have a cold start period. If a process accumulates memory fast enough, it may enter the CrashLoop state again — a state that indicates that the process either crashes on startup or within a short time after it starts over and over again. Crashlooping pods often result in service unavailability.

How to solve it? The way resource limits are monitored is similar to how we monitor CPU/memory requests. Your goal should be to reach 80% of actual usage in the 90th percentile limit. For example, if our pod has a CPU limit of 2 cores and a memory limit of 2GB, the alarm should be set to 1.6 cores of CPU usage or 1.6GB of memory usage. Anything above this value results in the risk of your workload being throttled or restarted based on the threshold breached.

Percentage of unavailable

pods in replicas

When you deploy an application, you can set the number of desired replicas (pods) that it should run. Sometimes, some pods may not be available for a number of reasons, for example:

    some pods

  • may not fit into any running node in the cluster due to resource requests – these pods will transition to the Pending state until the nodes release resources to host them or new nodes that meet the requirements join the cluster.

  • Some pods may fail to pass the liveness/readiness probe, which means they are either restarted or removed from the service endpoint.

  • Some pods may reach their resource limits and enter the Crashloop state.

  • Some pods

  • may be hosted on a failed node for various reasons, and if the node is unhealthy, the pods hosted on it are likely not functioning properly.

Pod unavailable is obviously not a healthy state for your system. It can result in minor service disruptions to complete service unavailability, depending on the percentage of unavailable pods as a percentage of the number of replicas required and the importance of missing pods in the system core stream.

How to solve it? The feature we’re monitoring here is the percentage of unavailable pods as a percentage of the required number of pods. The exact percentage you should aim for in your KPIs depends on the importance of the service and every pod in its system. For some workloads, we may accept 5% of pods to be unavailable for a period of time, as long as the system returns to a healthy state on its own and has no impact to customers. For some workloads, even 1 unavailable pod can be an issue. A good example is statefulsets, where each pod has its own unique identity and it may be unacceptable to be unavailable.

HPA | Desired Replicas Beyond the Maximum Replica

Horizontal Pod Autoscaler (HPA) is a k8s resource that allows you to adjust the number of replicas your workload is running based on a target function that you define. A common use case is to automatically scale based on the average CPU usage of the pods in the deployment compared to CPU requests.

When the number of replicas deployed reaches the maximum defined in HPA, you might encounter situations where you need more pods but HPA can’t scale. Depending on the magnification function you set, the results may vary. Here are 2 examples to illustrate more clearly:

  • if the extension uses CPU usage, the CPU usage of existing pods will increase to the limit and be limited. This ultimately results in a decrease in the throughput of the system.
  • If the extension uses custom metrics, such as the number of unprocessed messages in the queue, the queue may start filling up with unprocessed messages, introducing delays in the processing pipeline.

How to solve it? Monitoring this metric is simple – you need to set the X% threshold for the current number of replicas divided by the maximum number of HPA replicas. A sane X might be 85% to allow you to make the required changes before reaching the maximum. Keep in mind that increasing the number of replicas can affect other parts of the system, so you may end up changing more than the HPA configuration to enable this scaling operation. A typical example is when you increase the number of replicas and more pods try to connect to it, the database reaches its maximum connection limit. That’s why it makes sense to use a large enough buffer as preparation time in this case.

A node status check failed

kubelet is a k8s agent that runs on every node on the cluster. Among its responsibilities, the kubelet publishes metrics (called node conditions) that reflect the health of the node on which it runs:

    ready – if the node

  • is healthy and ready to accept pods, true
  • disk pressure – if the node’s disks have no free storage space , report
  • true memory pressure – true PID pressure if the node is low on memory –

  • true if there are too many processes running on the
  • node

  • Network unavailable – true if the node’s network is not configured correctly

A healthy node should report True for the Ready condition and False for all four conditions.

You should be aware that if the Ready condition becomes negative or any other condition becomes positive, it may mean that some or all of the workload running on that node is not functioning properly.

For DiskPressure, MemoryPressure, and PIDPressure, the root cause is obvious – nodes can’t afford the speed at which processes write to disk/allocate memory/spawn processes.

The Ready and NetworkUnavailable conditions are a bit tricky and require further investigation to get to the source of the problem.

How to solve it? I first expect exactly 0 nodes to be unhealthy, so that an alarm is triggered when each node becomes unhealthy.

Persistent Volume

(PV) is a k8s resource object that represents a block of storage that can be attached and detached to pods in the system. The implementation of PVs is platform-specific, for example, if your Kubernetes deployment is based on AWS, the PVs will be represented by EBS volumes. Like every block storage, it has capacity and can be filled up by time.

When a process uses a disk with no free space, it crashes because failures can manifest themselves in a million different ways, and stack traces don’t always lead to the root cause. In addition to protecting you from future failures, observing this metric can be used to plan workloads that record and add data over time. Prometheus is a great example of such a workload – when it writes data points to its time series database, the amount of free space in the disk decreases. Because Prometheus writes data at a very consistent rate, it’s easy to use PV utilization metrics to predict how long it will take to delete old data or buy more disk capacity.

How to solve such problems? The kubelet exposes PV usage and capacity, so a simple division between them should give you PV usage. Suggesting a reasonable alert threshold is a bit difficult because it really depends on the trajectory of the utilization graph, but as a rule of thumb, PV exhaustion should be predicted two to three weeks in advance.


you’ve already discovered, dealing with Kubernetes clusters is not an easy task. There are a large number of indicators available and require a lot of expertise to select the important ones.

Having a dashboard that monitors key metrics of your cluster can serve both as a preventive measure to avoid problems and as a tool to troubleshoot problems in your system.