❝
This article is reprinted from ArthurChiao’s Blog, original text: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/, copyright belongs to the original author.

For unstructured data storage systems, LIST operations are usually very heavyweight, not only consuming a lot of disk IO, network bandwidth and CPU, but also affecting other requests in the same period of time (especially the host selection request with extremely high response latency), which is a major killer of cluster stability.
For example, for Ceph object storage, each LIST bucket request needs to go to multiple disks to retrieve all the data of the bucket; Not only is it slow itself, but it also affects other normal read and write requests within the same time period because the IO is shared, resulting in increased response latency and even timeouts. If there are so many objects in the bucket (for example, as a storage backend for harbor/docker-registry), the LIST operation cannot even be completed in a regular time (so registry GC that relies on the LIST bucket operation cannot run).
Another example is KV storage etcd. Compared to Ceph, an actual etcd cluster may store a small amount of data (a few ~ dozens of GB), or even enough to cache into memory. But unlike Ceph, it can have a number of concurrent requests that can be orders of magnitude higher, such as etcd of a ~4000 node k8s cluster. A single LIST request may only need to return tens of MB to gigabytes of traffic, but with more concurrent requests, etcd obviously can’t carry it, so it’s best to have a layer of cache in front, which is the function of apiserver. Most of K8s’ LIST requests should be blocked by apiserver and served from its local cache, but if used improperly, it will skip the cache and go directly to etcd, which is a great stability risk.
This topic delves into the LIST operation processing logic and performance bottlenecks of k8s apiserver/etcd, and provides some LIST stress testing, deployment, and tuning suggestions for basic services to improve the stability of large-scale K8s clusters.
kube-apiserver
LIST
request processing logic:

The code is based on v1.24.0, but the basic logic of 1.19~1.24 is the same as the code path, and you can refer to it if necessary.
1
Introduction1.1 K8s architecture: From the perspective of architecture hierarchy and component dependencies, a K8s
cluster and a Linux host can be compared as follows:

For K8s clusters, several components and functions from the inside out:
-
apiserver: reads ( ListWatch
) full data from etcd and caches it in memory; Stateless service**, horizontally scalable; -
Basic services (e.g. kubelet
, *-agent,*-operator
: Connect to apiservers and get (List/ListWatch
) the data they need; -
Workloads within the cluster: 3 create, manage, and reconcile under normal conditions of 1 and 2, such as kubelets creating pods, cilium configuring network and security policies. As you
can see above in the
apiserver/etcd
role, there are two levels of List/ListWatch in the system path
(But the data is the same):
- apiserver List/ListWatch etcd
-
basic service List/ListWatch apiserver
Therefore, In its simplest form, apiserver is a proxy
that stands in front of etcd +--------+ +---------------+ +------------+ | Client | -----------> | Proxy (cache) | --------------> | Data store | +--------+ +---------------+ +------------+ infra services apiserver etcd
-
In most cases, apiserver serves directly from the local cache (because it caches the full data of the cluster);
-
In some special cases, for example,
apiserver can only forward requests to etcd – special attention here – the client LIST parameter may also come to this logic if it is not set properly.
-
client explicitly requests to read data from etcd (for the highest data accuracy), and -
the apiserver local cache has not yet been built
- the
1.3 apiserver/etcd
list Overhead
1.3.1 Request example
Consider the following LIST operations:
-
LIST apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0
Both parameters are passed here, but
resourceVersion
=0 will cause apiserver to ignorelimit=500
, so the client gets the full ciliumendpoints data.The full data
of a resource may be relatively large, and you need to consider whether you really need the full data. Quantitative measurement and analysis methods will be introduced later.
-
LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1This
request is to get all pods on
node1
(%3D
is an escape of=
).Filtering according to nodename may feel like a small amount of data, but it’s more complicated than it seems:
this behavior is to be avoided unless there are very high requirements for data accuracy and deliberately bypassing the apiserver cache.
-
First, resourceVersion=0 is not specified here, causing apiserver to skip the cache and go directly to etcd to read the data; -
KV storage, there is no filtering function by label/field (only limit/continue
), -
so apiserver pulls the full amount of data from etcd and then Memory filtering, the overhead is also very large, there is code analysis later. -
LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0
and 2 are distinguished by the addition
of resourceVersion=0
, so apiserver will read data from the cache, and the performance will be improved by magnitude.Note, however, that while the actual amount of data returned to the
client may only be a few hundred KB to hundreds of MB (depending on the number of pods on the node, the number of labels on the pod, etc.), the amount of data that the apiserver needs to process may be several gigabytes. There will be quantitative analysis later.
Second, etcd is only
As you can see above, the impact of different LIST operations is different, and the client may see that the data is only a small part of the data processed by apiserver/etcd. If the underlying services are started or restarted on a large scale, it is very likely to burst the control plane.
1.3.2 Processing Overhead List requests can be divided into two types:
List Full data: overhead
Specifies that filtering
What needs to be specifically explained here is the second case, that is, the list request has a filter condition.
-
in most cases, apiserver will use its own cache for filtering, which is fast, so the time spent is mainly spent on data transmission;
-
In cases where requests need to be forwarded to
etcd, as
mentioned earlier, etcd is only KV storage and does not understand label/field information, so it cannot handle filtering requests. The actual process is: apiserver pulls all the data from etcd, then filters it in memory, and then returns it to the client.
Therefore, in addition to the data transfer overhead (network bandwidth), this situation also consumes a lot of apiserver CPU and memory.
1.4 To take another example of potential problems when deploying at scale
, the following line of code uses k8s client-go to filter pods
based on nodename podList, err := Client(). CoreV1(). Pods(""). List(ctx(), ListOptions{FieldSelector: "spec.nodeName=node1"})
seems very simple, let’s actually look at the amount of data behind it. Taking a cluster of 4000 nodes and 10w pods as an example, the full pod data volume
:
-
etcd: compact unstructured KV storage, on the order of 1GB; -
apiserver cache: already structured golang objects, on the order of 2GB (TODO: further confirmation required); -
apiserver returns: The client generally chooses the default JSON format to receive, which is already structured data. The json of the full pod is also in the 2GB range.
As you can see, some requests seem simple and are just a matter of one line of code on the client, but the amount of data behind them is staggering. Specifies that pods filtered by nodeName may only return 500KB of data, but apiserver needs to filter 2GB of data – in the worst case, etcd will also process 1GB of data (the above parameter configuration does hit the worst case, see code analysis below).
When the cluster size is relatively small, this problem may not be visible (etcd does not start printing warning logs until the LIST response delay exceeds a certain threshold); After the scale is large, if there are more such requests, apiserver/etcd will definitely not be able to carry it.
1.5 The purpose of this article is
to
view the List/ListWatch implementation of k8s in depth to deepen the understanding of performance problems and provide some references for the stability optimization of large-scale K8s clusters.
2 apiserver List()
operation source code analysis
With the above theoretical warm-up, you can see the code implementation.
2.1 Call stack and flowchart
store. List|-store. ListPredicate |-if opt == nil
| opt = ListOptions{ResourceVersion: ""} |-Init SelectionPredicate.Limit/Continue fileld |-list := e.NewListFunc() // objects will be stored in this list
|-storageOpts := storage. ListOptions{opt. ResourceVersion, opt. ResourceVersionMatch, Predicate: p} | |-if MatchesSingle ok // 1. when "metadata.name" is specified, get single obj
| Get single obj from cache or etcd | |-return e.Storage.List(KeyRootFunc(ctx), storageOpts) // 2. get all objs and perform filtering
|-cacher. List() | // case 1: list all from etcd and filter in apiserver
|-if shouldDelegateList(opts) // true if resourceVersion == ""
| return c.storage.List // list from etcd | |- fromRV *int64 = nil | |- if len(storageOpts.ResourceVersion) > 0
| | rv = ParseResourceVersion | | fromRV = &rv | | | |- for hasMore {
| | objs := etcdclient. KV. Get() | | filter(objs) // filter by labels or filelds | | } | | // case 2: list & filter from apiserver local cache (memory)
|-if cache.notready()
| return c.storage.List // get from etcd | | // case 3: list & filter from apiserver local cache (memory)
|-obj := watchCache.WaitUntilFreshAndGet |-for elem in obj. (*storeElement)
| listVal.Set() // append results to listOjb |-return // results stored in listObj
corresponding flowchart:

2.2 Request processing entry: List().
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L361 Returns a list of objects
func (e *Store) List(ctx, options* metainternalversion. ListOptions) (runtime. Object, error) { label := labels. Everything() if options != nil && options. LabelSelector != nil
label = options. LabelSelector Label filter, for example app=nginx field := fields. Everything() if options != nil && options. FieldSelector != nil
field = options. FieldSelector field filters, such as spec.nodeName=node1 out := e.ListPredicate(ctx, e.PredicateFunc(label, field), options) pull (List) data and filter (Predicate).
if e.Decorator != nil e.Decorator(out) return out, nil
}
2.3 ListPredicate()
https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L411func (e *Store) ListPredicate(ctx , p storage. SelectionPredicate, options *metainternalversion. ListOptions) (runtime. Object, error) {
Step 1: Initialize
if options == nil
options = &metainternalversion. ListOptions{ResourceVersion: ""} p.Limit = options. Limit p.Continue = options. Continue list := e.NewListFunc() The returned result will be stored in
storageOpts := storage. ListOptions{ converts the ListOption on the API side to the ListOption on the underlying storage side, and the field differences are described below ResourceVersion: options. ResourceVersion, ResourceVersionMatch: options. ResourceVersionMatch, Predicate: p, Recursive: true
, } Step 2: If the request specifies a metadata.name, you should get a single object without filtering the full data
if name, ok := p.MatchesSingle(); ok { Check if the metadata.name field
is set if key := e.KeyFunc(ctx, name); err == nil { get the key (unique or nonexistent) of this object in etcd storageOpts.Recursive
= false e.Storage.GetList(ctx, key, storageOpts, list) return list
} else logic: If you execute here, it means that you did not get the key for filtering from the context, then fallback to take the full data below and filter
} Step 3: Filter the full data
e.Storage.GetList(ctx, e.KeyRootFunc(), storageOpts, list) KeyRootFunc() is used to get the root key (i.e. prefix, without the final /)
return list} of this resource in etcd
In
1.24.0, both case 1 & 2 call e.Storage.GetList(), which was a bit different in previous versions:
e.Storage.GetToList
in Case 1 The e.Storage.List in Case 1 but the basic flow is the same.
-
initializes a default value where
ResourceVersion
is set to an empty string if the client does not passListOption
, which will enable the apiserver to slave etcd pulls data back to the client without using a local cache (unless the local cache is not already created);For example, if the client sets
ListOption{Limit: 5000, ResourceVersion: 0}
list ciliumendpoints, the request sent will be /apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0
。ResourceVersion
is the behavior of an empty string, which will be parsed later. -
Initialize the limit/continue field of the filter (SelectionPredicate) with the fields in the listoptions;
-
Initialization returns result,
list := e.NewListFunc()
; -
To convert the ListOption on the API side to the ListOption in the underlying storage, see
metainternalversion below for the field differences. ListOptions
is an API-side struct that includesstaging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/types.go
ListOptions is the query options to a standard REST list call.
type ListOptions struct { metav1. TypeMetaLabelSelector labels. Selector tag filters, such as app=nginx FieldSelector
fields. Selector field filters, such as spec.nodeName=node1 Watch boolAllowWatchBookmarks bool
ResourceVersion string ResourceVersionMatch metav1. ResourceVersionMatchTimeoutSeconds *int64 // Timeout for the list/watch call.
Limit int64
Continue string // a token returned by the server. return a 410 error if the token has expired. }storage. ListOptions
are structs passed to the underlying storage, with some differences in fields:staging/src/k8s.io/apiserver/pkg/storage/interfaces.go
ListOptions provides the options that may be provided for storage list operations.
type ListOptions struct {
ResourceVersion string ResourceVersionMatch metav1. ResourceVersionMatchPredicate SelectionPredicate // Predicate provides the selection rules for the list operation.
Recursive bool true: gets a single object based on key; false: Get full data
according to key prefix ProgressNotify bool storage-originated bookmark, ignored for non-watch requests. }
2.4 The request specifies a resource name: Get a single object
and then specify meta in the request according to whether meta.
Name is divided into two cases:
-
if specified, it means that a single object is queried, because Name
is unique, and then the logic of querying a single object is turned in; -
If it is not specified, you need to obtain the full data, and then filter it in apiserver memory according to the filter conditions in SelectionPredicate, and return the final result to the client;
The code is as follows
:
case 1: Get a single object according to metadata.name without filtering the full data if name, ok :
= p.MatchesSingle(); ok { Check if the metadata.name field
is set if key := e.KeyFunc(ctx, name); err == nil { e.Storage.GetList(ctx, key, storageOpts, list) return list
} else logic: If it is executed here, it means that the key for filtering is not obtained from the context, then fallback to take the full amount of data below and filter
}
e.Storage is an interface
, staging/src/k8s.io/apiserver/pkg/storage/interfaces.go// Interface offers a common interface for object marshaling/unmarshaling operations and
hides all the storage-related operations behind it.
type Interface interface {
Create(ctx , key string, obj, out runtime. Object, ttl uint64) error
Delete(ctx , key string, out runtime. Object, preconditions *Preconditions,...)
Watch(ctx , key string, opts ListOptions) (watch. Interface, error)
Get(ctx , key string, opts GetOptions, objPtr runtime. Object) error // unmarshall objects found at key into a *List api object (an object that satisfies runtime. IsList definition).
// If 'opts. Recursive' is false, 'key' is used as an exact match; if is true, 'key' is used as a prefix.
// The returned contents may be delayed, but it is guaranteed that they will
// match 'opts. ResourceVersion' according 'opts. ResourceVersionMatch'.
GetList(ctx , key string, opts ListOptions, listObj runtime. Object) error
e.Storage.GetList() executes into the cacher code.
Whether you get a single object or get full data, you go through a similar process:
-
first fetched from the apiserver local cache (determinants include ResourceVersion, etc.), -
unavoidable etcd to get;
which is
The logic of getting a single object is relatively simple, so I won’t look at it here. Next, look at the list of full data and then do the filtering logic.
2.5 The request does not specify the resource name and obtains the full data for filtering2.5.1
apiserver caching layer: GetList()
processing logic
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622// GetList implements storage. Interface
func (c *Cacher) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error { recursive := opts. Recursive resourceVersion := opts. ResourceVersion pred := opts. Predicate case one: ListOption requires
that should DelegateList(opts)
return c.storage.GetList(ctx, key, opts, listObj) must be read from etcd c.storage pointing to etcd If resourceVersion is specified, serve it from cache
listRV := c.versioner.ParseResourceVersion(resourceVersion) Case 2: apiserver cache is not built, can only be read from etcd
if listRV == 0 && !c.ready.check()
return c.storage.GetList(ctx, key, opts, listObj) Case 3: apiserver cache is normal, read from cache: Ensure that the returned object version is not lower than 'listRV'
listPtr := meta. GetItemsPtr(listObj) listVal := conversion. EnforcePtr(listPtr) filter := filterWithAttrsFunction(key, pred) The final filter
objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...). Performance optimization
for _, obj := range objs { elem := obj.( *storeElement) if filter(elem. Key, elem. Labels, elem. Fields) real filtering
listVal.Set(reflect. Append(listVal, reflect. ValueOf(elem)) } Update the last read ResourceVersion
if c.versioner
!= nil
c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
return nil}
2.5.2 Determine if data must be read from etcd: shouldDelegateList().
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L591 func shouldDelegateList(opts storage. ListOptions) bool {
resourceVersion := opts. ResourceVersion pred := opts. Predicate pagingEnabled := DefaultFeatureGate.Enabled(features. APIListChunking) is enabled by default
hasContinuation := pagingEnabled && len(pred. Continue) > 0 Continue is a token
hasLimit := pagingEnabled && pred. Limit > 0 & & resourceVersion != "0" hasLimit is only possible to be true if resourceVersion != "0" 1. If resourceVersion is not specified, the data is pulled from the underlying storage (etcd);
2. If there is continuation, also pull data from the underlying storage;
3. The limit will only be passed to the underlying storage (etcd) if resourceVersion != "0", because watch cache does not support continuation
return resourceVersion == "" || hasContinuation || hasLimit || opts. ResourceVersionMatch == metav1. ResourceVersionMatchExact
}
is very important here
:
-
Q: The client has not set the
ResourceVersion
field in ListOption{}, does it correspond to thisresourceVersion == ""
?A: Yes, so the example [1] in section 1 would result in pulling the full amount of data from etcd.
-
Q: Does setting limit=
500&resourceVersion=0
on the client causehasContinuation==true
next time?A: No, resourceVersion=0 will cause the limit to be ignored (
hasLimit
is the line of code), that is, although limit=500 is specified, the request will return the full data. -
Q: What is ResourceVersionMatch used for?
A: It is used to tell the apiserver how to interpret the ResourceVersion. There is a very complicated table [2] officially, if you are interested, you can take a look.
Next, go back to the cacher's GetList()
logic to see what specific processing situations there are.
2.5.3 Case 1: ListOption requires reading data from etcd In this case, apiserver will directly read all objects from etcd and filter, and then return them to the client, which is suitable for scenarios with extremely high data
consistency requirements. Of course, it is also easy to stray, causing excessive pressure on etcd, such as the example in the first section [3].
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L563// GetList implements storage. Interface.
func (s *store) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error { listPtr := meta. GetItemsPtr(listObj) v := conversion. EnforcePtr(listPtr) key = path. Join(s.pathPrefix, key) keyPrefix := key // append '/' if needed
newItemFunc := getNewItemFunc(listObj, v) var fromRV *uint64
if len(resourceVersion) > 0 { If RV is not empty ( the default is an empty string when the client does not pass) parsedRV := s.versioner.ParseResourceVersion(resourceVersion) fromRV = &parsedRV } ResourceVersion , ResourceVersionMatch and other processing logic
switch {
case recursive && s.pagingEnabled && len( pred. Continue) > 0: ...
case recursive && s.pagingEnabled && pred. Limit > 0 : ...
default : ... } // loop until we have filled the requested limit from etcd or there are no more results
for {
getResp = s.client.KV.Get(ctx, key, options...) Pull data from etcd
numFetched += len(getResp.Kvs) hasMore = getResp.More for i, kv := range getResp.Kvs {
if limitOption != nil && int64(v.Len()) >= pred. Limit {
hasMore = true
break } lastKey = kv. Key data := s.transformer.TransformFromStorage(ctx, kv. Value, kv. Key) appendListItem(v, data, kv. ModRevision, pred, s.codec, s.versioner, newItemFunc) will filter
numEvald++ } key = string(lastKey) + "\x00"
} instruct the client to begin querying from immediately after the last key we returned
if hasMore {
// we want to start immediately after the last key
next := encodeContinue(string(lastKey)+"\x00", keyPrefix, returnedRV)
return s.versioner.UpdateList(listObj, uint64 (returnedRV), next, remainingItemCount) } // no continuation
return s.versioner.UpdateList(listObj, uint64(returnedRV), "", nil)}
-
client. KV. Get()
enters the etcd client library, you can continue to dig down if you are interested. -
appendListItem()
will filter the obtained data, which is the apiserver memory filtering operation we mentioned in the first section.
2.5.4 Situation 2: The local cache has not been built, and the specific execution process is the same as that of case 1 that can only read data from etcd
.
2.5.5 Case 3: Use local cache
https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622// GetList implements storage. Interface
func (c *Cacher) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error {
Case 1: ListOption requires that the ... Case 2: The apiserver cache is not built, and can only be read from etcd
. Case 3: apiserver cache is normal, read from cache: Ensure that the returned object version is not lower than 'listRV'
listPtr := meta. GetItemsPtr(listObj) // List elements with at least 'listRV' from cache. listVal := conversion. EnforcePtr(listPtr) filter := filterWithAttrsFunction(key, pred) The final filter
objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...). Performance optimization
for _, obj := range objs { elem := obj.( *storeElement) if filter(elem. Key, elem. Labels, elem. Fields) real filtering
listVal.Set(reflect. Append(listVal, reflect. ValueOf(elem)) } if c.versioner != nil
c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
return nil}
3 LIST
test
In order to avoid the client library (such as client-go) automatically setting some parameters for us, we directly use curl
to test, just specify the certificate:
$ cat curl-k8s-apiserver.sh
curl -s --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key --cacert /etc/kubernetes/pki/ca.crt $ @
usage
:
$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"{ "kind": "PodList",
"metadata": {
"resourceVersion": "2127852936",
"continue": " eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...", }, "items": [ {pod1 data }, {pod2 data}]}
3.1 Specify limit=2
:response will return paging information (continue
)
3.1.1 curl
test
$ ./ curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"{ "kind": "PodList",
"metadata": {
"resourceVersion": "2127852936",
"continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...", }, "items": [ {pod1 data } , {pod2 data
}]}
can be seen that
-
does return two pod information, in the items[]
field; -
In addition, a continue field is returned in the metadata
, and the next time the client takes this parameter, apiserver will continue to return the rest of the content until apiserver no longer returnscontinue
.
3.1.2 Kubectl
test
to increase the log level of kubectl
, you can also see that it uses continue behind it to get the full pods:
$ kubectl get pods --all-namespaces --v=10#
# The following are log output, with appropriate
adjustments ## curl -k -v -XGET -H "User-Agent: kubectl/v1.xx" -H "Accept: application/json; as=Table; v=v1; g=meta.k8s.io,application/json; as=Table; v=v1beta1; g=meta.k8s.io,application/json"
## 'http://localhost:8080/api/v1/pods?limit=500'
## GET http://localhost:8080/api/v1/pods?limit=500 200 OK in 202 milliseconds
## Response Body: {"kind":"Table","metadata":{"continue":"eyJ2Ijoib...","remainingItemCount":54},"columnDefinitions":[...],"rows":[...]}
##
## curl -k -v -XGET -H "Accept: application/json; as=Table; v=v1; g=meta.k8s.io,application/json; as=Table; v=v1beta1; g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.xx"
## 'http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500'
## GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500 200 OK in 44 milliseconds
## Response Body: {"kind":"Table","metadata":{"resourceVersion":"2122644698"},"columnDefinitions":[],"rows":[...]}
The first request got 500 pods, and the second request brought the returned continue: GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500
, continue is a token, a bit long, It is truncated here for better display.
3.2 Specify limit=2&resourceVersion=0
:limit=2
will be ignored and full data
$ ./ curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2&resourceVersion=0"{ "kind": "PodList",
"metadata ": {
"resourceVersion": "2127852936",
"continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...", }, "items ": [ {pod1 data }, {pod2 data}, ...]
}
Items[]
is full pod information.
3.3 Specify spec.nodeName=node1
&resourceVersion=0 vs. spec.nodeName=node1
Same
result $./curl-k8s-apiserver.sh " https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1" | jq '.items[].spec.nodeName'
"node1"
"node1"
"node1"... $ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1&resourceVersion=0" | jq '.items[ ].spec.nodeName'
"node1"
"node1"
"node1"...
The result is the same, unless there is an inconsistency between the apiserver cache and etcd data, which is extremely small and we will not discuss it here.
The speed difference is very
large Use time to measure the time in the above two cases, and you will find that for larger clusters, the response time of these two requests will be significantly different.
$ time ./curl-k8s-apiserver.sh > result
For 4K nodes, 100K pods scale clusters, the following data is for reference:
- >
-
With resourceVersion=0
(read apiserver cache): takes0.05s
resourceVersion=0
(read etcd and filter on apiserver): takes 10s
200 times worse.
❝The
total size of full pods is calculated as 2GB, averaging 20KB each.
4 LIST Request for Control Plane Pressure: Quantitative Analysis
This section uses cilium-agent as an example to quantitatively measure the pressure on the control plane when it starts.
4.1 Collect LIST request
first to get the LIST k8s resources when the agent starts. There are several ways to collect it:
-
filter by ServiceAccount, verb, request_uri, etc. in the k8s access log; -
via agent logs; -
By further code analysis and more.
Suppose we collect the following LIST request:
- api/
-
v1/namespaces?resourceVersion=0
-
api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0
-
api/v1/nodes?fieldSelector=metadata.name%3Dnode1& resourceVersion=0
-
api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io% 2Fservice-proxy-name
-
apis/discovery.k8s.io/v1beta1/endpointslices?resourceVersion=0
-
apis/networking.k8s.io/networkpolicies?resourceVersion=0
-
apis/cilium.io/v2/ciliumnodes?resourceVersion=0
-
apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=0
-
apis/cilium.io/v2/ ciliumclusterwidenetworkpolicies?resourceVersion=0
4.2 Test LIST The amount of data requested and the time it takes to request
a LIST Request list, then you can manually execute these requests to get the following
data:
-
the amount of data that the request takes time -
to process the request, which is divided into two types: -
processed by apiserver (full data), and the performance impact on apiserver/etcd should be evaluated based on this as -
final amount of data obtained by the agent (filtered by selector).
- the amount of data
the
Use the following script (put on the real k8s master) to perform a test,
$cat benchmark-list-overheads.sh
apiserver_url="https://localhost:6443" ## List k8s core resources (e.g. pods, services)
## API: GET/LIST /api/v1/?&resourceVersion=0
function benchmark_list_core_resource() {
resource=$1
selectors=$2 echo "----------------------------------------------------"
echo "Benchmarking list $2"
listed_file="listed- $resource"
url="$apiserver_url/api/v1/$resource?resourceVersion=0" ## first perform a request without selectors, this is the size apiserver really handles
echo "curl $url"
time ./ curl-k8s-apiserver.sh "$url" > $listed_file ## perform another request if selectors are provided, this is the size client receives
listed_file2="$listed_file-filtered"
if [ ! -z "$selectors" ]; then
url="$url&$selectors"
echo "curl $url"
time ./ curl-k8s-apiserver.sh "$url" > $listed_file2
fi ls -ahl $listed_file $listed_file2 2>/dev/ null
echo "----------------------------------------------------"
echo ""} ## List k8s apiextension resources (e.g. pods, services)
## API: GET/LIST /apis//?&resourceVersion=0
function benchmark_list_apiexternsion_resource() {
api_group=$1
resource=$2
selectors=$3 echo "----------------------------------------------------"
echo "Benchmarking list $api_group/ $resource"
api_group_flatten_name=$(echo $api_group | sed 's/\//-/g')
listed_file="listed-$ api_group_flatten_name-$resource"
url="$apiserver_url/apis/$api_group/$resource? resourceVersion=0"
if [ ! -z "$selectors" ]; then
url="$url&$selectors"
fi echo "curl $url"
time ./curl-k8s-apiserver.sh "$url" > $listed_file
ls -ahl $listed_file
echo "----------------------------------------------------"
echo ""}benchmark_list_core_resource "namespaces" ""
benchmark_list_core_resource "pods" "filedSelector=spec.nodeName%3Dnode1"
benchmark_list_core_resource "nodes" "fieldSelector=metadata.name%3Dnode1"
benchmark_list_core_resource "services" "labelSelector=%21service.kubernetes.io%2Fheadless%2C% 21service.kubernetes.io%2Fservice-proxy-name"benchmark_list_apiexternsion_resource "discovery.k8s.io/v1beta1" "endpointslices" ""
benchmark_list_apiexternsion_resource "apiextensions.k8s.io/v1" "customresourcedefinitions" ""
benchmark_list_ apiexternsion_resource "networking.k8s.io" "networkpolicies" ""
benchmark_list_apiexternsion_resource "cilium.io/v2" "ciliumnodes" ""
benchmark_list_apiexternsion_resource "cilium.io/v2" "ciliumendpoints" ""
benchmark_list_apiexternsion_resource "cilium.io/v2" "ciliumnetworkpolicies" ""
benchmark_list_apiexternsion_resource "cilium.io/v2" "ciliumclusterwidenetworkpolicies" ""The
execution effect is as follows:
$ benchmark-list-overheads.sh----------------------------------------------------Benchmarking listcurl https://localhost:6443/api/v1/namespaces?resourceVersion=0real 0m0.090suser 0m0.038ssys 0m0.044s-rw-r--r-- 1 root root 69K listed-namespaces----------------------------------------------------Benchmarking list fieldSelector=spec.nodeName%3Dnode1curl https://localhost:6443/api/v1/pods?resourceVersion=0real 0m18.332suser 0m1.355ssys 0m1.822scurl https://localhost:6443/api/v1/pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1real 0m0.242suser 0m0.044ssys 0m0.188s-rw-r--r-- 1 root root 2.0G listed-pods-rw-r--r-- 1 root root 526K listed-pods-filtered----------------------------------------------------...
Note: Any LIST with selector, such as LIST pods
?spec.nodeName=node1
, this script will first execute the request without selector, in order to measure the amount of data that the apiserver needs to process, such as the above list pods:
-
The agent really executes pods? resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1
, so the request time should be based on this -
to execute the pods? resourceVersion=0
, this is to test how much data the apiserver needs to process for the request of 1
❝
Note: List all pods such an operation will produce 2GB of files, so use this benchmark tool carefully, first understand what the script you write is testing, especially do not automate or run concurrently, which may burst apiserver/etcd.
4.3 Test result analysis
The above output has the following key information:
-
the resource type of LIST, such as pods/ endpoints/services -
time-consuming LIST -
The amount of data involved in the LIST operation -
apiserver Amount of data to be processed (json format): Take the list pods above as an example, corresponding to the listed-pods
file, a total of 2GB; -
The amount of data received by the agent (because the agent may have specified the label/field filter): Take the list pods above as an example, corresponding to the listed-pods-filtered
file, a total of526K
LIST operation
Collect and sort all LIST requests in the above way, and you can know the pressure on apiserver/etcd when the agent starts the operation at one time.
$ ls -ahl listed-*-rw-r--r-- 1 root root 222 listed-apiextensions.k8s.io-v1-customeresourcedefinitions-rw-r--r-- 1 root root 5.8M listed-apiextensions.k8s.io-v1-customresourcedefinitions-rw-r--r-- 1 root root 2.0M listed-cilium.io-v2-ciliumclusterwidenetworkpolicies-rw-r--r-- 1 root root 193M listed-cilium.io-v2-ciliumendpoints-rw-r--r-- 1 root root 185 listed-cilium.io-v2-ciliumnetworkpolicies-rw-r--r-- 1 root root 6.6M listed-cilium.io-v2-ciliumnodes-rw-r--r-- 1 root root 42M listed-discovery.k8s.io-v1beta1-endpointslices-rw-r--r-- 1 root root 69K listed-namespaces-rw-r--r-- 1 root root 222 listed-networking.k8s.io-networkpolicies -rw-r--r-- 1 root root 70M listed-nodes ## Used only to evaluate the amount of data that apiserver needs to process
-rw-r--r-- 1 root root 25K listed-nodes-filtered-rw-r--r-- 1 root root 2.0G listed-pods #
# Only used to evaluate the amount of data that apiserver needs to process -rw-r--r-- 1 root root 526K listed-pods-filtered-rw-r--r-- 1 root root 23M listed-services ## Only used to evaluate the amount of data that apiserver needs to process
-rw-r--r-- 1 root 23M listed-services-filtered
Or using cilium as an example, there is roughly such a sort (the amount of data processed by apiserver, json format):
List resource type | The amount of data processed by the apiserver (json | takes |
---|---|---|
193MB | 11s | CiliumNodes (full) |
70MB | 0.5s | |
… | … | … |
5 Large-scale basic services: deployment and tuning
recommendations5.1 List Request default settings ResourceVersion=0
As already introduced, Not setting this parameter will cause apiserver to pull the full amount of data from
etcd and then filter, so unless the data
accuracy is extremely high and data must be pulled from etcd, the ResourceVersion=0
parameter should be set when the LIST request is requested to let the apiserver serve the cache as a service.
If you are using the client-go ListWatch/informer interface, it already has ResourceVersion=0
set by default.
5.2 Prefer namespaced API
If you want to list resources in a single or few namespaces, consider using namespaced API:
-
Namespaced API: /api/v1/namespaces/
/pods?query=xxx -
Un-namespaced API: /api/ v1/pods?query=xxx
5.3 Restart backoff
For basic services deployed by per-node, such as kubelets, cilium-agents, daemonsets, effective restart backoff is required to reduce the stress on the control plane during large-area restarts.
For example, after hanging up at the same time, the number of agents restarted per minute does not exceed 10% of the cluster size (configurable or automatically calculated).
5.4 Preferentially filter on the server side through label/field selector
If you need to cache certain resources and listen for changes, you need to use the ListWatch mechanism, pull the data locally, and the business logic itself filters from the local cache as needed. This is the client-go’s ListWatch/informer mechanism.
But if it’s just a one-time LIST operation, and there are filters, such as the aforementioned example of filtering pods based on nodename, then obviously we should set the label or field filter and let the apiserver filter out the data for us. LIST 10w pods take tens of seconds (most of the time is spent on data transmission, and also occupies a lot of CPU/BW/IO on the apiserver), and if you only need pods on the local machine, then after setting nodeName=node1
, LIST may only need 0.05s
to return the result. It is also very important not to forget to include resourceVersion=0
in the request.
5.4.1 Label selector
in apiserver memory filtering.
5.4.2 Field selector
in apiserver memory filtering.
5.4.3 Namespace selector etcd The namespace is part of the prefix, so you can specify a namespace to filter resources much faster than a
selectorthat is not a prefix.
5.5 Supporting infrastructure (monitoring, alerting, etc.)
The above analysis can be seen that a single request from a client may only return a few hundred kilobytes of data, but apiserver (worse, etcd) needs to process gigabytes of data. Therefore, we should try our best to avoid large-scale restart of basic services, so we need to improve monitoring and alarm as much as possible.
5.5.1 Use independent ServiceAccount
for
each basic service (such as kubelet, cilium-agent, etc.), as well as various operators with a large number of LIST operations on apiserver, use their own independent SA, so that the apiserver can distinguish the source of the request, for monitoring, Both troubleshooting and server-side throttling are very useful.
5.5.2 The basic service of Liveness monitoring and alarm
must cover liveness monitoring.
There must be a P1-level liveness alarm to be able to find large-scale hanging scenes at the first time. The pressure on the control plane is then reduced by restart backoff.
5.5.3 Monitoring and tuning etcd
needs to monitor and alarm on key performance-related indicators:
-
memory
-
Large bandwidth
-
LIST request number and response time for
example
: { “
level":"
warn",
"msg":"apply request took too long",
"took":"5357.87304ms",
"expected-duration":"100ms" ,
"prefix":"read-only range ",
"request":"key:\"/registry/pods/\" range_end:\"/registry/pods0\" ",
"response":"range_response_count:60077 size:602251227" }Deployment
and configuration tuning:
- >
-
Other.
6 Other
6.1 Get requests: GetOptions{}
Rationale and ListOption{}
Similarly, not setting ResourceVersion=0
will cause the apiserver to etcd to fetch data, which should be avoided as much as possible.
References
-
Kubernetes API Concepts[4 -
]. Raft consensus algorithm (and etcd/raft source code parsing) (USENIX, 2014)[5
]
Cited link
Example: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/#client_code_empty_rv
Form: https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter
Example: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/#client_code_empty_rv
Kubernetes API Concepts: https://kubernetes.io/docs/reference/using-api/api-concepts/
[Paper] Raft consensus algorithm (and etcd/raft source code analysis) (USENIX, 2014): https://arthurchiao.art/blog/raft-paper-zh/