This article is reprinted from ArthurChiao’s Blog, original text: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/, copyright belongs to the original author.

For unstructured data storage systems, LIST operations are usually very heavyweight, not only consuming a lot of disk IO, network bandwidth and CPU, but also affecting other requests in the same period of time (especially the host selection request with extremely high response latency), which is a major killer of cluster stability.

For example, for Ceph object storage, each LIST bucket request needs to go to multiple disks to retrieve all the data of the bucket; Not only is it slow itself, but it also affects other normal read and write requests within the same time period because the IO is shared, resulting in increased response latency and even timeouts. If there are so many objects in the bucket (for example, as a storage backend for harbor/docker-registry), the LIST operation cannot even be completed in a regular time (so registry GC that relies on the LIST bucket operation cannot run).

Another example is KV storage etcd. Compared to Ceph, an actual etcd cluster may store a small amount of data (a few ~ dozens of GB), or even enough to cache into memory. But unlike Ceph, it can have a number of concurrent requests that can be orders of magnitude higher, such as etcd of a ~4000 node k8s cluster. A single LIST request may only need to return tens of MB to gigabytes of traffic, but with more concurrent requests, etcd obviously can’t carry it, so it’s best to have a layer of cache in front, which is the function of apiserver. Most of K8s’ LIST requests should be blocked by apiserver and served from its local cache, but if used improperly, it will skip the cache and go directly to etcd, which is a great stability risk.

This topic delves into the LIST operation processing logic and performance bottlenecks of k8s apiserver/etcd, and provides some LIST stress testing, deployment, and tuning suggestions for basic services to improve the stability of large-scale K8s clusters.

kube-apiserver LIST request processing logic:

The code is based on v1.24.0, but the basic logic of 1.19~1.24 is the same as the code path, and you can refer to it if necessary.

1

Introduction1.1 K8s architecture: From the perspective of architecture hierarchy and component dependencies, a K8s

cluster and a Linux host can be compared as follows:

Fig 1. Anology: a Linux host and a Kubernetes cluster

For K8s clusters, several components and functions from the inside out:

    >etcd : Persistent KV storage, cluster resources (pods/services/networkpolicies/… A unique authoritative data (status) source;
  1. apiserver: reads (ListWatch) full data from etcd and caches it in memory; Stateless service**, horizontally scalable;
  2. Basic services (e.g. kubelet, *-agent, *-operator: Connect to apiservers and get (List/ListWatch) the data they need;
  3. Workloads within the cluster: 3 create, manage, and reconcile under normal conditions of 1 and 2, such as kubelets creating pods, cilium configuring network and security policies.

    As you

can see above in the

apiserver/etcd role, there are two levels of List/ListWatch in the system path

(But the data is the same):

    apiserver List/ListWatch etcd

  1. basic service List/ListWatch apiserver

Therefore, In its simplest form, apiserver is a proxy

 that stands in front of etcd             +--------+              +---------------+                 +------------+           | Client | -----------> | Proxy (cache) | --------------> | Data store |           +--------+              +---------------+                 +------------+         infra services               apiserver                         etcd
  1. In most cases, apiserver serves directly from the local cache (because it caches the full data of the cluster);

  2. In some special cases, for example,

    apiserver can only forward requests to etcd – special attention here – the client LIST parameter may also come to this logic if it is not set properly.

    1. the

    2. client explicitly requests to read data from etcd (for the highest data accuracy), and
    3. the apiserver local cache has not yet been built

1.3 apiserver/etcd list Overhead

1.3.1 Request example

Consider the following LIST operations:

  1. LIST apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0

    Both parameters are passed here, but resourceVersion=0 will cause apiserver to ignore limit=500, so the client gets the full ciliumendpoints data.

    The full data

    of a resource may be relatively large, and you need to consider whether you really need the full data. Quantitative measurement and analysis methods will be introduced later.

  2. LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1This

    request is to get all pods on node1(%3D is an escape of =).

    Filtering according to nodename may feel like a small amount of data, but it’s more complicated than it seems:

    this behavior is to be avoided unless there are very high requirements for data accuracy and deliberately bypassing the apiserver cache.

    • First, resourceVersion=0 is not specified here, causing apiserver to skip the cache and go directly to etcd to read the data;
    • Second, etcd is only

    • KV storage, there is no filtering function by label/field (only limit/continue),
    • so apiserver pulls the full amount of data from etcd and then Memory filtering, the overhead is also very large, there is code analysis later.
  3. LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0

    and 2 are distinguished by the addition of resourceVersion=0, so apiserver will read data from the cache, and the performance will be improved by magnitude.

    Note, however, that while the actual amount of data returned to the

    client may only be a few hundred KB to hundreds of MB (depending on the number of pods on the node, the number of labels on the pod, etc.), the amount of data that the apiserver needs to process may be several gigabytes. There will be quantitative analysis later.

As you can see above, the impact of different LIST operations is different, and the client may see that the data is only a small part of the data processed by apiserver/etcd. If the underlying services are started or restarted on a large scale, it is very likely to burst the control plane.

1.3.2 Processing Overhead List requests can be divided into two types:

    List Full data: overhead

  • is mainly spent on data transmission;
  • Specifies that filtering

  • is performed by label or field, only matching data is required.
  • What needs to be specifically explained here is the second case, that is, the list request has a filter condition.

    • in most cases, apiserver will use its own cache for filtering, which is fast, so the time spent is mainly spent on data transmission;

    • In cases where requests need to be forwarded to

      etcd, as

      mentioned earlier, etcd is only KV storage and does not understand label/field information, so it cannot handle filtering requests. The actual process is: apiserver pulls all the data from etcd, then filters it in memory, and then returns it to the client.

      Therefore, in addition to the data transfer overhead (network bandwidth), this situation also consumes a lot of apiserver CPU and memory.

    1.4 To take another example of potential problems when deploying at scale

    , the following line of code uses k8s client-go to filter pods

     based on nodename  podList, err := Client(). CoreV1(). Pods(""). List(ctx(), ListOptions{FieldSelector: "spec.nodeName=node1"})

    seems very simple, let’s actually look at the amount of data behind it. Taking a cluster of 4000 nodes and 10w pods as an example, the full pod data volume

    :

    1. etcd: compact unstructured KV storage, on the order of 1GB;
    2. apiserver cache: already structured golang objects, on the order of 2GB (TODO: further confirmation required);
    3. apiserver returns: The client generally chooses the default JSON format to receive, which is already structured data. The json of the full pod is also in the 2GB range.

    As you can see, some requests seem simple and are just a matter of one line of code on the client, but the amount of data behind them is staggering. Specifies that pods filtered by nodeName may only return 500KB of data, but apiserver needs to filter 2GB of data – in the worst case, etcd will also process 1GB of data (the above parameter configuration does hit the worst case, see code analysis below).

    When the cluster size is relatively small, this problem may not be visible (etcd does not start printing warning logs until the LIST response delay exceeds a certain threshold); After the scale is large, if there are more such requests, apiserver/etcd will definitely not be able to carry it.

    1.5 The purpose of this article is

    to

    view the List/ListWatch implementation of k8s in depth to deepen the understanding of performance problems and provide some references for the stability optimization of large-scale K8s clusters.

    2 apiserver List() operation source code analysis

    With the above theoretical warm-up, you can see the code implementation.

    2.1 Call stack and flowchart

    store. List|-store. ListPredicate

       |-if opt == nil


       |   opt = ListOptions{ResourceVersion: ""}   |-Init SelectionPredicate.Limit/Continue fileld

       |-list := e.NewListFunc()                               // objects will be stored in  this list

       |-storageOpts := storage. ListOptions{opt. ResourceVersion, opt. ResourceVersionMatch, Predicate: p}   |

       |-if MatchesSingle ok                                   // 1. when "metadata.name" is specified,  get single obj

       |    Get single obj from cache or etcd   |

       |-return e.Storage.List(KeyRootFunc(ctx), storageOpts)  // 2. get all objs and perform filtering

          |-cacher. List()

             | // case 1: list all from etcd and filter in apiserver


             |-if shouldDelegateList(opts)                     // true if  resourceVersion == ""
             |    return c.storage.List                        // list from etcd         |             |- fromRV *int64 = nil

             |             |- if len(storageOpts.ResourceVersion) > 0

             |             |     rv = ParseResourceVersion         |             |     fromRV = &rv         |             |

             |             |- for hasMore {

             |             |    objs := etcdclient. KV. Get()         |             |    filter(objs)                   // filter by labels or filelds         |             | }         |

             | // case 2: list & filter from apiserver local cache (memory)


             |-if cache.notready()
             |   return c.storage.List                         // get from etcd         |

             | // case 3: list & filter from apiserver local cache (memory)

             |-obj := watchCache.WaitUntilFreshAndGet

             |-for elem in  obj. (*storeElement)

             |   listVal.Set() // append results to listOjb

    |-return // results stored in listObj


    corresponding flowchart:

    Fig 2-1. List operation processing in apiserver

    2.2 Request processing entry: List().

     // https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L361

    Returns a list of objects


    func (e *Store) List(ctx, options* metainternalversion. ListOptions) (runtime. Object, error) {    label := labels. Everything()

        if options != nil && options. LabelSelector != nil


            label = options. LabelSelector Label filter, for example app=nginx field := fields. Everything()

        if options != nil && options. FieldSelector != nil


            field = options. FieldSelector field filters, such as spec.nodeName=node1

    out := e.ListPredicate(ctx, e.PredicateFunc(label, field), options) pull (List) data and filter (Predicate).


        if e.Decorator != nil        e.Decorator(out)

        return out, nil

    }

    2.3 ListPredicate()

      https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L411

    func (e *Store) ListPredicate(ctx , p storage. SelectionPredicate, options *metainternalversion. ListOptions) (runtime. Object, error) {


    Step 1: Initialize
    if options == nil
    options = &metainternalversion. ListOptions{ResourceVersion: ""}    p.Limit    = options. Limit    p.Continue = options. Continue list :=

    e.NewListFunc() The returned result will be stored in


    storageOpts := storage. ListOptions{ converts the ListOption on the API side to the ListOption on the underlying storage side, and the field differences are described below         ResourceVersion:      options. ResourceVersion,        ResourceVersionMatch: options. ResourceVersionMatch, Predicate: p,

    Recursive: true

    , }

    Step 2: If the request specifies a metadata.name, you should get a single object without filtering the full data


    if  name, ok := p.MatchesSingle(); ok { Check if the metadata.name field
    is set if key := e.KeyFunc(ctx, name); err == nil { get the key (unique or nonexistent) of this object in etcd storageOpts.Recursive
    = false e.Storage.GetList(ctx, key, storageOpts, list)

                return list

    }

    else logic: If you execute here, it means that you did not get the key for filtering from the context, then fallback to take the full data below and filter

    }

    Step 3: Filter the full data


        e.Storage.GetList(ctx, e.KeyRootFunc(), storageOpts, list) KeyRootFunc() is used to get the root key (i.e. prefix, without the final /)
    return list} of this resource in etcd

    In

    1.24.0, both case 1 & 2 call e.Storage.GetList(), which was a bit different in previous versions:

      e.Storage.GetToList

    • in Case 1
    • The e.Storage.List in Case 1

    but the basic flow is the same.

    1. initializes a default value where ResourceVersion is set to an empty string if the client does not pass ListOption, which will enable the apiserver to slave etcd pulls data back to the client without using a local cache (unless the local cache is not already created);

      For example, if the client sets ListOption{Limit: 5000, ResourceVersion: 0} list ciliumendpoints, the request sent will be /apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0

      ResourceVersion is the behavior of an empty string, which will be parsed later.

    2. Initialize the limit/continue field of the filter (SelectionPredicate) with the fields in the listoptions;

    3. Initialization returns result, list := e.NewListFunc();

    4. To convert the ListOption on the API side to the ListOption in the underlying storage, see

      metainternalversion below for the field differences. ListOptions is an API-side struct that includes

        staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/types.go 

       ListOptions is the query options to a standard REST list call.


       type ListOptions struct {     metav1. TypeMeta    

           LabelSelector labels. Selector tag filters, such as app=nginx FieldSelector


      fields. Selector field filters, such as spec.nodeName=node1 Watch bool

      AllowWatchBookmarks bool



      ResourceVersion string      ResourceVersionMatch metav1. ResourceVersionMatch    

           TimeoutSeconds *int64         // Timeout for the list/watch call.


           Limit int64
           Continue string               // a token returned by the server. return a 410 error if the token has expired. }

      storage. ListOptions are structs passed to the underlying storage, with some differences in fields:

        staging/src/k8s.io/apiserver/pkg/storage/interfaces.go 

       ListOptions provides the options that may be provided for storage list operations.


       type ListOptions struct {
           ResourceVersion string     ResourceVersionMatch metav1. ResourceVersionMatch

           Predicate SelectionPredicate // Predicate provides the selection rules for the list operation.


      Recursive bool true: gets a single object based on key; false: Get full data
      according to key prefix ProgressNotify bool storage-originated bookmark, ignored for non-watch requests. }

    2.4 The request specifies a resource name: Get a single object

    and then specify meta in the request according to whether meta. Name is divided into two cases:

    1. if specified, it means that a single object is queried, because Name is unique, and then the logic of querying a single object is turned in;
    2. If it is not specified, you need to obtain the full data, and then filter it in apiserver memory according to the filter conditions in SelectionPredicate, and return the final result to the client;

    The code is as follows

    :

    case 1: Get a single object according to metadata.name without filtering the full data if name, ok :  
    = p.MatchesSingle(); ok { Check if the metadata.name field
    is set if key := e.KeyFunc(ctx, name); err == nil {            e.Storage.GetList(ctx, key, storageOpts, list)

                return list

            }

             else logic: If it is executed here, it means that the key for filtering is not obtained from the context, then fallback to take the full amount of data below and filter

    }

    e.Storage is an interface

    ,  staging/src/k8s.io/apiserver/pkg/storage/interfaces.go

    // Interface offers a common interface for object marshaling/unmarshaling operations and


     hides all the storage-related operations behind it.
    type Interface interface {
        Create(ctx , key string, obj, out runtime. Object, ttl uint64) error
        Delete(ctx , key string, out runtime. Object, preconditions *Preconditions,...)
        Watch(ctx , key string, opts ListOptions) (watch. Interface, error)
        Get(ctx , key string, opts GetOptions, objPtr runtime. Object) error

        // unmarshall objects found at key into a *List api object (an object that satisfies runtime. IsList definition).


        // If 'opts. Recursive' is false, 'key' is used as an exact match; if is true, 'key' is used as a prefix.
        // The returned contents may be delayed, but it is guaranteed that they will
        // match 'opts. ResourceVersion' according 'opts. ResourceVersionMatch'.
        GetList(ctx , key string, opts ListOptions, listObj runtime. Object) error

    e.Storage.GetList() executes into the cacher code.

    Whether you get a single object or get full data, you go through a similar process:

    1. first fetched from the apiserver local cache (determinants include ResourceVersion, etc.),
    2. which is

    3. unavoidable etcd to get;

    The logic of getting a single object is relatively simple, so I won’t look at it here. Next, look at the list of full data and then do the filtering logic.

    2.5 The request does not specify the resource name and obtains the full data for filtering2.5.1

    apiserver caching layer: GetList() processing logic

    // https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

    // GetList implements storage. Interface


    func (c *Cacher) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error {    recursive := opts. Recursive    resourceVersion := opts. ResourceVersion    pred := opts. Predicate

    case one: ListOption requires


    that should DelegateList(opts)
    return c.storage.GetList(ctx, key, opts, listObj) must be read from etcd c.storage pointing to etcd

    If resourceVersion is specified, serve it from cache

        listRV := c.versioner.ParseResourceVersion(resourceVersion)

    Case 2: apiserver cache is not built, can only be read from etcd


    if listRV == 0 && !c.ready.check()
    return c.storage.GetList(ctx, key, opts, listObj)

    Case 3: apiserver cache is normal, read from cache: Ensure that the returned object version is not lower than 'listRV'

        listPtr := meta. GetItemsPtr(listObj)    listVal := conversion. EnforcePtr(listPtr) filter :=

    filterWithAttrsFunction(key, pred) The final filter

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...).  Performance optimization


    for _, obj := range objs { elem := obj.( *storeElement)

            if filter(elem. Key, elem. Labels, elem. Fields) real filtering

    listVal.Set(reflect. Append(listVal, reflect. ValueOf(elem)) }

    Update the last read ResourceVersion

    if c.versioner
    != nil
    c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
    return nil}

    2.5.2 Determine if data must be read from etcd: shouldDelegateList().

    // https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L591 

    func shouldDelegateList(opts storage. ListOptions) bool {

        resourceVersion := opts. ResourceVersion    pred            := opts. Predicate

        pagingEnabled   := DefaultFeatureGate.Enabled(features. APIListChunking) is enabled by default


    hasContinuation := pagingEnabled && len(pred. Continue) > 0 Continue is a token
    hasLimit := pagingEnabled && pred. Limit > 0 & & resourceVersion != "0" hasLimit is only possible to be true if resourceVersion != "0"

    1. If resourceVersion is not specified, the data is pulled from the underlying storage (etcd);


    2. If there is continuation, also pull data from the underlying storage;
    3. The limit will only be passed to the underlying storage (etcd) if resourceVersion != "0", because watch cache does not support continuation
    return resourceVersion == ""  || hasContinuation || hasLimit || opts. ResourceVersionMatch == metav1. ResourceVersionMatchExact

    }

    is very important here

    :

    1. Q: The client has not set the ResourceVersion field in ListOption{}, does it correspond to this resourceVersion == ""

      A: Yes, so the example [1] in section 1 would result in pulling the full amount of data from etcd.

    2. Q: Does setting limit=500&resourceVersion=0 on the client cause hasContinuation==true next time?

      A: No, resourceVersion=0 will cause the limit to be ignored (hasLimit is the line of code), that is, although limit=500 is specified, the request will return the full data.

    3. Q: What is ResourceVersionMatch used for?

      A: It is used to tell the apiserver how to interpret the ResourceVersion. There is a very complicated table [2] officially, if you are interested, you can take a look.

    Next, go back to the cacher's GetList() logic to see what specific processing situations there are.

    2.5.3 Case 1: ListOption requires reading data from etcd In this case, apiserver will directly read all objects from etcd and filter, and then return them to the client, which is suitable for scenarios with extremely high data

    consistency requirements. Of course, it is also easy to stray, causing excessive pressure on etcd, such as the example in the first section [3].

    // https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L563

    // GetList implements storage. Interface.


    func (s *store) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error {    listPtr   := meta. GetItemsPtr(listObj)    v         := conversion. EnforcePtr(listPtr)    key        = path. Join(s.pathPrefix, key)

        keyPrefix := key // append '/' if needed

        newItemFunc := getNewItemFunc(listObj, v)

        var fromRV *uint64


        if len(resourceVersion) > 0 { If RV is not empty ( the default is an empty string when the client does not pass) parsedRV := s.versioner.ParseResourceVersion(resourceVersion) fromRV =  &parsedRV } ResourceVersion

    , ResourceVersionMatch and other processing logic


    switch {
    case recursive && s.pagingEnabled && len( pred. Continue) > 0: ...
        case recursive && s.pagingEnabled && pred. Limit > 0        : ...
        default                                                    : ...    }

        // loop until we have filled the requested limit from etcd or there are no more results


        for {
            getResp = s.client.KV.Get(ctx, key, options...)  Pull data from etcd
    numFetched += len(getResp.Kvs) hasMore = getResp.More

    for i, kv := range getResp.Kvs {


    if  limitOption != nil && int64(v.Len()) >= pred. Limit {
                    hasMore = true
                    break             }            lastKey = kv. Key            data := s.transformer.TransformFromStorage(ctx, kv. Value, kv. Key)

                appendListItem(v, data, kv. ModRevision, pred, s.codec, s.versioner, newItemFunc) will filter

    numEvald++ }

    key = string(lastKey) + "\x00"

    }

     instruct the client to begin querying from immediately after the last key we returned


        if hasMore {
            // we want to start immediately after the last key
            next := encodeContinue(string(lastKey)+"\x00", keyPrefix, returnedRV)
            return s.versioner.UpdateList(listObj, uint64 (returnedRV), next, remainingItemCount)    }

        // no continuation


        return s.versioner.UpdateList(listObj, uint64(returnedRV),  ""nil)}
    • client. KV. Get() enters the etcd client library, you can continue to dig down if you are interested.
    • appendListItem() will filter the obtained data, which is the apiserver memory filtering operation we mentioned in the first section.

    2.5.4 Situation 2: The local cache has not been built, and the specific execution process is the same as that of case 1 that can only read data from etcd

    .

    2.5.5 Case 3: Use local cache

     https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

    // GetList implements storage. Interface


    func (c *Cacher) GetList(ctx , key string, opts storage. ListOptions, listObj runtime. Object) error {
    Case 1: ListOption requires that the ...

    Case 2: The apiserver cache is not built, and can only be read from etcd

    .

    Case 3: apiserver cache is normal, read from cache: Ensure that the returned object version is not lower than 'listRV'


    listPtr := meta. GetItemsPtr(listObj) // List elements with at least 'listRV' from cache.    listVal := conversion. EnforcePtr(listPtr) filter :=

    filterWithAttrsFunction(key, pred) The final filter

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...).  Performance optimization


    for _, obj := range objs { elem := obj.( *storeElement)

            if filter(elem. Key, elem. Labels, elem. Fields) real filtering

    listVal.Set(reflect. Append(listVal, reflect. ValueOf(elem))    }

        if c.versioner != nil


            c.versioner.UpdateList(listObj, readResourceVersion, ""nil)
         return nil}

    3 LIST

    test

    In order to avoid the client library (such as client-go) automatically setting some parameters for us, we directly use curl to test, just specify the certificate:

     $ cat curl-k8s-apiserver.sh
    curl -s --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key --cacert /etc/kubernetes/pki/ca.crt $ @

    usage

    :

    $ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"{ "kind":

     "PodList",


      "metadata": {
        "resourceVersion""2127852936",
        "continue"" eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...", }, "

    items": [ {pod1 data }, {pod2 data}]}

    3.1 Specify limit=2 :response will return paging information (continue)

    3.1.1 curl test

    $ ./ curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"{

      "kind""PodList",


      "metadata": {
         "resourceVersion""2127852936",
        "continue""eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",  },

      "items": [ {pod1 data } , {pod2 data

    }]}

    can be seen that

    • does return two pod information, in the items[] field;
    • In addition, a continue field is returned in the metadata, and the next time the client takes this parameter, apiserver will continue to return the rest of the content until apiserver no longer returns continue.

    3.1.2 Kubectl test

    to increase the log level of kubectl

    , you can also see that it uses continue behind it to get the full pods:

    $ kubectl get pods --all-namespaces --v=10#
    # The following are log output, with appropriate
    adjustments ## curl -k -v -XGET  -H "User-Agent: kubectl/v1.xx" -H "Accept: application/json; as=Table; v=v1; g=meta.k8s.io,application/json; as=Table; v=v1beta1; g=meta.k8s.io,application/json"
    ##   'http://localhost:8080/api/v1/pods?limit=500'
    ## GET http://localhost:8080/api/v1/pods?limit=500 200 OK in 202 milliseconds
    ## Response Body: {"kind":"Table","metadata":{"continue":"eyJ2Ijoib...","remainingItemCount":54},"columnDefinitions":[...],"rows":[...]}
    ## 
    ## curl -k -v -XGET  -H "Accept: application/json; as=Table; v=v1; g=meta.k8s.io,application/json; as=Table; v=v1beta1; g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.xx"
    ##   'http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500'
    ## GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500 200 OK in 44 milliseconds
    ## Response Body: {"kind":"Table","metadata":{"resourceVersion":"2122644698"},"columnDefinitions":[],"rows":[...]}

    The first request got 500 pods, and the second request brought the returned continue: GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500, continue is a token, a bit long, It is truncated here for better display.

    3.2 Specify limit=2&resourceVersion=0:limit=2 will be ignored and full data

    $ ./ curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2&resourceVersion=0"{

      "kind""PodList",


      "metadata ": {
        "resourceVersion""2127852936",
        "continue""eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",  },

      "items ": [ {pod1 data }, {pod2 data}, ...]

    }

    Items[] is full pod information.

    3.3 Specify spec.nodeName=node1

    &resourceVersion=0 vs. spec.nodeName=node1

    Same

     result $./curl-k8s-apiserver.sh " https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1" | jq '.items[].spec.nodeName'
    "node1"
    "node1"
    "node1"...

    $ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1&resourceVersion=0" | jq '.items[ ].spec.nodeName'


    "node1"
    "node1"
    "node1"...

    The result is the same, unless there is an inconsistency between the apiserver cache and etcd data, which is extremely small and we will not discuss it here.

    The speed difference is very

    large Use time to measure the time in the above two cases, and you will find that for larger clusters, the response time of these two requests will be significantly different.

    $ time ./curl-k8s-apiserver.sh  > result

    For 4K nodes, 100K pods scale clusters, the following data is for reference:

      >

      Without resourceVersion=0 (read etcd and filter on apiserver): takes 10s
    • With resourceVersion=0 (read apiserver cache): takes 0.05s

    200 times worse.

    The

    total size of full pods is calculated as 2GB, averaging 20KB each.

    4 LIST Request for Control Plane Pressure: Quantitative Analysis

    This section uses cilium-agent as an example to quantitatively measure the pressure on the control plane when it starts.

    4.1 Collect LIST request

    first to get the LIST k8s resources when the agent starts. There are several ways to collect it:

    1. filter by ServiceAccount, verb, request_uri, etc. in the k8s access log;
    2. via agent logs;
    3. By further code analysis and more.

    Suppose we collect the following LIST request:

      api/

    1. v1/namespaces?resourceVersion=0
    2. api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0
    3. api/v1/nodes?fieldSelector=metadata.name%3Dnode1& resourceVersion=0
    4. api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io% 2Fservice-proxy-name
    5. apis/discovery.k8s.io/v1beta1/endpointslices?resourceVersion=0
    6. apis/networking.k8s.io/networkpolicies?resourceVersion=0
    7. apis/cilium.io/v2/ciliumnodes?resourceVersion=0
    8. apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=0
    9. apis/cilium.io/v2/ ciliumclusterwidenetworkpolicies?resourceVersion=0

    4.2 Test LIST The amount of data requested and the time it takes to request

    a LIST Request list, then you can manually execute these requests to get the following

    data:

    1. the amount of data that the request takes time
    2. to process the request, which is divided into two types:
      1. the amount of data

      2. processed by apiserver (full data), and the performance impact on apiserver/etcd should be evaluated based on this as
      3. the

      4. final amount of data obtained by the agent (filtered by selector).

    Use the following script (put on the real k8s master) to perform a test,

    $cat benchmark-list-overheads.sh
    apiserver_url="https://localhost:6443

    " ## List k8s core resources (e.g. pods, services)


    ## API: GET/LIST /api/v1/?&resourceVersion=0
    function  benchmark_list_core_resource() {
        resource=$1
        selectors=$2

        echo  "----------------------------------------------------"


        echo "Benchmarking list $2"
        listed_file="listed- $resource"
        url="$apiserver_url/api/v1/$resource?resourceVersion=0"

         ## first perform a request without selectors, this is the size apiserver really handles


        echo "curl $url"
        time ./ curl-k8s-apiserver.sh "$url" > $listed_file

        ## perform another request if selectors are provided, this is the size client receives


        listed_file2="$listed_file-filtered"
        if [ ! -z "$selectors" ]; then
            url="$url&$selectors"
            echo "curl $url"
            time ./ curl-k8s-apiserver.sh "$url" > $listed_file2
        fi

        ls -ahl $listed_file $listed_file2 2>/dev/ null

        echo "----------------------------------------------------"


        echo ""}

    ## List k8s apiextension resources (e.g. pods, services)


    ## API: GET/LIST /apis//?&resourceVersion=0
    function benchmark_list_apiexternsion_resource() {
        api_group=$1
        resource=$2
        selectors=$3

        echo "----------------------------------------------------"


        echo "Benchmarking list $api_group/ $resource"
        api_group_flatten_name=$(echo $api_group | sed 's/\//-/g')
        listed_file="listed-$ api_group_flatten_name-$resource"
        url="$apiserver_url/apis/$api_group/$resource? resourceVersion=0"
        if [ ! -z "$selectors" ]; then
            url="$url&$selectors"
        fi

        echo "curl $url"


        time ./curl-k8s-apiserver.sh "$url" > $listed_file
        ls -ahl $listed_file
        echo  "----------------------------------------------------"
        echo ""}

    benchmark_list_core_resource "namespaces" ""


    benchmark_list_core_resource "pods"       "filedSelector=spec.nodeName%3Dnode1"
    benchmark_list_core_resource "nodes"       "fieldSelector=metadata.name%3Dnode1"
    benchmark_list_core_resource "services"   "labelSelector=%21service.kubernetes.io%2Fheadless%2C% 21service.kubernetes.io%2Fservice-proxy-name"

    benchmark_list_apiexternsion_resource "discovery.k8s.io/v1beta1" "endpointslices"                    ""


    benchmark_list_apiexternsion_resource "apiextensions.k8s.io/v1"  "customresourcedefinitions"        ""
    benchmark_list_ apiexternsion_resource "networking.k8s.io"        "networkpolicies"                  ""
    benchmark_list_apiexternsion_resource "cilium.io/v2"              "ciliumnodes"                      ""
    benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumendpoints"                   ""
    benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumnetworkpolicies"            ""
    benchmark_list_apiexternsion_resource "cilium.io/v2" "ciliumclusterwidenetworkpolicies" ""The

    execution effect is as follows:

     $ benchmark-list-overheads.sh----------------------------------------------------Benchmarking listcurl https://localhost:6443/api/v1/namespaces?resourceVersion=0real    0m0.090suser    0m0.038ssys     0m0.044s-rw-r--r-- 1 root root 69K  listed-namespaces----------------------------------------------------Benchmarking list fieldSelector=spec.nodeName%3Dnode1curl https://localhost:6443/api/v1/pods?resourceVersion=0real    0m18.332suser    0m1.355ssys     0m1.822scurl https://localhost:6443/api/v1/pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1real    0m0.242suser    0m0.044ssys     0m0.188s-rw-r--r-- 1 root root 2.0G listed-pods-rw-r--r-- 1 root root 526K listed-pods-filtered----------------------------------------------------...

    Note: Any LIST with selector, such as LIST pods

    ?spec.nodeName=node1, this script will first execute the request without selector, in order to measure the amount of data that the apiserver needs to process, such as the above list pods:

    1. The agent really executes pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1, so the request time should be based on this
    2. to execute the pods? resourceVersion=0, this is to test how much data the apiserver needs to process for the request of 1

    Note: List all pods such an operation will produce 2GB of files, so use this benchmark tool carefully, first understand what the script you write is testing, especially do not automate or run concurrently, which may burst apiserver/etcd.

    4.3 Test result analysis

    The above output has the following key information:

    1. the resource type of LIST, such as pods/ endpoints/services
    2. LIST operation

    3. time-consuming LIST
    4. The amount of data involved in the LIST operation
      1. apiserver Amount of data to be processed (json format): Take the list pods above as an example, corresponding to the listed-pods file, a total of 2GB;
      2. The amount of data received by the agent (because the agent may have specified the label/field filter): Take the list pods above as an example, corresponding to the listed-pods-filtered file, a total of 526K

    Collect and sort all LIST requests in the above way, and you can know the pressure on apiserver/etcd when the agent starts the operation at one time.

    $ ls -ahl listed-*-rw-r--r-- 1 root root  222 listed-apiextensions.k8s.io-v1-customeresourcedefinitions-rw-r--r-- 1 root root 5.8M  listed-apiextensions.k8s.io-v1-customresourcedefinitions-rw-r--r-- 1 root root 2.0M listed-cilium.io-v2-ciliumclusterwidenetworkpolicies-rw-r--r-- 1 root root 193M  listed-cilium.io-v2-ciliumendpoints-rw-r--r-- 1 root root  185 listed-cilium.io-v2-ciliumnetworkpolicies-rw-r--r-- 1 root root 6.6M listed-cilium.io-v2-ciliumnodes-rw-r--r-- 1 root root  42M listed-discovery.k8s.io-v1beta1-endpointslices-rw-r--r-- 1 root root  69K listed-namespaces-rw-r--r-- 1 root root  222 listed-networking.k8s.io-networkpolicies

    -rw-r--r-- 1 root root 70M listed-nodes ## Used only to evaluate the amount of data that apiserver needs to process

    -rw-r--r-- 1 root root 25K listed-nodes-filtered-rw-r-

    -r-- 1 root root 2.0G listed-pods #

    # Only used to evaluate the amount of data that apiserver needs to process -rw-r--r-- 1 root root 526K listed-pods-filtered-rw-r-

    -r-- 1 root root 23M listed-services ## Only used to evaluate the amount of data that apiserver needs to process

    -rw-r--r-- 1 root 23M listed-services-filtered

    Or using cilium as an example, there is roughly such a sort (the amount of data processed by apiserver, json format):

    )

    CiliumEndpoints (full)

    List resource type The amount of data processed by the apiserver (json takes
    193MB 11s CiliumNodes (full)
    70MB 0.5s

    5 Large-scale basic services: deployment and tuning

    recommendations5.1 List Request default settings ResourceVersion=0

    As already introduced, Not setting this parameter will cause apiserver to pull the full amount of data from

    etcd and then filter, so unless the data

    accuracy is extremely high and data must be pulled from etcd, the ResourceVersion=0 parameter should be set when the LIST request is requested to let the apiserver serve the cache as a service.

    If you are using the client-go ListWatch/informer interface, it already has ResourceVersion=0 set by default.

    5.2 Prefer namespaced API

    If you want to list resources in a single or few namespaces, consider using namespaced API:

    • Namespaced API: /api/v1/namespaces//pods?query=xxx
    • Un-namespaced API: /api/ v1/pods?query=xxx

    5.3 Restart backoff

    For basic services deployed by per-node, such as kubelets, cilium-agents, daemonsets, effective restart backoff is required to reduce the stress on the control plane during large-area restarts.

    For example, after hanging up at the same time, the number of agents restarted per minute does not exceed 10% of the cluster size (configurable or automatically calculated).

    5.4 Preferentially filter on the server side through label/field selector

    If you need to cache certain resources and listen for changes, you need to use the ListWatch mechanism, pull the data locally, and the business logic itself filters from the local cache as needed. This is the client-go’s ListWatch/informer mechanism.

    But if it’s just a one-time LIST operation, and there are filters, such as the aforementioned example of filtering pods based on nodename, then obviously we should set the label or field filter and let the apiserver filter out the data for us. LIST 10w pods take tens of seconds (most of the time is spent on data transmission, and also occupies a lot of CPU/BW/IO on the apiserver), and if you only need pods on the local machine, then after setting nodeName=node1, LIST may only need 0.05s to return the result. It is also very important not to forget to include resourceVersion=0 in the request.

    5.4.1 Label selector

    in apiserver memory filtering.

    5.4.2 Field selector

    in apiserver memory filtering.

    5.4.3 Namespace selector etcd The namespace is part of the prefix, so you can specify a namespace to filter resources much faster than a

    selectorthat is not a prefix.

    5.5 Supporting infrastructure (monitoring, alerting, etc.)

    The above analysis can be seen that a single request from a client may only return a few hundred kilobytes of data, but apiserver (worse, etcd) needs to process gigabytes of data. Therefore, we should try our best to avoid large-scale restart of basic services, so we need to improve monitoring and alarm as much as possible.

    5.5.1 Use independent ServiceAccount

    for

    each basic service (such as kubelet, cilium-agent, etc.), as well as various operators with a large number of LIST operations on apiserver, use their own independent SA, so that the apiserver can distinguish the source of the request, for monitoring, Both troubleshooting and server-side throttling are very useful.

    5.5.2 The basic service of Liveness monitoring and alarm

    must cover liveness monitoring.

    There must be a P1-level liveness alarm to be able to find large-scale hanging scenes at the first time. The pressure on the control plane is then reduced by restart backoff.

    5.5.3 Monitoring and tuning etcd

    needs to monitor and alarm on key performance-related indicators:

    1. memory

    2. Large bandwidth

    3. LIST request number and response time for

      example

      : { “

      level":" 
      warn",
           "msg":"apply request took too long",
           "took":"5357.87304ms",
           "expected-duration":"100ms" ,
           "prefix":"read-only range ",
           "request":"key:\"/registry/pods/\" range_end:\"/registry/pods0\" ",
           "response":"range_response_count:60077 size:602251227" }Deployment

    and configuration tuning:

      >

      K8s events are split into a separate etcd cluster
    1. Other.

    6 Other

    6.1 Get requests: GetOptions{}

    Rationale and ListOption{} Similarly, not setting ResourceVersion=0 will cause the apiserver to etcd to fetch data, which should be avoided as much as possible.

    References

    1. Kubernetes API Concepts[4
    2. ]. Raft consensus algorithm (and etcd/raft source code parsing) (USENIX, 2014)[5

    ]

    Cited link

    [1]

    Example: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/#client_code_empty_rv

    [2]

    Form: https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter

    [3]

    Example: https://arthurchiao.art/blog/k8s-reliability-list-data-zh/#client_code_empty_rv

    [4]

    Kubernetes API Concepts: https://kubernetes.io/docs/reference/using-api/api-concepts/

    [5]

    [Paper] Raft consensus algorithm (and etcd/raft source code analysis) (USENIX, 2014): https://arthurchiao.art/blog/raft-paper-zh/

    Buy Me A Coffee