click “zhisheng” above to follow, star or pin to grow together

Flink From beginner to proficient series of article


After trying federation and Remote Write, we finally chose Thanos as the monitoring companion component and used its global view to manage our multi-regional, 300+ cluster monitoring data. This article mainly introduces the use and experience of some components of Thanos.

Prometheus official high availability has several solutions:

  • that is, two sets of Prometheus collect exactly the same data, and the external load balancer is attached
  • HA + Remote Storage: In addition to the basic multi-replica Prometheus, it also writes to remote storage through Remote write to solve the problem of storage persistence
  • Federation cluster: Federation, partitioned according to function, different shards collect different data, and are uniformly stored by global nodes to solve the problem of monitoring data scale.
  • Using the official recommended multi-copy

    + federation will still encounter some problems, the essential reason is that Prometheus’ local storage does not have data synchronization capabilities, and it is difficult to maintain data consistency under the premise of ensuring availability, and the basic multi-copy proxy cannot meet the requirements, such as:


    • backend of the Prometheus cluster has two instances, A and B, and there is no data synchronization between A and B. If A is down for a period of time and some data is lost, if the load balancer polls normally and the request hits A, the data will be abnormal.

    • If A and B have different start times and different clocks, then the same data is collected with different timestamps, and the data of multiple copies is different. Even if remote storage is used, A and B cannot be pushed to the same TSDB, if each person pushes their own TSDB, which side the data query goes is the problem The

    • official recommends that the data do Shard, and then achieve high availability through the Federation, but the edge node and the global node are still a single point, and you need to decide whether each layer should use two-node repeated collection to keep alive. That is, there will still be a stand-alone bottleneck.

    • In addition, some sensitive alarms should not be triggered by the global node as much as possible, after all, the stability of the transmission link from the Shard node to the global node will affect the efficiency of data arrival, which will lead to a decrease in the effectiveness of the alarm.

    At present, most of Prometheus’ clustering solutions ensure data consistency from two perspectives:
      > Storage angle: If you use Remote Write remote storage, A and B can be followed by an adapter, Adapter as the main logic, only one copy of data can be pushed to TSDB, which can ensure that one abnormality, the other can also be pushed successfully, the data is not lost, and there is only one remote storage, which is shared data. The solution can refer

      to the storage angle of this article

    • : Remote Write is still used for remote storage, but A and B write to the two time series databases of TSDB1 and TSDB2 respectively, and use Sync to synchronize data between TSDB1 and TSDB2 to ensure that the data is full.
    • Query perspective: The above solution needs to be implemented by itself, which is invasive and risky, so most open source solutions are made at the query level, such as Thanos or Victoriametrics, which are still two pieces of data, but data deduplication and join are done when querying. It’s just that Thanos puts data in object storage through sidecar, and Victoriametrics remotely writes data to its own server instance, but the logic of the query layer Thanos Query and Victor’s Promxy is basically the same, both serving the actual needs of the global view

    As our cluster grows larger, the types and amounts of monitoring data also increase: such as Master/Node machine monitoring, process monitoring, performance monitoring of the 4 core components, POD resource monitoring, kube-stats-metrics, K8S events monitoring, plugin monitoring, and so on. In addition to solving the above high availability problem, we also want to build a global view based on Prometheus, the main requirements are:

      > Long-term storage: Data storage for about 1 month may add dozens of gigabytes every day, and it is hoped that the maintenance cost of storage is small enough to have disaster recovery and migration. Influxdb was considered, but Influxdb did not have an existing clustering solution and required human maintenance. It is best to store TSDB or object storage or file storage on the cloud.

    • Unlimited expansion: We have 300+ clusters, thousands of nodes, tens of thousands of services, stand-alone Prometheus can not be satisfied, and for isolation, it is best to do shard according to function, such as master component performance monitoring and POD resources and other business monitoring separate, host monitoring and log monitoring are also separated. Or separate by tenant and service type (real-time service, offline service).

    • Global view: After separating by type, although the data is scattered, the monitoring view needs to be integrated, and n panels in a Grafana can see the monitoring data of all regions + clusters + pods, which is more convenient to operate, and does not need to be cut by multiple Grafana or multiple Datasources in Grafana.

    • Non-intrusive: Do not make too many modifications to the existing Prometheus, because Prometheus is an open source project, the version is also iterating rapidly, we used 1.x at the earliest, but the version upgrade of 1.x and 2.x is less than a year, the query speed of the 2.x storage structure has been significantly improved, and 1.x has been used. Therefore, we need to follow the community and iterate on new versions in a timely manner. Therefore, the code of Prometheus itself cannot be modified, it is better to wrap it and be transparent to the top user.

    Studied a large number of open source solutions (Cortex/Thanos/Victoria/StackDriver.). After commercial products, we chose Thanos, to be precise, Thanos is just a monitoring suite, combined with native Prometheus, to meet the needs of long-term storage + unlimited expansion + global view + non-intrusive.



    The default mode of Thanos: sidecar mode

    In addition to this sidecar approach, Thanos has a less commonly used receive pattern, which will be mentioned later.
    Thanos is a set of components that can be seen on the official website including:
      > bucket
    • Check
    • Compactor
    • Query
    • Rule
    • In

    • addition to the official mentioned these,

    • the Sidecar
    • Store

    actually has:
    It seems that there are many components, but in fact, there is only one binary when deployed, which is very convenient. Just with different parameters to achieve different functions, such as the query component is ./thanos query, the sidecar component is ./thanos sidecar, the component is all in one, there is only one copy of the code, and the volume is very small.
    In fact, the core sidecar + query can already run, other components are just to achieve more functions

    The latest version of Thanos is downloaded here for release,

    for Thanos, which is still fixing bugs and iterating on the function, do not use the old one if there is a new version.


    and configuration

    The following will introduce how to combine Thanos components to quickly achieve your Prometheus high availability, because it is a quick introduction, and the official quick start is part of the same, but will give the recommended configuration, and this article as of the 2020.1 version, I don’t know what Thanos will iterate into in the future

    Step 1: Confirm that you already have Prometheus

    thanos is non-intrusive, just the upper suite, so you still need to deploy your Prometheus, no more details here, by default you already have a stand-alone Prometheus running, it can be a pod or a host deployment, depending on your operating environment, we are outside the k8s cluster, so it is a host deployment. Prometheus collects monitoring data for Region A. Your Prometheus configuration can be


    startup configuration:
    > "./prometheus--config.file=prometheus.yml \ --log.level=info \ --storage.tsdb.path=data/prometheus \ --web.listen-address='' \ --storage.tsdb.max-block-duration=2h \ --storage.tsdb.min-block-duration=2h \ --storage.tsdb.wal-compression \ --storage.tsdb.retention.time=2h \ --web.enable-lifecycle"

    web.enable-lifecycle must be turned on, for hot load to reload your configuration, retention is reserved for 2 hours, Prometheus will generate a block by default 2 hours, Thanos will upload this block to object storage.

    Acquisition configuration: prometheus.yml

     global:  scrape_interval: 60s evaluation_ interval: 60s external_labels : region:  'A' replica: 0 
    rule_files: scrape_configs: job_name: 'prometheus' static_configs: targets: ['']
    job_name: 'demo-scrape' metrics_path: '/metrics' params: ...
    Here you need to declare the external_labels and mark your region. If you are running with multiple replicas, you need to declare your replica identity, such as 0, 1, 2 Three replicas collect exactly the same data, and the other 2 Prometheus can run at the same time, but the replica value is different. The configuration here is similar to the official Federation scheme.
    Requirements for Prometheus:

    • 2.2.1 above

    • to declare your external_labels

    • Enable –web.enable-admin-apiEnable


    Step 2: Deploy the sidecar component

    The key step is here, and the core is the sidecar component. sidecar is a modal sidecar component in k8s
    deployed in the same pod as the Prometheus server. He does two things:
      > It implements Thanos’ Store API using Prometheus’ Remote Read API. This allows the Query component, described later, to treat the Prometheus server as another source of time series data without having to interact directly with the Prometheus API (which is the blocking role of Sidecar
    1. ). Optional configuration: When Prometheus generates TSDB blocks every 2 hours, Sidecar uploads TSDB blocks to the object bucket. This allows the Prometheus server to run with low retention times while making historical data durable and queryable via object storage.

    Of course, this does not mean that Prometheus can

    be completely stateless, because if it crashes and restarts, you will lose 2 hours of metrics, but if your Prometheus is also multi-replica, you can reduce the risk of this 2h of data.
    Sidecar configuration:
     ./thanos sidecar \--Prometheus.url="http://localhost:8090" \ --objstore.config-file=./conf/bos.yaml \--tsdb.path= /home/work/opdir/monitor/Prometheus/data /Prometheus/"
    Configuration is simple, just declare Prometheus.url and data addresses. objstore.config-file is optional. If you want to store data in object storage (which is also recommended), configure the account information of the object storage.
    Thanos supports Google Cloud/AWS by default, taking Google Cloud as an example, the configuration is as follows:
    type: GCS config:  bucket: "" service_account: ""

    Because Thanos

    does not yet support our cloud storage by default, we have included the corresponding implementation in the Thanos code and submitted a PR to the official office.


    word of caution: don’t forget to pair your other two copies, Prometheus 1 and Prometheus, with a sidecar. If it is a pod running, you can add a container, 127 access, if it is a host deployment, specify the Prometheus port.

    In addition, the sidecar is

    stateless, it can also be multiple copies, multiple sidecars can access a copy of Prometheus data to ensure the scalability of the sidecar itself, but if the pod is running, there is no need for this, the sidecar and Prometheus live and die together.

    The sidecar reads the meta.json information in each block of Prometheus and then expands the json file to include metadata information unique to Thanos. It is then uploaded to Block Storage. After uploading, write to thanos.shipper.json Step


    : Deploy

    the query component

    sidecar deployment is complete, there are 3 copies of the same data, at this time, if you want to directly display the data, you can install the query component
    The Query component (also known as Query) implements Prometheus’ HTTP v1 API to query data in a Thanos cluster through PromQL like Prometheus’ graph.
    In short, sidecar exposes the StoreAPI, which collects data from multiple StoreAPIs, queries it, and returns results. Query is completely stateless and can scale horizontally.
     "  query \--http-address="" \--store=relica0:10901 \--store=relica1:10901 \ --store=relica2:10901 \--store= \ "
    The store parameter represents the sidecar component that has just been started, and after starting 3 copies, you can configure three relica0, relica1, and relica2, and 10901 is the default port of the sidecar.
    http-address stands for the port of the query component itself, because it is a web service, and when started, the page looks like this:

    Almost the same as Prometheus, with this page you don’t need to care about the original Prometheus, you can check it here.

    Click Store to see which sidecars are docked.

    The query page has two check boxes, which mean


    • deduplication: whether to deduplicate. The default check represents deduplication, and only one piece of the same data will appear, otherwise replica0 and 1, 2 exactly the same data will find out 3.

    • partial response

    • : whether to allow part of the response, the default allows, there is a compromise of consistency, such as 0, 1, 2 three copies have a hanging or timeout, there will be no response when querying, if allowed to return the user’s remaining 2 copies, the data is not very consistent, but because a timeout is not returned at all, the availability is lost, so the default allows partial response.

    Step 4: Deploy the Store Gateway component

    You may have noticed that in step 3, ./thanos query has a clause –store is xxx:19914, not the 3 copies that are always mentioned, this 19914 is the store gateway component to be talked about next.
    In the sidecar configuration in step 2, if you configure the object storage objstore.config-file, your data will be uploaded to the bucket regularly, leaving only 2 hours locally, so what if you want to query the data 2 hours ago? The data is not controlled by Prometheus, how can it be retrieved from the bucket and provided with the exact same query?
    Store gateway component: The store gateway mainly interacts with object storage to obtain persisted data from object storage. Like SideCar, the Store Gateway implements the Store API, from which the Query group can query historical data.
    The configuration is as follows:
     ./thanos store \--data-dir=./thanos-store-gateway/tmp/store \ --objstore.config-file=./thanos-store-gateway/conf/bos.yaml \--http-address= \ --grpc-address= \--index-cache-size=250MB \-- sync-block-duration=5m \--min-time=-2w \--max-time=-1h \ 

    grpc-address is the port exposed by the store API, that is, the –store in query is the configuration of xxx:19914.

    Because the Store gateway needs to pull a large amount of historical data from the network to load into memory, it will consume a lot of CPU and memory, this component is also a component that was questioned when thanos came out, but the current performance is okay, and some problems encountered will be mentioned later.
    The Store gateway can also be infinitely expanded to pull the same bucket data.
    Put a schematic diagram, a copy of Thanos, and hang store components in multiple regions

    Data statistics of one of the regions:

    The speed of querying historical data for one month is okay, mainly because there is no O&M pressure on data persistence, can be expanded at will, and the cost is low.

    At this point, the basic use of Thanos is over, as for compact compression and bucket verification, it is not a core function, the resource consumption of compact is also particularly large, and we have not used the rule component, so we will not introduce it.

    Step 5: View

    the data With multi-region and

    multi-copy data, you can combine Grafana for a global view, for example:

    view the performance indicators of ETCD by region and cluster

    View core component monitoring by region, cluster, and machine, such as various performance on multi-replica master machines

    Once the data is aggregated, all views can be displayed in one place, such as these panels:

      machine monitoring: node-exporter, process-exporter POD resource usage:

    • Cadvisor

    • Docker, kube-proxy, kubelet

    • monitoring

    • scheduler, controller-manager, etcd, apiserver monitoring

    • kube-state-metrics meta information

    • All the components mentioned earlier in

    • the log monitoring

    • receive mode such as

    • K8S Events

    • mtail

    are configured based on sidecar mode, but Thanos also has a Receive mode, which is less commonly used. Just appear in Proposals

    Due to some network limitations, we have tried the Receive solution before, and here we can describe the usage scenarios of Receive:
      > The sidecar mode has a disadvantage: that is, the data within 2 hours still needs to be obtained through sidecar->Prometheus, that is, it still depends on Prometheus, and not completely the data is stored externally. If your network only allows you to query specific stored data and cannot access Prometheus in the cluster, then the 2 hours of data will be lost, and the Receive mode uses remote write to have no so-called 2-hour block problem.
    1. If you are a multi-tenant environment or a cloud vendor, the object storage (historical data) query component is generally on the control plane, which is convenient for permission checksum interface service encapsulation, while sidecar and Prometheus are in the cluster, that is, on the user side. The network on the control plane and the user side sometimes has restrictions, which are not accessible, at this time there will be some restrictions that cause you to not be able to use sidecar
    2. tenants and control plane isolation, similar to Article 2, I hope that the data is completely stored in the control plane, I have always felt that Receive is for cloud vendors to serve.

      However, Receive is not the default

      solution after all, if it is not particularly needed, it is better to use the default sidecar for some problems

    Prometheus compression

    Compression: As mentioned in the official documentation, when using sidecars, you need to set Prometheus’ –storage.tsdb.min-block-duration and –storage.tsdb.max-block-duration to 2h, and the two parameters are equal to ensure that Prometheus has turned off local compression. Not reflected in help, the Prometheus author also explained that this is only a parameter for development and testing, and it is not recommended that users modify it. Thanos requires compression to be turned off because Prometheus will compress with a period of 2,25,25*5 by default, if it is not turned off, it may cause thanos to just upload a block, but this block is compressed, resulting in upload failure.

    But you don’t have to worry, because when the sidecar starts, these two parameters are checked, and if it is not suitable, the sidecar will fail to start

    store-gatewa y

    store-gateway: The store component consumes the most resources, after all, it wants to pull remote data and load it locally for querying, if you want to control the historical data and cache period, you can modify the corresponding configuration, such as:

    --index-cache-size=250MB \-- sync-block-duration=5m \ --min-time=-2w \ maximum query 1 week--max-time=-1h \ 

    store-gateway supports index caching by default to speed up TSDB block lookups, but sometimes startup will take up a lot of memory, after 0.11.0 version has been fixed, you can view this issue

    Prometheus 2.0 has optimized the storage tier. For example, according to the time and indicator name, try to put them together as continuously as possible. The store gateway can obtain the structure of the storage file, so it can translate the request for metric storage into the minimum object storage request. For that kind of big query, you can take hundreds or thousands of chunks of data at a time.

    In the local store of the store, only the Index data is put into the cache, and although the Chunk data can also be used, it is orders of magnitude larger. At present, there is only a small delay in getting chunk data from object storage, so there is little incentive to cache chunk data, after all, this requires a lot of resources.

    Data in Store-Gateway:

    Each folder is actually an index file index.cache.json

    compactor component

    Prometheus data is getting more and more, queries will definitely get slower and slower, Thanos provides a compactor component to handle, he has two functions

    • one is to do compression, that is, to constantly merge old data.

    • One is downsampling, he will store the data, according to a certain time, calculate the maximum, minimum equal value, will be according to the query interval, control, return the sampled data, not the real point, when querying the data for a particularly long time, look at the main trend, accuracy can choose to decrease.

    • Note that Compactor does not reduce disk usage, but increases disk usage (doing higher-dimensional aggregation).

    Optimization of thanos components is not a panacea, because business data is always growing, and business splitting may be considered. We need to have a certain division of the business, and different business monitors are placed in different buckets (need to be transformed or deploy more sidecars). For example, if you have 5 buckets, prepare 5 store gateways for proxy queries. Reduce the problem of large data for a single store.

    The second scenario is time slicing, the store gateway mentioned above, which can choose how long the data can be queried. The Store Gateway configuration supports two expressions, one based on relative time, for example, –max-time is 3d ago to 5d ago. One is based on absolute time, such as 3/1/19 to 5/1/19. For example, if you want to query 3 months of data, a store can proxy one month of data, then you need to deploy 3 stores to cooperate.

    When the deduplication query component

    of Query

    is started, by default, duplicate data will be deduplicated according to the query.replica-label field, and you can also check deduplication on the page to decide. The results of the query will be displayed according to the label selection of your query.replica-label. But if 0,1,2 three copies all return data, and the values are different, which one will query choose?

    Thanos will select a more stable replica data based on the scoring mechanism, the specific logic is:










    public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.

    Like + watch, less bugs 👇

    Buy Me A Coffee