click “zhisheng” above to follow, star or pin to grow together
Flink From beginner to proficient series of article
background
After trying federation and Remote Write, we finally chose Thanos as the monitoring companion component and used its global view to manage our multi-regional, 300+ cluster monitoring data. This article mainly introduces the use and experience of some components of Thanos.
- HA:
Using the official recommended multi-copy
+ federation will still encounter some problems, the essential reason is that Prometheus’ local storage does not have data synchronization capabilities, and it is difficult to maintain data consistency under the premise of ensuring availability, and the basic multi-copy proxy cannot meet the requirements, such as:
- The
-
backend of the Prometheus cluster has two instances, A and B, and there is no data synchronization between A and B. If A is down for a period of time and some data is lost, if the load balancer polls normally and the request hits A, the data will be abnormal.
-
If A and B have different start times and different clocks, then the same data is collected with different timestamps, and the data of multiple copies is different. Even if remote storage is used, A and B cannot be pushed to the same TSDB, if each person pushes their own TSDB, which side the data query goes is the problem The
-
official recommends that the data do Shard, and then achieve high availability through the Federation, but the edge node and the global node are still a single point, and you need to decide whether each layer should use two-node repeated collection to keep alive. That is, there will still be a stand-alone bottleneck.
-
In addition, some sensitive alarms should not be triggered by the global node as much as possible, after all, the stability of the transmission link from the Shard node to the global node will affect the efficiency of data arrival, which will lead to a decrease in the effectiveness of the alarm.
-
: Remote Write is still used for remote storage, but A and B write to the two time series databases of TSDB1 and TSDB2 respectively, and use Sync to synchronize data between TSDB1 and TSDB2 to ensure that the data is full. -
Query perspective: The above solution needs to be implemented by itself, which is invasive and risky, so most open source solutions are made at the query level, such as Thanos or Victoriametrics, which are still two pieces of data, but data deduplication and join are done when querying. It’s just that Thanos puts data in object storage through sidecar, and Victoriametrics remotely writes data to its own server instance, but the logic of the query layer Thanos Query and Victor’s Promxy is basically the same, both serving the actual needs of the global view
to the storage angle of this article
-
Unlimited expansion: We have 300+ clusters, thousands of nodes, tens of thousands of services, stand-alone Prometheus can not be satisfied, and for isolation, it is best to do shard according to function, such as master component performance monitoring and POD resources and other business monitoring separate, host monitoring and log monitoring are also separated. Or separate by tenant and service type (real-time service, offline service).
-
Global view: After separating by type, although the data is scattered, the monitoring view needs to be integrated, and n panels in a Grafana can see the monitoring data of all regions + clusters + pods, which is more convenient to operate, and does not need to be cut by multiple Grafana or multiple Datasources in Grafana.
-
Non-intrusive: Do not make too many modifications to the existing Prometheus, because Prometheus is an open source project, the version is also iterating rapidly, we used 1.x at the earliest, but the version upgrade of 1.x and 2.x is less than a year, the query speed of the 2.x storage structure has been significantly improved, and 1.x has been used. Therefore, we need to follow the community and iterate on new versions in a timely manner. Therefore, the code of Prometheus itself cannot be modified, it is better to wrap it and be transparent to the top user.
> Long-term storage: Data storage for about 1 month may add dozens of gigabytes every day, and it is hoped that the maintenance cost of storage is small enough to have disaster recovery and migration. Influxdb was considered, but Influxdb did not have an existing clustering solution and required human maintenance. It is best to store TSDB or object storage or file storage on the cloud.
Thanos
architecture
The default mode of Thanos: sidecar mode
-
Check -
Compactor -
Query -
Rule -
-
the Sidecar
In
addition to the official mentioned these,
Store
The latest version of Thanos is downloaded here for release,
Components
and configuration
Step 1: Confirm that you already have Prometheus
:
>"./prometheus
--config.file=prometheus.yml \
--log.level=info \
--storage.tsdb.path=data/prometheus \
--web.listen-address='0.0.0.0:9090' \
--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.wal-compression \
--storage.tsdb.retention.time=2h \
--web.enable-lifecycle"
web.enable-lifecycle must be turned on, for hot load to reload your configuration, retention is reserved for 2 hours, Prometheus will generate a block by default 2 hours, Thanos will upload this block to object storage.
Acquisition configuration: prometheus.yml
global:
scrape_interval: 60s
evaluation_ interval: 60s
external_labels :
region: 'A'
replica: 0
rule_files:
scrape_configs:
job_name: 'prometheus'
static_configs:
targets: ['0.0.0.0:9090']
job_name: 'demo-scrape'
metrics_path: '/metrics'
params:
...
- version
-
2.2.1 above
-
to declare your external_labels
- Enable –web.enable-admin-apiEnable
–web.enable-lifecycle
-
Step 2: Deploy the sidecar component
-
). Optional configuration: When Prometheus generates TSDB blocks every 2 hours, Sidecar uploads TSDB blocks to the object bucket. This allows the Prometheus server to run with low retention times while making historical data durable and queryable via object storage.
Of course, this does not mean that Prometheus can
./thanos sidecar \
--Prometheus.url="http://localhost:8090" \
--objstore.config-file=./conf/bos.yaml \
--tsdb.path= /home/work/opdir/monitor/Prometheus/data /Prometheus/
"
type: GCS
config:
bucket: ""
service_account: ""
Because Thanos
does not yet support our cloud storage by default, we have included the corresponding implementation in the Thanos code and submitted a PR to the official office.
A
word of caution: don’t forget to pair your other two copies, Prometheus 1 and Prometheus, with a sidecar. If it is a pod running, you can add a container, 127 access, if it is a host deployment, specify the Prometheus port.
In addition, the sidecar is
stateless, it can also be multiple copies, multiple sidecars can access a copy of Prometheus data to ensure the scalability of the sidecar itself, but if the pod is running, there is no need for this, the sidecar and Prometheus live and die together.
The sidecar reads the meta.json information in each block of Prometheus and then expands the json file to include metadata information unique to Thanos. It is then uploaded to Block Storage. After uploading, write to thanos.shipper.json Step
3
: Deploy
the query component
sidecar deployment is complete, there are 3 copies of the same data, at this time, if you want to directly display the data, you can install the query component The Query component (also known as Query) implements Prometheus’ HTTP v1 API to query data in a Thanos cluster through PromQL like Prometheus’ graph. In short, sidecar exposes the StoreAPI, which collects data from multiple StoreAPIs, queries it, and returns results. Query is completely stateless and can scale horizontally. Configuration: "
query \
--http-address=" 0.0.0.0:8090" \
--store=relica0:10901 \
--store=relica1:10901 \
--store=relica2:10901 \
--store=127.0.0.1:19914 \
"
The store parameter represents the sidecar component that has just been started, and after starting 3 copies, you can configure three relica0, relica1, and relica2, and 10901 is the default port of the sidecar. http-address stands for the port of the query component itself, because it is a web service, and when started, the page looks like this: Almost the same as Prometheus, with this page you don’t need to care about the original Prometheus, you can check it here.
Click Store to see which sidecars are docked.
![]()
The query page has two check boxes, which mean
:
-
deduplication: whether to deduplicate. The default check represents deduplication, and only one piece of the same data will appear, otherwise replica0 and 1, 2 exactly the same data will find out 3.
-
: whether to allow part of the response, the default allows, there is a compromise of consistency, such as 0, 1, 2 three copies have a hanging or timeout, there will be no response when querying, if allowed to return the user’s remaining 2 copies, the data is not very consistent, but because a timeout is not returned at all, the availability is lost, so the default allows partial response.
partial response
Step 4: Deploy the Store Gateway component
/thanos query
has a clause –store is xxx:19914, not the 3 copies that are always mentioned, this 19914 is the store gateway component to be talked about next. ./thanos store \
--data-dir=./thanos-store-gateway/tmp/store \
--objstore.config-file=./thanos-store-gateway/conf/bos.yaml \
--http-address=0.0.0.0:19904 \
--grpc-address=0.0.0.0:19914 \
--index-cache-size=250MB \
-- sync-block-duration=5m \
--min-time=-2w \
--max-time=-1h \
grpc-address is the port exposed by the store API, that is, the –store in query is the configuration of xxx:19914.
Data statistics of one of the regions:
The speed of querying historical data for one month is okay, mainly because there is no O&M pressure on data persistence, can be expanded at will, and the cost is low.
At this point, the basic use of Thanos is over, as for compact compression and bucket verification, it is not a core function, the resource consumption of compact is also particularly large, and we have not used the rule component, so we will not introduce it.
Step 5: View
the data With multi-region and
multi-copy data, you can combine Grafana for a global view, for example:
view the performance indicators of ETCD by region and cluster

View core component monitoring by region, cluster, and machine, such as various performance on multi-replica master machines

Once the data is aggregated, all views can be displayed in one place, such as these panels:
- machine monitoring: node-exporter, process-exporter POD resource usage:
-
-
Cadvisor
-
Docker, kube-proxy, kubelet
-
scheduler, controller-manager, etcd, apiserver monitoring
-
kube-state-metrics meta information
-
the log monitoring
-
K8S Events
monitoring
All the components mentioned earlier in
receive mode such as
mtail
are configured based on sidecar mode, but Thanos also has a Receive mode, which is less commonly used. Just appear in Proposals
-
If you are a multi-tenant environment or a cloud vendor, the object storage (historical data) query component is generally on the control plane, which is convenient for permission checksum interface service encapsulation, while sidecar and Prometheus are in the cluster, that is, on the user side. The network on the control plane and the user side sometimes has restrictions, which are not accessible, at this time there will be some restrictions that cause you to not be able to use sidecar -
tenants and control plane isolation, similar to Article 2, I hope that the data is completely stored in the control plane, I have always felt that Receive is for cloud vendors to serve. However, Receive is not the default
solution after all, if it is not particularly needed, it is better to use the default sidecar for some problems
Prometheus compression
Compression: As mentioned in the official documentation, when using sidecars, you need to set Prometheus’ –storage.tsdb.min-block-duration and –storage.tsdb.max-block-duration to 2h, and the two parameters are equal to ensure that Prometheus has turned off local compression. Not reflected in help, the Prometheus author also explained that this is only a parameter for development and testing, and it is not recommended that users modify it. Thanos requires compression to be turned off because Prometheus will compress with a period of 2,25,25*5 by default, if it is not turned off, it may cause thanos to just upload a block, but this block is compressed, resulting in upload failure.
But you don’t have to worry, because when the sidecar starts, these two parameters are checked, and if it is not suitable, the sidecar will fail to start
store-gatewa y
store-gateway: The store component consumes the most resources, after all, it wants to pull remote data and load it locally for querying, if you want to control the historical data and cache period, you can modify the corresponding configuration, such as:
--index-cache-size=250MB \
-- sync-block-duration=5m
\ --min-time=-2w \ maximum query 1 week--max-time=-1h \
store-gateway supports index caching by default to speed up TSDB block lookups, but sometimes startup will take up a lot of memory, after 0.11.0 version has been fixed, you can view this issue
Prometheus 2.0 has optimized the storage tier. For example, according to the time and indicator name, try to put them together as continuously as possible. The store gateway can obtain the structure of the storage file, so it can translate the request for metric storage into the minimum object storage request. For that kind of big query, you can take hundreds or thousands of chunks of data at a time.
In the local store of the store, only the Index data is put into the cache, and although the Chunk data can also be used, it is orders of magnitude larger. At present, there is only a small delay in getting chunk data from object storage, so there is little incentive to cache chunk data, after all, this requires a lot of resources.
Data in Store-Gateway:

Each folder is actually an index file index.cache.json
compactor component
Prometheus data is getting more and more, queries will definitely get slower and slower, Thanos provides a compactor component to handle, he has two functions
-
one is to do compression, that is, to constantly merge old data.
-
One is downsampling, he will store the data, according to a certain time, calculate the maximum, minimum equal value, will be according to the query interval, control, return the sampled data, not the real point, when querying the data for a particularly long time, look at the main trend, accuracy can choose to decrease.
-
Note that Compactor does not reduce disk usage, but increases disk usage (doing higher-dimensional aggregation).
Optimization of thanos components is not a panacea, because business data is always growing, and business splitting may be considered. We need to have a certain division of the business, and different business monitors are placed in different buckets (need to be transformed or deploy more sidecars). For example, if you have 5 buckets, prepare 5 store gateways for proxy queries. Reduce the problem of large data for a single store.
The second scenario is time slicing, the store gateway mentioned above, which can choose how long the data can be queried. The Store Gateway configuration supports two expressions, one based on relative time, for example, –max-time is 3d ago to 5d ago. One is based on absolute time, such as 3/1/19 to 5/1/19. For example, if you want to query 3 months of data, a store can proxy one month of data, then you need to deploy 3 stores to cooperate.
When the deduplication query component
of Query
is started, by default, duplicate data will be deduplicated according to the query.replica-label field, and you can also check deduplication on the page to decide. The results of the query will be displayed according to the label selection of your query.replica-label. But if 0,1,2 three copies all return data, and the values are different, which one will query choose?
Thanos will select a more stable replica data based on the scoring mechanism, the specific logic is: https://github.com/thanos-io/thanos/blob/55cb8ca38b3539381dc6a781e637df15c694e50a/pkg/query/iter.go
reference
-
https://www.percona.com/blog/2018/09/20/Prometheus-2-times-series-storage-performance-analyses/
-
https://qianyongchao.blog/2019/01/03/Prometheus-thanos-design-%E4%BB%8B%E7%BB%8D/
-
https://github.com/thanos-io/thanos/issues/405
-
https://katacoda.com/bwplotka/courses/thanos
-
https://medium.com/faun/comparing-thanos-to-victoriametrics-cluster-b193bea1683
-
https://www.youtube.com/watch?v=qQN0N14HXPM
> https://thanos.io/
Source:
http://www.xuyasong.com/?p=1925
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.
Like + watch, less bugs 👇