First, the business background

Second, the pre-introduction

1. Introduction to ClickHouse

ClickHouse is a columnar database management system (DBMS) for online analysis (OLAP). Columnar databases are more suitable for OLAP scenarios (processing speed is at least 100 times faster for most queries), and the following is more convenient for intuitive understanding through pictures:


Image source: #


Image source: #

At the same time, ClickHouse is based on column memory and also provides a high compression ratio. The compression algorithm lz4 can go up to 1:4 and zstd can go up to 1:10. This article is mainly to introduce our practice on ClickHouse, and the characteristics and related introductions of ClickHouse itself will not be repeated here.

2. ClickHouse storage policy

ClickHouse supports table-level data TTL policies when creating tables, which can support data deletion after Compaction expires, or automatically merge and move to other Disks or Volumn. The multi-level storage of the log platform is to use the storage strategy, but due to stepping on a pit of TTL, we finally gave up the table-level TTL setting and changed to the task scheduling of the relocation table part to implement the TTL policy, as will be mentioned later.

1) Configure the storage policy


For the default storage path for ClickHouse, find the label and add the storage policy label below.

: Pin labels, define storage policies.

: A fixed label below which defines the disk name and the absolute path to the disk.

, : A custom label that marks the path and can be defined by this name for easy distinction.

: Pin label that defines the specific storage policy name.

: A custom label that defines the name of the specific storage policy for table-level TTL, which can be easily distinguished by this name definition.

: Pin labels, define volume groups.

, : Volume group name, each volume group can include one or more disk tags, disk tags take value as the disk name defined under the tags.

2) Create a table


To create a table-level TTL, in order to use hierarchical hot and cold storage, you must specify a storage policy to use Disk, why specify a storage policy, my understanding is that TTL can be specified to the Disk and Volumn level, but the Volumn level can only be defined in the storage policy.

Specific reading:

3. Partition strategy

PARTITION BY–partition key, table-level optional parameter. In most cases, a partition key is not required, and even if a partition key is used, it is not recommended to use a partition key that is finer than the month, and partitioning does not speed up the query (as opposed to the ORDER BY expression). You should never use partitions that are too refined. Do not partition the data by client identifier or name, but use the client identifier or name as the first column in the ORDER BY expression (the official document should keep this sentence in mind, generally we only recommend partitioning the time field).

The above is the original words of the official document, but in fact, we need to create partitions that can even be granularized to the hour-level according to the specific situation, or that sentence, depending on the demand, there will be side effects if it is not used well, which will be mentioned later.

Third, business needs

In addition to mentioning the need for high write capability, high compression ratio, and actual requirements research, the infrastructure students also proposed that each business domain has its own log time query range requirements (ES as storage, research and development even needs to be able to provide a day-level log retention time), such as 7 days, 30 days, 90 days, or even longer and other different ranges, because the cold data query frequency is low, you can use a lower storage medium. In addition, some data is retained for a period of time without being checked but needs to meet compliance requirements, and then long logs can be deleted. The infrastructure hopes that the DBA will help provide some advice on data retention time on ClickHouse’s storage to minimize storage costs.

To sum up, the following requirements are as follows:

How can I meet the day-level retention policies for each business domain as much as possible?

How can I store my data on different storage media based on date?

What kind of storage media is used in a multi-tier storage strategy to minimize storage costs?

Based on these requirements, the DBA offers the following options for these requirements:

For the tenure retention policy, we used the table partitioning strategy and circumvented some pitfalls.

Multi-level storage mainly uses the three-level storage strategy, that is, Hot+Cold+Arch (Archive has been replaced by Arch subsequently).

The hot disk uses ESSD PL1 disk to mount the host, and the disk name is hot; Cold selects ESSD PL0 disk to mount the host, and the disk name is cold; Arch We eventually chose to mount the OSS file system directly (because we also investigated JuiceFS in the middle, but eventually gave up), and the disk name is arch.

Let’s take a look at them all.

Fourth, the program design

For business needs, we need to solve three problems:

1. How to meet the day-level retention policies of each business domain as much as possible?

1) Scheme 1: Partition Policy PARTITION BY (application, environment, toYYYYMMDD(log_time))

Initially, we expected to use a partition strategy, a combined partitioning strategy of application name + environment + daily partitioning, that is, PARTITION BY (application, environment, toYYYYMMDD(log_time)). In this way, each application has independent partitions for each environment, and Service R&D can also flexibly modify the log retention time at will, so that each partition can be independently moved to different disks according to the retention policy. The ideal is very full, the reality is very bone. Although this solution can meet the needs very well, but when the real writing, we encountered problems, because the online application in more, and is still on the rise, each application may also have multiple environments, even if each application is only written to a day partition per day, the actual test process, we found that the write performance is seriously insufficient, the write is very slow, and at the same time encountered a number of parameter values are not enough to report errors.

max_partitions_per_insert_block, the default value is 100.

This means the number of partitions involved in a data block inserted at one time. Since the log is a certain topic of consumption kafka, a topic may have hundreds of applications, each application also has multiple environments, even if the writing to the sky partition, the partition is physically isolated into different directories, a write will also be split by ClickHouse into thousands of different partitions in the part, an insert involved in the partition number explosion, prompting the number of inserted data parted more than the parameter value (too many partitions for single insert blocks), far beyond the parameter settings, 100 is simply not enough, adjust to 1w or report an error. In the case where the amount of writing is not very large, the parameter can be adjusted appropriately, but the test environment log is too broken, and the parameter is raised quickly with too many parts of the error, too many parts appear is that the ClickHouse merge can not keep up with the write, directly refuse to write data, this is the following parameter max_parts_in_total value is not enough.

max_parts_in_total, the default value is 10w.

In the test environment, because there is less application data, it is more difficult to save batch data, plus if the code is not handled well in the place where the batch writes data, it is easy to frequently write less data rows of part, plus the number of applications, the environment is more, resulting in more data written, and the active part data in a table is very easy to exceed 10w (you can view the number of data in the table that is active in the table), and many times report errors too many parts in total error. Here it is possible to appropriately increase the value of the background_pool_size to increase the speed of the merge, but for a large number of fragmented parts is also a drop in the bucket, does not solve the fundamental problem. Eventually we abandoned option 1 and chose option 2.

2) Scenario 2: Partition Policy PARTITION BY (toDate(log_time), log_save_time, oss_save_time)

The problem in Scenario 1 is mainly the problem of partition field setting, and we decided to provide a certain ability to retain logs while satisfying the write capacity. Finally, the partition field is removed from the application name application field (to ensure query speed, we put the application field into the first field of order by.

Due to the three-level retention policy, the first thought of using multiple tables is that we define the fixed options for the log retention time range (7d, 30d, 90d, 180d), the log retention time needs to be developed to make some compromises, can not modify the retention time at will, according to these options to create the corresponding table, so that the best performance of the write. But there will be a problem, that is, when the research and development of the work order to modify the application of the retention time, the time adjustment will lead to the log falling into a different table, so that the code query statement does not change the case of the same time will not be able to find the historical data problem, although the code can do a certain routing rules, but it is cumbersome, not friendly enough, give up the program.

How to solve it? After our discussion, we came up with a way to add two fields to the table, log_save_time, oss_save_time, which are of type int, and adjust the partition field to PARTITION BY (toDate(log_time), log_save_time, oss_save_time). log_save_time the time that each log is kept in the hot disk, beyond which the task will be moved to the cold disk; oss_save_time is the time that each log is kept in the cold disk, beyond which it will be moved to the arch disk (the moving task here will be described below). The task will query the table every day, view the comparison of three fields in the partition field, that is, toDate(log_time) and the current date comparison, the difference is greater than the value of the log_save_time in the partition, then move the partition to the cold disk, and if the difference is greater than oss_save_time move to the arch disk. In this way, if the metadata information of the log retention policy is modified, the two fields of the newly generated log data are also written to the new retention policy value, and the new data will fall into different partitions. So how? Enter question 2.

2. How to meet the requirements of storage in different storage media according to the date?

Solve the first problem, then how to make the log expire according to the partition settings?

1) Solution 1: Use table-level TTL settings

Initially we thought of using table-level TTL settings in the table’s creation statement:

The advantage is to use ClickHouse’s own capabilities to relocate data and realize the migration of expired data to cold storage media.

However, there is an issue that causes the previously mentioned multiple corresponding tables to be created based on multiple selectable time ranges.

In addition, there is a pit, once the table-level TTL wants to modify the retention time of the TTL, ClickHouse will reload all the part directories of the table, resulting in IO Util being full, the cluster can not respond, this is very pit, there is no good way to solve, so abandon the program.

2) Solution 2: Develop scheduling tasks and manually move partition data

Combined with question 1, we finally chose to create the table structure in this way (only as a demo reference, not the real business situation), as follows:

The logging platform performs a scheduling task while maintaining the correspondence between the application and (log_save_time, oss_save_time). Based on the information of the table, the corresponding log partition is applied every day.

But there is also a small problem here is that when the research and development needs to modify the log retention time, such as the retention time is increased, the new data will fall into the new partition, so that the previous partition will be deleted in advance because of the matching rules, such as 7 days adjusted to 30 days, then because the previous retention time is still 7 in the partition or the value of 7, to the 7th day, the previous partition has met the deletion policy will be deleted. Although you change to a 30-day retention policy, there will still be 7 days of logs that cannot be found, of course, the full 30-day logs can still be found after 7 days of time. On the contrary, if the retention time is reduced, such as from 30 to 7, there will be logs that have been continuously retained for up to 23 days, and the space is not released in time. Space is released after 23 days of time. These problems do not matter, can be tolerated, compared to multiple tables of the scheme, the advantages of the scheme outweigh the disadvantages, and finally choose the solution.

3. What kind of media is selected to meet the mass storage of archived data?

After solving the expiration policy, the design of the table structure, the previously mentioned arch disk to store basically unchecked data, the use of low storage cost media to reduce costs, the first thing we think of is whether we can use OSS? The answer is yes, and we’ve looked at the charges, and OSS costs the same capacity are only one-third of ESSD PL0, which can undoubtedly significantly reduce storage costs. But how to use the best, you need to investigate + test.

1) Scenario 1: ClickHouse + JuiceFS + OSS

The main function of JuiceFS is to convert S3 into a file system mount, and there are online cases in many well-known Internet companies (a public technical sharing article of an overseas e-commerce platform mentioned the ClickHouse cold and hot separation storage based on JuiceFS), so we began to investigate the feasibility of this solution based on this information.

The introduction and architecture diagram of JuiceFS are as follows:

JuiceFS is a high-performance shared file system for cloud-native design, released under the Apache 2.0 open source protocol. Provides complete POSIX compatibility, can access almost all object storage locally as a large number of local disks, and can also be mounted on different hosts across platforms and regions at the same time.

JuiceFS adopts a separate storage architecture of “data” and “metadata” to achieve distributed design of file systems. With JuiceFS to store data, the data itself is persisted in object storage (for example, Amazon S3), and the corresponding metadata can be persisted on demand in Redis, MySQL, TiKV, SQLite, and many other databases.

JuiceFS provides a wealth of APIs for the management, analysis, archiving, and backup of various forms of data, and can seamlessly connect with application platforms such as big data, machine learning, and artificial intelligence without modifying code, providing them with massive, elastic, and low-cost high-performance storage.

Image source:

Another reason to consider JuiceFS at that time is its read caching mechanism, when accessing the files in JuiceFS, there will be multi-level caching to provide better performance for frequently accessed data, read requests will try kernel paging cache, JuiceFS process read-ahead buffer, local disk cache, when the cache does not find the corresponding data will be read from the object storage, and will asynchronously write to the cache at all levels to ensure the performance of the next access.

Image source:

It can be seen that JuiceFS supports mounting object storage to ECS, doing local disk access through the file system, and supporting read caching to accelerate file reading, we also did testing, and indeed the read and write performance is not bad.

Local read and write test:

Mounting the file system of JuiceFS to the ClickHouse storage policy, the SSBM test results are as follows:

From a ClickHouse perspective, query performance is comparable to, or even slightly better, than PL0. (PL0 costs a lot more than OSS)

However, in the end we didn’t use JuiceFS for the following reasons:

The entire JuiceFS architecture introduces third-party storage media to hold file metadata.

The use of Redis for metadata storage does not guarantee metadata consistency, and ClickHouse is at risk of losing data if a switch occurs in Redis and metadata is lost.

The metadata store uses MySQL, QPS, and RT to meet the frequency of ClickHouse high-speed file access to metadata.

The metadata store uses TiKV, which increases the complexity of O&M.

Based on the above problems, we want to avoid the introduction of other technology stacks to bring additional operation and maintenance costs, of course, this is not to say that JuiceFS has any drawbacks, but in our case we need to introduce as few other components as possible. It just so happens that we have an off-the-shelf object storage service. So I continued to test the native OSS mounting scheme (that is, the scheme to be discussed below). This survey also gave us an in-depth understanding of the architecture and advantages of JuiceFS, laying the foundation for subsequent application in other scenarios.

2) Scenario 2: ClickHouse + OSS (our final choice of solution)

ClickHouse’s native MergeTree itself supports direct mounting of S3 as a data disk, and we tested based on this feature and the results were as expected. Although the official demo is S3, it can also be used because OSS also supports the S3 protocol. It should be noted that endpoint must start with http here, otherwise it cannot be mounted normally.

The storage policy is configured as follows:

Test Conditions:

Write tests, move data from the hot disk to OSS.

OSS writes can be to 7Gb/s.

Rebooting the cluster also starts quickly.

The architecture of this solution is simple and reliable, which is convenient for O&M, and finally we chose to mount OSS directly. When you select OSS, you can save up to 66% on the amount of data archived compared to ESSD PL0. In addition, we apply for a separate OSS bucket for each cluster, which reduces costs and satisfies sufficient write performance. At present, the log platform has begun to migrate the log to meet the archiving strategy, which is believed to save a lot of money.

5. Storage architecture

Based on the above scenario, our log platform storage architecture is as follows:

The architecture takes full advantage of the infrastructure on the cloud, such as:

Based on the ESSD cloud disk, a single-copy storage architecture is implemented (the cloud disk has already implemented multi-copy storage), and if the ClickHouse node fails, the ESSD disk will be migrated together. You can implement hot (memory copy) migration or self-trigger migration;

Based on the local expansion capability of ESSD, the vertical expansion of disk space can be completed without loss, and the service is not affected; For example, the storage space in the early stage is not much, you can give 50% of the budget first, and then expand the capacity as the data volume gradually rises;

Based on the scaling and scaling capability of ECS, it can quickly expand the capacity of nodes or reduce the cost of allocation. Because ClickHouse itself does not have the ability to automate reblance, we can first give more low-allocation nodes, and later carry out vertical upgrade of ECS with the increase of business volume;

Based on low-cost object storage OSS, archive data is stored with little or no access. If you build it yourself through an HDD, you need to invest it all at once. OSS on the cloud can be charged on a per-volume basis, which is relatively cost-effective.

Sixth, summary

This article mainly explains some of the things that the DBA team participated in the business transformation of the logging platform in the previous period, such as table field index design suggestions, expiration policy scheme development, SQL writing and optimization suggestions, and providing cost reduction solutions. Finally, through the efforts of the research and development students of the log platform, we migrated the log platform storage from ES to ClickHouse, which not only obtained high-performance write capabilities, but also saved more than 50% of storage costs.