> included in Hehe Set #ClickHouse series of articles 2
In 2018, I wrote an article about Clickhouse, which is still popular on the internet and has even been translated many times. It’s been more than two years now, and the development rhythm of Clickhouse is still active: 800 merged PRs last month! Didn’t that surprise you? It might take an hour to review these changelogs and new feature descriptions, such as 2020: https://clickhouse.tech/docs/en/whats-new/changelog/2020/
For fair comparison, the ElasticSearch repository has a staggering 1076 merge PRs in the same month, and in terms of functionality, its pace is also very impressive!
We’re using Clickhouse for log storage and analysis for ApiRoad.net projects, an API marketplace where developers sell their APIs and are currently under active development, and so far we’re happy with the results. As an API developer, observability and analyzability of HTTP request/response cycles are important for maintaining quality of service and finding bugs quickly, especially for pure API services.

We also use the ELK (ElasticSearch, Logstash, filebeat, Kibana) stack on other projects for the same purpose: to get HTTP and mail logs, and to use Kibana for post-mortem analysis and search.
Of course, we also use MySQL everywhere!
This article focuses on our choice of Clickhouse over ElasticSearch (or MySQL). as the primary reason for the underlying data (service request logs) storage solution (note: we still use MySQL for OLTP purposes ).
First, SQL support, JSON and arrays as first-class citizens
SQL is the ideal language for data analysis. I love the SQL query language, SQL schema is a perfect example of boring techniques, and I recommend using it in 99% of projects to uncover the truth in data: the project code is not perfect, and if your database is a structured schema Stored, it can be retrofitted with relative ease. Conversely, if the database data is a huge JSON block (NoSQL) and no one can fully grasp the clear structure of the data, then refactoring will be more troublesome.
Especially when using MongoDB I see this in old projects. Every new analysis report and every refactoring involving data migration is painful. It’s fun to create a new project like this – because you don’t need to spend too much time structuring the project in detail, just “see how it works”, but maintaining it will be very boring!
However, it is important to note that this rule of thumb (“use strict schema”) is not that critical for log storage use cases. That’s why ElasticSearch is so successful, with many benefits and a flexible architecture.
Continuing back to JSON, in terms of query, syntax of JSON data, Traditional relational databases are still catching up with NoSQL databases, and we must admit that JSON is a very convenient format for dynamically structured data such as log storage.
Clickhouse is a way to evolve after JSON has evolved (unlike MySQL and Postgres Modern engines designed and built. Since Clickhouse doesn’t have to carry these popular RDBMS backwards compatibility and strict SQL standards, the Clickhouse team can evolve more quickly in terms of features and improvements, and it is. Clickhouse developers have more opportunities to work with strict schema and JSON There’s an optimal balance between flexibility and I think they’ve done a good job of that. Clickhouse tries to compete with Google Big Query and other major competitors in analytics, so it is committed to “standard” SQL Many improvements have been made that make its syntax a killer feature, giving it advantages over traditional RDBMS in many cases for analytical and computational purposes .
In MySQL, you can extract JSON fields, but complex JSON processing is only available in the latest version (version 8 with JSON_TABLE functions). In PosgreSQL, it’s even worse – there was no direct JSON_TABLE alternative before PostgreSQL 12!
This is only a small step ahead of Clickhouse’s JSON and related array capabilities. Array function related links:
In many cases, PostgreSQL’s generate_series() feature is useful. A concrete example from ApiRoad: we need to map the number of requests on the chart.js timeline. A regular SELECT: group by day is performed daily, but gaps occur if there are no queries for certain days. But we don’t want gaps, so we need to make up zeros, right? This is where the generate_series() function in PostgreSQL comes in. In MySQL, it is recommended to create tables by date and join them, isn’t it too elegant?
Here’s how to solve it in ElasticSearch:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html#_missing_value_2
About the query language: I’m still not satisfied with ElasticSearch’s Lucene syntax, HTTP API, and writing json to retrieve data. And SQL will be my first choice.
This is Clickhouse’s solution for date difference filling:
SELECT a.timePeriod as t, b.count as c from (
with (select toUInt32(dateDiff('day' , [START_DATE], [END_DATE])) )
as diffInTimeUnits
select arrayJoin(arrayMap(x -> ( toDate(addDays([START_DATE], x))), range(0, diffInTimeUnits+ 1))) as timePeriod ) a
LEFT JOIN
(select count(*) as count, toDate(toStartOfDay(started_at)) as timePeriod from logs WHERE
[CONDITIONS]
GROUP BY toStartOfDay(started_at)) b on a.timePeriod=b.timePeriod
Here, we generate a virtual table through a lambda function and loop, and then left-join with the log table grouped by day.
I think arrayJoin + arrayMap + range function mode compared to generate_ series() has more flexibility. The WITH FILL keyword can be used for a more concise syntax.
Flexible schema – but also strict when needed
For log storage tasks, where the data schema often changes over the life of the project, ElasticSearch allows you to put huge chunks of JSON into an index and then figure out the field types and index parts. Clickhouse also supports this approach.
You can put data into JSON fields and filter it relatively quickly, although not as fast at the terabyte level. Then, when you often have a query need for a specific field, you can add materialized columns to the log table, which can extract values from JSON on the fly. Queries on terabytes of data are faster.
I recommend Altinity’s featured video on JSON vs. tabular format in log storage:
https://youtu.be/pZkKsfr8n3M
Third, storage and query efficiency
Clickhouse is very fast when it comes to SELECT queries, as discussed in previous articles.
Interestingly, there is evidence that Clickhouse can be 5-6 times more efficient than ElasticSearch, and literally an order of magnitude faster from a query perspective. There is another example.
I believe there is no direct benchmark between the two, at least not yet, because Clickhouse and ElasticSearch are very different in terms of query syntax, cache implementation, and overall features.
In MySQL,
suboptimal queries such as index failure on a table with only 100 million rows of log data will slow down the server and generate memory swapping, so MySQL is not suitable for large log queries. But as far as storage goes, compressed InnoDB tables aren’t that bad. Due to its row-based nature, the data compression is much worse than Clickhouse (sorry, no link to that this time), but it still manages to reduce costs without significantly degrading performance. So in some cases, we can still use InnoDB tables for a small number of logs.
In Clickhouse, it’s easy to calculate the median and 99th quantile time of a 404 query:
SELECT count(*) as cnt,
quantileTiming(0.5)(duration) as duration_median,
quantileTiming( 0.9)(duration) as duration_90th,
quantileTiming(0.99)(duration) as duration_99th
FROM logs WHERE status= 404
Here we should pay attention to the usage of the quantileTiming function and how to use currying elegantly. Clickhouse has a universal quantile quantile function! But quantileTiming has been optimized for serializing data, such as logs of loading web pages or backend response times
Are there many more similar, weighted arithmetic averages needed? Do you want to calculate linear regression? It’s easy, just use specialized functions.
Here is the full list of Clickhouse-related statistical functions:
https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/
most of these are problematic in MySQL.
ElasticSearch is much better than MySQL in this regard, it has both quantiles and weighted medians, but it does not yet have linear regression.
MySQL and Clickhouse are closely integrated
MySQL and Clickhouse have multiple levels of integration with each other, which makes them very convenient to use together with minimal data duplication:
I can’t say for sure how fast and stable the dynamic database and table engine are on joins, which definitely needs to be benchmarked. But the concept is very attractive – you can already fully replicate MySQL tables on a Clickhouse database without having to deal with cache invalidation and reindexing.
My limited experience with using MySQL with Elasticsearch shows that there are too many differences between the two technologies. I have the impression that they talk to each other and don’t come together. So I usually just need to JSON the data that ElasticSearch needs to index and send it to ElasticSearch.
After that, MySQL data is migrated or any change operations (UPDATE/REPLACE After that, find the parts that need to be reindexed on the Elasticseach side. About data synchronization for MySQL and ElasticSearch, this is an article based on the Logstash implementation. I don’t like Logstash very much because of its average performance, high memory requirements, and because it can be a destabilizing factor in the system. For simple projects using MySQL, data synchronization and indexing are often the factors that prevent us from using Elasticsearch.
Want to attach the CSV for S3 storage and use it as a table in Clickhouse? It’s very simple.
Do you want to update or delete log lines to comply with data protection specifications? Now, it’s easy!
When I wrote my first post in 2018, Clickhouse didn’t have an easy way to delete or update data, which was a real downside. Now, this is no longer a problem. Clickhouse deletes rows of data using custom SQL syntax:
ALTER TABLE [db.] table [ON CLUSTER cluster] DELETE WHERE filter_expr
is clear for reasons
that deletion is still a fairly expensive operation for Clickhouse (and other columnar databases), so it’s best not to use it frequently in a production environment.
Clickhouse also has drawbacks compared to ElasticSearch. First, if you build internal analytics for log storage, you need the best GUI tools. Kibana would be a good choice for this at the moment compared to Grafana (at least, this view is very popular, and the Grafana UI is sometimes not as smooth). If you use Clickhouse, you must use Grafana or Redash. (Our favorite Metabase is also supported by Clickhouse!)
However, in our case, we are building user-facing analytics, so we have to build the analytics GUI from scratch anyway (we use Laravel, Inertia.js, Vue.js, and Charts.js to implement the user interface).
Another ecosystem-related issue is that there are limits to the tools that consume, process, and send data to Clickhouse. For Elasticsearch, there are Logstash and filebeat, which are tools inherent to the Elastic ecosystem and designed to work together perfectly. Fortunately, Logstash can also be used to put data into a clickhouse, mitigating this problem. In ApiRoad, we used our own custom Node.js log shipping program, which aggregates the logs and sends them to Clickhouse in batches (because Clickhouse prefers large batches over small multiple inserts).
Some of the strange names of functions that I don’t like in Clickhouse are because Clickhouse was created by Yandex.Metrika, a competitor to Google Analytics. For example, visitParamHas() is used to check for the presence of a particular key in JSON. Generic purpose, but not a generic name. There are a bunch of nice JSON function names, such as JSONHas(). There’s another interesting detail: as far as I know, they use different JSON parsing engines, which are more standards-compliant but slightly slower.
ElasticSearch is a very powerful solution, but I think its strongest aspect is still support for more than 10+ nodes, for large full-text indexes and facets, complex indexing, and score calculations – this is the highlight of ElasticSearch. When we mention time series and log storage, there seems to be a better solution, and Clickhouse is one of them. The ElasticSearch API is very powerful, but in many cases, it’s hard to remember how to do something without copying a specific HTTP request from a document, it’s more “enterprise” and “Java-style.”
Both Clickhouse
and EasticSearch are memory-intensive programs, with Clickhouse requiring 4GB while ElasticSearch requires 16GB. I also think the Elastic team is focused on their new machine learning capabilities, and my humble opinion is that as new as they sound, these huge feature sets are hard to support and improve on an ongoing basis, no matter how many developers and money you have. For me, ElasticSearch is constantly moving into a state of “know-it-all.” Maybe I was wrong.
Clickhouse is different. It’s easy to set up, SQL is simple, and the console client is great. With a small amount of configuration, you can make everything simple and efficient, but you can also use rich features, replicas, and sharding capabilities on terabytes of data when needed.
Written by Anton Sidashin (translator: Fivezh).
Source丨Website: https://pixeljets.com/blog/clickhouse-vs-elasticsearch/


public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇
tag is not clickable during preview