open
source analytics database ClickHouse is known for being fast, is that really the case? Let’s verify it with a comparative test.
First, use ClickHouse (CH) and Oracle Database (ORA) to do comparative tests in the same software and hardware environment. The test protocol uses the internationally recognized TPC-H to complete the calculation requirements (Q1 to Q22) defined by 22 SQL statements for 8 tables. The test adopts a single machine with 12 threads and a total data size of 100G. The SQL corresponding to TPC-H is relatively long, so it will not be listed in detail here.
Q1 is a simple single-table traversal calculation group summary, and the comparison test results are as follows:
CH
calculates Q1 better than ORA, indicating that CH’s columnar storage is doing well, and single-table traversal is fast. The main disadvantage of ORA is the use of row storage, which is significantly slower.
But if we increase the computational complexity, how does CH perform? Continue looking at Q2, Q3, Q7 of TPC-H, the test results are as follows:
As calculations become complex, CH performance degrades significantly. Q2 involves a small amount of data, column storage is not useful, and CH performance is almost the same as ORA. Q3 has a large amount of data, and CH surpasses ORA after taking advantage of the cheap column storage. Q7 data is also large, but the calculation is complex, and the CH performance is not as good as ORA.
Whether you do complex calculations quickly depends mainly on whether the performance optimization engine is doing well. CH’s column storage occupies a huge storage advantage, but it is caught up with row storage by ORA, which shows that CH’s algorithm optimization ability is far inferior to ORA.
TPC-H’s Q8 is a more complex calculation, there are multiple table connections in the subquery, CH ran for more than 2000 seconds and did not produce results, it should be stuck, ORA ran for 192 seconds. Q9 added like to the subquery in Q8, CH directly reported the out-of-memory error, and the ORA ran for 234 seconds. There are also some complex calculations that CH can’t run, so there is no way to make an overall comparison.
CH and ORA are based on the SQL language, but the statements that ORA can optimize, CH cannot run, which proves that CH’s optimization engine ability is relatively poor.
According to anecdotal legends, CH is only good at doing single-table traversal operations, and it can’t even outperform MySQL when there are correlation operations, which seems to be not nonsense. Students who want to use CH have to weigh it, how much can this scene be adapted?
open source esProc SPL is also a high-performance publicity point, So let’s compare again.
Q2, Q3, Q7 and other more complex operations, SPL is faster than CH and ORA. CH can’t run Q8, Q9, SPL ran 37 seconds and 68 seconds, respectively, which is also faster than ORA. The reason is that SPL can use better algorithms, its computational complexity is lower than that of SQL optimized by ORA, far lower than that of SQL executed by CH, plus column storage, and finally SPL developed in Java outperforms the CH and ORA implemented by C++.
It can probably be concluded that esProc SPL performs very well for both simple and complex calculations.
However, for the simple operation of Q1, CH is still slightly better than SPL. It seems to further prove the previous conclusion that CH is particularly good at simple traversal operations.
And slowly, SPL has a secret weapon.
The
columnar cursor mechanism is provided in the enterprise edition of SPL, let’s compare and test: under the volume of 800 million data, do the simplest grouping summary calculation, and compare the performance of SPL (using columnar cursors) and CH. (The machine configuration used is slightly lower than the previous TPC-H test, so the results are different, but the relative value is mainly looked at here.)
The
SQL statement corresponding to CH in a simple group is
:
SELECT mod( id, 100) AS Aid, max(amount) AS Amax
FROM test.t
GROUP BY mod(id, 100)
The result of this test looks like this:
After SPL uses the columnar cursor mechanism, the performance of simple traversal grouping calculations is the same as that of CH. If columnar cursors were also used in the Q1 test of TPC-H, SPL would achieve the same performance as CH.
During the test, it was found that 800 million pieces of data stored in text format occupied 15G of disk, 5.4G of CH, and 8G of SPL. It shows that CH and SPL both use compressed storage, and the compression ratio of CH is higher, which further proves that the CH storage engine is really good. However, SPL can also achieve the same performance as CH, which shows that the SPL storage engine and algorithm optimization are relatively good, and the high-performance computing power is more balanced.
The current version of SPL is written in Java, which is slow to generate objects for computation after Java readings, while CH developed in C++ does not have this problem. For complex operations, the proportion of reading time is not high, and the drag caused by the slow generation of Java objects is not obvious; For simple traversal operations, the reading time is very high, so the SPL will be slower than CH in the previous test. Columnar cursors optimize the reading scheme and no longer generate small objects, so that the number of object generation times is greatly reduced, and the gap can be pulled back. Purely from the storage itself, SPL has no obvious advantages or disadvantages compared to CH.
Next, looking at the comparison test of conventional TopN, the SQL of CH is
: SQL2:
SELECT * FROM test.t ORDER BY amount DESC LIMIT 100
The comparative test result is as follows:
Looking at CH’s SQL2 alone, the calculation method of conventional TopN is to take out the first N pieces of data after all sorting. When the amount of data is large, if you really do full sorting, the performance will be very poor. The SQL2 test results show that CH should be optimized like SPL, without full sorting, so both perform quickly and SPL is slightly faster.
In other words, whether simple or complex, esProc SPL is superior.
As mentioned earlier, CH and ORA use the SQL language and are based on relational models, so they both face the problem of SQL optimization. TPC-H tests have proved that CH cannot be optimized or even run out of results in some scenarios that ORA can optimize. Then, if faced with some calculations that ORA will not optimize, CH will not optimize. For example, if we change the simple group summary of SQL1 to two groups of summary results and then join, the SQL of CH is written as follows:
SELECT *
FROM (
SELECT mod(id, 100) AS Aid, max(amount) AS Amax
FROM test.t
GROUP BY mod(id, 100)
) A
JOIN (
SELECT floor(id / 200000) AS Bid, min(amount) AS Bmin
FROM test.t
GROUP BY floor(id / 200000)
) B
ON A.Aid = B.Bid
In this case, the result of the comparison test is that the calculation time of CH is doubled, and the SPL is unchanged:
This is because SPL not only uses columnar cursors, but also traversal multiplexing mechanisms, which can calculate multiple grouping results in a single traversal process, which can reduce many hard disk accesses. The SQL used by CH cannot write such operations, and can only rely on CH’s own optimization capabilities. The CH algorithm has poor optimization ability, and its optimization engine did not work in this test, and can only traverse twice, so the performance is doubled.
The code for SPL to implement traversal reuse is simple, roughly like this:
| A
| |
B
1 |
=file(“topn.ctx”).open().cursor@mv(id,amount) |
2 |
cursor A1 |
=A2.groups(id%100:Aid; max(amount):Amax) |
3 |
cursor |
=A3.groups(id\200000:Bid; min(amount):Bmin) |
4 |
=A2.join@i(Aid,A3:Bid,Bid,Bmin) |
Then adjust the SQL2 regular TopN calculation to find the TopN in the group after grouping. The corresponding SQL is: SQL4:
SELECT
gid,
groupArray(100)(amount) AS amount
FROM
(
SELECT
mod(id, 10) AS gid,
amount
FROM test.topn
ORDER BY
gid ASC,
amount DESC
) AS a
GROUP BY gid
The comparison result of this grouped TopN test is as follows:

CH does the grouping TopN calculation is 42 times slower than the regular TopN, indicating that CH is likely to do the sorting action in this case. That is, after complicating the situation, CH’s optimization engine does not work again. Unlike SQL, SPL regards TopN as an aggregation operation, which is the same calculation logic as sum, count, etc., and only needs to traverse the original data once. In this way, the group to find the TopN in the group is the same as the group sum and count, which can avoid sorting calculation. As a result, SPL computes packet TopN 22 times faster than CH.
Also, the SPL code for calculating the grouping TopN is not complicated:
| |
A
1 |
=file(“topn.ctx”).open().cursor@mv(id,amount) |
2 |
=A1.groups(id%10:gid; top(10; -amount)).news(#2; gid, ~.amount) |
Let’s take a look at the common funnel operations in e-commerce systems. The SPL code is still concise:
| A
| B
| |
1
=[“etype1″,”etype2″,”etype3”] |
=file(“event.ctx”).open() |
2 |
=B1.cursor(id,etime,etype; etime>=date(“2021-01-10”) && etime |
3 |
=A2.group(id).( ~.sort(etime)) |
=A3.new(~.select@1(etype==A1(1)):first,~:all).select(first) |
4 |
=B3.( A1. (t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime |
5 |
=A4.groups(; count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)CH’s |
SQL
cannot achieve such calculations, let’s take ORA as an example to see the SQL writing of the three-step funnel:
with e1 as (
select gid, 1 as step1,min(etime) as t1
from T
where etime>= to_date( '2021-01-10', 'yyyy-MM-dd') and etime <to_date('2021-01-25', 'yyyy-MM-dd')
and eventtype='eventtype1' and ...
group by 1
),
with e2 as (
select gid,1 as step2,min(e1.t1) as t1,min( e2.etime) as t2
from T as e2
inner join e1 on e2.gid = e1.gid
where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date(' 2021-01-25', 'yyyy-MM-dd')
and e2.etime > t1
and e2.etime < t1 + 7
and eventtype='eventtype2' and ...
group by 1
),
with e3 as (
select gid,1 as step3,min(e2.t1) as t1,min( e3.etime) as t3
from T as e3
inner join e2 on e3.gid = e2.gid
where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date(' 2021-01-25', 'yyyy-MM-dd')
and e3.etime > t2
and e3.etime < t1 + 7
and eventtype='eventtype3' and ...
group by 1
)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.gid = e2.gid
left join e3 on e2.gid = e3.gid
ORA’s SQL takes more than thirty lines to write, which is quite difficult to understand. And this code is related to the number of steps in the funnel, and each additional step adds another subquery. In contrast, SPL is much simpler, and it is this code that handles any number of steps.
This kind of complex SQL is very laborious to write, and performance optimization is impossible to talk about.
CH’s SQL is far inferior to ORA, basically can not write such complex logic, can only write C++ code external. That is, only CH’s storage engine can be leveraged in this case. Although it is possible to achieve good performance in external computing in C++, it is very expensive to develop. There are many similar examples, and CH cannot be directly implemented.
To sum up: CH calculations for some simple scenarios (such as single-table traversal) are indeed fast, and the performance of SPL is similar. However, high-performance computing can not only look at simple situations but also weigh various scenarios. For complex operations, SPL not only performs much better than CH, but also makes code much simpler. SPL can cover all scenarios of high-performance data computing, which can be said to beat CH.
heavy! The open source SPL exchange group has established
a simple and easy-to-use SPL open source!
In order to provide a platform for interested technicians to communicate with each other
, a
communication group has been specially opened (the group is completely free, no advertising and no classes).
Friends who need to join the group, you can long press to scan the QR code below
Friends who are interested in this article, Please go to the original article to bookmark ^_^