ClickHouse vs Oracle

CH

esProc SPL debuts

The

**columnar cursor**mechanism is provided in the enterprise edition of SPL, let’s compare and test: under the volume of 800 million data, do the simplest grouping summary calculation, and compare the performance of SPL (using columnar cursors) and CH. (The machine configuration used is slightly lower than the previous TPC-H test, so the results are different, but the relative value is mainly looked at here.)

The

:

`SELECT mod( id, 100) AS Aid, max(amount) AS Amax`

`FROM test.t`

`GROUP BY mod(id, 100)`

: SQL2:

`SELECT * FROM test.t ORDER BY amount DESC LIMIT 100`

The comparative test result is as follows: Looking at CH’s SQL2 alone, the calculation method of conventional TopN is to take out the first N pieces of data after all sorting. When the amount of data is large, if you really do full sorting, the performance will be very poor. The SQL2 test results show that CH should be optimized like SPL, without full sorting, so both perform quickly and SPL is slightly faster. In other words, whether simple or complex, esProc SPL is superior. Further disparities

don’t stop there. As mentioned earlier, CH and ORA use the SQL language and are based on relational models, so they both face the problem of SQL optimization. TPC-H tests have proved that CH cannot be optimized or even run out of results in some scenarios that ORA can optimize. Then, if faced with some calculations that ORA will not optimize, CH will not optimize. For example, if we change the simple group summary of SQL1 to two groups of summary results and then join, the SQL of CH is written as follows: SQL3: `SELECT *`

`FROM (`

`SELECT mod(id, 100) AS Aid, max(amount) AS Amax`

`FROM test.t`

`GROUP BY mod(id, 100)`

`) A`

`JOIN (`

`SELECT floor(id / 200000) AS Bid, min(amount) AS Bmin`

`FROM test.t`

`GROUP BY floor(id / 200000)`

`) B`

`ON A.Aid = B.Bid`

In this case, the result of the comparison test is that the calculation time of CH is doubled, and the SPL is unchanged:

This is because SPL not only uses columnar cursors, but also traversal multiplexingmechanisms, which can calculate multiple grouping results in a single traversal process, which can reduce many hard disk accesses. The SQL used by CH cannot write such operations, and can only rely on CH’s own optimization capabilities. The CH algorithm has poor optimization ability, and its optimization engine did not work in this test, and can only traverse twice, so the performance is doubled.The code for SPL to implement traversal reuse is simple, roughly like this:

B

A | ||

1 | =file(“topn.ctx”).open().cursor@mv(id,amount) | |

2 | cursor A1 | =A2.groups(id%100:Aid; max(amount):Amax) |

3 | cursor | =A3.groups(id\200000:Bid; min(amount):Bmin) |

4 | =A2.join@i(Aid,A3:Bid,Bid,Bmin) |

Then adjust the SQL2 regular TopN calculation to find the TopN in the group after grouping. The corresponding SQL is: SQL4:

`SELECT`

`gid,`

`groupArray(100)(amount) AS amount`

`FROM`

`(`

`SELECT`

`mod(id, 10) AS gid,`

`amount`

`FROM test.topn`

`ORDER BY`

`gid ASC,`

`amount DESC`

`) AS a`

`GROUP BY gid`

The comparison result of this grouped TopN test is as follows:

A

1 | =file(“topn.ctx”).open().cursor@mv(id,amount) |

2 | =A1.groups(id%10:gid; top(10; -amount)).news(#2; gid, ~.amount) |

is not just fast

1

A | B | ||

=[“etype1″,”etype2″,”etype3”] | =file(“event.ctx”).open() | ||

2 | =B1.cursor(id,etime,etype; etime>=date(“2021-01-10”) && etime |
||

3 | =A2.group(id).( ~.sort(etime)) | =A3.new(~.select@1(etype==A1(1)):first,~:all).select(first) | |

4 | =B3.( A1. (t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime |
||

5 | =A4.groups(; count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)CH’s |

SQL

cannot achieve such calculations, let’s take ORA as an example to see the SQL writing of the three-step funnel:

`with e1 as (`

`select gid, 1 as step1,min(etime) as t1`

`from T`

`where etime>= to_date( '2021-01-10', 'yyyy-MM-dd') and etime <to_date('2021-01-25', 'yyyy-MM-dd')`

`and eventtype='eventtype1' and ...`

`group by 1`

`),`

`with e2 as (`

`select gid,1 as step2,min(e1.t1) as t1,min( e2.etime) as t2`

`from T as e2`

`inner join e1 on e2.gid = e1.gid`

`where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date(' 2021-01-25', 'yyyy-MM-dd')`

`and e2.etime > t1`

`and e2.etime < t1 + 7`

`and eventtype='eventtype2' and ...`

`group by 1`

`),`

`with e3 as (`

`select gid,1 as step3,min(e2.t1) as t1,min( e3.etime) as t3`

`from T as e3`

`inner join e2 on e3.gid = e2.gid`

`where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date(' 2021-01-25', 'yyyy-MM-dd')`

`and e3.etime > t2`

`and e3.etime < t1 + 7`

`and eventtype='eventtype3' and ...`

`group by 1`

`)`

`select`

`sum(step1) as step1,`

`sum(step2) as step2,`

`sum(step3) as step3`

`from`

`e1`

`left join e2 on e1.gid = e2.gid`

`left join e3 on e2.gid = e3.gid`

To sum up: CH calculations for some simple scenarios (such as single-table traversal) are indeed fast, and the performance of SPL is similar. However, high-performance computing can not only look at simple situations but also weigh various scenarios. For complex operations, SPL not only performs much better than CH, but also makes code much simpler. SPL can cover all scenarios of high-performance data computing, which can be said to beat CH.

**heavy! The open source SPL exchange group has established**

a simple and easy-to-use SPL open source!

In order to provide a platform for interested technicians to communicate with each other

, a

communication group has been specially opened (the group is completely free, no advertising and no classes).

Friends who need to join the group, you can long press to scan the QR code below

**Friends who are interested in this article, Please go to the original article to bookmark ^_^**