This article will introduce the best practices of the open source big data OLAP engine in six parts.
Today’s open source data engines are diverse, and different kinds of engines meet our different needs. There are now three main types of ROLAP computing and storage data warehouses, namely StarRocks (DorisDB), ClickHouse and Apache Doris. The most widely used data query systems are mainly Druid, Kylin and HBase. The MPP engines are mainly Trino, PrestoDB and Impala. These engines have a wide range of applications in the industry.
1. Overview of open source OLAP 2. Open source data warehouse solution 3. ClickHouse introduction 4. StarRocks introduction 5. Trino introduction
6. Customer cases
Open source data warehouse solution
Next, let’s talk about open source big data and data warehouse solutions. The figure above shows the overall architecture of EMR, and at the cloud resource layer, there are mainly ECS. JindoFS provides an OSS-based Hadoop interface at the storage layer, which not only saves costs, but also improves overall scalability. The data lake format effectively solves the problem of unified data management. Secondly, in terms of computing engine, it has capabilities such as batch processing, streaming computing, machine learning, and engine acceleration.
At present, the most widely used offline data warehouse system is the Lambda architecture. The architecture is divided into two main parts.
In the first part, in real-time, we start with the CDC, ORTP data source, do behavioral data analysis, and then process it through Kafka, Flink. The data online system can directly call APIs to improve the efficiency of point checks. Second, when all aggregated numbers are imported into the Olap system, operations personnel can quickly use it to realize their own new ideas and improve work efficiency.
The second part is that when it comes to storing data for a long time offline, everyone uses hive. If there is no incremental database format, we generally do some data mart on detail by inserting overwrite. In addition, we realize real-time data correction of offline data warehouses through offline T+1 mode. Because real-time data generally gives approximate values, offline data gives accurate values.
The third part, the solution of real-time data lake, has a data volume at the petabyte+ level. We want to unify offline and real-time data warehouses and build our business with one set of code. The data of the data lake is stored in OSS/HDFS, and since some of our businesses have Upsert change requirements, we hope to build a data warehouse from minutes to hours. The hottest data can be imported into StarRocks/CK, and the query time of OLAP is guaranteed to be between 500 milliseconds and 2 seconds. At the same time, when we use Presto to query Hudi/Iceberg/Delta, the rate is guaranteed to be between 5 and 30 seconds.
The figure above is a more traditional real-time data warehouse solution. When the daily incremental data reaches 10 TB+, we hope to directly build a service foundation with a single software, so that the data can be stored in CK/StarRocks first, and the cold data can be transferred to OSS. There is no need to operate and maintain the huge system of Hadoop, which greatly simplifies operation and maintenance operations, which can be comparable to full hosting.
In the second real-time data warehouse solution, we use the micro-batch task scheduler to handle DWS, DWD and ODS. Its real-time performance is very strong, which greatly simplifies development efficiency and has the highest data consistency. In the future, we will launch a storage and computing separation solution, use OSS to store massive data, and use Cache to accelerate hot data.
Introduction to ClickHouse
ClickHouse is an open-source analytics engine for online analytical processing (OLAP). Originally developed by Yandex, Russia’s first search engine, it was open sourced in 2016 in C++. Due to its excellent query performance, petabyte-level data scale, and simple architecture, it is widely used in domestic and foreign companies.
It is a columnar database with complete DBMS functions, backup columnar storage and data compression. Its MPP architecture is easily scalable and easy to maintain. In addition, it supports vectorized queries, perfect SQL and real-time data updates, and the query speed can reach sub-second responses.
So why is ClickHouse so fast? It is similar to LSM tree, all data is arranged in an orderly manner, aggregated in advance, and then stored. And its data storage format comes with its own index.
Second, ClickHouse can create indexes based on multiple keys. Its secondary index uses the Data skipping index.
There are four main application scenarios of ClickHouse.
First, user behavior analysis. ClickHouse makes the user behavior analysis table into a large wide table, reduces the form of join, and realizes functions such as path analysis, funnel analysis, and path conversion. In addition to this, it supports advertising, marketing and AB experimentation.
Second, real-time BI reports. ClickHouse can produce timely output in real time according to business needs, query flexible BI reports, including order analysis, marketing effect analysis, promotion activity analysis and so on.
Third, monitoring. ClickHouse can write system and application monitoring metrics to ClickHouse in real time after cleaning and processing through the streaming computing engine Flink and Spark streaming. Visualize in conjunction with Grafna.
Fourth, user portraits. ClickHouse can process data on various user characteristics to make one or more user feature tables containing all users, provide flexible user portrait analysis, support advertising, circle people and other business needs.
Next, let’s talk about the EMR ClickHouse architecture. We’ve made certain enhancements on top of ClickHouse. First, we refactored the In Memory Part write module to support Flink single writes, Flink Exactly Once transaction writes, and Sharding Key writes. Successfully solved the pain points of writing distributed tables and improved the overall performance. Second, it also supports DiskOSS. Realize hot and cold tiered storage, saving costs. Finally, we have expanded replica capacity and sharding, making the expansion method more flexible.
Introduction to StarRocks
Next, let’s talk about StarRocks. StarRocks’ vectorized execution engine implements sub-second query latency. StarRocks’ 100M/s write speed per node allows it to process 10 billion rows of data per second. StarRocks’ comprehensive query speed is 10 to 100 times faster than other products. Data is updated in real time in seconds. Secondly, StarRocks supports simultaneous analysis by thousands of users, and some scenarios can support more than 10,000 QPS per second, and TP99 is controlled within 1 second. Finally, StarRocks is based on multiple data models to enable blazing-fast analysis and reduce business delivery time. Increased productivity for data engineers and analysts.
As shown in the figure above, StarRocks’ architecture is concise and straightforward, compatible with the MySQL protocol, and can use a variety of MySQL clients. And support FE, BE horizontal expansion, so as to achieve automatic balancing. It is very convenient to operate and use.
StarRocks’ blazing fast engine for full vectorized execution. It can be stored per column, calculated per column. With fewer virtual function calls, fewer branch judgments, better use of SIMD instructions and more CPU cache friendly. Secondly, the effect of StarRocks vector improvement is obvious. The effects of Vectorized Filter, Vectorized Aggregation, and Vectorized Shuffle Join all have geometric multiples improved.
StarRocks’ blazing speed engine with an all-new CBO. Based on the Orca paper, expressions are rewritten and reused. Use common predicate extraction and predicate derivation. Rewrite the subquery, adjust the join order, and let the join algorithm automatically select. Successfully convert SQL statements into an executable plan.
StarRocks’ blazing fast engine with multiple distributed joins. At present, this distributed join is a feature that ClickHouse lacks. The figure on the right is a more efficient join method, which makes the overall operation more efficient by completing the bucket classification in advance.
StarRocks provides four data models for all scenarios.
First, the detail model. Used to save and analyze raw detail data, with almost no updates after data is written. It is mainly used for logs, operation records, device status sampling, etc.
Second, aggregate models. Used to save, analyze, summarize data. You do not need to query the detail data. Data is aggregated in real time after import, and there are almost no updates after data is written. Applies to data aggregated by time, region, and organization.
Third, the primary key model. Supports primary key-based updates, delete and insert, and ensure high-performance queries when importing in large batches. Used to save and analyze data that needs to be updated.
Fourth, update the model. Supports primary key-based updates, Merge On Read, which are updated more frequently than the primary key model. Used to save and analyze data that needs to be updated. Both the primary key model and the update model are suitable for orders, device status, etc. whose status will change.
StarRocks also implements high concurrent queries in all scenarios. StarRocks’ partitioning mechanism can efficiently filter and improve query performance. StarRocks’ bucket splitting mechanism gives full play to the performance of the cluster and successfully avoids hot spots. However, StarRocks still has a certain gap compared to other OLAP engines and OLTP engines with row storage.
In the LakeHouse scenario, StarRocks’ federated query not only masks the details of the underlying data source, but also enables joint analysis of heterogeneous data based on source data, which is perfectly combined with the incremental data lake format. In order to improve query speed, StarRocks optimizes each data source. Enhanced vectorization parsing ORC, Parquet format, dictionary filtering, delayed materialization and other capabilities.
In addition to the ultimate engine performance and all-scenario optimization capabilities, StarRocks also implements auto scaling and supports online expansion, making O&M simple. In the face of traffic growth, users can not only scale on demand, but also save costs. StarRocks also supports the gradual expansion of small-scale initial clusters, greatly saving O&M costs.
As shown in the figure above, EMR’s data lake architecture uses OSS and HDFS as the storage layer of the data lake. On the basis of the storage layer, storage optimizers are carefully installed, mainly JindoFS and ALLUXIO series. In terms of storage formats, EMR’s data lake supports formats such as Hudi, Iceberg, and ORC. At the compute layer, it supports multiple computations such as Flink, SPARK, Trino, and Hive.
Next, let’s look at the characteristics of the EMR Trino. First of all, in terms of stabilization, EMR Trino supports the built-in Coordinator HA Hull Worker Label function. Because EMR Trino integrates EMR auto scaling capabilities and supports Trino on K8s product form, it greatly saves O&M costs. In terms of ecology, EMR Trino not only supports cloud ecosystems such as Iceberg, Hudi, and Delta Connector, but also supports optimized connectors such as ClickHouse and Hive. In terms of performance, EMR Trino is optimized for formats such as Parquet/Orc. And use the cache layer of JindoFS to accelerate data lake queries. Query efficiency is greatly improved.
Finally, let’s talk about a few customer cases. As shown above, this is an online education customer. It has billions of data pieces per day, and there are also needs for order data changes, feature crowd selection, and machine learning training. The original solution had untimely data processing, could not cope with the Upsert scenario, and the zipper table was clumsy and resource-intensive. After the transformation, it perfectly supports Upsert scenarios, Presto can query detailed data, CK’s wide table number can also be queried by Ad-hoc, and CK’s materialized view can be queried by BI system.
The figure above is the architecture diagram of the social domain customer. It has a data scale of 5TB per day, and needs to support real-time large screens, business system checks, and random queries by business personnel. Before the transformation, Hive was a minute-level digital warehouse, which faced the pain points of inexhaustibility, undetectability, and complex system operation and maintenance. We fall into CK and Ad-hoc queries and detail tables into StarRocks, realizing complex Ad-hoc queries, report analysis, and materialized view point lookup capabilities. Make data warehouse operations simple and efficient.
The picture above is a customer in the e-commerce field, which relies on OLTP systems for a large number of businesses, and has the need to upgrade GMV, orders, logistics, customer analysis, recommendation systems, etc. The original Hadoop data warehouse and offline T+1 analysis system make the operation and maintenance of the entire system complex and the cost remains high. We gradually transitioned the OLTP system to the OLAP system, replacing the original data warehouse structure, making the link extremely simplified, making Ad-hoc query flexible, and facilitating O&M personnel to analyze detailed data and check the system on the wiring. While simplifying the system, it improves the work efficiency of O&M personnel and greatly reduces O&M costs.
the tag is not clickable during preview