Founded in 2007, Leap Express Group Co., Ltd. currently has more than 3,000 service outlets, covering more than 500 cities, and is a unicorn enterprise in China’s logistics service industry. Across the Group’s big data center, it is responsible for the construction and maintenance of all data platform components of the Group, supporting more than 20 core business lines and being used by more than 50,000 employees of the Group. At present, the big data center has built a data query interface of 1W+, with more than 10 million calls per day, and TP99 is less than 1 second. We use DorisDB as a general-purpose query engine to effectively solve the problem that a large number of queries in the original schema take too long to return and the performance does not meet expectations.

Author: Zhang Jie


Big Data Operation and Maintenance Architect of Cross Group, Responsible for the maintenance and construction of the big data platform of the group company

Business background

Overall architecture

The overall architecture of our original offline data warehouse is shown in the figure below, the data from the databases of various business lines, such as MySQL, etc., is aggregated to the ETL cluster (that is, Hadoop cluster) through data integration tools, and then the batch processing engines such as Hive, Spark, and Presto are used for hierarchical processing of the data warehouse, and then the data of the DW layer and the ADS layer is pushed to various query engines.

On top of these query engines, there is a unified query API gateway, and the application layer self-service analysis tool or ERP system front end can call this API gateway to present data content to users.

Business pain points

The biggest pain point of this system is query performance issues. The company’s response delay to the big data query interface is assessed, and it is expected that 99% of the query requests can be returned within 1 second, such as page ERP system, mobile phone reporting APP, users will view the data at any time and adjust the production process, too slow query response will affect the user experience, and even affect business production. For complex SQL query scenarios, the previous systems such as Presto, Impala+Kudu, and ClickHouse are far from meeting expectations. In addition, for various complex data analysis business scenarios, many different components are introduced, resulting in very high maintenance and use costs.

Therefore, we urgently need a new query engine that can unify the query engine, solve performance query problems, and reduce usage and maintenance costs.

OLAP engine selection

The first phase, in 2019, crossed the Group’s big data center using Presto as a general-purpose query engine. At this stage, the digital warehouse layer of the group’s big data center basically uses Hive, and the feature that Presto can be directly connected to Hive allows us to directly generate query APIs without too much transformation. From a performance perspective, we will also copy some of the data in the data warehouse to an independent Presto cluster and isolate resources from the data warehouse ETL cluster. After running the architecture for more than a year, the performance of the Presto-based cluster deteriorated dramatically as business needs became more complex and the amount of data grew.

In the second stage, in order to solve the shortcomings of the insufficient performance of the Presto cluster, we started to build a new general-purpose query engine based on ClickHouse. In 2020, we used ClickHouse to build a large number of large and wide tables, and gradually migrated queries that previously required multi-layer correlation to the ClickHouse cluster. In this way, we really solved the performance issues we faced before. But at the same time, we need to build more and more large and wide meters, which is cumbersome and difficult to operate. And this data model cannot change quickly with changes in business needs, and the flexibility is poor.

In the third phase, we started looking for other OLAP engines that could meet our needs in 2021, and we discovered DorisDB at this time. First of all, we should pay attention to the performance of DorisDB’s single-table and multi-table association queries are excellent, which can meet our requirements for query latency. DorisDB supports the MySQL protocol, which allows our development colleagues to learn and use the interface with a very low barrier to entry. In addition, DorisDB supports updates by primary key, supports multiple types of foreign tables, is simple in deployment and O&M, and supports rich data import methods. These are all we need.

Therefore, we began to gradually migrate the previous analysis business to the DorisDB cluster, using DorisDB as the general-purpose query engine of the big data center.

DorisDB is used

across groups

Online scene application

At present, the number of query requests on our online data interface has exceeded 10 million every day. Before the introduction of DorisDB, we used 8 to 9 query engines to support various online business scenarios. ElasticSearch is used as a support for detailed inspection scenarios of large data volumes. For reporting scenarios where query dimensions are fixed and can be pre-calculated in advance, MySQL is used. For complex SQL queries, Presto is used if multiple tables join and subqueries are nested. The real-time update scene will be supported by the combination of Impala + Kudu.

after the introduction of DorisDB, The Presto and Impala+Kudu support scenes have been replaced. ElasticSearch, MySQL and ClickHouse may also be gradually replaced with DorisDB according to the actual situation of the business scenario.

The following details a typical case of an actual online scenario. As shown above, we have a wide table aggregation query with 200 fields on the original Presto system. Due to complex business requirements, SQL statements have more than 600 rows. We had hoped to optimize from the business logic, but it was not easy, and we could not blindly ask the business side to accommodate because of the problem of system capabilities. Now we replace the original Presto cluster with 10 nodes with DorisDB with the same configuration of 10 nodes. Without any business logic changes, using the DorisDB detail model, with the high performance of DorisDB itself, the query latency was reduced from 5.7 seconds to 1 second, which is nearly 6 times the performance of the original Presto cluster.

OLAP scenario application

The OLAP multi-dimensional analysis platform across the group is a set of BI systems developed by ourselves. Users can select fields and associated conditions according to their business scenarios to generate tables or charts of data in a drag-and-drop manner. The first back-end engine we used to support OLAP multidimensional analysis was Presto, and the performance in such scenarios is really unsatisfactory. Due to performance issues, we can’t promote this tool to more users. After we replaced the back-end query engine with DorisDB, the performance improvement was noticeable. We promoted OLAP multidimensional analysis platform to the entire group, and it was well received by more and more users.

OLAP multi-dimensional analysis is mainly based on offline analysis, taking the customer’s offline analysis scenario as an example, after the data is processed by ETL, the corresponding DW layer or ADS layer data is generated, and then the data is imported into DorisDB by day through Broker Load. We use the star model to build the customer subject field, the customer master table to create tables in DorisDB with the detail model, and the detail model to create dimension tables. In this way, users can drag and drop various indicators and dimensions of the customer subject field on the front end to generate corresponding tables and charts.

In the customer’s offline analysis scenario, Under the premise that our business logic was adjusted before and after DorisDB was launched, TP99 dropped from 4.5 seconds to 1.7 seconds, and the performance was three times the original (we will try to open the CBO optimizer in the future, and it is expected to have greater performance improvement). Most scenarios can be returned within 1s, which greatly improves the user experience.

Using DorisDB’s real-time analytics capabilities, we also built real-time OLAP multidimensional analytics. Taking the real-time analysis scenario of the waybill as an example, we originally used Hive to run batches every two hours to achieve it, calculate the fixed-dimensional data, and write the results to Presto to provide queries, the logic is similar to the offline data warehouse, which cannot be called real time. After the introduction of DorisDB, we adjusted the data transfer logic, wrote data to Kafka by listening to Binlog, and then consumed Kafka through Rontine Load to write data to DorisDB in real time. We use the update model to establish a real-time waybill master table, and set the waybill ID as the primary key, so that each waybill can be updated in real time to the consignment note master table after it is updated. As with offline analysis scenarios, use the star model to build the waybill subject domain.

With this adjustment, the subject area of the waybill, which used to update data every two hours, can now be updated in seconds, becoming a veritable real-time analysis. In addition, it was previously necessary to rely on pre-calculation, and the dimensions were fixed, and many analysis functions were limited. In addition to the “real-time” experience, the improvement in analytics flexibility is also significant. Real-time experience and flexible analysis have also become the biggest highlights of OLAP multi-dimensional analysis platform tools in actual services.

Follow-up planning

1. In order to avoid some slow queries affecting the overall cluster performance, In the future, multiple DorisDB clusters will be built to isolate physical resources according to business scenarios.

2. The function of DorisDB to query Hive

external tables has been internally tested better than Presto query Hive, and the original Presto query Hive scenario will be seamlessly migrated to DorisDB in the future.

3. At present, we write a lot of real-time data on DorisDB, which needs to be aggregated and other processing, and we are trying to use scheduling tools to perform 5-minute and 10-minute lightweight ETL processing on DorisDB.

4. Turn on DorisDB’s CBO optimizer to further improve query performance.

Finally, I would like to thank Keystone for providing us with such a good product as DorisDB, which meets our requirements for a high-performance and full-featured query engine product; Thanks to Keystone’s continued technical support, we have solved all kinds of problems encountered in use.

[Popular article.] 】
1 Little Red Book x DorisDB: Realize the unification of the data service platform, simplify the data link, and improve the high concurrency and fast query capability2
Shell Finding a House x DorisDB: A New Unified Extreme OLAP Platform Practice
3 > China Academy of Postal Sciences X DorisDB: Unified OLAP platform, greatly reducing O&M costs4
Lenovo x DorisDB: Simplify the data processing link and greatly improve the efficiency of BI analysis5
. Good Future builds a new real-time data warehouse based on DorisDB to deeply release the value of real-time data5
58 Group x DorisDB: Comprehensively upgrade data analysis capabilities to meet multi-scenario business analysis requirements6
Ape Tutoring is based on the construction of a unified OLAP platform and comprehensively upgrades data analysis capabilities7
Kujiale based on DorisDB to achieve a comprehensive upgrade of data analysis, greatly reducing platform costs8

DorisDB application in the business field of CMD IoT PGW real-time session business if

you want to know more details, please follow us!