Founded in 2007, Leap Express Group Co., Ltd. currently has more than 3,000 service outlets, covering more than 500 cities, and is a unicorn enterprise in China’s logistics service industry. Across the Group’s big data center, it is responsible for the construction and maintenance of all data platform components of the Group, supporting more than 20 core business lines and being used by more than 50,000 employees of the Group. At present, the big data center has built a data query interface of 1W+, with more than 10 million calls per day, and TP99 is less than 1 second. We use DorisDB as a general-purpose query engine to effectively solve the problem that a large number of queries in the original schema take too long to return and the performance does not meet expectations.
The overall architecture of our original offline data warehouse is shown in the figure below, the data from the databases of various business lines, such as MySQL, etc., is aggregated to the ETL cluster (that is, Hadoop cluster) through data integration tools, and then the batch processing engines such as Hive, Spark, and Presto are used for hierarchical processing of the data warehouse, and then the data of the DW layer and the ADS layer is pushed to various query engines.
On top of these query engines, there is a unified query API gateway, and the application layer self-service analysis tool or ERP system front end can call this API gateway to present data content to users.
Business pain points
The biggest pain point of this system is query performance issues. The company’s response delay to the big data query interface is assessed, and it is expected that 99% of the query requests can be returned within 1 second, such as page ERP system, mobile phone reporting APP, users will view the data at any time and adjust the production process, too slow query response will affect the user experience, and even affect business production. For complex SQL query scenarios, the previous systems such as Presto, Impala+Kudu, and ClickHouse are far from meeting expectations. In addition, for various complex data analysis business scenarios, many different components are introduced, resulting in very high maintenance and use costs.
Therefore, we urgently need a new query engine that can unify the query engine, solve performance query problems, and reduce usage and maintenance costs.
OLAP engine selection
The first phase, in 2019, crossed the Group’s big data center using Presto as a general-purpose query engine. At this stage, the digital warehouse layer of the group’s big data center basically uses Hive, and the feature that Presto can be directly connected to Hive allows us to directly generate query APIs without too much transformation. From a performance perspective, we will also copy some of the data in the data warehouse to an independent Presto cluster and isolate resources from the data warehouse ETL cluster. After running the architecture for more than a year, the performance of the Presto-based cluster deteriorated dramatically as business needs became more complex and the amount of data grew.
In the second stage, in order to solve the shortcomings of the insufficient performance of the Presto cluster, we started to build a new general-purpose query engine based on ClickHouse. In 2020, we used ClickHouse to build a large number of large and wide tables, and gradually migrated queries that previously required multi-layer correlation to the ClickHouse cluster. In this way, we really solved the performance issues we faced before. But at the same time, we need to build more and more large and wide meters, which is cumbersome and difficult to operate. And this data model cannot change quickly with changes in business needs, and the flexibility is poor.
In the third phase, we started looking for other OLAP engines that could meet our needs in 2021, and we discovered DorisDB at this time. First of all, we should pay attention to the performance of DorisDB’s single-table and multi-table association queries are excellent, which can meet our requirements for query latency. DorisDB supports the MySQL protocol, which allows our development colleagues to learn and use the interface with a very low barrier to entry. In addition, DorisDB supports updates by primary key, supports multiple types of foreign tables, is simple in deployment and O&M, and supports rich data import methods. These are all we need.
Therefore, we began to gradually migrate the previous analysis business to the DorisDB cluster, using DorisDB as the general-purpose query engine of the big data center.
DorisDB is used
Online scene application
At present, the number of query requests on our online data interface has exceeded 10 million every day. Before the introduction of DorisDB, we used 8 to 9 query engines to support various online business scenarios. ElasticSearch is used as a support for detailed inspection scenarios of large data volumes. For reporting scenarios where query dimensions are fixed and can be pre-calculated in advance, MySQL is used. For complex SQL queries, Presto is used if multiple tables join and subqueries are nested. The real-time update scene will be supported by the combination of Impala + Kudu.
after the introduction of DorisDB, The Presto and Impala+Kudu support scenes have been replaced. ElasticSearch, MySQL and ClickHouse may also be gradually replaced with DorisDB according to the actual situation of the business scenario.
The following details a typical case of an actual online scenario. As shown above, we have a wide table aggregation query with 200 fields on the original Presto system. Due to complex business requirements, SQL statements have more than 600 rows. We had hoped to optimize from the business logic, but it was not easy, and we could not blindly ask the business side to accommodate because of the problem of system capabilities. Now we replace the original Presto cluster with 10 nodes with DorisDB with the same configuration of 10 nodes. Without any business logic changes, using the DorisDB detail model, with the high performance of DorisDB itself, the query latency was reduced from 5.7 seconds to 1 second, which is nearly 6 times the performance of the original Presto cluster.
OLAP scenario application
The OLAP multi-dimensional analysis platform across the group is a set of BI systems developed by ourselves. Users can select fields and associated conditions according to their business scenarios to generate tables or charts of data in a drag-and-drop manner. The first back-end engine we used to support OLAP multidimensional analysis was Presto, and the performance in such scenarios is really unsatisfactory. Due to performance issues, we can’t promote this tool to more users. After we replaced the back-end query engine with DorisDB, the performance improvement was noticeable. We promoted OLAP multidimensional analysis platform to the entire group, and it was well received by more and more users.
OLAP multi-dimensional analysis is mainly based on offline analysis, taking the customer’s offline analysis scenario as an example, after the data is processed by ETL, the corresponding DW layer or ADS layer data is generated, and then the data is imported into DorisDB by day through Broker Load. We use the star model to build the customer subject field, the customer master table to create tables in DorisDB with the detail model, and the detail model to create dimension tables. In this way, users can drag and drop various indicators and dimensions of the customer subject field on the front end to generate corresponding tables and charts.
In the customer’s offline analysis scenario, Under the premise that our business logic was adjusted before and after DorisDB was launched, TP99 dropped from 4.5 seconds to 1.7 seconds, and the performance was three times the original (we will try to open the CBO optimizer in the future, and it is expected to have greater performance improvement). Most scenarios can be returned within 1s, which greatly improves the user experience.
Using DorisDB’s real-time analytics capabilities, we also built real-time OLAP multidimensional analytics. Taking the real-time analysis scenario of the waybill as an example, we originally used Hive to run batches every two hours to achieve it, calculate the fixed-dimensional data, and write the results to Presto to provide queries, the logic is similar to the offline data warehouse, which cannot be called real time. After the introduction of DorisDB, we adjusted the data transfer logic, wrote data to Kafka by listening to Binlog, and then consumed Kafka through Rontine Load to write data to DorisDB in real time. We use the update model to establish a real-time waybill master table, and set the waybill ID as the primary key, so that each waybill can be updated in real time to the consignment note master table after it is updated. As with offline analysis scenarios, use the star model to build the waybill subject domain.
With this adjustment, the subject area of the waybill, which used to update data every two hours, can now be updated in seconds, becoming a veritable real-time analysis. In addition, it was previously necessary to rely on pre-calculation, and the dimensions were fixed, and many analysis functions were limited. In addition to the “real-time” experience, the improvement in analytics flexibility is also significant. Real-time experience and flexible analysis have also become the biggest highlights of OLAP multi-dimensional analysis platform tools in actual services.
1. In order to avoid some slow queries affecting the overall cluster performance, In the future, multiple DorisDB clusters will be built to isolate physical resources according to business scenarios.
2. The function of DorisDB to query Hive
external tables has been internally tested better than Presto query Hive, and the original Presto query Hive scenario will be seamlessly migrated to DorisDB in the future.
3. At present, we write a lot of real-time data on DorisDB, which needs to be aggregated and other processing, and we are trying to use scheduling tools to perform 5-minute and 10-minute lightweight ETL processing on DorisDB.
4. Turn on DorisDB’s CBO optimizer to further improve query performance.
Finally, I would like to thank Keystone for providing us with such a good product as DorisDB, which meets our requirements for a high-performance and full-featured query engine product; Thanks to Keystone’s continued technical support, we have solved all kinds of problems encountered in use.
you want to know more details, please follow us!