a “technology-driven new residential service provider”, Shell House is committed to promoting the digitalization and intelligence of the residential service industry, and providing 300 million Chinese families with a full range of high-quality and efficient living services, including second-hand houses, new houses, leasing, decoration, etc., by helping high-quality service providers.
The Shell Big Data Platform Department has built and supported multiple scenario applications across the group, covering many business lines and high business complexity, so the requirements for the data analysis platform are also very high. The OLAP platform needs to support routine services such as indicator analysis, ad hoc exploratory analysis, and visual reports, as well as typical services such as user behavior analysis, risk control, and DMP. The OLAP platform needs to adapt to the analysis requirements of different types, workloads and scenarios, so there are already 6 or 7 different analysis engines on the platform that the big data platform department needs to operate and maintain at the same time.
From 2021, the integration of the company’s big data analysis engine began with the introduction of DorisDB as the main analysis engine. On the indicator platform and reporting platform, it is basically realized through one component (DorisDB) to adapt to various data analysis scenarios. DorisDB is used to build a one-stop all-scenario ultra-fast data analysis platform, which improves data analysis efficiency, reduces O&M complexity, and fully releases the value of data.
As a unified indicator management platform for the whole group in multiple scenarios, the indicator platform
provides the following functions:
indicators and unified management
real-time indicator queries
unified definition of API
Apache Kylin was used to support summary metric queries in the early stage. As the demand for detail queries increased, several components such as Druid, ClickHouse, and Apache Doris were introduced.
- tens of
thousands of level indicators apply
tens of millions of calls/day
TP99 query within 3 seconds
Visual reporting platform
Operations personnel can create self-service reports based on Hive tables or metrics on the visual reporting platform. When creating reports based on metrics, requests are converted into SQL statements through the metrics platform, and most queries are executed using Impala.
business pain points
Historical data update support is poor
because most business scenarios of Shell need to update data. If the offline indicator is processed in batches, the real-time indicator needs to update the historical data in real time.
For example, in the broker’s viewing scenario, if the risk control rules are triggered, some records will be judged to be invalid and the data status will change. Another example is the new housing transaction process, the status of the new house record needs to be directly transferred to each other in filing, viewing, signing, and closing. The entire business process requires an online update on the status of the new home.
As the main analysis engine in the original architecture, Druid does not support the Update function, and can only be used for metric analysis on offline data, and cannot support real-time metric calculation. Although ClickHouse provides two mutation operations, Update and Delete, the cost of modification is relatively large. Excessive mutations are often accumulated to fail to complete data updates, and this leads to multiple online ClickHouse cluster outages. In addition, since mutation is an asynchronous thread, there is no guarantee that the updated data will be visible in real time, so the real-time performance of the indicator cannot be guaranteed.
The support ability of
the multi-table join function is poor, and
the existing OLAP engine (Kylin, Druid, ClickHouse) has poor performance when multi-table join, and even does not support multi-table join. Previously, data models could only be built in wide table form. But Shell is an online and offline combination of industrial Internet companies, a typical scenario is that there are brokers often beating in the middle of the store. When calculating the latest performance, or calculating bonus indicators, it is necessary to pay attention to organizational changes. If you use the wide table model, as long as the dimension changes, you need to rebrush the entire wide table, resulting in some indicators taking too long to refresh and the data timeliness will deteriorate.
The existing engine, Druid, although capable of lookup tables, did not perform well after actual testing. Apache Kylin doesn’t actually support joins either, and multi-table joins need to be implemented by making wide tables at the bottom when the cube is built. ClickHouse only supports the local hash join mode, does not support distributed Shuffle join, in most cases the flexibility is limited, and the performance is poor.
Unable to support both detail and aggregation
in Shell indicators not only need to show managers summary indicators, if you find that there is a problem with the indicators, you also need to drill down to the details to see the specific reasons for the abnormal indicators. Then, according to the situation of the detailed data, a series of management actions are taken. In other words, the OLAP engine needs to have the ability to query detailed data and aggregate data at the same time. Since Apache Kylin and Druid cannot support detail data query, the aggregated data can only be stored in Apache Kylin and Druid, and the detail data is stored in Clickhouse. The aggregate data is not put into Clickhouse because the materialized view of Clickhouse is opaque, and it is necessary to switch to the corresponding schedule when querying the details for the upper-level application, and the operation is more cumbersome. Switching between the query engine and the table requires us to maintain additional query code logic. Moreover, it is not friendly enough for front-end data analysts, who need to understand the different storage locations and correspondence between detailed data and aggregated data at the same time, increasing the cost of learning and communication.
There are many OLAP engines, complex operation and maintenance, and high user learning costs
At present, six or seven different analysis engines (Impala, Presto, Kylin, Druid, ClickHouse, Hive) have been introduced into Shell’s data analysis platform. The team is only a dozen people, there are too many technology stacks, resulting in our lack of deep mastery of each engine, the operation and maintenance pressure is very large, and it is easy to hold on when something goes wrong.
Especially like the ClickHouse cluster, although the performance is very good, but the requirements for operation and maintenance are relatively high. The sharding and replica information of the ClickHouse cluster is configured through static configuration files. When the entire cluster needs to be scaled or down, it must be refreshed by modifying the configuration file, and O&M personnel are required to intervene in data balancing. In addition, ClickHouse uses zookeeper to do replica management, when the cluster scale becomes larger, too many replicas will cause the pressure of zookeeper to increase, and the stability of the cluster will deteriorate accordingly.
On the other hand, multiple engines are also very expensive for users to learn, and the SQL statements of different analysis systems are inconsistent, and each requires additional learning costs.
to other OLAP engines
Mainly because DorisDB has the following features:
MPP architecture + efficient columnar storage engine
High performance, high availability, high elasticity
Standard ANSI SQL
to support efficient batch import of data, and real-time import
supports real-time update of data
We thoroughly tested DorisDB against other OLAP engines, including ClickHouse, Duird, and Apache Doris. The test environment configuration information is as follows:
Query performance: DorisDB vs ClickHouse vs Apache Doris
The query performance comparison test uses the SSB test set, and the largest data volume of the table lineorder is about 6 billion (scale 1000). In the wide table mode, which ClickHouse is best at, the performance of DorisDB and Clickhouse is compared under the condition of limiting the number of threads not exceeding 8 and not limiting the number of threads.
With DorisDB and ClickHouse using no more than 8 threads per node, 9 out of 13 queries performed better than ClickHouse.
(Wide table mode, set ClickHouse max_threads=8)
Without limiting the number of ClickHouse
threads, 7 out of 13 queries perform better than ClickHouse.
(Wide table mode,
no restrictions max_threads)
In multi-table Join mode, the performance of DorisDB and Apache Doris is compared. Overall, DorisDB has a 5-10x performance advantage over Apache Doris.
There is no wide table performance process test for Apache Doris, because with 6 billion data volumes, DorisDB can directly use the insert into select statement to convert data to wide tables, and Apache Doris will report oom when executing the same statement. It can also be seen that DorisDB is much better than Apache Doris in terms of memory management and execution efficiency. At the same time, we also understand that DorisDB also has open source plans in the future, so we use DorisDB as the OLAP analysis engine in our applications.
High concurrency: DorisDB vs Druid
In the actual online environment, Druid and DorisDB were stress-tested with high concurrency in wide table mode. The QPS of the Druid cluster can reach about 600-700, with an average response time of about 100ms and a maximum response time of about 300ms. For the same size of DorisDB cluster, QPS can reach 1500-2000, with an average response time of about 50ms and a maximum response time of about 100ms.
(Druid concurrency under stress test)
addition, we additionally conducted a high concurrency stress test on the Join mode of DorisDB, and the QPS can reach 200-300, with an average response time of 470ms. It can be seen that even in the complex query scenario of the Join mode, the concurrency performance of DorisDB is still maintained at a good level.
are shown in the table below, and we have also compared indicators in other areas:
At present, Shell’s DorisDB cluster uses 35 physical machines (80core, 192GB memory, 3TB SSD) and deploys 35 BE, 3 FE. It supports multiple applications such as indicator platform, visual reporting platform, and typical business scenarios.
1, high QPS indicator query
DorisDB’s powerful concurrency capabilities support high-QPS scenarios that Druid could not meet in the past. For example, during the performance appraisal period of real estate agents, QPS will instantly soar from dozens to 3,000. In the past, there was no good solution to use Durid to deal with such instantaneous high-voltage scenarios, and the cluster would keep alarming or even going down. Using an indicator platform supported by DorisDB can solve this problem well.
materialized view that can be updated
automatically DorisDB has very good materialized view capabilities. Slow query metrics are aggregated by rollup, which can automatically hit the materialized view and automatically route them during querying to accelerate the entire query. At the same time, the materialized view supports automatic update, and when the schedule changes, the materialized view automatically refreshes the aggregated results.
3. Real-time large-screen indicators
The original real-time metrics were supported by ClickHouse, but a large number of views needed to be built. ClickHouse materialized view does not support automatic routing, and you need to specify the corresponding materialized view table name when querying. Moreover, ClickHouse’s support for Updates is very limited, and querying the latest records requires additional function support, which does not conform to standard SQL syntax. In general, using ClickHouse to calculate real-time metrics is a very complicated implementation process. DorisDB supports real-time indicator scenarios, which can automatically update indicators in real time, only need to create the corresponding materialized view, and can update indicators in real time without any additional operations.
4. More flexible data model
DorisDB also has a very strong single-table query capability and multi-table Join capability, and can support wide table mode and multi-table Join mode. In response to some flexible indicators, such as the broker organizational structure change scenario mentioned above, there is no need to build a wide table based on DorisDB. Using the online join method, when the dimension changes, update the dimension table to perform the association query again.
Odin visualization platform
We have done a large number of reports based on MySQL, such as market management dashboards. As the amount of data increases, MySQL is completely unsupportable when the amount of data reaches tens of millions. All of these visualization system reports have been migrated to DorisDB. Due to DorisDB’s support for the MySQL protocol, the entire migration process is relatively smooth and requires very little effort.
The original typical services of Typical Services, such as A/B test platform, trading platform, risk control platform, live broadcast middle office, etc., were previously built on ClickHouse and Apache Doris. We have now started to gradually migrate these business applications to DorisDB. In addition, we will also build new applications based on DorisDB, such as user behavior analysis.
The following figure shows the query efficiency comparison of the live broadcast platform after migrating from Apache Doris to DorisDB. It can be seen that the query efficiency has been doubled, especially in the case of large data volume (full scale), and the performance improvement is more than 7 times.
(After the live streaming platform uses DorisDB, the latency of all queries is significantly reduced.)
Write at the end
In the past six months of use, DorisDB is better than Apache Doris in terms of stability and query performance overall. The performance of the wide table is comparable to that of ClickHouse, and the multi-table join capability is better than that of ClickHouse. While maintaining or even exceeding the performance of ClickHouse, DorisDB has greatly reduced our O&M pressure and simplified the link of data development.
DorisDB’s support for hive appearances also gives us a lot of room for imagination, especially in some ad hoc query scenarios. Now we use Spark SQL for small queries, and hive or presto for large queries. DorisDB is used to share some hot query traffic, and the overall query efficiency can be further improved. Querying ElasticSearch tables using DorisDB is also in our next steps.
In the future, we will cover DorisDB to more business scenarios and use DorisDB to gradually replace other analysis engines such as Druid, Clickhouse, and Kylin to build our ultra-fast OLAP analysis platform for all scenarios.
The support from the DorisDB team is also very good, and I would like to thank you very much.
. Application of DorisDB in the Business Area of the Medium Mobile IoT PGW Real-Time Session