Sharing guests: Tang Huaidong, Weilai Automobile, data team leader
Editor’s Finish: Luo Qingxin IBM
Production platform: DataFunTalk
Introduction: The topic of this sharing is the application of Doris in Weilai Automobile, which mainly includes the following parts:
OLAP development in Weilai
Doris serves as a unified OLAP number warehouse
Doris’ practice on operational platforms
The first part introduces the development of OLAP in NIO.
1. Introduced Druid in 2017
At that time, there were not many OLAP storage and query engines available, and the more common ones were Druid and Kylin.
The reason we prioritized the introduction of Druid is that we are familiar with previous experience.
Kylin precomputing has extremely high query efficiency advantages, but:
The most suitable and optimal storage for Kylin’s underlying layer is HBase, which the company has not introduced as a basic selection, which will add additional O&M work.
Kylin precalculates various dimensions and metrics, and if there are many dimensions and dimensions, there will be dimensional explosions, which will put a lot of pressure on the storage.
2. TiDB was introduced in 2019
The scenario that started with the introduction of TiDB was distributed OLTP. With the introduction of the TiFlash analysis engine, its OLAP capabilities have gradually increased, and the combination of OLTP and OLAP has great advantages. TiDB does a very good job of MySQL compatibility, OLTP data comes in, directly OLAP analysis and query, relatively speaking, the path will be much shorter, so it can be used in some scenarios a lot.
3. Introducing Doris in 2021
This is due to the domestic infrastructure and basic technology efforts in recent years, whether it is Doris or some other storage, including the time series database TDengine we have studied, which is the better and better place for domestic development.
Here’s a closer look at why our company chose Doris and why we came from this path.
The advantages and disadvantages of the product, the cost of technology use, the protocol, the performance and the operation and maintenance cost determine our technology selection.
First, Druid’s advantages are obvious, supporting real-time and offline data access, columnar storage, high concurrency, and very high query efficiency. Its drawbacks are also obvious:
Standard protocols such as JDBC are not used, and the threshold for use is high
Join support is weak
Accurate deduplication is inefficient and performance degrades. The overall performance should be considered in different scenarios, which is why we choose other OLAPs later
High O&M costs, different components have different installation methods and different dependencies; Data import also considers integration with Hadoop and dependencies on JAR packages
Second, TiDB is a mature engine for OLTP+OLAP.
The advantage is that the update is very friendly, thanks to its essentially prioritizing the OLTP database. There is also the need to support details and aggregation, the scene is not only indicator calculation or dashbroad display, and sometimes detailed data query
The downside is that it is not a standalone OLAP, and TiFlash relies on OLTP, which increases storage. Its OLAP capabilities are slightly insufficient
Support standard SQL, low cost of use
Low O&M costs
Third, we focus on Doris, and its advantages fully meet our needs.
High concurrency, we pay more attention to this
Both real-time and offline support are supported
Supports details and aggregation
Some degree of updates are supported
The ability to materialize views can greatly speed up query efficiency
Compatible with the MySQL protocol, so the development and use costs are relatively low
Performance meets our requirements
O&M costs are relatively low
Finally, let’s mention Clickhouse, which we’ve researched before and tried to use. Its stand-alone performance is extremely high, but:
In the scenario we explicitly need, its multi-table join support is slightly worse
Concurrency is relatively low
The operation and maintenance cost is extremely high, and people who have operated and maintained it will understand why the operation and maintenance cost of this piece is relatively high
Doris serves as a unified OLAP number warehouse
This graph is basically from data sources to data access, data computing, data warehousing, data services, and applications.
1. Data source
In the NIO scenario, the data source not only refers to the data of the business system, but also the buried point data, equipment data, vehicle data and so on. The data is accessed to the big data platform through an access method.
2. Data access
Not all of our data sources are accessed in this way. For some business system data, CDC can be enabled to capture the changed data, and then convert it into a data stream and store it in Kafka, and then continue to stream the calculation.
For some data can only be directly entered into our distributed storage in batches.
3. Data calculation
We didn’t use stream-batch integration, we used the Lambda architecture instead of the Kappa architecture.
The Lambda architecture is actually offline and real-time divided into two paths, which is related to our own business scenarios:
Not all data is streamed.
Not all data can be stored in the data stream, and some historical data is not stored in Kafka.
In order to ensure the accuracy of the data, we will still have an offline scene or an offline pipeline, recalculate the entire data, and then re-brush, and then ensure that the data is ultimately accurate. So we still use the Lambda architecture.
4. Data Warehouse
Data calculation to the number of warehouses, these two lines we did not use Flink or Spark Doris Connector, Flink with Routine Load, and Spark with Broker Load is not written directly, which has certain use cases and reasons. Like Spark, the data we generate, in fact, will still leave a copy to Hive for other scenarios. In this way, once calculated, it can be used in multiple places at the same time.
Flink is similar.
5. Data Services
Behind Doris is a One Service, and the industry has done a lot in the area of data servitization. By registering a data source or flexibly configuring it, the API is automatically generated to control the flow of the API and control the permissions, so that its flexibility will be greatly improved. Some companies also have added orchestration capabilities to the K8s serverless solution, which makes the entire service very flexible and rich.
6. Data Application
The app can be a reporting app or it can provide some services directly.
We mainly have two types of use cases:
User-oriented, similar to the Internet, we have many user scenarios, including Kanban and indicators
For the car, some of our car data enters Doris in this way, generally not very detailed data, but through a certain aggregation, Doris data volume in the billions, but the overall performance can still meet our requirements.
Doris’ practice on operational platforms
1. Architecture of the CDP Operating Platform
Next, we introduce Doris’ practice on the operating platform, which is a real use case for us. Internet companies basically do their own CDP, which includes several modules:
Label, this is the cornerstone.
Circle people, is based on labels, how do we circle people.
Insight, for the circled population, to understand what kind of distribution it is? What does it look like? It has some characteristics, so there will usually be insight into this step.
Reach, that is, through different access paths, such as SMS, telephone, voice, APP notification, IM, etc. can reach the user, of course, there are some traffic control in the middle.
In order to close the loop, our operation plan also has an effect analysis, so that from the perspective of closed-loop integrity, there is action, effect and feedback.
Doris plays the most important role in this, including: tag storage, crowd storage, and performance analysis.
Tags are divided into basic labels and basic data on user behavior, on top of which we have the flexibility to customize other tags. For another dimension, labels are also divided into real-time labels and offline labels.
2. Considerations for CDP storage selection
We consider the selection of CDP storage from 5 dimensions.
(1) Offline and real-time unification
As mentioned above, the label has an offline label and a real-time label, and our real-time is actually a quasi-real-time scene. For some data, quasi-real-time is enough to meet our needs, a large number of tags or offline tags, the way to use is Doris’s Routine Load and Broker Load.
In addition, on the same table, the frequency of updates for different columns is also different. For example, the user’s basic tag, we may need real-time updates to the user’s identity, because the user’s identity is changing all the time, the frequency of its changes, we can not make T+1, otherwise the entire use of the scene will be greatly limited. But for some labels that can be taken offline, such as basic labels such as the user’s gender, age, etc., T+1 is sufficient. But why we expect to put this kind of atomic label of the basic user in a table, this is because a table, maintenance costs will be very low. In addition, when you customize the label later, the number of tables will be greatly reduced, which will greatly benefit the overall performance.
(2) Efficient circle selection
User operation has a label, the second step is to circle people, circle selection is to filter out all the people who meet the label conditions according to different combinations of labels, then there will be different combinations of label conditions of the query, this query in Doris after the introduction of vectorization has a more obvious improvement.
(3) Efficient polymerization
The previously mentioned user insights or group insights and effect analysis statistics, need to do statistical analysis of the data, not a single simple scenario of obtaining tags by user ID, the amount of data it reads and query efficiency have a great impact on the distribution of our labels, the distribution of groups, and the statistics of effect analysis, Doris’s functional characteristics:
The first is data sharding, we shard the data according to time, and the analysis statistics will greatly reduce the amount of data, which can greatly accelerate the efficiency of query and analysis.
The second is node aggregation, and then collection to do a unified aggregation.
The third is vectorization acceleration, and the vectorization engine has a very significant performance improvement. That’s why Doris has been making a big push in this area lately.
(3) Multi-table association
Our CDP may not be the same as the CDP of some scenes in the industry, because the CDP tags in some scenarios are calculated in advance, and there is no custom label scene. We only do atomic labels, or statistics on basic user behavior data. Leave the flexibility to the user who uses CDP, he can customize the label according to his business scenario, at this time will find that the underlying data is scattered in different database tables, if you do the construction of custom labels, it is bound to need to do table association.
One very, very important reason we chose Doris is the ability to relate multiple tables. We have passed performance tests and are currently able to meet our requirements, and this piece really provides users with very powerful capabilities. Because labels are dynamic and fit the business scenario.
(4) Federal inquiries
We will record the success of user reach to TiDB. Because we do user operations, if it is only a notification, it may only affect the user experience, if it involves money such as issuing points or coupons, the task execution must be done without repeating and not leaking, and this OLTP scenario is more suitable for TiDB.
Then we do the performance analysis, we need to know to what extent the operation plan is implemented, whether the goals are achieved, its distribution and so on. Therefore, it is necessary to combine the task execution and crowd selection to analyze, and the association between Doris and TiDB and the appearance association will be used for query.
At first, we imagined that the label volume was relatively small, and it might be more appropriate to save to es, and after our test this piece was not online, and the reason for it will be explained later.
The first is bitmap, which we haven’t reached that big enough to be efficient. If the volume reaches a certain level, there will be a good performance improvement with bitmap. To calculate the UV scene, the full set of Id is greater than 50 million, and bitmap aggregation can be considered.
The second is the ES appearance, and the single-table query pushdown efficiency is still very good. However, the efficiency of aggregation query is relatively low, ES is not suitable for aggregation query, and the ES details will be read to Doris, and then aggregated; ES is also not suitable for multi-table association, reading a large amount of data, which will cause ES frequent GC.
The third is to update columns in batches, in order to reduce the number of tables and improve the performance of join tables, design tables to be as compact as possible to aggregate, the same type of facts are put together. However, fields of the same type may be updated at different frequencies, some fields need to be updated at the day level, and some fields may require hourly updates, and updating a column alone is an obvious appeal. The solution for updating some columns separately for the Doris aggregation model is to use REPLACE_IF_NOT_NULL. Note: It is not possible to replace the original non-null value with null, and you can replace all nulls with meaningful default values, such as unknown.
The fourth online service, Doris with the same data, while serving online and offline scenarios, is easy to influence each other. Try to provide online services with separate technical solutions.
There is also a vision for the future, mainly to use Manager to better manage Doris, as well as resource isolation to avoid mutual influence.
Q1: Will there be conflicts when updating different columns of the same table in real time and offline at the same time? How did you solve this here?
A1: No conflict issues were found, we updated it using REPLACE_IF_NOT_NULL way
Q2: How are the labels stored on the operating platform?
A2: Tags are stored in Doris, including two categories, one is the basic tag, without any logical judgment of the atomic tag, the other is the analysis and statistics of user behavior, such as aggregating the user’s APP operation behavior, which is essentially factual data. Different behaviors are stored in different tables, and the higher-level custom labels are extracted, calculated and restored with scheduled tasks, and the table structure of the custom labels is defined in advance, the name is additionally mapped, and then the data can be directly updated.
Q3: The operation platform stores different labels such as user location and user gender, whether multiple tables are stored separately or with large wide tables, in addition, how do we choose a detailed model or an aggregate model to build a table, and what are the suggestions for storage and query?
A3: We store separately according to the scenario, the aforementioned user base or atomic label, like user information can be a table. And different behaviors will be stored in different places and will not affect each other. We attach great importance to Doris’s multi-table association ability, and custom labels require multi-table association generation. Custom labeled tables and columns are defined in advance, and names are mapped with additional tables.
Q4: How is the structure of the label table designed? How do I design custom properties for labels?
A4: We calculate the basic data first, how many behaviors, how many basic atomic labels will be defined in advance. Custom label we first build a large wide table, which itself defaults to no meaning, if you customize a label, occupy a column can be.
Q5: Is your company’s digital warehouse all based on Doris? Is it hierarchical?
A5: Not all the data warehouses use Doris, the volume of our company’s vehicle detail data is very large, the user’s use of the scene is very flexible, they often have to write their own udf query data, there is no way to cure in advance, so some offline scenarios are Hive. Doris applicable to the scene or real-time OLAP scene, the Web or APP dashboard to display indicators, directly query the number of positions is not appropriate. Our layering of different scenes is not the same, and there are few layers in real time, and there are basically no special requirements, including details plus the final aggregation. Offline is mainly based on the use of different scenarios or query requirements, we also stratify.
This concludes today’s sharing, thank you.
Share at the end of the article, like, watch, give a 3 combo ~
01/ Sharing guests
Data team leader
He graduated from Beijing University of Posts and Telecommunications with a bachelor’s degree and the Chinese Academy of Sciences with a master’s degree. He worked in Yahoo Beijing R&D Center, responsible for the development of news recommendation effect analysis index Kanban, and at Cheetah Mobile, responsible for news recommendation algorithms. At present, he is in charge of the construction of big data platform & middle office in Weilai Automobile.
02/ Free download materials
03/ Register to watch live PPT for free
DataFun: Focus on the sharing and exchange of big data and artificial intelligence technology applications. Founded in 2017, more than 100+ offline and 100+ online salons, forums and summits have been held in Beijing, Shanghai, Shenzhen, Hangzhou and other cities, and more than 2,000 experts and scholars have been invited to participate in sharing. Its public account DataFunTalk has produced 800+ original articles, millions + reads, and 150,000+ accurate fans.
🧐 Share, like, watch, give a 3 combo! 👇