1. Early stage of real-time computing

Although real-time computing has only become popular in recent years, some companies have real-time computing needs in the early days, but the amount of data is relatively small, so a complete system cannot be formed in real-time, and basically all development is a specific analysis of specific problems. , come to a requirement to make one, basically do not consider the relationship between them, the development form is as follows:

As shown in the figure above, after getting the data source, it will go through data cleaning, dimension expansion, business logic processing through Flink, and finally business output directly. Taking this link apart, the data source side will repeatedly reference the same data source, and the operations such as cleaning, filtering, and dimension expansion must be repeated. The only difference is that the code logic of the business is different.

With the ever-increasing demand for real-time data from products and business people, more and more problems arise with this development model:

  1. There are more and more data indicators, and the “chimney” development leads to serious code coupling problems.
  2. There are more and more demands, some require detailed data, and some require OLAP analysis. A single development model is difficult to cope with a variety of needs.
  3. Resources must be applied for for each demand, resulting in a rapid expansion of resource costs, and resources cannot be used intensively and effectively.
  4. There is a lack of a sophisticated monitoring system to detect and fix problems before they impact the business.

Looking at the development and problems of real-time data warehouses, it is very similar to offline data warehouses. After the large amount of data in the later period, various problems occurred. How did offline data warehouses solve them at that time? The offline data warehouse decouples data through a layered architecture, and multiple businesses can share data. Can the real-time data warehouse also use a layered architecture? Of course it is possible, but there are still some differences between the details and offline layering, which will be discussed later.

2. Real-time warehouse construction

In terms of methodology, real-time and offline are very similar. In the early stage of offline data warehouses, specific problems are also analyzed in detail. When the scale of data increases to a certain amount, how to manage it will be considered. Layering is a very effective way of data governance, so when it comes to how to manage real-time data warehouses, the first consideration is the processing logic of layers.

The architecture of the real-time data warehouse is as follows:

From the above figure, we analyze the role of each layer in detail:

  • Data source: At the data source level, offline and real-time data sources are consistent. They are mainly divided into log categories and business categories. Log categories include user logs, buried point logs, and server logs.
  • Real-time detail layer: In the detail layer, in order to solve the problem of repeated construction, it is necessary to carry out unified construction, using the offline data warehouse mode to build a unified basic detail data layer, and manage it according to the theme. The purpose of the detail layer is to provide the downstream directly available. Therefore, the basic layer should be processed uniformly, such as cleaning, filtering, and dimension expansion.
  • Aggregation layer: The aggregation layer can directly calculate the results through Flink’s concise operators, and form a pool of aggregated indicators. All indicators are processed in the aggregated layer, and everyone manages and builds according to unified specifications to form reusable aggregated results.

We can see that the layers of real-time data warehouse and offline data warehouse are very similar, such as data source layer, detail layer, summary layer, and even application layer, and their naming patterns may be the same. But it is not difficult to find out that there are many differences between the two:

  • Compared with offline data warehouses, real-time data warehouses have fewer layers:
  • Judging from the current experience in building offline data warehouses, the data detail layer of the data warehouse will be very rich in content. In addition to processing detailed data, the concept of a light summary layer will generally be included. In addition, the application layer data in the offline data warehouse is inside the data warehouse. However, in the real-time data warehouse, the data of the app application layer has already fallen into the storage medium of the application system, and this layer can be separated from the data warehouse table .
  • The advantage of less application layer construction: When processing data in real time, every time a layer is built, the data will inevitably have a certain delay .
  • The advantage of building fewer summary layers: When summarizing statistics, in order to tolerate the delay of some data, some delays may be artificially created to ensure the accuracy of the data. For example, when calculating the data in the order events related to the inter-day, it may wait until 00:00:05 or 00:00:10 to make the statistics, and then make sure that all the data before 00:00 have been accepted in place, and then start the statistics. Therefore, if there are too many layers in the summary layer, it will aggravate the artificially caused data delay.
  • Compared with offline data warehouse, the data source storage of real-time data warehouse is different:
  • When building an offline data warehouse, basically the entire offline data warehouse is built on the Hive table . However, when building a real-time data warehouse, the same table will be stored in different ways. For example, in common cases, detailed data or summary data will be stored in Kafka, but dimension information such as cities and channels needs to be stored with databases such as Hbase, MySQL or other KV storage .

3. Real-time data warehouse of Lambda architecture

The concepts of Lambda and Kappa architecture have been explained in the previous article. For those who don’t understand, you can click the link: One article to understand big data real-time computing

The following figure shows the specific practice of the Lambda architecture based on Flink and Kafka. The upper layer is real-time computing, the lower layer is offline computing, the horizontal is divided by computing engine, and the vertical is divided by real-time data warehouse:

The Lambda architecture is a relatively classic architecture. In the past, there were not many real-time scenarios, mainly offline. When a real-time scenario is added, the technical ecology is different due to the different timeliness between offline and real-time. The Lambda architecture is equivalent to attaching a real-time production link, which is integrated at the application level, with two-way production, each independent. This is also a logical way to use it in business applications.

There will be some problems in dual-channel production, such as double processing logic, double development and operation and maintenance, and resources will also become two resource links. Because of the above problems, a Kappa architecture has been evolved.

4. Real-time data warehouse of Kappa architecture

The Kappa architecture is equivalent to the Lambda architecture with the offline computing part removed, as shown in the following figure:

The Kappa architecture is relatively simple in terms of architectural design, unified in production, and a set of logic for offline and real-time production. However, there are relatively large limitations in practical application scenarios, because the same table of real-time data will be stored in different ways, which leads to the need to cross data sources during association, and the operation of data has great limitations, so directly in the industry There are few cases of production and landing with Kappa architecture, and the scene is relatively simple.

Regarding the Kappa architecture, students who are familiar with real-time data warehouse production may have a question. Because we often face business changes, many business logics need to be iterated. If the caliber of some of the previously produced data is changed, it needs to be recalculated, or even the historical data will be rewritten. For real-time data warehouses, how to solve the problem of data recalculation?

The idea of ​​Kappa architecture in this part is: first, prepare a message queue that can store historical data, such as Kafka, and this message queue can support you to restart consumption from a certain historical node. Then you need to start a new task to consume the data on Kafka from an earlier time node, and then when the progress of the new task can be equal to the current running task, you can The downstream of the task switches to the new task, the old task can be stopped, and the original result table can also be deleted.

5. Real-time data warehouse combined with stream and batch

With the development of real-time OLAP technology, the performance and ease of use of open source OLAP engines, such as Doris, Presto, etc., have been greatly improved. Coupled with the rapid development of data lake technology, the combination of stream and batch has become Simple.

The following figure shows the real-time data warehouse combined with stream batches:

Data is uniformly collected from logs to message queues, and then to real-time data warehouses. The construction of the basic data flow is unified. Afterwards, for log real-time features, real-time large-screen applications use real-time stream computing. Real-time OLAP batch processing is used for Binlog business analysis.

We see that the combination of stream and batch has changed with the storage methods of the above architectures. Kafka has been replaced by Iceberg. Iceberg is an intermediate layer between the upper computing engine and the underlying storage format. We can define it It has become a “data organization format”, and the underlying storage is still HDFS, so why add an intermediate layer, and it is better to combine convection and batch processing? Iceberg’s ACID capability can simplify the design of the entire pipeline and reduce the delay of the entire pipeline, and its modification and deletion capabilities can effectively reduce overhead and improve efficiency. Iceberg can effectively support batch high-throughput data scanning and stream computing concurrent real-time processing at partition granularity.