First of all, it is necessary to clarify what is a real-time data warehouse, Baidu Encyclopedia and Wikipedia have not given specific instructions, which is a real-time data warehouse? Is it possible to obtain data in real time through real-time streaming, is it a real-time data warehouse? Or is the integration of stream batch is real-time data warehouse? In or fully adopt the real-time method, collection and real-time calculation is the real-time data warehouse? This question may be answered differently in different companies, and some people will think that providing real-time dashboards or real-time reports is even real-time digital warehouse; Others may feel that the data provided by the data warehouse must be real-time to be considered a real-time data warehouse. In fact, there is no standard answer to this question, and different people, scenarios, and enterprises understand it differently. I remember that a boss said the difference between management and technical posts, one of the points

is to

treat a thing or demand, and the answer to the T post is clear: either it can be done or it cannot be done; M-gang’s answer seems clear, but there will be multiple interpretations. [Isn’t this the old fritters, haha]

So interpreting the real-time data warehouse

from different angles, the definition of what is a real-time data warehouse is different; There are generally some definitions:

    has real-time data

  • processing capabilities, and can provide real-time data according to business needs, such as providing real-time business changes and real-time marketing effect data for the operation side.

  • All data in the data warehouse is real-time from data collection, processing and data distribution.

  • From data construction, data quality, data

  • lineage, data governance, etc., all adopt real-time methods.

It can be seen that different understandings are different, and the complexity of building real-time data warehouses is different. However, the final construction of a set of real-time data warehouses is still driven by business, and it is necessary to comprehensively consider input and output.

Real-time data warehouse architecture design ideas

data transfer and processing, in the real-time

or offline data warehouse are basically similar to the figure below, because layering is a very effective way of data governance, so in the real-time data warehouse how to manage, the first consideration is also the hierarchical processing logic.

As can be seen from the above figure, when designing the real-time data warehouse solution, the following points need to be considered (not to design the most awesome technical solution, but the designed scheme is the most suitable for the business scenario and resource situation; Sometimes the technical solution will increase the technical complexity and operation and maintenance difficulty, which will test our control ability; Therefore, we choose not the most technically awesome solution, and the most suitable solution for our actual situation):

    whether the data

  • integration stream batch is integrated: whether the offline and real-time use a unified data collection mode; For example, real-time data capture and push to kafka through CDC or OGG, batch and stream consume data from kafka and load it into the detail layer.

  • Whether the storage layer is integrated with batch flow: whether offline and real-time data is uniformly layered and stored; For example, offline and real-time data are persisted to the same data store according to unified tiering (ODS, DMD, DMS) after ETL processing.

  • Whether ETL logical flow and batch

  • are integrated: Whether the stream and batch use unified SQL syntax or ETL components, and then adapt the flow and batch computing engines respectively through the underlying layer.

  • Whether the ETL calculation engine flows into batches: The same set of computing engines is used for streams and batches, which fundamentally avoids the problem of two sets of codes for the same processing logic flow batch.

Data integration and storage layer stream batch integration mainly produce the following problems:

    stream

  • processing is more likely to occur data loss

  • , lost query operations,

  • generally use message middleware to store stream data, message middleware storage characteristics also determine, lost query operations, very miserable, Difficult to logarithmic.

  • The synchronization and transformation of schemas is based on stream processing to ensure that schemas are consistent and that data is not affected. In the face of frequently changing business systems, the maintenance volume will often become higher

Real-time data warehouse architecture

According to whether these are integrated: “whether the data integration stream batch is integrated”, “whether the storage layer flow batch is integrated”, “whether the ETL logic flow batch is integrated”

, “whether the ETL calculation engine is integrated with stream batch”, different flow batch integration will design different real-time data warehouse architectures. The more classic architectures are Lambda and Kappa; There is also Meituan’s real-time data warehouse architecture (real-time data production + real-time analysis engine) and Alibaba’s stream-batch integrated architecture (Lambda + Kappa). The following is a summary of these real-time data warehouse architectures.

Lambda data warehouse architecture

Lambda has Batch Layer and Speed Layer. Then stitch together the results of the batch, and stream. The Lambda architecture has the immutable nature of data to avoid human error problems, support data re-running, and separate complex stream processing. Batch Layer and Speed Layer often choose different components due to different scenarios. Also, as anyone who has written Storm knows, Storm code is a pain to write (Trident will improve). So, we need to prepare two sets of code. The same logic is implemented twice for batch and stream processing.

Lambda architecture problem:

  • two sets of architectures

  • , each independent

  • of a logicTwo sets of code

  • components too much

  • data scattered in multiple systems, difficult to access each other

Kreps

proposed another dimension of thinking, can we improve and adopt a stream processing system to build a big data system? It is proposed that it is entirely possible to build a data system with flow as the core. Moreover, the data reruns are achieved by replaying historical data.

This kind of data system with stream processing as the core is what Kreps calls the “Kappa architecture.” Both Kappa and Lambda are Greek alphabetic symbols. This architecture is much simpler than the Lambda architecture. It is to change the original batch processing to stream processing. It doesn’t have the Batch Layer, Speed Layer, and Serve Layer of the Lambda architecture.

Lambda Architecture problem:

    the high cost of

  • big data volume backtracking

  • , the migration

  • of offline data data warehouses left behind by high production pressure, and the migration of thousands of ETL jobs to the stream processing system is too large, the cost is huge, and the risk is huge

  • Data loss issues

  • , stream processing platforms, data loss issues are more likely to occur than batch processing. Moreover, it is very difficult to produce logarithmic

real-time

data + real-time analysis engine

The above figure shows the design of Meituan’s real-time data warehouse architecture, and the ETL process of data collection from log collection to message queue to data flow is unified as the basic data flow. After that, for the real-time characteristics of the log type, the real-time large-screen application takes real-time stream computing. For Binlog class business analysis take real-time OLAP batch processing. Meituan’s real-time data warehouse architecture mainly handles some of the difficulties faced in real-time processing by real-time OLAP.

Real-time processing faces several difficulties

:

    the

  • multi-state of the business: the business process is constantly changing from the beginning to the end, such as from order – > payment – > delivery, the business library is changed on the original basis, and Binlog will generate a lot of change logs. Business analysis pays more attention to the final state, resulting in the problem of data drawdown calculation, such as placing an order at 10 o’clock, canceling at 13 o’clock, but hoping to subtract the cancellation order at 10 o’clock

  • business

  • integration: business analysis data generally cannot be expressed through a single subject, often many tables are associated, in order to get the desired information, in the real-time stream data convergence alignment, often requires large cache processing and complex.

  • The analysis is batch, and the processing

  • process is streaming: for a single data, it is impossible to form an analysis, so the analysis object must be batch, and the data processing is piece-by-piece.

Lambda+Kappa

As can be seen from the above figure, Alibaba’s real-time data warehouse architecture is a combination of Lambda and Kappa at the same time; Data integration does not use stream and batch integration, and realizes stream and batch data collection through real-time collection and data synchronization. ETL logic flow batch integration, so that users only write one set of code, the platform automatically translates into Flink Batch tasks and Flink Stream tasks, and writes a Holo table at the same time to complete the unification of the expression of the computing layer. The storage layer stream and batch are stored separately, but the stream batch storage can be transparent, and the query logic is completely consistent.

Summarize

the architecture design is

not to design the most awesome technical solution, but the designed scheme is the most suitable for business scenarios and resource conditions; Sometimes the technical solution will increase the technical complexity and operation and maintenance difficulty, and it is necessary to invest higher costs to control it; Therefore, we chose not the most technically awesome solution, but the most suitable technical architecture for our actual situation. In the design of the real-time data warehouse architecture, it mainly considers “whether the data integration stream batch is integrated”, “whether the storage layer stream batch is integrated”, “whether the ETL logical stream batch is integrated”, and “whether the ETL calculation engine is integrated with stream batch”. Weighing the problems caused by these integrations, a real-time data warehouse architecture that meets business scenarios is designed.

END


Popular content

two years of experience to win an Ant/Headline/PingCAP Offer, awesome

." Kuaishou big data platform as a service

deep understanding of the Java memory model

Follow me, Java learning is not lost!"

Like + watch, less bugs 👇