At the beginning of the 21st century, with the advent of the Internet era, the amount of data has soared, and the era of big data has arrived. Hadoop ecosystem and derivative technologies are slowly moving towards the “stage”, Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (MR) as the basic computing model, around HDFS and MR, a series of components have been generated, and the data processing capabilities of the entire big data platform have been continuously improved, such as HBase for KV operations, Hive for SQL analysis, pig for workflows, etc. Data storage and data processing technology with Hadoop as the core has gradually become the “mainstay” of data processing, and some of the technology stacks are shown in the following figure:

in this period, in the process of enterprise informatization, with the upgrading of informatization tools and the application of new tools, the amount of data has become larger and larger, the data formats have become more and more, the decision-making requirements have become more and more demanding, and data warehouse technology has been widely used in big data scenarios. the construction of the data warehouse in big data is based on the classic number warehouse architecture, and the tools in big data are used to replace the traditional tools in the classic data warehouse, and there is no fundamental difference in the architecture construction. in the offline big data architecture, the structure of the offline digital warehouse is as follows:

With the continuous change of data processing power and processing requirements, more and more users find that no matter how much the batch processing mode improves performance, it cannot meet some processing scenarios with high real-time requirements, and streaming computing engines have emerged, such as Storm, Spark Streaming, Flink, etc.

The above offline big data architecture can not handle real-time business, early days, very much the company is based on Storm to deal with real-time business scenarios, as more and more applications online, we found that in fact, batch processing and stream computing are used together to meet most of the application needs. For users, in fact, they do not care what the underlying computing model is, users hope that whether it is batch processing or stream computing, they can return processing results based on a unified data model, so the Lambda architecture was proposed.

02

Lambda schema

In the Lambda architecture, in order to calculate some real-time indicators, a real-time computing link is added on the basis of the original offline digital warehouse, and the data source is streamed: the message is sent to the message queue (Kafka commonly used in big data), the real-time calculation is consumed to consume the data in the message queue, the real-time indicator calculation is completed, pushed to the downstream data service, and the data service layer completes the merger of offline and real-time results.

The data in the Lambda architecture starts from the underlying data source, enters the big data platform through various formats, collects it through data components such as Kafka and Flume in the big data platform, and then divides it into two lines for calculation. One line is to enter the streaming computing platform (such as Storm, Flink or Spark Streaming) to calculate some real-time indicators to ensure the real-time nature of the data; the other line enters the batch data processing offline computing platform (such as Mapreduce, Hive, Spark SQL) to calculate the relevant business indicators of T+1, which need to be seen every other day to ensure the effectiveness and accuracy of the data.

The complexity of real-time business statistics is also divided into the following two scenarios: lambda architecture.

      1. offline data + real-time processing link (traditional real-time development)

According to the complexity of the real-time index calculation in the real-time link, the real-time business is not complicated, it is “chimney (cong)”-style development and design, there is no need to build a real-time digital warehouse, we can choose not to layer, in this scenario, the Lambda architecture is composed of offline digital warehouses and real-time business processing parts, this part of the real-time can not reach the stage called real-time digital warehouses, can only be called real-time processing links, its structure is as follows:

note: in a business of a certain size, there are usually a variety of application systems, which are developed by different departments of the enterprise, at various historical periods, and for a variety of different business purposes. because there is no unified specification of the data format, there is no connection between each other, and the data is not integrated, like a chimney, it is called a “chimney system”. similarly, in the process of data processing, the data processing procedures cannot be well unified in data specification, unified in processing data processes, and data reuse, which are independent of each other, which is called “chimney” development.

      2. offline digital warehouse + real-time digital warehouse

With the increase of real-time services in enterprises, more and more real-time indicators of statistics, and the degree of complexity is getting higher and higher, in order to better reuse data in the real-time link, it is necessary to add data layering design to the real-time link to build a real-time digital warehouse. In this scenario, the Lambda architecture consists of two parts: offline digital warehouse and real-time digital warehouse, and its structure is as follows:

The difference between the traditional real-time and “real-time digital warehouse” of the “real-time processing link” in the above Lambda architecture is that the traditional real-time “chimney” development leads to serious code coupling problems, when the demand is increasing, sometimes need detailed data, sometimes olap analysis, this mode is difficult to cope with these requirements, lack of perfect specifications. Under the premise of ensuring the real-time data, “Real-time Digital Warehouse” realizes data management based on data warehouse, which is more unified and standardized, and the stability and service are stronger.

In the Lambda architecture, the metric batch of stream processing calculation is still calculated, and the final batch processing result is prevailed, that is, the result of the stream processing is overwritten after each batch calculation, which is due to the imperfect compromise method in the stream processing process, which is processed by the data service, and its function is mainly to combine offline computing and real-time computing results.

FOR EXAMPLE: WHEN COUNTING REAL-TIME TRADING ORDERS, THE RESULTS OF REAL-TIME STATISTICS MAY NEED TO BE DISPLAYED AT THE MINUTE LEVEL OF THE DAY, T+1 CAN SHOW THE TOTAL NUMBER OF TRADING ORDERS YESTERDAY, OBVIOUSLY, THE LATTER IS T+1 DAILY OFFLINE BATCH STATISTICAL RESULTS, THEN ASSUMING THAT SOME USERS HAVE CANCELED ORDERS ON THE SAME DAY, THERE MAY BE INCONSISTENCIES BETWEEN THE STATISTICAL STATISTICAL RESULTS AND THE REAL-TIME DISPLAY DATA OF THE DAY, THEN IT IS NECESSARY TO USE DATA SERVICES TO PROCESS, UNIFY DATA, AND DECIDE HOW TO USE DATA.

Lambda data architecture has become a must-have architecture for every company’s big data platform, which solves the needs of a company’s big data batch offline processing and real-time data processing. The core idea of the Lambda architecture is “stream batch in one”, as shown in the figure above, the entire data flow flows from left to right into the platform. After entering the platform, it is divided into two, one part takes the batch mode, and the other part takes the streaming computing mode.

Regardless of the computing mode, the final processing result is provided to the application through a unified service layer to ensure the consistency of access, and the underlying layer is transparent to the user in the end of the batch or stream. After years of development, the advantage of Lambda architecture is stable, the calculation cost of the real-time computing part is controllable, and the batch processing can be calculated in batches at night, so that the real-time computing and offline computing peaks are separated, but it also has some fatal disadvantages:

1) the same requirements require the development of two identical sets of code

This is the biggest problem of the Lambda architecture, for the same requirement needs to develop two sets of code, one implemented on the batch engine, one implemented on the stream processing engine, after writing the code also need to construct data tests to ensure that the results of the two are consistent, in addition, the two sets of code for later maintenance is also very troublesome, once the requirements change, both sets of code need to be modified, and both sets of code also need to be online at the same time.

2) the use of cluster resources increases

the same logic needs to be calculated twice, and the overall consumption of resources will increase. although the offline part is run in the early morning, there may be many tasks, which will cause a sharp increase in the use of cluster resources in the early morning, and the efficiency of report output may decrease, and the delay of reports will also affect subsequent displays.

3) offline results and real-time results are inconsistent

in this architecture, we often see that the results of the next day’s statistics are less than last night’s results, because the next day’s statistical results and yesterday’s statistical results have taken two lines: the next day’s statistical results are more accurate batch processing results according to the batch processing. the results seen last night were the results of streaming operations, relying on the real-time link statistics to produce real-time results (real-time results statistics cumulative), sacrificing some accuracy. there is no solution to this problem that the data results from batches and real-time do not match.

4) BATCH CALCULATION T+1 MAY NOT BE CALCULATED

with the advent of the internet of things era, the data level in some enterprises is getting larger and larger, and it is often found that running batch tasks at night has been unable to complete the data accumulated for more than 20 hours during the day, and ensuring that the data appears on time before going to work in the morning has become a headache for some big data teams.

5) the server storage is large

because both batches need to store data in the cluster, and a large amount of temporary data is generated in the middle, it will cause rapid data expansion and increase server storage pressure.

03

Kappa architecture

With the continuous improvement of streaming engines such as Flink, the mature development of stream processing technology related technologies (e.g., Kafka, ClickHouse), the need to maintain two sets of programs for the Lambda architecture, etc., LinkedIn’s Jay Kreps combined practical experience and personal experience to propose the Kappa architecture.

The core idea of the Kappa architecture is to solve the problem of full data processing by improving the stream computing system, so that the real-time computing and batch processing processes use the same set of code. In addition, the Kappa schema believes that historical data will only be double-counted when necessary, and if it is necessary to repeat the calculation, many instances can be started under the Kappa schema for double-counting, which is done through upstream replay (pulling data from the data source for recalculation).

Kappa architecture is based on the flow to process all data, stream computing natural distributed characteristics, doomed to his better scalability, by increasing the concurrency of stream computing, increase the “time window” of stream data, to unify batch processing and streaming two computing modes. Its architecture is as follows:

The Kappa architecture builds a number of warehouses that deserve to be called real-time digital warehouses, and the biggest problem with the Kappa architecture is that the throughput capacity of streaming reprocessing history will be lower than that of batch processing, but this can be compensated for by increasing computing resources. Reprocessing the data may seem cumbersome, but it is not complicated in the Kappa architecture, and the steps are as follows:

1. Select a message queue with replay function that can save historical data, and set the storage time of historical data according to the requirements, for example: Kafka, you can set to save all historical data.

2. when one or some indicators have the need to be reprocessed, write a new job according to the new logic, and then re-consume the data from the beginning of the upstream message queue, and write the results to a new downstream result table.

3. when the new job catches up with the progress, switch the data source and read the result table generated by the new job.

4. stop the old job and delete the old result table.

In addition, the Kappa architecture is not the intermediate results are not completely landed, and now many big data systems need to support machine learning (offline training), so the real-time intermediate results need to land the corresponding storage engine for machine learning, and sometimes need to query the detailed data, which also needs to write the real-time detail layer into the corresponding engine.

The Kappa architecture also has certain disadvantages, such as: the Kappa architecture due to the data format collected is not uniform, each time need to develop a different Streaming program, resulting in a long development cycle. More Kappa architecture issues are discussed in Real-Time Digital Position Trends.

04

hybrid structure

The traditional offline big data architecture can no longer meet the real-time business needs of some companies, because with the development of the Internet and the Internet of Things, more and more companies are more or less involved in some streaming business processing scenarios. From Lambda offline digital warehouse + real-time digital warehouse architecture to Kappa real-time digital warehouse architecture, it involves real-time digital warehouse development, so whether to use Lambda architecture or Kappa architecture in real business development?

let’s first look at the differences between the above three architectures:

judging from the above comparison, the comparison results of the three are as follows:

From the architectural point of view, the three sets of architectures have obvious differences, the real real-time digital warehouse is dominated by the Kappa architecture, while the offline digital warehouse is dominated by the traditional offline big data architecture, and the Lambda architecture can be considered as the intermediate state of the two. Most of the real-time digital warehouses currently mentioned in the industry are Lambda architectures, which are determined by requirements.

from the perspective of construction methods, real-time digital warehouses and offline digital warehouses basically follow the traditional modeling theory of digital warehouse themes and produce a wide table of facts. in addition, the join of real-time streaming data in the real-time digital warehouse has hidden time semantics, which needs to be paid attention to in construction.

from the perspective of data protection, the real-time digital warehouse is more sensitive to the change of data volume because it is necessary to ensure real-time, and it is necessary to do the stress test and the main and backup support work in advance in the big promotion and other scenarios, which is a more obvious difference from the offline digital warehouse.

at present, in some companies without real-time data processing scenarios, most of them use traditional offline big data architecture, and in these companies, the offline big data architecture is cost-effective and practical.

In some companies involved in real-time business scenarios, which architecture to choose in actual work needs needs to be decided according to specific business needs. Many times, the Lambda architecture or Kappa architecture is not fully standardized, and it can be a mixture of the two, such as most real-time metric statistics using the Kappa architecture to complete the calculation, and a small number of key indicators using the Lambda architecture to be recalculated with batch processing, adding a proofreading process.

To address a wider range of scenarios, most companies adopt this hybrid architecture, where both offline and real-time data links exist, and the right link is selected for each business need. Note: This is not a Lambda schema, for example, an enterprise has multiple business modules, some business modules need to run in the Lambda schema, and some business modules need to run in the Kappa schema.