author’s introduction
Zhan Huaimin, Hua Mingxindu, big data engineer of Alibaba Cloud Digital Industry Production and Research Department – Industrial Brain Team, the current direction of work is to use big data and AI technology to build a data middle platform for industrial enterprise customers, and support the digital transformation and intelligent manufacturing of industrial enterprises. Use big data technology to benefit more Chinese manufacturing enterprises.


With the release of Yunqi Industrial Brain 3.0 in 2020, the industrial brain has experienced years of development. This article will share with you the best practices of using DeltaLake in the construction of industrial data middle office, mainly including:

1, processing of heterogeneous stream messages in different locations, and data analysis of

stream batch fusion

3. Theory of transactions and support

for algorithms

Processing of heterogeneous stream messages

For industrial enterprises, data sources are often scattered around the world, and group-level users often want to obtain data centered on the data middle office, as shown

in the following figure

There are many big data components that can do the above tasks, such as Flink and Flume, but the following reasons led us to choose DeltaLake :

    supports the use of regular

  • consumption of multiple Kafka Topics

    Using SubscribePattern, you can use regular to consume data from multiple topics at the same time, which is very convenient in scenarios where many topics need to be consumed in a park

  • support for HDFS and the encapsulation of small file merging

    use ubscribePattern, which can be used to consume data from multiple topics at the same time using regular implementation, which is very convenient in scenarios where many topics need to be consumed in a campus

When encountering the scenario of “writing Kafka data to HDFS in real time”, it is also convenient to use DeltaLake for two main reasons:

  • natural support for writing HDFS, which can save the need to write HDFS sinker when using Flink, or the trouble caused by additional operation and maintenance of Flume

  • clusters Whether it is Flink, Flume or SparkStreaming, there will be a design of “rolling write capacity (or number) threshold” and “rolling write time threshold”, in the actual implementation process, according to the different business requirements for data latency and performance, to weigh the two. For example, in scenarios with low latency tolerance, you can set the capacity or number threshold to a small (or even 1) to allow new data to be written on a rolling basis quickly, but the side effect of this is the frequent IO of the sinker, such as generating a lot of small files in HDFS, affecting the performance of data reading and writing or DataNode; In scenarios with high latency tolerance, delivery engineers often choose to increase the number of records threshold and time threshold to bring better IO performance at the expense of data latency. This is a one-size-fits-all approach, but in real production, you’ll find that maintaining many different configurations for many flow jobs is still costly. Using DeltaLake to process, it is much easier, you can set the rolling write threshold of all streaming jobs to the same (for example, they are relatively small), so that all streaming jobs can get better data latency, and at the same time combine the features of DeltaLake Optimize and Vacuum, configure scheduled tasks to execute periodically, merge or delete small files, to ensure HDFS performance, This makes the entire data development work much simpler and better operation. For more information about the Optimize feature, see Data

Analysis for

Flow Batch Fusion


In manufacturing, the stable operation of machinery and equipment is crucial to the quality of the finished product, and the most intuitive way to determine whether the equipment is stable is to look at the historical trend of certain sensors over a long period of time A large amount of sensor time series data is processed and written to OLAP storage (such as Alibaba Cloud ADB, TSDB, or HBase) to support the real-time query requirements of upper-layer data analysis applications with high concurrency and low response time.

However, the

actual situation is often much more complicated than this, because the level of informatization and digitalization of industrial enterprises is generally not high, the degree of automation of the production process in different industries is also uneven, and there are many equipment real-time data that is actually not accurate, and they need to be used after a certain time (minutes or hours) after manual intervention or recalculation.

Therefore, in the

actual implementation process, a “rolling overlay” mode is often adopted to continuously rewrite the data in OLAP storage, and the OLAP is divided into “real-time incremental area” and “periodic coverage area”, as shown in the following figure

The above figure makes an OLAP store, all data is divided into orange and blue parts, the upper layer data application can query the data of these two regions without discrimination, the only difference is: the latest data in orange, obtained from Kafka in real time by the stream computing job, processed and written; The blue area, on the other hand, is written after the periodic calculation of historical data (adding correction logic) to revise the real-time data yesterday or more ago, so that the cycle repeats itself, while ensuring the timeliness of the data, the historical data is revised and covered to ensure the correctness of the data.

In the past, a stream+batch Lambda architecture was often used, and two different compute engines were used to process streams and batches, as shown in the following figure

The drawbacks of Lambda architecture can also be seen from this, maintaining two codes on two different platforms, but also to ensure that their computing logic is completely consistent, is a more laborious thing, after the introduction of DeltaLake, things have become relatively simple, Spark’s natural flow batch integration design, it is a good solution to the problem of code reuse and cross-platform logic unification, combined with the characteristics of DeltaLake (such as ACID, OPTIMIZE, etc.), This can be done more elegantly, as shown below:

It is also worth mentioning that stream-batch integration is not a unique feature of Spark, but Alibaba Cloud EMR encapsulates SQL on top of SparkSQL and Spark Streaming, enabling business personnel to use Flink SQL-like syntax for job development at a lower threshold, making code reuse and operation and maintenance work in the stream-batch scenario simpler, which is of great significance for project delivery and efficiency

processing of transactions and support for algorithms

Traditional data warehouses rarely introduce transactions into the modeling process, because the data warehouse needs to reflect the changes in the data, so it often uses methods such as slow-changing dimensions to record the state changes of the data, and does not use ACID to align the data warehouse with the business system.

However, in the implementation process of the industrial data middle office, the affairs have its unique use scenarios, such as scheduling scheduling, which is a major issue that every industrial enterprise is concerned about, scheduling, often from the group level, according to customer orders, material inventory and factory capacity to reasonably decompose and arrange the current production demand to achieve reasonable allocation of production capacity; Scheduling tends to be more microscopic, at the factory level, according to work orders, materials and actual production conditions to dynamically adjust production plans in real time to achieve maximum resource utilization. They are all programming problems that require many data fusion solutions, as shown in the following figure:

The original data required by the scheduling algorithm often comes from multiple business systems, such as ERP to provide order and planning data, WMS to provide material data, MES to provide work order and process data, these data must be fused together (physical and logical) in order to be used as an effective input to the scheduling algorithm, so in the implementation process often need a unified storage to store data from each system. At the same time, the scheduling algorithm also has certain requirements for the effectiveness of the data, and it requires that the input data can be as consistent as possible with each business system, so as to truly reflect the production situation at that time in order to better schedule.

In the past, we have handled this scenario like this:

  • Leverage the CDC capabilities of individual business systems, or write separate programs to poll for near real-time data changes

    writes to the relational database

  • , and processes the logic of data Merge in the process to make the data in the relational database consistent with the business system data in near real time

    > When the scheduling engine is triggered,


are some obvious problems

with the architecture of pulling data from the RDB for operation, mainly

    > Use RDB instead of big data storage, query the data into memory during calculation, it will be difficult for the case of a relatively large amount of data

  • if you use the Hive engine to replace the middle RDB, although ACID is supported in Hive3.X, real-time and MapReduce The programming framework’s support for algorithms (solvers) is difficult to meet engineering requirements

, and we are currently trying to introduce DeltaLake and combine Spark features to optimize this architecture, as shown below

optimized architecture It has the following advantages:

  • use HDFS+Spark instead of RDB as the middle office storage, solve the storage problem when the amount of data is large

    • > Spark Streaming+DeltaLake is used to dock the original data, the

      ACID feature of DeltaLake is used to process the Merge logic when the data enters the middle storage, and the data is Merge+Optimize at the same time when streaming into the warehouse to ensure read and write performance

    • scheduling engine no longer from the middle office query data to in-memory computing, but encapsulates the algorithm task into Spark jobs and delivers them to the computing platform to complete the calculation, so as to use the Spark ML programming framework’s good support for algorithms and Python, as well as Spark’s own distributed computing capabilities, to distribute the planning algorithms that require multiple rounds of iteration

    • using DeltaLake’s Time Travel feature to manage or roll back data versions, which is a very beneficial summary for debugging and evaluating algorithm models

    1) DeltaLake’s core competency ACID is very helpful for applications with high data real-time and accuracy requirements, especially algorithm applications, which can more effectively use Spark’s natural support for ML

    2) Combine DeltaLake’s Optimize+Vacuum and Streaming streaming capabilities to connect upstream Kafka in large quantities There will be better compatibility of data, and at the same time, it can effectively reduce O&M costs3

    ) Using the Streaming SQL development flow jobs packaged by the Alibaba Cloud EMR team, the development threshold and cost can be effectively reduced during the implementation of large-scale data middle office projects

    At present, the application of DeltaLake in the industrial brain is still in the experimental stage, such as flow storage, scheduling engine, flow batch fusion and other scenarios are being applied in multiple projects of the industrial brain, and these scenarios are gradually precipitated into the standard products of the industrial brain, and subsequently combined with the visual editing and copying capabilities of the industrial brain 3.0 data + algorithm scenes, it can be quickly copied to the scenarios of discrete manufacturing, automobiles, steel and other industries, and use AI capabilities to benefit Chinese industry.

    Interested students can refer to: