Abstract: This article is shared by Zheng Zhisheng, head of bilibili big data real-time platform, the core of this sharing explains the implementation of trillion-level transmission distribution architecture, and how the AI field can build a complete set of preprocessing real-time pipelines based on Flink. This sharing mainly focuses on the following four aspects:

  1. scheme of Flink On Yarn’s incremental pipeline in real time in the past and present life of station B

  2. Some

  3. engineering practices in the direction of Flink and AI for

  4. future

  5. development and thinking

Tips: Click at the end of the article Read the original article” to review the author’s original shared video~
GitHub address 
welcome everyone to like Flink and send star~

First, the past and present life of station B in real time

When it comes to the future of real-time computing, the key word is the effectiveness of data. First of all, from the ecology of the entire big data development, look at its core scene radiation: in the early stage of big data development, the core is the offline computing scene facing the sky as the granularity. At that time, most of the data effectiveness was measured in days, and it paid more attention to the balance of time and cost.

With the popularization and improvement of data applications, data analysis and data warehouses, more and more people have put forward higher requirements for the effectiveness of data. For example, when it is necessary to make real-time recommendations of some data, the effectiveness of the data will determine its value. In this case, the entire real-time computing scenario is universally born.

However, in

the actual operation process, there are also many scenarios, in fact, there is no very high real-time requirements for data, in this case there will inevitably be some new scenarios with data from milliseconds, seconds or days, and real-time scene data is more of a minute-granular incremental calculation scenario. For offline computing, it is more cost-focused; For real-time computing, it pays more attention to value effectiveness; For incremental computing, it pays more attention to balancing costs, as well as the combined value and time.

2. The timeliness

of station B

is in three dimensions, what is the division of station B? For station B, 75% of the data is currently supported by offline calculation, and another 20% of the scenarios are calculated in real time and 5% are calculated by increments.

    For real-time

  • computing scenarios, it is mainly applied in the whole real-time machine learning, real-time recommendation, advertising search, data application, real-time channel analysis and delivery, reporting, OLAP, monitoring, etc.

  • For offline computing, the data radiation is wide, mainly based on data warehouses;

  • For incremental computing, some new scenarios were only started this year, such as binlog’s incremental Upsert scenario.

3. Poor timeliness of ETL

For the problem of effectiveness, in fact, there are many pain points encountered in the early stage, and the core focuses on three aspects:

    > First, the transmission pipeline lacks computing power. In the early scheme, the data was basically to fall to the ODS according to the sky, and the DW layer was the day after the early morning to scan all the data of the previous day’s ODS layer, that is, the overall data could not be pre-cleaned;

    Second, the resource

  • concentration with a large number of activities breaks out after the early morning, and the pressure on the entire resource orchestration will be very large;

    Third, real-time and offline

  • gaps are more difficult to meet, because for most data, the cost of pure real-time is too high, and the practical effect of pure offline is too poor. At the same time, the storage time of MySQL data is not enough. For example, like the barrage data of station B, its volume is very exaggerated, and the synchronization of this business table often takes more than ten hours, and it is very unstable.


AI real-time engineering is complex

In addition to the problem of practicality, it also encountered the more complex problem of AI real-time engineering in the early days:

    > first, is the problem of the calculation efficiency of the entire feature engineering. The same real-time feature calculation scenario also needs to be traced back to the data in the offline scene, and the calculation logic will be repeatedly developed;

    >Second, the entire real-time link is relatively long. A complete real-time recommendation link, covering N real-time and M offline more than a dozen jobs, sometimes encountered troubleshooting, the operation and maintenance and control costs of the entire link are very high;

  • Third, with the increase of AI personnel and the input of algorithm personnel, it is difficult to scale horizontally in experimental iteration.


In the context of these key pain points, we focus on Flink’s ecological practice, which includes the application of the entire real-time data warehouse and the entire incremental ETL pipeline, as well as some scenarios of AI-oriented machine learning. This sharing session will focus more on the incremental pipeline and the direction of AI plus Flink. The following figure shows the overall scale, at present, the entire transmission and computing volume, in the trillion-level message scale has 30000+ computing cores, 1000+ jobs and more than 100 users.


1. Early architecture


Flink On Yarn’s incremental pipeline

Let’s take a look at the early architecture of the entire pipeline, as you can see from the figure below, the data is actually mainly consumed by Flume Kafka falling to HDFS. Flume uses its transaction mechanism to ensure the consistency of data from Source to Channel and then to Sink, and finally after the data falls to HDFS, the downstream Scheduler will determine whether the data is ready by scanning the directory for tmp files, so as to schedule and pull up downstream ETL offline jobs.

2. Pain


encountered many pain points in the early stage:

      > The first use was MemoryChannel, which would have data loss, and then I tried to use FileChannel mode, but the performance could not meet the requirements. In addition, in the case that HDFS is not stable, Flume’s transaction mechanism will cause the data to be rolled back to the channel, which will lead to continuous duplication of data to a certain extent. In extremely unstable HDFS, the probability of the highest repetition rate reaching the percentile;

    • Lzo rows are stored, and the entire transmission in the early days was in the form of delimiters, whose schema was weakly constrained and did not support nested formats.

  • the second point is the aging of the entire data, which cannot provide minute-level queries, because Flume does not have a checkpoint cutting mechanism like Flink, and more controls the closure of files through the idle mechanism;

  • the third point is the downstream ETL linkage. As mentioned earlier, we are more by scanning whether the tmp directory is ready, in this case, the scheduler will call the hadoop list API with NameNode a lot, which will cause more pressure on NameNode.

3. Stability-related pain points

also encounter many problems in stability:

    >First, Flume is stateless, and after the node is abnormal or restarting, tmp cannot be shut down normally;

    >Second, the early environment without relying on big data is a physical deployment mode, and resource scaling is difficult to control, and the cost will be relatively high;

  • Third, Flume and HDFS have problems with communication. For example, when writing HDFS blockage, the blockage of a certain node will be counter-pressure to the Channel, which will cause Source to not go to Kafka to consume data, stop pulling offset, to a certain extent, it will cause Kafka’s Rebalance, and finally lead to the global offset not moving forward, resulting in the accumulation of data.

4. Trillion-level incremental pipeline DAG view Under the above pain points, the core solution builds a set of

trillion-level incremental

pipelines based on Flink, and the following figure is the DAG view of the entire runtime.

First of all, under the Flink architecture, KafkaSource eliminates the avalanche problem of rebalance, even if there is a certain degree of concurrency in the entire DAG view that blocks data write HDFS, it will not cause congestion of all Kafka partitions globally. In addition, the essence of the whole scheme is to implement extensible nodes through Transform’s modules.

  • The first layer node is Parser, which mainly does parsing operations such as data decompression and deserialization;

    > the second layer is to introduce a customized ETL module provided to the user, which can realize customized cleaning of data in the pipeline;

  • The third layer is the Exporter module, which supports exporting data to different storage media. For example, when writing to HDFS, it will be exported as a parquet; Written to Kafka, it will be exported to pb format. At the same time, the module of ConfigBroadcast is introduced on the link of the entire DAG to solve the problem of real-time update and hot loading of pipeline metadata. In addition, throughout the chain, a checkpoint is performed every minute to append the actual data of the increment, so that minute-level queries can be provided.

5. Trillion-level incremental pipeline overall view

The overall architecture of Flink On Yarn can be seen that the entire pipeline view is actually divided into BU. Each Kafka topic represents a distribution of a certain type of data terminal, and the Flink job is dedicated to writing processing for various terminal types. In the view, you can also see that for the data of blinlog, the assembly of the entire pipeline is also realized, and the operation of the pipeline can be realized by multiple nodes.

6. Technical


Next, let’s take a look at some technical highlights at the core of the entire architecture solution, the first three are some features at the real-time functional level, and the last three are mainly some optimizations at some non-functional levels.

  • For the data model, it is mainly through parquet, using Protobuf to parquet mapping to achieve format convergence;


  • notification is mainly because a pipeline actually processes multiple streams, and the core solution is the notification mechanism of the partition ready of multiple stream data;

  • CDC pipeline is more about using binlog and HUDI to solve upsert problems;

  • small files are mainly used to solve the problem of file merging through DAG topology at runtime;

    >HDFS communication is actually optimized for many key problems at a trillion-scale scale;

■ 6.1 Data model

Business development is mainly to assemble strings to assemble data records and report them. The later stage is organized through the definition and management of the model, as well as its development, mainly through the entrance of the platform to provide users to record each stream, each table, its Schema, Schema will generate it to generate Protobuf files, users can download the HDFS model file corresponding to Protobuf on the platform, so that the development of the client side can be completely constrained from PB through a strong schema way.

First of all, Kafka’s Source will consume every RawEvent record that actually travels over, and there will be PBEvent objects in RawEvent, and PBEvent is actually a Protobuf record. The parser module where the data flows from Source will form PBEvent after parsing, and PBEvent will store the entire schema model entered by the user on the platform on the OSS object system, and the Exporter module will dynamically load the model changes. Then the generated specific event object is reflected through the PB file, and the event object can finally be mapped into the parquet format. A lot of optimization of cache reflection is mainly done here, so that the dynamic parsing performance of the entire PB is improved by six times. Finally, we will land the data into HDFS to form a parquet format.

■ 6.2 Partition notification optimization

As mentioned earlier, the pipeline will process hundreds of flows, and the early Flume architecture, in fact, each Flume node is difficult to sense the progress of its own processing. At the same time, Flume can’t handle the overall progress. But based on Flink, it can be solved by the mechanism of Watermark.

First, the Source will generate a watermark based on the eventime in the message, and the watermark will

be passed to the sink through each layer of processing, and finally the progress of all watermark messages will be summarized in a single-threaded way through the Commiter module. When it finds that the global watermark has advanced to the next hour partition, it will send a message to Hive MetStore or write to Kafka to notify the previous hour partition data is ready, so that the downstream scheduling can pull up the job faster through a message-driven way.

■ 6.3 Optimization on CDC pipelines

The right side of the figure below is actually the complete link of the entire CDC pipeline. To achieve a complete mapping of MySQL data to Hive data, you need to solve the problem of streaming and batching.

The first is to synchronize all MySQL data to HDFS at once through Datax. Then, through the job of spark, the data is initialized into the initial snapshot of HUDI, and then the data of Mysql’s binlog is dragged to the topic of Kafka through Canal, and then the initialized snapshot data is incrementally updated with incremental data through Flink’s job, and finally the HUDI table is formed.

The whole link is to solve the problem of data not being lost or heavy, the focus is on writing Kafka for Canal, and the transaction mechanism is opened to ensure that when the data falls on the Kafka topic, the data can be neither lost nor heavy during transmission. In addition, data may also be duplicated and lost at the upper layer of data transmission, which is more through a globally unique ID plus millisecond-level timestamps. In the entire streaming job, deduplication of data is done for the global id and data is sorted for millisecond-level time, so as to ensure that the data can be updated to the HUDI in an orderly manner.

Then, through the Trace’s system based on Clickhouse to store, to count the number of incoming and outgoing data of each node to achieve accurate data comparison.

■ 6.4 Stability – Merging of small files

As mentioned earlier, after transforming into Flink, we did checkpoints per minute, and the number of files was greatly enlarged. It is mainly to introduce the merge operizer in the entire DAG to achieve file merging, and the merge method is mainly based on the concurrency horizontal merge, and a writer will correspond to a merge. In this way, every five-minute checkpoint, 12 files in 1 hour, will be merged. In this way, the number of files can be greatly controlled within a reasonable range.

■ 6.5

In the actual operation process of HDFS communication, the problem of accumulation of the

whole operation is often encountered, and the actual analysis is actually mainly related to HDFS communication.

In fact, HDFS communication sorts out four key steps: initializing state, Invoke, Snapshot, and Notify Checkpoint complete.

The core problem mainly occurs during the Invoke phase, where Invoke reaches the scroll condition of the file, at which point flush and close are triggered. When close actually communicates with NameNode, there will often be blockages.

The Snapshot stage also encounters a problem, once hundreds of streams in a pipeline trigger Snapshot, serial execution of flush and close will also be very slow.

Core optimization focuses on three aspects:

    > first, reducing the frequency of file chopping, that is, close. In the Snapshot phase, the file is not closed closely, but more through the file continuation. In this way, during the initialization of state, you need to do the file Truncate to do the recovery recovery.

    > second, is the improvement of asynchronized close, it can be said that the close action will not block the processing of the entire total link, for the close of Invoke and Snapshot, the state will be managed into the state, and the file will be restored by initializing the state.

    > Third, for multiple streams, Snapshot also did parallelization, every 5 minutes of checkpoint, multiple streams are actually multiple buckets, will be serialized through loops, then through multi-threaded transformation, can be reduced Checkpoint timeout occurs.

■ 6.6 Some optimizations of partition fault tolerance

In the case of multiple streams in the pipeline, the data of some streams is not continuous every hour.

This situation will cause partitioning, and its watermark cannot advance normally, causing the problem of empty partitions. So we introduce the PartitionRecover module during the running of the pipeline, which will promote the notification of partitions according to the watermark. For some streams of Watermark, if ideltimeout has not been updated, the Recover module appends partitions. When the end of each partition is reached, it adds delay time to scan the watermark of all streams, and thus bottoms out.

During the transfer process, when the Flink job restarts, a wave of zombie files will be encountered, and we are cleaning up and deleting the zombie files before the entire partition notification through the commit node of the DAG to achieve the cleaning of the entire zombie file, which is some optimization at the non-functional level.

Some engineering practices

in the direction of Flink and AI

1. Architecture evolution timeline

The following figure shows a complete timeline of AI direction in real-time architecture.

  • As early as 2018, many algorithmists’ experimental development was workshop-style. Each algorithm person will choose different languages to develop different experimental projects according to their familiar language, such as Python, PHP or C++. It is very expensive to maintain and prone to failure;

    in the first half of

  • 2019, mainly based on Flink to provide the jar package mode to do some engineering support for the entire algorithm, it can be said that in the early part of the entire first half of the year, in fact, more around stability, versatility to do some support;

  • In the second half of 2019, the threshold for model training was greatly reduced through self-developed BSQL, and the real-time solution of labels and instances improved the efficiency of the entire experimental iteration;

  • In the first half of 2020, it is more about the calculation of the entire feature, the opening up of flow batch calculation, and the improvement of feature engineering efficiency to make some improvements;

  • By the second half of 2020, it will be more about the flow of the entire experiment and the introduction of AIFlow, which is convenient to do flow batch DAG.

2. AI engineering

architecture review review of the entire AI project, its early architecture

diagram actually reflects the entire AI architecture view in early 2019, its essence is through some single tasks, various mixed languages to form some computing nodes, to support the entire model training link pull-up. After an iteration in 2019, the entire near-line training was completely replaced with the BSQL pattern for development and iteration.

3. The

pain point of the current situation

At the end of 2019, some new problems were actually encountered, which were mainly concentrated in two dimensions: functional and non-functional.

    • first from label to generate instance stream, and to model training, to online prediction, and even the real experimental effect, the whole link is very long and complex;

    • Second, the

    • integration of the entire real-time feature, offline feature, and stream batch involves a lot of job composition, and the entire link is very complicated. At the same time, both the experiment and the online must do the calculation of features, and the inconsistent results will cause problems with the final effect. In addition, it is not easy to find the existence of the feature, and there is no way to trace it.

  • At the non-functional level, algorithm students often encounter that they don’t know what Checkpoint is, whether to open it, and what configuration they have. In addition, it is not easy to troubleshoot when there is a problem on the line, and the entire link is very long.

      so the

    • third point is that the complete experimental progress needs to involve a lot of resources, but for the algorithm it does not know what these resources are and how much they need. These problems actually cause great confusion about algorithms.

4. Pain points boil down to

three aspects:

    > The first is the issue of consistency. From data preprocessing, to model training, to prediction, all links are actually faulty. This includes data inconsistencies and calculation logic inconsistencies;

    > Second, the entire experimental iteration is very slow. A complete experimental link, in fact, for the algorithm student, he needs to master a lot of things. At the same time, the materials behind the experiment cannot be shared. For example, some features must be repeatedly developed behind each experiment;

The complete experimental link is actually composed of a real-time project and an offline engineering link, and it is difficult to troubleshoot online problems.

5. The prototype of real-time AI engineering Under such pain points, in the past 20 years, it has mainly focused on the direction of AI to create the prototype of real-time engineering.

The core is to make breakthroughs through the following three aspects.

  • the first is in some capabilities of BSQL, for algorithms, it is hoped to reduce engineering investment by developing them for SQL;

  • the second is feature engineering, which will solve some problems of feature calculation through the core to meet some support of features;

    the third is the collaboration of the entire experiment, the purpose of the algorithm is actually to experiment,

  • hoping to create a set of end-to-end experimental collaboration, and finally hope to achieve algorithm-oriented “one-click experiment”.

6. Feature Engineering – Difficulties

We have encountered some difficulties in feature engineering.

  • the first is in real-time feature calculation, because it needs to use the results to the entire online prediction service, so it has very high requirements for latency and stability;

    the second is that the entire real-time and offline calculation logic

  • is consistent, we often encounter a real-time feature, it needs to go back to the past 30 days to 60 days of offline data, how to achieve the real-time feature calculation logic can also be reused in the calculation of offline features;

  • the third is that it is difficult to get through the flow batch integration of the entire offline feature. The calculation logic of real-time features often has some streaming concepts such as window timing, but offline features do not have these semantics.

7. Real-time features Here look at how we do

real-time features

, the right side of the figure is the most typical scene. For example, I want to count the user’s last minute, 6 hours, 12 hours, and 24 hours in real time, and the number of times the main related video of each UP is played. For such a scenario, there are actually two points in it:

    > First, it needs to use a sliding window to calculate the entire user’s past history. In addition, in the sliding calculation process of data, it also needs to associate some basic information dimension tables of the UP master to obtain some videos of the UP master to count his playbacks. In the final analysis, I actually encountered two relatively large pains.

    • using Flink’s native sliding window, minute-level sliding will lead to more windows and greater performance loss.

    • At the same time, the fine-grained window will also lead to too many timers, and the cleaning efficiency is relatively poor.

  • the second is dimension table query, which encounters multiple keys to query multiple corresponding values of HBASE, which needs to support concurrent query of arrays.

Under the two pain points, for sliding windows, it is mainly transformed into the mode of Group By, plus the mode of agg’s UDF, and some window data of the entire hour, six hours, twelve hours, and twenty-four hours is stored in the entire Rocksdb. In this way, through the UDF mode, the entire data trigger mechanism can achieve record-level triggering based on Group By, and the entire semantics and timeliness will be greatly improved. At the same time, in the UDF function of the entire AGG, Rocksdb is used to do the state, and the data life cycle is maintained in the UDF. It also extends the entire SQL implementation of array-level dimensional table queries. The final effect can actually support various computing scenarios in the direction of real-time features through the mode of a large window.

8. Features – offline Next look at offline, the upper part of the left view is a complete


feature computing link, it can be seen that to solve the same SQL, in the offline calculation can also be reused, then you need to solve the corresponding some of the calculated IO can be reused. For example, data input is performed through Kafka on streaming, and HDFS is required offline. In terms of streaming, it is supported by some kv engines such as KFC or AVBase, and it needs to be solved by the hive engine offline, in the final analysis, it is actually necessary to solve three problems:

    > First, it is necessary to simulate the ability of the entire streaming consumption, which can support the consumption of HDFS data in offline scenarios;

    >Second, it is necessary to solve the problem of partitioning order in the consumption process of HDFS data, similar to Kafka’s partitioned consumption;

    > Third, it is necessary to simulate the dimensionalized consumption of the KV engine to achieve hive-based dimensional table consumption. It is also necessary to solve a problem, when each record pulled from HDFS, each record actually consumes the hive table when there is a corresponding snapshot, which is equivalent to the timestamp of each piece of data, to consume the partition of the corresponding data timestamp.

9. Optimization

9.1 Offline – Partition order

The partitioning order scheme is actually mainly based on the data when it falls into HDFS, and some transformations have been made in front. First of all, before the data falls into HDFS, it is the pipeline that transmits and consumes the data through Kafka. After Flink’s job pulls data from Kafka, the watermark of the data is extracted through Eventtime, and the concurrency of each Kafka Source will be reported to the GlobalWatermark module in the JobManager, and GlobalAgg will summarize the progress of each concurrency watermark to count Progress on GlobalWatermark. According to the progress of GlobalWatermark, the problem of calculating which concurrency is too fast in the watermark calculation, so as to send control information to Kafka Source through GlobalAgg, Kafka Source is too fast in concurrency, and its entire partition is slowed down. In this way, in the HDFS Sink module, the entire event time of the data record received on the same time slice is basically orderly, and finally falls to HDFS will also identify its corresponding partition and corresponding time slice range on the file name. Finally, under the HDFS partition directory, you can implement the ordered directory of the data partition.

■ 9.2 Offline-partition incremental consumption

After the data is ordered in HDFS increments, HDFStreamingSource is implemented, which will make Fecher partitions for files, and there are Fecher threads for each file, and each Fecher thread will count each file. It offset the progress of the cursor and updates the state to the State according to the Checkpoint process.

In this way, the orderly promotion of the entire file consumption can be achieved. When going back to historical data, offline jobs involve stopping the entire job. In fact, it is to introduce a partition end identifier in the entire FileFetcher module, and when each thread counts each partition, it senses the end of its partition, and the state after the partition is finally summarized to the cancellation manager, and further summarizes to the Job Manager to update the progress of the global partition, when all the global partitions are at the end of the cursor, the entire Flink job will be canceled Turn it off.

9.3 Offline – Snapshot tables

As mentioned earlier, the entire offline data, in fact, the data is on hive, the entire table field information of hive’s HDFS table data will be very much, but when actually doing offline features, the information required is actually very small, so you need to do offline field clipping in the process of hive, clean an ODS table into DW table, DW table will finally run Job through Flink, there will be a reload scheduler inside, It will periodically pull the table information corresponding to each partition in the hive according to the Watermark partition currently advanced by the data. By downloading some data in the hive directory of an HDFS, it will finally reload into a Rocksdb file in the entire memory, and Rocksdb is actually the last component used to provide dimension table KV queries.

The component will contain multiple Rocksdb build processes, mainly depending on the Eventtime in the

entire data flow process, if you find that the Eventtime has been advanced to the end of the hourly partition, it will actively reload through the lazy loading mode to build the next hour Rocksdb partition, in this way, to switch the entire Rocksdb read.



Based on the above three optimizations, namely partition ordered increment, Kafka-like partition fetch consumption, and dimension table snapshot, the experimental stream batch integration finally realizes real-time features and offline features, shares a set of SQL schemes, and opens up the stream batch calculation of features. Next, let’s take a look at the entire experiment, the complete flow batch integrated link, it can be seen from the figure that the top granularity is the complete calculation process of the entire offline. The second is the entire near-line process, and the semantics of the calculations used in the offline process are completely consistent with the semantics of real-time consumption in the near-line process, and they are all used to provide SQL calculations with Flink.

Let’s take a look at the near line, in fact, Label join uses a Kafka clickstream and display stream, and to the entire offline computing link, it uses an HDFS click directory and HDFS display directory. Feature data processing is the same, using Kafka’s playback data in real time, as well as some manuscript data from Hbase. For offline, Hive’s manuscript data and Hive’s playback data are used. In addition to the entire offline and nearline flow batch opening, the real-time data effect generated by the entire nearline is summarized to the OLAP engine, and the entire real-time indicator visualization is provided through superset. In fact, it can be seen from the figure that the complete complex flow batch integrated computing link, when the included computing nodes are very complex and voluminous.

11. Experimental Collaboration – Challenges

The next stage of the challenge is more in experimental collaboration, and the following figure is an abstraction that simplifies the entire link before. As can be seen from the figure, in the three dotted area boxes, they are offline links plus two real-time links, and the three complete links constitute the flow batch of the job, which is actually the most basic process of a workflow. It is necessary to complete the complete abstraction of the workflow, including the driving mechanism of stream batch events, and, for algorithms in the field of AI, more hope to use Python to define the complete flow, in addition to the entire input, output and its entire calculation tend to template, which can facilitate the cloning of the entire experiment.

12. The introduction of AIFlow The whole workflow is more about cooperation with the community in the second half of the year, and the whole set of


solutions is introduced.

On the right is actually the DAG view of the entire AIFlow complete link, you can see the entire node, in fact, the type it supports is not limited, it can be a streaming node or an offline node. In addition, the entire node-to-node communication edge can be data-driven as well as event-driven. The main benefit of introducing AIFlow is that AIFlow provides Python semantics to facilitate the definition of complete AIFlow workflows, as well as the scheduling of the progress of the entire workflow.

On the edge of the node, he also supports the entire event-driven mechanism compared to some native industry Flow solutions. The advantage is that it can help to send an event-driven message between two Flink jobs through the progress of the watermark processing data partition in Flink to pull up the next offline or real-time job.

In addition, it also supports some supporting services around the area, including some message module services for notifications, metadata services, and some model center services in the AI field.

13. Python defines Flow

Let’s take a look at how AIFlow is ultimately defined as a workflow for Python. The view on the right defines the complete workflow of an online project. The first is the definition of the entire Spark job, which describes the entire downstream dependency by configuring dependency, and it will send an event-driven message to pull up the following Flink streaming job. Streaming jobs can also pull up the following Spark jobs in a message-driven way. The definition of the whole semantics is very simple, only four steps, configure the information of each node’s confg, and define the behavior of each node’s operation, as well as its dependency dependencies, and finally run the topology view of the entire flow.

14. Event-driven flow batches

Next, take a look at the driving mechanism of the complete flow batch scheduling, and on the right side of the figure below is the driving view of the complete three worker nodes. The first is from Source to SQL to Sink. The yellow box introduced is the extended supervisor, who can collect global watermark progress. When the entire streaming job finds that the watermark can advance to the next hour of partitioning, it sends a message to the NotifyService. After the NotifyService gets this message, it will send it to the next job, and the next job will mainly introduce the flow operator in the entire Flink DAG, and the operator will block the entire job before receiving the previous job message. Until the message driver is received, it means that the upstream partition has actually been completed in the last hour, and the next flow node can drive up and run. Similarly, the next workflow node introduces a module of GlobalWatermark Collector to summarize the progress of collecting its processing. When the partition is completed in the previous hour, it will also send a message to the NotifyService, and the NotifyService will drive the module that calls AIScheduler, so as to pull up the spark offline job to do the finishing of the spark offline. As you can see, the entire link actually supports four scenarios: batch-to-batch, batch-to-flow, flow-to-flow, and flow-to-batch.

15. Based on the entire flow definition and scheduling of streams and batches, the

prototype of real-time AI full-link


initially constructed in 2020, and the core is experiment-oriented. Algorithm students can also develop node nodes based on SQL, Python is able to define complete DAG workflows. Monitoring, alerting, and O&M are integrated.

At the same time, it supports communication from offline to real-time, from data processing to model training, from model training to experimental effect, and end-to-end communication. On the right is the link for the entire nearline experiment. The following is a service that provides the material data produced by the entire experimental link to online predictive training. The whole will have three aspects of support:

    > one is some basic platform functions, including experiment management, model management, feature management and so on;

    > followed by some services under the entire AIFlow layer;

  • and then there are some platform-level metadata metadata services.

Fourth, some future prospects

In the coming year, we will focus more on some work in two aspects.


  • first is the direction of the data lake, which will focus on some incremental computing scenarios from ODS to DW layer, and breakthroughs in some scenarios from DW to ADS layer, and the core will be combined with Flink plus Iceberg and HUDI as the landing in this direction.

    on the real-time AI platform,

  • it will further provide a set of real-time AI collaboration platform for experiments, the core is to create an efficient engineering platform that can refine and simplify algorithms.