Ali Mei Guide: In this year’s Double 11, the traffic peak of real-time computing processing reached a record 4 billion records per second, and the data volume also reached a staggering 7 TB per second. This article deeply analyzes the practical experience of “stream-batch integration” in Alibaba’s core data scenarios for the first time, and reviews the development process of “stream-batch integration” big data processing technology.

With the chime of 12 o’clock on November 11, the GMV figure of Double 11 in 2020 was frozen at 498.2 billion, driven by Flink’s real-time computing technology, and the Alibaba real-time computing platform based on Flink also successfully completed the real-time data task guarantee of this year’s Double 11 overall economy, and once again smoothly passed the annual exam.
In addition to GMV media screen, Flink also supports important services such as real-time machine learning for search recommendations, real-time anti-cheating of advertising, real-time tracking and feedback on Cainiao order status, real-time attack detection of cloud servers, and monitoring and alerting of a large number of infrastructures. Real-time business and data volume have grown significantly every year, with a peak of 4 billion records per second and a staggering 7 terabytes per second of data, equivalent to reading 5 million Xinhua Dictionaries in one second.
Up to now, we have more than 35,000 real-time computing jobs, and the total computing scale of the cluster has reached more than 1.5 million cores, which is a leading level in China and the world. So far, Flink has supported all the real-time computing needs of Alibaba economies, realized real-time full-link data, and brought the value of data to consumers, merchants and operators in the first place.

However, this year, the value brought by Flink’s

technology evolution is not only that, Flink-based streaming and batch integrated data applications have also begun to emerge in Alibaba’s core data business scenarios, and have withstood rigorous production tests in terms of stability, performance and efficiency.


“Stream-batch integration” has been implemented for the first time in Alibaba’s core data scenarios
In fact, Flink streaming and batch integration technology has been applied within Alibaba for a long time. The development of Flink in Alibaba began with the search recommendation scenario, so the index construction of search engines and the feature engineering of machine learning are already based on Flink’s batch stream integration architecture. On 11.11 this year, Flink has gone a step further, leveraging the integrated computing capability of stream and batch to help the data middle office achieve more accurate real-time offline cross-data analysis and business decision-making.
Alibaba’s data reports are divided into real-time and offline, the former is particularly useful in scenarios such as Double 11, which can provide merchants, operations and management with real-time data information in various dimensions, help them make timely decisions, and improve platform and business efficiency. For example, in a typical real-time marketing data analysis scenario, the operation and decision-making levels need to compare the data results of a certain time period on the day of the promotion with a certain historical period (such as the comparison of the turnover at 10 o’clock on the day of the promotion and the turnover at 10 o’clock yesterday), so as to judge the current marketing effect, as well as whether and how to regulate strategies.

In the above marketing data analysis scenario, two sets of data analysis results are actually required, one is an offline data report calculated every night based on batch technology, and the

other is based on stream processing technology to calculate the real-time data report of the day, and then compare and analyze the real-time and historical data, and make relevant decisions based on the comparison results. Offline and real-time reports are based on batch and stream two different computing engines output, that is, the architecture of batch and stream separation will not only have two sets of development costs, but also solve the problem of data logic and caliber alignment, and it is difficult to ensure that the data statistical results developed by the two sets of technologies are consistent. Therefore, the ideal solution is to use a stream-batch integrated computing engine for data analysis, so that offline and real-time reports will be naturally consistent. In view of the continuous maturity of Flink’s stream-batch integrated computing technology and the successful landing in the search recommendation scenario in the early stage, this year’s Double 11 data platform development team also showed firm confidence and trust, working side by side with Flink real-time computing team to jointly promote the upgrade of real-time computing platform technology, and for the first time, Flink-based stream batch integrated data processing technology was successfully implemented in the core data scenario of Double 11.

This year, the integrated streaming and batch computing framework jointly promoted by the Flink team and the data platform team made its successful debut in the core scenario of Double 11 data, and was also recognized by Peng Xinyu, head of Alibaba Data Middle Office, at the business layer: stream-batch integration is technically possible, realizing even multiple computing processing modes, which can be compatible with only one set of code. In terms of calculation speed, it is 1 times faster than other frameworks and 4 times faster to query, which improves the speed of building data reports for Xiaoer by 4-10 times. At the same time, due to the “all-in-one” nature, real-time and offline data can be fully consistent.

In addition to the progress in service development efficiency and computing performance, the stream-batch integrated computing architecture has also greatly improved the utilization of cluster resources. Alibaba’s Flink real-time cluster has reached the scale of millions of CPUs after rapid expansion in recent years, and tens of thousands of Flink real-time computing tasks run on it. During the day, real-time data services are at their peak, and at night, computing resources are idle during off-peak hours, which provides free computing resources for offline batch tasks. A set of batch and flow engines, running on a set of resource bases, natural peak shaving, valley filling, natural mixing, not only save development costs, but also greatly save operation and maintenance costs and resource costs. On Double 11 this year, Flink-based streaming and batch integrated data services did not apply for any additional resources, and the batch mode fully reused Flink real-time computing clusters, greatly improving the cluster utilization rate, saving a lot of resource expenses for business parties, and providing fertile soil for more business innovation in the future.
“Flow and batch integration”, Flink has sharpened a sword

for ten years

Next, let’s talk about the development process of “flow and batch integration” big data processing technology from a technical point of view. This starts with Hadoop, the originator of open source big data technology, Hadoop appeared as the first generation of open source big data technology more than 10 years ago, MapReduce as the first generation of batch technology to solve large-scale data processing problems, and the emergence of Hive allows users to use SQL to calculate large-scale data. However, with the gradual development of big data business scenarios, many applications have an increasingly strong demand for real-time data, such as social media, e-commerce transactions, financial risk control and other industries. In this demand background, Storm as the first generation of big data stream processing technology came into being, Storm is completely different from Hadoop / Hive in architecture, it is a completely message-based streaming computing model, can concurrently process massive data with millisecond latency, so Storm makes up for the lack of timeliness of Hadoop MapReduce and Hive. In this way, big data computing has its own different mainstream engines in both batch and flow directions, and presents a clear pattern, and big data processing technology has experienced the first era.
Then big data processing technology came to the second era, Spark and Flink two computing engines appeared in the new era. Compared with Hadoop and Hive, Spark has more complete batch expression and better performance, which makes the Spark community grow rapidly, and gradually surpasses the old Hadoop and Hive to become the mainstream technology in the field of batch technology. But Spark didn’t stop at batch technology, and soon Spark also launched a stream computing solution, Spark Streaming, and continued to improve it. However, everyone knows that the core engine of Spark is oriented to the concept of “batch processing”, not a pure streaming computing engine, and cannot provide the ultimate streaming and batch integration experience in issues such as timeliness. However, Spark is based on a set of core engine technologies, and the concept of implementing both stream and batch computing semantics is very advanced, and there is another new engine Flink with the same concept of stream and batch integration. Flink officially lit up slightly later than Spark, but its predecessor was Stratosphere, a 2009 research project from the Technical University of Berlin in Germany, which has been around for 10 years. Flink’s philosophy and goal is also to use a set of compute engines to support both streaming and batch computing modes, but it has chosen a different implementation route than Spark. Flink chose the engine architecture for “stream processing”, and believes that “batch” is actually a “finite flow”, based on the stream-centric engine to achieve stream-batch integration is more natural, and there will be no architectural bottlenecks, we can think that Flink chose the “batch on streaming” architecture, different from Spark’s choice of “streaming on batch” architecture.

Flink’s implementation of a perfect stream-batch integration architecture is not achieved overnight, and in the early Flink version, Flink’s streams and batches have not yet reached complete unification in both API and runtime. However, starting from version 1.9, Flink began to accelerate the improvement and upgrade of stream-batch integration, and Flink SQL, as the most mainstream API used by users, took the lead in realizing the stream-batch integration semantics, so that users only need to learn to use a set of SQL to carry out stream-batch integration development, greatly saving development costs.

But SQL doesn’t solve all the needs of users. Some jobs with a high degree of customization, such as manipulating state storage, still need to continue to use the DataStream API. In common business scenarios, after you write a stream computing job, you usually prepare an offline job to perform batch brushing of historical data. However, although DataStream can solve the various needs of stream computing scenarios, it lacks efficient support for batch processing.
Therefore, after completing the SQL stream-batch integration upgrade, the Flink community has also invested a lot of effort in improving DataStream’s stream-batch integration capability since version 1.11, adding batch semantics to the DataSteam API, and combining the design of stream-batch integration Connector The DataStream API can interface with different types of streaming batch data sources such as Kafka and HDFS in the stream batch fusion scenario. Next, the iterative compute API with integrated stream and batch will also be introduced into DataStream to further unlock a series of machine learning scenarios.
In the current major version of Flink, both SQL and DataStream APIs are a combination of stream and batch computing capabilities in terms of the concept of stream-batch integration. The code written by the user needs to choose whether to run in the way of flow or batch. However, some business scenarios have put forward higher requirements, that is, the requirements of stream batch mixing, and automatically switch between batch and stream, such as: data integration and data into the lake scenario, the user’s demand is to first synchronize the full data of the database to HDFS or cloud storage, and then automatically synchronize the incremental data in the DB in real time, and carry out ETL data processing of stream batch mixing during the synchronization process, Flink will continue to support more intelligent stream batch fusion scenarios in the future.
The development of

Flink’s “flow and batch integration” technology in

Alibaba Alibaba is the earliest company in China to choose Flink open source technology, in 2015 my search recommendation team hopes to face the development of the next 5-10 years, choose a new big data computing engine, used to process the search recommendation background massive goods and user data, because the e-commerce industry has a very high demand for timeliness, so we hope that the new computing engine has both large-scale batch processing capabilities, but also millisecond-level real-time processing capabilities, that is, a unified engine for streaming batches, Spark at the time The ecosystem has matured, and through Spark Streaming to provide streaming and batch integration computing power, and Flink just became the top Apache project the year before, or a rising star project, when the team for Spark and Flink after a period of research and discussion, it was agreed that although Flink’s ecology was not mature at that time, its architecture based on stream processing as the core was more suitable for the support of stream batch integration. Therefore, a very quick decision was made to improve and optimize based on the open source Flink within Alibaba, and build a real-time computing platform for search recommendations.
After one year of hard work by the team, the Flink-based search recommendation real-time computing platform successfully supported the search Double 11 in 2016, ensuring the real-time search recommendation process. Through the landing proof of Alibaba’s core business scenarios, the whole group also got to know the Flink real-time computing engine, and decided to migrate the real-time data business of the entire group to the Flink real-time computing platform. After another year of hard work, Flink lived up to expectations on Double 11 in 2017 and smoothly supported the real-time data business of Double 11 of the whole group, including the most core data business scenarios such as GMV large screen.
In 2018, Flink began to move to the cloud, and Alibaba Cloud launched real-time computing products based on Flink, aiming to provide cloud computing services for the majority of small and medium-sized enterprises. Alibaba not only hopes to use Flink technology to solve its own business problems, but also hopes to promote the faster development of the Flink open source community and make more contributions to the open source technology community, so Alibaba acquired the founding company and team of Flink Ververica in early 2019 and began to invest more resources in the Flink ecosystem and community. By 2020, almost all mainstream technology companies at home and abroad have chosen Flink as their real-time computing solution, and we see that Flink has become the de facto standard for real-time computing in the big data industry.

Next, the

Flink community will not stop technological innovation, and the integrated technology of flow and batch has moved from theory to implementation in Alibaba’s business scenarios. On Double 11 in 2020, Flink streaming and batch integration technology gave wonderful performance in the Tmall marketing decision-making core system, coupled with the construction of streaming batch integrated index and machine learning process that has been successfully run in search recommendation before, fully verifying the correctness of our bold choice of Flink technology system 5 years ago, and I believe that we will see the landing of Flink flow batch integration technology in more companies in the future.
The technological innovation of “Streaming and Batch Integration” has promoted the vigorous development

of the Flink open source community

Flink adheres to the road of technological innovation of flow and batch integration, which naturally promotes the rapid development of Flink’s open source community and the accelerated prosperity of the ecosystem. We are pleased to see that with the accelerated landing of Flink in more domestic companies, the power of the Chinese community is becoming larger and larger, and has begun to gradually surpass foreign countries to become the mainstream.
The first and most obvious is the increase in the number of users, and since June of this year, Flink’s Chinese mailing list has begun to surpass the English mailing list. With the influx of a large number of users into the Flink community, more excellent code contributors have been brought in, which has effectively promoted the development iteration of the Flink engine.
Since version 1.8.0, the number of contributors in each version of Flink has increased, and most of them are from major domestic enterprises. There is no doubt that developers and user groups from China have gradually become the backbone of Flink’s development.

Chinese growing community, Flink’s overall activity continues unabated compared to 2019. In the Apache Software Foundation’s fiscal 2020 report, Flink was named the most active project of the year (active through the user+dev mailing list). Meanwhile, Flink ranked second in both code commit and Github homepage traffic. It’s not easy to achieve this in the Apache Software Foundation’s nearly 350 top projects.
Flink Forward Asia 2020, the “flow and batch integration” technology revealed
Flink Forward is an official authorized Flink technology conference by Apache, this year’s Flink Forward Asia (hereinafter referred to as: FFA) conference adopts online live broadcast mode, providing an open source big data technology feast for the majority of developers for free, and you can watch online from Alibaba, Ant Technology, Tencent, ByteDance, Meituan, Xiaomi, Kuaishou, B Station, Netease, Weibo, Intel, DellEMC, Linkedin and other domestic and foreign first-tier Internet companies share technical practices and technological innovation for Flink.

The

head of Tmall Data Technology will share the practice and implementation of Flink streaming and batch integration technology in Alibaba, so that everyone can see how the flow and batch integration technology plays a business value in the most core scenarios of Double 11. Technical experts from Alibaba and ByteDance Flink PMC and Committer will conduct in-depth technical interpretations around Flink’s integrated SQL and Runtime to bring you the latest technical progress of the Flink community. Game technology experts from Tencent will bring you the application practice of Flink in the glory of the national game king; The person in charge of real-time big data from Meituan will introduce how Flink helps make life service scenarios real-time; The person in charge of big data from Kuaishou will bring you the development process of Flink in Kuaishou’s past and present; Machine learning technology experts from Weibo will show you how to use Flink for information recommendation. In addition, Flink-related topics also cover finance, banking, logistics, automobile manufacturing, travel and other industries, showing a prosperous ecological scene of a hundred flowers. Developers who are passionate about open source big data technologies are welcome to attend this year’s Flink Forward Asia Technical Conference to learn more about the latest technological developments and innovations in the Flink community. Official website of the conference:
http://flink-forward.org.cn

Poke me and go to Flink Forward Asia 2020.

Buy Me A Coffee