Wikipedia’s entry for “Apache Flink
Planned at the beginning of 2021 in the InfoQ editorial office outlook for technology trends throughout the year , we mentioned that the field of big data will accelerate the new direction of embracing the evolution of “convergence” (or “integration”). The essence is to reduce the technical complexity and cost of big data analytics, while meeting higher requirements for performance and ease of use. Today, we see the popular stream processing engine Apache Flink (Flink) taking another step along this trend.
On the morning of January 8, Flink Forward Asia 2021 kicked off with an online conference. This year marks the fourth year that Flink Forward Asia (FFA) has landed in China and the seventh year that Flink has been a top project of the Apache Software Foundation. With the development and deepening of the real-time wave, Flink has gradually evolved into a leading role and de facto standard for stream processing. Looking back on its evolution, on the one hand, Flink continues to optimize its core capabilities of stream computing and continuously improve the standard of stream computing processing in the entire industry, and on the other hand, it gradually promotes architecture transformation and application scenarios along the idea of stream-batch integration. But beyond these, Flink’s long-term development needs a new breakthrough.
In the keynote speech of Flink Forward Asia 2021, Wang Feng, the founder of Apache Flink Chinese community and the head of Alibaba’s open source big data platform, focused on Flink’s latest progress in the evolution and implementation of the stream-batch integrated architecture, and proposed the next development direction of Flink – Streaming Warehouse (Streamhouse). As the title of the keynote “Flink Next, Beyond Stream Processing” said, Flink should move from Stream Processing to Streaming Warehouse to cover larger scenarios and help developers solve more problems. To achieve the goal of streaming data warehouse, it means that the Flink community should expand the data storage suitable for streaming and batch integration, which is an innovation of Flink in technology this year, and the community-related work has been launched in October, which will be promoted as a key direction for the Flink community in the coming year.
So, how to understand the flow data bin? What problems does it want to solve with existing data architectures? Why did Flink choose this direction? What will be the implementation path of the streaming data warehouse? With these questions in mind, InfoQ conducted an exclusive interview with Mo Qian to learn more about the thinking path behind streaming data bins.
In recent years, Flink has been repeatedly emphasizing stream-batch integration, that is, using the same set of APIs and the same set of development paradigms to achieve stream computing and batch computing of big data, so as to ensure the consistency of the processing process and the results. Mo Wen said that the integration of flow and batch is more of a technical concept and ability, which does not solve any problems of users itself, and only when it really falls into the actual business scenario, can it reflect the value of development efficiency and operational efficiency. The flow type data warehouse can be understood as the thinking of the landing solution under the general direction of flow batch integration.
At last year’s FFA, we have seen the landing application of Flink streaming batch integration on Tmall Double 11, which is the first time that Alibaba has truly implemented large-scale integration of core data business. Now, one year later, Flink Flow Batch has made new progress in both technical architecture evolution and landing application.
In terms of technical evolution, Flink Stream-batch integrated API and architecture transformation have been completed, and on the basis of the original Stream-batch integrated SQL, DataStream and DataSet two sets of APIs have been further integrated to realize a complete Java semantic Stream-batch integrated API, and architectively achieve a set of code that can undertake stream storage and batch storage at the same time.
released in October of this year Flink is already supported in version 1.14 Mix bounded and unbounded flows in the same application: Flink now supports checkpointing for partially run, partially ended applications where some operators have been processed to the end of the bounded input data stream. In addition, Flink triggers a final checkpoint when processing to the end of the bounded data stream to ensure that all calculations are submitted to the sink smoothly.
Batch execution mode now supports a mix of the DataStream API and the SQL/Table API in the same application (previously only the DataStream API or SQL/Table API was supported separately).
In addition, Flink updated the unified Source and Sink APIs to begin integrating the connector ecosystem around the unified API. The new Hybrid Source enables transitions between multiple storage systems, enabling operations such as reading old data from Amazon S3 before seamlessly switching to Apache Kafka.
At the level of landing applications, two more important application scenarios have also emerged.
The first is a fully incremental all-in-one data integration based on Flink CDC.
Data integration and data synchronization between different data sources are just a need for many teams, but traditional solutions are often too complex and timely. Traditional data integration solutions are usually offline data integration and real-time data integration using two sets of technology stacks, which involve many data synchronization tools, such as Sqoop, DataX, etc., these tools can either only do full or only incremental, developers need to control the switching of full increments, which is more complicated to cooperate.
Based on Flink’s stream-batch integration capability and Flink CDC, you only need to write a single SQL, you can first fully synchronize historical data, and then automatically resumably upload incremental data to achieve one-stop data integration. Without user judgment and intervention, Flink automatically switches between batch streams and ensures data consistency.
As an independent open source project, Flink CDC Connectors has maintained a fairly high speed of development since it was open sourced in July last year, with an average of one version per two months. The Flink CDC version has been updated to 2.1 version, and completed the adaptation of many mainstream databases, such as MySQL, PostgreSQL, MongoDB, Oracle, etc., and more databases such as TiDB, DB2, etc. are also in progress. We can see that more and more companies are already using Flink CDC in their own business scenarios, and XTransfer, which InfoQ recently interviewed, is one of them.
The second application scenario is the most core data warehouse scenario in the field of big data.
At present, the mainstream real-time offline integrated data warehouse architecture is usually shown in the following figure.
Most scenarios will use Flink+Kafka to do real-time data stream processing, that is, the real-time data warehouse part, and write the final analysis results to an online service layer for display or further analysis. At the same time, there will be an asynchronous offline data warehouse architecture in the background to supplement real-time data, regularly run large-scale batch or even full analysis every day, or carry out regular correction of historical data.
However, there are some obvious problems with this classic architecture: first of all, the technology stack used by the real-time link and the offline link is different, and there must be two sets of APIs, so two sets of development processes are required, which increases the development cost; Secondly, the real-time offline technology stack is different, and the consistency of data caliber cannot be guaranteed. Again, the intermediate queue data of the real-time link is not conducive to analysis. If users want to analyze the data of a detail layer in the real-time link, it is actually very inconvenient, and many users may currently use the method to export the data in this detail layer first, such as leading to Hive for offline analysis, but this timeliness will be greatly reduced, or in order to speed up the query, the data is imported into other OLAP engines, but this will increase the complexity of the system, and data consistency is also difficult to guarantee.
Flink’s concept of flow and batch integration can be fully applied in the above scenarios. In Mowen’s view, Flink can take the current mainstream data warehouse architecture in the industry to a new level and achieve true end-to-end full-link real-time analysis capabilities, that is, when the data changes at the source, it can capture the change and support layer-by-layer analysis of it, so that all data flows in real time and can be queried in real time for all flowing data. With Flink’s complete streaming-batch integration capabilities, flexible offline analysis can be supported using the same set of APIs. In this way, real-time, offline and interactive query analysis, short query analysis, etc., can be unified into a complete set of solutions, becoming the ideal “Streaming Warehouse”.

Streaming Warehouse is more precise, actually “make data warehouse streaming”, that is, to let the data of the entire data warehouse flow in real time, and in a pure stream mode rather than a mini-batch. The goal is to achieve a pure streaming service with end-to-end real-time (Streaming Service), use a set of APIs to analyze all flowing data, when the source data changes, such as capturing the log of the online service or the binlog of the database, according to the pre-defined query logic or data processing logic, the data is analyzed, the analyzed data falls to a certain layer of the data warehouse, and then flows from the first layer to the next layer. Then all the layers of the data warehouse will all flow, and finally flow to an online system, and the user can see the full real-time flow effect of the entire data warehouse. In this process, the data is proactive, the query is reactive, and the analysis is driven by changes in the data. At the same time, in the vertical direction, for each data detail layer, users can execute Query for active query, and can obtain query results in real time. In addition, it is compatible with offline analysis scenarios, and the API is still the same, achieving true integration.
At present, there is no such a mature solution for end-to-end full-stream links in the industry, although there are pure flow solutions and pure interactive query solutions, but the user needs to add up the two sets of solutions themselves, which will inevitably increase the complexity of the system. What the streaming data warehouse needs to do is to achieve high timeliness without further increasing the complexity of the system, so that the entire architecture is very simple for developers and operation and maintenance personnel.
Of course, streaming data is the final state, and to achieve this goal, Flink needs a supporting integrated storage support for streaming batches. In fact, Flink itself has a built-in distributed RocksDB as state storage, but this storage can only solve the problem of storing the state of streaming data inside the task. Streaming data warehouses require a table storage service between computing tasks: the first task writes data into it, the second task can read it out in real time, and the third task can perform user query analysis on it. Therefore, Flink needs to expand a storage that matches its own concept, go out of State storage, and continue to go out. To this end, the Flink community has proposed a new Dynamic Table Storage, a storage solution with stream-table duality.
storage of stream batches: Flink Dynamic Table
Flink Dynamic Table (see FLIP-188 for community discussion) can be understood as a set of integrated storage for stream batches and seamlessly connected to Flink SQL. Originally, Flink could only read and write external tables like Kafka and HBase, but now you can create a Dynamic Table with the same Flink SQL syntax as you did to create source and target tables. The hierarchical data of the streaming data warehouse can all be placed in the Flink Dynamic Table, and the layering of the entire data warehouse can be connected in real time through Flink SQL, which can not only query and analyze the data of different detail levels in the Dynamic Table in real time, but also do batch ETL processing for different layers.
From the perspective of data structure, Dynamic Table has two core storage components, File Store and Log Store. As the name suggests, the Flie Store stores the table in the form of file storage, which adopts the classic LSM architecture and supports streaming updates, deletions, additions, etc.; At the same time, it adopts an open column memory structure to support optimization such as compression; It corresponds to the batch mode of Flink SQL and supports full batch read. The Log Store stores the operation record of the table, which is an unchangeable sequence, corresponding to the streaming mode of Flink SQL, which can be analyzed in real time by subscribing to the incremental changes of the Dynamic Table through Flink SQL, and currently supports plug-in implementation.
Writes to the Flie Store are encapsulated in a built-in sink, masking the complexity of the write. At the same time, Flink’s Checkpoint mechanism and Exactly Once mechanism can ensure data consistency.
The implementation of the first phase of Dynamic Table has now been completed, and the community is also discussing more in this direction. According to the community’s planning, the future final state will realize the servitization of Dynamic Table, truly form a set of Dynamic Table services, and realize completely real-time integrated streaming and batch storage. At the same time, the Flink community is also discussing the operation and release of Dynamic Table as an independent sub-project of Flink, and it is not ruled out that it will be completely independent into a stream-batch integrated general storage project in the future. Finally, Flink CDC, Flink SQL, and Flink Dynamic Table can be used to build a complete streaming data warehouse to achieve a real-time offline integration experience. The whole process and effect are shown in the demo video below.
Although the whole process is initially through, in order to achieve a fully real-time link and be stable enough, the community also needs to gradually improve the quality of the implementation solution, including the optimization of Flink SQL in OLAP interactive scenarios, the optimization of dynamic table storage performance and consistency, and the construction of dynamic table service-oriented capabilities. The direction of flow data warehouse has just started, and there is a preliminary attempt, in Mo Wen’s view, there is no problem with the design, but a series of engineering problems need to be solved in the future. This is like designing an advanced process chip or ARM architecture, many people can design it, but it is actually difficult to produce the chip under the premise of ensuring the yield rate. Streaming data warehouse will be the most important direction for Flink in the big data analysis scenario, and the community will also invest heavily in this direction.
Under the general trend of real-time transformation of big data, Flink can not only do one thing. It could do more.
The
industry’s original positioning of Flink was more of a stream processor or stream computing engine, but this is not the case. Mo Wen said that Flink is not just computing, you may think that Flink is computing in a narrow sense, but in a broad sense, Flink has storage. “Flink’s ability to break out of the siege with stream computing relies on stateful storage, which is a bigger advantage over Storm.”
Now Flink wants to go a step further and implement a solution that covers a larger range of real-time problems, and the original storage is not enough. The external storage system or other engine architecture is not completely consistent with Flink’s goals and features, and cannot be well integrated with Flink. For example, Flink has integrated with data lakes including Hudi and Iceberg to support real-time incremental analysis into the lake and into the lake, but these scenarios still cannot fully play the advantages of Flink’s full real-time, because the data lake storage format is still Mini-Batch, and Flink will also degenerate to Mini-Batch mode. This is not the architecture that Flink most wants to see or is most suitable for Flink, so it naturally needs to expand itself to a storage system that matches Flink’s integrated flow and batch concept.
In Mo Wen’s view, for a set of big data computing and analysis engines, without the support of a storage technology system supporting its concept, it is impossible to provide a set of data analysis solutions with the ultimate experience. This is similar to the fact that any good algorithm needs to have a corresponding data structure to accompany it in order to solve the problem with the best efficiency.
Why is Flink more suitable for flow data warehouse? This is determined by Flink’s philosophy, the core idea of Flink is to give priority to Streaming to solve the problem of data processing, and Streaming is essential to make the data flow of the entire data warehouse in real time. After the data is flowing, the flow table duality of the collection data and Flink’s flow batch integration analysis capability can analyze the data of any link in the flow, whether it is a second-level analysis of short queries or an offline ETL analysis, Flink has the corresponding capabilities. Mo Wen said that the biggest limitation of Flink stream-batch integration is that there is no supporting storage data structure in the middle, which will make the scene difficult to land, as long as the storage and data structures are supplemented, many chemical reactions of stream-batch integration will naturally appear.
So will Flink’s self-built data storage system have a certain impact on the existing data storage projects in the big data ecosystem? In this regard, Mo Wen explained that the Flink community launched a new stream-batch integrated storage technology to better meet its own needs for stream-batch integrated computing, and will maintain open protocols, open APIs and SDKs for storage and data, and there are plans to develop this project independently in the future. In addition, Flink will continue to actively dock with mainstream storage projects in the industry to maintain compatibility and openness to the external ecology.
The
boundaries between different components of the big data ecosystem are becoming more and more blurred, and Mo Wen believes that the current trend is from a single component capability to an integrated solution. “Everyone is actually following this trend, for example, you can see a lot of database projects, it turns out that OLTP was later added to OLAP, and finally it was called HTAP, which is actually a fusion of row and column storage, both supporting serving and supporting analysis, all in order to provide users with a complete data analysis experience.” Mo Wen further added: “At present, many systems have begun to expand their boundaries, from real-time to offline, or from offline to real-time, interpenetrating each other. Otherwise, users need to manually combine various technical components, but also face various complexities, and the threshold is getting higher and higher. Therefore, the trend of integration is very obvious. In the end, there is no right or wrong in who combines whom, the key is whether it can provide users with the best experience in a good integration way. Whoever does it wins the last user. The community must have vitality and sustainable development, it is not enough to do what they are best at, but also to constantly innovate and break through boundaries based on user needs and scenarios, and most of the users’ needs are not necessarily in the gap between 95 points and 100 points in a single ability. According
to Mo Wen’s estimates, it will take about a year or so to form a relatively mature flow data warehouse scheme. For users who have adopted Flink as a real-time computing engine, it is naturally suitable to try new streaming data warehouse solutions, and the user interface is fully compatible with Flink SQL. It is revealed that the first Preview version will be released in the latest Flink 1.15 version, and users who are using Flink can try it first. Mo Wen said that the flow data warehouse based on Flink has just been launched, the technical solution needs to be further iterated, and it will take some time to polish from maturity, hoping that more enterprises and developers can participate in the construction with their own needs, which is the value of the open source community.
The problem of many components of the
big data open source ecosystem and high architectural complexity has been criticized for many years, and now the industry seems to have reached a certain consensus, that is, to promote the evolution of data architecture in the direction of simplification through integration and integration, although different enterprises have different statements and implementation paths.
In Mowen’s view, it is normal for open source ecology to bloom, and each technical community has its own areas of expertise, but to really solve the problem of business scenarios, a set of one-stop solutions is still needed to provide users with a simple and easy-to-use experience. Therefore, he agrees that the general trend will go in the direction of integration and integration, but the possibility is not unique, in the future there may be a system responsible for integrating all components, or it is possible that each system will gradually evolve into integration. Which possibility is the endgame, perhaps we can only wait for time to give us the answer.
Today’s good article recommends
so-called “modern web development”, what kind of demons and ghosts are they?
Interpreting Middleware in 2021: After Being Reshaped by Cloud Native, Selection is
More Difficult
Click to look at one less bug 👇