Abstract: Alibaba technical expert Hu Zheng shared at Meetup in Shanghai on April 17, the content of the article is to use Flink and Iceberg to try to solve the challenges related to data entering the lake, and help business students focus on their own business challenges more efficiently. Topics include:
-
core challenges of data into the lake
-
-
how Flink and Iceberg solve the problem
-
Community Roadmap
, Apache Iceberg describes


First, the core challenge of data entering the lake
1. Case #1: A program bug causes data transfer to be interrupted
-
FIRST, WHEN THE DATA SOURCE IS TRANSMITTED TO THE DATA LAKE (DATA WAREHOUSE) THROUGH THE DATA PIPELINE, IT IS LIKELY TO ENCOUNTER A BUG IN THE JOB, RESULTING IN HALF OF THE DATA BEING TRANSMITTED AND AFFECTING THE BUSINESS.
-
The second problem is how to restart the job when encountering this situation, and ensure that the data is not duplicated or missing, and completely synchronized to the data lake (data warehouse).
2. Case #2: Data changes are too painful
When data changes occur, it can bring great stress and challenges to the entire link. In the following figure, a table originally defined two fields, ID and NAME. At this time, the business students said that they need to add the address to facilitate better mining the value of users.
First, we need to add a column Address to the Source table, then chain the link to Kafka, then modify the job and restart. Then the entire link has to be changed all the way, new columns are added, jobs are modified and restarted, and finally all the data in the data lake (data warehouse) is updated to achieve new columns. This process is not only time-consuming, but also introduces the problem of how to ensure the isolation of data so that the reading of analysis jobs will not be affected during the change process.
As shown in the figure below, the table in the data warehouse is partitioned in “months”,
and now you want to change to “days” for partitioning, which may require updating all the data of many systems, and then partitioning with new policies, which is very time-consuming.
3. Case #3: Increasingly slow near real-time reports?
-
the first pressure is that starting analysis jobs is getting slower and slower, and Hive Metastore faces scaling challenges, as shown in the figure below.
-
As there are more and more small files, the bottleneck of using the centralized metastore will become more and more serious, which will cause the start of analysis jobs to become slower and slower. Because when starting the job, all the small file original data will be scanned. -
Second, because Metastore is a centralized system, it is easy to encounter Metastore expansion problems. For example, Hive may find a way to expand the MySQL behind, resulting in large maintenance costs and overhead.
As the number of small files increases, after the analysis job is up, You will find that the scanning process is getting slower and slower. The essence is that the large number of small files causes the scan job to switch frequently between many Datanodes.
4. Case #4: Analyzing CDC data in real time is difficult
to
-
First of all, from the source side, for example, to synchronize MySQL data to the data lake for analysis, you may face a problem, that is, there is existing data in MySQL, and if incremental data is constantly generated, How to perfectly synchronize full and incremental data to the data lake to ensure that there is no more and no less data.
-
In addition, assuming that the full and incremental switching of the source is solved, if an exception is encountered during synchronization, such as an upstream schema change that causes job interruption, how to ensure that CDC data is synchronized to the downstream in a row.
-
The construction of the entire link needs to involve the switching of the source and synchronization, including the collusion of intermediate data streams, as well as the process of writing to the data lake (data warehouse), and a lot of code needs to be written to build the entire link, and the development threshold is high.
-
The last and critical issue is that we find it difficult to find efficient, high-concurrency data for the changing nature of CDC in open source ecosystems and systems.
5. The core challenges faced by data entering the lake
-
which cannot effectively isolate the impact of writing on analysis; -
The synchronization task does not guarantee exactly-once semantics.
-
DDL complicates full-link update and upgrade. -
It is difficult to modify the stock data in the lake/warehouse.
-
slower near real-time reports
-
Frequent writes produce a large number of small files; -
Metadata system is stressful and slow to start; -
A large number of small files cause slow data scanning.
-
it is difficult to complete the switch from full to incremental synchronization; -
Involving end-to-end code development, the threshold is high; -
The open source community lacks efficient storage systems.
> does not allow near real-time analysis of CDC data
Introduction to Apache Iceberg
1. Netflix:
Hive cloud pain points summary
■ Pain point 1: Data change and backtracking are difficult
< ol class="list-paddingleft-2">
■ Pain point 2: Replacing HDFS with S3 difficult<
ol class=”list-paddingleft-2″ >
■ Pain point 3: Too many details
< ol class="list-paddingleft-2"
2. Apache Iceberg core features
-
generalized standard design
-
perfectly decoupled computing engine -
Schema standardizes -
opens up data formats -
that support Java and Python -
Perfect Table semantics
-
schema definition and change -
Flexible Partition PolicyACID -
SemanticsSnapshot Semantics -
Rich data management
-
The scalable META design -
batch updates and CDC -
supports file encryption -
Cost performance
-
calculation pushes down to design -
low-cost metadata management -
Vector computing -
lightweight
and
supports
index3
. Apache Iceberg File Layout
4. Apache Iceberg Snapshot View
5. Choose Apache Iceberg’s company
-
Adobe’s daily new incremental data scale is several terabytes, and the total data size is about tens of petabytes.
-
AWS uses Iceberg as the foundation for a data lake.
-
Cloudera builds its entire public cloud platform based on Iceberg, and the trend of HDFS privatization deployment such as Hadoop is weakening, and the trend of cloud migration is gradually rising, and Iceberg plays a key role in the cloud stage of Cloudera’s data architecture to the cloud.
-
Apple has two teams using it:
> NetFlix now has hundreds of petabytes of data on Apache Iceberg, and Flink is hundreds of terabytes of data per day.
-
The second is Siri, an artificial intelligence voice service, which is also based on Flink and Iceberg to build the entire database ecology.
How Flink and Iceberg solved the problem
Going back to the most critical point, let’s explain how Flink and Iceberg solved a series of problems encountered in Part 1.
1. Case #1: A program bug causes data transfer to be interrupted
2. Case #2: Data changes are too painful
If you want to change to partition by “day”, you
3. Case #3: Increasingly slow near real-time reports?
- the first solution is to optimize the problem of small files at write time, and shuffle according to
-
buckets Way to write, because of the small file of Shuffle, the file written is naturally small.
-
The second scenario is for batch jobs to merge small files on a regular basis.
-
The third solution is relatively smart, which is to automatically merge small files incrementally.
4. Case #4: Analyzing CDC data in real time is difficult
-
The first is the problem of synchronizing full and incremental data, and the community actually has the Flink CDC Connected solution, which means that Connected can automatically do the seamless connection between full and incremental.
-
The second problem is how to ensure that binlogs are synchronized to the lake in a row during synchronization, even if an exception is encountered in the middle.
For this problem, Flink can
identify different types of events well at the Engine level, and then with the help of Flink’s exactly once semantics, even if it encounters a failure, it can automatically recover and process.
-
The third problem is that building the entire link requires a lot of code development, and the threshold is too high.
After using the Flink and Data Lake solutions, you only need to write a source table and sink table, and then an INSERT INTO, and the entire link can be opened without writing any business code.
> and finally how the storage plane supports near real-time CDC data analysis.
Community roadmap
-
first stage is Flink establishing a connection with Iceberg.
-
The second stage is Iceberg replacing the Hive scene. In this scenario, many companies have begun to go online and land their own scenarios.
-
The third phase is to solve more complex technical problems with Flink and Iceberg.
-
The fourth stage is to move this set from a simple technical solution to a more complete product solution.