Abstract: Alibaba technical expert Hu Zheng shared at Meetup in Shanghai on April 17, the content of the article is to use Flink and Iceberg to try to solve the challenges related to data entering the lake, and help business students focus on their own business challenges more efficiently. Topics include:

  1. core challenges of data into the lake

  2. , Apache Iceberg describes

  3. how Flink and Iceberg solve the problem

  4. Community Roadmap

Tips: Click Read Original” at the end of the article to view the original video~
GitHub address 
https://github.com/apache/flink
welcome everyone to give Flink a thumbs up and send star~

First, the core challenge of data entering the lake

Real-time data entry into the lake can be divided into three parts, namely data sources, data pipelines, and data lakes (data warehouses), and the content of this article will expand around these three parts.

1. Case #1: A program bug causes data transfer to be interrupted

  • FIRST, WHEN THE DATA SOURCE IS TRANSMITTED TO THE DATA LAKE (DATA WAREHOUSE) THROUGH THE DATA PIPELINE, IT IS LIKELY TO ENCOUNTER A BUG IN THE JOB, RESULTING IN HALF OF THE DATA BEING TRANSMITTED AND AFFECTING THE BUSINESS.

  • The second problem is how to restart the job when encountering this situation, and ensure that the data is not duplicated or missing, and completely synchronized to the data lake (data warehouse).

2. Case #2: Data changes are too painful

When data changes occur, it can bring great stress and challenges to the entire link. In the following figure, a table originally defined two fields, ID and NAME. At this time, the business students said that they need to add the address to facilitate better mining the value of users.

First, we need to add a column Address to the Source table, then chain the link to Kafka, then modify the job and restart. Then the entire link has to be changed all the way, new columns are added, jobs are modified and restarted, and finally all the data in the data lake (data warehouse) is updated to achieve new columns. This process is not only time-consuming, but also introduces the problem of how to ensure the isolation of data so that the reading of analysis jobs will not be affected during the change process.

As shown in the figure below, the table in the data warehouse is partitioned in “months”,

and now you want to change to “days” for partitioning, which may require updating all the data of many systems, and then partitioning with new policies, which is very time-consuming.

3. Case #3: Increasingly slow near real-time reports?

When the business needs more real-time reports, it is necessary to change the import cycle of the data from “days” to “hours” or even “minutes”, which may cause a series of problems.

As shown in the figure above, the first problem is that the number of files is growing at a rate visible to the naked eye, which will put more and more pressure on the system outside. The pressure is mainly reflected in two aspects:

  • the first pressure is that starting analysis jobs is getting slower and slower, and Hive Metastore faces scaling challenges, as shown in the figure below.

    • As there are more and more small files, the bottleneck of using the centralized metastore will become more and more serious, which will cause the start of analysis jobs to become slower and slower. Because when starting the job, all the small file original data will be scanned.
    • Second, because Metastore is a centralized system, it is easy to encounter Metastore expansion problems. For example, Hive may find a way to expand the MySQL behind, resulting in large maintenance costs and overhead.

As the number of small files increases, after the analysis job is up, You will find that the scanning process is getting slower and slower. The essence is that the large number of small files causes the scan job to switch frequently between many Datanodes.

4. Case #4: Analyzing CDC data in real time is difficult

to

investigate various systems in Hadoop and find that the entire link needs to run fast, well, stable, and have good concurrency, which is not easy.

  • First of all, from the source side, for example, to synchronize MySQL data to the data lake for analysis, you may face a problem, that is, there is existing data in MySQL, and if incremental data is constantly generated, How to perfectly synchronize full and incremental data to the data lake to ensure that there is no more and no less data.

  • In addition, assuming that the full and incremental switching of the source is solved, if an exception is encountered during synchronization, such as an upstream schema change that causes job interruption, how to ensure that CDC data is synchronized to the downstream in a row.

  • The construction of the entire link needs to involve the switching of the source and synchronization, including the collusion of intermediate data streams, as well as the process of writing to the data lake (data warehouse), and a lot of code needs to be written to build the entire link, and the development threshold is high.

  • The last and critical issue is that we find it difficult to find efficient, high-concurrency data for the changing nature of CDC in open source ecosystems and systems.

5. The core challenges faced by data entering the lake

    • which cannot effectively isolate the impact of writing on analysis;
    • The synchronization task does not guarantee exactly-once semantics.
    • DDL complicates full-link update and upgrade.
    • It is difficult to modify the stock data in the lake/warehouse.
  • slower near real-time reports

    • Frequent writes produce a large number of small files;
    • Metadata system is stressful and slow to start;
    • A large number of small files cause slow data scanning.

    > does not allow near real-time analysis of CDC data

    • it is difficult to complete the switch from full to incremental synchronization;
    • Involving end-to-end code development, the threshold is high;
    • The open source community lacks efficient storage systems.

Introduction to Apache Iceberg

1. Netflix:

Hive cloud pain points summary

The most critical reason for Netflix to do Iceberg is to solve the pain points of Hive cloud migration, and the pain points are mainly divided into the following three aspects:

■ Pain point 1: Data change and backtracking are difficult

< ol class="list-paddingleft-2">

  • does not provide ACID semantics. When data churn occurs, it is difficult to isolate the impact on analysis tasks. Typical operations such as: INSERT OVERWRITE; Modify data partitions; modify the schema;
  • Inability to handle multiple data changes, resulting in conflicts;
  • Historical versions cannot be effectively retraced.
  • ■ Pain point 2: Replacing HDFS with S3 difficult<

    ol class=”list-paddingleft-2″ >

  • data provider directly depends HDFS API;
  • Relies on the atomicity of the RENAME interface, which is similar to S3 It is difficult to achieve the same semantics on such object storage;
  • A lot of dependencies on the list interface of file directories, which is inefficient on object storage systems.
  • ■ Pain point 3: Too many details

    < ol class="list-paddingleft-2"

  • > when the schema is changed, Different file formats behave inconsistently. Different FileFormats don’t even support data types consistently;
  • Metastore only maintains partition-level statistics, resulting in no task plan overhead; The Hive Metastore is difficult to extend;
  • Non-partition fields cannot be partitioned prune.
  • 2. Apache Iceberg core features

    • generalized standard design

      • perfectly decoupled computing engine
      • Schema standardizes
      • and

      • opens up data formats
      • that support Java and Python
    • Perfect Table semantics

      • schema definition and change
      • Flexible Partition PolicyACID
      • SemanticsSnapshot
        Semantics
    • Rich data management

      • > the stream batch unification of storage
      • The scalable META design
      • supports

      • batch updates and CDC
      • supports file encryption
    • Cost performance

      • calculation pushes down to design
      • low-cost metadata management
      • Vector computing
      • lightweight

      index3

    . Apache Iceberg File Layout

    Above is a standard Iceberg TableFormat structure, the core is divided into two parts, one is Data, one is Metadata, either part is maintained on top of S3 or HDFS.

    4. Apache Iceberg Snapshot View

    shown above Iceberg’s write and read flow approximately.

    You can see that there are three layers in this:

    Each write results in a batch of files, one or more manifests, and snapshots.

    For example, the first snapshot Snap-0 is formed, the second snapshot Snap-1 is formed, and so on. However, when maintaining the original data, additional maintenance is done step by step.

    This can help users do batch data analysis on a unified storage, and can also do incremental analysis between snapshots based on storage, which is why Iceberg can do some support in the reading and writing of streams and batches.

    5. Choose Apache Iceberg’s company

    The picture above shows some of the companies currently using Apache Iceberg, and the domestic example is more familiar to everyone, here is a general introduction to the use of foreign companies.

      > NetFlix now has hundreds of petabytes of data on Apache Iceberg, and Flink is hundreds of terabytes of data per day.

    • Adobe’s daily new incremental data scale is several terabytes, and the total data size is about tens of petabytes.

    • AWS uses Iceberg as the foundation for a data lake.

    • Cloudera builds its entire public cloud platform based on Iceberg, and the trend of HDFS privatization deployment such as Hadoop is weakening, and the trend of cloud migration is gradually rising, and Iceberg plays a key role in the cloud stage of Cloudera’s data architecture to the cloud.

    • Apple has two teams using it:

        > First, the entire iCloud data platform is built on Iceberg;
      • The second is Siri, an artificial intelligence voice service, which is also based on Flink and Iceberg to build the entire database ecology.

    How Flink and Iceberg solved the problem

    Going back to the most critical point, let’s explain how Flink and Iceberg solved a series of problems encountered in Part 1.

    1. Case #1: A program bug causes data transfer to be interrupted

    First of all, Flink is used for synchronous links, which can ensure exactly once semantics, and when the job fails, it can do strict recovery to ensure data consistency.

    The second is Iceberg, which provides rigorous ACID semantics that can help users easily isolate the adverse effects of writes on analysis tasks.

    2. Case #2: Data changes are too painful

    As shown above, when data changes occur, Flink and Iceberg can solve this problem.

    Flink can capture the upstream schema change event, and then synchronize this event to the downstream, after synchronization, the downstream Flink directly forwards the data down, forwards it to storage, and Iceberg can change the schema instantly.

    When doing DDL such as Schema, Iceberg directly maintains multiple versions of the schema, and then the old data source is completely unchanged, and the new data writes a new schema to achieve one-click schema isolation.

    Another example is the problem of partition changes, as shown in the image above.

    If you want to change to partition by “day”, you

    can directly change the partition with one click, the original data remains unchanged, and the new data is all partitioned by “day”, semantics to achieve ACID isolation.

    3. Case #3: Increasingly slow near real-time reports?

    The third problem is the strain that small files put on the metastore.
    First of all, for Metastore, Iceberg is to store the original data in the file system and then maintain it in the form of metadata. The whole process actually removes the centralized metastore and only relies on file system extensions, so the scalability is better.

    Another problem is that there are more and more small files, resulting in slower and slower data scanning. Flink and Iceberg offer a series of solutions to this problem:
      the first solution is to optimize the problem of small files at write time, and shuffle according to


    • buckets Way to write, because of the small file of Shuffle, the file written is naturally small.

    • The second scenario is for batch jobs to merge small files on a regular basis.

    • The third solution is relatively smart, which is to automatically merge small files incrementally.

    4. Case #4: Analyzing CDC data in real time is difficult

    • The first is the problem of synchronizing full and incremental data, and the community actually has the Flink CDC Connected solution, which means that Connected can automatically do the seamless connection between full and incremental.

    • The second problem is how to ensure that binlogs are synchronized to the lake in a row during synchronization, even if an exception is encountered in the middle.

    For this problem, Flink can

    identify different types of events well at the Engine level, and then with the help of Flink’s exactly once semantics, even if it encounters a failure, it can automatically recover and process.

    • The third problem is that building the entire link requires a lot of code development, and the threshold is too high.


    After using the Flink and Data Lake solutions, you only need to write a source table and sink table, and then an INSERT INTO, and the entire link can be opened without writing any business code.

      > and finally how the storage plane supports near real-time CDC data analysis.

    Community roadmap

    The picture above is Iceberg’s Roadmap, you can see that Iceberg only released one version in 2019, but directly released three versions in 2020, and became a top-level project in version 0.9.0.

    The picture above shows Flink and Iceberg’s roadmap, which can be divided into 4 stages.

    • first stage is Flink establishing a connection with Iceberg.

    • The second stage is Iceberg replacing the Hive scene. In this scenario, many companies have begun to go online and land their own scenarios.

    • The third phase is to solve more complex technical problems with Flink and Iceberg.

    • The fourth stage is to move this set from a simple technical solution to a more complete product solution.