the exploration and practice of several stacks on the integration of lakes and warehouses

guide:

in the development process of big data technology, after the data warehouse and data lake, another innovative technology of the big data platform- the integration of the lake warehouse, has begun to attract attention in the industry in recent years. the demand for data management generated by market development has always been the driving force for data technology innovation. for example, how does a data warehouse store data in different structures? how can data lakes avoid data clutter due to lack of governance? today’s article would like to talk to you specifically about how our digital stack solves these problems.

you can see 👇👇👇

▫ a brief description of the concept of lake warehouse integration

▫ what are the pain points in the construction process of the lake warehouse of the number stack

▫ how to solve these problems in a targeted manner

author / potato, knife

edit / toward the mountain

background

With the third decade of the 21st century, big data technology has gradually moved from a period of exploration and development to a period of popularization. Nowadays, more and more enterprises are using big data technology to assist decision analysis. Data warehouses have evolved for more than three decades since Bill Inmon, the father of data warehousing, in 1990, and major cloud vendors have launched data warehouses such as AWS Redshift, Google BigQuery, and Snowflake.

However, with the modernization of enterprises, various data structures, higher and higher real-time, rapidly changing data models and other realities have led to data warehouses that can no longer meet the growing needs of enterprises, and data lakes represented by Iceberg and Hudi have emerged. Open file storage, open file formats, open metadata services, and real-time reading and writing features make them warmly sought after by everyone, and major cloud vendors have also proposed their own data lake solutions, so some people say that data lakes are the next generation of big data platforms.

there are always two sides to the new, on the one hand, the data warehouse cannot accommodate data in different formats, on the other hand, the lack of structure and governance of the data lake, will quickly become a “data swamp”, both technologies face serious limitations. in this context, a new architectural model that integrates the advantages of data warehousing and data lakes, “lake warehouse integration“, has been proposed.

what is the lake warehouse integration

in short, “lake warehouse integration” is a new architectural model, which fully combines the advantages of data warehousing and data lake, and its data is stored on the low-cost storage architecture of data lake, with the flexibility of data lake data format, and inherits the governance ability of data warehouse data.

the evolution of the number stack on the lake warehouse

with the continuous development of customers’ businesses, the digital stack as a set of data center platforms has also encountered more and more challenges. while overcoming these challenges, we also feel that we still have many shortcomings.

stack offline warehouse

As shown in the figure, the user’s business data is imported into the Hive warehouse through FlinkX, the business logic is processed by the Spark engine, and finally written back to the user data source through FlinkX.

stack real-time number of warehouses

As shown in the figure, there are two links for real-time digital warehouses: one is a real-time link, the collected CDC data is written to the message queue, calculated in real time through FlinkStreamSQL, and finally written to Kudu, HBase and other efficient read and written data sources; the other is a quasi-real-time link, the collected CDC data is written to the Hive table and calculated through Spark SQL.

ingest data lakes

Since the digital stack stream computing engine uses Flink, after investigating the two open source data lake projects of Iceberg and Hudi, Iceberg is more convenient to integrate with Flink and more ecologically friendly than Hudi, so we decided to use Iceberg as our first data lake product, and will support other data lakes such as Hudi one by one. Some of the features of Iceberg are not too much to dwell on here, and the following are the links to the number of warehouses after the introduction of the data lake:

Structured, semi-structured, and unstructured data is eTL processed via FlinkX and written to the Iceberg data lake or back to the message queue. The data is then continuously flowed and computed through Flink and Spark engines in message queues and data lakes, and finally written to efficient read and write data sources such as Kudu and HBase.

the pain points of the stack in the construction of the lake warehouse

batch flow separation, operation and maintenance costs money and labor

At present, the practice of offline digital bin is to use FlinkX to collect data into a Hive table, then calculate through Hive SQL or Spark SQL, and finally write back to Hive; the practice of real-time digital bin is that the data is read from the Kafka of the source table, calculated through FlinkStreamSQL, and finally written to kudu or HBase.

In these two links, developers first have to maintain two sets of self-developed frameworks: FlinkX and FlinkStreamSQL; operators have to have some understanding of Hive SQL, Spark SQL, and Flink SQL tasks; and data development has to be familiar with the syntax and parameter configuration of Hive SQL, Spark SQL, and Flink SQL. The cost of developing, using, and operating and maintaining such a complete set of digital warehouses is not huge.

duplicate code, waste of resources

FlinkX and FlinkStreamSQL were created with one for synchronization and one for compute. But as the business continues to grow, the two are actually becoming more and more similar. FlinkX also needs to do a certain amount of calculation when synchronizing, and write the data to the target table after cleaning. FlinkStreamSQL, if it does not perform calculations and is simply a write library, then it is a synchronization function.

Therefore, when adding new data source types, FlinkX and FlinkStreamSQL need to add a similar connector each, and 80% of the code in this connector is similar. Both FlinkX and FlinkStreamSQL need to be fixed when faced with data source-related bugs. The two frameworks bring twice the labor cost.

without governance, the lake barn became a swamp

After the introduction of the Iceberg data lake, most of the data was written into it unprocessed. Due to the lack of catalog-level metadata management, trying to find the business data you want from a large amount of raw data is like panning for gold in the sand. Different business people do not know how to organize after using their own data, which leads to data clutter and a large number of small files. The large number of small files is a serious drag on the efficiency of the Hadoop cluster, reducing the data lake to a data swamp.

several stacks move towards the lake warehouse integration

solutions to pain points

in order to solve the above pain points, the number stack has made the following changes:

1. Enable Flink as the main calculation engine

Flink implemented the Streaming Batch integration of the Source & Sink API in version 1.12, and the community is constantly moving towards Streaming Batch Integration, so we chose Flink as the main compute engine. At this point, whether it is offline, real-time digital warehouses or data lakes, only one set of Flink SQL tasks is required to complete the processing of the business. Thanks to Flink’s industry-leading level in data processing, we can use Flink as the main computing engine of Lake Warehouse based on Flink streaming and batch integration to solve the problems of high O&M costs and difficult operations in one fell swoop.

2. two sets of plug-ins that fuse code repetitions

As mentioned above, FlinkX and FlinkStreamSQL duplicate 80% of the code at the plugin layer, so we don’t need to maintain two duplicate sets of plugins. We combined the advantages of the two frameworks to write a brand new FlinkX. The fused FlinkX inherits the data synchronization capabilities of the original JSON and can also use the powerful SQL language. Whether the data is offline or real-time, whether it’s in the warehouse, into the lake or in the calculation, it can be easily processed with the new FlinkX.

3. unified lake warehouse data source center

The introduction of data source center unified management of data sources used in the middle office can facilitate the middle office administrators to manage data sources and control the permissions to use data sources. At the same time, the metadata modules that are hashed in each project are unified into the data source center, which makes it easy for users to view the usage of a data source. Set up fine-grained catalog management for Iceberg data lake data sources to prevent falling into a data swamp. For the underlying data sources stored on HDFS, such as Hive, Iceberg, etc., add small file merging function, manual or automatic timing to merge small files, completely solve the problem of small files.

stack lake warehouse integrated architecture

Based on the above, let’s take a look at how the structure of the lake warehouse integration on the number stack is realized after we enter the data into the lake (Iceberg) and the hive through Flinkx:

After the introduction of Iceberg, we can not only unify the docking of various formats of data storage, including structured semi-structured data, and the underlying storage support for OSS, S3, HDFS and other storage systems, and the use of Iceberg features can also provide support for ACID, table structure changes, based on Snapshot reading historical data and other functions, while the number stack in the upper layer unified metadata center, using unified metadata storage, not only can manage the storage of data lakes, Moreover, it can achieve unified management of the original data warehouse, unified entry at the table structure layer, and global table information can be seen when calculating at the upper level, rather than isolated table information from multiple sources.

After unifying metadata, we need a tool that can calculate the data lake and data warehouse based on the metadata that has been built, and in the Hadoop ecosystem, there are many similar computing tools, including Trino, Flink, Spark, etc. At present, in this structure, we can choose according to the customer’s business scenario, if the customer already has a data warehouse, and wants to use the data lake to build the upper layer of the business, can support cross-source Flink, Trino used to query is a suitable choice, while the customer has requirements for query interaction performance, then Trino’s MPP architecture to provide scale-out features will be a good choice.

the outlook of the digital stack for the future

Digital Stack currently unifies real-time and offline data integration and compute and storage capabilities by introducing Iceberg and transforming FlinkX, enabling a basic lake warehouse integration on the Digital Stack. In the future, we hope that the digital stack has cross-source capabilities, not only in a single Hadoop ecosystem to build a lake warehouse integration, but also based on the existing traditional data stores such as MySQL, Oracle warehouse (no need to draw data from MySQL, Oracle and other warehouses to a unified data center), through the unified meta data center registration of different catalogs for isolation, plus the newly built data lake, Flink at the upper level The ability of the computing engine to achieve the integration of the lake warehouse.

At the storage layer, we want to be able to support the current HDFS and S3, as well as local and cloud storage; and at the storage level, we want to automate data management, including regular merging of small files, acceleration of remote file data, indexing of data, unified metadata management, etc.; our goal is to implement the data warehouse of the storage layer as a servrice.

To achieve the above planning, we still have many functions to optimize and integrate, in the future we will pay attention to and participate in the Iceberg, Hudi, Flink communities in real time, pay attention to the planning and development of the community, combined with our current unified data development platform for continuous iteration, to reach the ability of DasS, so that enterprises and users can enhance the value of data under the integrated architecture of the lake warehouse.