full
text 7110 words in total, recommended to read for 15 minutes
Before understanding the integration of lakehouses, let’s take a look at an interesting story about data warehousing~
Walmart has the world’s largest data warehouse system, and it uses data mining methods to analyze transaction data and finds that “the most purchased item with diapers is beer!” Later, after a lot of actual investigation and analysis, it was found that in the United States, some young fathers often have to go to the supermarket to buy baby diapers after work, and 30%~40% of them also buy some beer for themselves, because American wives often tell their husbands to buy diapers for their children after work, and husbands bring back their favorite beer after buying diapers.
This is the story of beer and diapers often told in the field of big data!
It can be seen that big data has been accompanied by our daily life for a long time, so let’s understand the basic concept of lakehouse integration.
01 What are data warehouses, data marts, and data lakes?
1. Data warehouse
The early system used a database to store and manage data, but with the rise of big data technology, we want to find the possible relationship between data through big data technology, so we designed a new data storage management system, all the data stored in the data warehouse, and then unified data processing, this system is called data warehouse. The database lacks flexible and powerful processing power.
In computing, data warehouses are systems used for reporting and data analysis and are considered a core component of business intelligence. A data warehouse is a central repository of integrated data from one or more different sources. Data warehouses store current and historical data together to facilitate various analytical methods such as online analytical processing (OLAP) and data mining, helping decision makers quickly analyze valuable information from large amounts of data and help build business intelligence (BI).
Although warehouses are great for structured data, many modern enterprises must deal with unstructured data, semi-structured data, and data with high variety, speed, and volume. Data warehouses are not suitable for many of these scenarios and are not optimal for cost-effectiveness.
Each department also has its own needs for processing and analyzing business data, but does not involve other data, and does not want to operate in a data warehouse with a large amount of data (because the operation is slow, and may affect others to process data), so establish a new storage system and store your own data in the data warehouse to this system. Essentially a subset of the data warehouse. This system is called a data mart.
For example, a department in a company wants to analyze investor service data, so they set up a data mart for investor service data, where the data is extracted from the data warehouse:
With the current development of a large number of information technology and
the popularization of electronic equipment products, a large number of photos, videos, documents and other unstructured data are generated, and people also want to find the relationship between these data through big data technology, so a system larger than the data warehouse is designed, which can store and do some processing of unstructured and structured data together, this system is called a data lake.
Data warehouses grow well, while data lakes are more flexible.
The data structure supported by the data warehouse is relatively single, and the types of data lakes are relatively rich and can be all-encompassing. Data warehouses are more suitable for analysis and processing in mature data, and data lakes are more suitable for mining the value of heterogeneous data.
Data lakes, while good for storing data, lack some key capabilities: they don’t support transactions, don’t guarantee data quality, and lack consistency/isolation, making it nearly impossible to mix append and read data, as well as batch and stream jobs. For these reasons, many of the capabilities of data lakes have not yet been implemented, and in many cases the benefits of data lakes have been lost.
02 Data lake + data warehouse = lake warehouse integration?
Before the advent of lakehouses, data warehouses and data lakes were the most discussed topics.
Before officially cutting into the topic, let’s popularize a concept with everyone, that is, what is the workflow of big data? Two relatively unfamiliar terms are involved here: the degree of structure of data and the density of data. The former describes the normativity of the data itself, while the latter describes the amount of information contained in the unit storage volume.
Generally speaking, most of the original data obtained by people is unstructured, and the information density is relatively low, through the data cleaning, analysis, mining and other operations, you can eliminate useless data, find the correlation in the data, in this process, the degree of data structure, information density is also improved, the last step, It is to use the optimized data and turn it into a real means of production.
In short, the process of big data processing is actually a process of improving the degree of data structure and information density. In this process, the characteristics of data have been changing, and different data and suitable storage media are also different, so there was a once hot dispute between data warehouses and data lakes.
A data warehouse is a topic-oriented, integrated, relatively stable collection of historically changing data that is primarily used to support management decisions and global sharing of information. To put it simply, a data warehouse is like a large library, the data in it needs to be placed according to specifications, and you can find the information you want by category.
At present, the mainstream definition of data warehouse is a
large-capacity repository located on multiple databases, its role is to store a large amount of structured data, for management analysis and business decision-making to provide unified data support, although the access process is relatively cumbersome, there are certain restrictions on data types, but in that era, the functionality of the data warehouse is enough, so around 2011, The market is also the world of data warehousing.
In the Internet era, the amount of data has exploded “like a blowout”, and data types have become heterogeneous. Limited by the scale and data type of data, traditional data warehouses cannot support business intelligence in the Internet era, and with the maturity of Hadoop and object storage technology, the concept of data lake was born, which was proposed by James Dixon in 2011.
Compared to data warehouses, a data lake is an evolving and scalable infrastructure for big data storage, processing, and analysis. Like a large warehouse that can store raw data in any form (both structured and unstructured) and in any format (including text, audio, video, and images), data lakes are often larger and cheaper to store. But its problem is also obvious: data lakes lack structure, and once not properly governed, they can become data swamps.
In terms of product form, data warehouses are generally independent standardized products, and data lakes are more like architectural guidance, which needs to be accompanied by a series of peripheral tools to achieve business needs. In other words, the flexibility of the data lake is friendly for pre-development and pre-deployment; The standardization of data warehouse is friendly to the later operation of big data and the long-term development of the company, so is it possible and a new architecture that can combine the advantages of data warehousing and data lake?
Thus, the lakehouse was born.
According to DataBricks’ definition of Lakehouse, lakehouse integration is a new paradigm that combines the advantages of data lakes and data warehouses, and implements data structures and data management functions similar to those found in data warehouses on low-cost storage for data lakes. Lakehouse integration is a more open new architecture, some people have made a metaphor, similar to building a lot of small houses by the lake, some are responsible for data analysis, some run machine learning, some to retrieve audio and video, etc., as for those data source streams, can be easily obtained from the data lake.
As far as the development trajectory of lakehouse integration is concerned, the early lakehouse integration is more of a processing
idea, processing data lake and data warehouse to open up each other, the current lakehouse integration, although still in the early stage of development, but it is not just a pure technical concept, but has been given more meaning and value related to the manufacturer’s product level.
It should be noted here that “lakehouse integration” is not the same as “data lake” + “data warehouse”, which is a great misunderstanding Nowadays, many companies often build two storage architectures of data warehouse and data lake at the same time, a large data warehouse drags multiple small data lakes, which does not mean that this company has the ability to integrate the lakehouse, and the integration of the lake warehouse is by no means the same as the simple opening of the data lake and data warehouse, but the data will have great redundancy in these two types of storage.
03 Why was lakehouse integration born?
1. Open up the storage and calculation
of data
Companies have no less need for flexible, high-performance systems for a variety of data applications, including SQL analytics, real-time monitoring, data science and machine learning. Most of the latest advances in AI are based on models that better handle unstructured data (such as text, images, video, audio), and two-dimensional relational tables of a completely pure data warehouse can no longer undertake the processing of semi/unstructured data, and AI engines cannot only run on pure data warehouse models.
A common solution is to combine the benefits of a data lake and a data warehouse
to establish a lakehouse integration, which solves the limitations of a data lake: implementing similar data structures and data management functions as in a data warehouse directly on low-cost storage for the data lake.
The previous Weibo developed a data warehouse platform based on the needs of big data, and a data lake platform based on the needs of AI, and these two big data platforms are completely separated at the cluster level, and data and computing cannot flow freely between the two platforms. The integration of the lakehouse can realize the seamless flow between the data lake and the data warehouse, opening up different levels of data storage and computing.
2. Flexibility and growth
Through the above chart, we can see that flexibility and growth are of different importance for enterprises in different periods. When an enterprise is in the initial stage, and data needs a stage of innovation and exploration from generation to consumption to gradually precipitate, then the big data system used to support this type of business, flexibility is more important, and the architecture of the data lake is more applicable. When the enterprise gradually matures and has precipitated into a series of data processing processes, the problem begins to be transformed into a growing scale of data, the cost of processing data continues to increase, and the number of people and departments involved in the data process continues to increase, then the growth of the big data system used to support this type of business determines how far the business can develop. The schema of the data warehouse is more suitable.
After in-depth elaboration and comparison of data lakes and data warehouses, it can be found that one data lake and data warehouse is user-friendly for start-ups, and the other has better growth. Do data lakes and data warehouses have to be a binary choice for enterprises? Is there an option that combines the flexibility of a data lake with the growth of a cloud data warehouse to achieve a lower total cost of ownership for users? Then lakehouse integration is the answer!
04 What is lakehouse integration?
With the current trend of big data technology application, enterprises are not satisfied with a single data lake and data warehouse architecture. More and more enterprises are beginning to converge data lake and data warehouse platforms, not only to realize the functions of data warehouses, but also to realize different types of data processing functions, data science, advanced functions for discovering new models. Lakehouse integration is a new type of open architecture, fully combining the advantages of data lake and data warehouse, it is built on the low-cost data storage architecture of the data lake, and inherits the data processing and management functions of the data warehouse, opening up the two systems of data lake and data warehouse, so that data and computing flow freely between the lake and the warehouse. As a new generation of big data technology architecture, it will gradually replace a single data lake and data warehouse architecture.
Some people make the “lakehouse integration” as a figurative metaphor, as if a lot of small houses have been built by the lake, some can be responsible for data analysis, some can run machine learning, some can retrieve audio and video, etc., and these data source streams can be easily obtained from the data lake.
05 Introduction
to the lakehouse integrated Data Lakehouse
Data Lakehouse is a new data architecture that absorbs the advantages of both data warehouses and data lakes, allowing data analysts and data scientists to operate on data in the same data store, while also bringing more convenience to companies for data governance. So what is a Data Lakehouse and what are its characteristics?
We have been using two data storage methods to structure data:
data warehouse: Data warehouse is a data storage architecture that mainly stores structured data organized in relational databases. The data is transformed, consolidated, cleansed, and imported into the target table. In a data warehouse, the structure of the data store is strongly matched to its defined schema.
Data Lake: A data lake is a data storage structure that can store any type of data, including unstructured data such as pictures and documents. Data lakes are typically larger and cheaper to store. The data stored in it does not need to satisfy a specific schema, and the data lake does not attempt to implement a specific schema on it. Instead, the owner of the data usually parses the schema-on-read when reading the data, and applies transformations to it when the corresponding data is processed.
Nowadays, many companies often build two storage architectures of data warehouse and data lake at the same time, a large data warehouse and multiple small data lakes. In this way, the data will have some redundancy in both types of storage.
The emergence of Data Lakehouse attempts to integrate the differences between data warehouses and data lakes, by building data warehouses on data lakes, making storage cheaper and more elastic, and lakehouses can effectively improve data quality and reduce data redundancy. ETL plays a very important role in the construction of the lakehouse, which can transform unstructured data at the lake layer into structured data at the data warehouse layer.
The
Data Lakehouse concept was proposed by Databricks, which listed the following features:
>
Transaction support: Lakehouse can handle multiple different data pipelines. This means that it can support concurrent read and write transactions without compromising data integrity.
-
Schemas: Databin applies schemas on all data stored on them, while data lakes do not. Lakehouse’s architecture can standardize the vast majority of data by applying a schema based on the needs of the application.
-
Support for reporting and analytic applications: This storage architecture can be used by both reporting and analytic applications. The data held in Lakehouse is cleaned and integrated and can be used to speed up analysis. At the same time, compared with the data warehouse, it can save more data, and the timeliness of the data will be higher, which can significantly improve the quality of the report.
-
Data type expansion: Datawarehouse can only support structured data, while Lakehouse’s structure can support more different types of data, including files, videos, audio, and system logs.
-
End-to-end streaming support: Lakehouse can support streaming analytics to meet the needs of real-time reporting, which is becoming increasingly important in more and more enterprises.
-
Compute-storage separation: We often implement data lakes using low-cost hardware and clustered architectures that provide very inexpensive discrete storage. Lakehouse is built on top of a data lake, so it is natural to adopt a storage-computing separation architecture, where data is stored in one cluster and processed in another.
-
Openness: Lakehouse usually makes Iceberg, Hudi, Delta Lake and other building components in its construction, first of all, these components are open source and open, and secondly, these components use open compatible storage formats such as Parquet and ORC as the lower level data storage format, so different engines and different languages can be operated on Lakehouse.
The concept of Lakehouse was first proposed by Databricks, and other similar products include Azure Synapse Analytics. Lakehouse technology is still evolving, so the features described above will be constantly revised and improved.
06 What are the benefits of lakehouse integration?
The integrated lakehouse can give full play to the flexibility and ecological richness of the data lake, as well as the growth and enterprise-level capabilities of the data warehouse. Help enterprises establish data assets, realize data business, and then promote full-line business intelligence, realize data-driven enterprise data intelligence innovation, and fully support the future large-scale business intelligence implementation of enterprises. The main benefits are mainly the following:
Data duplication: If an organization maintains a data lake and multiple data warehouses at the same time, this will undoubtedly introduce data redundancy. At best, this will only lead to inefficient data processing, but at worst, it will lead to data inconsistencies. The combination of lakehouse integration can eliminate the duplication of data and truly achieve uniqueness.
High storage costs: Data warehouses and data lakes are all about reducing the cost of data storage. Data warehouses often reduce costs by reducing redundancy and consolidating heterogeneous data sources. Data lakes, on the other hand, often use big data file systems and Spark to store computational data on inexpensive hardware. The goal of the lakehouse architecture is to combine these technologies to minimize costs.
Differences between reporting and analytics applications: Data science tends to work with data lakes, using a variety of analytics techniques to process raw data. Report analysts, on the other hand, prefer to use consolidated data, such as data warehouses or data marts. In an organization, there is often not much intersection between these two teams, but in fact, there is a certain duplication and contradiction in the work between them. When using the lakehouse architecture, both teams can work on the same data architecture and avoid unnecessary duplication.
Data stagnation
: Data stagnation is one of the most serious problems in a data lake, and if data remains ungoverned, it can quickly become a data swamp. We often throw data into the lake easily, but without effective governance, the timeliness of data becomes more and more difficult to trace over time. The introduction of lakehouse integration can govern massive data and more effectively help improve the timeliness of analysis data.
Risk of potential incompatibility: Data analytics is still an emerging technology, with new tools and techniques still emerging every year. Some technologies may only be compatible with data lakes, while others may only be compatible with data warehouses. The integrated architecture of the lakehouse means preparing for both aspects.
07 Lakehouse integrated landing path and cost
A: Now that most enterprises already have their own set of big data architecture, how do they implement the lakehouse based on the existing architecture? What are the possible landing paths? Where are the costs likely to come from?
Q: Now some enterprises already have their own big data architecture, these enterprises may be born relatively early, most of them are chosen Hadoop system, or self-built Hadoop system, or use cloud hosted Hadoop system. These companies have many options, they can choose a plan like Databricks or a plan like MaxCompute.
Both paths are relatively feasible, so how to choose? This usually depends on whether the company wants to invest more in the big data technology stack. If enterprises feel that they do not need to invest a lot of resources in infrastructure, but want to put more resources on the business, then it is more valuable to choose a more fully managed version of the lakehouse integrated solution. Conversely, if there are many technical personnel in the enterprise and hope that the underlying infrastructure is flexible enough and controllable, you can choose the mode of building a warehouse on the lake.
There are also relatively new companies, such as those established in the past three years, and many of them are in a high-growth stage. These enterprises are actually born on the cloud, and even the big data architecture chosen at the beginning is already the architecture of the cloud data warehouse, and it is relatively simple for such enterprises to evolve forward based on the existing architecture. As long as you use cloud infrastructure as much as possible, you can form a lakehouse integrated architecture by opening several cloud services, which is a simple and direct and relatively simple path.
So where does the cost come from? If the enterprise chooses a fully managed lakehouse integrated solution, the cost mainly comes from one-time expenses for current data, such as data warehouse migration, data collation, etc., once this part of the work is completed, a positive cycle is formed in data governance in the future, and the overall cost will not be too high. If an enterprise chooses to maintain a lakehouse architecture by itself, the cost mainly comes from the labor cost and hardware cost of continuously maintaining and tuning the entire infrastructure.
A: According to your understanding, what are the main problems and challenges encountered by enterprises when trying to land a lakehouse? Is now a good time to adopt a lakehouse integration?
Q: At present, most enterprises have not used the new architecture of lakehouse integration, and they have either chosen the data lake solution or the data warehouse solution. As an emerging structure, many enterprises are still in the early stage of exploration. After putting data on the data lake, some enterprises find it relatively difficult to do a good job of data governance or data management on the data lake, and then adopt the lakehouse integrated model at this time, and then abstract a layer of digital warehouse layer and governance layer on the existing relatively more flexible but not enough management data, and do better management and governance of data. For data warehouse users, if the data warehouse system adopted supports the integrated lakehouse architecture, it is good to directly mount the data lake.
There are several main problems and challenges that enterprises will encounter when trying to land a lakehouse. First, if the team doesn’t have good enough experience in data governance or data management, the challenge can be greater. That’s why almost all of our solutions are moving towards a fully managed or full-service SaaS model, hoping to lower the threshold.
Secondly, for enterprises that build their own lakehouses, the challenges they will encounter are mainly the high complexity of lakehouses, especially how to coordinate between lakehouses, which involves the problem of storage connection between the two systems, metadata consistency, data cross-reference between different engines on the lake and warehouse, as well as bandwidth problems, security issues, and so on. In addition, since the bottom layer of the lakehouse integrated architecture is a binary system, when facing the user upward, can the user see the two systems? If the user can see the two systems, how to distinguish and guide? If the user can’t see it, what kind of packaging does the development need to do? These are all problems that will be encountered in the self-built lakehouse system.
In short, if the enterprise does not have to invest heavily in infrastructure, it will be much simpler to directly adopt the fully managed version of the lakehouse integrated architecture.
Finally, lakehouse integration is still an emerging direction, and many questions are still being explored, such as what data is placed in the data warehouse / data lake? It is more suitable for enterprises with a certain willingness to explore and innovate.