The database industry is heading for a watershed.
The global database industry has grown rapidly over the past few years. In 2020, Gartner redefined the Magic Quadrant for Database as a Cloud DBMS for the first time, with cloud databases as the only evaluation direction. In 2021, two key changes have taken place in the Gartner Magic Quadrant:
1. Snowflake and Databricks, two cloud data warehouses, entered the Leaders quadrant;
2. The revenue threshold of the Magic quadrant has been relaxed, and new database forces such as SingleStore, Exasol, MariaDB, and Couchbase have entered the list for the first time.
To some extent, behind this change, it is implied that the global database has entered a golden age of development, and it is also a year of accelerated rise of emerging forces. Among them, the most typical example is that Snowflake and Databricks often shout in the air, the former is the representative player of the cloud data warehouse, and continued to maintain more than 1 times business growth last year; The latter’s valuation soared to $36 billion due to the launch of the “lakehouse integration”, and the dispute between the two is actually a dispute between the old and new database architectures.
development process of data lake, data warehouse, and lakehouse
integration (source: Databricks official)
As enterprise digitalization drives into the deep water area, the data usage scenarios also show a diversified trend, and the data that was easily ignored by enterprises in the past has begun to come from behind the scenes to the front. How to choose a suitable database product for many scenarios has become a must for many CIOs and managers.
However, one thing is certain, the past database has been difficult to match the current growing data complexity requirements, based on scalability and availability division, distributed architecture breaks through the database limitations under stand-alone, shared, and cluster architectures, and has developed rapidly in recent years.
To this end, this article will mainly analyze:
1. What is the integration of data warehouse, data lake, and lake warehouse?
2. Architecture evolution, why does the lakehouse represent the future?
3. Is it a good time to lay out the integrated lakehouse?
01: Data lake + data warehouse ≠ lakehouse integration
Before the advent of lakehouses, data warehouses and data lakes were the most discussed topics.
Before officially cutting into the topic, let’s popularize a concept with everyone, that is, what is the workflow of big data? Two relatively unfamiliar terms are involved here: the degree of structure of data and the density of data. The former describes the normativity of the data itself, while the latter describes the amount of information contained in the unit storage volume.
Generally speaking, most of the original data obtained by people is unstructured, and the information density is relatively low, through the data cleaning, analysis, mining and other operations, you can eliminate useless data, find the relevance of the data, in this process, the degree of data structure, information density is also improved, the last step, is to optimize the data to use, into the real means of production.
In short, the process of big data processing is actually a process of improving the degree of data structure and information density. In this process, the characteristics of data have been changing, and different data and suitable storage media are also different, so there was a once hot dispute between data warehouses and data lakes.
Let’s first talk about the data warehouse, which was born in 1990 and is a topic-oriented, integrated, relatively stable data collection that reflects historical changes, mainly used to support management decisions and global sharing of information. To put it simply, a data warehouse is like a large library, the data in it needs to be placed according to specifications, and you can find the information you want by category.
At present, the mainstream definition of data warehouse is a large-capacity repository located on multiple databases, its role is to store a large amount of structured data, for management analysis and business decision-making to provide unified data support, although the access process is relatively cumbersome, there are certain restrictions on data types, but in that era, the functionality of the data warehouse is enough, so around 2011, the market is still the world of data warehousing.
In the Internet era, the amount of data has exploded “like a blowout”, and data types have become heterogeneous. Limited by the scale and data type of data, traditional data warehouses cannot support business intelligence in the Internet era, and with the maturity of Hadoop and object storage technology, the concept of data lake was born, which was proposed by James Dixon in 2011.
Compared to data warehouses, a data lake is an evolving and scalable infrastructure for big data storage, processing, and analysis. Like a large warehouse that can store raw data in any form (both structured and unstructured) and in any format (including text, audio, video, and images), data lakes are often larger and cheaper to store. But its problem is also obvious: data lakes lack structure, and once not properly governed, they can become data swamps.
In terms of product form, data warehouses are generally independent standardized products, and data lakes are more like architectural guidance, which needs to be accompanied by a series of peripheral tools to achieve business needs. In other words, the flexibility of the data lake is friendly for pre-development and pre-deployment; The standardization of data warehouse is friendly to the later operation of big data and the long-term development of the company, so is it possible and a new architecture that can combine the advantages of data warehousing and data lake?
Thus, the lakehouse was born.
According to DataBricks’ definition of Lakehouse, lakehouse integration is a new paradigm that combines the advantages of data lakes and data warehouses, and implements data structures and data management functions similar to those found in data warehouses on low-cost storage for data lakes. Lakehouse integration is a more open new architecture, some people have made a metaphor, similar to building a lot of small houses by the lake, some are responsible for data analysis, some run machine learning, some to retrieve audio and video, etc., as for those data source streams, can be easily obtained from the data lake.
As far as the development trajectory of lakehouse integration is concerned, the early lakehouse integration is more of a processing
idea, processing data lake and data warehouse to open up each other, the current lakehouse integration, although still in the early stage of development, but it is not just a pure technical concept, Instead, it is given more meaning and value related to the manufacturer’s product level.
It should be noted here that “lakehouse integration” is not equivalent to “data lake” + “data warehouse”, which is a great misunderstanding, now many companies often build data warehouse, data lake two storage architectures at the same time, a large data warehouse drags multiple small data lakes, which does not mean that this company has the ability to integrate the lakehouse, the integration of the lakehouse is by no means the same as the data lake and the data warehouse are simply opened, but the data will have great redundancy in these two types of storage.
02: Why is the lakehouse integrated the future?
Back to the core question of the beginning: What can the lakehouse represent the future?
Regarding this question, we can actually ask another question, that is, in the era of data intelligence, will the integration of lakehouses become a must for enterprises to build big data stacks?
In terms of technical dimensions and application trends, the answer to this question is almost certain, for high-growth enterprises, the choice of lakehouse integrated architecture to replace the traditional independent warehouse and independent lake has become an irreversible trend.
A convincing example is that at this stage, major cloud vendors at home and abroad have successively launched their own “lakehouse integration” technology solutions, such as Amazon Web Services’ Redshift Spectrum, Microsoft’s Azure Databricks, HUAWEI CLOUD’s Fusion Insight, Dipu Technology’s FastData, etc., these players have established leaders in cloud computing and new forces in the field of data intelligence.
In fact, the evolution of architecture is directly driven by the business, if the business side puts forward higher performance requirements, then in the process of big data architecture construction, the database architecture construction needs to be upgraded technically.
Taking Dipu Technology, the
fastest growing unicorn in the field of digital enterprise services in China, as an example, relying on FastData, a new generation of lakehouse integration and flow batch integration data analysis basic platform, based on in-depth insights into advanced manufacturing, biomedicine, consumer circulation and other industries, Dipu Technology starts from actual scenarios. It provides customers with one-stop digital solutions.
Dipu believes that “in the field of data analysis, lakehouse integration is the future 。 It can better respond to the needs of data analysis in the AI era, and is ahead of the past analytical databases in terms of storage form, computing engine, data processing and analysis, openness, and evolution towards AI. “Taking AI application as an example, the integrated architecture of the lakehouse is naturally suitable for AI analysis (including audio and video unstructured data storage, compatible with AI computing frameworks, with platform capabilities for model development and machine learning throughout the life cycle), and is also more suitable for the era of large-scale machine learning.
This coincides with the trend.
Not long ago, Gartner released the prediction of future application scenarios of lakehouse integration: the lakehouse integration architecture needs to support three types of real-time scenarios, the first is real-time continuous intelligence; The second category is real-time on-demand intelligence; The third type is offline on-demand intelligence, which will be available to data consumers through snapshot view, real-time view and real-time batch view, which is also the direction that the integrated architecture of the lakehouse needs to continue to evolve in the future.
03: Is it a good time to lay out the lakehouse as one?
From the perspective of market development, the “lakehouse integration” architecture is the only way to go based on the process of technological development.
However, due to the fact that this new open architecture is still in the early stage of development, the different levels of digitalization and market cognition of domestic and foreign enterprises have caused great differences in solutions. In the view of industry investors, “although the US enterprise service market is much more mature than ours, and there are many paths to refer to, the Chinese market has many Chinese characteristics.” Taking Databricks’ Dipu Technology as an example, the US enterprise service market often sells products, but China’s large customer groups need solutions that are more deeply integrated with customers’ senior scenarios, and solutions need to take into account both versatility and customization. ”
In the previous cooperation with Dipu Technology, Belle International has completed the construction of a unified data warehouse, realizing the data collection of multiple business lines and the data construction of various business domains. Under the premise of ensuring the normal operation of front-end data and “hot switching” of the underlying application, Dipu Technology and Belle International worked closely together to integrate multiple data warehouses into a unified data warehouse in just a few months, effectively unifying the business caliber, greatly reducing the workload of development and operation and maintenance, and forming a closed loop of the entire business value chain.
This is also the capability value of “lakehouse integration”: with the gradual diversification of data structures, 3D drawings, live video, conference video, audio and other data materials, in order to deeply explore the value of data, relying on the leading lakehouse integrated technical architecture, Belle International can first store massive multi-mode data into the lake, when the computing power allows in the future, and after mining in-depth business analysis scenarios, grab data analysis from the data lake.
For a simple example, a designer
wants to design a shoe, generally from the historical data to find effective information reference, the designer may only need a product photo, can be like browsing a movie, can understand the product over the years of the full life cycle sales performance, brand story, competitor analysis and other data. Empower production and business decisions to maximize the value of data.
In general, large enterprises that want to maintain sustained growth often rely on large, effective data output to achieve intelligent decision-making. Many enterprises due to the limitation of IT construction capabilities, resulting in a lot of things can not be done, but through the lakehouse integrated architecture, so that the previously limited data value can be fully exerted, if the enterprise can pay attention to the value of data, and consciously preserve it, the enterprise has completed one of the important propositions of digital transformation.
We also have reason to believe that with the acceleration of enterprise digital transformation, the integrated architecture of the lakehouse will also have a broader space for development.