id=”js_tags” class=”article-tag__list”> included in the collection #数据湖
id=”js_article-tag-card__right” class=”article-tag-card__right”> 6
the full text is 22790 words, 35 Fig

Recently, the
concept of data lake is very hot, and many students on the front line are discussing how to build a data lake. Is there a mature data lake solution? Are there any cases of data lake solutions from major vendors? How to understand data lakes? What is the difference between a data lake and a big data platform? With these questions in mind, we tried to write such an article, hoping to throw bricks and lead to some thinking and resonance.
This article has the following 7 chapters:
- What are
-
the basic characteristics of
-
a data
-
lakeData lakeBasic data lake architecture
-
A summary of
-
basic process of
-
construction in typical data lake application scenarios
-
data lake solutions from various vendors
the
data lake
of
Data lakes are a hot concept at the moment, and many enterprises are building or planning to build their own. However, before planning to build a data lake, it is crucial to figure out what a data lake is, clarify the basic components of a data lake project, and then design the basic architecture of the data lake. There are the following definitions of what a data lake is.
Wikipedia defines
a data
lake as a system or store for storing data in its natural/raw format, usually blocks of objects or files. A data lake is typically a single store for all data in an enterprise. Full data includes copies of raw data produced by the original system and transformed data for tasks such as reporting, visualization, advanced analytics, and machine learning. Data lakes include structured data (rows and columns), semi-structured data (e.g., CSV, logs, XML, JSON), unstructured data (e.g., emails, documents, PDFs, etc.), and binary data (e.g., images, audio, video) from relational databases. A data swamp is a degraded, unmanaged data lake that is either inaccessible or does not provide sufficient value to users.
AWS’s definition is relatively succinct:
a data
lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is (without structuring it first) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Microsoft’s definition is even more vague, and does not explicitly specify what a data lake is, but cleverly defines the function of a data lake
: Azure’s data lake includes all the capabilities that make it easier for developers, data scientists, and analysts to store and process data, which allows users to store data of any scale, any type, and any production speed, and can do all types of analysis and processing across platforms and languages. Data lakes help users accelerate the application of data while eliminating the complexity of data collection and storage, while also supporting batch processing, streaming computing, interactive analytics, and more. Data lakes work alongside existing IT investments in data management and governance to ensure data is consistent, manageable, and secure. It also integrates seamlessly with existing business databases and data warehouses to help extend existing data applications. Azure Data Lake draws on the experience of a large number of enterprise-level users and supports large-scale processing and analytics scenarios in some Microsoft businesses, including Office 365, Xbox Live, Azure, Windows, Bing and Skype. Azure addresses many of the efficiency and scalability challenges as a service that enables users to maximize the value of their data assets to meet current and future needs.
There are many definitions of data lakes, but they basically revolve around the following characteristics.
-
The data lake needs to provide sufficient data storage capacity to hold all the data in an enterprise/organization. -
Data lakes can store massive amounts of any type of data, including structured, semi-structured, and unstructured data. -
lake is raw data, a complete copy of the business data. The data in the data lake remains the same as they are in the business system. -
A data lake needs to have sound data management capabilities (perfect metadata) and can manage various data-related elements, including data sources, data formats, connection information, data schemas, and permission management. -
Data lakes require diverse analytics capabilities, including but not limited to batch processing, streaming computing, interactive analytics, and machine learning; At the same time, it is also necessary to provide certain task scheduling and management capabilities. -
Data lakes require robust data lifecycle management capabilities. It is not only necessary to store the original data, but also to be able to save the intermediate results of various analysis and processing, and completely record the analysis and processing process of the data, which can help users completely and in detail trace the generation process of any piece of data. -
Data lakes need to have sound data acquisition and data publishing capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; Then normalize the storage. Data lakes can push the results of data analysis and processing to the appropriate storage engine to meet different application access needs. -
Support for big data, including hyperscale storage and scalable large-scale data processing capabilities.
The data in the data
In summary, I personally believe that a data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; Data-oriented, realize full acquisition, full storage, multi-mode processing and full lifecycle management of any source, any speed, any scale, and any type of data; And through interactive integration with various external heterogeneous data sources, it supports various enterprise-level applications.
Figure 1. The basic capabilities of the data lake
indicate that two more points need to be pointed out here:
-
Data-oriented means that the data lake should be simple and easy to use for users, helping users free themselves from complex IT infrastructure operation and maintenance work, and focus on business, models, algorithms, and data. Data lakes are aimed at data scientists and analysts. At present, cloud native should be a more ideal way to build a data lake, and this view will be discussed in detail later in the “Basic Architecture of a Data Lake” section.
After having a basic understanding of the concept of data lake, we need to further clarify what basic characteristics of data lake need to have. Especially compared with big data platforms or traditional data warehouses, what are the characteristics of data lakes? Before the specific analysis, let’s take a look at a comparison table
from AWS
class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/tMJtfgIIibWIY9w0QuntbrPVoahRdlvJ7SqxnUP7woibXPEm4zRUhqy5FuAhOJ0YotZ7RhUfMkIua5RJCQShwrpw/640?wx_fmt=png”>
The above table compares the differences between data lakes and traditional data warehouses, and I personally feel that we can further analyze what characteristics data lakes should have at both data and computing levels. In terms of data:
-
“Flexibility”: One point in the table above is the “write-based schema” v.s.” “Read-type schema” is essentially a question of where the design of the data schema occurs. Schema design is essential for any data application, and even some databases such as mongoDB that emphasize “schemaless” still recommend that records use the same/similar structure as much as possible. The logic behind the “writing schema” is that before the data is written, it is necessary to determine the schema of the data according to the access mode of the business, and then complete the data import according to the established schema, which brings the benefit of good adaptation of data to the business; However, this also means that the upfront ownership cost of the data warehouse will be relatively high, especially when the business model is not clear and the business is still in the exploration stage. The underlying logic behind the “read-type schema” emphasized by data lakes is that business uncertainty is the norm: we can’t anticipate changes in the business, so we maintain some flexibility, deferring the design, and giving the entire infrastructure the ability to make data fit the business “on demand”. Therefore, I personally believe that “fidelity” and “flexibility” are in the same line: since there is no way to predict changes in business, then simply keep the data in the most original state, and when needed, the data can be processed according to demand. Therefore, data lakes are more suitable for innovative enterprises and enterprises with rapid business changes. At the same time, the users of data lakes are correspondingly more demanding, and data scientists and business analysts (with certain visualization tools) are the target customers of data lakes. -
“Manageable”: A data lake should provide sound data management capabilities. Since data requires “fidelity” and “flexibility,” there are at least two types of data in a data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve. Therefore, data management capabilities will also be very demanding, including at least the following data management capabilities: data source, data connection, data format, data schema (library/table/column/row). At the same time, a data lake is a unified data storage place in a single enterprise/organization, so it also needs to have certain permission management capabilities. -
“Traceability”: A data lake is a storage place for all data in an organization/enterprise, which needs to manage the whole life cycle of data, including the whole process of data definition, access, storage, processing, analysis, and application. A powerful data lake implementation needs to be able to traceable the access, storage, processing, and consumption of any piece of data between them, and can clearly reproduce the complete generation process and flow process of data.
In terms of computing, I personally believe that the computing power requirements of data lakes are actually very extensive, which depends entirely on the business requirements for computing.
-
rich compute engine. From batch processing to streaming compute, interactive analytics, to machine learning, compute engines fall within the scope of a data lake. In general, batch calculation engines are used for data loading, transformation, and processing; For the part that requires real-time computing, the streaming computing engine is used; For some exploratory analysis scenarios, you may need to introduce an interactive analysis engine. With the closer integration of big data technology and artificial intelligence technology, various machine learning/deep learning algorithms have been continuously introduced, such as TensorFlow/PyTorch framework has supported reading sample data from HDFS/S3/OSS for training. Therefore, for a qualified data lake project, the scalability/pluggability of the compute engine should be a basic capability.
-
Multimodal storage engine. In theory, the data lake itself should have a built-in multi-modal storage engine to meet the data access needs of different applications (taking into account factors such as response time/concurrency/access frequency/cost). However, in the actual use process, the data in the data lake is usually not accessed frequently, and related applications are mostly in exploratory data applications, in order to achieve acceptable cost performance, data lake construction usually chooses relatively inexpensive storage engines (such as S3/OSS/HDFS/OBS), and works with external storage engines when needed to meet diversified application needs.
Data lakes can be thought of as a new generation of big data infrastructure. To better understand the basic architecture of a data lake, let’s first look at the evolution of big data infrastructure architecture.
1) Phase 1: Offline data processing infrastructure represented by Hadoop. As shown in the figure below, Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (MR) as the basic computing model. Around HDFS and MR, a series of components have been generated to continuously improve the data processing capabilities of the entire big data platform, such as HBase for online KV operations, HIVE for SQL, PIG for workflow, etc. At the same time, with the increasing performance requirements for batch processing, new computing models have been continuously proposed, resulting in computing engines such as Tez, Spark, and Presto, and MR models have gradually evolved into DAG models.
On the one hand, the
DAG model increases the abstract concurrency ability of the computing model: each calculation process is decomposed, the task is logically divided according to the aggregation operation points in the calculation process, and the task is divided into stages, each stage can be composed of one or more Tasks, and Tasks can be executed concurrently, thereby improving the parallelism of the entire computing process; On the other hand, in order to reduce the intermediate result file writing operation in the data processing process, Spark, Presto and other computing engines try to use the memory of the computing node to cache the data, so as to improve the efficiency of the entire data process and the system throughput capacity.
Figure 2. Hadoop architecture diagram
2
) Phase 2: lambda architecture. With the continuous change of data processing capacity and processing requirements, more and more users find that no matter how much the batch mode improves performance, it cannot meet some processing scenarios with high real-time requirements, and streaming computing engines have emerged, such as Storm, Spark Streaming, Flink, etc.
However, as more and
more applications go online, we have found that batch processing and stream computing can be used together to meet most application requirements. For users, in fact, they do not care what the underlying computing model is, users hope that whether it is batch or stream computing, it can return processing results based on a unified data model, so the Lambda architecture is proposed, as shown in the following figure.
Figure 3. Lambda Architecture Schematic The
core concept of Lambda architecture is “flow batch integration”, as shown in the figure above, the entire data flow flows from left to right into the platform. After entering the platform, it is divided into two, one part goes to batch mode, and part goes to stream computing mode. Regardless of the compute mode, the final processing result is provided to the application through the service layer, ensuring consistent access.
3) Phase 3: Kappa architecture. The Lambda architecture solves the problem of consistency of application reading data, but the processing link of “stream batch separation” increases the complexity of R&D. Therefore, it was raised whether a system could be used to solve all problems. At present, the more popular practice is to do it based on stream computing. The natural distributed nature of stream computing is destined to scale better. By increasing the concurrency of stream computing and increasing the “time window” of streaming data, the two computing modes of batch processing and streaming are unified.
In summary, from the traditional hadoop architecture to the lambda architecture, from the lambda architecture to the evolution of the
Kappa architecture, the evolution of the big data platform infrastructure gradually includes all kinds of data processing capabilities required by the application, and the big data platform has gradually evolved into a full data processing platform for enterprises/organizations. In current enterprise practice, in addition to relational databases, they rely on various independent business systems; Almost all of the rest of the data is considered for unified processing into the big data platform. However, the current big data platform infrastructure locks the perspective on storage and computing, and ignores the asset management of data, which is precisely one of the key directions of data lake as a new generation of big data infrastructure.
I once read a very interesting article that raised the following question: Why is a data lake called a data lake and not a data river or a data ocean? An interesting answer is:
-
The reason why it is not called “sea” is that the sea is boundless, while “lake” has a boundary, and this boundary is the business boundary of the enterprise/organization; As a result, data lakes require more data management and rights management capabilities. -
Another important reason for the name “lake” is that data lakes need to be finely governed, and a data lake that lacks governance and governance will eventually degenerate into a “data swamp”, so that applications cannot effectively access the data and the data stored in it loses value.
The evolution of big data infrastructure actually reflects one point: within enterprises/organizations, it has become a consensus that data is an important asset; In order to make better use of data, enterprises/organizations need to
effectively manage and centrally manage data assets:
-
for long-term as-is-is-is -
-
Provide multi-mode computing power to meet processing requirements -
business-oriented, and provide unified data view, data model and data processing results
and
In addition to the various basic capabilities of big data platforms, data lakes emphasize data management, governance and asset capabilities. In terms of specific implementation, the data lake needs to include a series of data management components, including:
- data
-
access
-
migration
-
data governance
-
Quality
-
asset catalog
-
access control
-
management
-
task orchestration
-
Metadata management, etc
data
management
task
.
is shown in the following figure, which gives a reference architecture for a data lake system. For a typical data lake, it is similar to a big data platform in that it also has the storage and computing power required to process hyperscale data, and can provide multi-mode data processing capabilities; The enhancement point is that the data lake provides more complete data management capabilities, which is embodied in the <
ol class=”list-paddingleft-2″>
Figure 5. Data Lake Component Reference Architecture
It should also be pointed out that the “centralized storage” in the above diagram is more of a business conceptual centralization, essentially hoping that the data within an enterprise/organization can be precipitated in a clear and unified place. In fact, the storage of data lakes should be a kind of distributed file system that can be expanded on demand, and most data lake practices also recommend the use of distributed systems such as S3/OSS/OBS/HDFS as the unified storage of data lakes.
We can then switch to the data dimension and look at how the data lake processes data from the perspective of the data life cycle, and the entire life cycle of data in the data lake is shown in Figure 6. In theory, the data in a well-managed data lake will permanently retain the original data, and the process data will continue to be improved and evolved to meet the needs of the business.
Figure 6: Data lifecycle diagram
in a data lake
As a current outlet, major cloud vendors have launched their own data lake solutions and related products. This section analyzes the data lake solutions launched by various mainstream vendors and maps them to the data lake reference architecture to help you understand the advantages and disadvantages of each solution.
4.1 AWS Data Lake Solutions
Figure 7 shows the data lake solution recommended by AWS. The entire solution is built on AWS Lake Formation, which is essentially a management component that works with other AWS services to complete the entire enterprise-level data lake construction function. The above figure is from left to right, reflecting the four steps of data inflow, data precipitation, data calculation, and data application. Let’s take a closer look at its key takeaways:
1) Data ingress. Data ingress
is the starting point of the entire data lake construction, including the ingress of metadata and the inflow of business data. Metadata inflow includes two steps: data source creation and metadata capture, and finally forms a data resource directory, and generates corresponding security settings and access control policies. The solution provides specialized components that obtain metadata about external data sources, connect to external data sources, detect data formats and schemas, and create metadata belonging to the data lake in the corresponding data resource catalog. The inflow of business data is done through ETL.
In the form of a concrete product, metadata scraping, ETL, and data preparation AWS abstracts it separately to form a product called AWS GLUE. AWS GLUE shares the same data resource catalog with AWS Lake Formation, which is clearly stated on the AWS GLUE official website document: “Each AWS account has one AWS Glue Data Catalog per AWS region.”
Support for heterogeneous data sources. The data lake solution provided by AWS supports S3, AWS relational database, AWS NoSQL database, and AWS uses GLUE, EMR, Athena and other components to support the free flow of data.
2) Data precipitation.
Adopt Amazon S3 as the centralized storage for your entire data lake, scaling on demand/pay-as-you-go.
3) Data calculation.
The entire solution leverages AWS GLUE for basic data processing. The basic calculation form of GLUE is ETL tasks in various batch modes, and the starting mode of the task is divided into three types: manual triggering, timing triggering, and event triggering. It has to be said that AWS’s various services are implemented very well in the ecosystem, and in the event trigger mode, AWS Lambda can be used for extended development and trigger one or more tasks at the same time, which greatly improves the custom development ability of task triggering; At the same time, all kinds of ETL tasks can be well monitored through CloudWatch.
4) Data application.
In addition to providing basic batch computing modes, AWS provides rich computing mode support through various external computing engines, such as Athena/Redshift to provide interactive SQL-based batch processing capabilities; Provide a variety of Spark-based computing capabilities through EMR, including the streaming computing capabilities and machine learning capabilities that Spark can provide.
5) Permission management.
AWS’s data lake solution provides relatively complete permission management through Lake Formation, including “library-table-column”. However, with one exception, when GLUE accesses Lake Formation, the granularity is only at the “library-table” level; This also shows from another side that the integration of GLUE and Lake Formation is closer, and GLUE has greater access to the data in Lake Formation.
The permissions of Lake Formation can be further subdivided into data resource directory access rights and underlying data access permissions, corresponding to metadata and actual stored data, respectively. The access rights of the actual stored data are further divided into data access rights and data storage access rights. Data access permissions are similar to database access permissions for library tables, while data storage permissions further refine access to specific directories in S3 (divided into two types: display and implicit). As shown in Figure 8, user A cannot create a table in the bucket specified by S3 if only has data access permissions.
Personally, I think this further reflects the need for data lakes to support a variety of different storage engines, the future data lake may not only S3/OSS/OBS/HDFS core storage, may be based on the access needs of the application, into more types of storage engines, for example, S3 stores raw data, NoSQL storage is suitable for accessing data in “key-value” mode after processing, OLAP engine storage needs to produce various reports in real time/ Data queried by adhoc. While various materials are currently emphasizing the difference between data lakes and data warehouses; However, in essence, the data lake should be the concrete realization of a kind of integrated data management ideas, and “lakehouse integration” is also likely to be a development trend in the future.
Figure 8.
In summary, AWS data lake solutions have a high degree of maturity, especially in metadata management and permission management, which open up the upstream and downstream relationships between heterogeneous data sources and various computing engines, so that data can be “moved” freely. In terms of stream computing and machine learning, AWS’s solutions are also relatively complete. In terms of stream computing, AWS has launched a special stream computing component Kinesis, and the Kinesis data Firehose service in Kinesis can create a fully managed data distribution service, and the data processed in real time through Kinesis data Stream can be easily written to S3 with the help of Firehose, and supports corresponding format conversion, such as converting JSON to Parquet format.
The
best thing about AWS’s entire solution is that Kinesis can access metadata in GLUE, which fully reflects the ecological completeness of AWS data lake solutions. Similarly, in terms of machine learning, AWS offers the SageMaker service, which can read the training data in S3 and write back the trained model to S3. However, it should be pointed out that in AWS’s data lake solution, stream computing and machine learning are not fixed bundles, but only as computing power extensions and can be easily integrated.
Finally, let’s go back to the data lake component reference architecture in Figure 6 and look at the component coverage of AWS’s data lake solution, see Figure 9.
Figure 9. AWS Data Lake Solutions Map in Reference Architecture
In summary, AWS’s data lake solutions cover all functions except quality management and data governance. In fact, quality management and data governance are strongly related to the organizational structure and business type of the enterprise, and a lot of custom development work is required, so it is understandable that the general solution does not include this content. In fact, there are now better open source projects to support this project, such as Apache Griffin, if you have strong demands on quality management and data governance, you can customize and develop it yourself.
4.2 Huawei Data Lake Solution
>Figure 10.Huawei Data Lake Solution
Information about Huawei’s data lake solution comes from Huawei’s official website. Related products currently visible on the official website include Data Lake Insight (DLI) and Intelligent Data Lake Operation Platform (DAYU). DLI is equivalent to AWS’s collection of Lake Formation, GLUE, Athena, EMR (Flink & Spark). I didn’t find the overall architecture diagram of DLI on the official website, I tried to draw one according to my own understanding, mainly with AWS’s solution to have a comparison, so the form is as consistent as possible, if there are students who know Huawei DLI very well, please don’t hesitate to advise.
Huawei’s data lake solution is complete, and DLI undertakes all the core functions of data lake construction, data processing, data management, and data application. The biggest feature of DLI is the completeness of the analysis engine, including interactive analysis based on SQL and integrated processing engine for stream and batch processing based on Spark+Flink. In terms of core storage engine, DLI is still provided through built-in OBS, which is basically in line with the capabilities of AWS S3. Huawei’s data lake solution is relatively complete in the upstream and downstream ecosystems compared to AWS, and supports almost all data source services currently available on HUAWEI CLOUD for external data sources.
DLI can be connected with Huawei’s CDM (Cloud Data Migration Service) and DIS (Data Access Service):
-
with DIS, DLI can define various data points, which can be used in Flink jobs as source or sink;
-
With CDM, DLI can even access data from IDC and third-party cloud services.
HUAWEI CLOUD provides the DAYU platform to better support advanced data lake functions such as data integration, data development, data governance, and quality management. The DAYU platform is the implementation of Huawei’s data lake governance and operation methodology. DAYU covers the core processes of the entire data lake governance and provides corresponding tools to support them; Even in Huawei’s official documentation, suggestions for building a data governance organization are given. Figure 11 shows the implementation of DAYU’s data governance methodology (from the official website of HUAWEI CLOUD).
Figure 11
DAYU data governance methodology
flow can be seen that in essence, DAYU data governance methodology is actually an extension of the traditional data warehouse governance methodology on the data lake infrastructure: from the perspective of data model, it still includes the source layer, multi-source integration layer, and detailed data layer, which is completely consistent with the data warehouse. According to the data model and indicator model, quality rules and transformation models will be generated, and DAYU will connect with DLI to directly call the relevant data processing services provided by DLI to complete data governance.
HUAWEI CLOUD’s entire data lake solution
covers the data processing lifecycle, explicitly supports data governance, and provides data governance process tools based on models and indicators, gradually evolving towards “lakehouse integration” in HUAWEI CLOUD’s data lake solution.
4.3 Alibaba Cloud Data Lake Solution
There are many data products on Alibaba Cloud, because I am currently in the data BU, so this plan will focus on how to use database BU products to build a data lake, and other cloud products will be slightly involved. Alibaba Cloud’s data lake solution based on database products is more focused, focusing on two scenarios: data lake analysis and federated analysis. Figure 12 shows the Alibaba Cloud data lake solution.
Figure 12:
The entire Alibaba Cloud data lake solution
still uses OSS as the centralized storage of the data lake. In terms of data source support, all Alibaba Cloud databases are currently supported, including OLTP, OLAP, and NoSQL databases. The core key points are as follows:
-
Data resource directory. DLA provides Meta data catalog components for unified management of data assets in a data lake, whether the data is “in the lake” or “outside the lake”. The Meta data catalog is also a unified metadata entry point for federated analysis. -
In terms of built-in computing engines, DLA provides two types: SQL computing engine and Spark computing engine. Both SQL and Spark engines are deeply integrated with the Meta data catalog, which can easily obtain metadata information. Based on Spark’s capabilities, DLA solutions support compute modes such as batch processing, stream computing, and machine learning. -
In terms of peripheral ecology, in addition to supporting various heterogeneous data sources for data access and aggregation, DLA is deeply integrated with cloud-native data warehouse (formerly ADB) in terms of external access capabilities. On the one hand, the results of DLA processing can be pushed to ADB in time to meet real-time, interactive, ad hoc complex queries. On the other hand, the data in ADB can also be easily restreamed to OSS with the help of the external table function. Based on DLA, various heterogeneous data sources on Alibaba Cloud can be fully connected and data can flow freely. -
In terms of data integration and development, Alibaba Cloud’s data lake solution provides two options: one is to use DataWorks to complete; The other is to use DMS to do it. No matter which one you choose, it can provide visual process orchestration, task scheduling, and task management capabilities. In terms of data lifecycle management, Dataworks’ data mapping capabilities are relatively more mature. -
In terms of data management and data security, DMS provides powerful capabilities. DMS data management granularity is divided into “database-table-column-row”, which perfectly supports enterprise-level data security management and control requirements. In addition to permission management, DMS is more refined by extending the original database-based DevOps concept to the data lake, making the operation and maintenance and development of the data lake more refined.
Further refine the data application architecture of the entire data lake scenario, as shown in the following figure.
Figure 13:
From
the perspective of data flow from left to right, data producers generate various types of data (on-premises/on-cloud/other clouds) and upload them to various general/standard data sources, including OSS/HDFS/DB, etc. For various data sources, DLA provides complete lake operations through data discovery, data access, and data migration.
For data “into the lake”, DLA provides data processing capabilities based on SQL and Spark, and can provide visual data integration and data development capabilities based on Dataworks/DMS. In terms of external application service capabilities, DLA provides standardized JDBC interfaces, which can be directly docked with various reporting tools and large-screen display functions. Alibaba Cloud’s DLA is characterized by the entire Alibaba Cloud database ecosystem, including OLTP, OLAP, NoSQL and other databases, providing SQL-based data processing capabilities, which is relatively low in transformation cost and has a relatively smooth learning curve for traditional enterprise database-based development technology stacks.
Another feature of Alibaba Cloud’s DLA solution is “cloud-native lakehouse integration”. The traditional enterprise-level data warehouse is still irreplaceable in various reporting applications in the era of big data, but the data warehouse cannot meet the flexible requirements of data analysis and processing in the era of big data.
Therefore, we recommend that the data
warehouse should exist as the upper-layer application of the data lake: that is, the data lake is the only official data storage location for the original business data in an enterprise/organization; According to the requirements of various business applications, the data lake processes the original data to form intermediate results that can be reused. When the data schema of the intermediate result is relatively fixed, DLA can push the intermediate result to the data warehouse for enterprises/organizations to carry out business applications based on data warehouses. Alibaba Cloud not only provides DLA, but also provides cloud-native data warehouses (formerly ADB), which are deeply integrated with DLA and cloud-native data warehouses in the following two points:
-
Both have built-in access support for OSS. OSS exists directly as the native storage of DLA. For ADB, structured data on OSS can be easily accessed through the ability of external tables. With the help of external tables, data can flow freely between DLA and ADB, achieving a true lakehouse integration.
> uses a homologous SQL parsing engine. DLA’s SQL is fully syntactically compatible with ADB’s SQL, which means that developers can use a set of technology stacks to develop data lake applications and data warehouse applications at the same time.
The
combination of DLA+ADB truly achieves the integration of cloud-native lakehouses (about what is cloud native, which is not the scope of this article). Essentially, DLA can be seen as a data warehouse source layer with expanded capabilities. Compared with traditional data warehouses, the sticker layer:
-
can store all kinds of structured, semi-structured and unstructured data;
-
It can connect with various heterogeneous data sources;
-
Metadata discovery, management, synchronization and other capabilities;
-
The built-in SQL/Spark compute engine has stronger data processing capabilities to meet diversified data processing needs.
-
It has the ability to manage the whole life cycle of full data. The lakehouse integration solution based on DLA+ADB will cover the processing capacity of “big data platform + data warehouse” at the same time.
Another important capability of DLA is to build a “well-connected” data flow system and provide external capabilities with the experience of the database, whether the data is on or off the cloud, whether the data is inside or outside the organization; With the help of data lake, there are no longer barriers between various systems, and it can flow in and out freely; What’s more, this flow is regulated, and the data lake provides a complete record of the flow of data.
4.4 Azure Data Lake Solution
Azure’s data lake solution includes data lake storage, interface layer, resource scheduling and compute engine layer, as shown in Figure 15 (from the Azure official website). The storage layer is built on Azure Object Storage and still supports structured, semi-structured and unstructured data.
The interface layer is WebHDFS, and in particular, the HDFS interface is implemented in Azure Object Storage, which Azure calls “multi-protocol access on data lake storage”. In terms of resource scheduling, Azure is based on YARN. In terms of computing engines, Azure provides a variety of processing engines such as U-SQL, hadoop and Spark.
Figure 15. Azure Data lake analysis architecture
Azure is special about the development support that Visual Studio provides to customers.
-
Support for development tools, deep integration with Visual Studio; Azure recommends using U-SQL as the development language for data lake analytics applications. Visual studio provides a complete development environment for U-SQL; At the same time, in order to reduce the complexity of distributed data lake system development, visual studio is packaged based on the project, and when performing U-SQL development, you can create a “U-SQL database project”, in such projects, visual studio can be used to easily code and debug, and at the same time, it also provides wizards to publish the developed U-SQL scripts to the production environment. U-SQL supports Python and R to scale to meet custom development needs. -
Multi-compute engine adaptation: SQL, Apache Hadoop and Apache Spark. Hadoop here includes HDInsight (Azure-hosted Hadoop service) from Azure, and Spark includes Azure Databricks. -
Automatic conversion capability between multiple different engine tasks. Microsoft recommends U-SQL as the default development tool for data lakes, and provides various conversion tools to support the conversion between U-SQL scripts and Hive, Spark (HDSight & databricks), and Azure Data Factory data Flow.
4.5 Summary
This article is about data lake solutions and does not involve any single product from any cloud vendor. We have made a brief summary similar to the following table from the aspects of data access, data storage, data computing, data management, and application ecology.
For the sake of space, in fact, well-known cloud vendors’ data lake solutions also include Google and Tencent. From their official websites, the data lake solution is relatively simple, and it is only some conceptual elaboration, and the recommended landing solution is “oss + hadoop (EMR)”.
In fact, data lake should not be
viewed from the perspective of a simple technology platform, there are various ways to achieve data lake, the key to evaluating the maturity of a data lake solution should be based on the data management capabilities it provides, including but not limited to metadata, data asset catalog, data sources, data processing tasks, data life cycle, data governance, permission management, etc.; and the ability to connect with the peripheral ecology.
5.1 Advertising Data Analytics
In recent years, the cost of traffic acquisition has become higher and higher, and the exponential increase in the cost of customer acquisition through online channels has caused all walks of life to face severe challenges. In the context of rising Internet advertising costs, the main business strategy of spending money to buy traffic and pull new ones is bound to be unworkable. The optimization of the front-end of traffic has become the end of the strong crossbow, and the use of data tools to improve the target conversion of traffic after arrival at the site, and the refined operation of all aspects of advertising is a more direct and effective way to change the status quo. At the end of the day, to increase the conversion rate of advertising traffic, you must rely on big data analytics.
In order to provide more decision-making support basis, it is necessary to take more collection and analysis of buried point data, including but not limited to channels, delivery time, and delivery population, and use the click rate as the data indicator for data analysis, so as to give better and faster solutions and suggestions to achieve high efficiency and high output. Therefore, in the face of the requirements of structured, semi-structured and unstructured data collection, storage, analysis and decision-making suggestions such as multi-dimensional, multimedia, and multi-advertising space in the field of advertising placement, the data lake analysis product solution has been warmly favored by advertisers or publishers in the selection of a new generation of technology.
DG is a world-leading international intelligent marketing service provider, based on advanced advertising technology, big data and operational capabilities, to provide customers with global high-quality user acquisition and traffic monetization services. DG decided to build its IT infrastructure based on the public cloud from the very beginning, and initially chose the AWS cloud platform to store its advertising data in S3 in the form of a data lake for interactive analysis through Athena. However, with the rapid development of Internet advertising, the advertising industry has brought several major challenges, and the mobile advertising publishing and tracking system must solve several key problems:
-
concurrency and peak problems. In the advertising industry, traffic peaks often occur, and the number of clicks in an instant may reach tens of thousands, or even hundreds of thousands, which requires the system to have very good scalability to quickly respond and process how to -
achieve real-time analysis of massive data with each click. In order to monitor the advertising effect, the system needs to analyze each click and activation data of the user in real time, and transmit the relevant data to the downstream media; -
platform is growing dramatically, the daily business log data is continuously generated and uploaded, the data of exposure, clicking, and push is continuously processed, and the amount of new data added every day has been about 10-50TB, which puts forward higher requirements for the entire data processing system. How to efficiently complete offline/near real-time statistics of advertising data, and aggregate and analyze according to the dimensional requirements of advertisers.
The data volume of the
In response to the above three business challenges, and at the same time, DG, the
customer’s daily incremental data is becoming larger and larger (the current daily data scanning volume reaches 100+TB), and continues to encounter the bandwidth bottleneck of Athena’s reading S3 data on the AWS platform, the lag time of data analysis is getting longer and longer, and the investment cost is rising sharply to cope with the growth of data and analysis demand. Finally, it was decided to move from the AWS cloud platform to the Alibaba Cloud platform, and the new architecture diagram is as follows:
class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/tMJtfgIIibWIY9w0QuntbrPVoahRdlvJ7ichFZJFqSOhyszDvibAQOtIQCh3svUVeJqw6o7CtGyeS960q8nibVjU8g/640?wx_fmt=png” >
Figure 16. After the revamped advertising data lake solution architecture
was moved from AWS to Alibaba Cloud, we designed the ultimate analysis capability of “Data Lake Analytics + OSS” to cope with business peaks and valleys for the customer. On the one hand, it is easy to deal with ad hoc analysis from brand customers. On the other hand, using the powerful computing power of Data Lake Analytics, it analyzes monthly and quarterly advertising placement, and accurately calculates how many activities there will be under a brand, and the effect of each activity is divided into media, market, channel, and DMP, which further enhances the sales conversion rate brought by the Jiahe intelligent traffic platform for brand marketing.
In terms of the total cost of ownership of advertising and analysis, the serverless elastic service provided by Data Lake Analytics is charged on demand, without the need to purchase fixed resources, which fully meets the resource fluctuations caused by business tides, meets the elastic analysis needs, and greatly reduces the operation and maintenance costs and usage costs.
Figure 17 Data lake deployment In
general, after DG switched from AWS to Alibaba Cloud, it greatly saved hardware costs, labor costs, and development costs. Due to the use of DLA serverless cloud services, DG does not need to invest a lot of money in advance to purchase servers, storage and other hardware equipment, nor does it need to purchase a large number of cloud services at one time, and the scale of its infrastructure is completely on-demand: increase the number of services when the demand is high, reduce the number of services when the demand decreases, and improve the utilization rate of funds.
The second significant benefit of using Alibaba Cloud platform is the improved performance. In the rapid growth period of DG business and the subsequent access period of multiple business lines, DG in the mobile advertising system traffic often shows explosive growth, but the original AWS solution and platform in Athena to read S3 data encountered a huge bottleneck in data reading bandwidth, data analysis time became longer and longer, Alibaba Cloud DLA and OSS team and other great optimization and transformation, at the same time, DLA database analysis in the computing engine (shared with TPC-DS’s world-ranked AnalyticDB computing engine) is dozens of times more powerful than the Presto native computing engine, and also greatly improves the analysis performance for DG.
5.2 Game Operations Analytics
Data Lake is a type of big data infrastructure with excellent TCO performance. For many fast-growing game companies, a blockbuster game often grows rapidly in the short term; At the same time, the technology stack of the company’s R&D personnel is difficult to match with the increment and growth rate of data in the short term; At this point, it is difficult to use explosive data effectively. Data lakes are a technology of choice to solve such problems.
YJ is a fast-growing game company that hopes to rely on relevant user behavior data for in-depth analysis to guide game development and operation. The core logic behind data analysis is that with the expansion of market competition in the game industry, players have higher and higher requirements for quality, and the life cycle of game projects is getting shorter and shorter, which directly affects the input-output ratio of projects.
With the rising cost of traffic, how to build an economical and efficient refined data operation system to better support business development has become more and more important. The data operation system needs to have its supporting infrastructure support facilities, and how to choose such infrastructure support facilities is a problem that the company’s technical decision-makers need to think about. The starting point for thinking includes:
-
be flexible enough. For games, it is often a short-term explosion and a surge in data volume; Therefore, whether it can adapt to the explosive growth of data and meet the elastic demand is a key consideration; Both compute and storage need to be elastic.
-
To have enough value for money. For user behavior data, it is often necessary to pull a long period to analyze and compare, such as retention rate, and in many cases need to consider the retention rate of customers for 90 days or even 180 days; Therefore, how to store massive data in the most cost-effective way for a long time is a key issue to consider. -
Analytical capabilities and scalability are required. In many cases, user behavior is reflected in the tracking data, which in turn needs to be associated with structured data such as user registration information, login information, and billing. Therefore, in terms of data analysis, at least ETL capabilities of big data, access capabilities of heterogeneous data sources and modeling capabilities of complex analysis are required. -
It should match the company’s existing technology stack and facilitate recruitment in the future. For YJ, an important point in its technology selection is the technology stack of its technical staff, most of YJ’s technical team is only familiar with traditional database development, that is, MySQL; Moreover, there is a shortage of manpower, and there is only one technician who does data operation analysis, and there is no ability to independently build the infrastructure of big data analysis in a short period of time. From YJ’s point of view, it is best if the vast majority of analysis can be done via SQL; And in the recruitment market, the number of SQL developers is also much higher than the number of big data development engineers. In response to the customer’s situation, we help the customer to modify the existing solution.
Figure 18:
Before the renovation, all the customer’s structured data was in a high-specification MySQL. Player behavior data is collected in Log Service (SLS) through LogTail, and then delivered to OSS and ES from Log Service. The problem with this architecture is:
- behavioral data and
-
structured data are completely separated and cannot be analyzed together; -
Provide retrieval function for behavioral data intelligence, and cannot do in-depth mining and analysis; -
OSS is only used as a data storage resource and does not tap enough data value.
In fact, our analysis customer’s existing architecture already has the prototype of a data lake: all the data has been stored in OSS, and now we need to further supplement the customer’s ability to analyze data in OSS. Moreover, the SQL-based data processing mode of the data lake also meets the needs of customers for development technology stacks. In summary, we made the following adjustments to the customer’s architecture to help the customer build a data lake.
Figure 19
.
In general, we did not change the customer’s data link flow after the renovation of the data lake solution, but added DLA components on the basis of OSS to process OSS data for secondary processing. DLA provides a standard SQL computing engine and supports access to various heterogeneous data sources. After processing OSS data based on DLA, data is generated that is directly available to the service. However, the problem of DLA is that it cannot support interactive analysis scenarios with low latency requirements, and in order to solve this problem, we introduce cloud-native data warehouse ADB to solve the latency problem of interactive analysis. At the same time, QuickBI was introduced as a visual analysis tool for customers at the very front. The YJ solution is a classic implementation case of the lakehouse integration solution in the game industry shown in Figure 14.
YM is a data intelligence service provider that provides a series of data analysis and operation services for all kinds of small and medium-sized businesses. The technical logic of the specific implementation is shown in the following figure.
Figure 20. The YM intelligent data service SaaS model indicates that
the platform party provides
a multi-terminal SDK for users (merchants provide multiple access forms such as web pages, APP, and mini programs) to access various types of buried point data, and the platform party provides unified data access services and data analysis services in the form of SaaS. Merchants can access various data analysis services to conduct more fine-grained point data analysis, and complete basic analysis functions such as behavior statistics, customer portraits, customer selection, and advertising monitoring. However, in this SaaS model, there will be certain problems:
- due to the
-
diversification of merchant types and needs, it is difficult for the platform to provide SaaS analysis functions to cover all types of merchants, and cannot meet the customized needs of merchants; For example, some merchants focus on sales, some focus on customer operations, and some focus on cost optimization, it is difficult to meet all needs. -
For some advanced analysis functions, such as customer circles and customer custom extensions that rely on custom labels, unified data analysis services cannot be satisfied; In particular, some custom labels rely on merchant-defined algorithms and cannot meet the advanced analysis needs of customers. -
Asset management requirements for data. In the era of big data, data is an enterprise/organization asset has become a consensus of everyone, how to make the data belonging to the business reasonable and long-term precipitation, is also SaaS services need to consider things.
In summary, we introduce the data lake mode on the basic
mode in the above figure, so that the data lake can be used as the basic support facility for merchants to precipitate data, output models, and analyze operations. The SaaS data intelligence service model after the introduction of the data lake is as follows.
Figure 21: The data intelligence service based on the data lake is shown in Figure 21, the platform provides each user with a one-click lake construction service, and merchants use this function to build their own data lakes.
On the other hand, all buried point data belonging to the merchant is fully synchronized to the data lake, and the daily incremental data is archived into the lake based on the “T+1” model. Based on traditional data analysis services, the service model based on data lake endows users with three capabilities:
-
Analytical modeling. There is not only raw data in the data lake, but also a schema of buried data. Through the data lake, in addition to the original data as an asset, the data model is also output, with the help of the tracking data model, merchants can have a deeper understanding of the user behavior logic behind the buried data, help merchants better insight into customer behavior, obtain user needs. -
Service customization capabilities. With the data integration and data development capabilities provided by the data lake, based on the understanding of the buried point data model, merchants can customize the data processing process, continuously iteratively process the original data, extract valuable information from the data, and finally obtain value beyond the original data analysis services.
Personally, I believe that the data lake is a more complete big data processing infrastructure than the traditional big data platform, and the perfect technical existence in the data lake is closer to the customer’s business. All data lakes include features that go beyond the existence of big data platforms, such as metadata, data asset catalog, permission management, data lifecycle management, data integration and data development, data governance and quality management, etc., all to be closer to the business and better convenient for customers. Some of the basic technical features emphasized by the data lake, such as elasticity, independent expansion of storage computing, unified storage engine, multi-mode computing engine, etc., are also to meet business needs and provide the most cost-effective TCO for business parties.
The process of building a data lake should be closely integrated with the business; However, the construction process of a data lake should be different from a traditional data warehouse, or even a hot data middle office. The difference is that data lakes should be built in a more agile way, “build and govern.” In order to better understand the agility of data lake construction, let’s first look at the construction process of traditional data warehouses. The industry has proposed two models of “bottom-up” and “top-down” for the construction of traditional data warehouses, which are proposed by Inmon and KimBall, respectively. The specific process will not be described in detail, otherwise hundreds of pages can be written, and the basic idea will be briefly explained here.
-
Inmon proposes a bottom-up (EDW-DM) data warehouse construction model, that is, the data source of an operational or transactional system, which is transformed and loaded into the ODS layer of the data warehouse through ETL extraction. The data in the ODS layer is processed according to the pre-designed EDW (Enterprise Data Warehouse) paradigm and then entered into the EDW. EDW is generally a general data model of enterprises/organizations, which is not convenient for upper-level applications to directly do data analysis. Therefore, each business unit will again process the data mart layer (DM) from EDW according to its own needs.
Advantages: easy maintenance, high integration; Disadvantages: Once the structure is determined, it is not flexible enough, and the deployment period is long to adapt to the business. The data warehouse constructed in this way is suitable for more mature and stable businesses, such as finance.
-
KimBall proposes a top-down (DM-DW) data architecture that extracts or loads data sources from operational or transactional systems into the ODS layer. Then, through the data of ODS, the dimensional modeling method is used to build a multi-dimensional thematic data mart (DM). Each DM is linked together by the dimension of consistency, and finally forms a common data warehouse for enterprises/organizations.
Advantages: build quickly, see ROI the fastest, agile and flexible; Disadvantages: As an enterprise resource, it is not easy to maintain, the structure is complex, and the data mart integration is difficult. It is often used in small and medium-sized enterprises or the Internet industry.
In fact, whether it is constructing EDW first or constructing DM first, it is inseparable from the exploration of data, as well as the design of the data model before the construction of the data warehouse, including the current hot “data middle platform”, can not escape the basic construction process shown in the figure below.
Figure 22: Basic data warehouse/data middle office construction process
-
Model abstraction. According to the business characteristics of the enterprise/organization, sort out and classify all kinds of data, divide the data into fields, form metadata for data management, and build a general data model based on metadata. -
Data access. According to the results of the first step, determine the data source to be accessed. According to the data source, determine the necessary data access technical capabilities, complete the selection of data access technology, and access data at least include: data source metadata, original data metadata, and raw data. All kinds of data are classified and stored according to the results formed in the second step. -
Converged governance. Simply put, it is to use various computing engines provided by the data lake to process the data, form various intermediate data/result data, and properly manage and store it. The data lake should have perfect data development, task management, and task scheduling capabilities, and record the data processing process in detail. In the process of governance, more data models and metric models will be needed. -
Business support. On the basis of the common model, each business unit customizes its own detailed data model, data usage process, and data access service.
> Data prospecting. For an enterprise/organization, the initial work in building a data lake is to do a comprehensive survey and investigation of the data within the enterprise/organization, including data source, data type, data form, data mode, total amount of data, data increment, etc. An implicit important work at this stage is to further sort out the organizational structure of the enterprise with the help of data mapping work, and clarify the relationship between data and organizational structure. Lay the foundation for clarifying the user roles, permission design, and service methods of the data lake in the future.
The above process, for a fast-growing Internet enterprise, is too heavy, in many cases can not land, the most realistic problem is the second step model abstraction, in many cases, the business is trial and error, exploration, not clear where the future direction is, it is impossible to extract a general data model; Without the data model, all the subsequent operations cannot be discussed, which is one of the important reasons why many fast-growing enterprises feel that the data warehouse/data middle office cannot be implemented and cannot meet the needs.
Data lakes should be a more “agile” way to build, and we recommend the following steps to build a data lake.
Figure 23. Compared with Figure 22, the basic process of data lake construction
is still five steps, but these five steps are a comprehensive simplification and “implementable” improvement.
-
data mapping. It is still necessary to understand the basic information of data, including data source, data type, data form, data mode, total amount of data, and data increment. But that’s what needs to be done. A data lake is a full preservation of raw data, so there is no need for in-depth design in advance.
-
Technology selection. According to the situation of data mapping, determine the technical selection of data lake construction. In fact, this step is also very simple, because there are many common practices in the industry regarding the technology selection of data lakes, and there are three basic principles: “separation of compute and storage”, “elasticity”, and “independent scaling”. The recommended storage option is a distributed object storage system (such as S3/OSS/OBS); It is recommended to focus on batch processing requirements and SQL processing capabilities on the compute engine, because in practice, these two types of capabilities are the key to data processing, and we will discuss them later. Whether it is computing or storage, it is recommended to give priority to the form of serverless; In the future, it can gradually evolve in the application, and you really need an independent resource pool, and then consider building a dedicated cluster. -
Data access. Determine the data source to be accessed, and complete full extraction and incremental access of data. -
App governance. This step is key to the data lake, and I personally changed “converged governance” to “application governance”. From the perspective of a data lake, data applications and data governance should be integrated and inseparable. Starting from the data application, clarify the requirements in the application, and gradually form the data that can be used by the business in the process of data ETL; At the same time, the data model, index system and corresponding quality standards are formed. Data lakes emphasize the storage of raw data and the exploratory analysis and application of data, but this is definitely not to say that data lakes do not need data models; On the contrary, the understanding and abstraction of the business will greatly promote the development and application of the data lake, and the data lake technology makes the processing and modeling of data retain great agility and can quickly adapt to the development and change of the business.
From a technical point of view, data lakes are different from big data platforms in that in order to support the full life cycle management and application of data, data lakes need to have relatively perfect data management, category management, process orchestration, task scheduling, data traceability, data governance, quality management, authority management and other capabilities. In terms of computing power, the current mainstream data lake solutions support SQL and programmable batch processing modes (support for machine learning, you can use Spark or Flink’s built-in capabilities); In terms of processing paradigm, almost all of them adopt the pattern of workflow based on directed acyclic graphs, and provide corresponding integrated development environments. When it comes to supporting streaming computing, data lake solutions are taking different approaches. Before discussing the specific methods, let’s make a classification of stream computing
:
- mode
-
one: real-time mode. This stream computing mode is equivalent to processing data in the form of “processing one by one”/”micro-batch”; It is more common in online business, such as risk control, recommendation, early warning, etc.
-
Mode 2: Flow-like. This mode needs to obtain data that changes after a specified point in time/read a certain version of data/read the latest current data, etc., which is a stream-like mode; It is more common in data exploration applications, such as analyzing daily activity, retention, and transformation in a certain period of time.
The
essential difference between the two is that when mode 1 processes data, the data is often not stored in the data lake, but only flows in the network/memory; When mode two processes data, the data is already stored in the data lake. To sum up, I personally recommend using the following figure mode:
Figure 24
Figure 24 shows the data flow flow diagram of the data lake,
and when the data lake needs to have the processing capabilities of Mode 1, Kafka-like middleware should be introduced as the infrastructure for data forwarding. A complete data lake solution solution should provide the ability to source raw data to Kafka. The streaming engine has the ability to read data from Kafka-like components. After processing the data, the streaming computing engine can write the results to OSS/RDBMS/NoSQL/DW as needed, and provide access. In a sense, the Schema 1 Stream Computing Engine does not have to exist as an integral part of the data lake, but only needs to be easily introduced when the application needs it. However, it should be pointed out here that
the streaming engine
-
Streaming Engine tasks also need to be unified into the data lake task management; -
Streaming tasks still need to be included in unified permission management.
For mode two, it’s essentially closer to batch processing. Now many classic big data components have provided support methods, such as HUDI/IceBerg/Delta, etc., all support Spark, Presto and other classic computing engines. Taking HUDI as an example, it supports special types of tables (COW/MOR) to provide access to snapshot data (specified version), incremental data, and near real-time data. At present, AWS, Tencent, etc. have integrated HUDI into their EMR services, and Alibaba Cloud’s DLA is also planning to launch DLA on HUDI capabilities.
Let’s go back to the first chapter at the beginning of this article, we said that the main users of data lakes are data scientists and data analysts, and exploratory analytics and machine learning are common operations for this group of people; Stream computing (real-time mode) is mostly used for online business, and strictly speaking, it is not the rigid needs of target users of data lakes. However, streaming computing (real-time mode) is an important part of the online business of most Internet companies, and data lakes, as a centralized data repository within enterprises/organizations, need to maintain certain expansion capabilities in architecture, which can be easily expanded and integrated with streaming computing capabilities.
Business support. Although most data lake solutions provide standard access interfaces, such as JDBC, various popular BI reporting tools and large-screen tools on the market can also directly access the data in the data lake. However, in actual applications, we still recommend pushing the data processed by the data lake to the corresponding data engines that support online business, so that the application can have a better experience.
As the infrastructure of a new generation of big data analysis and processing, data lakes need to go beyond traditional big data platforms. Personally, I believe that the following aspects are the possible future development direction of data lake solutions.
1. Cloud native architecture. There are many opinions about what constitutes a cloud-native architecture, and it is difficult to find a uniform definition. However, when it comes to the data lake scenario, I personally believe that it is the following three characteristics:
2. Sufficient data management capabilities. Data lakes need to provide more powerful data management capabilities, including but not limited to data source management, data category management, processing process orchestration, task scheduling, data traceability, data governance, quality management, and permission management.
(The source of this article is Alibaba Cloud Database, the author is shocked)
end
public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords. like + Looking, less bugs 👇