The concept of a data lake was first proposed in 2011 by Dan Woods, CTO and author of CITO Research. The analogy is that if we compare data to nature’s water, then the water of various rivers and rivers is unprocessed and continuously collects into the data lake. The industry has always had a broad and varied understanding and definition of data lakes.

“A data lake is a platform that centrally stores massive, multiple sources, and types of data, and can quickly process and analyze data, which is essentially an advanced enterprise data architecture.”

The core value of “data lake” is to provide enterprises with a data platform operation mechanism. With the advent of the DT era, enterprises urgently need to change, and need to use the sharp tools of informatization, digitalization and new technologies to form a platform system to empower the company’s personnel and business and quickly respond to challenges. And the data foundation of all this is exactly what a data lake can provide.

The following is a set of comics to explain the concept of data lakes more intuitively.

In the past, when there was little data, people could just take their brains to remember, and it was a big deal to use knotted rope notes:

later, For more efficient note-taking and work, databases appeared. The core of the database is to meet the rapid addition, deletion, modification and query, and cope with online transactions.

For example, if you use a silver card to spend, the background database must quickly record the transaction and update your card balance.

Over time, people found that there was more and more data in the library, not only to support online business, but also to analyze the value. However, traditional databases are not suitable for this kind of analysis business, which is characterized by reading large amounts of data, to meet the requirements of frequent and fast reading and writing.

so , people process the data on the basis of existing databases. This process is called ETL (Extract-Transform-Load) extraction, conversion and loading.

After these three steps, the data warehouse is built. This “warehouse” is mainly for data analysis purposes, such as BI, reporting, business analysis, and so on.

To summarize briefly: The database is used for online transactions, usually for high-frequency reads and writes of small amounts of data.

Raw data such as databases are processed by ETL and loaded into the data warehouse. Data warehouses are mainly used for online analysis services, usually for large data volume reading.

Although the application scenarios are different, they are all structured data.

For quite some time, they have joined forces to meet the real-time “transactional” business and online “analytical” business of the enterprise.

With the development of the times, there are more and more types of data, and people’s needs for data are becoming more and more complex.

Enterprises are increasingly valuing the value of these “big data” and hope to store and use them well.

These data, varied, many and miscellaneous, how to store it?

Just dig a big hole!

This is the prototype of a data lake. To put it bluntly, a data lake is like a “big puddle”, an architecture that centrally stores all kinds of heterogeneous data.

Why not Data River?

Because data should be able to be stored, not a river of spring water flowing eastward.

Why not Data Pool?

Because, to

be large enough, big data is too big to survive.

Why not DataSea?

Because the data of enterprises must have boundaries, can be circulated and exchanged, but pay more attention to privacy and security, “the sea to the boundless sky”, that is not OK.

so, data lake

, data lake, just right.

However, although the concept is good, it is not easy to use this “puddle” well.

Data lake characteristics

The data lake itself has the following characteristics:

1. Raw data

Massive raw data is stored centrally without processing. A data lake is typically a single store for all of an organization’s data, including raw copies of source system data, as well as transformed data for tasks such as reporting, visualization, analytics, and machine learning. Data lakes can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video). That is, data lakes bring together different kinds of data.

2. On-demand computing

users process on demand, and do not need to move data to calculate. Databases often provide a variety of data calculation engines for users to choose from. Common ones include batch, real-time query, streaming, machine learning, and more.

3. Lazy bound

data

lakes provide flexible, task-oriented data authoring without the need to define a data model in advance.

Data lake advantages

and disadvantages

There are two sides to everything, and there are also some disadvantages to data lakes.

1. Advantages

< ol class="list-paddingleft-2"> the data

  • in the data lake is closest to native. This brings great convenience to the data exploration requirements, and the original data can be obtained directly.

  • Data lakes unify the data of various business systems within the enterprise and solve the problem of information islands. Provides a possibility for data applications that span multiple systems.

  • Data lakes provide a global, unified enterprise-level view of data, which is essential for data quality and data security. Up to overall data governance, and even to the level of data assets, it is very beneficial.

  • Data lakes change the way we work and encourage everyone to understand and analyze data; Instead of relying on a dedicated data team to “feed” the way, you can improve data operational efficiency, improve customer interaction, and encourage data innovation.

  • Disadvantages

    < OL class="list-paddingleft-2">

  • the degree of aggregation processing of data is obviously missing, which is too “raw material” for users who try to use data directly, and the data is too redundant. To cope with this problem, it can be solved by “data access + data processing + data modeling”.

  • The performance of the basic layer of the data lake has high requirements, and the data processing process must be carried out by relying on high-performance servers. This is mainly due to the problems caused by massive data, heterogeneous and diverse data, and lazy binding patterns…

  • Data processing skills are demanding. This is mainly due to the problems caused by the data being too raw.

  • Data Lake and Association Concepts

    1.Data Lake and Data Warehouse

    The idea of data lake construction essentially subverts the traditional data warehouse construction methodology. Traditional enterprise data warehouses emphasize integration, theme-oriented, hierarchical and other ideas. The two are not equivalent concepts, but more inclusive; That is, the data warehouse exists as a type of “data application” of the data lake.

    The two can be compared from the following dimensions:

    1) Storage data typeData warehouse

    is storage cleaning and processing, trustworthy, well-structured data; A data lake stores large amounts of raw data, including structured, semi-structured, and unstructured data. In our world, it’s mostly made up of raw, messy, unstructured data.

    As “messy data” continues to escalate, so does interest in it to better understand it, derive value from it, and make decisions based on it. This requires a flexible, agile, cost-effective, and relatively easy solution, but these are not the strong points of data warehousing. And when new requirements arise, traditional data warehouses are difficult to change quickly.

    2) Processing

    data modeIf we need to

    load data into the data warehouse, we first need to define it, which is called the Schema-On-Write pattern. With a data lake, you simply load the raw data and then give it a definition when you’re ready to work with it, which is called the Schema-On-Read pattern.

    These are two very different approaches to data processing. Because the data lake defines the model structure when the data is used, it improves the flexibility of data model definition and can meet the efficient analysis requirements of more different upper-layer services.

    3) Work cooperation mode

    The traditional data warehouse working method is centralized, business personnel to the data team, the data team according to the requirements processing, development into dimension tables, for the business team through BI report tool query.

    Data lakes are more open and self-service, open data for everyone to use, data teams are more to provide tools, environments for each business team to use (but centralized dimension table construction is still needed), business teams to develop, analyze.

    2. Data lake vs big data lake technology implementation, closely integrated with big data

    technology.

    Through the low cost of Hadoop storage, massive raw data, local data, transformation data, etc. are stored in Hadoop. In this way, all data is stored in one place, which can provide a basis for subsequent management, reprocessing, and analysis.

    Through low-cost processing capabilities such as Hive and Spark (compared to RDBMS), data is handed over to a big database platform for processing. In addition, special computing methods such as streaming can be supported through Storm and Flink.

    Due to the scalability of Hadoop, full data storage can be easily implemented. Combined with data lifecycle management, full-time span data management and control

    can be achieved3. Data lake vs cloud computing

    Cloud computing adopts virtualization, multi-tenancy and other technologies to meet the maximum utilization of basic resources such as servers, networks, and storage, reduce the cost of IT infrastructure for enterprises, and bring huge economy to enterprises. At the same time, cloud computing technology realizes the rapid application and use of host, storage and other resources, which also brings more management convenience to enterprises. Cloud computing technology can play a big role when building the infrastructure of a data lake. In addition, AWS, MicroSoft, EMC, etc. all provide data lake services in the cloud.

    4. Data lake vs artificial intelligence

    In recent years, artificial intelligence technology has once again developed rapidly, training and inference need to process super-large, even multiple data sets at the same time, these data sets are usually video, pictures, text and other unstructured data, from multiple industries, organizations, projects, the collection, storage, cleaning, conversion, feature extraction and other work of these data is a series of complex and long projects. Data lakes need to provide a platform for AI programs to quickly collect, govern, and analyze data, and provide extremely high bandwidth, massive small file access, multi-protocol interoperability, and data sharing capabilities, which can greatly accelerate data mining, deep learning and other processes.

    5. Data lake vs data governance

    Traditionally, data governance has tended to be in a data warehouse. Then after building an enterprise-level data lake, the need for data governance is actually stronger. Because unlike the “pre-modeling” data warehouse, the data in the lake is more dispersed, disordered, and unstandardized, and the data needs to be “available” through governance, otherwise the data lake is likely to “corrupt” into a data swamp and waste a lot of IT resources. Whether a platform-based data lake architecture can drive enterprise business development, data governance is crucial. This is also one of the biggest challenges to the construction of data lakes.

    6. Data Lakes vs Data Security Data

    lakes contain large amounts of raw and processed data, which can be accessed unregulated. Here are the necessary data security and privacy protection issues to consider, which are the capabilities that the data lake provides. But from another perspective, centralizing data in a data lake is actually good for data security. This is much better than data being scattered across the enterprise.

    Data lake architecture system

    Data lake is a storage

    architecture, essentially storage, enterprises based on cloud services, can quickly dig out a suitable “lake”, complete data collection, storage, processing, governance, provide data integration and sharing services, high-performance computing capabilities and big data analysis algorithm models, support the comprehensive development of business management data analysis applications. Empower data applications at scale.

    The data lake technical architecture involves 10 aspects, including data access (transfer), data storage, data computing, data application, data governance, metadata, data quality, data resource catalog, data security and data audit

    1. Data ingestion (mobile)

    extracts allow connectors to ingest data from different data sources and load into a data lake. Data extraction support: All types of structured, semi-structured, and unstructured data. Batch, real-time, one-time load, etc. multiple ingestion; In terms of data access, it is necessary to provide an adapted multi-source heterogeneous data resource access mode to provide a channel for data extraction and aggregation of enterprise data lakes.

    2. Data storage data storage should be scalable, provide cost-effective storage

    and allow quick access to data exploration. It should support various data formats.

    3. Data computing data lake needs to provide a variety of data analysis engines to meet data computing

    requirements. Specific computing scenarios such as batch, real-time, and streaming are required. In addition, it is also necessary to provide access to massive data to meet the needs of high concurrent reads and improve the efficiency of real-time analysis. It needs to be compatible with a variety of open source data formats and directly access data stored in these formats.

    4. Data governance Data governance

    is the process of managing the availability, security, and integrity of data used in a data lake. Data governance is an ongoing effort that provides guidance and oversight to all other data management functions by articulating strategies, establishing frameworks, setting guidelines and enabling data sharing.

    5. Metadata

    metadata management is the basic work that needs to be done in the entire data life cycle of the data lake, and enterprises need to manage the life cycle of metadata. Metadata management is not an end in itself, it is a means for organizations to get more value from their data, and to be data-driven, organizations must first be metadata-driven.

    6. Data resource directory

    The initial construction of a data resource catalog, typically scanning large amounts of data to collect metadata. The data scope of a catalog may include all data assets in the data lake that are identified as valuable and shareable. The Data Resource Catalog uses algorithms and machine learning to automate finding and scanning datasets, extracting metadata to support dataset discovery, exposing data conflicts, inferring semantics and business terms, labeling data to support search, and identifying privacy, security, and compliance of sensitive data.

    7. Privacy and Security

    Data security is the planning, development, and execution of security policies and security procedures to provide authentication, authorization, access, and auditing of data and information assets. Security needs to be implemented in every layer of the data lake. It starts with storage, mining, and consumption, with the basic need to stop access by unauthorized users. Authentication, auditing, authorization, and data protection are some of the important features of data lake security.

    8. Data Quality

    Data quality is an important part of a data lake architecture. Data is used to determine business value, and extracting insights from poor quality data will lead to poor quality insights. Data quality focuses on the realization of requirements, inspection, analysis and enhancement, and identifies, measures, monitors, and warns data quality problems that may arise from each stage of the data life cycle from planning, acquisition, storage, sharing, maintenance, application, and demise, and further improves data quality by improving and enhancing the management level of the organization.

    9. Data audit The two main data audit

    tasks are to track changes to

    key data sets: track changes to important data set elements; Capture how/when/ and who changed these elements. Data audits help assess risk and compliance.

    10. Data Application

    Data application refers to the unified management, processing and application of data in the data lake, internal support for business operations, process optimization, marketing promotion, risk management, channel integration and other activities, external support for data open sharing, data services and other activities, so as to improve the supporting and auxiliary role of data in the process of organizational operation and management, and at the same time realize the realization of data value. On top of the basic computing capabilities, the data lake needs to provide upper-layer applications such as batch reporting, ad-hoc queries, interactive analysis, data warehousing, and machine learning, as well as self-service data exploration capabilities.

    How to realize the business value of data lake through data governance

    Data lake plays a crucial role in the digital transformation and sustainable development of an enterprise. Build an open, flexible, and scalable enterprise-level unified data management and analysis platform that connects internal and external data on demand, breaking the system boundaries of data.

      uses data lake intelligent analysis, data

    1. visualization and other technologies to realize data sharing, automatic generation of daily reports, fast and intelligent analysis, and meet the data analysis and application needs of enterprises at all levels.

    2. Deeply explore the value of data and help enterprises implement digital transformation. It realizes the management of data catalog, model, standard, responsibility, security, visualization, sharing, etc., realizes centralized data storage, processing, classification and management, realizes report generation automation, data analysis agility, data mining visualization, and realizes data quality assessment and landing management process.

    Data lakes encounter challengesData lakes

    themselves are centralized storage, capable of storing structured and unstructured data at any scale. The advantage of a data lake is that data can be stored as an asset first, and the question is how to use this data in the business. When a data lake is deployed, data governance issues will follow, such as how to divert data from a data lake to a data lake, and how to organize the data in the lake.

    The data in the data warehouse is organized and clear and easy to understand. The concept of a data lake is to be stacked directly without processing, then the data lake may become a “data swamp”, and the difficulty of screening will become larger. With incorrect definitions, incomplete information, stale data, or inability to find the information it needs, it requires more metadata to understand the data assets stored in the data lake, including business-level understanding of data content, data asset graph, data sensitivity, user preferences, data quality, context (without context that would not be available for analysis), and data value. In addition, these systems and applications are developed by technical personnel, which makes it more complex and difficult for business users to obtain data due to the differences in thinking and “language” between technical and business personnel.

    1. Avoid data swamps

    How do you keep the water of your data lake clear and don’t become a data swamp? “Data from a data lake becomes a dumping ground if it’s not used effectively.” There is a Chinese proverb: “Flowing water does not rot, and the household hub does not beetle”. Only when data flows can it not become a data swamp, and lakes are only bases for temporary data rivers. Data flow means that all data is generated, and ultimately its cultivators and users. For data to flow effectively, an effective “data river” must be established. The industry generally ignores the importance of data governance in its attempts at data lakes, which is dangerous, and the resulting data swamp is one of the reasons why enterprises continue to wait and see data lakes.

    2. Intelligent data governance is a necessary way to realize the value of data lakes

    The need for data governance is actually stronger. Because unlike the “pre-modeling” data warehouse, the data in the lake is more scattered, disordered, irregular, etc., and the data needs to be “available” through governance, otherwise the data lake is likely to “corrupt” into a data swamp and waste a lot of IT resources. Whether a platform-based data lake architecture can drive enterprise business development, data governance is crucial, without data lake governance, enterprises may lose meaningful business intelligence. This is also one of the biggest challenges to the construction of data lakes.

    Consider comprehensive data lake governance, including who ingests the data, who is responsible for the data, and the definition of the data to ensure that the data is properly labeled and used, and to achieve optimization and transformation and effective control of the content level of enterprise data resources.

    Data lake concepts and technologies are evolving, and different solution providers are adding new features and capabilities, including architecture standardization and interoperability, data governance requirements, data security, and more.

    As a cloud service to meet the analysis, processing and storage needs of different data at any time, the scalability of the data lake can provide users with more real-time analysis, and the data lake based on enterprise big data is developing to support more types of real-time intelligent services, which will bring great changes to the existing data-driven decision-making mode of enterprises.

    With the development of data

    lakes, they have become the foundation of enterprise data systems: databases, data warehouses, big data processing, machine learning and other data services can be “collected in one lake”. In this era of “using data to empower intelligence on the cloud”, many enterprises have completed the first step of cloud migration, and the next step is how to “use data” and “empower intelligence”.