A data warehouse is a topic-oriented, integrated, time-changing, but relatively stable collection of information to support the management decision-making process.
Data warehouses are mainly built around the demands of data users and data developers; Therefore, when starting to plan the construction of a digital warehouse, it is necessary to first analyze the needs, pain points and itchy points of all parties, and then design solutions and determine the construction content in these demands. When using data, there are three main types of problems:
-
can’t be found, I don’t know if the data is there and where.
-
don’t understand, there are many business parties that are not from the technical research and development team, and I can’t understand what the data means, how to relate the query, and which business system it comes from.
-
I don’t know how to write SQL or which products can query the data metrics you want.
I
Therefore, for data users, in the process of data warehouse construction, it is necessary to meet: find, understand, and use data development engineers are more concerned about the convenience, efficiency and rapid positioning of data development, so the data developer mainly has the following demands:
-
data reuse : Data requirements chimney development has led to the development of a large number of repetitive logic codes, and data reuse can shorten the delivery of data requirements, improve the efficiency of data development, and meet the agile research and development requirements of business for data.
-
Problem tracking: In the process of data processing and data quality analysis, the source of problems can be quickly located.
-
Impact analysis: You can quickly and efficiently analyze the impact of data rule modification or data upload and removal.
After the data
warehouse construction content
clarifies the data warehouse
construction goals, it mainly builds the data warehouse capability from the following aspects
:
-
hierarchical architecture: The hierarchical architecture can simplify the process of data cleaning and provide a foundation for data and model reuse.
-
Master data management: through master data to open up all business chains, unified data language, unified data standards, to achieve data sharing.
-
The index system is to organize each indicator according to a specific framework, so as to unify the definition of indicator name and caliber, clarify the composition relationship between indicators, and avoid repeated construction of
-
management: through the root can be used to standardize table names, field names, subject domain names, etc.
-
Data lineage: data traceability, data value and quality assessment.
Index system:
root
The
hierarchical architecture
can better organize, manage and maintain data warehouse data through hierarchical data management, simplify data development work, and the processing logic of each layer is relatively simple and easy to understand, and it is easier to ensure the correctness of each step, thereby simplifying the process of data cleaning. Layering is to use space for time to improve the user experience (efficiency) of the application system through a large number of preprocessing, so there will be a large amount of redundant data in the data warehouse; If it is not layered, if the business rules of the source business system change, it will affect the entire data cleaning process, which is a huge workload. Through data layering, it provides a foundation for data and model reuse, and many data quality problems are that our data and models cannot be reused, resulting in the unification of business caliber and technical caliber; New requirements are recalculated from the original data, resulting in many data quality problems. The data warehouse stratification is generally as follows:
-
ODS layer: a temporary area for loading and processing source data from business systems. ODS is system-oriented, sticker migration. Do not change the data structure and data granularity, but clean dirty data.
-
DWD: The unique, integrated, and accurate version of your data. Data is organized by subject domain, data structure is reconstructed by entity and relationship, and data granularity is retained the finest. Use E-R modeling.
-
DWS: Business-oriented, dimensional modeling. Data is organized according to business processes, data structures are reconstructed according to fact tables and dimension tables, and data granular business degrees are summarized on demand.
-
ADS: Provide data services by using suitable tools to improve the efficiency of data storage and processing for application scenarios.
Master data management
can meet the needs of cross-department collaboration of enterprises, reflect the basic information of the enterprise (organization) of the state attributes of the core business entity, the attributes are relatively stable, the accuracy requirements are higher, and the only identification is the master data, called MDM. This is the definition given in the Master Data Management Practices White Paper.
Master data is the key facts that describe the core business, such as customers, products, employees, regions, etc.; It also contains the data relationship between these facts. Master data management mainly reflects the following values
:
- eliminating data redundancy
-
: different systems and departments obtain data according to their own rules and requirements, which is easy to cause duplicate data storage and form data redundancy. Master data opens up various business chains, unifies data language, unifies data standards, realizes data sharing, and eliminates data redundancy to the greatest extent.
-
Improve data processing efficiency: Each system and department has different definitions of data, different versions of data are inconsistent, and a core theme also has multiple versions of information, which requires a lot of manpower and time costs to sort out and unify. Master data management enables data to be dynamically organized, copied, distributed, and shared.
-
: As the “common language” of the company’s internal business analysis and decision-making support, data will help break through departmental and system barriers, realize information integration and sharing, and improve the company’s overall strategic synergy after realizing the unification of multiple departments.
Improve the company’s strategic synergy
The following figure is an example of a master data asset inventory, to implement master data management, mainly from the following aspects:
-
Go deep into the business and take root in the business: Each line of business key entities has both differences and crossovers; Only by in-depth understanding of the business can we maintain the consistency, accuracy, completeness, and controllability of master data.
-
Subject-oriented domain management: Manage master data according to the three-level catalog mode of line of business, subject domain, and business process. Managing master data through hierarchical collation can improve management efficiency.
The indicator system indicator
is a numerical value that can quantify the number of target things, sometimes called a measurement, such as: DNU, retention rate, etc. are indicators. The index system is to organize each indicator according to a specific framework, so as to unify the name and caliber definition of the indicator, clarify the composition relationship between the indicators, and avoid duplicate construction. The figure below is an example of the indicator system.
-
Business process: A business process is an inseparable behavioral event in an enterprise activity.
-
/Attributes: Dimensions are the perspective from which business processes are observed and analyzed, and attributes are information that describes the dimension.
-
Atomic metrics: Atomic metrics are measures of specific business processes or counts of specific dimensions/attributes, which have clear business implications and cannot be separated at the logical level.
-
Modifiers: Modifiers are words that modify and qualify atomic indicators, corresponding to clear business scenarios and business rules, and are used to delineate the scope of business statistics of atomic indicators.
-
Derived metric: A derived metric is a combination of an atomic metric with one or more modifiers.
-
Composite indicator: the direct result of the superposition formula of atomic and derived indicators.
Dimensions
Root
management root is the
most fine-grained business term of the enterprise, which is the basis of dimension and indicator management, and can be used to unify table names, field names, and subject domain names through root words. Establish and maintain a convergent root library, business domain, subject field we can use the root way to enumerate clearly, constantly improve, the granularity is the same, the main is time granularity, day, month, year, week, etc., use the root to define the abbreviation, the field naming developed by the data warehouse can also be combined with the root of the word; Divided into ordinary roots and proprietary roots
-
ordinary roots: the smallest unit that describes things, such as: trading-trade.
-
Proprietary roots: with conventional or industry-specific descriptors, such as: USD-USD.
An example of a root word is as follows:
In the process of data
lineage
data, from the source to the
final data generation, every link may lead to data quality problems. For example, the data quality of our data source itself is not high, and if the data quality is not detected and processed in the subsequent processing link, then this data information will eventually flow to our target table, and its data quality is not high. It is also possible that in the data processing of a certain link, we have carried out some improper processing of the data, resulting in the quality of the data in the subsequent links becoming poor. Therefore, for the blood relationship of data, we must ensure that each link must pay attention to the detection and processing of data quality, so that our follow-up data will have excellent genes, that is, high data quality.
The
role
of data kinship
-
Data traceability: The kinship of data reflects the context of data, which can help us track the source of data and track the data processing process.
-
is very important in the field of data transactions, and data kinship can provide a basis for the evaluation of data value from the aspects of data audience, data update magnitude, and data update frequency.
-
Data quality assessment: From the perspective of data quality evaluation, clear data sources and processing methods can clarify the quality of data at each node. From the data lineage diagram, you can easily see the list of data cleaning criteria.
-
: From the perspective of data lifecycle management, the kinship of data helps us judge the life cycle of data and is a reference for data archiving and destruction operations.
Evaluate data value: The value of data
Reference for data archiving and destruction
An example of data lineage is as follows:
end
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇