in the process of construction, the organization and management of data should not only be divided vertically according to the business, but also require horizontal hierarchical specifications of digital warehouses. this article analyzes the stratification of several warehouses in enterprises, and hopes to help you.

people engaged in the work related to the number of warehouses know that one of the primary tasks of the model design is to carry out, which shows the importance of model stratification in the model design process, and indeed excellent hierarchical design is the core element of whether a digital warehouse project can be successfully built, so that data is easy to understand and highly reusable is the core goal of layering. 模型分层

WHAT IS A NUMBER OF DW’S WAREHOUSES

Data warehouse (can be shortened to DW or DWH) data warehouse, is a large number of databases already exist, it is a complete set of theoretical systems including etl, scheduling, modeling and so on. The purpose of the data warehouse solution is to use front-end query and analysis as a basis, mainly applied to OLAP (on-line Analytical Processing), support complex analysis operations, focus on decision support, and provide intuitive and easy-to-understand query results. At present, the more popular in the industry are: AWS Redshift, Greenplum, Hive and so on. The data warehouse is not the final destination of the data, but the preparation for the final destination of the data, which includes: cleaning, escaping, classifying, reorganizing, merging, splitting, statistics, etc

main features

  • theme-orientedoperational database organizations are oriented toward transactional tasks, while data in a data warehouse is organized according to a certain subject domain.a topic refers to a key area of concern when users use the data warehouse to make decisions, and a topic is related to multiple operational information systems.
  • integrationsource data needs to be processed and integrated, unified and synthesizedinconsistencies in the source data must be eliminated during processing to ensure consistent global information about the entire enterprise when the information within the data warehouse is maintained. (association)
  • not modifiableTHE DATA IN DW IS NOT UP-TO-DATE, BUT COMES FROM OTHER DATA SOURCESdata warehousing is primarily about providing data for decision analytics, and the operations involved are primarily queries of data
  • time-dependentdata in the data warehouse that is in the context of decision-making needs to be marked with time attributes

comparison with databases

  • DW: SPECIFICALLY DESIGNED FOR DATA ANALYSIS, IT INVOLVES READING LARGE AMOUNTS OF DATA TO UNDERSTAND RELATIONSHIPS AND TRENDS BETWEEN DATA
  • database: used to capture and store data

why tier

issues involved in the data warehouse:

  1. why do you want to do a data warehouse?
  2. why data quality management?
  3. why metadata management?
  4. what is the role of each layer in the number of bin layers?

in practice, we all want our data to flow sequentially, so that designers and users can clearly know the entire declaration cycle of the data, such as the following left figure.

however, in practice, the data situation we are facing is likely to be highly complex and hierarchically chaotic, and we may make a set of table dependency structures that are chaotic and have circular dependencies, such as the right figure below.

in order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly, which is data stratification. benefits of data tiering:

  • clear data structure: let each data layer have its own role and responsibilities, and it can be more convenient and understood when using and maintaining
  • complex problem simplification: break down a complex task into multiple steps to complete it in steps, with each layer solving only a specific problem
  • unified data caliber: through data layering, provide unified data export and unified output caliber
  • reduce duplicate development: standardize data layering and develop a common middle layer, which can greatly reduce the work of double calculation

data tiering

EACH COMPANY’S BUSINESS CAN BE TIERED AT DIFFERENT LEVELS ACCORDING TO ITS OWN BUSINESS NEEDS; AT PRESENT, THE MORE MATURE DATA TIERS ARE: DATA OPERATION LAYER ODS, DATA WAREHOUSE LAYER DW, AND DATA SERVICE LAYER ADS (APP).

DATA OPERATIONS LAYER ODS

Data Operations Layer: The Operation Data Store data preparation area, also known as the feed layer. The data in the data source enters this layer after being extracted, washed, and transmitted, that is, the ETL process. The main features of this layer:

  • ODS IS THE PREPARATION AREA BEHIND THE DATA WAREHOUSE LAYER
  • PROVIDE RAW DATA FOR THE DWD LAYER
  • reduce impact on business systems

WHEN THE SOURCE DATA IS LOADED INTO THIS LAYER, A SERIES OF OPERATIONS SUCH AS DENOISING (FOR EXAMPLE, IF THERE IS A DATA IN WHICH THE AGE OF THE PERSON IS 300 YEARS OLD, WHICH IS ABNORMAL DATA, IT NEEDS TO BE PROCESSED IN ADVANCE), DEDUPLICATION (FOR EXAMPLE, IN THE PROFILE TABLE, THE SAME ID HAS TWO DUPLICATE DATA, AND ONE STEP NEEDS TO BE DEDUPLICATION WHEN ACCESSING), AND THE FIELD NAMING CONVENTION. HOWEVER, IN ORDER TO CONSIDER THE SUBSEQUENT DATA TRACEABILITY PROBLEM, IT IS NOT RECOMMENDED TO DO TOO MUCH DATA CLEANING WORK FOR THIS LAYER, AND IT IS ALSO POSSIBLE TO ACCESS THE ORIGINAL DATA UNCHANGED, ACCORDING TO THE NEEDS OF THE SPECIFIC LAYERING OF THE SERVICE.

this layer of data is the source of subsequent data warehouse processing data. ways of data source:

  • business librarysqoop is often used to extract, for example, once a day at regular intervals.in terms of real-time, you can consider using canal to listen to mysql binlog, and real-time access can be done.
  • bury the logslogs are generally saved as files, and you can choose to synchronize them regularly with flumeYou can use spark streaming or Flink for real-time accessKafka is also OK
  • Message Queuing: That is, data from ActiveMQ, Kafka, etc.

data warehouse layer

the data warehouse layer can be divided into 3 layers from top to bottom: 、、。数据细节层DWD数据中间层DWM数据服务层DWS

DATA DETAIL LAYER DWD

Data warehouse details, DWD (data cleaning/DWI)

THIS LAYER IS THE ISOLATION LAYER OF THE BUSINESS LAYER AND THE DATA WAREHOUSE, MAINTAINING THE SAME DATA GRANULARITY AS THE ODS LAYER; MAINLY CLEANING AND NORMALIZING THE DATA OF THE ODS DATA LAYER, SUCH AS REMOVING EMPTY DATA, DIRTY DATA, AND OUTLIERS.

in order to improve the ease of use of the data detail layer, the layer usually adopts some dimension degradation methods, which degrade the dimensions to the fact table and reduce the association of the fact table and the dimension table.

DATA MIDDLE-TIER DWM

Data middle tier: Data Warehouse Middle, DWM

THIS LAYER IS BASED ON THE DATA OF THE DWD LAYER, DOES SOME SLIGHT AGGREGATION OPERATIONS ON THE DATA, GENERATES SOME INTERMEDIATE RESULT TABLES OF COLUMNS, IMPROVES THE REUSE OF COMMON INDICATORS, AND REDUCES THE REPETITIVE PROCESSING WORK.

in short, the aggregation operation of the common core dimensions is carried out to calculate the corresponding statistical indicators

DATA SERVICES LAYER DWS

Data service layer: Data Warehouse Service, DWS (wide table – user behavior, light aggregation)

THIS LAYER IS BASED ON THE BASIC DATA ON THE DWM, INTEGRATED AND SUMMARIZED INTO A DATA SERVICE LAYER THAT ANALYZES A CERTAIN SUBJECT AREA, GENERALLY A WIDE TABLE, WHICH IS USED TO PROVIDE SUBSEQUENT BUSINESS QUERIES, OLAP ANALYSIS, DATA DISTRIBUTION, ETC.

in general, the data tables for this layer will be relatively small; a table will cover more business content, and because of its large number of fields, it is generally called the table of this layer as a wide table.

  • USER BEHAVIOR, MILD AGGREGATE TO DWD
  • MAINLY DO SOME LIGHT SUMMARIZATION OF ODS/DWD LAYER DATA.

DATA APPLICATION LAYER ADS

Data application layer: Application Data Service, ADS (APP/DAL/DF) – Report results

This layer is mainly provided to data products and data analysis data, generally stored in ES, Redis, PostgreSql and other systems for online systems to use; may also be stored in hive or Druid for data analysis and data mining, such as commonly used data reports are stored here.

Fact Table

a fact table is a table that stores fact records, such as system logs, sales records, and so on. the record of the fact table is constantly growing, such as the e-commerce order table, which is a similar situation, so the size of the fact table is usually much larger than other tables.

Dimensional Surface Dimension (DIM)

Dimension table (Dimension Table) or dimension table, sometimes called lookup table (Lookup Table), is a kind of table corresponding to the fact table; it holds the attribute values of the dimension, which can be associated with the fact table, which is equivalent to extracting and standardizing the properties that often occur repeatedly on the fact table and managing it with a table. Dimension tables mainly contain two parts:

  • high cardinal dimension data: generally a user profile table, a product data sheet similar to the data sheet, the amount of data may be tens of millions or hundreds of millions
  • low cardinality dimension data: generally a configuration table, such as the meaning of the chinese corresponding to the enumeration field, or the date dimension table, etc.; the amount of data may be single digits or tens of thousands.

TEMPORARY TABLE TMP

EACH LAYER OF COMPUTATION WILL HAVE MANY TEMPORARY TABLES, WITH A DEDICATED DWTMP LAYER TO STORE THE TEMPORARY TABLES OF OUR DATA WAREHOUSE

data mart

Ads layer in the narrow sense; broadly refers to data that hadoop synchronizes from DWD DWS ADS to RDS

Data Mart (Data Mart), also known as data market, data mart is to meet the needs of specific departments or users, in accordance with the multi-dimensional way of storage, including defining dimensions, indicators that need to be calculated, levels of dimensions, etc., to generate a data cube for decision-making analysis needs.

in terms of scope, data is extracted from enterprise-wide databases, data warehouses, or more specialized data warehouses. the focus of the data center is that it caters to the specific needs of the professional user community, in terms of analysis, content, performance, and ease of use. users in data centers want data to be represented by familiar terminology.

data warehousing structure with a data mart

differentiate between data warehouses

a data mart is a subset of an enterprise data warehouse that is primarily geared toward departmental business and is geared toward a specific topic. to address the tension between flexibility and performance, a data mart is a small departmental or workgroup-level data warehouse that is added to a data warehouse architecture. the data mart stores pre-calculated data for a specific user to meet the user’s performance needs. a data mart can alleviate the bottleneck of accessing a data warehouse to some extent.

theoretically, there should be a concept of a total data warehouse before there is a data mart. when the data mart is actually built, it is rarely done in china. in china, we will generally start with the data mart, do the data mart first on a specific topic (such as the customer information of the enterprise), and then build a data warehouse. the order in which data warehouses and data marts are established is closely related to the design approach. as an engineering discipline, data warehousing is not right or wrong.

in terms of data structures, a data warehouse is a collection of topic-oriented, integrated data. while a data mart is usually defined as a star structure or a snowflake data structure, a data mart is generally composed of a fact table and several dimensional tables.