table of contents of this article:

1. Guiding ideology 2, data research 3, architecture design 4, index system construction 5, model design 6, dimension design 7, fact table design

8, other specifications

OneData is Alibaba’s internal data integration and management methodology and tools.

First, the guiding ideology

is to conduct sufficient business research and demand analysis.

Secondly, the

overall data architecture design is carried out, mainly to divide the data according to the data domain; According to the dimensional modeling theory, the bus matrix is constructed to abstract the business process and dimension.

Third, abstract the relevant indicator system for the report requirements and use OneData tool to complete the specification definition and model design of the indicator. Finally, code development and O&M.

Its implementation process is mainly divided into: data research, architecture design, specification definition and model design.

2. Data research

1. Business research

needs to confirm the business

areas to be planned into the data warehouse, as well as the functional modules included in each business area, taking Alibaba’s business as an example, you can plan the following matrix:

2.

What indicators does demand research understand about demand-side relationships? What dimensions and measures are required? Whether the data is precipitated to the summary layer, etc.

III. Architecture Design

1. The division of data fields

is whether the

data field is a business process or dimension for abstract collection, the general data domain and the application system (functional module) are related, you can consider dividing the business process of the same functional module system into a data field:

2. Build the bus matrix

After conducting sufficient business research and demand research, it is time to build the bus matrix, and two things need to be done:

  1. clarify what business processes are under each data domain.
  2. Which dimensions the business

  3. process is related to, and the business processes and dimensions under each data domain are defined by the bus matrix:

> four 、Indicator system construction

1. Basic concept

data domain: refers to the collection of business processes or dimensions that are oriented to business analysis and abstracted.

Business process: refers to the events in the business activities of the enterprise.

Time period: used to clarify the event scope or time point of data statistics, such as the last 30 days and up to now.

Modifier type: An abstract division of modifiers.

Modifier: refers to the business scenario qualification abstraction of indicators other than statistical dimensions. Abstract words belong to an abstract type, such as PC, Android, Apple under the access terminal type.

Metric

/Atomic Metric: A business term with a clear meaning. Such as: payment amount.

Dimension: A dimension is a measurement environment that reflects a type of attribute of the business, and the collection of such attributes is called a dimension, which can also be called an entity object, such as a geographic dimension and a time dimension.

Dimension attribute: A description of a dimension, belonging to a dimension. Such as: countries and provinces in the geographical dimension.

Derived indicators: atomic + multiple modifiers (optional) + timeframe.

Clarify the definition of atomic indicators, modifiers, timeframes and derived indicators.

2. The derived indicators of

the operating rules are derived from three types of indicators: transaction-based indicators, stock-type indicators and composite indicators.

Transactional metrics: Metrics that measure business activities.

Stock indicator: refers to the statistics of certain states of entity objects.

Composite indicators are compounded on the basis of the above two indicators.

V. Model design

1. Data layering

industry pairs view of data warehouse stratification is similar, generally speaking, it is divided into three layers: access layer, middle layer and application layer, but the understanding of the middle layer is somewhat different.

2. The access layer (

ODS)

service data is generally synchronized to the data warehouse at a fixed frequency to build the ODS layer by dataX or sqoop;

If it is log data, it is synchronized to the data warehouse through flume or Kafka.

The access layer generally does not do any processing and cleaning of the source data, which is convenient for later traceback.

3. Detail layer (

DWD) Theoretically, the detail layer data is to clean and process the ODS layer data

to improve the availability of the ODS layer data, and there is a trade-off between the views of whether the DWD layer data is referenced at the same layer:

    > In general, the DWD layer is not recommended to reference the same layer, which can reduce the dependencies between the tasks of the detail layer and reduce the depth of the node.
  1. However, in some scenarios, the data processing logic from the ODS layer to the DWD layer is complex and the computational overhead is large, so you can consider the appropriate reuse of DWD tables to build new DWD tables.

4. The summary layer (DWS) layer

relies on our indicator system to aggregate and calculate the data of the DWD layer according to various dimensions.

5. Data Mart Layer (DWM)

When we have some aggregate statistical needs across business domains, we put it at this layer.

6. The application layer (app)

layer is mainly for the summary layer, the combination of relevant indicators, and the generation of reports.

Sixth,

in dimensional

design dimensional modeling, the measurement is called facts, and the dimensions are used in the diverse environments required to analyze facts. Dimensions typically serve as queries, subtotals, and sorting.

Through the constraints of the report, as well as previous data research and communication with the business side, we can obtain dimensions.

Dimensions are associated with fact tables through primary keys, and the primary keys of dimension tables are divided into two types: surrogate keys and natural keys. Surrogate keys have no business meaning and are generally used to handle slow-changing dimensions, while natural keys have business implications.

1. The basic method

of dimension design

  1. select or create a new dimension, and master the dimensions in the current data warehouse architecture through the construction of the previous bus matrix.
  2. Determine the primary dimension table. The main dimension table here is generally an ODS table, which is directly synchronized with the business system.
  3. Identify the relevant dimension tables. Data warehouse is the data integration of business source systems, and there is correlation between different business systems or tables in the same business system. Based on the business analysis, we can confirm which tables are related to the main dimension table and select some of them to generate dimension attributes.
  4. Determine the dimension attributes. This step is divided into two phases, the first stage is to select dimension attributes from the main dimension table or generate new dimension attributes; The second stage is to select dimension attributes from the related dimension table or generate new dimension attributes.

2. Normalization and denormalizationWhen

there are multi-level dimensional attributes, normalized according to the third normal form to form a series of dimensional tables instead of a single dimensional table, this modeling is called snowflake mode.

Merging a dimension’s hierarchy of attributes into a single dimension is called denormalization.

3. Consistency Dimension and Cross-Profiling

We have a lot of needs to observe business processes in different data domains or different business processes in the same data domain together. For example, for the log data field, the PV and UV of the commodity dimension are counted in the last day; For the transaction data field, the GMV of the commodity dimension is counted for the last day.

This kind of data exploration, which combines commodity facts from different data fields together for data exploration, is called cross-probing.

The premise of cross-sniffing is that different data fields must have a consistent dimension.

4. Dimension integration

Since the data sources of the data warehouse come from different application systems and the

application systems are independent of each other, the description and storage of the same information may be different.

These different data need to be integrated after entering the data warehouse:

    > unification of naming conventions. Table names, field names, etc. are unified.
  1. Uniformity of field types. The field types for the same and similar fields are uniform.
  2. Common code and the unification of code values.
  3. Unification of tables with the same business meaning. Based on the concept of high cohesion and low coupling, the tables with large business relationships and small differences in the impact of the source system are integrated.

There are two main forms of table-level integration: vertical integration

, that is, different

source tables contain the same data set, but the stored information is different, and can be integrated into the same dimensional model.

Horizontal integration, that is, different source tables contain different data sets, and there is no or partial overlap between these subsets, and if there is overlap, deduplication; If there is no crossing, consider whether the natural keys of different subsets conflict, and if there is no conflict, you can use each subset natural key as the integrated natural key, or process each natural key into a supernatural key.

5. Zipper tableZipper

table, also known as extreme storage technology. Assuming that a table is used to store full user information, generally our way of dealing with it is to use each partition to store a snapshot of the full data per day, the problem with this method is that if I want to save all the historical state of the user, I may need to save each historical partition permanently.

If you use a zipper table, each partition can save the historical status of each user for the current day, and the history section can also be cleaned up.

In this way, although more data is stored in a single partition, after the data of some historical partitions is cleaned, the data stored in the entire table will be reduced, because many snapshots of user information that have not changed have been cleaned up.

6. Micro dimensions

are created by removing a part of the unstable attribute from the relatively stable master dimension and placing it in a new table with its own surrogate key.

7. The recursive level of recursion refers to the hierarchical relationship of the instance values of a dimension table, and the

recursive level of the dimension is divided into an equilibrium hierarchy with a fixed number of levels and a non-equilibrium hierarchy without a fixed number of levels.

Since the function of recursive SQL is generally not supported in the data warehouse to handle this hierarchy, other methods need to be used.

  1. flat hierarchies and suitable for balancing hierarchical dimensions.
  2. Hierarchical bridge tables, suitable for unbalanced hierarchical dimensions.

8. Multivalued dimensions

A multi-valued dimension means that one record in a fact table has multiple records corresponding to it in a dimension table.

For multi-valued dimensions, there are three common treatments:

  1. reducing the granularity of the fact table.
  2. Column expansion.
  3. A more general approach, using bridge tables.

9. Miscellaneous dimensions Miscellaneous dimensions

are

combinations of indicators or flag fields in an operational system and are generally not included in the consistency dimension.

If these dimensions exist in the fact table as facts, they will cause the fact table to take up more space; If you build dimension tables separately, there are many fragmented small dimension tables.

At this time, the usual solution is to establish miscellaneous dimensions, build these fields into a dimensional table, and only need to save a foreign key in the fact table, and miscellaneous dimensions can be understood as a processing scheme that stores many small-dimensional tables into a large-dimensional table by row-to-column.

10. A degenerate

dimension

is a dimension in which dimension attributes are stored directly in the fact table.

7. Fact table designThe

degree of business detail expressed by a record in the fact table is called granularity.

1

.

As a measure of the facts of the business process, there are three types: additive, semi-additive and non-additive: additive

facts refer to the fact that they can be summarized according to any dimension associated with the fact table.

Semi-additive facts can only be summarized by specific dimensions, not all dimensions.

Non-additive facts are not additive at all, such as proportional facts. For non-additive facts, consider decomposing into additable components to achieve aggregation.

2. Fact Table Types

The most common types of fact tables are three types: transaction fact tables, periodic snapshot fact tables, and cumulative snapshot fact tables.

Transaction fact tables

are used to describe business processes, represent measurement events corresponding to a point in time and space, and hold the most atomic data, also known as atomic fact tables. In actual use, it is generally used as a detail layer, such as order details, payment details, etc.

A row of periodic snapshot fact tables that record facts at regular intervals. Such as daily inventory snapshot table, daily user balance snapshot table.

Cumulative snapshot fact tables are used to represent critical step events between the beginning and end of a process, covering the entire life cycle of the process, often with multiple date fields to record key points in time, and as the process changes over its lifecycle, the records are modified as the process changes. Taking the example of the order mentioned in the transaction fact table as an example, you can make an order-related snapshot table of the whole life cycle of the order, involving various links such as order placement, pushing, order grabbing, and payment.

In addition, there is a non-factual fact table that simply records the occurrence of a certain action, and the quantification of its events is non-numerical, a typical example is access to the click log.

3. Fact table design principles

  • include all facts related to business processes as much as possible.
  • Select only facts that are relevant to the business process.
  • Decompose the fact of non-additivity into additable components.
  • Granularity must be declared before you can select dimensions and facts.
  • You cannot have multiple facts of different granularity in the same fact table.
  • The units of facts should be consistent.
  • To deal with null values for facts, it is recommended to fill them with 0.
  • Use degenerate dimensions to improve the ease of use of fact tables.

4. Fact table design method

  1. select the business process and confirm the fact table type.
  2. Declare granularity.
  3. Determine the dimensions.
  4. Determine the facts.
  5. Redundant dimensions.

5. Fact

table transaction fact table

, design a fact table for each business process. This facilitates independent analysis of each business process.

Multi-transaction fact tables, which put different facts into the same fact table, that is, the same fact table contains different business processes.

There are two ways to process facts in multi-transaction fact tables:

    facts of different

  1. business processes are stored in different fact fields; If it is not a measure that is not the current business process, consider filling it with a value of 0.
  2. Facts for different business processes are stored using the same fact field, but a column is added as a business process label to record whether the transaction was completed on the day.

Regarding which way

to process facts for multi-transaction fact tables:

In practical applications, when the business process measures are similar and the differences are not matched, you can adopt the second design method of multi-transaction fact tables, using the same field to represent the measurement data. But there is a problem with this approach, there will be multiple records in the same period.

When the measurement of different business processes is very different,

you can choose the first design method of multi-transaction fact table, and the measures of different business processes are redundant to the table with different fields, and the non-current business process is set to 0, which has the problem that the measurement field 0 value will be more.

Whether to use a single-transaction fact table or a multi-transaction fact table needs to be analyzed from the following points:

whether multiple business processes are put into the same fact table, it is first necessary to analyze the similarity between different business processes and the business source system.

For example, the three business processes of Taobao transaction are similar, and they all come from one application system – transaction system, which is suitable for putting into the same transaction fact table.

When considering whether to use single-transaction tables or multi-transaction tables, a key point is granularity and dimension.

After determining the business process, you

need to determine the granularity and dimensions based on different business processes, and when different business processes have the same granularity and have similar dimensions, you can consider using multi-transaction fact tables. If the granularity is different, it must be stored in different transaction tables.

If there are many facts in a single business

process and the facts of different business processes are different, consider using a single transaction fact table to handle it more clearly;

Using a multi-transaction fact table results in more zero-value or null-value fields.

The single-transaction fact table is easier for downstream users to understand, and the corresponding transaction fact table is used for which business process is concerned; The multi-transaction fact table contains multiple business processes, and users are often confused when using it.

6. Periodic snapshot

fact table

Transaction fact table can track an event well and perform metric analysis.

Then, when some status metrics are needed, such as account balance, commodity inventory, seller cumulative transaction value, etc., you need to aggregate the transactions related to them to be recognized and calculated, that is, the periodic snapshot fact table.

The periodic snapshot fact table samples the metrics of an attribute at determined intervals to study the metrics of the attribute without aggregating a long transaction history.

7. Cumulative snapshot fact table For the requirements of the time interval between research events, the transaction fact table processing logic is complex and the performance is poor, and the cumulative

snapshot fact table can be solved well.

The status measures collected in the snapshot fact table are semi-addable, and meaningful summary results cannot be obtained according to the time dimension.

When the data warehouse performs dimensional modeling, the transaction fact table and the snapshot fact table

are often designed in pairs to complement each other to meet more downstream statistical analysis needs, especially the snapshot fact table can be processed on the basis of the transaction fact table.

VIII. Other Specifications

1. The hierarchical research convention

  1. the application layer preferentially calls the public layer data.
  2. Intermediate layer data that already exists does not allow the application layer to reprocess data from the ODS layer across the middle tier.
  3. The middle-tier team should actively understand the construction requirements of the application layer data, precipitate the common data to the public layer, and provide data services for other teams.
  4. The application team also needs to actively cooperate with the middle layer team to transform and migrate the data common layer construction.
  5. Overuse of ODS tier references and unreasonable data replication and subcollection redundancy must be avoided.

2. Naming

convention Table naming convention:

< business process name| custom table name>< refresh period + storage policy> field naming convention:<层次><业务域名称><数据域名称>

3. Development specification

  1. in principle, cannot rely on non-data team nodes.
  2. Without the permission of the node owner, you cannot modify other people’s nodes without authorization.
  3. The node owner cannot be changed at will, and the recipient must be notified and consented.