|https://zhuanlan.zhihu.com/p/59184600 ETL

is the

abbreviation of data extraction (Extract), transformation (Transform), load (Load), it is the OLTP system of data after extraction, and the data of different data sources to transform and integrate to obtain consistent data. It is then loaded into the data warehouse. In short, ETL is

the

architecture that completes the process data warehouse

from OLTP system to OLAP system

Data warehouse (Data Warehouse\ DW) is based on the OLTP system data source, in order to facilitate multi-dimensional analysis and multi-angle display of its data according to a specific pattern to establish a relational database, it is different from the multi-dimensional database, the data in the data warehouse is detailed, integrated, the data warehouse is topic-oriented, is OLAP system for analysis purposes. It includes a star schema and a snowflake schema, where the star schema is a fact table in the middle and a dimension table around it, similar to stars; The snowflake schema has a fact table in the middle, and the dimension tables on both sides can have their associated sub-tables, while only one table is allowed to be associated with the fact table as a dimension table in the star type, and the snowflake type can have multiple tables in one dimension, but the star type cannot. When considering efficiency, star aggregation is fast and efficient, but the snowflake structure is clear and easy to interact with OLTP systems. In the actual project, we will use a combination of star architecture and snowflake architecture.

ETL’s five-step process

for building an enterprise-level data warehouse 1. Identify the topic

That is, to determine the analysis topic of a certain aspect of data analysis or front-end presentation, for example, we analyze the sales of beer in a certain region in a certain month of a certain year, which is a topic. The subject should reflect the analysis angle (dimension) and unified value type data (measurement) of a certain aspect, and the topic should be comprehensively considered when determining the subject, a topic in the data warehouse is a data mart, the data mart reflects a certain aspect of information, and multiple data marts constitute a data warehouse.

2. Determine the metric After determining the theme, we will consider the technical indicators to be analyzed, such as annual sales, generally numerical data, or summarize the data, or take the number of times, independent times or take the maximum and minimum values of the data, etc., such data is called a

metric.

Metrics are metrics to be counted, which must be selected appropriately in advance, and complex key performance indicators (KPIs) can be calculated based on different metrics.

3. Determine the granularity of the facts

After determining the metric, we need to consider the summary of the metric and the aggregation of the metric in different dimensions, considering that the aggregation degree of the metric is different, we will use the “minimum granularity principle”, that is, the granularity of the metric is set to the minimum, for example, we will summarize the sales according to time, the current data is recorded to the minimum to the day, that is, the daily transaction amount is recorded in the database, then we cannot summarize the data in the ETL by month or year, and need to be kept to the day for subsequent analysis of the day. And we don’t have to worry about the problem of data volume and data not being summarized in advance, because the data has been summarized in advance when the CUBE is established in the follow-up.

4. Determine the dimensions

Dimension is to analyze the various angles, for example, we want to analyze according to time, or by region, or by product, then the time, region, product here is the corresponding dimension, based on different dimensions we can see the summary of each metric, we can cross-analyze based on all dimensions. Here we first have to determine the level of the dimension (Hierarchy) and level (Figure 4: pic4.jpg), the level of the dimension refers to all the levels of the dimension, including the attributes of each level; the level of the dimension refers to the members under the dimension, for example, when establishing the regional dimension, we will take the regional dimension as a level, the level is three levels of province, city and county, considering that the dimension table should contain as much information as possible, so the establishment of the dimension should conform to the “short and fat principle”, that is, the dimension table should be as wide as possible, Try to include all descriptive information, not statistical data.

There is also a common situation, that is, the parent-child dimension, which is generally used for non-leaf nodes containing members, such as the dimension of the company’s employees, when counting the wages of employees, the salary of the department supervisor can not be equal to the simple addition of the salaries of subordinate members, and the

salary of the supervisor must be counted separately, and then the salary of the competent department is equal to the salary of the subordinate employee plus the salary of the department head, then when establishing the employee dimension, we need to establish the employee dimension into a parent-child dimension, so that when counting, The supervisor’s salary is automatically added, avoiding the situation that only leaf nodes have data.

In addition, when establishing the dimension table, it is necessary to fully use the surrogate key, the surrogate key is a numeric ID number, the advantage is that the surrogate key

uniquely identifies each dimension member information, easy to distinguish, more importantly, in the aggregation due to the numerical matching, JOIN high efficiency, easy to aggregate, and the surrogate key has more important significance for the slowly changing dimension, it plays a role in identifying historical data and new data, in the case of the original data primary key is the same, the surrogate key plays A very important identification role for new and historical data.

Sometimes we also encounter the situation of slow change of dimensions, such as adding a new product, or the ID number of the product

is modified, or the product has added a new attribute, at this time the members of a certain dimension will add new dimension members with the addition of new data, so we have to consider the processing of slowly changing dimensions, for slowly changing dimensions, there are three cases:

    >

    The first type of slowly changing dimension: historical data needs to be modified. In this way, the new data to overwrite the historical data, then we want to use UPDATE, for example, the ID number of the product is 123, and later found that the ID number is wrong, need to be rewritten to 456, then when the modified new data is inserted, the original ID number in the dimension table will be changed to 456 accordingly, so that the first type is used when the dimension is loaded, and the practice is completely changed.

  1. The second type of slowly changing dimension: historical data is retained, and new data is also retained. At this time, to update the original data and insert the new data, you need to use UPDATE/INSERT, for example, an employee was in department A in 2005, and he was transferred to department B in 2006. Then when counting the data for 2005, the employee should be located in department A; When counting the 2006 data, we should locate to department B, and then when new data is inserted, it will be processed according to the new department (department B), so our practice is to add the list of members of this dimension to the identification column, identify the historical data as “expired”, and identify the current data as “current”. Another method is to timestamp the dimension, that is, the time period in which the historical data takes effect as one of its attributes, and it will be associated according to the time period when it is matched with the original table to generate the fact table, which has the advantage that the validity time of the dimension members is clear.

  2. Slowly changing dimension third type: The new data dimension member changes the attribute. For example, a dimension member has added a new column, the column can not be based on it in the historical data, but in the current data and future data can be browsed according to it, then at this time we need to change the dimension table properties, that is, to add a new column, then we will use the stored procedure or program to generate a new dimension attribute, in the subsequent data will be based on the new attribute to view.

5. the use of advanced techniques in ETL

1. Use of the preparation area

When building a data warehouse, if the data source is located on one server and the data warehouse is on another server, considering that the data source server side is frequently accessed and the amount of data is large, it needs to be constantly updated, so a preparation area database can be established (Figure 7: pic7.jpg). The data is first extracted into the preparation area, and then processed based on the data in the preparation area, which has the advantage of preventing frequent access, data operation or sorting in the original OLTP system. For example, we can extract data into the preparation area according to the day, and based on the data preparation area, we will transform and integrate the data to process the data of different data sources consistently. There will be the original extraction table, some conversion intermediate and temporary tables, and ETL log tables in the data preparation area.

2. Use of timestamps

The time dimension is important for a factual subject, because different times have different statistical information, so the information recorded in time will play an important role. In ETL, timestamps have a special role, in the slow-changing dimensions mentioned above, we can use timestamps to identify dimension members; When recording the operation of the database and data warehouse, we will also use timestamps to identify information, for example, when performing data extraction, we will extract the data in the OLTP system according to the timestamp, such as taking the previous day’s data at 0:00 midnight, we will take GETDATE to GETDATE minus one day according to the timestamp in the OLTP system, so as to get the previous day’s data.

3. Use of log tables

When processing data, it is inevitable that data processing errors will occur and error messages will occur, so how can we obtain error messages and correct them in time? The method is to use one or more Log log tables, record the error information, in the log table we will record the number of each extraction, the number of successful processing, the number of processing failures, the processing failure data, processing time, etc., so that when the data error occurs, we can easily find the problem, and then correct or reprocess the wrong data.

4. use scheduling to make incremental updates to the data warehouse must use scheduling

(Figure 8: PIC8.jpg), that is, the fact data table is incrementally updated, before using the schedule, consider the amount of fact data, how often it takes to update, such as want to view by day, then we better extract by day, if the amount of data is not large, you can update the data in a month or half a year If there is a slowly changing dimension situation, the schedule needs to take into account the dimension table update, and update the dimension table before updating the fact table.

Scheduling is a key link in the data warehouse, to consider carefully, after the ETL process is built, to run it regularly, so scheduling is the key step to execute the ETL process,

each scheduling in addition to writing to the Log log table of the data processing information, but also to use to send email or alarm information, etc., which is also convenient for technicians to grasp the ETL process, enhance the security and data processing accuracy.

ETL to build a data warehouse requires a simple five steps,

master these five steps of the method we will build a powerful data warehouse, but each step has a deep need to study and dig, especially in the actual project, we have to consider comprehensively, for example, if the data source has a lot of dirty data, before building the data warehouse we must first carry out data cleaning to eliminate unnecessary information and dirty data.

In short, ETL is the core of the data warehouse, and mastering the five steps of ETL to build a data warehouse will master the fundamental method of building a data warehouse. However, we cannot be dogmatic, based on different projects, we will also have to carry out specific analysis, such as the use of parent-child dimensions and slowly changing dimensions. In the construction of data warehouses, ETL is related to the data quality of the entire project, so it must be placed in an important position to build a solid foundation of ETL.

the difference and connection between ETL

and SQL,

if ETL and SQL are concerned, SQL must be much more efficient.” But both sides have their own advantages, let’s start with ETL, ETL is mainly oriented to build a data warehouse to use. ETL is more inclined to data cleaning, multi-data source data integration, obtaining increments, and transforming tools used to load into the data warehouse. For example, I have two data sources, one is a database table and the other is excel data, and I need to merge these two data, which is usually difficult to implement in SQL statements. But ETL has a lot of ready-made components and drivers, and a few components are enough. There are also data sources such as cross-server, and the server can not establish a connection, such as our company’s system is divided into phase I and phase II, the database stored is different, the data structure is different, and the connection between the databases can not be established, in this case, ETL is particularly important and prominent. It can be easily implemented by fixed extraction, transformation, and loading into the data warehouse.

What about SQL? SQL is in fact just a fixed scripting language, but it is efficient and fast to execute. However, it is not very flexible, and it is difficult to consolidate data across servers. Therefore, SQL is more suitable for performing a wide range of queries and data changes in a fixed database, because the scripting language can be written casually, so the functions that can be implemented in the fixed database are quite powerful, unlike ETL in which functions can only be limited by components, what functions can components have, what functions can be achieved.

So it’s obvious when we use ETL and SQL, and

when we need to integrate multiple data sources to build a data warehouse and conduct data analysis, we use ETL. If we are fixing the data hierarchy of a single database, we use SQL. Of course, ETL is also inseparable from SQL.

Introduction to ETL Algorithms and Tools:

1.There are three main mainstream ETL tools

,

namely Ascential’s Datastage, Informatica’s Powercenter, and NCR Teradata’s ETL There are other open source tools, such as PDI (Kettle) and so on.

2. ETL is the basis of

DW system

DW system is based on fact occurrence data, self-produced data is less.

An enterprise often contains multiple business systems, all of which may become DW data sources.

The quality of business system data is uneven, and we must learn to eliminate the false and the true.

Business system data is complex and needs to be integrated into the data model.

The relationship between source data is

also complicated, and when the source data is processed into the DW system, some must follow a certain sequence relationship;

3. Classified

flow event table of source data

: This type of source table is used to record the occurrence of transactions and other actions, and will be added in the source system, most of which will not be modified and deleted, and a small number of tables will be deleted. e.g. time deposit register;

General status table

: This type of source table is used to record the status of data information. In the source system, new additions and modifications, and deletions may also occur. Such as customer information sheets;

Code parameter table: This type of source table is used to record the data codes and parameters used in the source system;

4. Types of data files Most data files are loaded from the source system to the data warehouse in a fixed period of

1 day. The data file contains increments, full amounts, and increments to be deleted.

Delta data file

: The content of the data file is the incremental information of the data table, including the new and modified records in the table.

Full data file

: The content of the data file is the full information of the data table, including all the data in the table.

Increment with deletion: The content of the data file is the incremental information of the data table, including the new, modified, and deleted records in the table, usually the deleted record is identified by the field DEL_IND=’D’.

5. ETL standard algorithms

can be divided into: historical zipper algorithm, append algorithm (event table), Upsert algorithm (main table) and all deletion and full addition algorithm (parameter table);

6. ETL standard algorithm selection

history zipper: according to the requirements of business analysis, data changes must be recorded, and continuous historical trajectories based on dates are required;

Append (event table): According to the requirements of business analysis, data changes must be recorded, and continuous historical tracks based on dates are not required;

Upsert (main table): According to the requirements of business analysis, data changes do not need to be recorded, and the current data has an impact on historical data;

Full deletion and full addition algorithm (parameter table): According to the requirements of business analysis, data changes do not need to be recorded, and the current data has no impact on historical data;

7. The so-called zipper of the historical zipper method is to record the history and record all the change information of

a

transaction from the beginning to the current state (the parameter adds a start and end date);

8. The append algorithm

is

generally used in the event table, and the events are relatively independent of each other, and there is no update of historical information;

9. Upsert algorithm

is

a combination of update and insert, which is generally used to track and retain historical information changes, only its latest status and data volume has a certain scale of tables, such as customer information tables;

10. The all-delete and all-add algorithm is generally used in the parameter table with a small amount of data, all

the historical data is deleted, and then reloaded in full;

11. Deal with complexity

history zipper, Upsert, Append, all deletion and full addition; Loading performance: All Delete All Plus, Append, Upsert, History Zipper;

12. The main algorithm of the near-source model layer is APPEND algorithm, conventional zipper algorithm, full band removal zipper algorithm

;

13. Integrated model layer algorithm APPEND algorithm, MERGE algorithm, conventional zipper algorithm, delete zipper algorithm based on incremental data, delete zipper algorithm based on full data, economical conventional zipper algorithm, economical delete zipper algorithm

based on incremental data, Economical zipper algorithm based on full data, PK_NOT_IN_APPEND algorithm, source date field self-zipper algorithm;

14. Technical buffer to the data flow algorithm of the near-source model layer – APPEND algorithm

This algorithm is usually used for pipeline event tables, the source table suitable for this type of algorithm will not be updated and deleted in the source system, but will only occur one addition, so only need to take the latest data with the transaction date of the day every day and directly attach it to the target table, the fields of such tables in the near source model layer are basically exactly the same as the technical buffer layer and the source system table, and will not add additional physical processing fields, and the query method of the source system table is the same when used;

15. Data flow algorithm for technical buffering to the near-source model layer – conventional zipper algorithm

This algorithm is usually used for general status tables without delete operations, and the source tables suitable for this type of algorithm will be added, modified, but not deleted, so you need to get the latest data at the end of the day every day (incremental or full incremental) every day, first find out the real incremental data (new and modified), use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, close the END_DT to the current business date), and then insert the latest incremental data into the target table as open chain data.

This kind of table is closer to the source

model layer than the technical buffer layer and the corresponding table of the source system, adding two additional physical processing fields START_DT (start date) and END_DT (end date), when using, you need to select the visual date first, and the visual date is carded by START_DT and END_DT, that is, START_DT<='visual date' AND END_DT> ‘visual date’;

16. Data flow algorithm for technical buffering to the near-source model layer – full band removal zipper algorithm

This algorithm is usually used in general status class tables with delete operations, and requires a full number of data files to compare the deletion increments; The source table suitable for this kind of algorithm will be added, modified, and deleted in the source system, and the latest full data at the end of the day will be taken over every day, and the real incremental data (added, modified) and deleted incremental data will be found respectively, and the open-chain data (valid data) with the modified attributes in the target table will be used for chain closing operations (that is, END_DT closed to the current business date), and then the real incremental and deleted data in the latest incremental data can be inserted into the target table as open-chain data. Note that the deletion flag DEL_IND delete records is set to ‘D’;

Such tables add three additional physical processing fields START_DT (start date), ENT_DT (end date), and DEL_IND (deletion criteria) to the corresponding table of the near source model layer than the technical buffer layer. The usage methods are divided into two categories: a general query and use, at this time, you need to select the perspective date first, and the perspective date is carded through START_DT and END_DT, that is, START_DT<='perspective date' AND END_DT> ‘perspective date’, and add the condition DEL_IND <> ‘D’; The other is to download or obtain the daily incremental data, in this case, you need to START_DT<='Perspective Date' AND END_DT> ‘Perspective Date’ a condition, and you don’t need to add DEL_IND <> ‘D’ condition.

17. Data flow algorithm from the near source model layer to the integrated model layer – APPEND algorithm This algorithm is usually used for pipeline event tables, and the source tables suitable for this type of algorithm

will

not be updated and deleted in the source system, but will only occur one stroke addition, Therefore, you only need to take the latest data with the transaction date of the current day every day and attach it directly to the target table;

Usually a temporary table named VT_NEW_ number is created to convert the latest data of each group of the day to the VT_NEW_ number, and then attach it to the final destination table again;

18. Data flow algorithm from the near source model layer to the integration model layer – MERGE INTO algorithm

This algorithm is usually used for general status tables without delete operations, generally do not need to keep history but only keep the current latest state of the table, suitable for this type of algorithm in the source system will be added, modified, but not deleted, so you need to obtain the latest data at the end of the day (incremental or full), for MERGE IN or UPSERT target tables; For efficiency and the requirement of identifying the true increment, the real incremental data (new and modified data) is usually identified first, and then the MERGE INTO operation is performed on the target table with these true incremental data.

Usually two temporary tables, named VT_NEW_ number, are created to convert the latest data of each group of the day to VT_NEW_ number; The other, called VT_INC_ number, compares the VT_NEW_ number with yesterday’s data in the target table to find out the true incremental data (new and modified) into the VT_INC_ number, and then MERGE INTO or UPSERT with the VT_INC_ number to the final target table.

19. Data flow algorithm from near source model layer to integration model layer – conventional zipper algorithm

This algorithm is usually used for general status tables without delete operations, and the source tables suitable for this type of algorithm will be added, modified, but not deleted in the source system, so you need to get the latest data at the end of the day every day (incremental or full incremental), first find out the real incremental data (new and modified), use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, close the END_DT to the current business date), and then insert the latest incremental data into the target table as open chain data;

Usually two temporary tables, named VT_NEW_ number, are created to convert the latest data of each group of the day to VT_NEW_ number; The other is called VT_INC_ number, compare the VT_NEW_ number with yesterday’s data in the target table to find out the real incremental data (new and modified) into the VT_INC_ number, and then the PK in the open chain data of the final target table appears in the VT_INT_ number for chain closure processing, and then insert all the data in the VT_INC_ number as open chain data into the final target table.

20. Data flow algorithm from near source model layer to integration model layer – zipper removal algorithm based on incremental data

This algorithm is typically used for general status tables with delete operations and requires that the delete data be provided in DEL_IND=’D’ delete increments; In addition to obtaining the latest data at the end of the day (incremental or full) every day, it is also necessary to obtain the data deleted on the same day, and according to the real incremental data found (new and modified) and the deletion of incremental data, use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, END_DT close to the current business time), and then insert the increment (excluding deleted data) into the target table as open chain data;

Usually three temporary tables, named VT_NEW_ number, are created to convert and load the latest data of each group of the day (excluding deleted data) to VT_NEW_ number; The second table is named VT_INC_ number, and the VT_NEW_ number is compared with yesterday’s data in the target table to find out the real incremental data and put it in VT_INC_ number; The third table is named VT_DEL_ number, and the deletion incremental data transformation is loaded into the VT_DEL_ number; Finally, the PK in the open chain data of the final target table appears in the VT_INC_ number or VT_DEL_ number for chain closure processing, and finally all the data in the VT_INC_ number can be inserted into the final target table as open chain data.

21. Data flow algorithm from near source model layer to integration model layer – zipper algorithm based on full data removal

This algorithm is usually used in general status tables with delete operations, and requires full data to compare the deletion increments; The source table suitable for this type of algorithm will add, modify, and fetch the latest full data at the end of the day in the source system every day, respectively find out the real incremental data (add, modify) and delete the incremental data, use them to close the chain of the open chain data (valid records) whose attributes have been modified in the target table (that is, close the END_DT to the current business time), and then insert the real incremental data (excluding deleted data) in the latest data into the target table as open chain data;

Usually two temporary tables, named VT_NEW_ number, are created to convert the latest full data of each group of the day to VT_NEW_ number; Another table is named VT_INC_ number, compare the VT_NEW_ number with yesterday’s data in the target table to find out the real incremental data (addition, modification) and delete incremental data into the VT_INC_ number, pay attention to the END_DT of deleting incremental data to the minimum date (borrowing); Finally, the PK in the open chain data of the final target table is then processed by VT_INC_ number or VT_DEL_ number, and then all the END_DT in the VT_INC_ number are not equal to the minimum date data (non-deleted data) can be inserted into the final target table as open chain data.

22. Data flow algorithm from the near-source model layer to the integration model layer – economical conventional zipper algorithm This algorithm is basically equivalent to the conventional pull algorithm

, except that only the records with non-empty attributes and non-0 attributes are inserted into the target table as open chain data in the last step;

23. Data flow algorithm from the near source model layer to the integration model layer – economical deletion zipper algorithm based on incremental data This algorithm is basically equivalent to the zipper deletion algorithm based on incremental

data

, except that only the records with non-empty attributes and non-0 attributes are inserted into the target table as open chain data in the last step;

24. Data flow algorithm from the near-source model layer to the integration model layer – economical deletion zipper algorithm based on full data This algorithm is basically equivalent to the zipper algorithm based on full

data

, except that only the records with non-empty and non-0 attributes are inserted into the target table as open chain data in the last step;

25. Data flow algorithm from the near-source model layer to the integrated model layer – PK_NOT_IN_APPEND algorithm This algorithm

is to insert only the data that PK does not appear in the current VT_NEW_ numbering table into the VT_NEW_ numbering table for each group, Finally, the data that does not appear in the target table of the PK is inserted into the target table to ensure that only those data that have not been entered by the PK are entered.

26. Data flow algorithm from the near source model layer to the integrated model layer – source date field self-zipper algorithm This algorithm is a date field in the source table to identify the effective date of the current record, this algorithm forms a natural zipper algorithm

by sorting the same primary key records according to this effective date, and connecting rows at a time