5. the use of advanced techniques in ETL
1. Use of the preparation area
When building a data warehouse, if the data source is located on one server and the data warehouse is on another server, considering that the data source server side is frequently accessed and the amount of data is large, it needs to be constantly updated, so a preparation area database can be established (Figure 7: pic7.jpg). The data is first extracted into the preparation area, and then processed based on the data in the preparation area, which has the advantage of preventing frequent access, data operation or sorting in the original OLTP system. For example, we can extract data into the preparation area according to the day, and based on the data preparation area, we will transform and integrate the data to process the data of different data sources consistently. There will be the original extraction table, some conversion intermediate and temporary tables, and ETL log tables in the data preparation area.
2. Use of timestamps
The time dimension is important for a factual subject, because different times have different statistical information, so the information recorded in time will play an important role. In ETL, timestamps have a special role, in the slow-changing dimensions mentioned above, we can use timestamps to identify dimension members; When recording the operation of the database and data warehouse, we will also use timestamps to identify information, for example, when performing data extraction, we will extract the data in the OLTP system according to the timestamp, such as taking the previous day’s data at 0:00 midnight, we will take GETDATE to GETDATE minus one day according to the timestamp in the OLTP system, so as to get the previous day’s data.
3. Use of log tables
When processing data, it is inevitable that data processing errors will occur and error messages will occur, so how can we obtain error messages and correct them in time? The method is to use one or more Log log tables, record the error information, in the log table we will record the number of each extraction, the number of successful processing, the number of processing failures, the processing failure data, processing time, etc., so that when the data error occurs, we can easily find the problem, and then correct or reprocess the wrong data.
4. use scheduling to make incremental updates to the data warehouse must use scheduling
(Figure 8: PIC8.jpg), that is, the fact data table is incrementally updated, before using the schedule, consider the amount of fact data, how often it takes to update, such as want to view by day, then we better extract by day, if the amount of data is not large, you can update the data in a month or half a year If there is a slowly changing dimension situation, the schedule needs to take into account the dimension table update, and update the dimension table before updating the fact table.
Scheduling is a key link in the data warehouse, to consider carefully, after the ETL process is built, to run it regularly, so scheduling is the key step to execute the ETL process,
each scheduling in addition to writing to the Log log table of the data processing information, but also to use to send email or alarm information, etc., which is also convenient for technicians to grasp the ETL process, enhance the security and data processing accuracy.
ETL to build a data warehouse requires a simple five steps,
master these five steps of the method we will build a powerful data warehouse, but each step has a deep need to study and dig, especially in the actual project, we have to consider comprehensively, for example, if the data source has a lot of dirty data, before building the data warehouse we must first carry out data cleaning to eliminate unnecessary information and dirty data.
In short, ETL is the core of the data warehouse, and mastering the five steps of ETL to build a data warehouse will master the fundamental method of building a data warehouse. However, we cannot be dogmatic, based on different projects, we will also have to carry out specific analysis, such as the use of parent-child dimensions and slowly changing dimensions. In the construction of data warehouses, ETL is related to the data quality of the entire project, so it must be placed in an important position to build a solid foundation of ETL.
and SQL,
if ETL and SQL are concerned, SQL must be much more efficient.” But both sides have their own advantages, let’s start with ETL, ETL is mainly oriented to build a data warehouse to use. ETL is more inclined to data cleaning, multi-data source data integration, obtaining increments, and transforming tools used to load into the data warehouse. For example, I have two data sources, one is a database table and the other is excel data, and I need to merge these two data, which is usually difficult to implement in SQL statements. But ETL has a lot of ready-made components and drivers, and a few components are enough. There are also data sources such as cross-server, and the server can not establish a connection, such as our company’s system is divided into phase I and phase II, the database stored is different, the data structure is different, and the connection between the databases can not be established, in this case, ETL is particularly important and prominent. It can be easily implemented by fixed extraction, transformation, and loading into the data warehouse.
What about SQL? SQL is in fact just a fixed scripting language, but it is efficient and fast to execute. However, it is not very flexible, and it is difficult to consolidate data across servers. Therefore, SQL is more suitable for performing a wide range of queries and data changes in a fixed database, because the scripting language can be written casually, so the functions that can be implemented in the fixed database are quite powerful, unlike ETL in which functions can only be limited by components, what functions can components have, what functions can be achieved.
So it’s obvious when we use ETL and SQL, and
when we need to integrate multiple data sources to build a data warehouse and conduct data analysis, we use ETL. If we are fixing the data hierarchy of a single database, we use SQL. Of course, ETL is also inseparable from SQL.
1.There are three main mainstream ETL tools
,
namely Ascential’s Datastage, Informatica’s Powercenter, and NCR Teradata’s ETL There are other open source tools, such as PDI (Kettle) and so on.
2. ETL is the basis of
DW system
DW system is based on fact occurrence data, self-produced data is less.
An enterprise often contains multiple business systems, all of which may become DW data sources.
The quality of business system data is uneven, and we must learn to eliminate the false and the true.
Business system data is complex and needs to be integrated into the data model.
The relationship between source data is
also complicated, and when the source data is processed into the DW system, some must follow a certain sequence relationship;
3. Classified
flow event table of source data
: This type of source table is used to record the occurrence of transactions and other actions, and will be added in the source system, most of which will not be modified and deleted, and a small number of tables will be deleted. e.g. time deposit register;
General status table
: This type of source table is used to record the status of data information. In the source system, new additions and modifications, and deletions may also occur. Such as customer information sheets;
Code parameter table: This type of source table is used to record the data codes and parameters used in the source system;
4. Types of data files Most data files are loaded from the source system to the data warehouse in a fixed period of
1 day. The data file contains increments, full amounts, and increments to be deleted.
Delta data file
: The content of the data file is the incremental information of the data table, including the new and modified records in the table.
Full data file
: The content of the data file is the full information of the data table, including all the data in the table.
Increment with deletion: The content of the data file is the incremental information of the data table, including the new, modified, and deleted records in the table, usually the deleted record is identified by the field DEL_IND=’D’.
5. ETL standard algorithms
can be divided into: historical zipper algorithm, append algorithm (event table), Upsert algorithm (main table) and all deletion and full addition algorithm (parameter table);
6. ETL standard algorithm selection
history zipper: according to the requirements of business analysis, data changes must be recorded, and continuous historical trajectories based on dates are required;
Append (event table): According to the requirements of business analysis, data changes must be recorded, and continuous historical tracks based on dates are not required;
Upsert (main table): According to the requirements of business analysis, data changes do not need to be recorded, and the current data has an impact on historical data;
Full deletion and full addition algorithm (parameter table): According to the requirements of business analysis, data changes do not need to be recorded, and the current data has no impact on historical data;
7. The so-called zipper of the historical zipper method is to record the history and record all the change information of
a
transaction from the beginning to the current state (the parameter adds a start and end date);
8. The append algorithm
is
generally used in the event table, and the events are relatively independent of each other, and there is no update of historical information;
9. Upsert algorithm
is
a combination of update and insert, which is generally used to track and retain historical information changes, only its latest status and data volume has a certain scale of tables, such as customer information tables;
10. The all-delete and all-add algorithm is generally used in the parameter table with a small amount of data, all
the historical data is deleted, and then reloaded in full;
11. Deal with complexity
history zipper, Upsert, Append, all deletion and full addition; Loading performance: All Delete All Plus, Append, Upsert, History Zipper;
12. The main algorithm of the near-source model layer is APPEND algorithm, conventional zipper algorithm, full band removal zipper algorithm
;
13. Integrated model layer algorithm APPEND algorithm, MERGE algorithm, conventional zipper algorithm, delete zipper algorithm based on incremental data, delete zipper algorithm based on full data, economical conventional zipper algorithm, economical delete zipper algorithm
based on incremental data, Economical zipper algorithm based on full data, PK_NOT_IN_APPEND algorithm, source date field self-zipper algorithm;
14. Technical buffer to the data flow algorithm of the near-source model layer – APPEND algorithm
This algorithm is usually used for pipeline event tables, the source table suitable for this type of algorithm will not be updated and deleted in the source system, but will only occur one addition, so only need to take the latest data with the transaction date of the day every day and directly attach it to the target table, the fields of such tables in the near source model layer are basically exactly the same as the technical buffer layer and the source system table, and will not add additional physical processing fields, and the query method of the source system table is the same when used;
15. Data flow algorithm for technical buffering to the near-source model layer – conventional zipper algorithm
This algorithm is usually used for general status tables without delete operations, and the source tables suitable for this type of algorithm will be added, modified, but not deleted, so you need to get the latest data at the end of the day every day (incremental or full incremental) every day, first find out the real incremental data (new and modified), use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, close the END_DT to the current business date), and then insert the latest incremental data into the target table as open chain data.
This kind of table is closer to the source
model layer than the technical buffer layer and the corresponding table of the source system, adding two additional physical processing fields START_DT (start date) and END_DT (end date), when using, you need to select the visual date first, and the visual date is carded by START_DT and END_DT, that is, START_DT<='visual date' AND END_DT> ‘visual date’;
16. Data flow algorithm for technical buffering to the near-source model layer – full band removal zipper algorithm
This algorithm is usually used in general status class tables with delete operations, and requires a full number of data files to compare the deletion increments; The source table suitable for this kind of algorithm will be added, modified, and deleted in the source system, and the latest full data at the end of the day will be taken over every day, and the real incremental data (added, modified) and deleted incremental data will be found respectively, and the open-chain data (valid data) with the modified attributes in the target table will be used for chain closing operations (that is, END_DT closed to the current business date), and then the real incremental and deleted data in the latest incremental data can be inserted into the target table as open-chain data. Note that the deletion flag DEL_IND delete records is set to ‘D’;
Such tables add three additional physical processing fields START_DT (start date), ENT_DT (end date), and DEL_IND (deletion criteria) to the corresponding table of the near source model layer than the technical buffer layer. The usage methods are divided into two categories: a general query and use, at this time, you need to select the perspective date first, and the perspective date is carded through START_DT and END_DT, that is, START_DT<='perspective date' AND END_DT> ‘perspective date’, and add the condition DEL_IND <> ‘D’; The other is to download or obtain the daily incremental data, in this case, you need to START_DT<='Perspective Date' AND END_DT> ‘Perspective Date’ a condition, and you don’t need to add DEL_IND <> ‘D’ condition.
17. Data flow algorithm from the near source model layer to the integrated model layer – APPEND algorithm This algorithm is usually used for pipeline event tables, and the source tables suitable for this type of algorithm
will
not be updated and deleted in the source system, but will only occur one stroke addition, Therefore, you only need to take the latest data with the transaction date of the current day every day and attach it directly to the target table;
Usually a temporary table named VT_NEW_ number is created to convert the latest data of each group of the day to the VT_NEW_ number, and then attach it to the final destination table again;
18. Data flow algorithm from the near source model layer to the integration model layer – MERGE INTO algorithm
This algorithm is usually used for general status tables without delete operations, generally do not need to keep history but only keep the current latest state of the table, suitable for this type of algorithm in the source system will be added, modified, but not deleted, so you need to obtain the latest data at the end of the day (incremental or full), for MERGE IN or UPSERT target tables; For efficiency and the requirement of identifying the true increment, the real incremental data (new and modified data) is usually identified first, and then the MERGE INTO operation is performed on the target table with these true incremental data.
Usually two temporary tables, named VT_NEW_ number, are created to convert the latest data of each group of the day to VT_NEW_ number; The other, called VT_INC_ number, compares the VT_NEW_ number with yesterday’s data in the target table to find out the true incremental data (new and modified) into the VT_INC_ number, and then MERGE INTO or UPSERT with the VT_INC_ number to the final target table.
19. Data flow algorithm from near source model layer to integration model layer – conventional zipper algorithm
This algorithm is usually used for general status tables without delete operations, and the source tables suitable for this type of algorithm will be added, modified, but not deleted in the source system, so you need to get the latest data at the end of the day every day (incremental or full incremental), first find out the real incremental data (new and modified), use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, close the END_DT to the current business date), and then insert the latest incremental data into the target table as open chain data;
Usually two temporary tables, named VT_NEW_ number, are created to convert the latest data of each group of the day to VT_NEW_ number; The other is called VT_INC_ number, compare the VT_NEW_ number with yesterday’s data in the target table to find out the real incremental data (new and modified) into the VT_INC_ number, and then the PK in the open chain data of the final target table appears in the VT_INT_ number for chain closure processing, and then insert all the data in the VT_INC_ number as open chain data into the final target table.
20. Data flow algorithm from near source model layer to integration model layer – zipper removal algorithm based on incremental data
This algorithm is typically used for general status tables with delete operations and requires that the delete data be provided in DEL_IND=’D’ delete increments; In addition to obtaining the latest data at the end of the day (incremental or full) every day, it is also necessary to obtain the data deleted on the same day, and according to the real incremental data found (new and modified) and the deletion of incremental data, use them to close the chain of the open chain data (valid data) whose attributes have been modified in the target table (that is, END_DT close to the current business time), and then insert the increment (excluding deleted data) into the target table as open chain data;
Usually three temporary tables, named VT_NEW_ number, are created to convert and load the latest data of each group of the day (excluding deleted data) to VT_NEW_ number; The second table is named VT_INC_ number, and the VT_NEW_ number is compared with yesterday’s data in the target table to find out the real incremental data and put it in VT_INC_ number; The third table is named VT_DEL_ number, and the deletion incremental data transformation is loaded into the VT_DEL_ number; Finally, the PK in the open chain data of the final target table appears in the VT_INC_ number or VT_DEL_ number for chain closure processing, and finally all the data in the VT_INC_ number can be inserted into the final target table as open chain data.
21. Data flow algorithm from near source model layer to integration model layer – zipper algorithm based on full data removal
This algorithm is usually used in general status tables with delete operations, and requires full data to compare the deletion increments; The source table suitable for this type of algorithm will add, modify, and fetch the latest full data at the end of the day in the source system every day, respectively find out the real incremental data (add, modify) and delete the incremental data, use them to close the chain of the open chain data (valid records) whose attributes have been modified in the target table (that is, close the END_DT to the current business time), and then insert the real incremental data (excluding deleted data) in the latest data into the target table as open chain data;
Usually two temporary tables, named VT_NEW_ number, are created to convert the latest full data of each group of the day to VT_NEW_ number; Another table is named VT_INC_ number, compare the VT_NEW_ number with yesterday’s data in the target table to find out the real incremental data (addition, modification) and delete incremental data into the VT_INC_ number, pay attention to the END_DT of deleting incremental data to the minimum date (borrowing); Finally, the PK in the open chain data of the final target table is then processed by VT_INC_ number or VT_DEL_ number, and then all the END_DT in the VT_INC_ number are not equal to the minimum date data (non-deleted data) can be inserted into the final target table as open chain data.
22. Data flow algorithm from the near-source model layer to the integration model layer – economical conventional zipper algorithm This algorithm is basically equivalent to the conventional pull algorithm
, except that only the records with non-empty attributes and non-0 attributes are inserted into the target table as open chain data in the last step;
23. Data flow algorithm from the near source model layer to the integration model layer – economical deletion zipper algorithm based on incremental data This algorithm is basically equivalent to the zipper deletion algorithm based on incremental
data
, except that only the records with non-empty attributes and non-0 attributes are inserted into the target table as open chain data in the last step;
24. Data flow algorithm from the near-source model layer to the integration model layer – economical deletion zipper algorithm based on full data This algorithm is basically equivalent to the zipper algorithm based on full
data
, except that only the records with non-empty and non-0 attributes are inserted into the target table as open chain data in the last step;
25. Data flow algorithm from the near-source model layer to the integrated model layer – PK_NOT_IN_APPEND algorithm This algorithm
is to insert only the data that PK does not appear in the current VT_NEW_ numbering table into the VT_NEW_ numbering table for each group, Finally, the data that does not appear in the target table of the PK is inserted into the target table to ensure that only those data that have not been entered by the PK are entered.
26. Data flow algorithm from the near source model layer to the integrated model layer – source date field self-zipper algorithm This algorithm is a date field in the source table to identify the effective date of the current record, this algorithm forms a natural zipper algorithm
by sorting the same primary key records according to this effective date, and connecting rows at a time