- the generation of data middle office: the pain points of data work, the generation of data middle office
, and the essence of the middle office
Understand the development process
of iQIYI data middle platform: middle office construction, pingback
system, data warehouse system, data warehouse platform, offline data warehouse architecture, big data platform, data platform architecture
Application scenarios of the data middle office: unification, personalization, and customization
of data middle office and data middle office, output and positioning
> 1. The generation of the data middle office
threshold for use: data work is a particularly professional work, and the requirements for personnel are relatively high.
Inconsistent caliber: In the process of using data, inconsistent caliber is a particularly common problem, which may lead to a difference in data use and analysis, and will reduce the efficiency of business data analysis.
Low data reliability: In the production process, reducing the data analysis efficiency of the business will eventually have a serious impact on business decisions, not only the data link process is very long, but also introduces many data quality problems.
Cross-business difficulty: Due to the lack of a unified data construction plan, standards and specifications, it is difficult to guide each business or all links of the entire production chain to have a standardized production and processing process, which leads to the difficulty of integrating data from multiple businesses and exerting greater data value.
High access costs: If new services are accessed or new scenarios require the use of data, many tasks need to be handled manually. To apply for various resources, permissions, find data, and connect the entire data collection, production, calculation, synchronization and display is a time-consuming, inefficient, and ultimately error-prone process.
Low delivery quality: When it comes to data, it is certainly inseparable from delivery, which is a series of data information used to record user behavior. If the delivery process is not standardized or process controlled, it will lead to poor delivery quality.
Difficult to obtain data: From the production of data to the final use, it may go through a relatively long time period or a relatively wide team span, users may not be able to find the data they want quickly, or the data produced by the data team does not really reach the business. to achieve its data value.
Data assets are blurred : This point may be a little related to the difficulty of obtaining data, and the vagueness of data assets is more about the need to do an overall management of the company’s data assets, without this overall management, it will lead to the level of data assets and what data assets are vague. In the end, it is difficult to play the advantages of data, and although it consumes a lot of computing resources, human resources, and storage resources, it does not bring corresponding value, and ultimately leads to extremely low resource efficiency.
of the data middle office
The data middle office is more like an enterprise architecture, which is a set of enterprise architecture that combines Internet technology and industry characteristics to find certainty in the uncertainty of enterprise development, and continues to precipitate and abstract the core capabilities of the enterprise, and ultimately supports enterprises to carry out business innovation and enhancement at a rapid, efficient and low-cost basis.
2. Definition of iQIYI Data Middle Station 1. Understand the data middle platform
People usually use big data clusters, that is, Hadoop, Spark, Flink, and other OLAP tools. However, these are only a concept of data background, and have not been made into a standardized, generalized, and relatively low threshold of middle-office concepts.
Data middle office: The data middle office is actually a product concept of data as a service, which includes data services, data platforms, data generated by the data middle office, and
standards and specifications generated in all data work, which constitute what we call the data middle office.
Data front desk:
The data front desk is a specific example of our actual product landing, which mainly includes several general directions
Data applications, such as ad-hoc queries, visual query tools;
Data products, similar to portraits and recommendation businesses, may be products that are ultimately formed by some data, directly serving users.
> Analysis system, such as user analysis, content analysis, business reports, etc.;
Therefore, the abstraction of the data
middle office refers to the concept of “platform + service + data + standardization”, which encapsulates the production, collection, processing, storage and service of data, and provides different service forms for users at different levels. In the process of data standardization, the data middle office can prevent data duplication construction, avoid caliber problems, and improve the efficiency of data use.
the data middle office
of data middle platform
of the data middle platform
Speaking of data middle office positioning, because the data middle office and the front and back office need to have a clear division, the data middle office positioning provides this abstract and universal ability to support the front office team to customize on this basis, and finally reuse the general capabilities at the same time, can meet the personalized needs of rapid business development, and achieve a global optimization state.
3. iQIYI Data Middle Station Construction
It mainly outputs middle-office capabilities from five perspectives, namely service, data, platform, delivery, and standards/specifications. In the implementation of iQIYI’s data middle platform, three general directions were divided:
production, which is what we call the delivery system;
Data, that is, the system of unified data warehouse, is the core of data;
Big data platform capabilities: including development, governance, and services.
This part outputs the delivery specification, and further for the delivery specification, the
relevant employees of the company need to be trained, so that everyone can deeply understand what the delivery is for, and how to meet our in-depth analysis requirements for user behavior.
Big data platform:
There are front-line development, corresponding operation and maintenance management,
real-time development corresponding operation and maintenance management, as well as data governance, data graph, data services, and ad-hoc query. Ad hoc query is a sub-item of our data service, but because it is widely used, it is brought out separately.
Unified data warehouse
The ability of unified data warehouse is to provide offline and real-time data warehouse capabilities for downstream. In order to facilitate the realization of cross-offline and real-time mixed use scenarios, standardization work is required, that is, the fields, definitions, calibers, formats and real-time data of offline output should be as consistent as possible, that is, real-time data is aligned with offline data.
In addition to the ability to provide data itself, the data warehouse also needs to maintain the entire company-level index system and unified dimension, so that all data system platforms and will be docked with a unified dimensional indicator system. Moreover, in order to help the data modeling and statistical indicators management in the construction process of the data warehouse, a corresponding data platform has been built, which is also built in accordance with the standards of data specifications, so as to support the user’s use of the platform to build the data warehouse in accordance with the specifications.
system The system of pingback is the delivery system
, so why do you do this specifically?
The main problems faced by the delivery work are as follows:
several pain points to be solved in the data warehouse system:
Data warehouse platform Data warehouse platform
Features of the data warehouse platform:
Descriptiveness of data information: It refers to the fact that in the process of creating tables, in order to quickly meet the business, some relevant descriptive information is rarely added, resulting in the lack of descriptiveness of the data. Therefore, it is necessary to use the platform to require users to describe the information in the process of data creation in a sufficiently detailed manner to facilitate the subsequent data use process;
The integrity of the data modeling system: it means that we need a three-step modeling process, that is, after business modeling, there is corresponding data modeling; After data modeling, there are different forms of physical modeling for this data modeling. The whole is a process-oriented work, avoiding users from skipping certain processes in order to quickly meet business needs, which ultimately leads to poor scalability of modeling;
The dimension of data relationship and the systematization of indicator management: by providing a unified dimension and indicator management system as a center, the external output of unified indicators and dimensions, so that everyone can use the process, You can use this standardized and centrally managed metadata;
Traceability of data relationships: It refers to the process of data warehouse construction and modeling, which promotes the interrelationship between our subsequent data tables and fields to be recorded and queried, which is what we call data kinship.
data warehouse architecture The following is a simplified architecture
of the data warehouse
, which mainly reflects the offline data warehouse part. The colored part is the unified data warehouse, and the other light color is some data applications, including data mart and theme data warehouse.
iQIYI’s big data platform has gone through five stages:
Development: In the first stage, the platform and visualization capabilities of the entire data development were completed, which lowered the development threshold and improved the development standardization.
O&M: After development, you need to improve the management and O&M capabilities of tasks. Through the construction of operation and maintenance management module, users are guaranteed to manage tasks more conveniently, and the stability of task output and the timeliness of data output are effectively monitored.
Quality: After providing data development and management related capabilities, it is necessary to further verify the quality of data output to avoid the direct use of the produced data without paying attention to data quality, resulting in the rapid spread of data problems.
Usage: Data usage is also a process of data discovery. For example, a lot of data is produced, how to let users see this data, and better apply it to business needs. In response to this pain point, complete the release of the data graph module, collect, process and manage various data element information, and finally provide the complete data information in a more friendly form to help everyone quickly discover data, further understand data element information, and use data quickly and accurately.
Governance: It is the last link in the data ecosystem and an important part of building a healthy ecological closed loop. Some companies may put governance at a higher level, but in some scenarios, such as the rapid development of business, governance often cannot keep up with business needs. Therefore, iQIYI’s approach is to wait for the business to develop to a certain extent, and then supplement the ability of data governance, to govern the stock, and to control the increment. The content of governance work mainly includes daily auditing of data and tasks, and then effectively evaluating the redundancy of data through data lineage and usage, and optimizing accordingly to reduce the waste of resources and manpower.
The computing layer is more of a big data cluster service, and also includes some task scheduling capabilities.
The platform layer includes offline and streaming task development management, machine learning platform, data warehouse platform, and then the following is a platform-based processing of ETL for the entire data, as well as a module with synchronization capabilities for external data, called data integration. While having these development capabilities or management capabilities, it is also necessary to do some effective construction of delivery management, data security, data quality, data graph, and do data governance in the entire data system.
The service layer provides downstream service capabilities in the form of ad-hoc query, real-time analysis, data service, and metadata service.
> The lowest layer is the data layer, such as the logs of the delivery server, including business data or other data sources, through the collection layer and the transport layer to our computing layer.
The application scenarios of the data middle office provide different access methods for different stages:
the first stage is unification form. There is a set of generic templates, its advantages and disadvantages are obvious, the advantage is that it is simple to access, the disadvantage is that it is not personalized and customized enough, and can only support this general data capability. Therefore, it is more suitable for the early stage of business, can be quickly accessed, and automatically complete this data processing and service;
The second stage is the ability to personalize. After determining the entire process, the business can be customized for some links in the process of use, and the ability of the existing data module can be expanded to meet some personalized needs, so it is more suitable for the stage of business growth;
The third stage is customization ability. Customization is more oriented to some particularly mature businesses, that is, there are many aspects and in-depth use scenarios for the demand for data, and the general and personalized architecture is no longer enough to meet the data needs, and the ability to customize can be adopted. Customization capability is the ability to provide data modularization, and then the business selects these modular capabilities according to its own needs, and assembles and expands them to meet its own customized needs.
> by Ma Jintao Original: dbaplus
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇