Sharing guest: Guo Yi NetEase Digital Fan
Editor: Cao Wenwu Zhongke Yungu
Production platform: DataFunTalk
Introduction: With the further development of big data, NetEase Digital Sail big data team put forward the concept of data productivity. Adhering to the vision of “everyone uses data, always uses data”, NetEase Digital Sail data middle platform support technology system has been built, and the construction of data middle office projects such as NetEase Cloud Music, Yanxuan, Media, Youdao and Mailbox has been built. If you don’t do a good job of data governance, the data middle office is like a castle in the air, and there will be various problems, so data governance is very critical to the construction of data middle office. This article will share some practical experience of NetEase Shufan in the field of data governance, including data middle office and data analysis, and will focus on the following five points:
NetEase Digital Sail Big Data
Why data governance projects often fail
NetEase Digital Sail Data Governance 2.0
Practice case of data governance of NetEase Digital Fan
First of all, let’s introduce the background of NetEase Digital Sail Big Data.
1. The development history of NetEase Digital Sail Big Data
NetEase Digital Fan is a commercial brand of ToB business incubated by NetEase Hangzhou Research Institute, mainly providing enterprises with the technology and services needed for digital transformation. NetEase Hangzhou Research Institute was founded in 2006, positioning is the public technology department of NetEase’s Internet business, at the beginning of its establishment, we mainly did three distributed systems: distributed database, distributed file system, distributed search engine, as a troika to support the later Internet 2.0 era of NetEase a series of products, including we are familiar with NetEase blogs, photo albums, etc.
In 2009, NetEase took the lead in doing data analysis and operation and maintenance based on Hadoop, NetEase’s technical system is very open, and we are optimistic about the momentum brought by the open source community to the sustainable development of a basic software. In 2014, NetEase’s big data platform (there is a more familiar name called NetEase Mammoth) and NetEase Youshu BI were launched, which promoted the large-scale application of NetEase data analysis, including NetEase Koala, Yanxuan, Music, News, and Youdao are based on this platform to build their own data analysis system.
In 2017, NetEase Big Data began to be officially commercialized. By 2018, with the rapid growth of NetEase’s internal data analysis scale, NetEase encountered many problems and challenges in the field of data analysis, mainly in the field of data use efficiency, quality, cost and security, facing huge pressure from the business, we began to use the data middle office to reshape the entire data architecture, proposed and released the “full link data middle office” solution. In 2020, NetEase Digital Fan put forward the concept of “data productivity”, emphasizing the construction of a data product matrix for business scenarios based on the data middle platform, and further refining the methodology of “data productization”, which is also one of the three core methodologies of data productivity. The large-scale use of data has accelerated the urgent need for data governance solutions, and in 2022, the concept of “data development and data governance integration” has been proposed, which is also the core connotation of NetEase Datafan’s “Data Governance 2.0”.
2. NetEase Digital Sail big data product matrix
The above figure is the technical system of Shufan big data products, including four layers of architecture:
(1) Infrastructure layer
Big data computing, storage engine, which includes some of the current more hot technologies, such as storage and computing separation technology, real-time data lake technology, offline and online business hybrid scheduling technology, etc., in NetEase News, we have realized offline data analysis tasks and online transaction processing business unified use of k8s scheduling, in the low peak period, some offline business scheduling to the online business server, resource utilization has been significantly improved. In NetEase Cloud Music’s overseas business, we have cooperated with AWS and have taken the lead in adopting the technology of memory and computing separation, replacing HDFS with cloud object storage, and building a cloud-native data platform architecture. In Cloud Music, we have used NetEase’s open source arctic real-time data lake solution to make the data lake have the ability to update in real time at the minute level.
(2) Data development platform based on the whole life cycle of DataOps
A complete DataOps toolchain including data integration, development, testing, release, and operation and maintenance can achieve efficient testing and seamless release between DEV/SIT/UAT/PRD multiple environments.
(3) Data governance technology platform
NetEase Datafan’s data governance system includes the three major pieces of traditional data governance that we often see: data quality, metadata management, and data standards, as well as related systems in the data middle office, such as indicator systems, model design centers, and data services, which we integrate into the system of NetEase Datasail Data Governance 2.0.
(4) Data product layer
BI is the most important window for data analysis, including one-stop data portal, self-service data retrieval, data screen, and some general data products, such as CDP. In addition, we have also put the machine learning platform into the data product layer, mainly on top of the data, which can access some intelligent algorithms to improve the accuracy of decision-making.
3. NetEase Digital Sail big data commercial positioning
After the long-term practice of NetEase Group’s internal business, it has a leading methodology, and has accumulated many industry landing cases, and also clarified the commercialization positioning of NetEase Digital Sail big data.
We are a basic software provider, we are not a cloud vendor;
We must support a cross-cloud strategy;
We believe that a healthy big data software market must be hierarchical.
4. User case wall
Why data governance projects often fail
Below, focus on why do data governance?
1. Why should we do data governance?
We divide the digital transformation of an enterprise into two stages, the first stage is online, mainly the use of information systems to replace offline processes, in this stage will form a lot of business systems. The second stage, which we define as digital intelligence, is the use of data and algorithms instead of head-to-head decision-making. To be digitally intelligent, it is necessary to be data productivity, and we define data productivity as the use of data to increase organizational productivity. After observing many enterprises, we found that all enterprises that can really achieve data productivity have the same characteristics, that is, they have achieved that everyone in the enterprise uses data and data all the time, so we take it as a vision of data productivity. To achieve this vision, NetEase Digital Fan proposed that it must rely on three major methodologies:
Data R&D (DataOps): Full data lifecycle R&D system
Data Fusion: Data Governance 2.0
DataProduct: Data is productized to make it easy for users to use it
2. NetEase Digital Sail Data Productivity Architecture
In the overall data productivity architecture, there are three roles, Business Systems, Data Middle Office, and Data Products. Business system is mainly responsible for the management of the process, different business systems, resulting in data islands, when we want to follow the whole process of business data analysis, we must bring the data of these different business systems into a unified data platform, forming a public data base of the enterprise. The most important responsibility of the data middle office is to build a public data layer of the enterprise, producing high-quality, consistent caliber indicators and presenting them on top of data products. Data products are mainly responsible for transforming data into business decisions and making the operation of business processes more intelligent. So in the whole architecture, the data comes from the business, and eventually the data is converted into decisions, and then it will return to the business, and this cycle is what we call the digital intelligence cycle.
So what does this have to do with the data governance we are going to talk about today? What role does data governance play in this? This also starts with the problems we encounter. As we said earlier, we want some business people to be able to really use the data effectively, but can the business people really use the data? What is wrong with the data in use?
We boil down to the problem: can’t find, can’t understand, can’t believe and can’t control, in fact, behind the whole data production is low efficiency and poor quality.
3. Traditional Data Governance 1.0
Traditional data governance, which we call the Big Three, includes metadata management, data quality, and data standards. The general data governance process will start with the data standard, and the process of formulating the data standard is called calibration. After the standard is set, it is necessary to complete the bidding, and the process needs to use metadata collection, metadata registration and metadata approval and release. After the standard drop completes the connection between the data model and the data standard, we can use the data element constraints defined in the data standard to audit the data quality, catch the data quality problems that do not meet the standard, and promote rectification. This is a very standard data governance process. This process has a significant improvement effect on existing data, but ignores the long-term governance of incremental data, which leads to the need for enterprises to continuously maintain the effect of data governance through data governance projects.
Therefore, NetEase Digital Fan believes that in order to achieve long-term governance of data, it is necessary to solve problems from the production of data to ensure that the data produced itself meets the standards.
The problems with traditional Data Governance 1.0 are summarized as follows:
(1) Data development is disconnected from data governance
Data quality is disconnected from data development: often people ask how to ensure the completeness of data quality audit rules, we find that only 10% of the core tables produced have audit rules, the same data items, different development settings of audit rules are inconsistent.
Data standards are disconnected from data modeling: sharing a set of data, in NetEase, 37% of the tables have naming irregularities, the same field, there are more than 8 kinds of field naming.
Data standards are disconnected from data security: Data security policies are inconsistent with data standards.
Data development is disconnected from data standards: dictionary mappings are inconsistent with ETL
Metadata is disconnected from task operations and development: Tasks cannot be effectively managed according to asset registrations
The cost of reverse data governance is very high, because the table has been built, the task has been launched, and then the cost of urging them to change it is relatively high. If we can design the model before the table or analysis task goes online, standardize the data first, and then model it, the table must meet the standards, and the cost is the lowest, so we emphasize the integration of data development and governance.
(2) Lack of unified management of different platforms
Different computing and storage engines increase the cost of users finding data, understanding data, and using data.
(3) Ignoring the efficiency and quality problems in the data development process
The above figure is two real cases, which shows that data governance should be integrated into the data production process, rather than governance after it is launched.
(4) There is no solution to the chimney-type data development
Chimney-style data development will cause inconsistencies in the caliber of indicators, efficiency problems caused by data repeated development, and resource use problems caused by data double calculation.
(5) Insufficient assessment of data value and cost
(6) The process of data governance lacks quantitative means
There should be some quantitative means of monitoring the entire governance process.
(7) The process of data governance lacks a closed loop of continuous feedback
Metadata lacks a closed loop for continuous improvement
Data quality lacks a closed loop of continuous improvement
Resource fine-grained management lacks a closed loop of continuous feedback
NetEase Digital Sail Data Governance 2.0
1. What exactly is data governance?
The industry authority DAMA has set 11 functional quadrants for data governance, but it lacks specific landing methods and experience.
DCMM is China’s first national standard in the field of data governance, it gives an objective evaluation method, but still lacks specific ways of action.
2. NetEase Shufan’s understanding of data governance?
NetEase divides it into two parts according to the purpose of data governance:
Data governance for business systems: Solve the core data consistency, authority, and correctness of enterprise across business, system, and process of business systems.
Data governance for data analysis: It solves the problems of efficiency, quality, safety, cost, standard, and value in the data analysis process.
3. DataFusion methodology of data governance of NetEase Digital Sail
NetEase’s data governance methodology, the traditional data governance method into the whole life cycle of data development, based on the DataOps life cycle data development base, using the data architecture of the data middle office, combined with NetEase’s characteristic ROI-based data assetization practice, we call it data governance 2.0
Integration of development and governance
DataFabric-based logical data lake
Data development cradle using DataOps
Data middle office architecture to solve the chimney-type data development
ROI-based data asset precipitation
(1) Integration of data development and governance
Generate domain constraints through data exploration
Data standards bind audit rules on data objects and metamodels
Data modeling references the data elements and metamodels in the data standard
Audit rules associated with the data standards bound to the table are automatically added to the audit monitoring of the table
(2) Based on DataFabric logical data lake
The core idea of a logical data lake based on DataFabric is to build a unified data mart across platforms. The HIVE, MySQL, and Greenplum are built into a unified aggregation layer, and on top of this, they are directly delivered to BI, and the out-of-the-box effect of the business is completed by rotating data sets and materialized views, which can shield the data implementation process between different data sources at the bottom layer.
(3) DataOps-based data development base
DataOps-based data development base is a application of CI/CD methodologies in software engineering to data development, covering sustainable integration and sustainable delivery, and sustainable deployment. Specifically, it includes six stages: coding, orchestration, testing, code review, release review, and deployment and launch.
(4) The architecture of the data middle office
The data middle office includes three cores: unified index management system, high multiplexing, standardized public layer model, and data service.
(5) ROI-based data asset precipitation
Based on the data asset precipitation of ROI, we can see the refined scene management of each task through the visual analysis page, which can enable business personnel to continuously manage and offline useless data.
Calculate the calculation of each task, query, table, storage resource consumption, and convert it to money, and allocate it to each data report and data service API application level;
“Peel onion” data downline: Starting from the data application that is no longer used downstream, it is archived layer by layer upstream tasks and data downline to release resources.
Task and query cost estimates, for high-consumption tasks and queries, approval control
4. Quantitative indicator monitoring and analysis
By monitoring the data governance health score in the dashboard, there can be different dimensions of deduction points, and finally we do the red and black list between different businesses based on this health score, which is also a means of performance management.
5. Ongoing Operations – Metadata Quality Discovery and Feedback
In the process of continuous operation, when data asset consumers find that there is a problem with data quality, they can apply for data governance. The data management department may assign a ticket to require the business department to complete the repair of the data corresponding problem at the specified time and place.
6. Enterprise data culture construction
Data analysis competition, data governance competition, data visualization competition
Data Development Engineer, Data Visualization Analysis Engineer Certification
Data Governance Department, as the data governance operation department
Business units are staffed with data governance specialists
Develop data governance scores as red and black lists to drive attention from business units
Combined with the company’s internal process engine, it realizes the instrumental flow of data governance processes
7. Data productivity organization
8. System construction oriented to governance
Technology is the foundation of data governance, but it is not enough to have technology, but also the above organization, process, assessment and policy, improve the entire system, in order to finally achieve the vision of everyone using data and using data all the time.
9. Data Strategy
10. Enterprise Data Asset Portal – One-stop data consumption platform
Through the one-stop data consumption platform and portal, business personnel can see what data the enterprise has, which core reports, and which core data governance applications are on the portal.
1. A large operator
Problems faced before the introduction of NetEase Shufan’s one-stop tool platform:
Data standards, data quality and data development are seriously disconnected, and norms can only stay at the dictionary level, cannot be integrated into the process of data production, and cannot be effectively implemented and supervised.
Different vendors and different tools are seriously separated, the audit rules of data quality cannot be connected with the domain constraints of data objects in data standards, the data objects in data standards cannot be linked with data modeling tools, and the data security level in metadata management cannot be linked with data masking in the security center of the security center.
Ultimately, data governance is done repeatedly, without fundamentally solving the problem.
2. Integration of data development and governance
Introduced NetEase Shufan, the data middle office is unified to provide data collection, modeling, development, scheduling, governance and other integrated capabilities for warehouses, economic divisions, and network clusters. In the production process, the operation of the program on and off the line, the establishment of tables and other operations to achieve online, process-based operations, on the one hand, reduce manual efficiency, on the other hand, improve the process of data control.
The focus is to integrate the entire process of data governance into the whole link of data development, to do data standardization before design, and then to do data modeling, to do data quality and data security and data assets around data standards, and to achieve the landing of the entire development and governance integration of data governance scenarios.
3. Results at a glance
The chart above illustrates the results of our data governance. You can also find problems in terms of quality, value, safety, cost, standards and efficiency.
Q1: How to coordinate business-oriented data governance and analytics-oriented data governance?
A1: This question is very well asked, do we want to do business-oriented data governance like this? Do we want to do data governance for data analysis? Do I do data governance for the business first or do analytics data governance first? In fact, there is a strong connection between them, because the data that comes from the business system will eventually return to the business system. So we did business-oriented data governance, that is, in the business system side actually has the corresponding data standards, data standards it also has the corresponding data quality rules, data asset levels.
Of course, doing data governance for business is not governance that I don’t have to do data analysis? No! I just talked about a very important point is that the modeling method of the analysis system and the business system is not the same, the modeling method of the business system is the use of entity relationship modeling, the modeling method of the analysis system is the use of dimensional modeling, there is an articulation relationship between the two, and it can be connected through the way of business entities. If you do data governance in the business system, the data governance of the business system can be directly applied to the data governance of the analysis system, we can synchronize the standards, we can synchronize the data quality rules corresponding to the standards, these rules in the analysis system It will form different data quality audit tasks, but the process of calibration can actually greatly reduce the complexity and difficulty of the work, so there is a collaborative relationship between them. That is, the data quality rules, data standards, and data models that can be done for the data governance of the business system can be synchronized to give us analysis-oriented data governance, and the same platform is used to manage, and the analysis and business on the same platform can be related through business entities. This is a collaborative process between the two, which is actually embodied in a tool, technology, and product to achieve.
Q2: How did you carry out the data test, and what kind of concept was it based on to implement and implement?
A2: Data testing is a very important part of our entire CI/CD. Is that we do data testing, it is a very important means of testing, we will do a lot of card points, how to ensure that this thing can be implemented in place, in fact, there are some card points, that is, there are some points that can be stuck, so that he can have to execute, that is, we do data testing, it is a very important means of testing, we will do a lot of card points, is how to ensure that this thing can be implemented in place, in fact, there are some points that can be stuck, Make it necessary to execute. What is the basis of the card point here? All your data is stuck and not very realistic. Therefore, you need to design and then develop the data, and you will do the classification and grading of data assets in the design process, and define the security level of the data. We can formulate the corresponding approval process according to the scope of data impact and data level, for example, for core data online, we must have a corresponding data test report. Including some business rules and technical rules for the corresponding data test, such as whether the primary key is the only one, whether there will be some relevant situations such as null values for auditing, we will automatically pass these corresponding data quality reports through the platform when the task is submitted online, doped into the business submission online process, at this time the online process will automatically trigger the approval flow according to its downstream impact on the downstream and the corresponding data asset level, and the person who approves it will go to see whether his data test report and code match. There is no corresponding data test results, if the data test results meet expectations, this task can be put online, in such a way, can be mandatory to ensure that all our core data are to be tested.
Q3: What do you think is the most successful case of data governance 2.0 application in financial scenarios?
A3: Realistically speaking, in fact, I have seen a lot of cases, including the securities industry, bank wealth management, asset management and many industries, for the integration of data development and governance is just beginning to explore the stage, including some time ago we communicated with many CIOs in the securities industry, data governance leaders, they especially hope that data governance can be landed, of course, there will be many problems in this landing process. For example, our tool development platform may have been many years ago, and the data governance platform is another one, so there will be a lot of problems between different platforms, bringing very high costs, and eventually leading to no way to land, just like the operator case I just shared. However, from the overall point of view, I think this is a trend and direction that everyone recognizes, that is, to complete the landing of the entire governance process in the production data and production links. It is not a process of doing this kind of governance repeatedly after the fact. To share with you a tip, for new data, the value of the business may be greater, the old data, but the value may be relatively limited, so we have to pay more attention to the generation of new data, the process of new data governance.
This concludes today’s sharing, thank you.
【Digital Intelligence New Machine Efficiency Win-Win-Win-2022 NetEase Digital + Conference】
Heavy online! Scan the code for details 👇
Share at the end of the article, like, watch, give a 3 combo ~
01/ Sharing guests
NetEase Digital Fan is the technical leader of big data products
NetEase Shufan big data product technology leader, graduated from Tianjin University, joined the NetEase team after graduation, has more than 10 years of data development and management experience in NetEase, and helped NetEase Cloud Music, Yanxuan, News, Youdao and other businesses to build a data middle platform. Geek Time “Data Middle Office Practical Lessons” columnist, with more than 19,000+ subscriptions, has been invited to participate in QCon, DTCC, ArchSubmmit, SACC, GIAC and other industry summits for a long time, sharing NetEase’s latest practices in data development and data management.
02/ Free download materials
03/ Register to watch live PPT for free
DataFun: Focus on the sharing and exchange of big data and artificial intelligence technology applications. Founded in 2017, more than 100+ offline and 100+ online salons, forums and summits have been held in Beijing, Shanghai, Shenzhen, Hangzhou and other cities, and more than 2,000 experts and scholars have been invited to participate in sharing. Its public account DataFunTalk has produced 700+ original articles, millions + reads, and 140,000+ accurate fans.
🧐 Share, like, watch, give a 3 combo! 👇