This article combines the specific practices of the cloud music data development team in the construction of data warehouses and data governance in the past period of time, and shares some of our ideas on data governance.
Today’s cloud music has become a mass product, users come to cloud music every day to listen to songs, read reviews, visit the community, this process has precipitated a large amount of user data. The user logs collected and processed by the platform have reached the level of 100 billion per day, and the total amount of data used by the entire cluster for processing and processing has reached 200 petabytes. To solve the technical problems of data storage, processing, and use on such a large scale, as a data development is first of all exciting, but on the other hand, more data means spending more hardware expenditure on computing and storage. How to play the value of data, prove that this money is worth it, and at the same time reduce costs and increase efficiency to maximize the ROI of data use, is a problem we have been thinking about and trying to solve.
From our experience with a large number of companies at similar stages of development, advancing data governance during this period is a path that has proven to be feasible and can bring great value.
So what is data governance? In fact, the scope of data governance is quite broad, according to Google’s definition of data governance, it includes all the principled methods of managing data throughout its lifecycle (from acquisition, use to disposal). Covers everything you can do to keep your data secure, private, accurate, usable, and easy to use, including the actions that must be taken, the processes that must be followed, and the technologies that support it throughout its lifecycle.
It looks like a lot of what we do can be put into this box. But at the same time, there are so many directions for data governance to do, and where can we start? In the following pages, I will share some of the work done by the cloud music data team in the past in data construction, as well as the recent ideas and progress in doing data governance, and finally summarize what the data governance system is in our eyes at this stage.
Some work upfront
In the previous few years of work, the cloud music data team mainly went through several stages in terms of digital warehouse construction:
1.1 Improve the modeling of the common layer and establish the corresponding design, research and development specifications.
Through the “Ren Dou Plan”, the team focused on completing several things: compiling and sorting out the “Cloud Music Data Warehouse Modeling Specification”, and on the basis of this specification, it built the “easyDesign” digital warehouse model design system with Hangyan. At the same time, the relationship between the data theme domain and the theme domain of cloud music is systematically sorted out and output, and the bus matrix covering the whole service is improved. The development of the above work has made the data asset precipitation of cloud music begin to become directional, and in the next half a year, a large number of data public layers on people, things, scenes and other entities have been completed.
1.2 Data link governance
In addition to improving our own development and design specifications, we summarize the problems feedback from users and also put some energy into the governance of upstream and downstream data links. Among them, the more prominent problem is the quality of the buried point, we began to do the first buried point governance project in early 20 years, the main goal is to hope that through the standard process and the corresponding platform tools to standardize the design, development and testing of the buried point, do a good job in before, during and after the management. After colliding and co-construction, the music-related team and Hangzhou Yan discussed the formulation of the buried point process, standardized and completed the systematization, and the buried point management platform “EasyTracker” was launched. On this basis, the small partners of the number warehouse spent about half a year to migrate the original thousands of buried points, basically realizing the standardization of the buried point format and the standardization and management of the buried point process.
1.3 Promote self-service data extraction and leverage data productivity
The Wang Institute of Hangzhou Research Institute previously proposed to make everyone use data and look at data every day. We focused on how to solve the last mile problem of data. At the tool level, Youshu quickly gave the prototype of EasyFetch (self-service numbers), and for a period of time later, the students on both sides were basically back-to-back solving various problems and optimizing the experience. After several iterations, EasyFetch can meet the needs of the business in terms of functionality and ease of use. And we focus on how to maximize the value of the tool, the core of this is still data, to have a good data model suitable for self-service data analysis. Through a large number of application layer key data model construction close to the line of business, and the following things carried out:
Clarify each indicator, caliber, and clarify its use;
Add the data homepage to introduce the usage scenarios of each data module;
About 30+ online and offline trainings before and after, for the operation of each business line, planning the establishment of self-service POPO operation and maintenance groups, and even one-on-one problem solving.
Basically to achieve full coverage of the self-service number of services, the number of self-service withdrawals completed through EasyFetch throughout 2020 exceeded 150,000 times, with more than 400 internal users and a maximum daily active of more than 100 people.
Three things about data governance
The work in the early few aspects can be said to have solved the problem of the stage, entering the 21st year cloud music launched the IPO, the entire 21 years of analyst team and data development spent a lot of time to produce all kinds of investors concerned about the data, indicators, reports. At the same time, under the guidance of larger business goals, the business side has more and more refined operational actions, followed by a large number of analysis and data collection requirements, more data model construction needs, and faster response time requirements.
In December 2021, Cloud Music was successfully listed, and the team completed the IPO data project well. In the next stage, how to do better? Data construction and data governance are ultimately to serve business development goals. The core goal of the team in the next stage is to be able to support business mining increment, operational stock, and achieve more refined operations. Therefore, on the production side of data, we put forward the goal of establishing a lean production system for data to provide a unified, easy-to-use, accurate and stable data warehouse for the business.
To achieve this goal on the basis of existing data construction, it depends on more effective and systematic data governance, of which we regard quality governance, assetization, cost reduction and efficiency improvement as the key goals of 22 years of work. A great deal of work has been done by drawing on existing data governance methodologies such as DAMA and some good hands-on experience from companies in the industry.
2.1 Quality governance
Many people will ask, why are you still grasping the quality problem after you have been doing so many warehouses for so long? Has the design and development standardization of data modeling solved this problem? My answer is: the devil is in the details. Just like every automobile manufacturer, the production process of vehicles is basically standardized, but the quality of Toyota is better than that of Ford, which invented standardized assembly lines. The core also lies in the more detailed dismantling and more effective management of the production process. Therefore, quality management is a never-ending problem, for the current stage of the goal, we focus on the development of quality standards, strengthen the implementation of specifications, optimize the platform tools to carry out the work.
The above figure is what we need to do for quality stability disassembly, and each work can refine a special project or even several.
Take the meta-data center as an example: as the data that defines the data, metadata helps us better understand and characterize the data itself and the accurate relationship between the data and the data, which is the foundation of data governance. On the other hand, metadata itself is also a kind of data, and there are also problems such as missing and inaccurate, which also need to be governed. In the data governance work of cloud music, we have prioritized metadata sorting and availability.
The above table lists the specific classification details and some problems we promote to solve, and it took a quarter of time in the middle and the number to finally achieve metadata obtainability through bi-weekly iteration, the completeness and accuracy meet the requirements of our governance, and also lay a solid foundation for our subsequent data governance.
Taking the task operation and maintenance on the execution side as an example, the stability and security of the production environment are top priorities for every data team. There was a statistic that most of the safety problems in the industrial production process came from people, and the data specific to DuPont was 96%. As technicians, we always expect to develop better tools to avoid risks, often ignoring human initiative. This is not to say that the quality and safety of this thing is simply transformed into an assessment of people. When a person doesn’t know what to do to achieve a goal, he can’t take responsibility for it. The team spent a lot of time in the early stage to sort out the duty operation and maintenance manual, the information synchronization, problem alarm, cause analysis, problem escalation, recording & review and other links into a SOP, while formulating the two most important military regulations, (1) production is no small matter” – no matter how small the production problem must be paid attention to. (2) “Where the problem comes from” – there is a beginning and an end, and the final handling conclusion must be synchronized to the place where the problem began. In the first half of this year, the overall task breakout rate dropped by 60%, another important indicator, the online repair time was reduced by 80%, which means that the data problem has been dealt with more quickly. The work of each other sub-item is limited to the length of the article, and there is no longer a chance to organize and share it separately.
The premise of data assetization is that data becomes a production factor, the core is to be used in production and bring value, for the construction of digital warehouses, it is necessary to solve the problem of usability and ease of use, and at the same time to solve the problem of data and business value correlation. Whether these problems are solved and how well they are solved need to be measured by clear criteria. Although we are dealing with various indicators every day, we do not think enough about the quantitative evaluation index system for our own data construction. How to answer “How are your number positions doing?” accurately and quickly? “This kind of question often bothers us, if it is answered as a proof question, it may need to be elaborated from several aspects: first introduce what the overall design specification of our data warehouse is, from the business process to the standard business bus matrix output, and then to the specific model design to follow the principles and paradigms. Then you might post a big picture that illustrates the basic layering, the division of the line of business, and so on. Or add to how many tables we have built, how many fact tables, how many dimension tables, and so on. Or supplement from the value level, through some show case to prove the magic power of data.
The reality is that the person asking the question may not be as patient as to spend an hour listening, and then need to judge for themselves to draw conclusions. What kind of design specifications and production processes are really their top concerns for end users? Apparently not. Therefore, in the recent period, we have reflected on some unsuccessful experiences in the past, and also absorbed many good practices in the industry, and put forward a new standard of user perspective to measure the current situation of data warehouse construction. That is, the “three-degree model”, from the aspects of construction progress, asset health, business value and other aspects of the development of quantitative goals to define the level of construction, and from these goals to develop the overall plan for data assetization.
In the process of completing the above goals, the team has also gradually explored and summarized a set of working methodologies (as shown in the figure above), from the formulation of corresponding standards, to the improvement of governance capabilities through processes and tools, to the establishment of channels with relevant parties for continuous operation, the three axes down, the results become predictable. After a period of continuous communication (Amway) and discussion, users of various business teams gradually established a consensus on this set of standards.
Less long talk about proof, but regularly make the consensus indicators transparent, combined with the results to do analysis and output, communication costs become lower! For users, the delivery results are more traceable and predictable, and the hard work and results obtained by the data team are easier to reflect from the data level.
Taking the asset reuse rate as an example, after more than half a year of model reconstruction, as well as the gradual offline of the old model (a total of 24,000 tables off the line), our data asset reuse rate has increased from 30%, directly to 55%, which means that our data use efficiency has been nearly doubled, and this data is compared with the work in the industry to see that the overall reuse rate of cloud music data assets is at a relatively healthy level, and we are confident that we can improve to a more advanced level in the next six months.
2.3 Cost reduction and efficiency increase
The impact of the recent environment has made many companies feel the “cold”, in the business is facing multiple pressures, “cost reduction and efficiency increase” is particularly important: through cost reduction and efficiency to enhance the survival endurance of enterprises, for the future faster growth potential to accumulate potential, become a strategic means. The cost reduction and efficiency increase work of the cloud music data team is roughly carried out from three aspects:
First of all, we must calculate the big account, on this point to thank a number of students, although the monthly bill is shocking, but the overall charging situation, the expenditure of each service category and the corresponding change log are listed very clearly. This allows us to spend our energy on taking stock of the water levels of each service. With the total and classified purpose expenditure, as well as the overall situation of the water level of each service, it is basically possible to have a clear concept of the big account, as shown in the following table (music offline cluster partial inventory):
With the cost map, it is necessary to further drill down the cost to the business line, team and individual, and in the process of dismantling, we also encountered some problems such as the task attribution responsible person is not clear, the blood data is missing, etc., which further proves the importance of metadata in data governance, fortunately, we and a number of students spent a quarter of time to finally get the metadata problem done, so that it is available and visualized.
Here is a brief list of our overall plan outline, identified a few key points: 1. The previous introduction to get the metadata first, so that the governance results can be relied on, and the governance process is accurately controlled. 2. Prioritize the optimization of storage and computing that accounts for the majority of costs.
In addition to the outline to determine the specific direction and special goals, specific to the promotion process we from the positive and negative two ideas to solve a problem: positive push, position warfare, for those can be sorted out through the scanning of metadata governance goals, the promotion is relatively easy, can be iteratively advanced, and gradually get the results. Pushing back, tackling tough battles, many resource consumption output links are normal, corresponding to normal downstream reports and functions. However, after years of iteration, a lot of data and functions have been disregarded or have better alternatives, and need to be taken from the final use of the inventory and promotion to solve, this process needs to rely on data, but also to focus on the focus. Even, in communication, there needs to be a little bit of fighting spirit…
As a code farmer team, cost reduction and efficiency increase can only rely on sports cleaning in the short term to get a certain result, long-term do certainly not work, technical optimization or the first productivity. For example, one of the more significant results of music business optimization results was obtained through the Spark3+Z-order+Zstd transformation optimization in cooperation with Yous. Since the release of Spark3, the team has been paying attention to important features such as AQE, hoping to improve the performance and stability of the overall cluster Spark jobs by optimizing a large number of SQL execution plans. In addition to AQE, we also recommend that we improve the file compression rate by introducing Z-order, thereby improving the overall storage resource efficiency. In the first half of the year, after several rounds of iteration, testing and task transformation, we completed:
hive task upgrade spark 3.1, upgrade 266 tasks, upgrade task consumption of resources accounted for 95%, optimization after the execution time reduced by more than 60%, after optimization computing resource cost reduced by 60%;
spark2 upgraded spark3, completed the upgrade of 631 spark2 tasks, upgraded tasks consumed 90% of resources, saved 28.71% of resources after optimization, and improved performance by 52.07%;
spark3.1 + zorder + gzip special governance: upgrade and optimize 170, saving 68% of storage, saving 55T of average daily storage, equivalent to storage cost of 798.3W/Y.
In terms of computing resources, after a series of upgrade actions, the stability of the cluster remained stable for a long period of time, providing a better guarantee for the subsequent baseline 530 project (core output advanced to 5:30 per day).
In terms of storage, the daily growth trend has slowed down, from the original daily increase of 170T to the daily increase of 50T (some of which comes from the life cycle management effect).
Some systematic thinking
The data governance work in the recent period of time sat down and briefly summarized what we have done, which can be briefly summarized in a graph:
(1) Methodology: The practice without theoretical guidance is blind, and the digital warehouse modeling and data governance have precipitated a more mature theoretical system, whether it is the well-known DAMA data management knowledge system, or the summary thinking of the experts of various teams in the industry for data governance, which provides us with very good theoretical and methodological guidance.
On the other hand, these methods are essentially toolboxes, and different teams have different development stages and business requirements, and the corresponding data governance directions and goals are also different, which must correspond to the use of different tools to solve problems.
(2) Standard: The data life cycle governance we are doing in cloud music at this stage, and the corresponding standardization runs through the whole process before, during, and after the event. Establish consensus in advance and clear quantifiable goals, it is best to sign and pledge. For example, our quality and stability SLA standards, data asset third-degree evaluation standards, resource level quantitative evaluation standards, etc., in the matter, we must refine each implementation link to formulate standard actions, including the establishment of processes, as well as SOPs at each node (such as our operation and maintenance military regulations, research and development red lines), so that the executors have rules to follow. Afterwards, it is also necessary to combine the goals of the event and use evaluation criteria to measure the quality and progress of the completion to start the next cycle.
(3) Organization: Data governance must serve the vision and strategy of the business, not a one-size-fits-all task that is divorced from the specific environment, and what problems data governance solves needs to be clearly defined. In Cloud Music, business development has entered a mature period, and the requirements of refined operation such as the quality and ease of use of the number of takes, as well as the overall cost reduction and efficiency increase background have prompted us to put forward the goals of quality, assetization, cost reduction and efficiency increase.
Regarding execution and the division of roles, we believe that data governance should be a matter of integration into the production process and full participation. It is not recommended to promote the implementation of data governance by adding new independent positions or setting up committees, but by clarifying the division of responsibilities and implementing the main responsibility. In addition, in order to prevent the data team from acting as both referees and athletes, and the whole system increases its antagonism to maintain stable operation, we have done a lot of governance processes by working with brother teams. For example, in the quality stability goal, although the SLA standard as a whole is led by the data team, the assessment supervision, reporting and review of the process are completed by the QA team.
(4) Technology: For platform-based tools, we insist on the view that there are specific problems first, then there are corresponding process specifications, and finally tools. Although we have a platform development team, the overall work is not pursuing big and complete, more is still practical, and there are many less useless wheels.
The above can be considered to be the α benefits we can get in data governance, but sometimes the innovation and evolution of new technologies may exceed our imagination, so getting the β benefits of new technology development requires us to maintain the tracking of new technologies and often do technical exchanges with the Hangzhou Research Public Technology team. This is a must to keep doing, the past work has brought us a lot of benefits, here are also thanks to the support of several teams.
Zhu Yifei is the head of music data development at NetEase Cloud. Mainly responsible for data-related asset construction, platform tool system construction and productization and other related work.