Vivo Internet Product Team – Wang Xiao

With the expansion of recommendation scenarios such as advertising and content, the algorithm model is also evolving and iterating. With the continuous growth of the business, the training and output of the model urgently need to be managed by the platform. The main business scenarios of vivo Internet machine learning platform include game distribution, stores, shopping malls, and content distribution. This article will introduce the thinking and optimization ideas of vivo’s internal machine learning platform in construction and practice from the two aspects of business scenarios and platform function implementation.

First, write it on the front

With the rapid development of the Internet field, the exponential growth of data volume and the continuous improvement of computing power, the industry is vigorously developing AI technology to achieve business empowerment. Algorithmic businesses tend to focus on models and tuning, while engineering is a relatively weak link. Building a powerful distributed platform that integrates resource pools and provides a unified machine learning framework will greatly speed up training, improve efficiency, bring more possibilities, and help improve resource utilization. I hope that through this article, beginners will have a certain understanding of the complexity of machine learning platforms and production environments.

Second, the business background

As of August 2022, vivo has 280 million online users and 70 million daily active users in the app store. AI application scenarios are abundant, from speech recognition, image algorithm optimization, and common scenarios on the Internet, advertising and recommendation requirements around business scenarios such as app stores, browsers, and game centers continue to rise.

How to make the model iteration of the recommendation system more efficient, the user experience better, and the effect of the business scenario better is a major challenge for the machine learning platform, how to achieve a balance in cost, efficiency and experience.

From the following figure, it can be understood that the entire model processing and application of the scene is serial can be closed-loop, for the user’s feedback needs to be updated in time, and the effect of the model is continuously improved, based on the basis of this link relationship to optimize efficiency, the construction of a common and efficient platform is the key.

Third, the design ideas of vivo machine learning platform

3.1 Functional modules

Based on the link relationship of the business scenarios in the above figure, we can classify the business scenarios, and according to the different functions, the general algorithm platform can be divided into three steps: data processing “corresponding to the universal feature platform, providing data support for features and samples”, model training “corresponding to the universal machine learning platform, used to provide model training output”, model service “corresponding to the general model service deployment, used to provide online model estimates”, all three steps can be self-contained and become an independent platform.

This article will focus on the model training part, the challenges encountered in the process of building the vivo machine learning platform, and the optimization ideas.

1. Data processing, work related to data, including collection, processing, labeling and storage.

Among them, the collection, processing, and storage coincide with the scenarios of the big data platform, and the marking scene is unique to the algorithm platform.

Data acquisition, i.e. data acquisition from an external system, using Bees{vivo data acquisition platform} to collect data.

Data processing, that is, importing and exporting data between different data sources, aggregating and cleaning the data.

Data labeling is to attach human knowledge to the data and generate sample data so that the trained model can infer and predict the new data.

Data storage, according to the characteristics of access to find the appropriate storage method.

2. Model training, the process of creating a model, including feature engineering, experimentation, training, and evaluation of the model.

Feature engineering, that is, through the knowledge of algorithm engineers to mine more features of the data, the data after the corresponding transformation, as the input to the model.

Experimentation, that is, experimenting with various algorithms, network structures, and superparameters to find the best model that can solve the current problem.

Model training, mainly the computing process of the platform, the platform can effectively use computing resources, improve productivity and save costs.

3. Model deployment is to deploy the model to the production environment for inference application and truly play the value of the model.

Through continuous iterative evolution, we solve various new problems encountered and maintain a high level of service.

4. General requirements for the platform, such as scalability, O&M support, ease of use, security, etc.

Since machine learning is in a rapidly changing stage from research to production applications, the flexible scalability of frameworks, hardware, and services is very important. Any team needs more or less O&M work, and excellent O&M capabilities can help teams effectively manage service quality and improve production efficiency.

Ease of use is valuable for getting started in small teams and learning for newcomers in large teams, and a good user interface is also useful for understanding the meaning of data.

Security is a top priority for any software product and needs to be avoided as much as possible during development.

3.2 Model training related

Model training includes two main parts, one is that the algorithm engineer conducts experiments to find the best model and parameters for the corresponding scene, which is called “model test”, and the other is the process of computer training model, which mainly focuses on the ability of platform support, called “training model”.

Modeling is one of the core jobs of algorithm engineers. The modeling process involves a lot of data work, called feature engineering, which is mainly to adjust and transform data. The main task is to maximize the value of data and meet business requirements.

3.2.1 Model experiments

Feature work and superparameter tuning are core tasks in the modeling process. The feature work mainly preprocesses the data to facilitate the better expression of the data of this part of the input model, thereby improving the quality of the model output results.

Data and feature engineering determine the upper limit of model quality, and algorithms and superparameters are infinitely approximation.

Hyperparameter tuning includes selecting an algorithm, confirming the network structure, and initial parameters, which rely on the extensive experience of the algorithm engineer and require platform support experimentation to test results.

Feature engineering and superparameter adjustment are complementary processes. After the characteristics are processed, the effect needs to be verified by the combination of superparameters. When the effect is not ideal, it is necessary to think and improve from the two aspects of feature engineering and super reference, and after repeated iteration, the desired effect can be achieved.

3.2.2 Train the model

The speed of rapid experiments can be increased by standardizing the data interface, and the results of the experiments can also be compared. The underlying layer supports the virtualization scheme at the docker operating system level, which is fast in deployment and can directly deploy the model to the Internet. Users do not need to do more customized operations on the training model, batch submission tasks can save users time, and the platform can compare the experiments of a set of parameter combinations, providing a more user-friendly interface.

Secondly, due to the more training directions, it is necessary to manage the allocation of automatic planning tasks and nodes, and even make reasonable use of idle resources according to the load situation.

Fourth, vivo machine learning platform practice

Earlier we introduced the background and development direction of the machine learning platform, now let’s introduce the platform in solving the problem of users in the part of the problem and solve the idea.

4.1 Platform Capability Matrix

The main goal of the machine learning platform is to deepen the cultivation of the model, and to assist users in model decision-making and faster model deployment.

With this as the goal, it is divided into two directions, the optimization of the training framework can support the distributed computing of large-scale models, and the optimization of scheduling ability can support the execution of batch models.

In terms of scheduling capabilities, the platform is scheduled by native k8s, the efficiency of single training scheduling is low, upgraded to kube-batch batch scheduling, to the goal of hybrid cloud fine orchestration, currently mainly in the form of flexible scheduling strategy.

In the training framework, from the native Tensorflow model, with the expansion of features and sample size, the self-developed ultra-large-scale training framework vlps, is currently in the new framework state of TensorFlow + vlps.

4.2 Introduction to Platform Capabilities

Platform capacity building mainly revolves around the use of model testing and training models, and how to solve the pain points and difficulties encountered in the application process is the key to our practice. At the same time, the training framework is also an experience of the key capabilities of the platform, and the framework is continuously optimized based on the complexity of the business.

It has covered the work of model debugging of the company’s internal algorithm engineers, and has reached the scale of 100 million samples and tens of billions of characteristics.

4.2.1 Resource Management

Pain point:

Machine learning platforms are compute-intensive.

Whether the business scenarios are different and whether the resources are completely divided according to the business groupings;

If the resource pool is divided too small, the resource utilization rate will be low and the resource requirements of the business surge will not be able to meet the resource requirements;

When the resources are insufficient to meet the business requirements, there will be a queuing situation that will cause the model to be updated in a timely manner;

How to manage the computing power well, improve efficiency and reduce the balance is a core issue of platform resource management.

Solution Ideas:

The basic idea of resource management is to centralize all computing resources and allocate them on demand, so that the resource utilization rate is as close as possible to 100%. Resources of any size are valuable.

For example, when a user has only one compute node and has multiple compute tasks, resource management can reduce the idle time between task rotations through queues, which is much more efficient than manually starting each compute task. In the case of multiple computing nodes, resource management can automatically plan tasks and node allocation, so that the computing nodes are in use as much as possible, without the need to artificially plan resources and start tasks. In the case of multiple users, resource management can reasonably utilize the idle resources of other users or groups according to the load situation. As the number of nodes increases, providing more business support based on limited computing power is the only way.

1. Limit resource abuse with quotas:

Add quota groups and individual quotas to reduce mutual interference between services, meet the resource needs of each group as much as possible, and quota groups support temporary expansion and sharing to solve the resource requirements of occasional surges; After the quota, users can only use it under limited resources, allowing users to self-regulate high-priority training.

2. Promote resource optimization with scheduling:

Add a new production environment, confirm that the model has been iterated normally, switch to a high-performance environment with reasonable utilization, and provide a higher performance resource pool; At the same time, it provides a scheduling scoring mechanism to focus on dimensions such as resource granularity and rationality of configuration, so that reasonable training resources can be pulled up faster and the scheduling is stuck;

After the launch of the multi-dimensional scheduling scoring mechanism, the platform’s unreasonable training tasks have dropped significantly, and the resource efficiency has been improved.

Surrounding is not limited to the following dimensions: maximum runtime, queuing time, cpu & memory & gpu granularity, and total demand, etc.

4.2.2 Self-development of the framework

Pain point:

As the sample and feature scale increases, the performance bottleneck of the framework is highlighted, and the efficiency of inference calculation needs to be improved.

Development Path:

Each development path is mainly based on the development of business volume, seeking the best training framework, and each version upgrade of the framework is packaged as an image to support more model training.

Current effect:

4.2.3 Training Management

Pain point:

How to support a variety of distributed training frameworks to meet the business requirements of algorithm engineers, so that users do not need to care about the underlying machine scheduling and O&M; How to let algorithm engineers quickly create new training, perform training, and view the training status is the key to training management.

Solution Ideas:

The file server and git that upload the code to the platform can be read, and the distributed training task can be quickly initiated by filling in the appropriate number of parameters on the platform. At the same time, it also supports OpenAPI, which is convenient for developers to complete machine learning business without the console.

The configuration information related to the training model is divided into basic information settings, resource information settings, scheduling dependency settings, alarm information settings, and advanced settings. In the process of experimenting with superparameters, it is often necessary to test a combination of parameters.

Bulk submission of tasks saves users time. Platforms can also compare this set of results directly, providing a friendlier interface. Train a script that reads a file server or git to perform training quickly.

1. Create training visually and efficiently

2. Accurate and quick modification scripts

3. Monitor training changes in real time

4.2.4 Interactive Development

Pain point:

Algorithm engineers have a high cost of debugging scripts, algorithm engineers and big data engineers have the demand to debug scripts online, and can run the code directly through the browser, while displaying the running results under the code block.

Solution Ideas:

Experiment and develop in interactive tools, such as jupyter notebooks, to provide a WYSIWYG interactive experience that is very convenient for debugging code.

In the case of interactive experimentation, you need to monopolize computing resources. Machine learning platforms need to provide capabilities that preserve computing resources for users. If you have limited compute resources, you can limit the total amount of compute resources requested per user and set a timeout period.

For example, if the user does not use the resource within a week, the reserved resource is reclaimed. After the resource is reclaimed, the user’s data can continue to be retained. After you rerequest a resource, you can restore the last job. In small teams, while it is more convenient for each person to reserve a machine to decide how to use it for themselves, with a machine learning platform for unified management, resource utilization can be higher. Teams can focus on solving business problems without having to deal with business-related issues with the computer’s operating system, hardware, etc.

V. Summary

At present, the vivo machine learning platform supports the offline training of algorithms in the Internet field, so that algorithm engineers pay more attention to the iterative optimization of model strategies, so as to achieve business empowerment. In the future, we will continue to explore the following aspects:

1. Realize the penetration of platform capabilities

The reading of current features, samples or models is realized through hdfs, and the alarms and log information on the platform are not correlated, and the platform capability can be connected in the future;

There is also room for standardization of the platform’s data and models, reducing learning costs and improving the efficiency of model development.

2. Strengthen pre-research at the framework level

Study the impact of different distributed training frameworks on model effects and adapt to different business scenarios;

Provide preset parameters to achieve decoupling of algorithms, engineering, and platforms, and reduce the user’s threshold for use.

END

Guess you like it

Multi-cloud container orchestration Karmada-Operator practice

Dubbo generalization calls the application of the unified configuration system in vivo

Performance tuning for high-performance Java computing services