This paper introduces Meituan’s practice in how to solve the problem of large-scale cluster management, design an excellent and reasonable cluster scheduling system, and expounds the issues, challenges and corresponding promotion strategies that Meituan is more concerned about when implementing cloud native technologies represented by Kubernetes. At the same time, this article also introduces some special support for Meituan’s business demand scenarios, hoping that this article can help or inspire students interested in the cloud native field.

  • introduction to the cluster scheduling system

  • The challenges of large-scale cluster management

  • Challenges of Operating Large-scale ClustersTrade-offs

  • in designing

  • cluster scheduling

  • systemsMeituan cluster scheduling system evolution

  • multi-cluster unified scheduling: improve data center resource utilization

  • Scheduling Engine Service: Enabling the Cloud-Native Implementation of PaaS

  • ServicesFuture Prospects: Building a Cloud-Native Operating System

Cluster scheduling systems play an important role in enterprise data centers, and as the number of clusters and applications continues to proliferate, the complexity of developers dealing with business problems has also increased significantly. How to solve the problem of large-scale cluster management, design an excellent and reasonable cluster scheduling system, and ensure stability, reduce costs, and improve efficiency? This article will answer them one by one.
| Note: This article was first published in the developer column of “New Programmer 003” in the cloud native era.

Cluster scheduling system, also known as data center resource scheduling system, is

commonly used to solve the resource management and task scheduling problems of data center, its goal is to achieve the effective use of data center resources, improve the utilization rate of resources, and provide automated operation and maintenance capabilities for business parties, reducing service operation and maintenance management costs. The industry’s more well-known cluster scheduling systems, such as the open source OpenStack, YARN, Mesos and Kubernetes, etc., and the well-known Internet company Google’s Borg, Microsoft’s Apollo, Baidu’s Matrix, Alibaba’s Fuxi and ASI.
As the core IaaS infrastructure of Internet companies, the cluster scheduling system has undergone many architectural evolutions in the past ten years. With the evolution of services from monolithic architecture to SOA (service-oriented architecture) and the development of microservices, the underlying IaaS facilities have gradually crossed from the era of bare metal physical machines to the era of containers. While the core issues we deal with have not changed over the course of evolution, the complexity of the problems has grown exponentially due to the dramatic expansion of cluster size and the number of applications. This article will explain the challenges of large-scale cluster management and the design ideas of cluster scheduling system, and take the implementation of Meituan’s cluster scheduling system as an example to describe a series of cloud native practices such as continuously improving resource utilization by building a unified multi-cluster scheduling service, providing Kubernetes engine services to empower PaaS components, and providing better computing service experience for business.
As we all know, rapid business growth has led to a surge in server size and the number of data centers. For developers, in the business scenario of large-scale cluster scheduling system, two problems must be solved:
  1. how to manage large-scale cluster deployment scheduling in data centers, especially in cross-data center scenarios. How to achieve resource elasticity and scheduling capabilities, improve the utilization of resources as much as possible under the premise of ensuring application service quality, and fully reduce data center costs.
  2. How to transform the underlying

  3. infrastructure, build a cloud-native operating system for the business side, improve the computing service experience, realize automatic disaster recovery response and deployment and upgrade of applications, reduce the mental burden on the business side for the management of the underlying resources, and allow the business side to focus more on the business itself.
In order to solve the above two problems in the real production environment, they can be further divided into the following four large-scale cluster operation and management challenges:
    > How to solve the diverse needs of users and respond quickly. As a platform-based service such as a cluster scheduling system, it is necessary to be able to quickly deliver functions and meet business needs in a timely manner. On the other hand, it is also necessary to build the platform to be generic enough, abstract the personalized requirements of the business into general capabilities that can be implemented on the platform, and iterate for a long time. This is a great test of the platform services team’s technical evolution planning, because if you are not careful, the team will be stuck in endless business function development, which will meet the business needs but cause the team to have a low level of duplication of work.
  1. How to improve the resource utilization of online application data centers while ensuring application service quality? Resource scheduling has always been a recognized problem in the industry, and with the rapid development of the cloud computing market, cloud computing vendors continue to increase investment in data centers. The resource utilization of the data center is very low, which exacerbates the problem. Gartner research found that the CPU utilization of servers in global data centers is only 6%~12%, and even Amazon Elastic Compute Cloud (EC2, Elastic Compute Cloud) has only 7%~17% resource utilization, which shows how serious the waste of resources is. The reason is that online applications are very sensitive to resource utilization, and the industry has to reserve additional resources to ensure the quality of service (QoS) of important applications. The cluster scheduling system needs to eliminate interference between applications and achieve resource isolation between different applications when running a hybrid operation.
  2. How to provide automatic instance exception handling for applications, especially stateful applications, shield differences in the computer room, and reduce users’ perception of the underlying layer 。 With the continuous expansion of service application scale and the maturity of the cloud computing market, distributed applications are often deployed in data centers in different regions, or even across different cloud environments, realizing multi-cloud or hybrid cloud deployment. The cluster scheduling system needs to provide a unified infrastructure for the business side, implement a hybrid multi-cloud architecture, and shield the underlying heterogeneous environment. At the same time, it reduces the complexity of application O&M management, improves application automation, and provides a better O&M experience for services.
  3. How to resolve the performance and stability risks related to cluster management caused by a single cluster or too many clusters 。 The complexity of lifecycle management of the cluster itself increases as the size and number of clusters increases. Taking Meituan as an example, the multi-center and multi-cluster solution we adopted in two places avoided the hidden dangers of excessive cluster scale to a certain extent, and solved problems such as service isolation and regional latency. With the emergence of cloud migration requirements for PaaS components such as edge cluster scenarios and databases, it is foreseeable that the number of small clusters will have a clear upward trend. This brings a significant increase in cluster management complexity, monitoring configuration costs, and O&M costs, and the cluster scheduling system needs to provide more effective operation specifications and ensure operational safety, alarm self-healing, and change efficiency.
To solve the above challenges, a good cluster scheduler will play a key role. But in reality, there is never a perfect system, so when designing the cluster scheduling system, we need to make trade-offs in several contradictions according to the actual scenario:
    system

  1. throughput and scheduling quality of the cluster scheduling system 。 System throughput is an important criterion for us to evaluate the quality of a system, but more important in an online service-oriented cluster scheduling system is the quality of scheduling. Because the impact of each scheduling result is long-term (days, weeks, or even months), non-anomalies are not adjusted. Therefore, if the scheduling result is wrong, it will directly increase the service delay. A higher scheduling quality means more computational constraints to consider, and the worse the scheduling performance, the lower the system throughput.
  2. Architectural complexity and scalability of cluster scheduling system. The more functions and configurations the system is open to upper-layer PaaS users, the more functions it supports to improve the user experience (such as supporting application resource preemption and recycling and abnormal self-healing of application instances), which means that the higher the complexity of the system, the more likely it is for conflicts between subsystems.
  3. The reliability of the cluster scheduling system and the single-cluster scale. The larger the size of a single cluster, the larger the schedulable range, but the greater the reliability challenge for the cluster because the explosion radius increases and the impact of failure is greater. If the scale of a single cluster is small, although the scheduling concurrency can be improved, the scheduling range becomes smaller, the scheduling failure probability becomes higher, and the cluster management complexity becomes larger.
At present, the cluster scheduling system in the industry can be divided into five different architectures: single scheduler, two-level scheduler, shared state scheduler, distributed scheduler and hybrid scheduler according to the architecture (see Figure 1 below), all of which have made different choices according to their respective scenario requirements, and there is no absolute good or bad.

Figure 1 Cluster scheduling system architecture classification (from Malte Schwarzkopf – The evolution of cluster scheduler architectures)
    > The monolithic scheduler uses a sophisticated scheduling algorithm combined with the cluster’s global information to calculate high-quality placement points, but with high latency. Such as Google’s Borg system, open source Kubernetes system.

    The

  • two-stage scheduler solves the limitations of monolithic schedulers by separating resource scheduling from job scheduling. The two-level scheduler allows different job scheduling logic according to specific applications, while maintaining the characteristics of sharing cluster resources between different jobs, but it cannot achieve preemption of high-priority applications. Representative systems are Apache Mesos and Hadoop YARN.
  • The shared state scheduler

  • solves the limitations of the two-level scheduler in a semi-distributed way, each scheduler in the shared state has a copy of the cluster state, and the scheduler updates the cluster state copy independently. Once the local state replica changes, the state information for the entire cluster is updated, but continuous resource contention can cause scheduler performance degradation. Representative systems are Google’s Omega and Microsoft’s Apollo.
  • The distributed scheduler uses a relatively simple scheduling algorithm to achieve large-scale high-throughput, low-latency parallel task placement, but because the scheduling algorithm is relatively simple and lacks a global resource usage perspective, it is difficult to achieve high-quality job placement effects, such as Sparrow of the University of California.
  • Hybrid schedulers spread workloads across centralized and distributed components, using complex algorithms for long-running tasks and relying on distributed layouts for short-running tasks. Microsoft Mercury has taken this approach.
Therefore, how to evaluate the quality of a scheduling system mainly depends on the actual scheduling scenario. Taking YARN and Kubernetes, the most widely used in the industry, for example, although both systems are general-purpose resource schedulers, YARN actually focuses on offline batch processing short tasks, and Kubernetes focuses on online long-running services. In addition to the architectural design and functional differences (Kubernetes is a monolithic scheduler, YARN is a two-level scheduler), the design philosophy and perspective of the two are also different. YARN focuses more on tasks, focuses on resource reuse, avoids multiple copies of remote data, and aims to perform tasks at a lower cost and at a higher speed. Kubernetes focuses more on service status, focusing on peak shifting, service profile, and resource isolation, with the goal of ensuring service quality.
In the process of containerization, Meituan changed the core engine of the cluster scheduling system from OpenStack to Kubernetes according to the requirements of business scenarios, and achieved the established goal of online business containerization coverage exceeding 98% by the end of 2019. However, it still faces problems such as low resource utilization and high O&M costs:
  • the overall resource utilization of the cluster is not high. For example, the average utilization rate of CPU resources is still at the average level of the industry, which is a large gap compared with other first-tier Internet companies.
  • The

  • containerization rate of stateful services is insufficient, especially MySQL, Elasticsearch and other products do not use containers, and there is a large room for optimization of service O&M costs and resource costs.
  • From the perspective of business needs, VM products will exist for a long time, and VM scheduling and container scheduling are two sets of environments, resulting in high O&M costs of team virtualization products.
Therefore, we decided to start a cloud-native transformation of the cluster scheduling system. Build a large-scale and highly available scheduling system with multi-cluster management and automated O&M capabilities, support scheduling strategy recommendation and self-service configuration, provide cloud-native underlying scalability, and improve resource utilization while ensuring application service quality. The core work revolves around the three directions of maintaining stability, reducing costs and improving efficiency to build a scheduling system.
  • to ensure stability: improve the robustness and observability of the scheduling system; Reduce the coupling between the modules of the system and reduce the complexity; Improve the automatic O&M capabilities of the multi-cluster management platform; Optimize the performance of core components of the system; Ensure availability for large-scale clusters.
  • Cost reduction: In-depth optimization of the scheduling model to open up the cluster scheduling and stand-alone scheduling links. From static resource scheduling to dynamic resource scheduling, offline service containers are introduced to form a combination of free competition and strong control, and improve resource utilization and reduce IT costs under the premise of ensuring high-quality business application services.
  • Improve efficiency: Support users to adjust scheduling policies by themselves, meet personalized business requirements, actively embrace the cloud native field, and provide core capabilities for PaaS components, including orchestration, scheduling, cross-cluster, and high availability, to improve O&M efficiency.

Figure 2 Meituan cluster scheduling system architecture Finally, the Meituan cluster scheduling system architecture
is divided into three layers according to the domain (see Figure 2 above): scheduling platform layer, scheduling strategy layer, and scheduling engine layer:
  • The platform layer is responsible for service access, opening up Meituan’s infrastructure, encapsulating native interfaces and logic, and providing container management interfaces (expansion, update, restart, and scale-in).
  • The policy layer provides unified scheduling capabilities for multiple clusters, continuously optimizes scheduling algorithms and policies, and improves CPU usage and allocation rate through service levels based on service levels and sensitive resources.
  • The engine layer provides Kubernetes services to ensure the stability of cloud-native clusters of multiple PaaS components, and sinks general-purpose capabilities to the orchestration engine to reduce the access cost of cloud-native services.
Through refined operations and product function polishing, we have managed nearly one million container/virtual machine instances in Meituan on the one hand, and improved the resource utilization rate from the industry average to a first-class level on the other hand, while also supporting the containerization and cloud native implementation of PaaS components.
To evaluate the quality of the cluster scheduling system, resource utilization is one of the most important indicators. So while we completed containerization in 2019, containerization is not an end, it’s a means. Our goal is to bring more benefits to users by switching from the VM technology stack to the container technology stack, such as reducing the computing cost of users across the board.
Once the capacity is expanded, the service container may be scaled to the hotspot host, and the service performance indicators such as TP95 will fluctuate, so that we can only ensure the quality of service by increasing resource redundancy like other companies in the industry. The reason is that the allocation method of the Kubernetes scheduling engine only briefly considers the Request/Limit Quota (Kubernetes sets the request value Request and constraint value Limit for the container as the resource quota of the user to apply for the container), which is a static resource allocation. As a result, although different hosts allocate the same amount of resources, the resource utilization of the host also varies greatly due to the service differences of the host.
In academia and industry, there are two commonly used approaches to resolve the contradiction between resource efficiency and application service quality. The first method is to solve it from a global perspective through an efficient task scheduler; The second method is to strengthen resource isolation between applications through stand-alone resource management. No matter which method, it means that we need to fully grasp the cluster state, so we did three things:
    > systematically established the association of cluster state, host state, and service state, and combined with the scheduling simulation platform, the peak utilization rate and average utilization rate were comprehensively considered. It implements prediction and scheduling based on the historical load of the host and the real-time load of the service.
  • Through the self-developed dynamic load regulation system and cross-cluster rescheduling system, the linkage between cluster scheduling and stand-alone scheduling links is realized, and the service quality assurance strategies of different resource pools are realized according to the service hierarchy.
  • After three iterations, the self-owned cluster federation service is realized, which better solves the problems of resource preoccupation and state data synchronization, improves the scheduling concurrency between clusters, and realizes computing separation, cluster mapping, load balancing, and cross-cluster orchestration control (see Figure 3 below).

Figure 3 Cluster Federation V3 Version Architecture
The third version of the cluster federation service is split into Proxy layer and worker layer according to the module, and deployed independently:
  • The proxy layer will select the appropriate cluster for scheduling based on the factors and weights of the cluster status, and select the appropriate worker to distribute requests. The Proxy module uses etcd for service registration, host selection and discovery, and the Leader node is responsible for scheduling preoccupied tasks, and all nodes can be responsible for query tasks.
  • The worker layer should handle a part of the cluster query requests, and when a cluster task is blocked, it can quickly expand a corresponding worker instance to alleviate the problem. When a single cluster is large, multiple worker instances will be corresponded, and Proxy will distribute scheduling requests to multiple worker instances for processing, improving scheduling concurrency and reducing the load of each worker.
Finally, through the unified scheduling of multiple clusters, we realize the shift from static resource scheduling model to dynamic resource scheduling model, thereby reducing the proportion of hotspot hosts, reducing the proportion of resource fragmentation, ensuring the service quality of high-quality business applications, and increasing the average CPU utilization of servers in online service clusters by 10 percentage points. The average cluster resource utilization is calculated as follows: Sum(nodeA.cpu. Current number of cores used + nodeB.cpu. Current number of cores + xxx) / Sum(nodeA.cpu. total number of cores + nodeB.cpu. total number of cores + xxx), one point per minute, and all values of the day are averaged.
In addition to solving the problem of resource scheduling, the cluster scheduling system also solves the problem of service using computing resources. As mentioned in the book Software Engineering at Google, the cluster scheduling system, as one of the key components in Compute as a Service, must solve both resource scheduling (from the resource dimension of physical machine disassembly to CPU/MEM) and resource competition (solving the “noisy neighbor”). ), it is also necessary to solve application management (automatic instance deployment, environment monitoring, exception handling, ensuring the number of service instances, determining the amount of business demand resources, different service types, etc.). Moreover, to some extent, application management is more important than resource scheduling, because this will directly affect the DevOps efficiency of the business and the service disaster recovery effect, after all, the human cost of the Internet is higher than the cost of the machine.
Containerization of complex stateful applications has always been a challenge because distributed systems in these different scenarios often maintain their own state machines. When an application system is scaled up or upgraded, how to ensure the availability of existing instance services and how to ensure the connectivity between them is a much more complicated problem than stateless applications. While we’ve containerized stateless services, we’re not yet realizing the full value of a good cluster scheduling system. If you want to manage computing resources well, you must manage the state of services, achieve separation of resources and services, and improve service resilience, which is what the Kubernetes engine is good at.
Based on Meituan’s optimized and customized Kubernetes version, we built Meituan’s Kubernetes engine service MKE:
  • strengthen cluster operation and maintenance capabilities It improves the automatic operation and maintenance capacity building of the cluster, including cluster self-healing, alarm system, Event log analysis, etc., and continuously improves the observability of the cluster.
  • Establish key business benchmarks, cooperate deeply with several important PaaS components, and quickly optimize users’ pain points such as Sidecar upgrade management, Operator grayscale iteration, and alarm separation to meet users’ demands.
  • Continuous improvement of the product experience, continuous optimization of the Kubernetes engine, in addition to supporting users to use custom operators, but also provides a common scheduling and orchestration framework (see Figure 4). Help users access MKE at a lower cost and obtain technical dividends.

Figure 4 Meituan Kubernetes Engine Service Scheduling and Orchestration Framework
In the process of promoting cloud native implementation, a widely concerned question is: What is the difference between managing stateful applications based on Kubernetes cloud native mode and building your own management platform before?
For this problem, it is necessary to consider from the root cause of the problem – O&Mability:
  • based on Kubernetes means that the system has achieved a closed loop, and there is no need to worry about the data inconsistency that often occurs in the two systems.
  • Exception response can be in milliseconds, reducing the RTO of the system (Recovery Time Objective, that is, the recovery time objective, mainly refers to the maximum time that can tolerate business downtime, and it is also the shortest time period required from the occurrence of a disaster to the recovery of service functions of the business system).
  • The complexity of system operation and maintenance is also reduced, and the service achieves automatic disaster recovery. In addition to the service itself, the configuration and state data that the service depends on can be recovered together.
  • Compared with the previous “chimney” management platform of each PaaS component, the general capability can be sunk to the engine service to reduce the development and maintenance cost, and by relying on the engine service, the underlying heterogeneous environment can be shielded to achieve cross-data center and multi-cloud environment service management.
We believe that cluster management in the cloud native era will fully transform from the previous functions of managing hardware and resources to an application-centric cloud native operating system. With this goal as the goal, Meituan’s cluster scheduling system also needs to make efforts from the following aspects:
  • application link delivery management 。 With the increase of service scale and link complexity, the complexity of the operation and maintenance of PaaS components and underlying infrastructure on which services rely has long exceeded the general understanding, and it is even more difficult for newcomers who have just taken over the project. Therefore, we need to support services to deliver services through declarative configuration and implement self-O&M, so as to provide a better O&M experience for services, improve application availability and observability, and reduce the burden on underlying resource management.
  • Edge computing solutions. With the continuous enrichment of Meituan’s business scenarios, the demand for edge computing nodes is growing much faster than expected. We will refer to industry best practices to form an edge solution suitable for landing in Meituan, provide edge computing node management capabilities for in-demand services as soon as possible, and achieve cloud-edge-device collaboration.
  • Capacity building in offline mixed parts 。 According to the 2019 data center cluster data disclosed by Google in the paper “Borg: the Next Generation”, the resource utilization rate of online tasks is only about 30% after removing offline tasks, which also shows that the risk of further improvement is greater, and the input-output ratio is not high. In the future, Meituan’s cluster scheduling system will continue to explore the offline mixed part, but due to the relative independence of Meituan’s offline computer room, our implementation path will be different from the common solution in the industry, starting from the hybrid part of online services and near-real-time tasks, completing the construction of underlying capabilities, and then exploring the mixing of online tasks and offline tasks.
When designing, Meituan’s cluster scheduling system follows the appropriate principle as a whole, and gradually improves the architecture to improve performance and rich functions after ensuring the stability of the system while meeting the basic needs of the business. Therefore

, we choose:

    in the system

  • throughput and scheduling quality, we choose to give priority to meeting the throughput requirements of the business for the system, and do not overly pursue the quality of a single scheduling, but improve through rescheduling adjustment.
  • In terms of architectural complexity and scalability, we choose to reduce the coupling between the modules of the system, reduce the complexity of the system, and the extended function must be downgradable.
  • In terms of reliability and single-cluster scale, we choose to control the single-cluster scale through multi-cluster unified scheduling, ensure system reliability, and reduce the explosion radius.
In the future, we will continue to optimize and iterate Meituan’s cluster scheduling system according to the same logic, and completely transform into an application-centric cloud native operating system.
Tan Lin, from Meituan’s basic R&D platform/basic technology department.