This paper introduces Meituan’s practice in how to solve the problem of large-scale cluster management, design an excellent and reasonable cluster scheduling system, and expounds the issues, challenges and corresponding promotion strategies that Meituan is more concerned about when implementing cloud native technologies represented by Kubernetes. At the same time, this article also introduces some special support for Meituan’s business demand scenarios, hoping that this article can help or inspire students interested in the cloud native field.
-
-
introduction to the cluster scheduling system
-
The challenges of large-scale cluster management
-
Challenges of Operating Large-scale ClustersTrade-offs
-
in designing
-
systemsMeituan cluster scheduling system evolution
-
multi-cluster unified scheduling: improve data center resource utilization
-
Scheduling Engine Service: Enabling the Cloud-Native Implementation of PaaS
-
ServicesFuture Prospects: Building a Cloud-Native Operating System
cluster scheduling
Cluster scheduling system, also known as data center resource scheduling system, is
-
how to manage large-scale cluster deployment scheduling in data centers, especially in cross-data center scenarios. How to achieve resource elasticity and scheduling capabilities, improve the utilization of resources as much as possible under the premise of ensuring application service quality, and fully reduce data center costs. -
infrastructure, build a cloud-native operating system for the business side, improve the computing service experience, realize automatic disaster recovery response and deployment and upgrade of applications, reduce the mental burden on the business side for the management of the underlying resources, and allow the business side to focus more on the business itself.
How to transform the underlying
-
How to improve the resource utilization of online application data centers while ensuring application service quality? Resource scheduling has always been a recognized problem in the industry, and with the rapid development of the cloud computing market, cloud computing vendors continue to increase investment in data centers. The resource utilization of the data center is very low, which exacerbates the problem. Gartner research found that the CPU utilization of servers in global data centers is only 6%~12%, and even Amazon Elastic Compute Cloud (EC2, Elastic Compute Cloud) has only 7%~17% resource utilization, which shows how serious the waste of resources is. The reason is that online applications are very sensitive to resource utilization, and the industry has to reserve additional resources to ensure the quality of service (QoS) of important applications. The cluster scheduling system needs to eliminate interference between applications and achieve resource isolation between different applications when running a hybrid operation. -
How to provide automatic instance exception handling for applications, especially stateful applications, shield differences in the computer room, and reduce users’ perception of the underlying layer 。 With the continuous expansion of service application scale and the maturity of the cloud computing market, distributed applications are often deployed in data centers in different regions, or even across different cloud environments, realizing multi-cloud or hybrid cloud deployment. The cluster scheduling system needs to provide a unified infrastructure for the business side, implement a hybrid multi-cloud architecture, and shield the underlying heterogeneous environment. At the same time, it reduces the complexity of application O&M management, improves application automation, and provides a better O&M experience for services. -
How to resolve the performance and stability risks related to cluster management caused by a single cluster or too many clusters 。 The complexity of lifecycle management of the cluster itself increases as the size and number of clusters increases. Taking Meituan as an example, the multi-center and multi-cluster solution we adopted in two places avoided the hidden dangers of excessive cluster scale to a certain extent, and solved problems such as service isolation and regional latency. With the emergence of cloud migration requirements for PaaS components such as edge cluster scenarios and databases, it is foreseeable that the number of small clusters will have a clear upward trend. This brings a significant increase in cluster management complexity, monitoring configuration costs, and O&M costs, and the cluster scheduling system needs to provide more effective operation specifications and ensure operational safety, alarm self-healing, and change efficiency.
- system
-
throughput and scheduling quality of the cluster scheduling system 。 System throughput is an important criterion for us to evaluate the quality of a system, but more important in an online service-oriented cluster scheduling system is the quality of scheduling. Because the impact of each scheduling result is long-term (days, weeks, or even months), non-anomalies are not adjusted. Therefore, if the scheduling result is wrong, it will directly increase the service delay. A higher scheduling quality means more computational constraints to consider, and the worse the scheduling performance, the lower the system throughput. -
Architectural complexity and scalability of cluster scheduling system. The more functions and configurations the system is open to upper-layer PaaS users, the more functions it supports to improve the user experience (such as supporting application resource preemption and recycling and abnormal self-healing of application instances), which means that the higher the complexity of the system, the more likely it is for conflicts between subsystems. -
The reliability of the cluster scheduling system and the single-cluster scale. The larger the size of a single cluster, the larger the schedulable range, but the greater the reliability challenge for the cluster because the explosion radius increases and the impact of failure is greater. If the scale of a single cluster is small, although the scheduling concurrency can be improved, the scheduling range becomes smaller, the scheduling failure probability becomes higher, and the cluster management complexity becomes larger.
-
two-stage scheduler solves the limitations of monolithic schedulers by separating resource scheduling from job scheduling. The two-level scheduler allows different job scheduling logic according to specific applications, while maintaining the characteristics of sharing cluster resources between different jobs, but it cannot achieve preemption of high-priority applications. Representative systems are Apache Mesos and Hadoop YARN. -
solves the limitations of the two-level scheduler in a semi-distributed way, each scheduler in the shared state has a copy of the cluster state, and the scheduler updates the cluster state copy independently. Once the local state replica changes, the state information for the entire cluster is updated, but continuous resource contention can cause scheduler performance degradation. Representative systems are Google’s Omega and Microsoft’s Apollo. -
The distributed scheduler uses a relatively simple scheduling algorithm to achieve large-scale high-throughput, low-latency parallel task placement, but because the scheduling algorithm is relatively simple and lacks a global resource usage perspective, it is difficult to achieve high-quality job placement effects, such as Sparrow of the University of California. -
Hybrid schedulers spread workloads across centralized and distributed components, using complex algorithms for long-running tasks and relying on distributed layouts for short-running tasks. Microsoft Mercury has taken this approach.
The
The shared state scheduler
-
the overall resource utilization of the cluster is not high. For example, the average utilization rate of CPU resources is still at the average level of the industry, which is a large gap compared with other first-tier Internet companies. -
containerization rate of stateful services is insufficient, especially MySQL, Elasticsearch and other products do not use containers, and there is a large room for optimization of service O&M costs and resource costs. -
From the perspective of business needs, VM products will exist for a long time, and VM scheduling and container scheduling are two sets of environments, resulting in high O&M costs of team virtualization products.
The
-
to ensure stability: improve the robustness and observability of the scheduling system; Reduce the coupling between the modules of the system and reduce the complexity; Improve the automatic O&M capabilities of the multi-cluster management platform; Optimize the performance of core components of the system; Ensure availability for large-scale clusters. -
Cost reduction: In-depth optimization of the scheduling model to open up the cluster scheduling and stand-alone scheduling links. From static resource scheduling to dynamic resource scheduling, offline service containers are introduced to form a combination of free competition and strong control, and improve resource utilization and reduce IT costs under the premise of ensuring high-quality business application services. -
Improve efficiency: Support users to adjust scheduling policies by themselves, meet personalized business requirements, actively embrace the cloud native field, and provide core capabilities for PaaS components, including orchestration, scheduling, cross-cluster, and high availability, to improve O&M efficiency.
-
The platform layer is responsible for service access, opening up Meituan’s infrastructure, encapsulating native interfaces and logic, and providing container management interfaces (expansion, update, restart, and scale-in). -
The policy layer provides unified scheduling capabilities for multiple clusters, continuously optimizes scheduling algorithms and policies, and improves CPU usage and allocation rate through service levels based on service levels and sensitive resources. -
The engine layer provides Kubernetes services to ensure the stability of cloud-native clusters of multiple PaaS components, and sinks general-purpose capabilities to the orchestration engine to reduce the access cost of cloud-native services.
-
Through the self-developed dynamic load regulation system and cross-cluster rescheduling system, the linkage between cluster scheduling and stand-alone scheduling links is realized, and the service quality assurance strategies of different resource pools are realized according to the service hierarchy. -
After three iterations, the self-owned cluster federation service is realized, which better solves the problems of resource preoccupation and state data synchronization, improves the scheduling concurrency between clusters, and realizes computing separation, cluster mapping, load balancing, and cross-cluster orchestration control (see Figure 3 below).
-
The proxy layer will select the appropriate cluster for scheduling based on the factors and weights of the cluster status, and select the appropriate worker to distribute requests. The Proxy module uses etcd for service registration, host selection and discovery, and the Leader node is responsible for scheduling preoccupied tasks, and all nodes can be responsible for query tasks. -
The worker layer should handle a part of the cluster query requests, and when a cluster task is blocked, it can quickly expand a corresponding worker instance to alleviate the problem. When a single cluster is large, multiple worker instances will be corresponded, and Proxy will distribute scheduling requests to multiple worker instances for processing, improving scheduling concurrency and reducing the load of each worker.
-
strengthen cluster operation and maintenance capabilities It improves the automatic operation and maintenance capacity building of the cluster, including cluster self-healing, alarm system, Event log analysis, etc., and continuously improves the observability of the cluster. -
Establish key business benchmarks, cooperate deeply with several important PaaS components, and quickly optimize users’ pain points such as Sidecar upgrade management, Operator grayscale iteration, and alarm separation to meet users’ demands. -
Continuous improvement of the product experience, continuous optimization of the Kubernetes engine, in addition to supporting users to use custom operators, but also provides a common scheduling and orchestration framework (see Figure 4). Help users access MKE at a lower cost and obtain technical dividends.
-
based on Kubernetes means that the system has achieved a closed loop, and there is no need to worry about the data inconsistency that often occurs in the two systems. -
Exception response can be in milliseconds, reducing the RTO of the system (Recovery Time Objective, that is, the recovery time objective, mainly refers to the maximum time that can tolerate business downtime, and it is also the shortest time period required from the occurrence of a disaster to the recovery of service functions of the business system). 。 -
The complexity of system operation and maintenance is also reduced, and the service achieves automatic disaster recovery. In addition to the service itself, the configuration and state data that the service depends on can be recovered together. -
Compared with the previous “chimney” management platform of each PaaS component, the general capability can be sunk to the engine service to reduce the development and maintenance cost, and by relying on the engine service, the underlying heterogeneous environment can be shielded to achieve cross-data center and multi-cloud environment service management.
-
application link delivery management 。 With the increase of service scale and link complexity, the complexity of the operation and maintenance of PaaS components and underlying infrastructure on which services rely has long exceeded the general understanding, and it is even more difficult for newcomers who have just taken over the project. Therefore, we need to support services to deliver services through declarative configuration and implement self-O&M, so as to provide a better O&M experience for services, improve application availability and observability, and reduce the burden on underlying resource management. -
Edge computing solutions. With the continuous enrichment of Meituan’s business scenarios, the demand for edge computing nodes is growing much faster than expected. We will refer to industry best practices to form an edge solution suitable for landing in Meituan, provide edge computing node management capabilities for in-demand services as soon as possible, and achieve cloud-edge-device collaboration. -
Capacity building in offline mixed parts 。 According to the 2019 data center cluster data disclosed by Google in the paper “Borg: the Next Generation”, the resource utilization rate of online tasks is only about 30% after removing offline tasks, which also shows that the risk of further improvement is greater, and the input-output ratio is not high. In the future, Meituan’s cluster scheduling system will continue to explore the offline mixed part, but due to the relative independence of Meituan’s offline computer room, our implementation path will be different from the common solution in the industry, starting from the hybrid part of online services and near-real-time tasks, completing the construction of underlying capabilities, and then exploring the mixing of online tasks and offline tasks.
, we choose:
- in the system
-
throughput and scheduling quality, we choose to give priority to meeting the throughput requirements of the business for the system, and do not overly pursue the quality of a single scheduling, but improve through rescheduling adjustment. -
In terms of architectural complexity and scalability, we choose to reduce the coupling between the modules of the system, reduce the complexity of the system, and the extended function must be downgradable. -
In terms of reliability and single-cluster scale, we choose to control the single-cluster scale through multi-cluster unified scheduling, ensure system reliability, and reduce the explosion radius.