Task scheduling background

Design of 100-million-level task scheduling framework for observability platform

The task scheduling framework is applied at scale in Log Service

Generic scheduling

Operating system: From the stand-alone operating system Linux, the kernel controls the execution time of the process on the processor through the way of time slice, and the priority of the process is linked to the time slice, in simple terms, the execution of the process on a single CPU or a CPU is controlled by the scheduler; K8s is known as the operating system of the distributed era, after the Pod is created, the K8s control plane scheduler runs the Pod by scoring and sorting the nodes, and finally selecting the appropriate Node.

Big data analysis system: From the earliest MapReduce using a fair scheduler to support job prioritization and preemption, to the SQL calculation engine Presto assigning tasks in the execution plan to the appropriate workers through the Coordinator’s scheduler, Spark splitting into Stage through DAGScheduler, TaskScheduler schedules the TaskSet corresponding to the Stage to the appropriate worker for execution.

Task scheduling framework: ETL processing tasks and scheduled tasks commonly used in data processing, these tasks have multi-mode characteristics: scheduled execution, continuous operation, one-time execution, etc. The orchestration and state consistency of tasks need to be considered during task execution.

Task scheduling

Timed class tasks

Task scheduling application: The adventure of a log

In addition to logs, the system also has Trace data, Metric data, which are the three pillars of the observability system. This process also applies to observable service platforms, let’s look at the process composition of the next typical observable service platform.

Data ingestion: In the observable service platform, it is first necessary to expand the data source, the data source may include various logs, message queue Kafka, storage OSS, cloud monitoring data, etc., can also include various types of database data, through the intake of rich data sources, you can have a full range of observations of the system.

Data processing: after the data is ingested into the platform, the data needs to be cleansed, processed, this process we call him data processing, data processing can be understood as various transformations of data and Fuhua, etc., aggregation processing supports the timing rolling up operation of data, such as calculating the past day of summary data every day, providing data with higher information density.

Data monitoring: observable data itself reflects the operating state of the system, the system exposes the health of the component by exposing specific indicators to each component, and can monitor the abnormal indicators through the intelligent inspection algorithm, such as QPS or Latency steep increase or steep drop, when there is an abnormality, it can be notified to the relevant operation and maintenance personnel by alarm, on the basis of the indicators, various operations and maintenance or operation of the large disk, in the daily regular sending of the large disk to the group is also a scenario requirements.

Data export: The value of observable data tends to decay over time, so long-term log data can be exported to other platforms for the purpose of keeping files.

The business is complex, there are many types of tasks: data ingestion, and only a single process of data ingestion may involve dozens or hundreds of data sources.

Large number of users, large number of tasks: Because it is a cloud business, each customer has a large number of task creation needs.

High SLA requirements: Service availability requirements are high, and background services are upgraded and migrated, which cannot affect the operation of existing tasks of users.

Multi-tenancy: Business customers on the cloud must not have direct influence on each other.

Observable platform task scheduling design goals

Support heterogeneous tasks: alarm, dashboard subscription, data processing, aggregation processing The characteristics of each task are different, such as alarm is a timed task, data processing is a resident task, and dashboard subscription preview is a one-time task.

Massive task scheduling: For a single alarm task, if it is executed once every minute, there will be 1440 dispatches a day, and this number is multiplied by the number of users and then multiplied by the number of tasks, which will be a massive task scheduling; The goal we need to achieve is that the increase in the number of tasks will not affect the performance of the explosion machine, especially to achieve horizontal scaling, the number of tasks or the number of schedules to increase only need to increase the machine linearly.

High availability: As a cloud service, it is necessary to achieve the purpose of upgrading or restarting the background service, or even going down without affecting the operation of user tasks, and it is necessary to have the monitoring ability of task operation at the user level and the background service level.

Simple and efficient O&M: For back-office services, you need to provide a visual O&M dashboard, which can intuitively display the problems of the service; At the same time, it is also necessary to configure the service alarm, and the service upgrade and release process can be as unattended as possible.

Multi-tenancy: The cloud environment is a natural multi-tenant scenario, and the resources between each tenant should be strictly isolated, and there should be no resource dependence or performance dependence between each other.

Extensibility: In the face of new customer needs, the future needs to support more task types, such as there are already MySQL, SqlServer import tasks, in the future need for more other database imports, in this case, we need to do not modify the task scheduling framework, only need to modify the plug-in can be completed.

API: In addition to the above requirements, we also need to do the API control of the task, for cloud users, many overseas customers are using APIs, Terraform to control cloud resources, so to achieve API management.

An overview of the observable platform task scheduling framework

Storage tier: It mainly includes the metadata store of the task and the state and snapshot storage when the task runs. The metadata of the task mainly includes the task type, task configuration, and task scheduling information, which are stored in the relational database; The running status and snapshots of tasks are stored in the distributed file system.

Service layer: Provides the core functions of task scheduling, mainly including task scheduling and task execution, corresponding to the task orchestration and task execution modules mentioned above. Task scheduling is mainly scheduled for three task types, including resident tasks, scheduled tasks, and on-demand tasks. Task execution supports multiple execution engines, including presto, restful interface, K8s engine, and in-house self-developed ETL 2.0 systems.

Business layer: The service layer includes functions that users can use directly in the console, including alarm monitoring, data processing, index reconstruction, dashboard subscription, aggregation processing, data source import, intelligent inspection tasks, and log delivery.

Access layer: The access layer uses Nginx and CGI to provide external services, and has the characteristics of high availability and regional deployment.

API/SDK/Terraform/Console: On the user side, you can use the console to manage various tasks, provide customized interfaces and monitoring for different tasks, and also use APIs, SDKs, and Terraform to add, delete, and modify tasks.

Task scheduling framework design essentials

Heterogeneous task model abstraction

Scheduling service framework

Large-scale mission support

Service high availability design

Task model abstraction

For tasks that need to be performed regularly such as alarm monitoring, dashboard subscription, and aggregation processing, they are abstracted as scheduled tasks and support timing and Cron expression settings.

For tasks that need to be run continuously, such as data processing, index reconstruction, and data import, they are abstracted as resident tasks, which often only need to be run once, and can have or without an end state.

Scheduling service framework

The scheduling of tasks is mainly implemented in the Worker layer, each Worker is responsible for pulling the tasks corresponding to the Partitions, and then loading the tasks through the JobLoader, Note: here will only load the task list of the current Worker corresponding to the Partitions, and then Scheduler schedules the task, which will involve the scheduling of resident tasks, scheduled tasks, and on-demand tasks. Scheduler will send the orchestrated task to JobExecutor for execution, JobExecutor in the process of execution needs to persist the status of the task in real time to save to RedoLog, in the next Worker upgrade restart process, you need to load the status of the task from the RedoLog, so as to ensure the accuracy of the task state.

Large-scale mission support

The correspondence between Worker and Partition is not static, it is a dynamic mapping, when the Worker is restarted or the load is high, its corresponding Partition will migrate to other Workers, so Worker needs to implement the Partition in and out operations.

When the number of tasks increases, because there is an intermediate layer of Partition, only the number of Workers needs to be increased to meet the needs of the task growth and achieve the purpose of horizontal expansion. When new workers are added, more partitions can be shared.

Service high availability design

Release Process:

From compilation to release, the whole Web-side white-screen operation is templated and released, and each version can be tracked and rolled back.

Supports grayscale control of cluster granularity and task type granularity, and can be verified in a small range when publishing, and then released in full.

O&M process:

Provides internal O&M APIs and Web-side operations to repair and handle abnormal jobs. Reduce human intervention in O&M.


Inside the service, we have developed an internal inspection function to find abnormal tasks, such as some tasks that start too long and stop for too long will print exception logs, which can track and monitor the exception logs.

Through exception logs, you can use Log Service alarms for monitoring, and you can notify O&M personnel in a timely manner if there is a problem.

Task monitoring:

User side: In the console we provide monitoring dashboards and built-in alarm configurations for various tasks.

Service side: In the background, you can see the running status of cluster granular tasks, which is convenient for background O&M personnel to monitor services.

At the same time, the execution status and history of the task are stored in a specific log store so that problems can be traced back and diagnosed.

Typical task: aggregation processing

Automatic retry

If the instance execution fails (for example, insufficient permissions, no source database, no target database, or invalid SQL syntax), the system supports automatic retry

Retry manually

Dynamic task types: Added support for dynamic task types, such as more complex task scheduling with intertask dependencies.

Multi-tenant optimization: Simple quota limits are currently used for tasks, and future further refinements of multi-tenant QoS are made to support larger quota settings.

API optimization, improvement: the current task types are also being updated rapidly, the iteration speed of the task API is still some gap, it is necessary to enhance the optimization of the task API, to achieve the purpose of adding a task type, do not need to modify or update the API in small quantities.

Recommended reading

1. How to write a good technical solution?

2. Ali has been precipitating | architectural design methods in technical practice for 10 years

3. How to do “defensive coding”?

GTS Cloud Rider Essay Contest is online!

Yunqiao is the landing of the concept of “assembled application”, which helps everyone improve the delivery speed, improve the delivery quality, and reduce labor costs. Participate in the Yunqiao Essay Contest, not only can the professional tutor group evaluate, but also 888 yuan cat super card and Tmall Genie Sound wait for you to come ~ the top 100 participants can get 38 yuan Tmall supermarket card Oh, first come, first served!

Click to read the original article for details.