author | Andrei Ushakov
This article explains Netflix’s
system monitoring practices: self-developed Telltale, which successfully runs and monitors the health of more than 100 production Netflix applications.
believe that many O&M personnel have had such an experience: a certain indicator of the monitoring system exceeds the threshold and triggers an alarm. In the middle of the night, you are urgently summoned.
Half open, your face is full of doubts: “Is there really a problem with the system, or do you just need to adjust the alarm?” When was the last time someone adjusted our alarm thresholds? Is it possible that there is a problem with the upstream or downstream service? “
Since this is a very important application alert, you have to get out of bed, quickly turn on the computer, and then go through the monitoring dashboard to track down the source of the problem.
After a long day, you have not confirmed that this alarm is a problem from the system, but you also realize that time is running out when looking for clues from the massive amount of data. You must locate the cause of the alarm as soon as possible and pray for the system to run smoothly.
A robust Netflix service is essential to our users. When you sit down and watch Raising a Tiger, you definitely want it to play smoothly.
Over the years, we’ve learned about the pain points of application monitoring from engineers who are often summoned late at night:
- too many
and too many scrolling dashboards
Too much configuration
too much maintenance
our streaming team needed a whole new monitoring system, Allows team members to quickly diagnose and fix problems; Because in the emergency situation of system alarm, every second counts!
Our Node team needed a system that would require only a small number of people to operate large clusters. So, we built Telltale.
The features of Telltale
are as follows
: Pooling monitoring data sources
to create an overall monitoring view: Telltale brings together various monitoring data sources to create a holistic monitoring view of application health.
Multi-dimensional judgment of application health: Telltale can determine the health of an application through multiple dimensions without frequently adjusting the alarm threshold based on a single metric.
Timely alerts: Because we know when an application is normal, we can notify the owner of an application when there is an abnormal trend in the application.
Display key data: Metrics are key to understanding the health status of your application. But a lot of times, you have too many metrics, too many charts, and too many monitoring dashboards. Telltale, on the other hand, displays only relevant data that is useful in your application and data from its upstream and downstream services.
Color-coded severity of issues: We use different colors to indicate the severity of the problem (in addition to choosing a color, we can also have Telltale display different numbers) so that operators can judge the health of the application at a glance.
Highlight alerts: We also highlight monitoring events, such as network traffic evacuation in local areas and nearby service deployments, which are critical to getting a complete picture of the health of the service, especially in the event of a real system failure.
This is our Telltale monitoring. It is now successfully running and provides monitoring services that monitor the health of more than 100 production applications on Netflix.
Application Health Assessment Model
microservices do not exist and run in isolation. It requires specific dependencies, interacts with data with other services, and is even located in a different AWS Region.
The call diagram
above is a relatively simple diagram involving many services, and the actual call chain can be deeper and more complex.
An application is part of the ecosystem of the system, and its operating state may be weakly affected by changes in related properties, or it may be fundamentally changed by certain events on a regional scale.
The launch of the canary may have some impact on the application. To a certain extent, the deployment of upstream or downstream services can also have an impact.
Telltale monitors application health by building a continuously self-optimizing model using multi-dimensional data sources:
regional network traffic evacuation
Mantis real-time streaming data
Infrastructure change events
Canary deployment and usage
of upstream and downstream services
characterize QoE metrics related
The impact of different data sources on application health is different for different data sources issued by the alarm platform. For example, an increase in response time has a much smaller impact on the application than an increase in error rates.
There are many error codes, but some specific error codes have a greater impact than others. Deploying canaries downstream of a service may not be as effective as deploying upstream.
Regional network traffic
diversion means that network traffic in one region drops to zero and network traffic in another region doubles.
You can feel the impact of different metrics on monitoring. The specific meaning of monitoring indicators determines how we should use it scientifically and effectively for monitoring.
Telltale considers all of these factors when building the application health view. The application health assessment model is at the heart of Telltale.
Every service O&M personnel knows how difficult it is to adjust the alarm threshold. Set the threshold too low and you’ll get a lot of false alarms.
If you overcompensate and relax the alarm threshold, you miss important exception warnings. The end result is a lack of trust in the alert. Telltale can help you avoid the tedious task of constantly tweaking the relevant configuration.
By providing accurate and tightly managed data sources, we make the setup and configuration process easier for application owners.
These data sources are applied to the program’s configuration in a combination to achieve the most common service type configuration.
Telltale can automatically track dependencies between services to build topologies in application health assessment models.
With data source management and topology monitoring, you can keep your configuration up-to-date without much effort. Some scenarios that require manual practice still support manual configuration and tuning.
No single algorithm can be applied to all of our monitoring scenarios. Therefore, we employ a hybrid algorithm, including statistical algorithms, rule-based algorithms, and machine learning algorithms.
Soon, we’ll be publishing an article on the Netflix Tech Blog about our monitoring algorithm.
Telltale also has an analyzer that can be used for trend detection or memory leak monitoring. Intelligent monitoring means that our users can rely on our monitoring results.
This indicates that users can more quickly locate and resolve system anomalies when failures occur.
Intelligent alarmIntelligent monitoring will inevitably promote intelligent
. An exception event occurs when Telltale detects a malfunction in the application.
Teams can choose to alert via Slack, email, or PagerDuty, all powered by our internal alerting system.
If the exception is caused by an upstream or downstream system, Telltale’s context-aware routing alerts the maintenance team corresponding to the service.
Smart alerting also means that the O&M team receives only one notification for a particular exception, which means that alarm storms are a thing of the past.
Example of a Telltale notification in Slack
When there is a problem with the system, it is critical to have accurate information. Our Slack alerter also starts a thread with contextual information about the event, providing information about the exception identified by Telltale and the cause of the problem.
The right context allows us to understand the current state of the application so that the on-call O&M engineer can target and fix the problem.
Exception alarm events are constantly evolving and have their own lifecycle, so it’s important to keep the event status up to date. Has the alarm anomaly improved or worsened? Do you want to consider new monitoring information or events?
Telltale updates the Slack thread when the current event changes. Once the system returns to a healthy state, the thread is marked as resolved, so the user knows at a glance which exception events are being processed and which are successfully fixed.
These Slack threads aren’t just for Telltale. Teams can also use them to share additional data about events for further observation, theoretical analysis, and discussion.
Exception information data and discussion are all concentrated in one thread, which is convenient for reaching a consensus on the current exception, which is conducive to faster problem resolution and post-event analysis of abnormal events.
We are committed to improving the quality of Telltale alerts. One way is to learn from our users. That’s why we’ve provided a feedback button in our Slack messages.
Users can tell us that there are situations in the future that do not require alarms, or provide reasons why certain alarms are unreasonable. Smart alerting means users can rely on our alerts.
An example of an exception details described in a Telltale notification in Slack
Why is my App Service running poorly? Various types of monitoring data, application-related knowledge, and correlation of data across multiple services help Telltale detect and analyze the causes of reduced application health.
These causes include instance exceptions, monitoring and deployment exceptions for related dependencies, database anomalies, or network traffic spikes. Highlighting these possible causes can help operators save valuable time.
Exception event management
Group similar abnormal events in the cluster view
shows that Telltale’s application health assessment model and its intelligent monitoring capabilities are very powerful, so we will also apply it to security deployment. We started testing with the open source delivery platform Spinnaker.
As new versions of
Spinnaker gradually become available, we use Telltale to continuously monitor the health status of instances running new versions.
Continuous monitoring means that new deployments can stop themselves and roll back operations when problems arise. This means that the radius of impact is smaller and the duration is shorter when there are problems with the deployment.
In complex systems, running microservices can be challenging. Telltale’s intelligent monitoring and alerting capabilities can help our O&M personnel improve system availability, reduce O&M labor intensity, and reduce the frequency of staff being woken up in the middle of the night.
We’re excited about these enhancements that Telltale has made. But far from over, we are still exploring new algorithms to improve the accuracy of alerts.
We’ll detail our work in a future Netflix Tech Blog article. We are still working on further evaluation and refinement of the application health assessment model.
We believe that service operation logs and trace data will contain more valuable information so that we can collect more useful metric data. We’re looking forward to working with other teams on the platform to develop these new features.”
Introducing new application monitoring to Telltale can enjoy a good service experience, but it cannot be well scaled, so we can definitely optimize and improve the user interface of self-service.
We are convinced that there are better heuristics that can help users identify some of the factors that affect the health of the service. Telltale simplifies application monitoring.
public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords. like + Looking, less bugs 👇