
water overflow, monthly profit is loss, nothing can develop indefinitely, and our system service capabilities are the same.
When the traffic continues to grow and reaches or exceeds the bearable range of the service itself, it is very important to establish a self-protection mechanism for system services.
This article hopes to use the most popular explanations and appropriate examples to take you to understand what current limiting, degradation and circuit breaker are.
Part 1 Current Limiting – Self-knowledge and eyesight are
one is its own carrying capacity, and the other is the service capacity of the relying party, in fact, from the perspective of the current system,
1.1 Self-knowledge is passive current limiting
I only have so much ability to serve so many customers!
The system needs to have a clear understanding of its own carrying capacity, and it is appropriate to reject additional calls that exceed the carrying capacity.
How to measure the carrying capacity of the system is a problem.
In general, we have two common solutions: one is to define thresholds and rules, and the other is to adapt to the flow limiting strategy.
Thresholds and rules are formulated by the owner based on manual experience through the control of the business and the current situation of its own storage and connection. Such a strategy is generally not a big problem, but it is not flexible enough to request feedback and does not utilize enough resources to request feedback.
In contrast, the adaptive strategy is a dynamic throttling strategy, which is to dynamically adjust the throttling threshold for the current operating status of the system to find a balance between machine resources and traffic processing.
For example, Alibaba’s open source Sentinel current limiter supports [1]
:
- Load
-
adaptive: the load1 of the system as a heuristic indicator for adaptive system protection. System protection is triggered when system load1 exceeds the set heuristic, and the current number of concurrent threads on the system exceeds the estimated system capacity. -
: When the system CPU usage exceeds the threshold, the system protection is triggered (value range 0.0-1.0), which is more sensitive. -
Average RT: System protection is triggered when the average RT of all ingress traffic on a single machine reaches a threshold, in milliseconds. -
threads: System protection is triggered when the number of concurrent threads for all ingress traffic on a single machine reaches the threshold. -
Ingress QPS: System protection is triggered when the QPS of all ingress traffic on a single machine reaches a threshold.
CPU usage
Number of concurrent
1.2 The active flow limiting
partner only has so much
ability, I can only ask for so much!
For the service capability of the downstream dependent system, it
is necessary to have an accurate judgment, and for the downstream system with weak service ability, it is necessary to appropriately reduce the call, which has to be a bit of insight, right.
Because most of the business systems do not exist alone, they will rely on many other systems, and the service capabilities of these relying parties are like wooden barrel shortfalls, limiting the processing capacity of the current system. At this time, it is necessary to consider the downstream as a whole.
Therefore, it is necessary to use cluster current limiting
and stand-alone current limiting together, especially when the number of instances of downstream services, service capabilities, etc. are far from the current system, cluster current limiting is still necessary.
One solution: by collecting the request logs of the service node, counting the request volume, and configuring the flow limit, the node throttling logic is controlled:

I call it post-current limiting, that is, collecting the request volume of each node and comparing it with the established threshold, and if it exceeds it, it is fed back to each node, relying on stand-alone current limiting for proportional flow limiting.
Another solution: is a flow limiting general control service, according to the configuration to produce tokens, and then each node consumes tokens, and can continue business after obtaining tokens normally:

I call it pre-limiting, pre-determining the available tokens are assigned, eliminating the processing mechanism of aggregation and feedback, and this control method is relatively precise and elegant in comparison.
1.3 Although the synchronous to asynchronous
partner has limited capabilities, it has a good attitude and works overtime to deal with it; And our customers are also very friendly, agree to a very
classic example, that is, the repayment business of the third-party payment platform, the students who have used it should have experienced, generally wait for a while after the payment is completed to receive the notice of write-off.
What is the underlying logic of this delay?
Generally, the service interface of financial institutions, because of its data consistency and system stability requirements, may not be as good as the system of Internet companies.
Then, when the repayment peak at the beginning of the month and the end of the month, if the write-off requests to support successful users are pressed to the institution, the consequences can be imagined.
However, for the user
, the whole process can be split, and the user side only needs to complete the payment operation. As for the final result, it is possible to allow for postponement of being notified.
Therefore, basically, the financial gateway is asynchronous in processing institutional write-offs, that is, first landing the write-off requests of each business, and then asynchronously polling the documents to be processed at a limited rate, and then interacting with the institution.
In fact, not only in the financial field, as long as our business processing speed is different and the process can be split, we can consider this architectural idea to relieve system pressure and ensure business availability.
Part 2 Downgrade – Losing a car is
a
sudden incident, limited ability, I can only hold on to a few important customer services!
So, when does it need to be downgraded and what link can be downgraded?
When the entire service is
in the peak period or the activity pulse period, when the load of the service is high and the service bearing threshold is approaching, you can consider service degradation to ensure the availability of the main function.
What can be downgraded must be non-core links, such as point deduction in online shopping scenarios, if you downgrade the points deduction link, it will not affect most of the payment functions.
So, what are the degradation schemes we generally use in the system?
1. Page downgrade: that is, operate from the user operation page, directly restrict and truncate the entrance of a function:

As shown in the figure above, in this business scenario, whether to use points is determined in the page rendering stage and returned to the previous segment for page stitching.
When we need to downgrade it, we will switch the downgrade switch through the control platform, and the system will read that after the downgrade is turned on, it will return to the logo of the previous point downgrade, and the front end will no longer display the point deduction entrance. That is, the execution of the integration link is truncated from the entrance to achieve the purpose of downgrading.
2. Storage degradation: Use caching to downgrade frequently operated storage

for the flash sale business, which is a scenario of writing more and reading less, the pressure on DB is very large, generally, we will use the cache architecture shown in the figure above, replace DB operations with cache operations, and replace synchronous interfaces with asynchronous MQ, which is also a storage degradation behavior.
3. Read downgrade: For
non-core information read requests to disable WeChat’s red packet grabbing scenario, the display of the red
packet list belongs to the non-core link of grabbing red packets, so for the list display, under the condition of greater business pressure, the reading of avatars and other information can be directly disabled.
4. Write downgrade
: directly prohibit the summary of service requests related write operations
, and summarize the core of downgrade in one sentence – losing the car to protect the marshal. At the cost of losing part of the experience, in exchange for the stability and continuous availability of the entire service link.
Part3 Circuit breaker – The big picture
partner is in trouble, you can’t force others into a corner for yourself, don’t drag yourself down! For humanitarian reasons, I have to ask from time to time, Are you ok?
The reason why the circuit breaker is given the nickname of the big picture is because the problem it solves is cascading failures and service avalanches!

In a distributed environment, exceptions are the norm. As shown in the preceding figure, when service C has a call exception, a large number of request timeouts and call delays occur in service B.
These calls also need to occupy system resources, when a large number of requests are backlogged, service B’s thread pool and other resources will also be exhausted, and eventually lead to an avalanche of the entire service link.
Therefore, when service C
has an abnormality, the call to service C is appropriately paused, and its interface is constantly monitored whether it is restored, which is very necessary for the health of the entire link, and the above processing process for C is a circuit breaker.

As can be seen from the above figure, the three key points of the circuit breaker operation:
- circuit breaker algorithm, that is,
-
what situation will be judged to require circuit -
breaker post-processing, that is, the current system does not make remote calls, but the call result needs to have alternative logic -
Circuit breaker recovery, an appropriate detection mechanism for ending the circuit breaker and resuming normal service calls.
As mentioned in the article “Implementing Flexible Transaction Degradation in the Scenario of Dependent Storage Not Granting Credit”, our distributed transactions will rely on the underlying storage for metadata storage and consistency verification.
However, the stability of the underlying storage
is slightly insufficient, and here it involves the processing of service circuit breakers:
-
when we detect a certain threshold of abnormal operation of the underlying storage through keyword monitoring, a switch switch operation will be triggered through the script.
-
The function of this switch is to abandon the underlying storage and go directly to the message queue to ensure that most requests can be processed normally.
-
During the development period, use the heuristic thread to test whether the underlying storage is restored, and when the storage is detected to return to normal, the toggle switch returns to the normal link.
[
Sentinel: “Adaptive Current Limiting https://github.com/alibaba/Sentinel/wiki/ System”
Hystrix: “https://github.com/Netflix/Hystrix/wiki”
Programmer-exclusive sweatshirt
Recommended Reading:
finally! I found out why programmers love to wear sweatshirts
5 minutes, Quick Start Python JWT Interface Authentication
to understand | Process concurrency and synchronization
an important solution to solving concurrency problems that 99% of programmers would ignore!"
discussion: What should programmers wear to the annual meeting?
Redeem points for books by checking in daily Entry