If the scenic spot can accommodate 10,000 people, and now there are 30,000 people in it, it is bound to rub shoulders one after another, and there will be accidents if it is not good, and the result is that everyone’s experience is not good.

If there is an accident, the scenic spot may have to be closed, resulting in unusable access, and the consequence is that everyone feels that the experience is terrible.

The idea of limiting the flow is to increase the number of people entering as much as possible while ensuring availability, and the rest of the people wait in line outside, ensuring that the 10,000 people inside can play normally.

Back to the Internet, the same is true, for example, a certain star announced a romance, the visit increased from the usual 500,000 to 5 million, the system can support up to 2 million visits, Double 11 flash sale activities, 12306 ticket grabs, etc.

Then the throttling rules are enforced to ensure that it is a usable state and does not cause all requests to be unavailable due to a server crash.


current limiting idea

limits the flow of system services, and generally has the following modes:

(1) The fusing

system takes the fusing measures into account at the beginning of the design. When there is a problem with the system, if it cannot be repaired in a short time, the system should automatically make a judgment, turn on the circuit switch, deny traffic access, and avoid overloading requests on the backend with large traffic.

The system should also be able to dynamically monitor the repair of the back-end program, and when the program has returned to stability, the fuse switch can be turned off to restore normal service.

Common fuse components are Hystrix and Ali’s Sentinel, which have advantages and disadvantages and can be selected according to the actual situation of the business.

(2) Service degradation

will be a

grading of all functional services of the system, when the system has a problem and needs to be urgently throttled, the less important functions can be downgraded and the service can be stopped, which can release more resources for the core functions to use.

For example, in an e-commerce platform, if there is a sudden surge in traffic, non-core functions such as product reviews and points can be temporarily downgraded, these services can be stopped, and resources such as machines and CPUs can be released to ensure that users place orders normally.

These downgraded functional services can wait for the entire system to return to normal before starting to make up orders/compensation.

In addition to functional degradation, you can also use all read cache and write cache as a temporary degradation solution that does not directly manipulate the database.

(3) Delayed processing

This mode requires setting up a traffic buffer pool at the front end of the system, buffering all requests into this pool and not processing immediately.

Then the real business handler on the backend takes out the request from this pool and processes it in turn, which can be commonly implemented in the queue pattern.

This is equivalent to using asynchronous methods to reduce the processing pressure of the backend, but when the traffic is large, the processing power of the backend is limited, and the requests in the buffer pool may not be processed in a timely manner, and there will be a certain degree of delay. The specific leakage bucket algorithm and the token bucket algorithm are this idea.

(4) The privilege processing

mode needs to classify users, through the preset classification, let the system give priority to user groups that need high protection, and the requests of other user groups will be delayed or directly not processed.

The difference between caching, degradation, and throttling is as follows:

    > cache is used to increase system throughput. Improve access speed and provide high concurrency.

  • Downgrade is to temporarily block the service that has problems when some service components of the system are unavailable, traffic surges, resource exhaustion, etc., continue to provide downgrade services, give users as friendly tips as possible, return the bottom data, will not affect the overall business process, and then re-launch the service after the problem is solved

  • Throttling refers to scenarios where caching and degradation are invalid. For example, when the threshold is reached, limit the interface call frequency, access times, and inventory, and downgrade the service in advance before the service is unavailable. Only serve a subset of users.

There are many

algorithm current limiting

algorithms, and

there are three common categories, namely counter algorithms, leakage bucket algorithms, and token bucket algorithms, which are explained one by one below.

(1) The

counter algorithm is simple and rude, such as specifying the thread pool size, specifying the database connection pool size, the number of Nnginx connections, etc., which belong to the counter algorithm


The counter algorithm is the simplest and easiest algorithm to implement among the current limiting algorithms. For example, for example, we stipulate that for the A interface, we cannot have more than 100 visits per minute.

So we can

do this: in the beginning, we can set a counter counter, and whenever a request comes over, the counter is incremented by 1.

If the value of counter is greater than 100 and the interval between the request and the first request is less than 1 minute, the number of requests is too large and access is denied.

If the interval between the request and the first request is greater than 1 minute, and the value of counter is still within the throttling range, then resetting the counter is as simple and rude as that.

(2) Leakage barrel algorithm The idea of the leakage barrel algorithm is very simple, water (request) first enters the leakage barrel, the leakage barrel comes out of the water at a certain speed, when the water inflow speed is too large to exceed the acceptable capacity of the barrel directly overflow, it can be seen that the leakage barrel algorithm

can forcibly limit the transmission rate of data.

Peak shaving: When a large amount of traffic comes in, an overflow occurs, throttling the availability of the protection service.

Buffering: It will not be directly requested to the server, buffering pressure, and the consumption speed is fixed, because the computing performance is fixed.

(3) Token bucket algorithm Token


is similar to leaking bucket, except that some tokens are placed in the token bucket, and after the service request arrives, the service will only be obtained after the token is obtained.

For example, we usually go to the canteen to

eat, we are queuing up in front of the window inside the canteen, which is like the barrel leakage algorithm, a large number of people gather outside the window in the canteen to enjoy the service at a certain speed.

If there are too many people pouring in, the canteen can’t fit, there may be some people standing outside the canteen, which does not enjoy the service of the canteen, called overflow, overflow can continue to request, that is, continue to queue, so what’s the problem?

If there are special circumstances at this time, such as some volunteers in a hurry or the college entrance examination in the third year of high school, this situation is an emergency.

If you also use the leakage bucket algorithm, you have to queue slowly, which does not solve our needs, for many application scenarios, in addition to the requirement to be able to limit the average transmission rate of data, it is also required to allow some degree of burst transmission.

At this time, the leakage bucket algorithm may not be suitable, and the token bucket algorithm is more suitable.

As shown in the figure, the principle of the token bucket algorithm is that the system will put tokens into the bucket at a constant speed, and if the request needs to be processed, it needs to obtain a token from the bucket first, and when there is no token to take in the bucket, it will deny service.

Concurrent current limiting


simply to set the total number of QPS of the system threshold, which is also quite common, in the case of Tomcat, many parameters are based on this consideration.

For example, the configured acceptCount sets the number of response connections, maxConnections sets the instantaneous maximum number of connections, and maxThreads sets the maximum number of threads.

In each framework or component, concurrency limiting fluids now has the following aspects:

  • limiting the total number of concurrency (such as database connection pools, thread pools).

  • Limit instantaneous concurrency (nginx’s limit_conn module, which limits the number of instantaneous concurrent connections)

  • limits the average rate

  • within a time window (e.g., Guava’s RateLimiter, nginx’s limit_req module, which limits the average rate per second

  • ). Others limit the remote interface call rate and limit the MQ consumption rate.

  • You can also throttle based on the number of network connections, network traffic, CPU or memory load, and so on.

With concurrency

throttling, it means that there is an additional protection mechanism when dealing with high concurrency, and there is no need to worry about instantaneous traffic causing the system to hang up or avalanche, and finally achieve damage to service rather than no service.

However, the throttling needs to be evaluated well and cannot be used indiscriminately, otherwise some normal traffic will have some strange problems and cause poor user experience and cause user loss.

Interface current limiting interface current limiting

is divided into two parts, one is to limit the number of interface calls in a period of time, referring to the counter algorithm of the previous current limiting algorithm, and the other is to set the sliding time window algorithm.

(1) The

total number of interfaces controls the total number of

interfaces called in a period of time, which can refer to the previous counter algorithm and will not be repeated.


) The problem of the interface time window fixed time window

algorithm (that is, the

previously mentioned counter algorithm) is that the statistical interval is too large and the current limit is not accurate enough. Moreover, the relationship and impact with the previous statistical interval are not considered in the second statistical interval (the second half of the first interval + the first half of the second interval is also one minute).

In order to solve the critical problem we mentioned above, we try to divide each statistical interval into smaller statistical intervals and more accurate statistical counts.

In the example above, suppose QPS can accept 100 queries/second, with access low in the first 40 seconds of the first minute and burst in the next 20 seconds, and this lasts for a while until the 40th second of the second minute begins to drop.

According to the previous counting method, the QPS of the previous second is

94, and the QPS of the next second is 92, then the set parameters are not exceeded.

But in the middle region, the QPS reaches 142, which is significantly more than we allowed to request service, so the fixed window counter is less reliable and requires sliding the window counter.

The counter algorithm is actually a fixed window algorithm, but it does not further divide the time window, so there is only 1 grid.

It can be seen that when the grid of the sliding window is

divided more, that is, the seconds are accurate to milliseconds or nanoseconds, the smoother the scrolling of the sliding window will be, and the more accurate the statistics of the current limit. It should be noted that the more space is consumed.

This part of the


limit implementation

is the specific implementation of the current limit, to put it simply, after all, no one wants to read the long code.

(1) guava implementation

introduction package:

    <groupId> com.google.guavagroupId>
    <version >28.1-jreversion>
Core code:
LoadingCache<Long, AtomicLong> counter = CacheBuilder.newBuilder().
  expireAfterWrite(2, TimeUnit.SECONDS)
  .build(new CacheLoader<Long, AtomicLong>() {


   public AtomicLong load(Long secend) throws Exception {
    // TODO Auto-generated method stub
    return new AtomicLong(0);   }  });

counter. get(1l).incrementAndGet();

(2) Token bucket to achieve

stable mode (SmoothBursty: Constant token generation speed):
public static void main(String[] args{
 // RateLimiter.create(2) Number of tokens generated per second RateLimiter
limiter = RateLimiter.create(2);
limiter.acquire() blocks the way to acquire the token
System out.println(limiter.acquire());;
 try {
 } catch (InterruptedException e) {
  // TODO Auto-generated catch block  e.printStackTrace(); }

 System. out.println(limiter.acquire());;

 System. out.println(limiter.acquire());;
 System. out.println(limiter.acquire());;
 System. out.println(limiter.acquire());;

 System. out.println(limiter.acquire());;

 System. out.println(limiter.acquire());; }

RateLimiter.create(2) Capacity and burst volume, the token bucket algorithm allows tokens that have not been consumed for a period of time to be staged into the token bucket for burst consumption.

Progressive mode (SmoothWarmingUp: Token generation speed increases slowly until it is maintained at a stable value):
 Smooth throttling, the time interval from the cold start rate (full) to the average consumption rate 
RateLimiter limiter = RateLimiter.create(2,1000l,TimeUnit.MILLISECONDS);
System. out.println(limiter.acquire());;
try {
catch (InterruptedException e) {
 // TODO Auto-generated catch block e.printStackTrace(); }

System. out.println(limiter.acquire());

System. out.println(limiter.acquire());
System. out.println(limiter.acquire());
System. out.println(limiter.acquire());

System. out.println(limiter.acquire());

System. out.println(limiter.acquire());
boolean tryAcquire = limiter.tryAcquire( Duration.ofMillis(11)); 

Whether a token can be obtained within the timeout time, executed asynchronously.

(3) Distributed system current limiting (Nginx+Lua implementation)

can use resty.lock to maintain atomic properties, and there is no lock reentrancy between requests:
using lua_shared_ dict store data

: local locks =

require "resty.lock"local function acquire()

local lock = locks:new("locks")

) local
elapsed, err =lock:lock("limit_key") -- mutex guarantees atomic properties
local limit_counter = ngx.shared.limit_counter -- counter

local key = "ip:" : os.time()

limit = 5 -- current
size local current = limit_counter:get( key) if current

~= nil and current + 1> limit then -- if the current limit size is exceeded

       return 0
    if current == nil then
       limit_ counter:set(key, 1, 1) -- the first time you need to set the expiration time, set the value of the key to 1,
-- the expiration time is 1 second
limit_ counter:incr(key, 1) -- add 1 to the second start
return 1

by Can’t wait for the harmonica

Editor: Tao Jialong

Source: https://urlify.cn/fuAB7j




public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.

like + Looking, less bugs