Network transmission problems are essentially the sharing and reuse of network resources, so congestion control is one of the core problems in the field of network engineering, and with the explosive growth of Internet and data center traffic, there have been many innovations in related algorithms and mechanisms, this series is the Chinese version of the free e-book “TCP Congestion Control: A Systems Approach”, which fully introduces the concepts, principles, algorithms and implementation methods of congestion control. TCP Congestion Control: A Systems Approach[1]

With the architectural foundation of TCP/IP, you are ready to explore the design space to solve congestion. But to do that, you first need to take a step back and consider some of the bigger picture. The Internet is a complex combination of compute, storage, and communications resources that millions of users share, and the main question is how to allocate these resources, especially switching capacity, buffer space, and link bandwidth, to end-to-end traffic.

Because the Internet originally adopted a best-effort service model, where users (or, more accurately, TCP that served users) were free to send as many packets as possible to the network, the Internet was bound to eventually face the tragedy of the commons. As users begin to experience congestion crashes, the natural reaction is to try to add more control. Thus, the term congestion control can be thought of as an implicit mechanism for allocating resources. When control mechanisms find resources becoming scarce, they react and work to ease congestion.

In the network service model, an obvious alternative is to explicitly allocate resources to data streams. For example, an application can explicitly request a resource before sending traffic. The best effort assumption of IP means that this approach doesn’t work immediately when congestion becomes a serious problem. This was followed by the use of clearer resource allocation mechanisms to transform the Internet’s best-effort model, including the ability to provide QoS assurance. It’s good to think about the way you deal with internet congestion from this perspective. The first part of the book explores the set of design decisions that form the basis of control mechanisms, and then we define criteria for quantitatively evaluating and comparing different congestion control mechanisms.

Let’s start by introducing the four implementation options facing the congestion control mechanism and the design principles behind TCP/IP decisions. Some decisions are “obvious” given our environment, but because of the environmental changes caused by the continued growth of the Internet, we need to carefully consider all of them.

In principle, the first design decision is whether the resource allocation of the network is centralized or distributed. In practice, the size of the Internet and the autonomy of the connected organizations determine that a distributed approach should be adopted. In fact, as Dave Clark has elaborated, distributed resource management is a clear goal of Internet design, but acknowledging that this default decision is important for two reasons.

Further reading: D. Clark, The Design Philosophy of the DARPA Internet Protocols[2]. ACM SIGCOMM, 1988.

First, while the Internet’s approach to controlling congestion is distributed across millions of hosts and routers, it can be assumed that they will work together to try to achieve a globally optimal solution. From this point of view, there is a shared objective function that all components implement some distributed algorithm to optimize that function. The various mechanisms described in this book simply define the different objective functions, and one of the long-standing challenges is how to coordinate the competition between the objective functions when multiple mechanisms are deployed.

Second, while a centralized approach is not practical for the Internet as a whole, it can be applied to a limited number of areas. For example, a logically centralized controller can collect information about network links and switch status, calculate global optimal resource allocation, and then recommend (or even supervise) the usable capacity provided by the end host. This approach is of course limited by the timescale of centralized controllers responding to network changes, but it has been successfully applied to coarse-grained allocation decisions made by traffic engineering mechanisms such as B4 and SWAN. There is no way to clearly draw a line between coarse-grained flow engineering decisions and fine-grained congestion control decisions, but it’s best to keep an open mind to all possible options.

S. Jain, et al. B4: Experience with a Globally-Deployed Software Defined WAN[3]. ACM SIGCOMM, August 2013.

Centralized control is also being used effectively in the data center, which is an interesting environment for congestion control. First, the RTT in the data center is very low (traffic in and out of the data center is not considered, only communication between data center servers is considered). Second, in many cases the data center can be seen as a new field, which raises the possibility of trying new approaches that don’t have to coexist equally with existing algorithms. Fastpass, developed in collaboration with researchers at MIT and Facebook, is a good example of this centralized approach.

Further reading:J. Perry, et al. Fastpass: A Centralized “Zero-Queue” Datacenter Network[4]. ACM SIGCOMM, August 2014.

Given a distributed resource allocation approach, the next question is whether the mechanism should be implemented inside the network (that is, on a router or switch) or at the edge of the network (that is, as part of the transport protocol in the host). This is not an either/or situation in the strict sense of the word, in practice both places need to be involved, the real question is which component needs to bear the primary responsibility. The router is always responsible for deciding which packets to forward and which to discard. However, there are many different options for the extent to which a router intervenes in the decision of the end host and learns how to make a decision.

At one end of these options, the router allows the host to reserve capacity, ensuring that packets for each flow are delivered accordingly. Routers can implement a signaling protocol using mechanisms such as fair queuing, accept new flows only when there is sufficient capacity, and supervise hosts to ensure that the flows created do not exceed the resource reservation range. This is the reservation-based QoS guarantee mechanism, and the specific details are beyond the scope of this book.

The other end is a host-centric approach. The router does not make any guarantees about usable capacity, nor does it provide any explicit feedback (packets are silently discarded when the buffer is full), while it is the host’s responsibility to observe network conditions (for example, how many packets successfully pass through the network) and adjust their behavior accordingly.

In the middle of the two extremes, routers can take more proactive actions to assist the end host in doing its job, but not by reserving buffer space, but by requiring the router to send feedback to the end host when the buffer is full. We’ll cover some forms of active queue management (AQM) in Chapter 6, but the host-centric mechanism described in the next two chapters assumes that routers simply silently drop packets when the buffer is full.

In the past, a host-centric approach was implemented at the transport layer, usually TCP or some other transport protocol that mimics a TCP algorithm, such as DCCP (datagram congestion control protocol) or QUIC (a relatively new transport protocol designed for HTTP-based applications). However, congestion control can also be implemented in the application itself. DASH (Dynamic Adaptive Streaming over HTTP) is an example, although strictly speaking it should be seen as a combination of transport layer (because running on TCP) and application layer congestion control. Based on the measured network performance, the server that transmits the video to the client switches between a series of different video encodings, changing the rate at which data is sent to the HTTP stream. In effect, TCP tries to find a sustainable bandwidth for the stream, and then the application adjusts its send rate to make the most of that bandwidth and avoid sending more data than can be supported under current network conditions. The main responsibility for congestion control falls on TCP, but the goal of the application is to keep the pipeline full while guaranteeing a good user experience.

After determining the host-centric approach, the next implementation choice is a window-based or rate-based mechanism. TCP uses a window-based mechanism to implement flow control, so the design decision for TCP congestion control seems obvious. In fact, the congestion control mechanism described in Chapter 4 revolves around an algorithm that computes congestion windows, where the sender is limited by a smaller flow control window or a computed congestion control window.

However, it is also possible to calculate the packet transmission rate that the network can support and adjust the transmission speed accordingly. The observed rate is simply the number of bytes delivered over a period of time, such as the measured RTT. We point out this duality between rate and window because the rate-based approach is better suited for multimedia applications that generate data at some average rate, and at least some minimum throughput is required to function. For example, a video codec can generate video with an average rate of 1 Mbps and a peak rate of 2 Mbps.

In reservation-based systems that support different QoS levels, the rate-based approach is a reasonable choice, but even in a best-effort network like the Internet, some kind of adaptive rate-based congestion control mechanism can be implemented to notify the application when it needs to adjust the transfer rate, for example by adjusting its codec. This is the core idea of TCP-friendly rate control (TFRC), which extends the concept of TCP congestion avoidance to more natural applications that send packets at a specific rate (e.g., the bit rate generated by a video codec at a given quality level). TFRC is commonly used with RTP, a transport protocol designed for real-time applications, and we will see examples of such mechanisms in Chapter 7.

Finally, one of the latest advances in TCP congestion control is BBR (Bottleneck Bandwidth and RTT), which uses a combination of window-based and rate-based control to limit the accumulation of network queues as much as possible. We will cover this approach in detail in Chapter 5.

We should note that the final implementation choice may be somewhat subtle. The challenge for end hosts is to calculate how much available capacity is in the network based on feedback and observations, and adjust its sending rate accordingly. There are two common strategies for achieving this: one is a proactive approach that deliberately sends a packet at a rate that will produce packet loss, and then responds to it; Another is a conservative approach, which attempts to detect the point in time at which the queue begins to appear to accumulate and reduce the rate before overflowing. We call the first type of mechanism a control-based mechanism, and the second type of mechanism a avoidance-based mechanism.

R. Jain and K. Ramakrishnan. Congestion Avoidance in Computer Networks with a Connectionless Network Layer: Concepts, Goals and Methodology[5]. Computer Networking Symposium, April 1988.

Raj Jain and K.K. Ramakrishnan Jain first proposed this distinction in 1988, but it is often overlooked, and people usually only have the term “congestion control” to refer to both ways. But we think the difference between the two is very important, so we will point out that difference in due course. However, if the difference between the two is not very important for discussion, we will use only the term “congestion control”.

It should also be noted that the methods we call “control-based” and “avoidance-based” are sometimes referred to as loss-based and delay-based-based methods, respectively, because the signals of the two adjusting the congestion window are not the same, the former adjusts the window when packet loss is detected, and the latter adjusts the window when a delay gradient change is detected. From this perspective, each of the algorithms presented in the next four chapters has in some way effectively improved the fidelity of these signals.

After determining the set of design decisions needed to build a congestion control mechanism, the next question is how to tell if a given solution is good or bad. Recall that in the first chapter we raised the question of how effective and fairly the network allocates its resources, suggesting that there are at least two broad metrics that can be used to evaluate resource allocation scenarios, which we consider in turn.

Considering the throughput and latency, the two main metrics of a network, is a good starting point for evaluating the effectiveness of congestion control mechanisms. Obviously, we want the highest possible throughput and the lowest possible latency, but unfortunately, these two goals can conflict with each other. The way to increase throughput is to allow as many packets as possible to enter the network, thereby increasing the utilization of all links to 100%. This is done to avoid link idleness that affects throughput. The problem with this strategy is that increasing the number of packets in the network also increases the length of the queue on each router, which means that the latency of packets in the network increases, and worse, excess packets may be dropped. Dropping packets in the network not only affects latency, but also impairs throughput, as upstream link bandwidth is wasted on a packet that cannot be successfully sent to its destination [1].

[1] We sometimes use the term goodput instead of throughput to emphasize that we care about data that is successfully sent to the receiver over the network, not just the data sent by the sender.

The ratio of throughput to latency is a common metric for evaluating the effectiveness of resource allocation scenarios. This ratio is sometimes referred to as the system index:

Our goal is to maximize this ratio, and this function indicates how much load our system can support, and the load is set by the resource allocation mechanism. Figure 11 shows a typical exponential curve, and ideally, the resource allocation mechanism will run at the peak of the curve. To the left of the peak, the mechanism is too conservative, that is, not allowing enough packets to be sent to keep the link busy. To the right of the peak, too many packets enter the network, either (a) the latency increases due to queuing (denominator), or (b) the throughput (numerator) actually starts to drop due to packet loss.

In addition, we need to be concerned about what happens even if the system is running under heavy load (the right end of the curve in Figure 11). We want to avoid a situation where system throughput is close to zero, and the goal is to keep the mechanism stable so that packets can continue to pass through the network even when the network load is heavy. If a mechanism is unstable under overload, the network will have a congestion collapse.

Note that while both “persistent queues” and “congestion crashes” need to be avoided, there is no precise definition of the threshold at which the network suffers from both. They are all subjective judgments about the behavior of the algorithm, and in short, latency and throughput are two important performance indicators.

The effective use of network resources is not the only criterion for judging the resource allocation plan, and we must also consider the issue of fairness. However, when we try to define what constitutes a fair distribution of resources, we quickly fall into chaos. For example, a reservation-based resource allocation scenario provides an explicit way to create controlled inequities where we can make a video stream receive 1 Mbps of data over certain links through reservations, while a file transfer stream on the same link can only receive 10 Kbps of data.

On the other hand, without clear information, if several streams share a particular link, we want each stream to use the same share of bandwidth. This definition assumes that a fair share of bandwidth means equal use of bandwidth. However, even without reservation, equality does not necessarily equal fairness. Should we also consider the length of the path? As shown in Figure 12, what is fair when a four-hop stream competes with three single-hop streams?

Assuming that the fairest case scenario is that all streams use the same bandwidth, network researcher Raj Jain has proposed a metric that can be used to quantify the fairness of congestion control mechanisms. The Jain Fairness Index is defined as follows:

Given a set of stream throughput

(measured in bps), the following function assigns a fair index to the stream:

The results of the fairness index are always between 0 and 1, where 1 indicates the maximum fairness. Let’s try to understand the meaning behind this indicator. Considering the throughput of all n streams receiving 1 data unit per second, we can see that the fairness index in this example is

Suppose a stream receives a throughput of , and now the fairness index is

Note that the denominator is larger than the numerator, so the fairness index now drops below 1, regardless of whether the flow of an odd stream is more or less (more or less unit) than the flow of all other streams. Another simple case to consider is that only k of the n-streams have equal throughput and the remaining n-k users have zero throughput, in which case the fairness index drops to k/n.

Further reading: R. Jain, D. Chiu, and W. Hawe. A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems[6]. DEC Research Report TR-301, 1984.

In the next section we’ll revisit the concept of fairness that applies to the new congestion control algorithm, which, as mentioned above, is not as clear as it initially seems.

TCP-Friendly Rate Control (TFRC) also uses the concept of fairness. TFRC uses the TCP throughput equation (discussed in Section 1.3) to estimate the share of congestion link bandwidth obtained by a stream implementing the TCP congestion control scheme and sets it as the target rate at which the application sends data. The application can then make a decision to help it reach its target rate. For example, a video streaming application might choose between a different set of encoding quality levels to try to maintain an average rate at a “fair” level determined by the TFRC.

The first step in evaluating any congestion control mechanism is to measure its performance individually, including:

The second step is to compare two or more mechanisms. Given the distributed nature of the Internet, there is no way to ensure a unified adoption of a mechanism. It’s easy to compare quantitative metrics such as throughput, but the question is how to evaluate multiple mechanisms that might coexist and compete with each other for network resources.

The question is not whether a given mechanism can treat all flows fairly, but whether mechanism A can fairly treat the flows managed by mechanism B. If mechanism A can provide better throughput than B, but perhaps in a more aggressive way, so A steals bandwidth from B, then A’s improvement is not fair and the effect may be greatly reduced. It’s clear that the internet’s highly decentralized approach to congestion control is effective, with a large number of streams responding to congestion in a cooperative way, opening the door to more aggressive streams at the expense of the performance of those streams that implement less aggressive algorithms.

R. Ware, et al. Beyond Jain’s Fairness Index: Setting the Bar for the Deployment of Congestion Control Algorithms[7]. ACM SIGCOMM HotNets. November 2019.

Such debates have occurred many times over the past 30 years, which sets a high bar for deploying new algorithms. Even if the global deployment of new algorithms has a net positive effect, incremental deployments, which are the only practical option, can still have a negative impact on the flow of using existing algorithms, leading to a reluctance to deploy new approaches. But as Ranysha Ware and colleagues point out, there are three problems with this analysis:

Rather than simply calculating the Jain Fairness Index, Ware advocates using packet-loss-based thresholds, measured by a decrease in throughput or an increase in latency or jitter. Intuitively, if the flow using the new mechanism B causes damage to the flow using the existing mechanism A in one range, and this range comes from the damage caused by one stream managed by A to other streams, we can assume that B deploying with A will not cause more harm. Ware went on to come up with specific measures of acceptable harm, which turned out to be much more complicated than it might seem at first glance. Even with a single congestion control algorithm, the damage caused by one stream to another depends on factors such as RTT, startup time, and duration. Therefore, the measurement of harm needs to take into account the influence of different flows on each other under existing mechanisms and strive not to get worse in the new algorithm.

The way to evaluate congestion control mechanisms is to measure the performance of real systems, and as we noted in the first chapter, the versions implemented in Linux are the actual specifications of the individual mechanisms. We now present a specific way to perform the widespread application of these metrics using Netesto (Network Test Toolkit), a software toolset on GitHub. There is also a simulation-based approach, and NS-3 is the most popular open source tool.

Further reading: Netesto[8]NS-3 Network Simulator[9]

Note that while the experiments presented in this section measure true congestion control algorithms (of course, we haven’t described them in detail), the purpose is to outline how the algorithm is evaluated, not to draw any conclusions about specific mechanisms.

We run real TCP sender/receiver on a Linux host and use kernel packages like netem and tbf qdisc to study a range of behaviors, and then use tcpdump to collect performance data. A network that connects to an end host consists of a set of real switches and emulation components that can support scenarios such as wide-area latency and low-bandwidth links.

Experiments are performed along two orthogonal dimensions. One is the network topology, including link bandwidth, RTT, buffer size, and so on. Another dimension is the load of traffic we run on the network, including the number and duration of streams, as well as the characteristics of each stream (e.g., stream vs RPC).

For the network topology, we evaluate the algorithm in three specific configurations:

Figure 13 shows the topology of the first two scenarios, where the sender and receiver are connected through a single switch. In the second scenario, the delay is implemented using netem in the receiver, which affects only the ACK sent back.

Figure 14 shows the topology of the third scenario where the router is implemented by a server-based repeater that uses tbf qdisc to limit egress link bandwidth.

For traffic loads, we evaluate the dynamics and fairness of the algorithm through the following tests:

These tests make the following possible:

Other tests include combined streaming, plus 10 KB and 1 MB RPC. These tests allow us to see if smaller RPC streams will be affected and, if so, how much. These tests make the following possible:

Some example results are shown below, which we select to illustrate the evaluation process. Start with a simple 2-flow experiment where both streams are managed by the same congestion control algorithm. Figure 15 shows the resulting goodput diagram. As one would expect, once the second stream (the red part) starts after 20 seconds, the two streams converge to use almost equal available bandwidth. This convergence does not occur immediately (the two graphs cross about 10 seconds after the start of the second stream), and other algorithms try to correct this behavior (for example, by using explicit feedback from the router). The upside is that once the second stream ends, the first stream will quickly use the freed bandwidth.

It is also possible to observe these two streams more closely, such as the congested window that tracks each stream. The corresponding chart is shown in Figure 16. Not surprisingly, different algorithms will have different “patterns” to deal with congested windows over time, and we’ll see more details in the next chapter.

We can repeat these experiments by changing the algorithm used by one of the processes, thus visualizing how the two algorithms interact. If both are fair algorithms, we would expect to see results similar to Figure 15. If not, you may see results similar to Figure 17, where the second stream (algorithm B) actively takes away bandwidth from the first stream (algorithm A).

The experiment can be repeated in three concurrent streams, but we’re going to evaluate how various algorithms handle different workloads. In particular, we are interested in the size fairness problem, that is, how to handle consecutive 10 KB or 1 MB RPC calls when a given algorithm must compete with an ongoing stream. Figure 18 (RPC for 1 MB) and Figure 19 (RPC for 10 KB) show some example results. The performance of five different algorithms (represented in different colors) is shown in the figure, testing 1, 2, 4, 8, and 16 concurrent streams.

The 1 MB result is not surprising, there are no significant outliers in the five algorithms, and the average goodput is getting lower and lower as RPCs compete with more and more streams. Although Figure 18 does not show, the fourth algorithm (green) performs best when all streams are stream-based, avoiding the large number of retransmissions that occur when the available bandwidth is shared between RPC calls.

The 10 KB result has a significant outlier, and the third algorithm (yellow) performs significantly better, almost 4x the advantage. If we plot latency instead of bandwidth (which is a more relevant metric for small message RPC calls), we find that the third algorithm achieves the lowest latency, with P99 and P999 basically identical.

Finally, all of the above experiments can be repeated on a network topology that contains a wide-area RTT. Of course, fairness between streams and fairness in stream size is still a concern, but queuing delays can also be an issue. For example, Figure 20 shows the P99 latency of four different algorithms when the network topology includes a 10 Mbps bottleneck link and a 40ms RTT. The important conclusion about this result is that when the bandwidth delay product of the buffers available on the bottleneck router is less than one unit, the performance of the second algorithm (red) is poor, which draws attention to another variable that may affect the results.

Finally, we conclude the discussion of experimental methods with a summary. When looking at a set of algorithms and a series of topology/traffic scenarios, it can be concluded that no one algorithm is superior to all others under all conditions. These examples prove that there are many factors to consider. This also explains why congestion control has always been a topic of interest to both web researchers and practitioners.

Hello, I am Yu Fan, did research and development in Motorola, and now does technical work in Mavenir, and has always maintained a strong interest in communications, networks, back-end architectures, cloud native, DevOps, CICD, blockchain, AI and other technologies, usually like to read, think, believe in continuous learning, lifelong growth, welcome to exchange and learn together. WeChat public account: DeepNoMind

TCP Congestion Control: A Systems Approach: https://tcpcc.systemsapproach.org/index.html

The Design Philosophy of the DARPA Internet Protocols: https://dl.acm.org/doi/10.1145/52324.52336

B4: Experience with a Globally-Deployed Software Defined WAN: https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf

Fastpass: A Centralized “Zero-Queue” Datacenter Network: http://fastpass.mit.edu/Fastpass-SIGCOMM14-Perry.pdf

Congestion Avoidance in Computer Networks with a Connectionless Network Layer: Concepts, Goals and Methodology: https://arxiv.org/pdf/cs/9809095.pdf

A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems: https://www.cse.wustl.edu/~jain/papers/ftp/fairness.pdf

Beyond Jain’s Fairness Index: Setting the Bar for the Deployment of Congestion Control Algorithms: https://www.cs.cmu.edu/~rware/assets/pdf/ware-hotnets19.pdf

Netesto: https://github.com/facebook/fbkutils/tree/master/netesto

NS-3 Network Simulator: https://www.nsnam.org