Network transmission problems are essentially the sharing and reuse of network resources, so congestion control is one of the core problems in the field of network engineering, and with the explosive growth of Internet and data center traffic, there have been many innovations in related algorithms and mechanisms, this series is the Chinese version of the free e-book “TCP Congestion Control: A Systems Approach”, which fully introduces the concepts, principles, algorithms and implementation methods of congestion control. TCP Congestion Control: A Systems Approach[1]

TCP congestion control details:

1. Overview

2. Background

3. Design space

4. Control algorithms

5. Avoidance Algorithm (this article)

6. Active queue management

7. Go beyond TCP

A review of the academic literature on TCP congestion control shows that there is a clear difference between the TCP Tahoe and Reno mechanisms, originally introduced in 1988 and 1990, respectively, and the research activities that began in 1994, the main sign being the introduction of an alternative called TCP Vegas, which led to a large number of comparative studies and alternative designs that lasted for more than 25 years.

L. Brakmo, S. O’Malley and L. Peterson TCP Vegas: New Technique for Congestion Detection and Avoidance[2]. ACM SIGCOMM ‘94 Symposium. August 1994. (Reprinted in IEEE/ACM Transactions on Networking, October 1995).

Although the methods described so far treat packet loss as a congestion signal and attempt to react to control congestion after it has occurred, TCP Vegas takes an avoidance-based approach to dealing with congestion: it attempts to detect changes in throughput to detect congestion and adjust the send rate before congestion is severe enough to cause packet loss. This chapter will cover the general “Vegas strategy” and three different examples introduced over time. The culmination of this type of research is the BBR algorithm that Google is now advocating.

The basic idea behind TCP Vegas is to adjust the send rate based on the comparison of the measured throughput rate and the expected throughput rate. As can be seen from the TCP Reno diagram given in Figure 29, the top figure shows the congestion window of the connection, giving the same information as in the previous chapter. The middle and bottom graphs describe the new information: the middle graph shows the average transmit rate measured at the sending end, while the bottom graph shows the average queue length measured at the bottleneck router, and the three graphs are time-synchronized. Between 4.5 seconds and 6.0 seconds (shaded area), the congestion window increases (above), and we expect the observed throughput to also increase, but in fact remain the same (Middle-Figure), because the increase in throughput cannot exceed the available bandwidth, and after exceeding the available bandwidth, any increase in window size will only result in more bottleneck router buffer space (below).

An interesting metaphor can be used to describe the phenomenon shown in Figure 29, i.e. driving on ice. The speedometer (congestion window) may show that the speed is 30 miles per hour, but by looking out the car window and seeing people walking past you (measured throughput rate), you know you’re no more than 5 miles per hour. In this kind of ratio, the engine’s meaningless idling is like sending extra packets, just uselessly staying in the router buffer.

TCP Vegas uses this idea to measure and control the amount of extra data in the connection transmission, where “extra data” refers to data that would not be transmitted if the sender could exactly match the bandwidth available to the network. The goal of TCP Vegas is to maintain the “right” amount of extra data in the network. Obviously, if one sender sends too much extra data, it will cause long delays and can lead to congestion. Less obviously, if a connection sends too little extra data, it cannot respond fast enough to a brief increase in available network bandwidth. TCP Vegas makes congestion avoidance actions based on changes in the estimated amount of additional data in the network, not just packet loss, and we will go into more detail about this algorithm.

First, the BaseRTT of a given stream is defined as the RTT of a packet when the flow is not congested. In practice, TCP Vegas sets BaseRTT to the minimum value of all round-trip times measured, usually the RTT of the first packet sent by the connection before the router causes an increase in queues due to that flow. If no overflow connections are assumed, the expected throughput is

where CongestionWindow is the TCP congestion window, assuming (for the purposes of this discussion) is equal to the number of bytes in transit.

Second, TCP Vegas calculates the current send rate, ActualRate. The calculation method is to record the sending time of a distinguished packet, record how many bytes are transmitted between the packet and the ACK received by it, and calculate the sample RTT of the differentiated packet when the ACK information arrives, and finally divide the number of bytes transmitted by the sample RTT, this calculation is performed once in each round trip.

Third, TCP Vegas compares ActualRate and ExpectedRate and adjusts the window accordingly. Diff = ExpectedRate – ActualRate, note that, by definition, Diff is positive or 0. ActualRate > ExpectedRate occurs only when the measured sample RTT is smaller than BaseRTT, and if this happens, change BaseRTT to the latest sampled RTT. We also define two thresholds, α < β, that correspond to cases where there is too little extra data in the network and too much, respectively. When Diff < α, TCP Vegas linearly increases the congestion window during the next RTT, and when Diff > β, TCP Vegas linearly reduces the congestion window during the next RTT. When α < Diff < β, TCP Vegas keeps the congestion window unchanged.

Intuitively, the farther the gap between actual throughput and expected throughput, the greater the congestion in the network, meaning that the send rate should be reduced, and β threshold triggers the send rate drop. On the other hand, when the actual throughput rate is too close to the expected throughput rate, the connection is faced with the situation of not being able to take advantage of the available bandwidth, α threshold triggers an increase in the sending rate. The overall goal is to keep between the α and the β in the extra bytes in the network.

Figure 30 shows the TCP Vegas congestion avoidance algorithm. The image above shows the congestion window, the same as the other illustrations given in this chapter. The following graph shows the expected and actual throughput, which determines how the congestion window is set. The following diagram best illustrates how the algorithm works. The colored line shows the ExpectedRate, while the black line shows the ActualRate. The shaded section gives the α and β threshold intervals, with the top of the shaded band α KBps from ExpectedRate and the bottom β KBps from ExpectedRate, with the goal of keeping ActualRate between these two thresholds in the shaded area. Whenever the ActualRate falls below the shaded area (i.e. too far from the ExpectedRate), TCP Vegas worries that too many packets in the network are buffered, reducing the congestion window. Similarly, whenever the ActualRate exceeds the shaded area (i.e. too close to the ExpectedRate), TCP Vegas worries that the network is underutilized, thus increasing the congestion window.

Because the algorithm just described compares the difference between the actual and expected throughput rates and the α and β thresholds, these two thresholds are defined in KBps. However, it may be more accurate to consider how much additional packet buffer the connection is consuming in the network. For example, on a connection with a BaseRTT of 100ms and a packet size of 1 KB, if α = 30 KBps and β = 60 KBps, it can be assumed that α specifies that the connection needs to occupy at least 3 additional buffers in the network, and β specifies that the connection occupies no more than 6 additional buffers in the network. α and β settings work well in the real-world environment of Vegas’ first deployment, but as described in the next section, these parameters will continue to adjust to the changing environment.

Finally, note that TCP Vegas reduces congestion windows in a linear fashion, seemingly conflicting with rules that require exponential reduction to ensure stability. This is because TCP Vegas does use exponential reduction when the timeout occurs, and the linear drop just described is the early drop in the congestion window, which occurs before congestion occurs and packets begin to be dropped.

In response to assumptions about different networks, TCP Vegas, along with Vegas-like congestion avoidance methods, has been tweaked over time. Vegas has never been as widely used as Reno, so these modifications are often driven more by laboratory research than extensive real-world experience, but the refinement of algorithms contributes to our understanding of avoidance-based algorithms. We’ve summarized some of these insights here, but in Chapter 7 we’ll continue with the general topic of customizing congestion control algorithms for specific use cases.

The first Vegas-inspired mechanism was FAST TCP, which improved Vegas to make it more efficient on high-speed networks with large bandwidth latency products. The main idea is to increase the congestion window more aggressively during the stage when the algorithm tries to find available “in-transit” bandwidth (before packets are buffered in the network), and then behaves conservatively when the algorithm begins to compete with other streams for buffers on the bottleneck router. FAST also recommends adjusting the α value to about 30 packets.

In addition to managing congestion in networks with large bandwidth latency products, keeping pipelines full is also a substantial challenge, and there are two more noteworthy things to note about FAST. First, both TCP Reno and TCP Vegas are based on intuition and a lot of trial and error, while FAST is based on optimization theory (which was later used to explain why Vegas works). Second, unlike all other congestion control algorithms we know, the implementation of FAST can only be used as a proprietary solution.

S. Low, L. Peterson, and L. Wang. Understanding TCP Vegas: A Duality Model[3]. Journal of the ACM, Volume 49, Issue 2, March 2002.

While Vegas’s motivation is to detect and avoid congestion before packet loss occurs, TCP Westwood (TCPW) is primarily motivated by the realization that packet loss is not always a reliable indicator of congestion. This is especially evident in wireless connectivity, which was new when Vegas was proposed, but had become common by the time TCPW came along. Wireless links often lose packets due to uncorrected errors on the wireless channel, which is not related to congestion, so another method needs to be used to detect congestion. Interestingly, the end result is somewhat similar to Vegas, where TCPW also tries to determine the bottleneck bandwidth by looking at the rate at which the ACK returns successfully delivered packets.

When packet loss occurs, because it is not yet known whether the packet loss is due to congestion or link dependence, TCPW does not immediately cut the congestion window in half, but instead estimates the speed of traffic before the packet loss occurs, which is a milder form of backing off than TCP Reno. If packet loss is congestion-related, TCPW should be sent at an acceptable rate before packet loss. If packet loss is caused by a wireless error, TCPW does not lower the send rate and will also start to increase the send rate again to get the most out of the network. The result is a protocol that is similar to Reno on permalinks, but gets a substantial performance boost on lossy links.

Optimization of congestion control algorithms over wireless links remains a challenging problem, and more complexly, WiFi and mobile cellular networks have different characteristics. We will revisit this issue in Chapter 7.

The final example is New Vegas (NV), which is an adaptation to Vegas’s latency-based approach to data center networks with a link bandwidth of 10Gbps or higher, with RTT typically within tens of microseconds. This is an important use case that we will cover in Chapter 7, and the current goal is to establish some intuitive feelings.

To understand the basic idea of NV, suppose we plot Rate vs. CongestionWindow for each package that receives an ACK. For simplification purposes, Rate is simply the ratio of the CongestionWindow (in bytes) to the packet RTT that is ACK (in seconds). Note that for simplicity, CongestionWindow was used in the discussion, when in fact NV is using bytes in transit (ACK not received). As shown in Figure 31, we use vertical bars instead of dots to represent the CongestionWindow value due to transient congestion or noise in the measurement.

The maximum slope at the top of the bar chart represents the best level that can be achieved in the past. In a well-tuned system, the top of the bar chart is bounded by a straight line that passes through the origin. The idea is that as long as the network is not congested, the amount of data sent per RTT doubles, and the rate doubles.

New measurements for Rate and CongestionWindow can fall near the boundary line (black diamond in the figure) or below (the blue diamond in the figure). A measurement above that line causes the NV to automatically update that line by increasing its slope, so the measurement results will fall on the new line. If the new metric approaches this line, then NV increases the CongestionWindow. If the measurement is below this line, it means similar performance to the previous lower congestion window. In the example shown in Figure 31, we see similar performance to CongestionWindow=12, so CongestionWindow is reduced. In the case of noise in new measurements, this reduction is exponential, not instantaneous. To filter out bad metrics, NV collects many metrics and then uses the best metrics before making a congestion judgment.

BBR (Bottleneck Bandwidth and RTT) is a new TCP congestion control algorithm developed by Google researchers. Like Vegas, BBR is latency-based, meaning it tries to detect buffer growth to avoid congestion and packet loss. Both BBR and Vegas use the minimum RTT and the observed bottleneck bandwidth (calculated based on time intervals) as the primary control signals.

Figure 32 shows the basic idea of BBR. Suppose a network has some available bandwidth and queue capacity on the bottleneck link, and as the congestion window opens, more data is sent, initially because the bottleneck is not full, the latency does not increase, and the throughput increases (below). Then, once the rate reaches the bottleneck bandwidth, the queue begins to populate. At this point, the RTT rises, but no throughput rise is observed, which is the beginning of the congestion phase. This chart is actually a simplified version of what we see on the 4.5 to 6.0 second timeframe in Figure 29.

Like Vegas, the goal of the BBR is to pinpoint exactly which point the queue just started to fill, rather than continuing until it fills up the buffer and causes packet loss, as Reno does. Much of the work of the BBR revolves around improving the sensitivity of the mechanism for locating the best point, among which there are many challenges: measuring bandwidth and delay is noisy; Network conditions are dynamic; and the long-term quest for fairness when competing with BBR and non-BBR streams for bandwidth.

A notable feature of BBR compared to other methods is that it does not rely entirely on CongestionWindow to determine how much data is transferred. Notably, BBR also tries to smooth out the rate at which senders feed data into the network to avoid bursts of peak traffic that can lead to over-queuing. Ideally, we want to send data at a speed that is perfectly in line with the bottleneck, achieving the highest possible throughput without causing queue buildup. While most TCP variants use ACK arrival to trigger the sending of data, ensuring that the amount of unacknowledged data in transit remains the same, BBR creates an estimate of the bottleneck bandwidth and uses a local scheduling algorithm to send data at that rate. ACKs still play an important role in updating the status of the network concerned, but are not directly used to adjust the transmission speed, which means that delayed ACKs do not cause a sudden burst of transmission volume. Of course, the CongestionWindow is still used to ensure that enough data can be sent to keep the pipeline full, and to ensure that the amount of data in transit is not much larger than the bandwidth delay cumulative, thus avoiding queue overflow.

In order to maintain an up-to-date view of the current RTT and bottleneck bandwidth, it is necessary to continuously probe up and down the currently estimated bottleneck bandwidth. More bandwidth may be gained due to a decrease in competing traffic traffic, changes in link attributes such as wireless links, or changes in routing. If the path changes, the RTT may also change. In order to detect changes in RTT, less traffic needs to be sent, thus emptying the queue. In order to detect changes in the available bandwidth, more traffic needs to be sent. Thus, the BBR simultaneously probes up and down its currently estimated bottleneck bandwidth, updates the estimate if necessary, and updates the send rate and CongestionWindow accordingly.

The status diagram of Figure 33 shows the process of sequentially probing the available bandwidth and the minimum RTT. After the aggressive boot phase that attempts to establish the available bandwidth on the path, the send rate is reduced to empty the queue, and then the algorithm goes into the inner ring of the graph, where it periodically checks for lower latency at lower send rates, or higher throughput at higher send rates. Over a relatively long time frame (a few seconds), the algorithm enters the ProbeRTT state, reducing its send rate by a factor of two to completely empty the queue and test for a lower RTT.

What’s interesting about this approach is that when a large stream significantly reduces the send rate in the ProbeRTT state, the stream’s contribution to queue latency will decrease, which will cause other streams to see a new, lower RTT at the same time and update their estimates. Thus, when the queue is actually empty or nearly empty, the stream exhibits a tendency to synchronize RTT estimates, which improves the accuracy of the estimates.

BBR is under active development and is growing rapidly, with the latest version 2 at the time of writing. Its main focus is fairness, for example, some early experiments have shown that CUBIC streams get 100-fold less bandwidth when competing with BBR streams, while other experiments have shown that there may be inequities between BBR streams. BBR version 1 is insensitive to packet loss, especially when the amount of buffering in the path is relatively low, which can lead to higher packet loss rates. Since several implementations of BBR are now being trialed in different environments, including in Google’s internal backbone as well as the wider Internet, people are gathering experience to further refine the design. The IETF’s Congestion Control Working Group is moderating discussions on ongoing designs and experiments.

Further reading: N. Cardwell, Y. Cheng, C. S. Gunn, S. Yeganeh, V. Jacobson. BBR: Congestion-based Congestion Control[4]. Communications of the ACM, Volume 60, Issue 2, February 2017.

Hello, I am Yu Fan, did research and development in Motorola, and now does technical work in Mavenir, and has always maintained a strong interest in communications, networks, back-end architectures, cloud native, DevOps, CICD, blockchain, AI and other technologies, usually like to read, think, believe in continuous learning, lifelong growth, welcome to exchange and learn together. WeChat public account: DeepNoMind

TCP Congestion Control: A Systems Approach:

TCP Vegas: New Technique for Congestion Detection and Avoidance:

Understanding TCP Vegas: A Duality Model:

BBR: Congestion-based Congestion Control: