Network transmission problems are essentially the sharing and reuse of network resources, so congestion control is one of the core problems in the field of network engineering, and with the explosive growth of Internet and data center traffic, there have been many innovations in related algorithms and mechanisms, this series is the Chinese version of the free e-book “TCP Congestion Control: A Systems Approach”, which fully introduces the concepts, principles, algorithms and implementation methods of congestion control. TCP Congestion Control: A Systems Approach[1]

TCP congestion control details:

1. Overview

2. Background (for this article)

3. Design space

4. Control algorithms

5. Avoidance algorithm

6. Active queue management

7. Go beyond TCP

To understand how to deal with Internet congestion, it is necessary to first discuss the construction assumptions and design decisions in the Internet architecture, which is the main content of this chapter, and during the discussion, we will provide enough details of the TCP/IP protocol stack to help understand the congestion control mechanism described in the following chapters. For a more complete introduction to the TCP/IP protocol stack, we recommend that you refer to the following resources.

Computer Networks: A Systems Approach[2], 2020.

The Internet supports connectionless, best-effort packet delivery service models, defined by IP and implemented by switches and routers. Connectionless means that each IP packet carries enough information that the network can forward to the correct destination based on this information, and there is no mechanism to tell the network what to do when the packet arrives. Best-effort means that if something goes wrong along the way, causing packets to be lost, corrupted, or delivered to the wrong destination, the network cannot recover from the failure, and recovery from the error is the responsibility of the higher-level protocol running on the end host. Networks are deliberately designed to make routers as simple as possible, which is often considered consistent with the end-to-end arguments articulated by Saltzer, Reed, and Clark.

J. Saltzer, D. Reed, and D. Clark. End-to-End Arguments in System Design[3]. ACM Transactions on Computer Systems, Nov. 1984.

The result of this design is that a given data source may have enough capacity to send traffic to the network at a certain rate, but somewhere in the middle of the network, many different traffic sources may need to use the same link, so packets may encounter bottlenecks. Figure 3 shows a typical example of this in which two high-speed links are connected to a router, which in turn feeds outgoing traffic onto a low-speed link. Although routers are able to buffer packets over a period of time, if the problem persists, the buffer queue will grow to a certain length and eventually (because it is limited) will overflow, resulting in data loss. This load exceeds the capacity of the link is defined by congestion.

It is important to note that avoiding congestion is not a problem that can be solved entirely by routing. While routing protocols can indeed assign large “costs” to congested links, thus avoiding that link, this does not solve the overall problem. To understand this, we need only look at the simple network described in Figure 3, where all traffic must pass through the same router to reach its destination. While this is an extreme example, it is usually impossible to bypass at least one link. This link, and the routers that send packets to it, can be congested, and the routing mechanism cannot do anything about it.

Because the Internet uses a connectionless model, any connection-oriented service is implemented by an end-to-end transport protocol (such as TCP) running on the end host. The network itself does not implement a connection establishment mechanism (compared to a virtual circuit-based network), so the router does not have a mechanism to preallocate buffer space or link bandwidth for active connections.

The lack of an explicit connection establishment mechanism does not mean that the router is completely unaware of the end-to-end connection. IP packets are independently switched, but typically, a given pair of hosts needs to exchange multiple packets consecutively, for example, a client downloads a large video file from a server. In addition, a given traffic between a pair of hosts is typically transmitted through a consistent set of routers. This introduces the important abstraction of a stream (a sequence of packets sent between a source/destination pair and follows the same network route), which will be used several times in later chapters.

The power of flow abstraction is that streams can be defined at different granularities. For example, it can be host-to-host (i.e. with the same source/destination IP address) or process-to-process (i.e. with the same source/destination host/port pair). Figure 4 illustrates several flows through a series of routers.

Because each router has multiple traffic, it is sometimes necessary to maintain state information for each traffic that can be used to make resource allocation decisions about the packets of that data flow, which is called soft state, and the main difference between soft and hard states is that the former is not explicitly created and deleted by signaling. The soft state represents the intermediate state between a pure connectionless network that remains stateless on the router and a pure faceless connected network that remains the router. In general, the proper functioning of the network does not depend on the current soft state (each packet is still routed correctly), but it can be better handled when a packet happens to belong to the stream of the soft state that the router currently maintains.

Quality-of-Service

In best effort service, all packages are treated almost equally, and the end host does not have the opportunity to require the network to provide some quality assurance or priority service to certain packages or streams. Defining a service model that supports some kind of high-priority service or quality assurance (for example, the bandwidth required to guarantee a video stream) results in an architecture that supports multiple types of quality of service (QoS).

There are actually a range of possibilities between a purely best-effort service model and a model where each stream can get QoS guarantees. There are some extensions of the Internet service model, including additional service levels, but (1) not widely deployed across the Internet, and (2) even if deployed, still allow for best-effort traffic that relies on the congestion control algorithms described in this book.

For completeness, Figure 5 shows the IPv4 packet format, but relevant to our discussion is the 8-bit TOS (Type of Service) field. Over the years, this field has been interpreted in different ways, but the basic function is to allow different processing of packets according to the needs of the application. In later chapters, we’ll see how various congestion control mechanisms apply different meanings of the TOS field over time.

Each router implements certain queuing rules that define how packets are buffered while waiting for transmission. The queuing algorithm can be thought of as allocating both bandwidth (which packets are transmitted) and buffer space (which packets are dropped), and also directly affects the packet delay by deciding how long it takes to wait for transmission.

The most common queuing algorithm is First-In/First-Out (FIFO), where the first packet that reaches the router is the first packet to be transmitted. Figure 6(a) shows a FIFO queue with “slots” that can hold up to 8 packets. Packets are added at the tail when they arrive and transmitted from the header, so the order of the FIFOs can be guaranteed.

Assuming that the buffer space of the router is limited, if a packet arrives and the queue (buffer space) is full, then the router drops the packet, as shown in Figure 6(b). This action has nothing to do with which stream the packet belongs to or how important the packet is. Because packets that reach the tail of the FIFO are dropped when the queue is full, this is sometimes referred to as tail drop.

Note that tail drop and first-in, first-out are two separate mechanisms. A FIFO is a scheduling rule that determines the order in which packets are transmitted. Tail drop is a drop policy that determines which packets are dropped. Since FIFO and tail drop are the simplest instances of scheduling rules and drop policies, respectively, they are sometimes seen as a default packet queue implementation. Chapter 6 discusses more than by judging “Is there an idle buffer?” More complex drop policies that can be used for FIFOs, or other more complex scheduling rules.

Fair Queuing

Fair Queuing (FQ) is an alternative to FIFO queues and is often used to implement QoS guarantees. The idea of FQ is to maintain a separate queue for each stream that the router is currently processing (for certain stream-grained). The routers then serve these queues in round-robin order (in the simplest version of FQ). If the router is congested due to traffic from multiple streams, FQ can ensure that no single stream can monopolize the link, and each flow can share a share of the link. In this way, a given data source cannot arbitrarily increase its share of network capacity at the expense of other flows.

FQ can be used in conjunction with end-to-end congestion control mechanisms. It simply isolates traffic so that poorly behaved traffic sources do not interfere with traffic sources that faithfully implement end-to-end algorithms. FQ also enhances fairness between collections of streams managed by good congestion control algorithms.

TCP implements reliable byte streaming (running between a pair of processes on the end host) based on the best effort service model supported by IP. This section describes TCP in sufficient detail to understand the congestion control mechanisms described in the following sections.

At the heart of TCP is the sliding window algorithm, which, in addition to the familiar acknowledgement/timeout/retransmission mechanism, must also solve the following complex problems.

First, since TCP supports the establishment of logical connections between two processes running on any two computers connected to the Internet, an explicit connection establishment mechanism is required. During this phase, the parties agree to exchange data with each other. One of the things that happens during connection establishment is that both parties share some state to start the sliding window algorithm. When disconnection is required, each host knows that this state can be released.

Second, the round-trip time of a TCP connection can vary greatly. For example, a TCP connection between San Francisco and Boston thousands of kilometers apart may have an RTT of 100 milliseconds, while a TCP connection between two hosts in the same room may have only 1 millisecond of RTT, and the same TCP protocol must be able to support both connections. To make matters worse, the TCP connection between San Francisco and Boston may have an RTT of 100 milliseconds at 3 a.m., but 500 milliseconds at RTT at 3 p.m., and the RTT variation may even occur in a single TCP connection that lasts only a few minutes. This means that the timeout mechanism that triggers retransmission must be adaptive for the sliding window algorithm.

Third, due to the best effort nature of the Internet, packets may be reordered during transmission. Since the sliding window algorithm can reorder packets by sequence number, slightly cluttered packets do not cause problems. The real question is how confusing the order of the packets might be, or in other words, how long the packets might arrive at their destination might be delayed. In the worst case, packets can be delayed almost arbitrarily on the Internet. Once each IP is forwarded by the router, its TTL (Time to Live) field decreases and eventually reaches zero, at which point the packet is dropped (so there is no danger of delayed arrival). Note that TTL may be misleading and has been renamed Hop Count in IPv6. TCP knows that the IP network will drop packets after the TTL expires, so it assumes that each packet has a maximum lifetime, called the Maximum Segment Lifetime, which is an engineering option and is currently recommended to be set at 120 seconds. Keep in mind that IP does not directly enforce a value of 120 seconds, which is only a conservative estimate, TCP determines how long a packet may exist on the Internet, which means that TCP must be prepared for very old packets to suddenly appear on the receiving end, which can confuse the sliding window algorithm.

Fourth, since almost any type of computer can connect to the internet, especially considering that any one host may support hundreds of TCP connections at the same time, the amount of resources used for any given TCP connection is highly variable. This means that TCP must provide some mechanism for each party to “know” how many resources the other can provide (for example, how much buffer space, which is the flow control problem).

Fifth, the sender of the TCP connection does not know which links to take to reach the destination. For example, the sender may be directly connected to a relatively fast Ethernet network capable of sending data at 10 Gbps, but somewhere in the middle of the network must pass through a 1.5 Mbps link. And, to make matters worse, data from many different sources may try to traverse the same slow connections. If enough traffic is aggregated, even high-speed links can become congested. This is the basic factor that causes congestion, which we will discuss in a later section.

TCP is a byte-oriented protocol, meaning that bytes are written in the sending direction to the TCP connection and the receiver reads bytes from the TCP connection. Although “byte streaming” describes the services that TCP provides to application processes, TCP itself does not transmit individual bytes over the Internet. Instead, TCP on the source host gets enough bytes from the sending process to populate a reasonably sized packet, which is then sent to the peer on the destination host. TCP on the target host then imports the contents of the packet into the receive buffer, from which the receiving process reads data when idle. This is shown in Figure 7, which only shows the data flowing in one direction for simplicity.

The packets exchanged between the TCP peers in Figure 7 are called segments, and each packet carries a shard of a byte stream, and each TCP shard contains the header shown in the diagram of Figure 8. The fields related to our discussion are described below.

The SrcPort and DstPort fields identify the source and destination ports, respectively. These two fields, together with the source IP address and destination IP address, uniquely identify each TCP connection. All states that need to manage TCP connections, including congestion-related states described in the following sections, are bound to a 4-tuple (SrcPort, SrcIPAddr, DstPort, DstIPAddr).

TCP’s sliding window algorithm involves the Acknowledgment, SequenceNum, and AdvertisedWindow fields. Because TCP is a byte-oriented protocol, each byte of data has a sequence number. The SequenceNum field identifies the sequence number of the first byte of data carried in that shard, while the Acknowledgment and AdvertisedWindow fields carry information about the reverse data stream. To simplify the discussion, we ignore the fact that data can flow in both directions and instead focus on data such that a particular SequenceNum flows in one direction, while Acknowledgment and AdvertisedWindow flow in the opposite direction, as shown in Figure 9.

The 6-bit Flags field is used to transmit control information between TCP peers, including the SYN and FIN flags, which are used when establishing and terminating connections, and the ACK flag, which is set when the Acknowledgment field is valid (implying that the receiver should be aware).

Finally, the length of the TCP header is variable (the option can be appended after the mandatory field), so the HdrLen field is included, giving the length of the header in 32 bits. This field takes sense when the TCP extension is appended to the end of the header (ll supports congestion control, for example). The core significance of adding these extensions as optional rather than changing the TCP header is that even if the host does not implement these options, it can still communicate based on TCP, and hosts that implement optional extensions can use these options during the TCP connection establishment phase.

The variant of the TCP sliding window algorithm has two main purposes: (1) to ensure reliable and orderly data delivery, and (2) to force flow control between the sender and receiver. For flow control, the receiver selects a sliding window size and announces it to the sender via the AdvertisedWindow field in the TCP header. The sender is then restricted from retaining no more than AdvertisedWindow bytes of unacknowledged data at any given time. The receiver selects the appropriate value for the AdvertisedWindow based on the amount of memory allocated to the connection to buffer the data, which is done to prevent the sender from occupying the receiver’s buffer.

To understand how the TCP slider window works, consider the situation shown in Figure 10. TCP maintains a send buffer on the sending side that stores data that has been sent but not yet acknowledged, as well as data that has been written but not yet transmitted by the sending application. On the receiving side, TCP maintains a receive buffer that holds data that arrives in the wrong order, as well as data that is in the correct order (i.e., there are no missing bytes in between) that the application process has not yet had a chance to read.

To make the following discussion simpler, we ignore that both the buffer and the sequence number are finite in size and therefore end up wrapping back. Similarly, we do not distinguish between a buffer pointer that stores a particular byte of data and the sequence number of that byte.

First of all, looking at the sender, the send buffer maintains three pointers, each with obvious meanings: LastByteAcked, LastByteSent, and LastByteWritten. Because it is impossible for the receiver to acknowledge a byte that has not yet been sent, and it is impossible for TCP to send a byte that has not yet been written by the application process, it is obvious:

Maintain a similar set of pointers (sequence numbers) on the receiving end: LastByteRead, NextByteExpected, and LastByteRcvd. However, these inequalities are a bit less intuitive due to the problem of disorderly delivery. Since the application can only read a certain byte and all previous bytes have been received, in this case:

If the data arrives sequentially, NextByteExpected points to the bytes after LastByteRcvd, while if the data arrives out of order, NextByteExpected points to the beginning of the first gap in the data, as shown in Figure 10.

So far the discussion has assumed that the receiver can keep pace with the sender, but this is not the case, and both the sender and receiver have buffers of a certain size, so the receiver needs some methods to slow down the sender, which is the essence of flow control.

While we’ve pointed out that flow control and congestion control are different issues, it’s important to first understand how flow control works, as the window mechanisms used to implement flow control also play an important role in congestion control. The window provides the sender with a clear indication of how much data is being “in transit” (not yet confirmed), which is crucial for both issues.

Let’s reconsider the fact that both buffer sizes are finite, expressed as SendBufferSizes and RcvBufferSizes, respectively. The receiver limits the sender by publishing a window that is no larger than the amount of data that can be buffered. Note that the receiver TCP must be guaranteed

This avoids buffer overflows. Therefore, the receiver announcement window size is

Represents the amount of free space remaining in its buffer. When data arrives, the receiver will approve as long as all the preceding bytes have also arrived. In addition, LastByteRcvd moves (incremental) to the right, which means that the notification window may shrink, but whether it shrinks depends on how quickly the local application process consumes data. If the local process reads data as fast as it arrives (causing LastByteRead to increment at the same speed as LastByteRcvd), the advertisement window will remain open (that is, AdvertisedWindow = RcvBufferSize). However, if the receive process falls behind (probably because of a very expensive operation on every byte of data read), the notification window will get smaller with each arriving shard until it finally becomes 0.

The sender TCP must then follow the notification window obtained from the receiver, which means that at any given time, it must be ensured

In other words, the sender needs to calculate the effective window to limit the amount of data that can be sent:

Obviously, if the source wants to send more data, the EffectiveWindow must be greater than 0. Therefore, it is possible that a received sharded packet acknowledges the x byte, so that the sender’s LastByteActed adds x, but because the receiving process did not read any data, it is now through the window that is x bytes smaller than before. In this case, the sender can free up buffer space, but cannot send more data.

During this process, the sender must also ensure that the local application process does not cause a send buffer overflow, that is,

If the sending process tries to write b bytes to TCP, however

TCP needs to block the sending process, not allowing it to send more data.

It is now understandable how the slow receive process ended up blocking the fast send process. First, the receive buffer is filled, which means that the notification window will shrink to 0. A notification window of 0 means that the sender cannot transmit any data, even if the previously sent data has been successfully confirmed. Finally, the inability to transmit any data means that the send buffer is filled, which eventually causes TCP to block the sending process. Once the receiving process starts reading data again, the receiver TCP is able to open a window that allows the sender TCP to transmit buffer data. When this data is finally confirmed, LastByteActed increases, buffer space becomes idle, and the sending process is unblocked and allows sending to continue.

There is only one detail left that must be solved, how does the sender know that the notification window is no longer 0? TCP always sends a shard response to receive data, and this response contains the latest values for the Acknowledge and AdvertisedWindow fields, even though those values have not changed since the last time they were sent. The problem is that once the receiver notification window is 0, the sender is not allowed to send any data, which means that it has no way to find that the notification window is no longer 0 at some time in the future. TCP on the receiving side does not spontaneously send non-data shards, but only sends as a response to arriving data.

For this case, TCP handles it this way. When the other party advertises that the window size is 0, the sender insists on sending 1 byte of data every once in a while. It knows that this data may not be accepted, but tries anyway because every 1-byte shard triggers a response containing the current announcement window, which eventually carries a non-zero value. These 1-byte messages are called Zero Window Probes and are actually sent every 5 to 60 seconds.

Next we consider how TCP decides to transmit a shard, which is a surprisingly delicate issue. If we ignore flow control and assume that the window is fully open, then TCP has three mechanisms to trigger the transmission of a shard:

Of course, we can’t ignore flow control. If the sender has MSS bytes of data to send and the window is open, then the sender sends a full shard. However, suppose the sender is accumulating bytes to be sent, but the window is currently closed. Now suppose the ACK arrives and opens enough windows for the sender to transmit, such as MSS/2 bytes. Should the sender send half of the shards or wait for the window to open to the full MSS?

The specification was not defined at first, and early implementations of TCP decided to transmit half of the shards. But it turns out that aggressive use of any available window strategy led to a situation now known as silly window syndrome, where partial shards can’t be merged back into a complete shard. This led to the introduction of a more complex decision-making process known as the Nagle algorithm and became a central part of the strategy employed by the congestion control mechanism described in later chapters.

The central question Nagle answers is: How long should the sender wait when the valid window is smaller than the MSS? If you wait too long, it can affect the performance of interactive applications. If the wait time is not long enough, there is a risk of sending a bunch of small packets and falling into stupid window syndrome.

While TCP can use a clock-based timer (for example, trigger every 100 milliseconds), Nagle introduces an elegant self-clocking solution. The idea is that as long as TCP has any data in transit, the sender will eventually receive an ACK. This ACK can be thought of as a timer trigger, which triggers the transfer of more data. The Nagle algorithm provides a simple and uniform rule to decide when to send:

In other words, a full shard can always be sent if the window allows. If there is currently no shard in transit, a small amount of data can also be sent immediately, but if there is data in transit, the sender must wait for the ACK before sending the next shard. Therefore, an interactive application that continuously writes one byte at a time will send data at the rate of one shard per RTT. Some shards will contain a single byte, while others will contain all the bytes that the user will be able to enter in one RTT time. Because some applications cannot afford this delay every time a TCP connection is written, the socket interface allows the application to set TCP_NODELAY options, which means that data will be transferred as quickly as possible.

TCP was first deployed in the early 1980s, when the backbone network link bandwidth was measured in tens of kbps per second. It’s no surprise that tweaking TCP to accommodate growing network speeds has attracted a lot of attention. In principle, the TCP changes are independent of the changes to the congestion control mechanism described in later chapters, but these changes are deployed together, so unfortunately the two issues are merged. To further blur the line between flexible high-speed networks and addressing congestion, extensions to TCP headers play an important role in handling both issues. Finally, note that increasing the bandwidth-delay product does have an impact on congestion control, and some of the methods discussed in later sections deal with this.

This section focuses primarily on the challenges of high-speed networks, and we postpone the details of TCP extensions used to address these challenges until Chapter 4, where we also take into account the associated congestion control mechanisms. Right now, we’re mainly focused on the limitations of the SequenceNum and AdvertisedWindow fields, and their impact on TCP correctness and performance.

The problem with the 32-bit serial number space is that the sequence number on a given connection may loop by sending a byte with the sequence number S once, and then needing to send a second byte of the same sequence number S at a later time. Again, we assume that packets cannot survive longer than the recommended MSL. Therefore, you need to make sure that the serial number does not cycle for 120 seconds. Whether this happens depends on the speed at which data travels over the internet, that is, the speed at which 32-bit serial number space is consumed. (This discussion assumes that we are trying to consume serial number space as fast as possible, and of course we can achieve this if we keep the pipeline at full capacity.) Table 1 shows the time it takes for serial numbers to cycle on networks of different bandwidths.

Table 1. 32-bit serial number space cycle time.

32-bit serial number space is sufficient at medium bandwidth, but given that OC-192 links are now common in the Internet backbone, and that most servers now have 10G Ethernet interfaces (or 10 Gbps), they are now well above the 32-bit tipping point. The TCP extension doubles the size of the sequence number field to prevent the SequenceNum field from looping, and this extension plays a dual role in congestion control, which we’ll cover in Chapter 4.

The problem with the 16-bit AdvertisedWindow field is that it must be large enough for the sender pipeline to be fully loaded. Obviously, the receiver is free to decide whether to open the maximum window allowed by the AdvertisedWindow field, and we are interested in whether the receiver has enough buffer space to handle as much data as the AdvertisedWindow allows.

In this case, the criteria for determining the size of the AdvertisedWindow field are not only the network bandwidth, but also the bandwidth delay product, which needs to be open large enough to allow the data defined by the bandwidth delay product to be transmitted. Assuming RTT is 100 milliseconds (this is a typical number for cross-border connections in the United States), Table 2 shows the bandwidth delay product of several network technologies. Note that for OC-n links, the link bandwidth we use removes the SONET overhead.

Table 2. The desired window size is 100 ms RTT.

In other words, TCP’s AdvertisedWindow field is worse than its SequenceNum field. Because the window for 16-bit fields to allow advertisements is only 64KB, it is not large enough to even handle T3 connections across the continental United States.

An extension of TCP fixed this, allowing the receiver to publish a larger window, allowing the sender to fill a larger bandwidth delay product, making high-speed networks possible. The extension involves an option that defines the scaling factor of the advertisement window. That is, instead of interpreting the number that appears in the AdvertisedWindow field as how many bytes the sender still has not been acknowledged, both parties agree to use the AdvertisedWindow field to count larger blocks (for example, how many 16-byte data units the sender has not been acknowledged). In other words, the window size option specifies how many bits TCP should shift to the left when calculating a valid window with the contents of the AdvertisedWindow field.

Hello, I am Yu Fan, did research and development in Motorola, and now does technical work in Mavenir, and has always maintained a strong interest in communications, networks, back-end architectures, cloud native, DevOps, CICD, blockchain, AI and other technologies, usually like to read, think, believe in continuous learning, lifelong growth, welcome to exchange and learn together. WeChat public account: DeepNoMind

TCP Congestion Control: A Systems Approach: https://tcpcc.systemsapproach.org/index.html

Computer Networks: A Systems Approach: https://book.systemsapproach.org

End-to-End Arguments in System Design: https://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf