Introduction | In scenarios that require high performance and resource saving, such as massive connections and high concurrency, we found that Go began to become strenuous, not only with large memory overhead, but also with frequent goroutine scheduling. GC time is also getting longer and longer, and even the system will be hooked. At this point, we can consider using Go to build a classic Reactor network model to deal with this scenario.

First, the common server-side network programming model

Before specifically talking about the implementation of the Reactor network library, let’s first quickly review the common server-side network programming models.

Server-side network programming mainly solves two problems, one is how the server manages connections, especially massive connections, high concurrent connections (the classic c10k/c100k problem), and how the server handles requests (normal response when high concurrency).

There are three solutions to these two problems, each corresponding to three models:

Traditional IO blocking model.

Reactor model.

Proactor model.

The following two figures are the traditional IO blocking model and the Reactor model, the traditional IO blocking model is characterized by each connection is managed by a separate line/process, business logic (crud) and data processing (read and write on the network connection) are completed in this line/process. The disadvantages are obvious, when the concurrency is large, it is necessary to create a large number of lines/processes, and the system resource overhead is large; After the connection is established, if the current line/process does not have data readable for the time being, it will block the Read call and waste system resources.

The Reactor model is an improvement over the traditional IO blocking model, and Reactor will act as a separate line/process to listen for and distribute events, and distribute them to other EventHandlers to handle data reads, writes, and business logic. Thus, unlike the traditional IO blocking model, Reactor’s connections all go first to an EventDispatcher, a core event dispatcher, while Reactor uses IO multiplexing to handle multiple connections non-blocking on the event dispatcher.

This EventDispatcher and the following EventHandlers can be both on the same line/process or separate, with a distinction to be made below. Overall, Reactor is an event distribution mechanism, so Reactor is also known as an event-driven model. In short, Reactor = IO multiplexing + non-blocking IO.

There are 3 typical implementations depending on the number of Reactors and the working arrangements of the business threads:

Single Reactor multithreading

Single Reactor multithreading with thread pool

Master-slave Reactor multithreading (with thread pool)

Let’s look at two single Reactors:

A Reactor takes over all the event schedules, and if the connection event is established, it is handed over to Acceptor for processing, and then a corresponding Handler is created to handle subsequent read and write events for the connection. If the connection event is not established, the event handler corresponding to the connection is called to respond. The difference between single Reator 1 and 2 is that 2 comes with a thread pool, which liberates the Event Handler thread to some extent, allowing the Handler to focus on data reading and writing processing, especially when encountering some cumbersome, time-consuming business logic.

Let’s look at Reactor, this is the protagonist of this article, and the third section is how to implement it. Multi-Reactor is master-slave multi-Reactor, it is characterized by multiple Reactors running in multiple separate lines/processes, MainReactor is responsible for handling the connection establishment event, handed over to its Acceptor processing, after processing, it is reassigned to the connection to SubReactor; SubReactor handles subsequent read and write events for this connection, and SubReactor calls EventHandlers itself to do something.

This implementation seems to have a clear responsibility, which can easily make full use of CPU resources by increasing the number of SubReactors, and is also the current mainstream server-side network programming model.

Although the protagonist of this article is a master-slave multi-Reactor, if Proactor wants to be the protagonist, there is nothing wrong with Reactor.

The essential difference between the Proactor model and the Reactor model is the difference between asynchronous I/O and synchronous I/O, that is, the underlying I/O implementation.

As can be seen from the above two diagrams, the synchronous I/O that the Reactor model relies on needs to constantly check for events to occur and then copy the data processing, while the asynchronous I/O used by the Proactor model only needs to wait for the system notification to directly process the data copied by the kernel.

The Proactor model based on asynchronous I/O is implemented as follows:

So why is Proactor with such a clear protagonist aura not the current mainstream server-side network programming model?

The reason is that the AIO API under Linux – io_uring has not yet been able to cover and support many scenarios as synchronous I/O, that is, it is not yet mature enough to be widely used.

Second, the introduction of the Go native network model

There are already many articles on the Internet about the implementation of Go’s native network model, which will not be expanded too much here, and readers can trace the entire code flow in conjunction with the following figure:

To sum up, all the network operations of Go revolve around the network descriptor netFD, netFD is bound to the underlying pollDesc structure, when reading and writing on a netFD encounters EAGAIN error, the current goroutine is stored in the bound pollDesc, and the goroutine is given to the park until the data on this netFD is ready, and then wake up the goroutine to complete the data reading and writing.

To sum up, Go’s native network model is a single Reactor multicoroutine model.

Third, how to implement asynchronous network libraries from 0 to 1

Now that we have reviewed common server-side network programming models, we also know that the way Go handles connections is by assigning a coroutine to the assignment, the goroutine-per-conn pattern.

That section gets to our point, how to implement an asynchronous network library (because the implementation of the Reactor model, generally after the main thread accepts a connection, it is assigned to other lines/processes to asynchronously process subsequent business logic and data reading and writing, so the general Reactor model network library is called asynchronous network library, not the API using asynchronous I/O).

Before the specific implementation, the author will first introduce the background of the requirements.

Go’s coroutines are very lightweight, and in most scenarios, applications built based on Go’s native network libraries will not have any performance bottlenecks, and the resource occupation is also considerable.

The gateway we are using now is based on C++ self-developed gateway, we want to unify the technology stack, replaced by Go, we now peak will be up and down a million connections, probably using dozens of machines, a single machine can stably support hundreds of thousands of connections. If you change to Go, we have been wondering, how much can the gateway single machine based on Go implementation, and how about memory and CPU? Can you save some machines?

Therefore, the author began to do a wave of stress tests on Go for this scenario with a large number of connections, and the conclusion was obvious: as the number of connections increased, Go’s coroutines also increased linearly with it, the memory overhead increased, and the proportion of GC time increased. When the number of connections reaches a certain value, Go’s forced GC will also hang the process and the service will not be available. (There will be comparative stress test data from the network library below)

Then, the author flipped through the solutions with the same scenario on the intranet and intranet, basically making articles on the implementation of the classic Reactor model. For example, in the earliest A Million WebSockets and Go, author Sergey Kamardin used epoll instead of goroutine-per-conn mode, and a small number of goroutines in millions of connection scenarios instead of a million goroutines.

A Million WebSockets and Go:

https://www.freecodecamp.org/news/million-websockets-and-go-cc58418460bb/

Sergey Kamardin’s plan summary:

Let’s structure the optimizations I told you about.

A read goroutine with a buffer inside is expensive. Solution: netpoll (epoll, kqueue); reuse the buffers.

A write goroutine with a buffer inside is expensive. Solution: start the goroutine when necessary; reuse the buffers.

With a storm of connections, netpoll won’t work. Solution: reuse the goroutines with the limit on their number.

net/http is not the fastest way to handle Upgrade to WebSocket. Solution: use the zero-copy upgrade on bare TCP connection.

Another example is that Byte developed the RPC framework Kitex based on the Reactor network library netpoll to deal with high concurrency scenarios.

The author simply implemented a gateway with Go, and used these Reactor network libraries to conduct another wave of stress tests, and the results were in line with expectations: the Go gateway after the number of connections was indeed more stable than before, and the memory footprint was also considerable. But in the end, none of these open source Reactor libraries were selected, because these open source libraries were not used out of the box, and none of them implemented common protocols such as HTTP/1.x and TLS; The API design is not flexible and focused on the scene is not suitable for the gateway, such as netpoll, which is currently mainly focused on RPC scenarios (Bytes officially opened source HTTP framework Hertz last week); The overall transformation cost is high, and it is difficult to adapt to the Go gateway.

Netpoll’s scenario description:

On the other hand, the open source community currently lacks Go network libraries focused on RPC solutions. Similar projects such as evio, gnet, etc., are oriented to scenarios such as Redis and HAProxy.

Finally, when it comes to the implementation part, we first look at the overall hierarchical design of a Reactor library, which is divided into three layers: the application layer, the connection layer, and the foundation layer.

The application layer is the common EchoServer, HTTPServer, TLSServer and GRPCServer, etc., which are mainly responsible for protocol parsing and execution of business logic, corresponding to the EventHandler in the Reactor model.

In the Reactor model, the application layer implements an interface for event processing and waits for the connection layer to call.

For example, when the connection is established, you can call the OnOpen function to do some initialization logic, and when there is new data on the connection, you can call the OnData function to complete the specific protocol resolution and business logic.

The connection layer is the core of the entire Reactor model, according to the above master-slave Reactor multi-threading model, the connection layer mainly has two kinds of Reactors, one main Reactor (Sub Reactor), you can also multi-master and multi-slave.

Main Reactor is mainly responsible for listening and receiving connections, and then assigning connections, it has a for loop, constantly to accept new connections, the method here can be called acceptorLoop; Sub Reactor gets the connection assigned by Main Reactor, which is also a for loop, waiting for the read and write event to arrive, and then doing the work, that is, calling back the application layer to execute specific business logic, and its method can be called readWriteLoop.

According to the working arrangement of the connection layer, we can see that we need the following three data structures:

EventLoop: The event loop, or Reactor, uses isMain to distinguish between master and slave, and in the case of Sub Reactors, there are a lot of Conns hanging from each SubReactor.

Poller: The readWriteLoop in Sub Reactor needs to constantly handle read and write events, which are monitored and notified by different I/O APIs on different systems, the classic Epoll triad in Linux, and Kqueue on Unix systems (such as Mac).

Conn: The connection established after the listener accept of the Main Reactor, bound to a file descriptor fd.

Each connection will be bound with an fd, and when a connection is closed, it will release the fd for new connection bindings, which is also called fd reuse.

Usually our application layer executes its business logic in a coroutine pool, and there is a Sub Reactor in the connection layer that handles read and write events on this connection.

If the connection is closed on the application layer side, and the Sub Reactor side is just ready to read the data on this connection, that is, to manipulate the fd.

When Sub Reactor has not had time to read, but the fd released by the application layer has been given a new connection, then Sub Reactor continues to read the data on this fd, and will read the data of the new connection.

Therefore, we need to add a lock before and after the operation of fd, that is, lock the connection before closing the connection and reading and writing on the connection, release the lock after closing, and determine whether the connection is closed before reading and writing on the connection, so as to avoid dirty data.

In addition to paying attention to the race brought about by fd multiplexing, there is also a non-negligible load balancing link in the Main Reactor allocation connection to Sub Reactor.

To avoid overloading a Sub Reactor in the future, we can refer to Nginx’s load balancing strategy, which is about the following three ways:

Round-Robin Scheduling: Poll Sub Reactors, assigning them one by one.

Fd hash: c.fd%len (s.workLoops), hashes the entire number of Sub Reactors at the fd value.

Least Connections: Preference is given to the sub reactor with the smallest number of connections.

The core of Reactor is done at the connection layer, and the role of the base layer is to provide support for low-level system calls and do a good job of memory management.

System calls are common listen/accept/read/write/epoll_create/epoll_ctl/epoll_wait, etc., which are not expanded here. However, the way memory is managed can greatly affect the performance of network libraries.

When the author once handled the connection read event, he first used the dynamic memory pool to provide a temporary Buffer to undertake, compared to using a fixed Buffer to undertake, the former needs to borrow one and return, in a simple Echo scene under the pressure test, the latter compared to the former increased by 12wQPS, horror as much.

The following are common memory management scenarios that compare the advantages and disadvantages of memory usage when reading and writing processing on a connection:

A fixed array

Each read requests a fixed-size buffer.

The advantage is that the implementation is simple, and the disadvantage is that it accumulates temporary objects.

RingBuffer

Read/write split, saving memory, but frequent capacity expansion has performance losses (old data needs to be relocated to the new RingBuffer when expanding)

LinkBuffer

Read and write split, saving memory

Pooled Block nodes facilitate expansion and scaling without performance loss

The NoCopy API can be implemented to further improve performance.

The ideal here is a third memory management scheme, which is implemented in bytes of netpoll.

Here to refer to the implementation of a project, NoCopy is reflected in the data read by the connection layer, which can be used by the application layer instead of copying it to the application layer, but letting the application layer reference the LinkBuffer to use.

First of all, the zero copy read interface, we will read the operation into “reference reading” and “release” two steps, “reference reading” will be a certain length of byte array in the Linked Buffer in the form of a pointer to take out, the user after using the data, take the initiative to perform “release” to tell Linked Buffer just “reference read” data space will no longer be used, can be released, the “freed” data can no longer be read and modified.

The zero-copy write interface is to construct the byte array passed by the user into a node, each node contains a pointer to the byte array, and then add these nodes to the Linked Buffer, all the way to the pointer to the byte array, without any copy behavior.

The above 3 subsections are a Reactor network library framework and implementation design, the process is not complicated, the author believes that the real test is based on the Reactor library to implement the common HTTP/1.x protocol, TLS protocol and even HTTP/2.0 protocol, etc., the author in the implementation of HTTP/1.x when trying a lot of open source parser, a lot of performance is not satisfactory; After trying to directly use Go’s official TLS protocol parser, it was found that the TLS four handshakes were not consecutive packets, and when the third handshake was held, the information sent by the client could wait a while… Most of the problems are tricky, which is probably why many open source libraries do not implement these protocols

After developing the Reactor network library and implementing common application layer protocols on the basis of this library, we need a wave of stress testing to test the performance of the network library.

Different from most of the open source libraries on the Internet that only do simple Echo pressure tests, the author has built two scenarios for pressure testing:

Echo scenario: EchoServer does not need to do protocol parsing, nor does it need to do business logic, the purpose is to do horizontal comparison with the same type of Reactor library.

HTTP scenario: HTTPServer needs to parse the HTTP/1.x protocol, plus 10w loop counting simulation business logic, in order to run to 10w connection and compare with Go net.

The final result is the following 4 graphs, you can ignore the byte netpoll data, probably because these two scenarios are not the target scenario of netpoll, that is, the RPC scene, so the posture of the pressure test is most likely to be wrong.

In the Echo scenario, it is the EchoServer running with a 4-core machine, and in the HTTP scenario it is an 8-core running HTTPServer.

Figure 1: In the Echo scenario, 1KB packets are fixed and the number of connections is increasing.

Figure 2: In the Echo scenario, the number of 1K connections is fixed and the packet size is increasing.

Figures 3 and 4: In the HTTP scenario, 1KB packets are fixed, the number of connections, QPS, and memory usage are increasing.

Through the stress test results, it can be seen that most of the pressure tests, Go native network library has no crotch performance, only after the number of connections has gone up, or the number of packets that need to be processed is getting larger and larger, the Go native network library gradually shows a downward trend. Especially when the connection is up to 30w to 50w, the memory overhead of the Go native network library increases, and the accompanying GC time also becomes longer, and when the connection reaches 50w, a wave of forced GC services is down.

Here are the details of the Go native network library when it is 50w connected, forcing the GC to fall down:

This is the GC details of the Reactor Network Library (wnet) that are still strong when connected:

Therefore, on the whole, most of the application scenarios, the Go native network library can be satisfied. Compared to the Reactor network library, the Go native network library can be seen as exchanging space (memory, runtime) for time (high throughput and low latency). When the space is tight, that is, after the number of connections comes up, the huge memory overhead and the corresponding GC will cause the service to be unavailable, and this massive connection scenario is the advantage of the Reactor network library. For example, in event-type scenarios such as e-commerce promotions, there are expected traffic peaks, and there will be massive connections and massive requests during peak periods; There is also a kind of live bullet screen, message push and other long-term connection scenarios, which also have a large number of long connections.

IV. Afterword

The final implementation of this article is not open source, readers and friends can combine the above process to read similar open source implementations, such as gnet, gev and other projects to understand the design of Reactor network library, and based on the design content of the third part of the reconfiguration of these open source projects, I believe that readers and friends will make better network libraries.

 About the author

Liu Xiangyu

Tencent back-office development engineer

Tencent back-office engineer, currently responsible for the development of e-sports events related services.

Recommended reading

Go network library Gnet parsing