Go is unique when it comes to profiling. In runtime it contains powerful, autonomous profilers. Other programming languages like Ruby, Python, or Node.js include profilers or APIs for writing profilers. But compared to the out-of-the-box services that Go provides, their scope is limited. If you want to learn more about the observability tools that Go offers, I highly recommend Felix Geisendörfer’s busy developer guide – Go Performance Analysis, Tracking and Observability

As a curious engineer, I love to delve into the underlying implementation principles, and I’ve always wanted to learn how the Go CPU profiler is implemented. This blog is this learning process. Without exception, I always find and learn new things as I read Go runtime.

There are two types of profilers:

Tracking: Any time a pre-set event is triggered, measurements are made, such as: function call, function exit, etc

Sampling type: Make a measurement at regular intervals

The Go CPU profiler is a sampled profiler. There is also a tracked profiler, the Go executor that tracks specific events like request locks, GC-related events, and so on.

A sampled profiler typically consists of two main parts:

Sampler: A callback triggered at intervals, and a stack information is typically collected as a profiling data. Different profilers use different strategies to trigger callbacks.

Data collection: This is where the profiler collects data: it could be memory usage or call statistics, basically stack trace-related data

Linux perf uses the PMU (Performance Monitor Unit) counter for sampling. You instruct the PMU to produce an interrupt after N occurrences of certain events. An example might be sampling every 1000 CPU clock cycles. Denis Bakhvalov wrote a detailed article explaining how tools like perf and VTune use PMU counters to implement them. Once the data collection callback is triggered periodically, all that remains is to collect the stack traces and summarize them appropriately. For completeness, Linux perf uses perf_event_open (PERF_SAMPLE_STACK_USER,…) to obtain stack trace information. The captured stack traces are written to user space through the mmap’d ring buffer.

pyspy and rbspy are well-known sampling analyzers for Python and Ruby. They both run as external processes, periodically reading the target application’s memory to capture the stack trace of the running thread. In Linux, they use process_vm_readv, and if I remember correctly, this API pauses the target program for a few milliseconds while reading memory. They then trace the pointer in the read memory to find the currently running thread structure and stack trace information. As one might guess, it’s a buggy and complicated approach, but surprisingly good. IIRC, pyflame is implemented using a similar approach.

Recent profilers such as Parca (and others) use eBPF. eBPF is a recently emerged technology that allows user area code to run in kernel virtual machines. It is an outstanding technology that is used in many areas such as security, networking, and observability. I highly recommend reading some information about eBPF; It’s a huge topic that goes far beyond the scope of this blog post.

The Go CPU profiler is a sampled profiler. In Linux, Go runtime uses the setitimer/timer_create/timer_settime API to set up the SIGPROF signal processor. This processor is in runtime. SetCPUProfileRate is triggered during the control period, defaulting to 100Mz (10ms). By the way: surprisingly, before Go 1.18, there were some serious issues around the Go CPU profiler! You can see the detailed details of these issues here. The IIUC, setitimer API is the recommended way to trigger time-based signals per thread in Linux, but it doesn’t work as advertised. If this statement is wrong, please feel free to correct me. (Felix Geisendörfer corrects this statement: Technically, it works the same way you would expect from a process-oriented signaling mechanism. But not a good mechanism for multicore profiling.

Let’s see how you can turn on the Go CPU profiler:

Once pprof. When StartCPUProfile is called, Go runtime generates a SIGPROF signal at a specific interval. The kernel sends a SIGPROF signal to a running thread in the application. Because Go uses non-blocking I/O, goroutines waiting for I/O are not evaluated as running, and the Go CPU profiler does not capture these. By the way: this is the basic reason for implementing fgprof. fgprof uses runtime. GoroutineProfile to get profile data for waiting and non-waiting goroutines.

Here’s how Go runtime handles SIGPROF signals:

Once a randomly running goroutine receives the SIGPROF signal, it is interrupted and the signal processor’s program starts running. The stack trace of the interrupted goroutine is retrieved in the context of this signal processor and then saved to a lock-free log structure along with the current profiler tags (each captured stack trace can be associated with a custom tag that you can filter with later). This special lock-free structure is named profBuf, which is defined in runtime/profbuf.go and explains how it works in detail. It is a single-write, single-read, unlocked-loop buffer structure, similar to the one published here. The writer is the signal processor for the profiler, and the reader is a goroutine(profileWriter) that periodically reads the data from this buffer and summarizes the results into the final hashmap. This final hashmap structure is named profMap and defined in runtime/pprof/map.go.

Here’s a simple diagram of how it all fits together:

As you can see, the final structure is similar to the ordinary pprof. The Profile object has many similarities: this is a hashmap, the key is stack trace + label, and the value is the number of times the call stack has been observed in the application. When pprof. When StopCPUProfile() is called, profiling stops and profileWriter goroutine calls the build() function, which is implemented here. This function is responsible for writing data from the profMap structure to an io. The object of the Writer that was originally pprof. Provided when StartCPUProfile is called. In fact, here is the final pprof. The location where the Profile object was generated.

The pseudocode for profileWriter is as follows and may be helpful:

Once I have a higher-order understanding of design, I ask myself the following questions:

Why did Go go to great lengths to implement a unique lock-free structure to hold temporary profiling data? Why not write all data to the hashmap regularly?

The answer lies in the design itself.

The first thing a SIGPROF processor does is disable memory allocation. In addition, the code path of the profiler does not involve any locks, and even the maximum depth of the stack trace is hard-coded. Starting with Go 1.19, it is 64. All of these details are intended to provide a more efficient and predictable overhead for the profiler. Low overhead and predictability are key to a profiler that can be applied to production.

Based on this design, will the overhead of the profiler be constant? In fact, it is case-by-case. Let me explain: in a single profiler interrupt, the following things will be done:

A random execution of the goroutine context switches to execute the SIGPROF processor code,

stack walk occurs, and then Go runtime saves this stack trace to a lock-free ring cache.

goroutine is restored.

Theoretically, since there is no distribution and locks occurring, it seems that all of the above situations should operate in a constant amount of time. Although I can’t find references, I remember all of them and it’s true: all of the above happens in about ~10 nanoseconds to 1 microsecond (on a typical CPU) (corrected by Felix Geisendörfer). But, in practice, it gets worse. If you search for “Go CPU profiler overhead”, you will see numbers ranging from %0.5 to %1-%5 or even %10 (*there is no public mention of this number, and there is no empirical evidence*). The reason behind this is mainly related to how the CPU works. Modern CPUs are as complex as a beast. They cache wantonly. A typical single CPU core has three layers of cache. L1, L2, and L3. These caches are highly utilized when a particular CPU-intensive code is running. This is especially true for applications that read a lot of small data (data that can be put into the cache) and sequential data.

Matrix multiplication is a good example: during matrix multiplication, the CPU accesses a large number of contiguous independent units in memory. Such cache-friendly applications can incur the worst overhead of sampling profilers. While I’d love to use perf to do some benchmarking to validate this claim, it’s beyond the scope of this blog post.

In summary, Go runtime does a great job of keeping profiler overhead predictable and as low as possible. If you don’t believe me, you shouldn’t believe it, and maybe the following can convince you:

At Google, we are constantly profiled the Go production service, and it is safe to do so

The above is a reference from Google thread.

There is also a commit from a continuous profiler implementation from DataDog so that the profiler is always enabled:

After testing many high-volume internal workloads, we have determined that this default value is safe for production.

Finally, based on the above theory, we can make the following observations:

On a typical I/O-bound application, the cost of a profiler will be minimal.

This is because when there are many hibernate/idle goroutines, the cache junk of the CPU does not have much of an impact. We repeatedly observe this in the Go CPU profiler’s benchmark: in a typical I/O-bound application, the overhead is simply zero (or statistically negligible). But, again, providing empirical evidence beyond the scope of this blog post can be achieved by observing the throughput in the load test of a Go network application with the profiler on and off.

One of the reasons I like this design is that it proves how well you can optimize your code, depending on how well you understand the access patterns of the underlying data structures. In this case, Go runtime uses a lock-free structure, although it is completely redundant in most cases. Go runtime is full of clever optimizations like this, and I highly recommend taking the time to dig into the parts that you find interesting.

I hope you enjoy it!







Scan the QR code below to register for Oh~

For information such as conference cooperation, on-site recruitment and enterprise ticket purchase, please contact WeChat: 18516100522

Poke here Go!