Go is one of those languages that comes with its own profiler. Its runtime contains powerful self-contained profilers. Other languages, such as Ruby, Python, and Node.js, also contain profilers or APIs for writing profilers, but they offer limited profiler functionality compared to Go’s out-of-the-box profiler. If you want to learn more about these observable tools that Go offers, I highly recommend reading Felix Geisendörfer’s The Busy Developer’s Guide to Go Profiling, Tracing and Observability.[1]
As a curious engineer, I like to dig into how things work on the underground, and I’ve always wanted to learn how the Go CPU profiler works underneath. This article is the result of this exploration. Whenever I read the Go runtime code, I always learn something new, and this time is no exception.
There are two types of profilers currently on the market:
The Go CPU profiler is a sampling profiler. Of course, Go also has an execution tracer, which is a tracing profiler that tracks certain events, such as request locks, GC-related events, and so on.
A sampling profiler typically consists of two basic parts:
Linux perf uses the PMU (Performance Monitor Unit) counter for sampling. You can instruct the PMU to generate an interrupt after N occurrences of certain events, for example, every 1000 CPU clock cycles. A detailed article written by Denis Bakhvalov[2] explains how PMU counters like perf and VTune tools can be used to achieve this. Once the data collection callback function is triggered periodically, all that remains is to collect the stack traces and aggregate them correctly. For completeness, Linux perf uses perf_event_open (PERF_SAMPLE_STACK_USER,…) to collect stack information. The captured stack trace is written to user space through the mmap ring buffer.
pyspy[3] and rbspy[4] are well-known sampling profilers for Python and Ruby. They all run as external processes and periodically read the target application memory to capture a stack trace of the running thread. In Linux, they use process_vm_readv method to collect data, and if I remember correctly, this API pauses the target application for a few milliseconds during a memory read. They then trace the pointer in the read memory to find the currently running thread structure and stack trace information. As one might guess, this is an error-prone and complex method, but it works very well. If I remember correctly, pyflame[5] uses a similar approach.
Recent profilers such as Parca[6] (and several others) use eBPF[7]. eBPF is a state-of-the-art technology that allows user-space code to run in kernel VMs. This is an excellent technology for many areas such as security, networking, and observability. I highly recommend reading some information about eBPF, which is a very big topic that goes far beyond the scope of this post.
The Go CPU profiler is a sampling profiler. In Linux, the Go runtime uses setitimer/timer_create/timer_settime to set up the SIGPROF signal processor. This processor presses runtime. The cycle interval set by SetCPUProfileRate is triggered, which by default is 100Mz (10 ms). By the way, prior to Go 1.18, there were some serious problems with the Go CPU profiler’s sampler, and you can see the details of these issues here[8]. If we remember correctly, the settimer API is the recommended way for every thread in Linux to trigger a time-based signal: technically, it works just like you would expect from a process signaling mechanism. But it is not a good mechanism for multicore analysis.
Let’s see how to manually enable the Go CPU profiler:
Once pprof. StartCPUProfile is called, and the Go runtime can generate SIGPROF signals at specified intervals. The Linux kernel sends a SIGPROF signal to one of the running threads in the application. Since Go uses non-blocking I/O, goroutines waiting for I/O are not counted as running goroutines, and Go CPU profilers do not capture data for such goroutines. By the way: this is the basic reason for implementing fgprof[9]. fgprof uses runtime. GoroutineProfile gets data for both the on-CPU and the off-CPU.
A picture is worth a thousand words; below is how Go runtime handles SIGPROF signal:
A picture is worth a thousand words. Here’s how the Go runtime handles SIGPROF signals:
A random running goroutine receives a SIGPROF signal, which is interrupted and the signal handler executes. The stack trace of the interrupted goroutine is obtained in the context of this signal handler and then saved to the lock-free[11] log structure with the current profiler tag [10] (each captured stack trace can be associated with a custom tag, which you can filter later). This particular lock-free structure, named profBuf, is defined in runtime/profbuf.go[12] and explains in detail how it works. It is a lock-free ringbuffer of a single-writer (single-reader), similar to the data structure described here [13]. The writer is the signal processor, while the reader is the profileWriter goroutine, which periodically reads this buffer and aggregates the results into the final hashmap. This final hashmap is named profMap and is defined at runtime/pprof/map.go.[14].
Here’s a visualization of what you just covered:
As can be seen, the final structure is the same as the regular pprof. The Profile object is very similar: it is a hashmap where the key is a stack trace + tag and the value is the number of times this call stack is observed in the application. When pprof. When StopCPUProfile() is called, the analysis stops, and the profileWriter goroutine calls build()[15]. This function is responsible for writing the profMap structure to io.Writer, which is provided by pprof.StartCPUProfile when it is first called. Basically this is the ultimate pprof. The process by which Profile[16] is generated.
The pseudocode for profileWriter may be more helpful in your understanding of this process:
Once I have a high-level understanding of design, I ask myself the following question: Why does Go put so much effort into implementing a unique lock-free structure just to save temporary analysis data? Why not write everything to the hashmap regularly?
The answer lies in the design itself.
The first thing to look at is that the SIGPROF processor is to disable memory dispatch. In addition, the profiler code does not contain hotspots and locks, and even the maximum depth of the stack is hard-coded. As of Go 1.19, the maximum depth is 64. All of these details can provide a more efficient and predictable overhead for the profiler. Low and predictable performance is key to a production-ready profiler.
Is profiler overhead constant by design? Well, it depends. Let me explain why: in a single profiler interrupt, the following occurs:
Theoretically, it seems that all of the above should run at a constant time because no memory allocation and lock competition take place. I remember all of this as true: all of the above happened in about 1 microsecond (on a typical CPU). In practice, however, the situation gets worse. If you search for “Go CPU profiler overhead”, you will see numbers from %0.5 to %1-%5 (no public mention of this number was found, and there is no appropriate evidence). The reason behind this is mainly related to the way the CPU works. Modern CPUs are complex monsters. They actively use caching. A typical single-CPU core has three layers of cache: L1, L2, and L3. These caches are fully utilized when specific CPU-intensive code is running. This is especially true for high cache utilization for some applications, where small (data that can be placed in the cache) and sequential data are read in large quantities.
A good example is matrix multiplication: during matrix multiplication, the CPU accesses a large number of sequentially stored independent cells in memory. These cache-friendly applications can incur significant overhead for sampling profilers. While it’s easy to use prof to do some benchmarking to verify this claim, it’s beyond the scope of this article.
To sum up, the Go runtime does a good job of keeping profiler overhead as predictable and low as possible. If you don’t believe me, and you have some doubts, perhaps one of the discussion dialogues quoted below[18] can convince you:
At Google, we continuously analyze Go production services, and it is safe to do so.
Another code submission from the datadog’s persistent profiler[19] also confirms this:
After testing this default value on many high-volume internal workloads (profiler is turned on by default), we have determined that this default value is safe for production.
Finally, based on the above theory, we can make the following observations:
On typical I/O-intensive applications, the overhead of profilers will be very small.
This is because when there are many hibernate/idle goroutines, CPU cache drops don’t make much difference. During the Go CPU profiler benchmark, we observed this over and over again: on a typical I/O-intensive application, the overhead is actually zero (or statistically negligible). But, again, providing empirical evidence is beyond the scope of this article, and one can do this by observing the throughput during load testing of a Go web application while the profiler is on and off.
One of the things I like about this design is that it proves that you can optimize your code based on how well you understand the access patterns of the underlying data structures. In this case, the Go runtime uses the lock-free structure, even though in most cases this is completely overkill. The Go runtime is full of clever optimizations like this, and I highly recommend taking the time to dig into the parts that you find interesting.
I hope you enjoy it!
Inside the Go CPU profiler[20], by Sümer Cip. My guess is that Felix Geisendörfer will definitely comment, and sure enough.
Translator: Bird’s Nest.
The Busy Developer’s Guide to Go Profiling, Tracing and Observability: https://github.com/DataDog/go-profiler-notes/blob/main/guide/README.md
Detailed article: https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics
pyspy: https://github.com/benfred/py-spy
rbspy: https://github.com/rbspy/rbspy
pyflame: https://github.com/uber-archive/pyflame
Parca: https://sumercip.com/posts/inside-the-go-cpu-profiler/parca.dev
eBPF: https://ebpf.io/
Here: https://www.datadoghq.com/blog/engineering/profiling-improvements-in-go-1-18/
fgprof: https://github.com/felixge/fgprof
profiler tags: https://rakyll.org/profiler-labels/
lock-free: https://preshing.com/20120612/an-introduction-to-lock-free-programming/
runtime/profbuf.go: https://github.com/golang/go/blob/master/src/runtime/profbuf.go
Introduction: http://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf
runtime/pprof/map.go: https://github.com/golang/go/blob/master/src/runtime/pprof/map.go
build(): https://github.com/golang/go/blob/aa4299735b78189eeac1e2c4edafb9d014cc62d7/src/runtime/pprof/proto.go#L348
pprof. Profile: https://pkg.go.dev/runtime/pprof#Profile
64: https://github.com/golang/go/blob/54cf1b107d24e135990314b56b02264dba8620fc/src/runtime/cpuprof.go#L22
Discussion Conversation: https://groups.google.com/g/golang-nuts/c/e6lB8ENbIw8/m/azeTCGj7AgAJ
Code submission: https://github.com/DataDog/dd-trace-go/commit/54604a13335b9c5e4ac18a898e4d5971b6b6fc8c
Inside the Go CPU profiler: https://sumercip.com/posts/inside-the-go-cpu-profiler/
Recommended reading
Go 1.19.2 and Go 1.18.7 were released