BPF iterators enable high-performance in-core data retrieval and aggregation. In this blog post, we talk about the motivation behind developing the BPF Iterator tool and show how to use it to flexibly and efficiently traverse kernel data from user space.
There are a few existing methods that can copy kernel data to user space. The most popular is to print all tcp6 or netlink socket information in the system via the /proc system, for example by using the “cat /proc/net/tcp6” or “cat /proc/net/netlink” commands. However, this way the output format is often fixed, and if the user wants to get more information about these sockets, it must be implemented by patching the kernel, which will involve upstream and release, often taking a long time. The same is true for popular tools like ss, any additional information that needs to be modified to the kernel commit patch.
The drgn[1] tool solves this problem by printing kernel data without modifying the kernel. However, the main drawback of drgn is performance, and pointer tracing is also not possible within the kernel. In addition, if the pointer becomes invalid within the kernel, drgn may produce incorrect results.
BPF iterators can be used to solve these related problems, providing the flexibility to make one-time modifications to specific data structures in the kernel and allowing pointer tracing within the kernel. Flexibility is achieved through the use of BPF programs, and correctness is achieved by implementing pointer tracking within the kernel and ensuring that pointer tracking is valid through proper reference count[2] or lock protection. In its current state, iterators change only a small part of the data structures in the kernel.
The BPF selftests[3] directory in kernel code provides a good example of using BPF iterators in user space. Typically, you need to implement a BPF program first.
Here are a few examples of BPF programs in selftest:
Here, let’s use the bpf_iter_task_file.c file as an example to iterate over information about files opened in system tasks:
In the example above, the SEC (“iter/task_file”) field indicates that the program is a BPF iterator program that can be used to iterate over all files for all tasks. The context of the program is bpf_iter__task_file. You can find the definition of the bpf_iter__task_file struct in vmlinux.h:
In the code above, the field variable name meta represents the metadata and is the same for all BPF iterator programs. The remaining fields depend on the different iterators. For example, for task_file iterators, the kernel layer provides task, fd, and file-related fields. Task and file are based on the application count [7], so they do not disappear when the BPF program runs.
After writing the BPF iterator program, we also need to write a user-space part of the code that triggers the BPF program to run and collect data. bpf_iter.c[8] in the selftest directory provides an example of writing a corresponding userspace part. The following illustrates a typical order:
BPF iterators use the kernel’s seq_file to pass data to user space. The data can be a formatted string or raw data. In the case of formatting strings, you can use the bpftool iter[9] subcommand to create and fix a BPF iterator to the path of the BPF file system (bpffs)[10] by bpf_link. You can then use cat
For example, you can use the following command to output a BPF program from a bpf_iter_ipv6_route.o object file to the file /sys/fs/bpf/my_route.
Then print out the result with the following command:
In order to implement a BPF iterator in the kernel, developers must fill in the following key data structures defined in the bpf.h[11] file.
After the data structure field is set, then call bpf_iter_reg_target() to register the iterator with the main BPF iterator subsystem.
Here’s an explanation of each field in Structure bpf_iter_reg:
Click here[12] to see the implementation of task_vma BPF iterator in the kernel.
The following is a list of the latest BPF iterators available in the upstream kernel, grouped by BPF program part name:
The test program for the iterator is described in the bpf_iter.c[13] file.
The table has been tweaked, instructions and code implementations have been added, bpf_link and ksym iterators have been added.
Note: All iterations of the implementation call the bpf_iter_reg_target function registration, which can be quickly found by searching the source code for the function. As of kernel version 5.15, 13 iterators are implemented, of which bpf_prog/bpf_map are BPF pre-implemented and loaded iterators (see the file kernel/bpf/preload/iterators/iterators.bpf.c), followed by iter/bpf_link (Linux 5.19)[52], iter/ksym (Linux 6.0)[53], and so on.
In Meta, we use BPF task_file iterators based on the bpftool tool to display process numbers that reference a specific BPF program/map/link.
The output shown by sudo bpftool prog is as follows:
We also developed tools based on bpf_sk_storage and task_iter iterators fbflow and dyno, respectively. The task_iter in dyno, implemented based on task_iter iterators, has a significant improvement in performance compared to the old way of netlink-based taskstats[54] for all tasks.
There is an upstream discussion of implementing a BPF iterator for bpf_links [Note: The 5.19 kernel has been incorporated into bpf: Add bpf_link iterator[55]]. We’ve also seen someone implement a BPF iterator for mounts (not yet upstream). As more use cases are discovered, we expect more users to implement BPF iterators in the kernel.
Original address: https://developers.secure.facebook.com/blog/post/2022/03/31/bpf-iterator-retrieving-kernel-data-with-flexibility-and-efficiency/
Author:Yonghong Song[56]
Time: March 31, 2022
drgn: https://developers.facebook.com/blog/post/2021/12/09/drgn-how-linux-kernel-team-meta-debugs-kernel-scale/
Reference count: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#bpffs
selftests: https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
bpf_iter_tcp4.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c
bpf_iter_task_vma.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
bpf_iter_task_file.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
App count-based: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#bpffs
bpf_iter.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
bpftool iter: https://www.mankier.com/8/bpftool-iter
BPF File System (bpffs): https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#bpffs
bpf.h: https://github.com/torvalds/linux/blob/master/include/linux/bpf.h
Click here: https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/
bpf_iter.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
kernel/bpf/prog_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/prog_iter.c
iterators.bpf.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/preload/iterators/iterators.bpf.c
kernel/bpf/map_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/map_iter.c
bpf_iter_bpf_map.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_map.c
kernel/bpf/map_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/map_iter.c
bpf_iter_bpf_hash_map.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_hash_map.c
bpf_iter_bpf_array_map.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_array_map.c
bpf_iter_bpf_percpu_hash_map.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c
See [Submit: https://lwn.net/ml/linux-fsdevel/20211201042333.2035153-4-memxor@gmail.com/
net/core/bpf_sk_storage.c: https://github.com/torvalds/linux/blob/master/net/core/bpf_sk_storage.c
bpf_iter_bpf_sk_storage_map.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_sk_storage_map.c
kernel/bpf/task_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/task_iter.c
bpf_iter_task.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task.c
bpf_iter_task_stack.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task_stack.c
kernel/bpf/task_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/task_iter.c
bpf_iter_task_file.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c
kernel/bpf/task_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/task_iter.c
bpf_iter_task_vma.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
net/ipv4/tcp_ipv4.c: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c
progs/bpf_iter_tcp4.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c
progs/bpf_iter_tcp6.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_tcp6.c
net/ipv4/udp.c: https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c
bpf_iter_udp4.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
bpf_iter_udp6.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
net/unix/af_unix.c: https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c
bpf_iter_unix.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
net/netlink/af_netlink.c: https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c
bpf_iter_netlink.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
net/ipv6/route.c: https://github.com/torvalds/linux/blob/master/net/ipv6/route.c
bpf_iter_ipv6_route.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_ipv6_route.c
net/core/sock_map.c: https://github.com/torvalds/linux/blob/master/net/core/sock_map.c
bpf_iter_sockmap.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_sockmap.c
kernel/bpf/link_iter.c: https://github.com/torvalds/linux/blob/master/kernel/bpf/link_iter.c
commit: https://github.com/torvalds/linux/commit/9f88361273082825d9f0d13a543d49f9fa0d44a8
bpf_iter_bpf_link.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_bpf_link.c
kernel/kallsyms.c: https://github.com/torvalds/linux/blob/3bc1bc0b59d04e997db25b84babf459ca1cd80b7/kernel/kallsyms.c
commit: https://github.com/torvalds/linux/commit/647cafa22349026a8435030e9157074ab7fe5710#diff-9538b26d3e082f233e6adac664cd2c14cbf2d510d5d7f286eef329c58de87ead
bpf_iter_ksym.c: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c
iter/bpf_link(Linux 5.19): https://github.com/torvalds/linux/commit/9f88361273082825d9f0d13a543d49f9fa0d44a8
iter/ksym(Linux 6.0): https://github.com/torvalds/linux/commit/647cafa22349026a8435030e9157074ab7fe5710
Netlink-based taskstats: https://www.kernel.org/doc/Documentation/accounting/taskstats.txt
bpf: Add bpf_link iterator: https://github.com/torvalds/linux/commit/9f88361273082825d9f0d13a543d49f9fa0d44a8
Yonghong Song: https://www.facebook.com/yonghong.song.583