Gain an in-depth understanding of page caching

Long-running Linux servers, usually free memory is less and less, making people feel that Linux can especially “eat” memory, and even someone has made a website explain this phenomenon. In fact, the Linux kernel caches accessed files as much as possible to make up for the huge latency gap between disk and memory. The memory that caches the contents of a file is the Page Cache.

Jeffrey Dean, google’s god, summed up a Latency numbers every programmer should know, which mentions that reading 1MB of data from disk takes 80 times more time than memory, and even switching to an SSD is 4 times the memory latency.

I did an experiment on the machine to experience the role of page caching. Start by generating a 1G size file:

# dd if=/dev/zero of=/root/dd.out bs=4096 count=262144

copy the code

To empty page cache:

# sync && echo 3 > /proc/sys/vm/drop_caches

copy the code

statistics on the time it takes to read a file for the first time:

# time cat /root/dd.out &> /dev/null
real 0m2.097suser 0m0.010ssys 0m0.638s

copy the code

Reading the same file again, because the system has already put the read file content into the Page Cache, this time the time is greatly reduced:

# time cat /root/dd.out &> /dev/null
real 0m0.186suser 0m0.004ssys 0m0.182s

copy the code

Page Cache can not only accelerate the access to file content, but also establish a Page Cache for the shared library, which can be shared between multiple processes, avoiding each process being loaded separately, resulting in a waste of valuable memory resources.

What is a Page Cache

The Page Cache is kernel-managed memory that sits between the VFS (Virtual File System) layer and the concrete file system layer (e.g. ext4, ext3). The application process uses file operations such as / and enters the VFS layer through system calls, and according to the O_DIRECT flag, you can use page cache as a cache of file content, or you can skip page cache without using the caching functionality provided by the kernel.readwrite

In addition, applications can use mmap to map file contents to the virtual address space of the process, and can read and write files on the hard disk directly as if they were read and written to memory. The virtual memory of the process is mapped directly and the Page Cache.

To understand how the kernel manages page caches, let’s first look at a few core objects of VFS:

  • file stores the information of the opened file and is the interface for the process to access the file;
  • dentry is used to organize files into a directory tree structure;dentry
  • inode uniquely identifies a file in the file system. for the same file, there will be only one inode structure in the kernel.

for each process, the open file has a file descriptor, and the process data structure task_struct in the kernel has a files field of type files_struct that holds all files opened by the process. the fd_array field of the files_struct structure is the file array, and the subscript of the array is the file descriptor, which points to a file structure that represents the file opened by the process. file is associated with the process that opens the file, and if multiple processes open the same file, then each process has its own file, but the files point to the same inode.


Inode manages what files have loaded into memory through address_space, known as page caches. The field of address_space points to an xarray tree on which the Page Cache pages associated with this file are hung. When we access the file contents, based on the specified file and the corresponding page offset, we can quickly determine whether the page is already in the Page Cache through the xarray tree. If the page exists, the contents of the file have been read into memory, that is, in the Page Cache; if the page does not exist, the content is not in the Page Cache and needs to be read from disk.i_pages

Since the file and inode correspond one-to-one, we can think of inode as the host of the Page Cache ( ) , and the kernel manages and maintains the Page Cache through the tree pointed to.hostinode->imapping->i_pages

How is a Page Cache generated and released, and how is it associated with a process? We need to understand the process virtual address space first.

the process virtual address space

Linux is a multitasking system that supports the concurrent execution of multiple processes. The operating system and cpu work together to create the illusion that each process has its own contiguous virtual memory space, and the address space of each process is completely isolated, so that the processes are not aware of each other’s existence. From a process’s point of view, it considers itself to be the only process in the system.

the process sees the address space of virtual memory, and it cannot access the physical address directly. when a process accesses a virtual address, the virtual address is translated into a physical memory address by the kernel, that is, the mapping of the virtual address to the physical address is completed. in this way, when different processes are running, even if they access the same virtual address, the kernel maps them to different physical addresses, so there will be no conflict.

Processes are described by task_struct in the Linux kernel. Estimating task_struct is the first data structure you’ll be familiar with when you learn the kernel, because it’s so important. task_struct describes all the information related to the process, including process status, runtime statistics, process affinity, progress scheduling information, signal processing, process memory management, files opened by the process, and so on. The process virtual memory space we are concerned with here is described by the mm_struct pointed to by the mm field in the task_struct, which is the runtime summary information for a process memory.

the virtual address of a process is linear and is described using the struct vm_area_struct. the kernel manages each segment of memory with the same properties as a vm_area_struct, each vm_area_struct is a contiguous range of virtual addresses that do not overlap each other. inside the mm_struct there is a single-linked list for stringing vm_area_struct together, and a red-black tree that vm_area_struct hangs from the starting address. using red-black trees can quickly find an area of memory based on the address.mmapmm_rb

vm_area_struct can be mapped directly to physical memory or can be associated with files. if vm_area_struct is a file mapping, the member pointer to the corresponding file is pointed. an vm_area_struct with no associated file is anonymous memory.vm_file

when developers use glibc library functions such as malloc to allocate memory, instead of allocating physical memory directly, they apply for a piece of virtual memory in the virtual memory space of the process, generate the corresponding data structure vm_area_struct, and then insert it into the mm_struct the linked list, which hangs on the red-black tree at the same time, even if the work is done, does not involve the allocation of physical memory at all. only when the first time this piece of virtual memory is read or written, it is found that the memory area is not mapped to physical memory, which triggers a page-starving interrupt, and then the kernel fills in the page table to complete the mapping of virtual memory to physical memory.mmapmm_rb

When developers use mmap for file mapping, the kernel loads the file contents from disk into physical memory, that is, in the Page Cache, according to the file mapping relationship represented in the vm_area_struct, and finally establishes the mapping of this virtual address to the physical address.vm_file

in addition, pages that are contiguous in virtual memory do not have to be contiguous in physical memory. as long as you maintain a mapping from virtual memory pages to physical memory pages, you can use memory correctly. because each process has a separate address space, each process must have a separate process page table to complete the mapping of virtual addresses to physical addresses. in a real process, virtual memory occupies address space, usually two consecutive pieces of space, rather than completely scattered random memory addresses. based on this feature, the kernel uses multi-level page tables to save the mapping relationship, which can greatly reduce the space occupied by the page table itself. the top-level page table is saved in the field of mm_struct.pgd

Well, we have a basic understanding of the process virtual address space, let’s take a look at the generation and release of the Page Cache, and how it relates to the process space.

Page Cache is generated and released

Page Caches are produced in two different ways:

  • Buffered I/O
  • Memory-Mapped file

When using these two methods to access files on disk, the kernel will determine whether the file content is already in the Page Cache based on the specified file and the corresponding page offset, and if the content does not exist, it needs to read from disk and create a Page Cache page.

The difference between the two approaches is that the app can read data from the user buffer before the app can copy data from the Page Cache to the user buffer. In this case, the Page Cache page is mapped directly to the process virtual address space, and the user can directly read and write the contents of the Page Cache. Since copy is missing, it is more efficient to use than .Buffered I/OMemory-Mapped fileMemory-Mapped fileBuffered I/O

As server uptime increases, there is less and less free memory in the system, much of which is consumed by page caches. The files that have been accessed are cached by the Page Cache, and the memory will eventually be exhausted, so when will the Page Cache be recycled? The kernel believes that the Page Cache is reclaimable memory, and when the application requests memory, if there is not enough free memory, it will first reclaim the Page Cache and then try to apply. There are two main ways to recycle: direct recycling and background recycling.

When used, the Page Cache is not directly associated with the virtual memory space of the process, but is transited through the user buffer. The more efficient way looks simpler, but the implementation behind it is somewhat complicated. Let’s take a look at how the kernel is implemented.Buffered I/OMemory-Mapped fileMemory-Mapped file

memory file mapping

As we mentioned earlier, inode is the host of the Page Cache ( ) , and the kernel manages and maintains the Page Cache through the tree pointed to. So how does the kernel complete the memory file mapping, directly mapping the Page Cache that caches the contents of the file to the process virtual memory space?hostinode->imapping->i_pages

we know that the fields in the task_struct of the process struct point to the virtual address space of the process mm_struct, and a piece of virtual memory is described by the struct vm_area_struct, and the linked list that strings vm_area_struct together represents the virtual memory that has been requested for allocation.mmmmap

If a memory file is mapped, the virtual memory region vm_area_struct that maps the file points to the mapped file structure file. file represents a file opened by a process whose members point to address_space, which is associated with the address_space that manages the file Page Cache.vm_filef_mapping

When the virtual memory area of a file mapping is accessed for the first time, and the virtual memory is not mapped to physical memory, a page-starved interrupt is triggered. When the kernel handles a page-missing interrupt, it finds that the vm_area_struct representing this virtual memory has an associated file, that is, the field points to a file structure file. The kernel takes the address_space of the file and looks for the xarray tree it points to based on the page offset of the content to be accessed. If not found, it means that the file content has not been loaded into memory, and the memory page will be allocated, the file content will be loaded into memory, and then the memory page will be hung on the xarray tree. The next time you access the same page offset, the file contents are already on the tree and can be returned directly. The tree pointed to is the kernel-managed Page Cache.vm_fileaddress_space->i_pagesaddress_space->i_pages

After loading the file contents into the Page Cache, the kernel can fill in the process-related page table entries, map the virtual address area of this file map directly to the Page Cache page, and complete the processing of the page break.

When memory is tight and requires the Page Cache to be reclaimed, the kernel needs to know which processes the Page Cache pages are mapped to so that it can modify the process’s page tables and unmap virtual and physical memory. We know that the same file can be mapped to multiple process spaces, so you need to save the reverse mapping relationship, that is, find the process according to the Page Cache page.

The reverse mapping of the Page Cache page is saved in another tree maintained address_space. It is a Priority Search Tree that is associated with the vm_area_struct of the Page Cache page of the file hanging from this tree, and these vm_area_struct will point to their respective process space descriptors mm_struct, thus establishing a Page Cache page-to-process connection.i_mmapaddress_space->i_mmap

When you need to unmap a Page Cache page, use the tree that points to find which processes the Page Cache page maps to which vm_area_struct, and determine which process page table item contents need to be modified.address_space->i_mmap

To summarize briefly, the address_space corresponding to a file mainly manages two trees: the xarray tree that points to all the Page Cache pages; and the PST tree that points to the vm_area_struct formed by the file mapping A virtual memory area that is used to find the process that mapped the Page Cache page when the page cache page is released. If the file is not mapped to process space, the corresponding PST tree is empty.i_pagesi_mmapi_mmap

Page Cache for observations

You can view the documentation for various metrics related to Page Cache./proc/meminfo

/proc It is a pseudo filesystems. Linux uses a pseudo-file system to make system and kernel information available in user space. Use the , and other commands to check the memory information you see, and the data actually comes from .freevmstat/proc/meminfo

let’s look at an example:

$ cat /proc/meminfo MemTotal:        8052564 kBMemFree:          129804 kBMemAvailable:    4956164 kBBuffers:          175932 kBCached:          4896824 kBSwapCached:           40 kBActive:          2748728 kB  <- Active(anon) + Active(file)Inactive:        4676540 kB  <- Inactive(anon) +Inactive(file)  Active(anon):       3432 kBInactive(anon):  2513172 kBActive(file):    2745296 kBInactive(file):  2163368 kBUnevictable:       65496 kBMlocked:               0 kBSwapTotal:       2097148 kBSwapFree:        2095868 kBDirty:                12 kBWriteback:             0 kBAnonPages:       2411440 kBMapped:           761076 kBShmem:            170868 kB...

copy the code

For a detailed explanation of each, see [Linux Kernel Documentation – The /proc Filesystem] (The /proc Filesystem — The Linux Kernel documentation). Let’s focus on the fields related to page caches./proc/meminfo

The current system Page Cache is equal to the sum of Buffers + Cached:

Buffers + Cached = 5072756 kB

copy the code

As discussed earlier, if the vm_area_struct is associated to a file, then this memory area is File-backed memory. The vm_area_struct memory area without associated files is anonymous memory. Can we assume that the sum of the File-backed memory associated with the disk file should be equal to the Page Cache?

Active(file) + Inactive(file) = 4908664 kB

copy the code

It seems to be a bit wrong, and it is a little worse, and the poor part is shared memory (Shmem).

Linux requires the use of a virtual file system in order to implement the “shared memory” feature, where multiple processes share the contents of the same memory. A virtual file is not a file that really exists on disk, it is simply simulated by the kernel. But virtual files also have their own inode and address_space structure. When the kernel creates a shared anonymous mapping area, it creates a virtual file and associates this file with the vm_area_struct, so that the vm_area_struct of multiple processes are associated with the same virtual file and eventually to the same physical memory page, thus realizing the shared memory function. This is how shared memory (Shmem) is implemented.

Because Shmem does not have files associated with disk, it is not part of File-backed memory, but is recorded in the Anonymous Memory (Active(anon) or Inactive(anon)) section. But because Shmem has its own inode, the page cache page maintained is hung on the xarray tree that points, so the memory of the Shmem part should also be counted in the Page Cache.inode->address_sapceaddress_space->i_pages

In addition, File-backed memory also has the difference between Active and Inactive. The memory space of data that has just been used is considered Active, and the memory space of data that has not been used for a long time is considered Inactive. When physical memory is insufficient and you have to free memory that is in use, Inactive’s memory is freed first.

The relationship between Page Cache and anonymous memory, File-backed memory, and so on, is shown in the following figure. Although errors are inevitable, the following relational formula is generally true:

Notably, AnonPages!= Active(anon) + Inactive(anon). Active(anon) and Inactive(anon) are used to represent memory that is not recyclable but can be swapped to a swap partition, while AnonPages refers to memory without a corresponding file, and the angles of the two are different. Although Shmem belongs to Active(anon) or Inactive(anon), Shmem has a corresponding memory virtual file, so it does not belong to AnonPages.

In summary, Page Cache definitely associates files, whether they are real disk files or virtual memory files. AnonPages does not associate any files. Shmem associates a virtual file, which belongs to Active(anon) or Inactive(anon), and is also counted in the Page Cache.

If we want to know how much of a file is cached in the Page Cache, we can use the [fincore](fincore(1) – Linux Man Pages ( command. For example:

$ fincore /usr/lib/x86_64-linux-gnu/  PAGES  SIZE FILE2.1M   542  2.1M /usr/lib/x86_64-linux-gnu/

copy the code

RES is the amount of memory space occupied by the contents of the file being loaded into physical memory. is how many memory pages are consumed into the contents of the file. In the example above, the entire contents of the file are loaded into the Page Cache.PAGES/usr/lib/x86_64-linux-gnu/

Combined with the command, we can see how much of a page cache is consumed by a file opened by a certain process:lsof

$ sudo lsof -p 1270 | grep REG | awk '{print $9}' | xargs sudo fincore  RES PAGES   SIZE FILE64.8M 16580  89.9M /usr/bin/dockerd  32K     8    32K /var/lib/docker/buildkit/cache.db  16K     4    16K /var/lib/docker/buildkit/metadata_v2.db  16K     4    16K /var/lib/docker/buildkit/snapshots.db  16K     4    16K /var/lib/docker/buildkit/containerdmeta.db 284K    71 282.4K /usr/lib/x86_64-linux-gnu/ 244K    61 594.7K /usr/lib/x86_64-linux-gnu/ 156K    39 154.2K /usr/lib/x86_64-linux-gnu/  24K     6  20.6K /usr/lib/x86_64-linux-gnu/ 908K   227 906.5K /usr/lib/x86_64-linux-gnu/ ...

copy the code

In addition, cache hit ratio is a very important metric for all cache types. We can use bcc’s built-in tool cachestat to track the Page Cache hit rate of the entire system:

$ sudo cachestat-bpfcc    HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB    2059        0       32  100.00%           74       1492     522        0        0  100.00%           74       1492      32        0        7  100.00%           74       1492     135        0       69  100.00%           74       1492      97        1        3   98.98%           74       1492     512        0       82  100.00%           74       1492     303        0       86  100.00%           74       1492    2474        7     1028   99.72%           74       1494     815        0      964  100.00%           74       1497    2786        0        1  100.00%           74       1497    1051        0        0  100.00%           74       1497^C     502        0        0  100.00%           74       1497Detaching...

copy the code

Use cachetop to track page cache hit rates by process:

$ sudo cachetop-bpfcc 14:20:41 Buffers MB: 86 / Cached MB: 2834 / Sort: HITS / Order: descendingPID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%   14237 mazhen   java                12823     4594     3653      52.6%      13.2%   14370 mazhen   ldd                   869        0        0     100.0%       0.0%   14371 mazhen   grep                  596        0        0     100.0%       0.0%   14376 mazhen   ldd                   536        0        0     100.0%       0.0%   14369 mazhen   env                   468        0        0     100.0%       0.0%   14377 mazhen   ldd                   467        0        0     100.0%       0.0%   14551 mazhen   grpc-default-ex       466        0        0     100.0%       0.0%   14375 mazhen   ldd                   435        0        0     100.0%       0.0%   14479 mazhen   ldconfig              421        0        0     100.0%       0.0%   14475 mazhen   BookieJournal-3       417       58      132      60.0%       6.1%   ...

copy the code

mmap system call

The system call mmap is the most important memory management interface. Use mmap to create file mappings that result in a Page Cache. Using mmap can also be used to request heap memory. The malloc provided by glibc uses mmap system calls internally. Since the efficiency of the mmap system call allocating memory is relatively low, malloc will first use mmap to request a relatively large piece of memory from the operating system, and then maximize the efficiency of memory allocation through various optimization methods.

depending on the parameters, mmap can be combined from two different dimensions: whether it is a file map and whether it is private memory:

  • private anonymous mapping

when calling, you only need to allocate a piece of memory in the process virtual memory space, and then create the vm_area_struct structure corresponding to this memory, and the call ends. when this piece of virtual memory is accessed, a page-out interruption occurs because the virtual memory is not mapped to physical memory. vm_area_struct associated file attribute is empty, so it is an anonymous mapping. the kernel allocates a physical memory and then establishes a mapping of virtual addresses to physical addresses in the page table.mmap(MAP_ANON | MAP_PRIVATE)

  • private file mapping

This is how processes request memory, such as by mapping Shared libraries and Text Segments of executable files to their own address space.mmap(MAP_FILE | MAP_PRIVATE)


the read-only pages of a private file mapping are shared between multiple processes, and a writable page is a separate copy for each process, and the time to create the copy is still copy-on-write.

  • share file mappings

the process requests memory in this way. on the basis of private file mapping, shared file mapping is simple: for writable pages, it is enough not to copy them when they are written. in this way, whenever, whether read or write, multiple processes access the same physical page of the same file.mmap(MAP_FILE | MAP_SHARED)

  • share anonymous mappings

the process requests memory in this way. with the virtual file system, the vm_area_struct of multiple processes are associated to the same virtual file and eventually mapped to the same physical memory pages, enabling the ability to share memory between processes.mmap(MAP_ANON | MAP_SHARED)

mmap the relationship between the four mapping types and the memory indicator described above:/proc/meminfo

Private mappings belong to AnonPages, and shared mappings are all Page caches. As discussed earlier, the shared anonymous mapping Shmem, although it does not have a real disk file associated with it, but it is associated with a virtual memory file, so it also belongs to the Page Cache.

Private file mapping, if the file is read-only, this memory belongs to the Page Cache. If a process writes a file, because the properties of this memory region are private, the kernel will do a write-time copy, creating a separate copy for the process that writes the file, and this copy belongs to AnonPages.

write at the end

The Page Cache mechanism involves multiple kernel functions such as process space, file system, memory management, etc. Page Cache is like a line stringing these parts together. Therefore, a deep understanding of the Page Cache mechanism will be of great help to learning the kernel.