Talk about one of the reasons why Kafka is so fast
On
A soul torture question about Kafka: why is it so fast? Or why does it achieve such high throughput and such low latency?
There are many articles that have answered this question, but this article focuses on only one direction, which is the use of page cache. Let’s briefly get to know the page cache in Linux (by the way, also the buffer cache).
Page Cache & Buffer Cache
executes the free command and notices that there will be two columns named buffers and cached, and a line named “-/+ buffers/cache”.
~ free -m total used free shared buffers cached Mem: 12895696440325150536839900 -/+ buffers/cache: 5117277784 Swap: 16002016001
where the cached column represents the current
page cache occupancy and the buffers column represents the current buffer cache occupancy. To explain it in one sentence: **Page cache is used to cache page data for files, and buffer cache is used to cache block data for block devices such as disks. Pages are a logical concept, so page caches are siblings to the file system; Blocks are a physical concept, so buffer caches are siblings with block device drivers.
The common purpose of page cache and buffer cache is to
accelerate data I/O: when writing data, first write to the cache, mark the written page as dirty, and then store flush to the outside, that is, write-back in the cache write mechanism (the other is write-through, which is not used by Linux); When reading data, the cache is read first, and if it is hit, it is read from external storage, and the read data is also added to the cache. The operating system always actively uses all free memory as page cache and buffer cache, and when the memory is not enough, it will also use algorithms such as LRU to eliminate cached pages.
Prior to Linux version 2.4 of the kernel, page cache was completely separate from buffer cache. However, block devices are mostly disks, and most of the data on the disks is organized through the file system, which causes a lot of data to be cached twice, wasting memory. So after the 2.4 kernel, the two caches were approximately fused together: if a file’s page is loaded into the page cache, then the buffer cache only needs to maintain the block pointer to the page. Only blocks that do not have a file representation, or that bypass direct manipulation of the file system (such as the dd command), will actually be placed in the buffer cache. Therefore, when we mention page cache now, we basically refer to both page cache and buffer cache at the same time, and will no longer distinguish between this article and directly collectively refer to page cache.
The following figure approximates a possible page cache structure in a 32-bit Linux system, where the block size is 1KB and the page size is 4KB.
Each file in the img page cache is a
radix tree (essentially a multi-fork search tree), and each node of the tree is a page. Based on the offset within the file, you can quickly locate the page you are on, as shown in the following figure. The principle of cardinality trees can be found on the English wiki, which will not be detailed here.
img
can then pull Kafka in.
Why
doesn’t Kafka manage the cache itself instead of using page cache? There are three reasons:
> everything in JVM is an object, and the object storage of data will bring the so-called object overhead and waste space;
If the cache is managed by the JVM, it will be affected by GC, and too large a heap will also drag down the efficiency of GC and reduce throughput.
Once the program crashes, all cached data managed by yourself will be lost.
The relationship between the three major Kafka pieces (broker, producer, consumer) and page cache can be represented by the following diagram.
When img
producer produces messages, it uses the pwrite() system call [corresponding to the FileChannel.write() API in Java NIO] to write data at an offset, and it will be written to the page cache first. When the consumer consumes messages, it will use the sendfile() system to call [corresponding FileChannel.transferTo() API], and transfer the data from the page cache to the broker’s Socket buffer in zero copy, and then through the network.
Also not drawn in the figure is the synchronization between the leader and the follower, which is the
same as the consumer: as long as the follower is in the ISR, it can also transfer data from the broker page cache where the leader is located to the broker where the follower is located through the zero-copy mechanism.
At the same time, the data
in the page cache will be written back to disk with the scheduling of the flusher thread in the kernel and the call to sync()/fsync(), even if the process crashes, there is no need to worry about data loss. In addition, if the message to be consumed by the consumer is not in the page cache, it will go to the disk to read, and by the way, some adjacent blocks will be pre-read into the page cache to facilitate the next reading.
From this, we can draw an important conclusion: if the production rate of Kafka producer is not much different from the consumption rate of consumers, then the entire production-consumption process can be completed almost exclusively by reading and writing to the broker page cache, and disk access is very small. This conclusion is popularly known as the “reading and writing air relay”. In addition, when Kafka persists messages to the partition files of each topic, it is written sequentially only appended, making full use of the fast sequential access to disks and high efficiency.
img
For Kafka’s disk storage mechanism, you can see Meituan’s technical team’s masterpiece https://tech.meituan.com/2015/01/13/kafka-fs-design-theory.html.
For clusters running Kafka alone, the first thing to note is to set a suitable (not so large) JVM heap size for Kafka. From the above analysis, it can be seen that the performance of Kafka has little to do with heap memory, and the demand for page cache is huge. According to the empirical value, allocating 6~8GB of heap memory for Kafka is enough, and the remaining system memory is used as page cache space to maximize I/O efficiency.
Another issue to pay special attention to is lagging consumers, i.e. consumers that are slow to consume and lag behind significantly. The data they want to read is more likely not in the broker page cache, so it will add a lot of unnecessary disk reading operations. To make matters worse, the “cold” data read by the lagging consumer still goes into the page cache, polluting the “hot” data that most normal consumers want to read, and the performance of the normal consumer deteriorates. This issue is especially important in a production environment.
As mentioned earlier, the data in the page cache is written back to disk with the dispatch of the flusher thread in the kernel. Related to it are the following 4 parameters, which can be adjusted if necessary.
/
proc/sys/vm/dirty_writeback_centisecs: The period of the flush check. The unit is 0.01 seconds, and the default value is 500, which is 5 seconds. Each check is handled according to the following three parameter control logic.
/
proc/sys/vm/dirty_expire_centisecs: If a page in the page cache is marked as dirty for longer than this value, it will be flashed directly to disk. The unit is 0.01 seconds. The default value is 3000, which is half a minute.
/
proc/sys/vm/dirty_background_ratio: If the total size of the dirty page as a percentage of free memory exceeds this value, a flusher thread will be dispatched in the background to write to disk asynchronously without blocking the current write() operation. The default value is 10%.
/
proc/sys/vm/dirty_ratio: If the total size of the dirty page exceeds this value as a percentage of the total amount of memory, it blocks the write() operation of all processes and forces each process to write its own files to disk. The default value is 20%.
It can be seen that the adjustment space is more flexible is parameter 2 and 3, and try not to reach the threshold of parameter 4, which is too expensive.