No matter what kind of code you write, it will be executed by the CPU, so if you want to write high-performance code, the techniques mentioned in this article are worth learning carefully. In addition, don’t think these things are useless, these things are very useful, more than ten years ago it was this knowledge that helped me a lot in performance tuning, thus opening the gap with many people….

First of all, we all know that the current CPU multi-core technology will have several levels of cache, the old CPU will have two levels of memory (L1 and L2),

and the new CPU will have three levels of memory (L1, L2, L3), as shown in the following figure:


  • L1 cache is divided into two types, one is instruction cache and the other is data cache. L2 and L3 caches are indistinguishable between instruction and data.
  • L1 and L2 are cached in each CPU core, and L3 is memory shared by all CPU cores.
  • L1, L2, L3 is closer to the CPU, the smaller

  • and faster, the farther away from the CPU, the slower the speed.

Further behind is the memory

, and behind the memory is the hard disk. Let’s look at some of their speeds


    access speed for L1: 4 CPU clock

  • cycles L2 access speed: 11
  • CPU clock cycles
  • L3 Access Speed: 39 CPU Clock

  • CycleRAM Memory Access Speed: 107 CPU Clock
  • Cycles

We can see that L1 is 27 times faster than RAM, but the size of L1/L2 is basically KB-level, and L3 will be MB-level. For example: Intel Core i7-8700K is a 6-core CPU, L1 on each core is 64KB (32KB for data and instructions), L2 is 256K, and L3 is 2MB (my Apple computer is Intel Core i9-8950HK, the same cache size as the Core i7-8700K).

Our data goes up from memory, first to L3, then to L2, then to L1, and finally to the registers for CPU calculations. Why is it designed as three layers? There are several considerations here:

    one aspect is physical speed, if you want a larger capacity, you need more transistors,

  • in addition to the volume of the chip will become larger, more importantly, a large number of transistors will lead to a decrease in speed, because the access speed and the location of the transistor to be accessed are inversely proportional, that is, when the signal path becomes longer, the communication speed will be slower. This part is a physical problem.
  • Another problem is that in multi-core technology, the state of the

  • data needs to be synchronized in multiple CPUs, and we can see that the speed gap between cache and RAM is too large, so multi-level caches of different sizes are beneficial to improve overall performance.

The world is always balanced, and how glamorous one side becomes, how dark the other side becomes. Building up so many levels of caching will inevitably introduce other problems, and here are two more important problems,

  • one is the problem of the hit ratio of the relatively simple cache.
  • The other is the more complex consistency of cache updates.

Especially the second problem, under multi-core technology, this is very much like a distributed system, to update multiple places.

Cache hits

before these two issues are explained. We need to understand a term Cache Line. Caching is basically to load the data behind it close to itself, for the CPU, it will not be loaded byte by byte, because this is very inefficient, generally speaking, it is to be loaded piece by piece, for such a piece of data units, the term is called Cache Line,

Generally speaking, the cache line of a mainstream CPU is 64 bytes (some CPUs use 32Bytes and 128Bytes), and 64 bytes is 16 32-bit integers, which is the smallest data unit that the CPU retrieves data from memory.

For example, Cache

Line is the smallest unit (64Bytes), so first distribute the Cache to multiple Cache Lines, for example: L1 has 32KB, then, 32KB/64B = 512 Cache Lines.

On the one hand, the cache needs to put the data in memory into it, which is called CPU Associativity in English. The data placement strategy of the cache determines where in the CPU cache blocks of data in memory will be copied, because the size of the cache is much smaller than the memory, so there needs to be an address association algorithm so that the data in memory can be mapped to the cache. This is a bit like the way memory addresses are mapped from logical addresses to physical addresses, but not exactly.

Basically, we will have the following methods.

    method is that the data of any

  • memory address can be cached in any cache line, which is the most flexible, but if we want to know whether a memory exists in the cache, we need to cache traversal of O(n) complexity, which is very inefficient.
  • Another approach, in order to reduce the cache search algorithm, we need to use a data structure like Hash Table, the simplest hash table is to do the modulus operation, for example: our L1 Cache has 512 Cache Lines, then, the formula: (memory address mod 512) * 64 You can directly find the offset of the Cache address. However, this approach requires our program’s access to memory addresses to be very even, otherwise the conflict will be very serious. This is a very ideal situation.
  • In order to avoid the problems of the above two schemes, it is necessary to tolerate certain hash conflicts, and the N-Way correlation appears. That is, tie consecutive N cache lines into a group, and then find the relevant group first, and then find the relevant cache line in this group. This is called Set Associativity. As shown in the following figure.

For N-Way group association, it may be a little difficult to understand, here is an example, and to say more details (you will not understand the code without the face), most Intel processors L1 cache is 32KB, 8-way group connection, Cache Line is 64 bytes. This means that <

ul class=”list-paddingleft-2″> 32KB

  • can be divided into 32KB / 64 = 512 cache lines.
  • Since there are 8 ways, there will be 512 / 8 = 64 cache lines per way.
  • So each way has 64 x 64 = 4096 Byts of memory.
  • In order to facilitate the indexing

    of memory addresses,

      >Tag: Each cache line will be preceded by an independently allocated 24 bits to save the tag, which is the first 24 bits of the memory address
    • Index: The next 6 bits of the memory address are the Cache Line index in this Way, 2^6 = 64 just to index 64 Cache Line Offset: the next 6bits
    • are used to represent in The offset in the cache line

    is shown in the following figure: (Image from “Cache: a place for concealment and safekeeping”)

    When you get a memory address, first take out the middle 6bits to find which group.

    Then, in this 8-group cache line, another traversal of O(n) n=8 is performed to match the tags of the first 24bits. If it matches, it is hit, if it is matched, it is cache miss, if it is a read operation, it needs to be accessed by entering the subsequent cache.

    L2/L3 is also such an algorithm. There are two elimination algorithms, one is random and the other is LRU. Now it is generally based on the algorithm of LRU (implemented by adding an access counter)

    class=”rich_pages wxw-img” src=””>

    which also means:

    • L1 Cache can map the memory address of 36bits, a total of 2^36 = 64GB
    • of memory

    • When the CPU wants to access a memory, it is located through the 6bits in the middle of this memory, through the first 24bits Locate the appropriate Cache Line.
    • Just like the data structure of a hash table, first the index of O(1) is entered, and then it goes into a conflict search.
    • Because the middle 6bits determine the same set, for a contiguous piece of memory, every 4096 memories will be placed in the same group, resulting in cache conflicts.

    In addition, when there is data that does not hit the cache, the CPU will update the data to memory in units as small as the cache line. Of course, the CPU does not necessarily just update 64Bytes, because accessing the main memory is too slow, so it is generally updated more. A good CPU will have some predictive techniques, and if a pattern is found, it will preload more memory, including instructions.

    This is called prefetching technology (see Cache Prefetching on Wikipedia and Memory Prefetching on the State University of New York). For example, if you access a contiguous array in a for-loop, and your step size is a fixed number, memory can be prefetched. Knowing

    these details will help us know under what circumstances it can cause the cache to invalidate.

    Cache consistency

    For mainstream CPUs, cached write operations are basically two strategies


    • one is Write Back, write operations as long as they are on the cache, and then flushed to memory.
    • One is Write Through, where writes to both cache and memory.

    In order to improve write performance, in general, mainstream CPUs (such as Intel Core i7/i9) use the Write Back strategy, because writing directly to memory is too slow.

    Well, now the problem is, if one data x is updated on the cache of CPU

    core 0, then the value of this data x on other CPU cores must also be updated, which is the problem of cache coherency. (Of course, for our upper-level programs we don’t care how the caches of multiple cores of the CPU are synchronized, which is transparent to the upper-level code)

    In general, on CPU hardware, there will be two ways to solve this problem.

    • Directory protocol 。 A typical implementation of this approach is to design a centralized controller that is part of the main memory controller. One of these directories is stored in main storage and contains global state information about various locally cached content. When a single CPU cache makes a read or write request, this centralized controller checks and issues the necessary commands to synchronize and transfer data between the main memory and the CPU cache or between the CPU cache itself.
    • Snoopy protocol. This protocol is more like a bus-type technology for data notification. CPU Cache can identify the state of data on other caches through this protocol. If there is data sharing, the status of the shared data can be notified to other CPU caches through the broadcast mechanism. This protocol requires that each CPU cache can snoop on notifications of data events and react accordingly. As shown in the following figure, there is a bus for Snoopy Bus.

    Because the Directory protocol is centralized, there are performance bottlenecks and increase the complexity of the overall design. The Snoopy protocol is more like microservices + message communication, so it is now basically a bus design using Snoopy.

    Here, I want to write more details, because this kind of microscopic thing is unnaturally associated with distributed systems, where we generally use distributed consistency algorithms such as Paxos/Raft.

    In the microcosm of the CPU, such an algorithm is not necessary because the hardware of the CPU with multiple cores does not have to consider the problem of network interruption latency. Therefore, the core of synchronization between the CPU’s multi-core cache is to manage the state of the data.

    Here are a few state protocols, starting with the simplest, the MESI protocol, which has nothing to do with the famous football player Messi, which mainly means that the cached data has four states: Modified, Exclusive, Shared (shared), and Invalid.

    The state machines of these states are as follows (a bit complicated, you can leave it alone, this diagram is just to tell you how complex state control is):

    Here’s an example (if you want to see an animation, there’s a web page (MESI Interactive Animations) where you can interoperate, the Write Through algorithm used in this animation:

    1) CPU0 read(x)x=1



    )x=1 (S)x=1

    Both CPUs

    )x=9 (M)x=1

    in CPU1 the state becomes Invalid

    x=9 (M)X=1(I)


    x=9 (S)x=9

    current operation CPU0 CPU1 Memory Note
    x=1Only one CPU has an x variable, so the state is Exclusive )
    CPU1 read(x (S)x =1 read the x variable, so the state becomes Shared 3) CPU0 write(x,9
    ( I) x=1 The variable changes, the state becomes Modified in CPU0, and
    4) The variable x writes back to memory x=9The current state does not change ) CPU1
    read(x). (S)x=9 variables are synchronized to all caches, and the state is returned to Shared

    After the data is updated, the MESI protocol will mark the data copied from other shared CPU caches as Invalid state, and then when other CPUs read again, the cache miss problem will occur, and the data will be updated from memory. Updating data from memory means a 20x reduction in speed.

    Can we update directly from the CPU cache next door to me? Yes, this can increase the speed a lot, but the state control becomes cumbersome. One more state is needed: Owner (host), for marking, and I am the source of updated data. Therefore, there is a state

    machine and demonstration example of MOESI protocol MOESI protocol I will not post (if you are interested, you can go to Berkeley to see the related courseware), we only need to understand that MOESI protocol allows CPU cache to synchronize data, so it also reduces the operation on memory, performance is very much improved, but the control logic is also very complicated.

    By the way, a protocol similar to the MOESI protocol is MESIF, where F is Forward,

    which also forwards the updated data to other CPU caches, however, the Owner state in MOESI and the Forward state in MESIF have a very big difference – the data in the Owner state is dirty and has not been written back to memory, and the data in the Forward state is Clean, can be discarded without notice.

    It should be noted that AMD uses MOESI, Intel uses MESIF. Therefore, the F-state is mainly designed for CPU L3 Cache (as we said earlier, L3 is shared by all CPU cores). (For a related comparison, see the answer to this question on StackOverlow)


    understanding the above things about program performance, let’s take a look at the impact on the program.



    suppose we have a 64M long array, imagine the following two loops:

     const int LEN = 64*1024*1024; int *arr = new int[LEN]; 

     for (int i = 0; i < LEN; i += 2) arr[i] *= i;


     for (int i = 0; i < LEN; i += 8) arr[i] *= i;

    In our opinion, the second cycle is 4 times less computationally intensive than the first, and it should be 4 times faster. On my machine, the first cycle takes 127 milliseconds and the second takes 121 milliseconds, which is not much different.

    The main reason here is Cache Line,

    because the CPU will load in a Cache Line 64Bytes minimum unit, which is 16 32bits integers, so whether you have a step size of 2 or 8, it is about the same. The subsequent multiplication is actually CPU-consuming.

    In Example 2


    let’s look at a code related to the cache hit ratio, where we access a continuous array with a certain increment.

     for (int i = 0; i < 10000000; i++) {
         for (int j = 0; j < size; j += increment) {         memory[j] += j;    } } }Let's

    test that in the following table, the table header is the step size, that is, how many

    integers are jumped each time, and the vertical is how many times this array can be jumped (you can understand it as several Cache Lines), so any item in the table represents how many this array is and what the step size is.

    For example: the horizontal axis is 512,

    the vertical axis is 4, which means that this array has 4*512 = 2048 lengths, and when accessing, it is accessed by 512 steps, that is, these four items are accessed: [0, 512, 1024, 1536].

    The same term in the table is the time to cycle 10 million times in microseconds (milliseconds after dividing by 1000).

      | count |   1    |   16  |  512  | 1024  | ------------------------------------------ |     1 |  17539 | 16726 | 15143 | 14477 | |     2 |  15420 | 14648 | 13552 | 13343 | |     3 |  14716 | 14463 | 15086 | 17509 | |     4 |  18976 | 18829 | 18961 | 21645 | |     5 |  23693 | 23436 | 74349 | 29796 | |     6 |  23264 | 23707 | 27005 | 44103 | |     7 |  28574 | 28979 | 33169 | 58759 | |     8 |  33155 | 34405 | 39339 | 65182 | |     9 |  37088 | 37788 | 49863 |156745 | |    10 |  41543 | 42103 | 58533 |215278 | |    11 |  47638 | 50329 | 66620 |335603 | |    12 |  49759 | 51228 | 75087 |305075 | |    13 |  53938 | 53924 | 77790 |366879 | |    14 |  58422 | 59565 | 90501 |466368 | |    15 |  62161 | 64129 | 90814 |525780 | |    16 |  67061 | 66663 | 98734 |440558 | |    17 |  71132 | 69753 |171203 |506631 | |    18 |  74102 | 73130 |293947 |550920 |

    We can see that time has risen significantly since [9,1024]. Including [17,512] and [18,512] also increased significantly. This is because, my machine's L1 cache is 32KB, 8 ways, as mentioned earlier, 8 ways have 64 groups, each group of 8 cache lines, when the for-loop step size exceeds 1024 integers, that is, exactly 4096 bytes, that is, the change in memory address is changed on the high 24bits

    The low 1 2bits does not change much, especially the middle 6bits do not change, resulting in all hitting the same set of sets, resulting in a large number of cache conflicts, resulting in performance degradation and time increase. The same is true for [16, 512], where several steps begin to cause L1 Cache to start invalidating conflicts.

    Example 3Next

    , let's look at another example. Below are two ways to traverse a two-dimensional array, one row by row and one column by column, both of which theoretically have the same amount of addressing and calculation, and the execution time should be the same.

     const int row = 1024; const int col = 512 int matrix[row][col];  Walk through int sum_row=0 line by line; 

     for(int _r=0; _r
         for(int _c=0; _c for(int _c=0; _c
         for(int _r=0; _r


    no, on my machine, get the following result.

    There is a gap of more than ten times the implementation time. The reason for this is that column-by-column traversal is not friendly to the way CPU caches work, so it comes at a huge cost.

    Example 4Next


    let's take a look at the performance problem under multi-core, see the following code. Two threads are manipulating two different elements of an array (without locking), and the thread loops 10 million times to do the addition operation. In the code below, I highlighted the line that is the p2 pointer, either p[1] or p[30], and theoretically should be the same execution time no matter which two array elements are accessed.

     void fn (int* data) {
         for (int i = 0; i < 10*1024*1024; ++i)         *data += rand(); }  int p[32];  int *p1 = &p[0]; int *p2 = &p[1];  int *p2 = &p[30];  thread t1(fn, p1); thread t2(fn, p2);

    However, no, the result of this on my machine is


      for p[

    • 0] and p[1]: 560ms for
    • p [0

    ] and p[30]: 104ms This is because p[0] and p[

    1] are on the same cache line, while p[0] and p[30]. It is impossible to be on the same cache line, the smallest update unit of the CPU's cache is the cache line, so although the two threads are writing different data, but because the two data are on the same cache line, it will cause the cache to need to be synchronized in L1/L2 of the two CPUs, resulting in a time difference of 5 times.

    Example 5Next


    let's take a look at another piece of code: we want to count the odd number of numbers in an array, but this array is too large, and we want to use multithreading to complete this count. In the following code, we pass an id for each thread, and then use this id to complete the statistical task of the corresponding array segment. This speeds up the overall processing.

     int total_size = 16 * 1024 * 1024; array length int* test_data = new test_data[total_size]; array int nthread = 6; Number of threads (since my machine is 6-core) int result[nthread]; Array of collected results void thread_func (int id) { result[id] = 0;     int chunk_size = total_size / nthread + 1;     int start = id * chunk_size;     int end = min(start + chunk_size, total_size); 

         for ( int i = start; i < end; ++i ) {

             if (test_data[i] % 2 != 0 ) ++result[id];    } } }However


    during execution, you will find that 6 threads can't run 1 thread. Because according to the above example, you know that the data in the array result[] is in a Cache Line, so all threads will write to this Cache Line, causing all threads to constantly resynchronize the Cache Line where result[] is located, so 6 threads cannot run the result of one thread. This is called False Sharing.

    Optimization is also simple, using variables within a thread.

     void thread_func (int id) {     result[id] = 0;     int chunk_size = total_size / nthread + 1;     int start = id * chunk_size;     int end = min(start + chunk_size, total_size);      int c = 0; Using temporary variables, synchronization without a cache line

    for ( int i = start; i < end; ++i ) {

             if (test_data[i] % 2 != 0 ) ++c;    }     result[id] = c; }

    Let's run the two programs on 1 to 32 threads respectively, and the result is plotted as follows (the horizontal axis is the number of threads, the vertical axis is the time to complete the system, and the unit is microseconds):

    In the figure above, we can see that the gray curve is the first method, and the orange one is the second method (using local variables). When there is only one thread, the two methods are equal, there is basically no difference, but when the number of threads increases, you will find that the performance of the second method improves very quickly. It doesn't start to stabilize until it reaches 6 threads (as mentioned earlier, my CPU is 6 cores).

    And no matter how many threads are added to the first method, there is no way to surpass the second method. Because the first method is not CPU cache friendly. That is, the second method, as long as I have enough CPU cores, can do linear performance scaling so that every CPU core runs, while the first cannot.

    From: Programmer CXUAN