I recently saw a great article, “Processor Microarchitecture: An Implementation Perspective”, which explains the caching, finger taking, decoding, register allocation, instruction issue, instruction execution, instruction submission, etc. in microarchitecture, from the perspective of hardware implementation.

This article is a reading note of the paper, corresponding to the second chapter of the original article “Caches”, it is highly recommended that interested students read the author’s original text.

Although the title says “An Implementation Perspective”, there is no hardware code in the article, but more about the role of each component in the microarchitecture, the implementation ideas, the principles and advantages and disadvantages of each implementation scheme.

The explanation of the article is short and clear, and there are a lot of illustrations, and after reading the second chapter, I was very addicted. Although it was a 2011 Lecture, since the author condensed knowledge from the architecture summit of the time, it is not very outdated now (at least many of the implementation mechanisms described in the Caches chapter are almost the same as today’s commercial architectures).

This article is very suitable as a supplementary reading material for “Computer Composition and Design Principles – Hardware/Software Interface” and “Computer Architecture – Quantitative Research Methods”, and it is strongly recommended that students who are interested in architecture and high-performance computing read in depth, so that they can have a more intuitive understanding of some of the conceptual principles in the book and can have a specific understanding of how these concepts are implemented.

I will start here with Chapter 2 Caches, in chapters, summarize the important content of the Lecture and make a brief comment, hoping to help you in your work and study.

My commentaries are in italics to help you distinguish between what the author describes and what I comment. If there is an error in the summary/commentary, you are welcome to point it out in the comments section.

The author focuses on the L1 Cache in a later Lecture.

If you use a virtual tag, you need to pay special attention to the virtual aliasing problem. For example, if a multi-process program accesses the same piece of physical memory (shared memory or a code snippet of a dynamic library), there will be two different virtual address pages pointing to the same physical address page. This kind of problem can be difficult to handle correctly.

TLB (translation lookaside buffer) generally has 10 ~ hundreds of items (high-end architecture TLB will also be graded, generally L1 TLB has 10 ~ dozens of items, L2 TLB may have hundreds of items). Because TLBs have a large impact on performance, they are typically accessed in a single cycle.

There are two scenarios for managing TLB content:

Pure software management: The content of the TLB is fully managed by the operating system, and the architecture will have special instructions to add/remove items from the TLB, which can flexibly set the paging organization of the system, and can control which pages will be cached on the TLB by software

Hardware management: TLB content is managed by hardware, the operating system is responsible for creating page tables in a format that the TLB understands, and the TLB behaves outside the control of the operating system

A cache consists mainly of two blocks: tag array and data array

The data array consists of multiple sets (some textbooks translate set Chinese as “group”), and each set consists of multiple blocks (blocks are equivalent to cache line, caching rows). How many blocks in a set are called contiguity, sometimes N-way, which means N.

The number of sets and the degree of association of the tag array are consistent with the data array. In addition to saving tags, tag arrays also save some control bits, such as valid, dirty and so on.

In fact, Figure 2.1 also describes an implementation of a single-period cache. Index is taken out and used to calculate the corresponding set. For data arrays, take out all blocks directly from the corresponding set to way mux; For tag array, take out all the N-way tags corresponding to set and compare them with the virtual address tag, if there is no one on the comparison, the miss signal is generated and the data is taken from the subordinate cache; If the comparison is on, use the i-th way on the comparison to take the corresponding block from the data array. Finally, get the word you want based on offset (this step is called data alignment).

The access to the Tag and the access to the Data occur at the same time in the above procedure.

Because L1 Cache has a significant impact on performance, the above process needs to be streamlined for high-frequency architectures.

The following diagram shows the Pipelined implementation of Figure 2.1:

There are two critical paths in the figure (marked with red arrows), one is to find the corresponding set from the tag to take out N data, and compare each tag, sending the control bit to the way mux; The other is to take N blocks from the data array, feed them into way mux, and select the result based on the control bits given by the tag.

With Tag & Data Parallel Access, Tag Array selection and matching, Data Array selection and removal, stable way mux channel control, data selection through way mux and stable output need to be completed in a single cycle. These operations in a single cycle can cause strain in timing and can slow down the frequency of Caches.

Another implementation is to access tag first and then data, both of which work serially. The workflow is schematically as follows:

After this cache gets the index decoding corresponding to the set, it first accesses and compares the tag array, and if the cache hits, then takes the corresponding way from the data array and sends the data to Aligner to get the required word.

Compared to the previous parallel implementation, the advantages of the serial implementation are:

No longer need way mux, nor do you need to take out all the data of the data array a set, but first select a way according to the tag match, and then take the data from the inside according to the index. All the ways of the data array can be connected directly to the aligner via the line, and there is no need to go through the way mux anymore

Since there is no need to take out all the data in a set of data arrays, power consumption can be effectively saved

The pipelined scheme for parallel access to Tag & Data is as follows:

As you can see, the advantage of the Figure 2.4 scheme over Figure 2.2 is that way mux is no longer required, the critical path of the Tag array (Figure 2.2 critical path 1) is shortened, and the critical path of the Data array is gone. The timing of the entire scenario is no longer as tight as parallel access, so the frequency of Cache can be increased and Cache throughput can be made larger. In addition, serial access consumes less power.

However, serial access also has a disadvantage, because it needs to match the tag first, and then take the Data, the pipeline will have one more cycle and higher latency than parallel access.

The lower the degree of connectivity, whether in parallel or serial schemes, the better the timing for tag matching and data gaging, so the higher the frequency can be achieved. But a lower degree of correlation also means more cache collisions, which will cause the cache miss rate to increase. So the degree of interconnectedness needs to be weighed.

Non-blocking cache (lockup-free cache): When a miss occurs in a pipelined access cache, the pipeline can continue to execute other instructions that have no dependency on the fetch (including fetch instructions). Non-blocking caches are used in combination with some tracking mechanism (scoreboard or tomasulo). In the current out-of-order execution core, Caches are non-blocking.

In order to implement this mechanism, there needs to be space to stage the fetch request for miss, wait for the data to be returned from the inferior fetch system, and then write to the cache and return to the pipeline. The spaces where this mechanism is implemented are called MSHRs (miss status/information holding registers). In addition, data returned from a subordinate fetch system is cached in the input stack (or fill buffer) before being written to the data array.

For non-blocking caches, miss can be divided into three categories:

Primary miss: The first miss for a cache line. After primary miss occurs, the cache stores the fetch information in MSHRs and fetches it to the subordinates

secondary miss: The accessed cache line has just happened once with a primary miss and is waiting for the data returned by the inferior fetch system. This miss is also known as miss on miss

structural-stall miss: A miss (e.g. MSHRs are not enough) due to structural risk, which usually results in an pipeline stall

MSHRs can be implemented in three ways:

Implicitly Addressed MSHRs

Explicitly Addressed MSHRs

In-Cache MSHRs

The implicitly addressed MSHRs structure is shown in the preceding figure. To be clear, a whole bunch of things in the diagram are actually what an MSHR contains.

An MSHR includes a valid bit, a block address (cache line address), and a data array of length N. N corresponds to the number of words in a cache line, that is, each item in the data array corresponds to a word in the cache line. The data array stores the fetch information for each word, including which register the data should be written to after returning from the inferior fetch, how many bits of the fetch instruction it is, whether zero-extend or sign-extend and so on.

When primary miss occurs, an empty MSHR is found, valid bits and block address are set, and the fetch information is set to the corresponding data array entry. When secondary miss occurs later, the MSHR entry is found according to the block address and the fetch information is set again. After waiting for the number of fetching systems to be returned, the data is returned to the register according to the fetch information recorded by the data array.

The implicit addressing MSHRs mentioned above, while simple to implement, has a problem: when the word of secondary miss is the same as that of primary miss, because the data array and word correspond one-to-one, secondary miss triggers a structural adventure, causing the pipeline to stop.

Explicitly addressed MSHRs abandon the one-to-one correspondence and instead add block offset information to each fetch message, as shown in the following figure. This allows even two misses of the same word to exist in different data array items, avoiding the problem of implicitly addressing MSHRs.

(The author does not give the selection rules of M, personally understand that there is a tradeoff here: the larger M is, the more misses can be cached, but the area and timing are larger, and the number of MSHRs terms will be limited; On the contrary, although the number of MSHRs items will increase, it will still cause blocking once the program frequently accesses the same cache line and both trigger miss.)

Whether MSHRs are addressed implicitly or explicitly, additional area is occupied, and the number of items for MSHRs is limited when the area is limited. One solution is to store MSHRs information directly in the cache, the Cache’s tag array still exists in the tag, the Cache’s data array is used to store MSHRs’ fetch information before the data returns, and an additional transient bit is added to mark that the cache line is accessing data from the lower level. This In-Cache MSHRs can be either implicitly or explicitly addressed.

The advantage of this approach is that there can be many MSHRs items, but the disadvantage is that the implementation is more complicated.

True Multiported Cache

Array Replication

Virtual Multiporting


This design idea replicates all of Cache’s control and data pathways, including address decoders, multiplexors, tag comparators, aligners, and more. tag arrays and data arrays are not copied. This approach, while guaranteed true dual emission, results in a dramatic increase in Cache access time, which in turn prevents the frequency from increasing. Therefore, commercial chips rarely use this solution.

This scheme builds on the True Multiported Cache and copies of the data array and tag array. This approach reduces the complexity of the fetch and increases the bandwidth without introducing additional fetch time overhead. However, this scheme is wasteful and complexities due to the need to synchronize the data of the two arrays. Some designs also copy only the data array.

This approach uses time-division multiplexing to process two inbound instructions in a single cycle, thereby implementing the function of two ports on one port. This design has two drawbacks: it is difficult to scale to a higher number of ports, and the other is that the data array is difficult to output multiple data in the next cycle at a high frequency, thus limiting the increase in frequency.

The core idea of this approach is to divide the cache into multiple banks, each bank with a cache port (including address decoder, tag comparators, aligners, etc.) that handles the current bank’s fetch requests. Unlike True Multiported Cache, this approach does not require the cache bank to have multiple ports, so the access time is not increased, and unlike Array Replication, it does not require the replication of tag arrays and data, so the area does not increase significantly. A schematic diagram of such a scenario is as follows:

Multibanking is the most commonly used dual-port Cache implementation today. However, this scenario requires attention to the bank conflict. If two fetch instructions access the same bank, they cannot be processed in a single cycle. To solve this problem, you can rely on hardware to map addresses (generally continuous address interleave), on the other hand, you can rely on software to access the same bank as much as possible.

Another point to emphasize is the problem of address alignment, if the address is aligned to the cache line, the access instruction only needs to access one bank, so as long as the other access instruction does not have a bank conflict, you can transmit an additional access instruction. If the access instructions are not aligned, they will be split into two aligned instructions and assigned to two banks to visit the deposit. Although the latency remains the same, the bandwidth is halved. (The latest design seems to have overcome this problem, I don’t know how to solve it)

Since program access is generally sequential, a cycle of access to a cache line can basically meet the needs of instruction transmission, so it is sufficient to use a single port

ICache uses a blocking cache. The advantage of a non-blocking cache is that after the fetch instruction misses, subsequent instructions can be executed. But ICache Miss has no follow-up instructions to use at all (it is the place to take instructions), and must wait until after the miss return, so there is no point in using a non-blocking cache.

ICache is better suited to Tag & Data Parallel Access than serial access because Parallel Access restarts the pipeline more quickly when branch predictions are wrong. However, the timing and power consumption of parallel access will be more stressful and require careful trade-offs

Regarding the degree of connectivity, ICache does not need too high a degree of interconnectedness because the access is generally relatively continuous, but it should also be carefully weighed according to the actual situation.