“Processor Microarchitecture: An Implementation Perspective” describes the caching, finger taking, decoding, register allocation, instruction issue, instruction execution, and instruction submission in microarchitecture, from the perspective of hardware implementation.
This article is the reading notes of the paper, corresponding to the fourth chapter of the original article “Decode”, mainly about the composition and working principle of the decoding unit, and it is highly recommended that interested students read the author’s original text.
This document explains the role of each component in the microarchitecture, the implementation ideas, and the principles and advantages and disadvantages of each implementation scheme.
The following will be in italics to mark the author’s simple commentary, to help you distinguish between the author’s description and the author’s commentary, I hope to help everyone’s work and study.
The type of translation instruction, such as control instruction, access instruction, calculation instruction, etc
Determine what the instruction specifically does, for example, for computational instructions, the decoding unit is responsible for translating what the ALU operation is
For example, for computational instructions, the decoding unit is responsible for determining which register the instruction needs to read in and which register to write back
In general, instructions are given to the decoding unit in the form of a binary byte stream (raw stream). The decoding unit first segments the byte stream into multiple independent instructions, and then translates the segmented instructions to generate the control signal of the subsequent units.
The complexity of the decoding unit has a lot to do with the design of the ISA and the number of instructions in the ISA.
The instruction encoding of the RISC architecture is generally simple and easy to decode:
Most RISC architectures have a fixed instruction length, so it’s good to segment byte streams
The encoding method of the RISC instruction set is also relatively fixed, and the format of the opcode, operand, and other fields of different instructions does not change much
As a result, many high-performance RISC architectures are capable of decoding in a single cycle. This corresponds to one of RISC’s design intentions: easy decoding, simple generation of control signals, and the realization of high-performance architectures.
Since the length of the instruction is uncertain, the ability to quickly calculate the length of the instruction is critical to performance when decoding. However, for x86 instruction sets, if you want to calculate the instruction length, you must first parse the opcode and know what type of instruction the current instruction is before you can parse subsequent fields. However, the x86 ISA is not sure which byte the opcode field starts from (prefixes are 1 to 4 Bytes in length) on the one hand, and how long the opcode field itself is (1 to 3 Bytes), so the process is complicated.
In addition, because x86 operands can be either registers or addresses in memory, and registers involve different encoding formats, the process of determining operands when decoding is equally complex.
Therefore, the decoding of the x86 instruction set is quite complex. With modern x86 processors, decoding often takes several cycles and introduces a great deal of complexity to chip design.
For units that execute out of order, to execute instructions of such a large granularity, it is necessary to design very complex control logic to ensure correct execution. In addition, this kind of coarse-grained instruction is not very suitable for instruction-level parallelism, which is not conducive to the improvement of overall performance. Therefore, the CISC instruction set is not efficient for out-of-order execution processors.
For the RISC instruction set, the above operations are translated at compile time into three simple instructions: load, add, and store. These simple instructions can be easily processed by out-of-order execution units without the need for particularly complex control logic. At the same time, these instructions can also be parallel to other undependent instructions, which is conducive to performance improvement.
So x86 processors retain CISC-style instructions at the ISA level, but when decoding, they dynamically translate them into RISC-style instructions. Intel refers to these RISC-style instructions as micro-operations (μops).
The earliest x86 μops appeared on AMD K5 and Intel P6. For P6, the length of the μops is fixed at 118 bits – this is longer than the instruction length of many RISC instruction sets, because μops is not to simply benchmark RISC instructions, but to decode RISC instructions, and on the basis of RISC instructions, they also contain some decoded information, such as pipelined control signals, etc. In addition, P6’s μops no longer uses the memory address as an operand, but directly uses the load/store model (here I understand that in addition to the load/store instruction, the other instructions only use registers as operands).
The ILD is not only responsible for calculating the instruction length, but also appending some additional information to help decode the subsequent process. Since instruction length calculation itself is a serial task, the process must be as fast as possible if you want to promote IPC.
For most instructions, ILD can complete the task of calculating the length of instructions in a single cycle. But for some complex instructions, it may take a long time.
The Intel Nehalem architecture implements three simple decoders and one complex decoder. The former is responsible for translating single μop instructions and the latter for translating instructions up to 4 μops. This approach can effectively save power consumption, reduce design complexity, and maximize the decoding bandwidth without being affected.
For some very complex instructions, it is very likely that you will need to translate into a sequence of instructions with more than 4 μops. These instructions are sent to the complex decoder, which stops all current normal decoding processes, and the microsequencer (MSROM) unit controls the subsequent decoding process. The MSROM consists of a sequencer circuit and an ROM array that outputs a microcode program to simulate complex instructions. This microcode program is equivalent to a pre-programmed sequence of ordinary μops.