diff options
Diffstat (limited to 'src/rv32.tex')
| -rw-r--r-- | src/rv32.tex | 1370 |
1 files changed, 0 insertions, 1370 deletions
diff --git a/src/rv32.tex b/src/rv32.tex deleted file mode 100644 index ced9826..0000000 --- a/src/rv32.tex +++ /dev/null @@ -1,1370 +0,0 @@ -\chapter{RV32I Base Integer Instruction Set, Version 2.1} -\label{rv32} - -This chapter describes version 2.0 of the RV32I base integer -instruction set. - -\begin{commentary} -RV32I was designed to be sufficient to form a compiler target and to -support modern operating system environments. The ISA was also -designed to reduce the hardware required in a minimal implementation. -RV32I contains 40 unique instructions, though a simple implementation -might cover the ECALL/EBREAK instructions with a single SYSTEM -hardware instruction that always traps and might be able to implement -the FENCE instruction as a NOP, reducing base instruction count to 38 -total. RV32I can emulate almost any other ISA extension (except the A -extension, which requires additional hardware support for atomicity). - -In practice, a hardware implementation including the machine-mode -privileged architecture will also require the 6 CSR instructions. - -Subsets of the base integer ISA might be useful for pedagogical -purposes, but the base has been defined such that there should be -little incentive to subset a real hardware implementation beyond -omitting support for misaligned memory accesses and treating all -SYSTEM and FENCE instructions as a single trap. -\end{commentary} - -\begin{commentary} -Most of the commentary for RV32I also applies to the RV64I base. -\end{commentary} - -\section{Programmers' Model for Base Integer ISA} - -Figure~\ref{gprs} shows the unprivileged state for the base integer -ISA. There are 31 general-purpose registers {\tt x1}--{\tt x31}, -which hold integer values. Register {\tt x0} is hardwired to the -constant 0. There is no hardwired subroutine return address link -register, but the standard software calling convention uses register -{\tt x1} to hold the return address on a call. For RV32I, the {\tt x} -registers are 32 bits wide, i.e., XLEN=32. - -There is one additional unprivileged register: the program counter {\tt pc} -holds the address of the current instruction. - -\begin{commentary} -The number of available architectural registers can have large impacts -on code size, performance, and energy consumption. Although 16 -registers would arguably be sufficient for an integer ISA running -compiled code, it is impossible to encode a complete ISA with 16 -registers in 16-bit instructions using a 3-address format. Although a -2-address format would be possible, it would increase instruction -count and lower efficiency. We wanted to avoid intermediate -instruction sizes (such as Xtensa's 24-bit instructions) to simplify -base hardware implementations, and once a 32-bit instruction size was -adopted, it was straightforward to support 32 integer registers. A -larger number of integer registers also helps performance on -high-performance code, where there can be extensive use of loop -unrolling, software pipelining, and cache tiling. - -For these reasons, we chose a conventional size of 32 integer -registers for the base ISA. Dynamic register usage tends to be -dominated by a few frequently accessed registers, and regfile -implementations can be optimized to reduce access energy for the -frequently accessed registers~\cite{jtseng:sbbci}. The optional -compressed 16-bit instruction format mostly only accesses 8 registers -and hence can provide a dense instruction encoding, while additional -instruction-set extensions could support a much larger register space -(either flat or hierarchical) if desired. - -For resource-constrained embedded applications, we have defined the -RV32E subset, which only has 16 registers (Chapter~\ref{rv32e}). -\end{commentary} - -\begin{figure}[H] -{\footnotesize -\begin{center} -\begin{tabular}{p{2in}} -\instbitrange{XLEN-1}{0} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ \ \ x0 / zero}} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x1\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x2\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x3\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x4\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x5\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x6\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x7\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x8\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x9\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x10\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x11\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x12\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x13\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x14\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x15\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x16\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x17\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x18\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x19\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x20\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x21\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x22\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x23\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x24\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x25\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x26\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x27\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x28\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x29\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x30\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ x31\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{c}{XLEN} \\ - -\instbitrange{XLEN-1}{0} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{pc}} \\ \cline{1-1} -\multicolumn{1}{c}{XLEN} \\ -\end{tabular} -\end{center} -} -\caption{RISC-V base unprivileged integer register state.} -\label{gprs} -\end{figure} - -\newpage - -\section{Base Instruction Formats} - -In the base RV32I ISA, there are four core instruction formats -(R/I/S/U), as shown in Figure~\ref{fig:baseinstformats}. All are a -fixed 32 bits in length and must be aligned on a four-byte boundary in -memory. An instruction-address-misaligned exception is generated on a -taken branch or unconditional jump if the target address is not -four-byte aligned. This exception is reported on the branch or jump -instruction, not on the target instruction. No -instruction-address-misaligned exception is generated for a -conditional branch that is not taken. - -\begin{commentary} -The alignment constraint for base ISA instructions is relaxed to a -two-byte boundary when instruction extensions with 16-bit lengths or -other odd multiples of 16-bit lengths are added (i.e., IALIGN=16). - -Instruction-address-misaligned exceptions are reported on the branch -or jump that would cause instruction misalignment to help debugging, -and to simplify hardware design for systems with IALIGN=32, where these -are the only places where misalignment can occur. -\end{commentary} - -\vspace{-0.2in} -\begin{figure}[h] -\begin{center} -\setlength{\tabcolsep}{4pt} -\begin{tabular}{p{1.2in}@{}p{0.8in}@{}p{0.8in}@{}p{0.6in}@{}p{0.8in}@{}p{1in}l} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\cline{1-6} -\multicolumn{1}{|c|}{funct7} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -R-type \\ -\cline{1-6} -\\ -\cline{1-6} -\multicolumn{2}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -I-type \\ -\cline{1-6} -\\ -\cline{1-6} -\multicolumn{1}{|c|}{imm[11:5]} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{opcode} & -S-type \\ -\cline{1-6} -\\ -\cline{1-6} -\multicolumn{4}{|c|}{imm[31:12]} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -U-type \\ -\cline{1-6} -\end{tabular} -\end{center} -\caption{RISC-V base instruction formats. Each immediate subfield is - labeled with the bit position (imm[{\em x}\,]) in the immediate - value being produced, rather than the bit position within the - instruction's immediate field as is usually done. } -\label{fig:baseinstformats} -\end{figure} - -The RISC-V ISA keeps the source ({\em rs1} and {\em rs2}) and -destination ({\em rd}) registers at the same position in all formats -to simplify decoding. Except for the 5-bit immediates used in CSR -instructions (Chapter~\ref{csrinsts}), immediates are always -sign-extended, and are generally packed towards the leftmost available -bits in the instruction and have been allocated to reduce hardware -complexity. In particular, the sign bit for all immediates is always -in bit 31 of the instruction to speed sign-extension circuitry. - -\begin{commentary} -Decoding register specifiers is usually on the critical paths in -implementations, and so the instruction format was chosen to keep all -register specifiers at the same position in all formats at the expense -of having to move immediate bits across formats (a property shared -with RISC-IV aka. SPUR~\cite{spur-jsscc1989}). - -In practice, most immediates are either small or require all XLEN -bits. We chose an asymmetric immediate split (12 bits in regular -instructions plus a special load-upper-immediate instruction with 20 -bits) to increase the opcode space available for regular instructions. - -Immediates are sign-extended because we did not observe a benefit to -using zero-extension for some immediates as in the MIPS ISA and wanted -to keep the ISA as simple as possible. -\end{commentary} - -\section{Immediate Encoding Variants} - -There are a further two variants of the instruction formats (B/J) -based on the handling of immediates, as shown in -Figure~\ref{fig:baseinstformatsimm}. - -\begin{figure}[h] -\begin{small} -\begin{center} -\setlength{\tabcolsep}{4pt} -\begin{tabular}{p{0.3in}@{}p{0.8in}@{}p{0.6in}@{}p{0.18in}@{}p{0.7in}@{}p{0.6in}@{}p{0.6in}@{}p{0.3in}@{}p{0.5in}l} -\\ -\multicolumn{1}{c}{\instbit{31}} & -\instbitrange{30}{25} & -\instbitrange{24}{21} & -\multicolumn{1}{c}{\instbit{20}} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{8} & -\multicolumn{1}{c}{\instbit{7}} & -\instbitrange{6}{0} \\ -\cline{1-9} -\multicolumn{2}{|c|}{funct7} & -\multicolumn{2}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{2}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -R-type \\ -\cline{1-9} -\\ -\cline{1-9} -\multicolumn{4}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{2}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -I-type \\ -\cline{1-9} -\\ -\cline{1-9} -\multicolumn{2}{|c|}{imm[11:5]} & -\multicolumn{2}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{2}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{opcode} & -S-type \\ -\cline{1-9} -\\ -\cline{1-9} -\multicolumn{1}{|c|}{imm[12]} & -\multicolumn{1}{c|}{imm[10:5]} & -\multicolumn{2}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{imm[4:1]} & -\multicolumn{1}{c|}{imm[11]} & -\multicolumn{1}{c|}{opcode} & -B-type \\ -\cline{1-9} -\\ -\cline{1-9} -\multicolumn{6}{|c|}{imm[31:12]} & -\multicolumn{2}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -U-type \\ -\cline{1-9} -\\ -\cline{1-9} -\multicolumn{1}{|c|}{imm[20]} & -\multicolumn{2}{c|}{imm[10:1]} & -\multicolumn{1}{c|}{imm[11]} & -\multicolumn{2}{c|}{imm[19:12]} & -\multicolumn{2}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -J-type \\ -\cline{1-9} -\end{tabular} -\end{center} -\end{small} -\caption{RISC-V base instruction formats showing immediate variants.} -\label{fig:baseinstformatsimm} -\end{figure} - -The only difference between the S and B formats is that the 12-bit -immediate field is used to encode branch offsets in multiples of 2 in -the B format. Instead of shifting all bits in the -instruction-encoded immediate left by one in hardware as is -conventionally done, the middle bits (imm[10:1]) and sign bit stay in -fixed positions, while the lowest bit in S format (inst[7]) encodes a -high-order bit in B format. - -Similarly, the only difference between the U and J formats is -that the 20-bit immediate is shifted left by 12 bits to form U -immediates and by 1 bit to form J immediates. The location of -instruction bits in the U and J format immediates is chosen to -maximize overlap with the other formats and with each other. - -Figure~\ref{fig:immtypes} shows the immediates produced by each of the -base instruction formats, and is labeled to show which instruction -bit (inst[{\em y}\,]) produces each bit of the immediate value. - -\begin{figure}[h] -\begin{center} -\setlength{\tabcolsep}{4pt} -\begin{tabular}{p{0.2in}@{}p{1.2in}@{}p{1.0in}@{}p{0.2in}@{}p{0.7in}@{}p{0.7in}@{}p{0.2in}l} -\\ -\multicolumn{1}{c}{\instbit{31}} & -\instbitrange{30}{20} & -\instbitrange{19}{12} & -\multicolumn{1}{c}{\instbit{11}} & -\instbitrange{10}{5} & -\instbitrange{4}{1} & -\multicolumn{1}{c}{\instbit{0}} & -\\ -\cline{1-7} -\multicolumn{4}{|c|}{--- inst[31] ---} & -\multicolumn{1}{c|}{inst[30:25]} & -\multicolumn{1}{c|}{inst[24:21]} & -\multicolumn{1}{c|}{inst[20]} & -I-immediate \\ -\cline{1-7} -\\ -\cline{1-7} -\multicolumn{4}{|c|}{--- inst[31] ---} & -\multicolumn{1}{c|}{inst[30:25]} & -\multicolumn{1}{c|}{inst[11:8]} & -\multicolumn{1}{c|}{inst[7]} & -S-immediate \\ -\cline{1-7} -\\ -\cline{1-7} -\multicolumn{3}{|c|}{--- inst[31] ---} & -\multicolumn{1}{c|}{inst[7]} & -\multicolumn{1}{c|}{inst[30:25]} & -\multicolumn{1}{c|}{inst[11:8]} & -\multicolumn{1}{c|}{0} & -B-immediate \\ -\cline{1-7} -\\ -\cline{1-7} -\multicolumn{1}{|c|}{inst[31]} & -\multicolumn{1}{c|}{inst[30:20]} & -\multicolumn{1}{c|}{inst[19:12]} & -\multicolumn{4}{c|}{--- 0 ---} & -U-immediate \\ -\cline{1-7} -\\ -\cline{1-7} -\multicolumn{2}{|c|}{--- inst[31] ---} & -\multicolumn{1}{c|}{inst[19:12]} & -\multicolumn{1}{c|}{inst[20]} & -\multicolumn{1}{c|}{inst[30:25]} & -\multicolumn{1}{c|}{inst[24:21]} & -\multicolumn{1}{c|}{0} & -J-immediate \\ -\cline{1-7} -\end{tabular} -\end{center} -\caption{Types of immediate produced by RISC-V instructions. The fields are labeled with the - instruction bits used to construct their value. Sign extension - always uses inst[31].} -\label{fig:immtypes} -\end{figure} - -\begin{commentary} -Sign-extension is one of the most critical operations on immediates -(particularly for XLEN>32), and in RISC-V the sign bit for all immediates -is always held in bit 31 of the instruction to allow sign-extension to -proceed in parallel with instruction decoding. - -Although more complex implementations might have separate adders for -branch and jump calculations and so would not benefit from keeping the -location of immediate bits constant across types of instruction, we -wanted to reduce the hardware cost of the simplest implementations. -By rotating bits in the instruction encoding of B and J immediates -instead of using dynamic hardware muxes to multiply the immediate by -2, we reduce instruction signal fanout and immediate mux costs by -around a factor of 2. The scrambled immediate encoding will add -negligible time to static or ahead-of-time compilation. For dynamic -generation of instructions, there is some small additional -overhead, but the most common short forward branches have -straightforward immediate encodings. -\end{commentary} - -\section{Integer Computational Instructions} - -Most integer computational instructions operate on XLEN bits of values -held in the integer register file. Integer computational instructions -are either encoded as register-immediate operations using the I-type -format or as register-register operations using the R-type format. -The destination is register {\em rd} for both register-immediate and -register-register instructions. No integer computational instructions -cause arithmetic exceptions. - -\begin{commentary} -We did not include special instruction-set support for overflow checks -on integer arithmetic operations in the base instruction set, as many -overflow checks can be cheaply implemented using RISC-V branches. -Overflow checking for unsigned addition requires only a single -additional branch instruction after the addition: -\verb! add t0, t1, t2; bltu t0, t1, overflow!. - -For signed addition, if one operand's sign is known, overflow checking -requires only a single branch after the addition: -\verb! addi t0, t1, +imm; blt t0, t1, overflow!. This covers the -common case of addition with an immediate operand. - -For general signed addition, three additional instructions after the -addition are required, leveraging the observation that the sum should -be less than one of the operands if and only if the other operand is -negative. -\begin{verbatim} - add t0, t1, t2 - slti t3, t2, 0 - slt t4, t0, t1 - bne t3, t4, overflow -\end{verbatim} -In RV64I, checks of 32-bit signed additions can be optimized further by -comparing the results of ADD and ADDW on the operands. -\end{commentary} - -\subsubsection*{Integer Register-Immediate Instructions} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{M@{}R@{}S@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -I-immediate[11:0] & src & ADDI/SLTI[U] & dest & OP-IMM \\ -I-immediate[11:0] & src & ANDI/ORI/XORI & dest & OP-IMM \\ -\end{tabular} -\end{center} -ADDI adds the sign-extended 12-bit immediate to register {\em rs1}. -Arithmetic overflow is ignored and the result is simply the low -XLEN bits of the result. ADDI {\em rd, rs1, 0} is used to implement the -MV {\em rd, rs1} assembler pseudoinstruction. - -SLTI (set less than immediate) places the value 1 in register {\em rd} -if register {\em rs1} is less than the sign-extended immediate when -both are treated as signed numbers, else 0 is written to {\em rd}. -SLTIU is similar but compares the values as unsigned numbers (i.e., -the immediate is first sign-extended to XLEN bits then treated as an -unsigned number). Note, SLTIU {\em rd, rs1, 1} sets {\em rd} -to 1 if {\em rs1} equals zero, otherwise sets {\em rd} to 0 (assembler -pseudoinstruction SEQZ {\em rd, rs}). - -ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, -and XOR on register {\em rs1} and the sign-extended 12-bit immediate -and place the result in {\em rd}. Note, XORI {\em rd, rs1, -1} -performs a bitwise logical inversion of register {\em rs1} (assembler -pseudoinstruction NOT {\em rd, rs}). - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:5]} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -0000000 & shamt[4:0] & src & SLLI & dest & OP-IMM \\ -0000000 & shamt[4:0] & src & SRLI & dest & OP-IMM \\ -0100000 & shamt[4:0] & src & SRAI & dest & OP-IMM \\ -\end{tabular} -\end{center} - -Shifts by a constant are encoded as a specialization of the -I-type format. The operand to be shifted is in {\em rs1}, and the -shift amount is encoded in the lower 5 bits of the I-immediate field. -The right shift type is encoded in bit 30. -SLLI is a logical left shift (zeros are shifted into the lower bits); -SRLI is a logical right shift (zeros are shifted into the upper bits); -and SRAI is an arithmetic right shift (the original sign bit is copied -into the vacated upper bits). - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{U@{}R@{}O} -\\ -\instbitrange{31}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[31:12]} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -20 & 5 & 7 \\ -U-immediate[31:12] & dest & LUI \\ -U-immediate[31:12] & dest & AUIPC -\end{tabular} -\end{center} - -LUI (load upper immediate) is used to build 32-bit constants and uses -the U-type format. LUI places the U-immediate value in the top 20 -bits of the destination register {\em rd}, filling in the lowest 12 -bits with zeros. - -AUIPC (add upper immediate to {\tt pc}) is used to build {\tt pc}-relative -addresses and uses the U-type format. AUIPC forms a 32-bit offset from the -20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset -to the {\tt pc} of the AUIPC instruction, then places the result in register {\em rd}. - -\begin{commentary} -The AUIPC instruction supports two-instruction sequences to access -arbitrary offsets from the PC for both control-flow transfers and data -accesses. The combination of an AUIPC and the 12-bit immediate in a -JALR can transfer control to any 32-bit PC-relative address, while an -AUIPC plus the 12-bit immediate offset in regular load or store -instructions can access any 32-bit PC-relative data address. - -The current PC can be obtained by setting the U-immediate to 0. -Although a JAL +4 instruction could also be used to obtain the local -PC (of the instruction following the JAL), it might cause pipeline -breaks in simpler microarchitectures or pollute BTB structures in more -complex microarchitectures. -\end{commentary} - -\subsubsection*{Integer Register-Register Operations} - -RV32I defines several arithmetic R-type operations. All operations -read the {\em rs1} and {\em rs2} registers as source operands and -write the result into register {\em rd}. The {\em funct7} and {\em - funct3} fields select the type of operation. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct7} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -0000000 & src2 & src1 & ADD/SLT/SLTU & dest & OP \\ -0000000 & src2 & src1 & AND/OR/XOR & dest & OP \\ -0000000 & src2 & src1 & SLL/SRL & dest & OP \\ -0100000 & src2 & src1 & SUB/SRA & dest & OP \\ -\end{tabular} -\end{center} - -ADD performs the addition of {\em rs1} and {\em rs2}. SUB performs the -subtraction of {\em rs2} from {\em rs1}. Overflows are ignored and the low XLEN -bits of results are written to the destination {\em rd}. -SLT and SLTU perform signed and unsigned compares -respectively, writing 1 to {\em rd} if $\mbox{\em rs1} < \mbox{\em - rs2}$, 0 otherwise. Note, SLTU {\em rd}, {\em x0}, {\em rs2} sets -{\em rd} to 1 if {\em rs2} is not equal to zero, otherwise sets {\em - rd} to zero (assembler pseudoinstruction SNEZ {\em rd, rs}). AND, OR, and -XOR perform bitwise logical operations. - -SLL, SRL, and SRA perform logical left, logical right, and arithmetic -right shifts on the value in register {\em rs1} by the shift amount -held in the lower 5 bits of register {\em rs2}. - -\subsubsection*{NOP Instruction} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{M@{}R@{}S@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -0 & 0 & ADDI & 0 & OP-IMM \\ -\end{tabular} -\end{center} - -The NOP instruction does not change any architecturally visible state, except for -advancing the {\tt pc} and incrementing any applicable performance -counters. NOP is encoded as ADDI {\em x0, x0, 0}. - -\begin{commentary} -NOPs can be used to align code segments to microarchitecturally -significant address boundaries, or to leave space for inline code -modifications. Although there are many possible ways to encode a NOP, -we define a canonical NOP encoding to allow microarchitectural -optimizations as well as for more readable disassembly output. The -other NOP encodings are made available for HINT instructions -(Section~\ref{sec:rv32i-hints}). - -ADDI was chosen for the NOP encoding as this is most likely to take -fewest resources to execute across a range of systems (if not -optimized away in decode). In particular, the instruction only reads -one register. Also, an ADDI functional unit is more likely to be -available in a superscalar design as adds are the most common -operation. In particular, address-generation functional units can -execute ADDI using the same hardware needed for base+offset address -calculations, while register-register ADD or logical/shift operations -require additional hardware. -\end{commentary} - -\section{Control Transfer Instructions} - -RV32I provides two types of control transfer instructions: -unconditional jumps and conditional branches. Control transfer -instructions in RV32I do {\em not} have architecturally visible delay -slots. - -\subsubsection*{Unconditional Jumps} - -\vspace{-0.1in} The jump and link (JAL) instruction uses the J-type -format, where the J-immediate encodes a signed offset in multiples of -2 bytes. The offset is sign-extended and added to the {\tt pc} -to form the jump target address. Jumps can therefore target a -$\pm$\wunits{1}{MiB} range. JAL stores the address of the instruction -following the jump ({\tt pc}+4) into register {\em rd}. The standard -software calling convention uses {\tt x1} as the return address -register and {\tt x5} as an alternate link register. - -\begin{commentary} -The alternate link register supports calling millicode routines (e.g., -those to save and restore registers in compressed code) while -preserving the regular return address register. The register {\tt x5} -was chosen as the alternate link register as it maps to a temporary in -the standard calling convention, and has an encoding that is only one -bit different than the regular link register. -\end{commentary} - -Plain unconditional jumps (assembler pseudoinstruction J) are encoded as a JAL -with {\em rd}={\tt x0}. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{W@{}E@{}W@{}R@{}R@{}O} -\\ -\multicolumn{1}{c}{\instbit{31}} & -\instbitrange{30}{21} & -\multicolumn{1}{c}{\instbit{20}} & -\instbitrange{19}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[20]} & -\multicolumn{1}{c|}{imm[10:1]} & -\multicolumn{1}{c|}{imm[11]} & -\multicolumn{1}{c|}{imm[19:12]} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -1 & 10 & \multicolumn{1}{c}{1} & 8 & 5 & 7 \\ -\multicolumn{4}{c}{offset[20:1]} & dest & JAL \\ -\end{tabular} -\end{center} - -The indirect jump instruction JALR (jump and link register) uses the -I-type encoding. The target address is obtained by adding the sign-extended -12-bit I-immediate to the register {\em rs1}, then setting the -least-significant bit of the result to zero. The address of -the instruction following the jump ({\tt pc}+4) is written to register -{\em rd}. Register {\tt x0} can be used as the destination if the -result is not required. -\vspace{-0.4in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -offset[11:0] & base & 0 & dest & JALR \\ -\end{tabular} -\end{center} - -\begin{commentary} -The unconditional jump instructions all use PC-relative addressing to -help support position-independent code. The JALR instruction was -defined to enable a two-instruction sequence to jump anywhere in a -32-bit absolute address range. A LUI instruction can first load {\em - rs1} with the upper 20 bits of a target address, then JALR can add -in the lower bits. Similarly, AUIPC then JALR can jump -anywhere in a 32-bit {\tt pc}-relative address range. - -Note that the JALR instruction does not treat the 12-bit immediate as -multiples of 2 bytes, unlike the conditional branch instructions. -This avoids one more immediate format in hardware. In -practice, most uses of JALR will have either a zero immediate or be -paired with a LUI or AUIPC, so the slight reduction in range is not -significant. - -Clearing the least-significant bit when calculating the JALR target -address both simplifies the hardware slightly and allows the -low bit of function pointers to be used to store auxiliary -information. Although there is potentially a slight loss of error -checking in this case, in practice jumps to an incorrect instruction -address will usually quickly raise an exception. - -When used with a base {\em rs1}$=${\tt x0}, JALR can be used to implement -a single instruction subroutine call to the lowest \wunits{2}{KiB} or highest -\wunits{2}{KiB} address region from anywhere in the address space, which could -be used to implement fast calls to a small runtime library. -\end{commentary} - -The JAL and JALR instructions will generate an -instruction-address-misaligned exception if the target address is not -aligned to a four-byte boundary. - -\begin{commentary} -Instruction-address-misaligned exceptions are not possible on machines -that support extensions with 16-bit aligned instructions, such as the -compressed instruction-set extension, C. -\end{commentary} - -Return-address prediction stacks are a common feature of -high-performance instruction-fetch units, but require accurate -detection of instructions used for procedure calls and returns to be -effective. For RISC-V, hints as to the instructions' usage are encoded -implicitly via the register numbers used. A JAL instruction should -push the return address onto a return-address stack (RAS) only when -{\em rd}$=${\tt x1}/{\tt x5}. JALR instructions should push/pop a -RAS as shown in the Table~\ref{rashints}. -\begin{table}[hbt] -\centering -\begin{tabular}{|c|c|c|l|} - \hline - \em rd & \em rs1 & {\em rs1}$=${\em rd} & RAS action \\ - \hline - !{\em link} & !{\em link} & - & none \\ - !{\em link} & {\em link} & - & pop \\ - {\em link} & !{\em link} & - & push \\ - {\em link} & {\em link} & 0 & pop, then push \\ - {\em link} & {\em link} & 1 & push \\ - \hline -\end{tabular} -\caption{Return-address stack prediction hints encoded in register - specifiers used in the instruction. In the above, {\em link} is - true when the register is either {\tt x1} or {\tt x5}.} -\label{rashints} -\end{table} - -\begin{commentary} -Some other ISAs added explicit hint bits to their indirect-jump instructions -to guide return-address stack manipulation. We use implicit hinting tied to -register numbers and the calling convention to reduce the encoding space used -for these hints. - -When two different link registers ({\tt x1} and {\tt x5}) are given as -{\em rs1} and {\em rd}, then the RAS is both popped and pushed to -support coroutines. If {\em rs1} and {\em rd} are the same link -register (either {\tt x1} or {\tt x5}), the RAS is only pushed to -enable macro-op fusion of the sequences:\linebreak -{\tt lui ra, imm20; jalr ra, imm12(ra)} \ and \ -{\tt auipc ra, imm20; jalr ra, imm12(ra)} -\end{commentary} - -\subsubsection*{Conditional Branches} - -All branch instructions use the B-type instruction format. The -12-bit B-immediate encodes signed offsets in multiples of 2, and is -added to the current {\tt pc} to give the target address. The -conditional branch range is $\pm$\wunits{4}{KiB}. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{W@{}R@{}F@{}F@{}R@{}R@{}F@{}S} -\\ -\multicolumn{1}{c}{\instbit{31}} & -\instbitrange{30}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{8} & -\multicolumn{1}{c}{\instbit{7}} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[12]} & -\multicolumn{1}{c|}{imm[10:5]} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{imm[4:1]} & -\multicolumn{1}{c|}{imm[11]} & -\multicolumn{1}{c|}{opcode} \\ -\hline -1 & 6 & 5 & 5 & 3 & 4 & 1 & 7 \\ -\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BEQ/BNE & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\ -\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BLT[U] & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\ -\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BGE[U] & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\ -\end{tabular} -\end{center} - -Branch instructions compare two registers. BEQ and BNE take the -branch if registers {\em rs1} and {\em rs2} are equal or unequal -respectively. BLT and BLTU take the branch if {\em rs1} is less than -{\em rs2}, using signed and unsigned comparison respectively. BGE and -BGEU take the branch if {\em rs1} is greater than or equal to {\em rs2}, -using signed and unsigned comparison respectively. Note, BGT, BGTU, -BLE, and BLEU can be synthesized by reversing the operands to BLT, -BLTU, BGE, and BGEU, respectively. - -\begin{commentary} -Signed array bounds may be checked with a single BLTU instruction, since -any negative index will compare greater than any nonnegative bound. -\end{commentary} - -Software should be optimized such that the sequential code path is the -most common path, with less-frequently taken code paths placed out of -line. Software should also assume that backward branches will be -predicted taken and forward branches as not taken, at least the -first time they are encountered. Dynamic predictors should quickly -learn any predictable branch behavior. - -Unlike some other architectures, the RISC-V jump (JAL with {\em - rd}={\tt x0}) instruction should always be used for unconditional -branches instead of a conditional branch instruction with an -always-true condition. RISC-V jumps are also PC-relative and support -a much wider offset range than branches, and will not pollute -conditional-branch prediction tables. - -\begin{commentary} -The conditional branches were designed to include arithmetic -comparison operations between two registers (as also done in PA-RISC -and Xtensa ISA), rather than use condition codes (x86, ARM, SPARC, -PowerPC), or to only compare one register against zero (Alpha, MIPS), -or two registers only for equality (MIPS). This design was motivated -by the observation that a combined compare-and-branch instruction fits -into a regular pipeline, avoids additional condition code state or use -of a temporary register, and reduces static code size and dynamic -instruction fetch traffic. Another point is that comparisons against -zero require non-trivial circuit delay (especially after the move to -static logic in advanced processes) and so are almost as expensive as -arithmetic magnitude compares. Another advantage of a fused -compare-and-branch instruction is that branches are observed earlier -in the front-end instruction stream, and so can be predicted earlier. -There is perhaps an advantage to a design with condition codes in the -case where multiple branches can be taken based on the same condition -codes, but we believe this case to be relatively rare. - -We considered but did not include static branch hints in the -instruction encoding. These can reduce the pressure on dynamic -predictors, but require more instruction encoding space and -software profiling for best results, and can result in poor -performance if production runs do not match profiling runs. - -We considered but did not include conditional moves or predicated -instructions, which can effectively replace unpredictable short -forward branches. Conditional moves are the simpler of the two, but -are difficult to use with conditional code that might cause exceptions -(memory accesses and floating-point operations). Predication adds -additional flag state to a system, additional instructions to set and -clear flags, and additional encoding overhead on every instruction. -Both conditional move and predicated instructions add complexity to -out-of-order microarchitectures, adding an implicit third source -operand due to the need to copy the original value of the destination -architectural register into the renamed destination physical register -if the predicate is false. Also, static compile-time decisions to use -predication instead of branches can result in lower performance on -inputs not included in the compiler training set, especially given -that unpredictable branches are rare, and becoming rarer as branch -prediction techniques improve. - -We note that various microarchitectural techniques exist to -dynamically convert unpredictable short forward branches into -internally predicated code to avoid the cost of flushing pipelines on -a branch mispredict~\cite{heil-tr1996,Klauser-1998,Kim-micro2005} and -have been implemented in commercial processors~\cite{ibmpower7}. -The simplest techniques just reduce the penalty of recovering from a -mispredicted short forward branch by only flushing instructions in the -branch shadow instead of the entire fetch pipeline, or by fetching -instructions from both sides using wide instruction fetch or idle -instruction fetch slots. More complex techniques for out-of-order -cores add internal predicates on instructions in the branch shadow, -with the internal predicate value written by the branch instruction, -allowing the branch and following instructions to be executed -speculatively and out-of-order with respect to other code~\cite{ibmpower7}. -\end{commentary} - -\section{Load and Store Instructions} - -RV32I is a load-store architecture, where only load and store -instructions access memory and arithmetic instructions only operate on -CPU registers. RV32I provides a 32-bit address space that is -byte-addressed and little-endian. The EEI will -define what portions of the address space are legal to access with -which instructions (e.g., some addresses might be read only, or -support word access only). Loads with a destination of {\tt x0} must -still raise any exceptions and action any other side effects even -though the load value is discarded. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -offset[11:0] & base & width & dest & LOAD \\ -\end{tabular} -\end{center} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{O@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:5]} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -offset[11:5] & src & base & width & offset[4:0] & STORE \\ -\end{tabular} -\end{center} - -Load and store instructions transfer a value between the registers and -memory. Loads are encoded in the I-type format and stores are -S-type. The effective byte address is obtained by adding register -{\em rs1} to the sign-extended 12-bit offset. Loads copy a value -from memory to register {\em rd}. Stores copy the value in register -{\em rs2} to memory. - -The LW instruction loads a 32-bit value from memory into {\em rd}. LH -loads a 16-bit value from memory, then sign-extends to 32-bits before -storing in {\em rd}. LHU loads a 16-bit value from memory but then -zero extends to 32-bits before storing in {\em rd}. LB and LBU are -defined analogously for 8-bit values. The SW, SH, and SB instructions -store 32-bit, 16-bit, and 8-bit values from the low bits of register -{\em rs2} to memory. - -Regardless of EEI, loads and stores whose effective addresses are -naturally aligned shall not raise an address-misaligned exception. -Loads and stores where the effective address is not naturally aligned -to the referenced datatype (i.e., on a four-byte boundary for 32-bit -accesses, and a two-byte boundary for 16-bit accesses) have behavior -dependent on the EEI. - -An EEI may guarantee that misaligned loads and stores are fully -supported, and so the software running inside the execution -environment will never experience a contained or fatal -address-misaligned trap. In this case, the misaligned loads and -stores can be handled in hardware, or via an invisible trap into the -execution environment implementation, or possibly a combination of -hardware and invisible trap depending on address. - -An EEI may not guarantee misaligned loads and stores are handled -invisibly. In this case, loads and stores that are not naturally -aligned may either complete execution successfully or raise an -exception. The exception raised can be either an address-misaligned -exception or an access exception. For a memory access that would -otherwise be able to complete except for the misalignment, an access -exception can be raised instead of an address-misaligned exception if -the misaligned access should not be emulated, e.g., if accesses to the -memory region have side effects. When an EEI does not guarantee -misaligned loads and stores are handled invisibly, the EEI must define -if exceptions caused by address misalignment result in a contained -trap (allowing software running inside the execution environment to -handle the trap) or a fatal trap (terminating execution). - -\begin{commentary} -Misaligned accesses are occasionally required when porting legacy -code, and help performance on applications when using any form of -packed-SIMD extension or handling externally packed data structures. -Our rationale for allowing EEIs to choose to support misaligned -accesses via the regular load and store instructions is to simplify -the addition of misaligned hardware support. One option would have -been to disallow misaligned accesses in the base ISA and then provide -some separate ISA support for misaligned accesses, either special -instructions to help software handle misaligned accesses or a new -hardware addressing mode for misaligned accesses. Special -instructions are difficult to use, complicate the ISA, and often add -new processor state (e.g., SPARC VIS align address offset register) or -complicate access to existing processor state (e.g., MIPS LWL/LWR -partial register writes). In addition, for loop-oriented packed-SIMD -code, the extra overhead when operands are misaligned motivates -software to provide multiple forms of loop depending on operand -alignment, which complicates code generation and adds to loop startup -overhead. New misaligned hardware addressing modes take considerable -space in the instruction encoding or require very simplified -addressing modes (e.g., register indirect only). -\end{commentary} - -Even when misaligned loads and stores complete successfully, these -accesses might run extremely slowly depending on the implementation -(e.g., when implemented via an invisible trap). Furthermore, whereas -naturally aligned loads and stores are guaranteed to execute -atomically, misaligned loads and stores might not, and hence -require additional synchronization to ensure atomicity. - -\begin{commentary} -We do not mandate atomicity for misaligned accesses so execution -environment implementations can use an invisible machine trap and -a software handler to handle some or all misaligned accesses. If -hardware misaligned support is provided, software can exploit this by -simply using regular load and store instructions. Hardware can then -automatically optimize accesses depending on whether runtime addresses -are aligned. -\end{commentary} - -\pagebreak - -\section{Memory Ordering Instructions} -\label{sec:fence} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{F@{}IIIIIIIIF@{}F@{}F@{}S} -\\ -\instbitrange{31}{28} & -\multicolumn{1}{c}{\instbit{27}} & -\multicolumn{1}{c}{\instbit{26}} & -\multicolumn{1}{c}{\instbit{25}} & -\multicolumn{1}{c}{\instbit{24}} & -\multicolumn{1}{c}{\instbit{23}} & -\multicolumn{1}{c}{\instbit{22}} & -\multicolumn{1}{c}{\instbit{21}} & -\multicolumn{1}{c}{\instbit{20}} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{fm} & -\multicolumn{1}{c|}{PI} & -\multicolumn{1}{c|}{PO} & -\multicolumn{1}{c|}{PR} & -\multicolumn{1}{c|}{PW} & -\multicolumn{1}{|c|}{SI} & -\multicolumn{1}{c|}{SO} & -\multicolumn{1}{c|}{SR} & -\multicolumn{1}{c|}{SW} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -4 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 5 & 3 & 5 & 7 \\ -FM & \multicolumn{4}{c}{predecessor} & \multicolumn{4}{c}{successor} & 0 & FENCE & 0 & MISC-MEM \\ -\end{tabular} -\end{center} - -The FENCE instruction is used to order device I/O and -memory accesses as viewed by other RISC-V harts and external devices -or coprocessors. Any combination of device input (I), device output -(O), memory reads (R), and memory writes (W) may be ordered with -respect to any combination of the same. Informally, no other RISC-V -hart or external device can observe any operation in the {\em - successor} set following a FENCE before any operation in the {\em - predecessor} set preceding the FENCE. -Chapter~\ref{ch:memorymodel} provides a precise description of the -RISC-V memory consistency model. - -The EEI will define what I/O operations are possible, and in -particular, which memory addresses when accessed by load and store instructions will be treated and -ordered as device input and device output operations respectively -rather than memory reads and writes. For example, memory-mapped I/O -devices will typically be accessed with uncached loads and stores that -are ordered using the I and O bits rather than the R and W bits. -Instruction-set extensions might also describe new coprocessor I/O -instructions that will also be ordered using the I and O bits in a -FENCE. - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{|c|c|l|} -\hline -{\em fm} field & Mnemonic & Meaning \\ -\hline -0000 & \em none & Normal Fence \\ -\hline -\multirow{2}{*}{1000} & \multirow{2}{*}{TSO} & With FENCE RW,RW: exclude write-to-read ordering \\ - & & Otherwise: \em Reserved for future use. \\ -\hline -\multicolumn{2}{|c|}{\em other} & \em Reserved for future use. \\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Fence mode encoding.} -\label{fm} -\end{table} - -The fence mode field {\em fm} defines the semantics of the FENCE. A -FENCE with {\em fm}=0000 orders all memory operations in its -predecessor set before all memory operations in its successor set. - -The optional FENCE.TSO instruction with {\em fm}=1000 orders all load -operations in its predecessor set before all memory operations in its -successor set, and all store operations in its predecessor set before -all store operations in its successor set. This leaves non-AMO store -operations in the FENCE.TSO's predecessor set unordered with non-AMO -loads in its successor set. - -\begin{commentary} - The FENCE.TSO encoding was added as an optional extension to the - original base FENCE instruction encoding. The base definition - requires that implementations ignore any set bits and treat the - FENCE as global, and so this is a backwards-compatible extension. -\end{commentary} - -The unused fields in the FENCE instructions---{\em rs1} and {\em rd}---are -reserved for finer-grain fences in future extensions. For forward -compatibility, base implementations shall ignore these fields, and standard -software shall zero these fields. Likewise, many {\em fm} and -predecessor/successor set settings in Table~\ref{fm} are also reserved -for future use. Base implementations shall treat all such reserved -configurations as normal fences with {\em fm}=0000, and standard -software shall use only non-reserved configurations. - -\begin{commentary} -We chose a relaxed memory model to allow high performance from simple -machine implementations and from likely future -coprocessor or accelerator extensions. We separate out I/O ordering -from memory R/W ordering to avoid unnecessary serialization within a -device-driver hart and also to support alternative non-memory paths -to control added coprocessors or I/O devices. Simple implementations -may additionally ignore the {\em predecessor} and {\em successor} -fields and always execute a conservative fence on all operations. -\end{commentary} - -\section{Environment Call and Breakpoints} - -SYSTEM instructions are used to access system functionality that might -require privileged access and are encoded using the I-type instruction -format. These can be divided into two main classes: those that -atomically read-modify-write control and status registers (CSRs), and -all other potentially privileged instructions. CSR instructions are -described in Chapter~\ref{csrinsts}, and the base unprivileged instructions -are described in the following section. - -\begin{commentary} -The SYSTEM instructions are defined to allow simpler implementations -to always trap to a single software trap handler. More sophisticated -implementations might execute more of each system instruction in -hardware. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}S} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct12} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -ECALL & 0 & PRIV & 0 & SYSTEM \\ -EBREAK & 0 & PRIV & 0 & SYSTEM \\ -\end{tabular} -\end{center} - -There two instructions cause a precise requested trap to the -supporting execution environment. - -The ECALL instruction is used to make a service request to the -execution environment. The EEI will define how parameters for the -service request are passed, but usually these will be in defined -locations in the integer register file. - -The EBREAK instruction is used to return control to a debugging -environment. - -\begin{commentary} -ECALL and EBREAK were previously named SCALL and SBREAK. The -instructions have the same functionality and encoding, but were -renamed to reflect that they can be used more generally than to call a -supervisor-level operating system or debugger. -\end{commentary} - -\begin{commentary} - EBREAK was primarily designed to be used by a debugger to cause - execution to stop and fall back into the debugger. EBREAK is also - used by the standard gcc compiler to mark code paths that should not - be executed. - - Another use of EBREAK is to support ``semihosting'', where the - execution environment includes a debugger that can provide services - over an alternate system call interface built around the EBREAK - instruction. Because the RISC-V base ISA does not provide more than - one EBREAK instruction, RISC-V semihosting uses a special sequence of - instructions to distinguish a semihosting EBREAK from a debugger - inserted EBREAK. -\begin{verbatim} - slli x0, x0, 0x1f # Entry NOP - ebreak # Break to debugger - srai x0, x0, 7 # NOP encoding the semihosting call number 7 -\end{verbatim} - Note that these three instructions must be 32-bit-wide instructions, - i.e., they mustn't be among the compressed 16-bit instructions - described in Chapter~\ref{compressed}. - - The shift NOP instructions are still considered available for use as - HINTS. - - Semihosting is a form of service call and would be more naturally - encoded as an ECALL using an existing ABI, but this would require - the debugger to be able to intercept ECALLs, which is a newer - addition to the debug standard. We intend to move over to using - ECALLs with a standard ABI, in which case, semihosting can share a - service ABI with an existing standard. - - We note that ARM processors have also moved to using SVC instead of - BKPT for semihosting calls in newer designs. -\end{commentary} - -\section{HINT Instructions} -\label{sec:rv32i-hints} - -RV32I reserves a large encoding space for HINT instructions, which are -usually used to communicate performance hints to the -microarchitecture. HINTs are encoded as integer computational -instructions with {\em rd}={\tt x0}. Hence, like the NOP instruction, -HINTs do not change any architecturally visible state, except for -advancing the {\tt pc} and any applicable performance counters. -Implementations are always allowed to ignore the encoded hints. - -\begin{commentary} -This HINT encoding has been chosen so that simple implementations can ignore -HINTs altogether, and instead execute a HINT as a regular computational -instruction that happens not to mutate the architectural state. For example, ADD is -a HINT if the destination register is {\tt x0}; the five-bit {\em rs1} and {\em -rs2} fields encode arguments to the HINT. However, a simple implementation can -simply execute the HINT as an ADD of {\em rs1} and {\em rs2} that writes {\tt -x0}, which has no architecturally visible effect. -\end{commentary} - -Table~\ref{tab:rv32i-hints} lists all RV32I HINT code points. 91\% of the HINT -space is reserved for standard HINTs, but none are presently defined. The -remainder of the HINT space is reserved for custom HINTs: no standard HINTs -will ever be defined in this subspace. - -\begin{commentary} -No standard hints are presently defined. We anticipate -standard hints to eventually include memory-system spatial and -temporal locality hints, branch prediction hints, thread-scheduling -hints, security tags, and instrumentation flags for -simulation/emulation. -\end{commentary} - -\begin{table}[hbt] -\centering -\begin{tabular}{|l|l|c|l|} - \hline - Instruction & Constraints & Code Points & Purpose \\ \hline \hline - LUI & {\em rd}={\tt x0} & $2^{20}$ & \multirow{15}{*}{\em Reserved for future standard use} \\ \cline{1-3} - AUIPC & {\em rd}={\tt x0} & $2^{20}$ & \\ \cline{1-3} - \multirow{2}{*}{ADDI} & {\em rd}={\tt x0}, and either & \multirow{2}{*}{$2^{17}-1$} & \\ - & {\em rs1}$\neq${\tt x0} or {\em imm}$\neq$0 & & \\ \cline{1-3} - ANDI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3} - ORI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3} - XORI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3} - ADD & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SUB & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - AND & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - OR & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - XOR & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SLL & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SRL & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SRA & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - FENCE & {\em pred}=0 or {\em succ}=0 & $2^{5}-1$ & \\ \hline \hline - SLTI & {\em rd}={\tt x0} & $2^{17}$ & \multirow{7}{*}{\em Reserved for custom use} \\ \cline{1-3} - SLTIU & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3} - SLLI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SRLI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SRAI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SLT & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3} - SLTU & {\em rd}={\tt x0} & $2^{10}$ & \\ \hline -\end{tabular} -\caption{RV32I HINT instructions.} -\label{tab:rv32i-hints} -\end{table} - |
