aboutsummaryrefslogtreecommitdiff
path: root/src/rv32.tex
diff options
context:
space:
mode:
Diffstat (limited to 'src/rv32.tex')
-rw-r--r--src/rv32.tex1370
1 files changed, 0 insertions, 1370 deletions
diff --git a/src/rv32.tex b/src/rv32.tex
deleted file mode 100644
index ced9826..0000000
--- a/src/rv32.tex
+++ /dev/null
@@ -1,1370 +0,0 @@
-\chapter{RV32I Base Integer Instruction Set, Version 2.1}
-\label{rv32}
-
-This chapter describes version 2.0 of the RV32I base integer
-instruction set.
-
-\begin{commentary}
-RV32I was designed to be sufficient to form a compiler target and to
-support modern operating system environments. The ISA was also
-designed to reduce the hardware required in a minimal implementation.
-RV32I contains 40 unique instructions, though a simple implementation
-might cover the ECALL/EBREAK instructions with a single SYSTEM
-hardware instruction that always traps and might be able to implement
-the FENCE instruction as a NOP, reducing base instruction count to 38
-total. RV32I can emulate almost any other ISA extension (except the A
-extension, which requires additional hardware support for atomicity).
-
-In practice, a hardware implementation including the machine-mode
-privileged architecture will also require the 6 CSR instructions.
-
-Subsets of the base integer ISA might be useful for pedagogical
-purposes, but the base has been defined such that there should be
-little incentive to subset a real hardware implementation beyond
-omitting support for misaligned memory accesses and treating all
-SYSTEM and FENCE instructions as a single trap.
-\end{commentary}
-
-\begin{commentary}
-Most of the commentary for RV32I also applies to the RV64I base.
-\end{commentary}
-
-\section{Programmers' Model for Base Integer ISA}
-
-Figure~\ref{gprs} shows the unprivileged state for the base integer
-ISA. There are 31 general-purpose registers {\tt x1}--{\tt x31},
-which hold integer values. Register {\tt x0} is hardwired to the
-constant 0. There is no hardwired subroutine return address link
-register, but the standard software calling convention uses register
-{\tt x1} to hold the return address on a call. For RV32I, the {\tt x}
-registers are 32 bits wide, i.e., XLEN=32.
-
-There is one additional unprivileged register: the program counter {\tt pc}
-holds the address of the current instruction.
-
-\begin{commentary}
-The number of available architectural registers can have large impacts
-on code size, performance, and energy consumption. Although 16
-registers would arguably be sufficient for an integer ISA running
-compiled code, it is impossible to encode a complete ISA with 16
-registers in 16-bit instructions using a 3-address format. Although a
-2-address format would be possible, it would increase instruction
-count and lower efficiency. We wanted to avoid intermediate
-instruction sizes (such as Xtensa's 24-bit instructions) to simplify
-base hardware implementations, and once a 32-bit instruction size was
-adopted, it was straightforward to support 32 integer registers. A
-larger number of integer registers also helps performance on
-high-performance code, where there can be extensive use of loop
-unrolling, software pipelining, and cache tiling.
-
-For these reasons, we chose a conventional size of 32 integer
-registers for the base ISA. Dynamic register usage tends to be
-dominated by a few frequently accessed registers, and regfile
-implementations can be optimized to reduce access energy for the
-frequently accessed registers~\cite{jtseng:sbbci}. The optional
-compressed 16-bit instruction format mostly only accesses 8 registers
-and hence can provide a dense instruction encoding, while additional
-instruction-set extensions could support a much larger register space
-(either flat or hierarchical) if desired.
-
-For resource-constrained embedded applications, we have defined the
-RV32E subset, which only has 16 registers (Chapter~\ref{rv32e}).
-\end{commentary}
-
-\begin{figure}[H]
-{\footnotesize
-\begin{center}
-\begin{tabular}{p{2in}}
-\instbitrange{XLEN-1}{0} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ \ \ x0 / zero}} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x1\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x2\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x3\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x4\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x5\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x6\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x7\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x8\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x9\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x10\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x11\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x12\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x13\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x14\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x15\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x16\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x17\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x18\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x19\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x20\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x21\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x22\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x23\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x24\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x25\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x26\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x27\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x28\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x29\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x30\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{\ \ \ x31\ \ \ \ \ }} \\ \cline{1-1}
-\multicolumn{1}{c}{XLEN} \\
-
-\instbitrange{XLEN-1}{0} \\ \cline{1-1}
-\multicolumn{1}{|c|}{\reglabel{pc}} \\ \cline{1-1}
-\multicolumn{1}{c}{XLEN} \\
-\end{tabular}
-\end{center}
-}
-\caption{RISC-V base unprivileged integer register state.}
-\label{gprs}
-\end{figure}
-
-\newpage
-
-\section{Base Instruction Formats}
-
-In the base RV32I ISA, there are four core instruction formats
-(R/I/S/U), as shown in Figure~\ref{fig:baseinstformats}. All are a
-fixed 32 bits in length and must be aligned on a four-byte boundary in
-memory. An instruction-address-misaligned exception is generated on a
-taken branch or unconditional jump if the target address is not
-four-byte aligned. This exception is reported on the branch or jump
-instruction, not on the target instruction. No
-instruction-address-misaligned exception is generated for a
-conditional branch that is not taken.
-
-\begin{commentary}
-The alignment constraint for base ISA instructions is relaxed to a
-two-byte boundary when instruction extensions with 16-bit lengths or
-other odd multiples of 16-bit lengths are added (i.e., IALIGN=16).
-
-Instruction-address-misaligned exceptions are reported on the branch
-or jump that would cause instruction misalignment to help debugging,
-and to simplify hardware design for systems with IALIGN=32, where these
-are the only places where misalignment can occur.
-\end{commentary}
-
-\vspace{-0.2in}
-\begin{figure}[h]
-\begin{center}
-\setlength{\tabcolsep}{4pt}
-\begin{tabular}{p{1.2in}@{}p{0.8in}@{}p{0.8in}@{}p{0.6in}@{}p{0.8in}@{}p{1in}l}
-\\
-\instbitrange{31}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\cline{1-6}
-\multicolumn{1}{|c|}{funct7} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-R-type \\
-\cline{1-6}
-\\
-\cline{1-6}
-\multicolumn{2}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-I-type \\
-\cline{1-6}
-\\
-\cline{1-6}
-\multicolumn{1}{|c|}{imm[11:5]} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{imm[4:0]} &
-\multicolumn{1}{c|}{opcode} &
-S-type \\
-\cline{1-6}
-\\
-\cline{1-6}
-\multicolumn{4}{|c|}{imm[31:12]} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-U-type \\
-\cline{1-6}
-\end{tabular}
-\end{center}
-\caption{RISC-V base instruction formats. Each immediate subfield is
- labeled with the bit position (imm[{\em x}\,]) in the immediate
- value being produced, rather than the bit position within the
- instruction's immediate field as is usually done. }
-\label{fig:baseinstformats}
-\end{figure}
-
-The RISC-V ISA keeps the source ({\em rs1} and {\em rs2}) and
-destination ({\em rd}) registers at the same position in all formats
-to simplify decoding. Except for the 5-bit immediates used in CSR
-instructions (Chapter~\ref{csrinsts}), immediates are always
-sign-extended, and are generally packed towards the leftmost available
-bits in the instruction and have been allocated to reduce hardware
-complexity. In particular, the sign bit for all immediates is always
-in bit 31 of the instruction to speed sign-extension circuitry.
-
-\begin{commentary}
-Decoding register specifiers is usually on the critical paths in
-implementations, and so the instruction format was chosen to keep all
-register specifiers at the same position in all formats at the expense
-of having to move immediate bits across formats (a property shared
-with RISC-IV aka. SPUR~\cite{spur-jsscc1989}).
-
-In practice, most immediates are either small or require all XLEN
-bits. We chose an asymmetric immediate split (12 bits in regular
-instructions plus a special load-upper-immediate instruction with 20
-bits) to increase the opcode space available for regular instructions.
-
-Immediates are sign-extended because we did not observe a benefit to
-using zero-extension for some immediates as in the MIPS ISA and wanted
-to keep the ISA as simple as possible.
-\end{commentary}
-
-\section{Immediate Encoding Variants}
-
-There are a further two variants of the instruction formats (B/J)
-based on the handling of immediates, as shown in
-Figure~\ref{fig:baseinstformatsimm}.
-
-\begin{figure}[h]
-\begin{small}
-\begin{center}
-\setlength{\tabcolsep}{4pt}
-\begin{tabular}{p{0.3in}@{}p{0.8in}@{}p{0.6in}@{}p{0.18in}@{}p{0.7in}@{}p{0.6in}@{}p{0.6in}@{}p{0.3in}@{}p{0.5in}l}
-\\
-\multicolumn{1}{c}{\instbit{31}} &
-\instbitrange{30}{25} &
-\instbitrange{24}{21} &
-\multicolumn{1}{c}{\instbit{20}} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{8} &
-\multicolumn{1}{c}{\instbit{7}} &
-\instbitrange{6}{0} \\
-\cline{1-9}
-\multicolumn{2}{|c|}{funct7} &
-\multicolumn{2}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{2}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-R-type \\
-\cline{1-9}
-\\
-\cline{1-9}
-\multicolumn{4}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{2}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-I-type \\
-\cline{1-9}
-\\
-\cline{1-9}
-\multicolumn{2}{|c|}{imm[11:5]} &
-\multicolumn{2}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{2}{c|}{imm[4:0]} &
-\multicolumn{1}{c|}{opcode} &
-S-type \\
-\cline{1-9}
-\\
-\cline{1-9}
-\multicolumn{1}{|c|}{imm[12]} &
-\multicolumn{1}{c|}{imm[10:5]} &
-\multicolumn{2}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{imm[4:1]} &
-\multicolumn{1}{c|}{imm[11]} &
-\multicolumn{1}{c|}{opcode} &
-B-type \\
-\cline{1-9}
-\\
-\cline{1-9}
-\multicolumn{6}{|c|}{imm[31:12]} &
-\multicolumn{2}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-U-type \\
-\cline{1-9}
-\\
-\cline{1-9}
-\multicolumn{1}{|c|}{imm[20]} &
-\multicolumn{2}{c|}{imm[10:1]} &
-\multicolumn{1}{c|}{imm[11]} &
-\multicolumn{2}{c|}{imm[19:12]} &
-\multicolumn{2}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-J-type \\
-\cline{1-9}
-\end{tabular}
-\end{center}
-\end{small}
-\caption{RISC-V base instruction formats showing immediate variants.}
-\label{fig:baseinstformatsimm}
-\end{figure}
-
-The only difference between the S and B formats is that the 12-bit
-immediate field is used to encode branch offsets in multiples of 2 in
-the B format. Instead of shifting all bits in the
-instruction-encoded immediate left by one in hardware as is
-conventionally done, the middle bits (imm[10:1]) and sign bit stay in
-fixed positions, while the lowest bit in S format (inst[7]) encodes a
-high-order bit in B format.
-
-Similarly, the only difference between the U and J formats is
-that the 20-bit immediate is shifted left by 12 bits to form U
-immediates and by 1 bit to form J immediates. The location of
-instruction bits in the U and J format immediates is chosen to
-maximize overlap with the other formats and with each other.
-
-Figure~\ref{fig:immtypes} shows the immediates produced by each of the
-base instruction formats, and is labeled to show which instruction
-bit (inst[{\em y}\,]) produces each bit of the immediate value.
-
-\begin{figure}[h]
-\begin{center}
-\setlength{\tabcolsep}{4pt}
-\begin{tabular}{p{0.2in}@{}p{1.2in}@{}p{1.0in}@{}p{0.2in}@{}p{0.7in}@{}p{0.7in}@{}p{0.2in}l}
-\\
-\multicolumn{1}{c}{\instbit{31}} &
-\instbitrange{30}{20} &
-\instbitrange{19}{12} &
-\multicolumn{1}{c}{\instbit{11}} &
-\instbitrange{10}{5} &
-\instbitrange{4}{1} &
-\multicolumn{1}{c}{\instbit{0}} &
-\\
-\cline{1-7}
-\multicolumn{4}{|c|}{--- inst[31] ---} &
-\multicolumn{1}{c|}{inst[30:25]} &
-\multicolumn{1}{c|}{inst[24:21]} &
-\multicolumn{1}{c|}{inst[20]} &
-I-immediate \\
-\cline{1-7}
-\\
-\cline{1-7}
-\multicolumn{4}{|c|}{--- inst[31] ---} &
-\multicolumn{1}{c|}{inst[30:25]} &
-\multicolumn{1}{c|}{inst[11:8]} &
-\multicolumn{1}{c|}{inst[7]} &
-S-immediate \\
-\cline{1-7}
-\\
-\cline{1-7}
-\multicolumn{3}{|c|}{--- inst[31] ---} &
-\multicolumn{1}{c|}{inst[7]} &
-\multicolumn{1}{c|}{inst[30:25]} &
-\multicolumn{1}{c|}{inst[11:8]} &
-\multicolumn{1}{c|}{0} &
-B-immediate \\
-\cline{1-7}
-\\
-\cline{1-7}
-\multicolumn{1}{|c|}{inst[31]} &
-\multicolumn{1}{c|}{inst[30:20]} &
-\multicolumn{1}{c|}{inst[19:12]} &
-\multicolumn{4}{c|}{--- 0 ---} &
-U-immediate \\
-\cline{1-7}
-\\
-\cline{1-7}
-\multicolumn{2}{|c|}{--- inst[31] ---} &
-\multicolumn{1}{c|}{inst[19:12]} &
-\multicolumn{1}{c|}{inst[20]} &
-\multicolumn{1}{c|}{inst[30:25]} &
-\multicolumn{1}{c|}{inst[24:21]} &
-\multicolumn{1}{c|}{0} &
-J-immediate \\
-\cline{1-7}
-\end{tabular}
-\end{center}
-\caption{Types of immediate produced by RISC-V instructions. The fields are labeled with the
- instruction bits used to construct their value. Sign extension
- always uses inst[31].}
-\label{fig:immtypes}
-\end{figure}
-
-\begin{commentary}
-Sign-extension is one of the most critical operations on immediates
-(particularly for XLEN>32), and in RISC-V the sign bit for all immediates
-is always held in bit 31 of the instruction to allow sign-extension to
-proceed in parallel with instruction decoding.
-
-Although more complex implementations might have separate adders for
-branch and jump calculations and so would not benefit from keeping the
-location of immediate bits constant across types of instruction, we
-wanted to reduce the hardware cost of the simplest implementations.
-By rotating bits in the instruction encoding of B and J immediates
-instead of using dynamic hardware muxes to multiply the immediate by
-2, we reduce instruction signal fanout and immediate mux costs by
-around a factor of 2. The scrambled immediate encoding will add
-negligible time to static or ahead-of-time compilation. For dynamic
-generation of instructions, there is some small additional
-overhead, but the most common short forward branches have
-straightforward immediate encodings.
-\end{commentary}
-
-\section{Integer Computational Instructions}
-
-Most integer computational instructions operate on XLEN bits of values
-held in the integer register file. Integer computational instructions
-are either encoded as register-immediate operations using the I-type
-format or as register-register operations using the R-type format.
-The destination is register {\em rd} for both register-immediate and
-register-register instructions. No integer computational instructions
-cause arithmetic exceptions.
-
-\begin{commentary}
-We did not include special instruction-set support for overflow checks
-on integer arithmetic operations in the base instruction set, as many
-overflow checks can be cheaply implemented using RISC-V branches.
-Overflow checking for unsigned addition requires only a single
-additional branch instruction after the addition:
-\verb! add t0, t1, t2; bltu t0, t1, overflow!.
-
-For signed addition, if one operand's sign is known, overflow checking
-requires only a single branch after the addition:
-\verb! addi t0, t1, +imm; blt t0, t1, overflow!. This covers the
-common case of addition with an immediate operand.
-
-For general signed addition, three additional instructions after the
-addition are required, leveraging the observation that the sum should
-be less than one of the operands if and only if the other operand is
-negative.
-\begin{verbatim}
- add t0, t1, t2
- slti t3, t2, 0
- slt t4, t0, t1
- bne t3, t4, overflow
-\end{verbatim}
-In RV64I, checks of 32-bit signed additions can be optimized further by
-comparing the results of ADD and ADDW on the operands.
-\end{commentary}
-
-\subsubsection*{Integer Register-Immediate Instructions}
-\vspace{-0.4in}
-\begin{center}
-\begin{tabular}{M@{}R@{}S@{}R@{}O}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-I-immediate[11:0] & src & ADDI/SLTI[U] & dest & OP-IMM \\
-I-immediate[11:0] & src & ANDI/ORI/XORI & dest & OP-IMM \\
-\end{tabular}
-\end{center}
-ADDI adds the sign-extended 12-bit immediate to register {\em rs1}.
-Arithmetic overflow is ignored and the result is simply the low
-XLEN bits of the result. ADDI {\em rd, rs1, 0} is used to implement the
-MV {\em rd, rs1} assembler pseudoinstruction.
-
-SLTI (set less than immediate) places the value 1 in register {\em rd}
-if register {\em rs1} is less than the sign-extended immediate when
-both are treated as signed numbers, else 0 is written to {\em rd}.
-SLTIU is similar but compares the values as unsigned numbers (i.e.,
-the immediate is first sign-extended to XLEN bits then treated as an
-unsigned number). Note, SLTIU {\em rd, rs1, 1} sets {\em rd}
-to 1 if {\em rs1} equals zero, otherwise sets {\em rd} to 0 (assembler
-pseudoinstruction SEQZ {\em rd, rs}).
-
-ANDI, ORI, XORI are logical operations that perform bitwise AND, OR,
-and XOR on register {\em rs1} and the sign-extended 12-bit immediate
-and place the result in {\em rd}. Note, XORI {\em rd, rs1, -1}
-performs a bitwise logical inversion of register {\em rs1} (assembler
-pseudoinstruction NOT {\em rd, rs}).
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O}
-\\
-\instbitrange{31}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:5]} &
-\multicolumn{1}{c|}{imm[4:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-7 & 5 & 5 & 3 & 5 & 7 \\
-0000000 & shamt[4:0] & src & SLLI & dest & OP-IMM \\
-0000000 & shamt[4:0] & src & SRLI & dest & OP-IMM \\
-0100000 & shamt[4:0] & src & SRAI & dest & OP-IMM \\
-\end{tabular}
-\end{center}
-
-Shifts by a constant are encoded as a specialization of the
-I-type format. The operand to be shifted is in {\em rs1}, and the
-shift amount is encoded in the lower 5 bits of the I-immediate field.
-The right shift type is encoded in bit 30.
-SLLI is a logical left shift (zeros are shifted into the lower bits);
-SRLI is a logical right shift (zeros are shifted into the upper bits);
-and SRAI is an arithmetic right shift (the original sign bit is copied
-into the vacated upper bits).
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{U@{}R@{}O}
-\\
-\instbitrange{31}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[31:12]} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-20 & 5 & 7 \\
-U-immediate[31:12] & dest & LUI \\
-U-immediate[31:12] & dest & AUIPC
-\end{tabular}
-\end{center}
-
-LUI (load upper immediate) is used to build 32-bit constants and uses
-the U-type format. LUI places the U-immediate value in the top 20
-bits of the destination register {\em rd}, filling in the lowest 12
-bits with zeros.
-
-AUIPC (add upper immediate to {\tt pc}) is used to build {\tt pc}-relative
-addresses and uses the U-type format. AUIPC forms a 32-bit offset from the
-20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset
-to the {\tt pc} of the AUIPC instruction, then places the result in register {\em rd}.
-
-\begin{commentary}
-The AUIPC instruction supports two-instruction sequences to access
-arbitrary offsets from the PC for both control-flow transfers and data
-accesses. The combination of an AUIPC and the 12-bit immediate in a
-JALR can transfer control to any 32-bit PC-relative address, while an
-AUIPC plus the 12-bit immediate offset in regular load or store
-instructions can access any 32-bit PC-relative data address.
-
-The current PC can be obtained by setting the U-immediate to 0.
-Although a JAL +4 instruction could also be used to obtain the local
-PC (of the instruction following the JAL), it might cause pipeline
-breaks in simpler microarchitectures or pollute BTB structures in more
-complex microarchitectures.
-\end{commentary}
-
-\subsubsection*{Integer Register-Register Operations}
-
-RV32I defines several arithmetic R-type operations. All operations
-read the {\em rs1} and {\em rs2} registers as source operands and
-write the result into register {\em rd}. The {\em funct7} and {\em
- funct3} fields select the type of operation.
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O}
-\\
-\instbitrange{31}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{funct7} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-7 & 5 & 5 & 3 & 5 & 7 \\
-0000000 & src2 & src1 & ADD/SLT/SLTU & dest & OP \\
-0000000 & src2 & src1 & AND/OR/XOR & dest & OP \\
-0000000 & src2 & src1 & SLL/SRL & dest & OP \\
-0100000 & src2 & src1 & SUB/SRA & dest & OP \\
-\end{tabular}
-\end{center}
-
-ADD performs the addition of {\em rs1} and {\em rs2}. SUB performs the
-subtraction of {\em rs2} from {\em rs1}. Overflows are ignored and the low XLEN
-bits of results are written to the destination {\em rd}.
-SLT and SLTU perform signed and unsigned compares
-respectively, writing 1 to {\em rd} if $\mbox{\em rs1} < \mbox{\em
- rs2}$, 0 otherwise. Note, SLTU {\em rd}, {\em x0}, {\em rs2} sets
-{\em rd} to 1 if {\em rs2} is not equal to zero, otherwise sets {\em
- rd} to zero (assembler pseudoinstruction SNEZ {\em rd, rs}). AND, OR, and
-XOR perform bitwise logical operations.
-
-SLL, SRL, and SRA perform logical left, logical right, and arithmetic
-right shifts on the value in register {\em rs1} by the shift amount
-held in the lower 5 bits of register {\em rs2}.
-
-\subsubsection*{NOP Instruction}
-\vspace{-0.4in}
-\begin{center}
-\begin{tabular}{M@{}R@{}S@{}R@{}O}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-0 & 0 & ADDI & 0 & OP-IMM \\
-\end{tabular}
-\end{center}
-
-The NOP instruction does not change any architecturally visible state, except for
-advancing the {\tt pc} and incrementing any applicable performance
-counters. NOP is encoded as ADDI {\em x0, x0, 0}.
-
-\begin{commentary}
-NOPs can be used to align code segments to microarchitecturally
-significant address boundaries, or to leave space for inline code
-modifications. Although there are many possible ways to encode a NOP,
-we define a canonical NOP encoding to allow microarchitectural
-optimizations as well as for more readable disassembly output. The
-other NOP encodings are made available for HINT instructions
-(Section~\ref{sec:rv32i-hints}).
-
-ADDI was chosen for the NOP encoding as this is most likely to take
-fewest resources to execute across a range of systems (if not
-optimized away in decode). In particular, the instruction only reads
-one register. Also, an ADDI functional unit is more likely to be
-available in a superscalar design as adds are the most common
-operation. In particular, address-generation functional units can
-execute ADDI using the same hardware needed for base+offset address
-calculations, while register-register ADD or logical/shift operations
-require additional hardware.
-\end{commentary}
-
-\section{Control Transfer Instructions}
-
-RV32I provides two types of control transfer instructions:
-unconditional jumps and conditional branches. Control transfer
-instructions in RV32I do {\em not} have architecturally visible delay
-slots.
-
-\subsubsection*{Unconditional Jumps}
-
-\vspace{-0.1in} The jump and link (JAL) instruction uses the J-type
-format, where the J-immediate encodes a signed offset in multiples of
-2 bytes. The offset is sign-extended and added to the {\tt pc}
-to form the jump target address. Jumps can therefore target a
-$\pm$\wunits{1}{MiB} range. JAL stores the address of the instruction
-following the jump ({\tt pc}+4) into register {\em rd}. The standard
-software calling convention uses {\tt x1} as the return address
-register and {\tt x5} as an alternate link register.
-
-\begin{commentary}
-The alternate link register supports calling millicode routines (e.g.,
-those to save and restore registers in compressed code) while
-preserving the regular return address register. The register {\tt x5}
-was chosen as the alternate link register as it maps to a temporary in
-the standard calling convention, and has an encoding that is only one
-bit different than the regular link register.
-\end{commentary}
-
-Plain unconditional jumps (assembler pseudoinstruction J) are encoded as a JAL
-with {\em rd}={\tt x0}.
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{W@{}E@{}W@{}R@{}R@{}O}
-\\
-\multicolumn{1}{c}{\instbit{31}} &
-\instbitrange{30}{21} &
-\multicolumn{1}{c}{\instbit{20}} &
-\instbitrange{19}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[20]} &
-\multicolumn{1}{c|}{imm[10:1]} &
-\multicolumn{1}{c|}{imm[11]} &
-\multicolumn{1}{c|}{imm[19:12]} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-1 & 10 & \multicolumn{1}{c}{1} & 8 & 5 & 7 \\
-\multicolumn{4}{c}{offset[20:1]} & dest & JAL \\
-\end{tabular}
-\end{center}
-
-The indirect jump instruction JALR (jump and link register) uses the
-I-type encoding. The target address is obtained by adding the sign-extended
-12-bit I-immediate to the register {\em rs1}, then setting the
-least-significant bit of the result to zero. The address of
-the instruction following the jump ({\tt pc}+4) is written to register
-{\em rd}. Register {\tt x0} can be used as the destination if the
-result is not required.
-\vspace{-0.4in}
-\begin{center}
-\begin{tabular}{M@{}R@{}F@{}R@{}O}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-offset[11:0] & base & 0 & dest & JALR \\
-\end{tabular}
-\end{center}
-
-\begin{commentary}
-The unconditional jump instructions all use PC-relative addressing to
-help support position-independent code. The JALR instruction was
-defined to enable a two-instruction sequence to jump anywhere in a
-32-bit absolute address range. A LUI instruction can first load {\em
- rs1} with the upper 20 bits of a target address, then JALR can add
-in the lower bits. Similarly, AUIPC then JALR can jump
-anywhere in a 32-bit {\tt pc}-relative address range.
-
-Note that the JALR instruction does not treat the 12-bit immediate as
-multiples of 2 bytes, unlike the conditional branch instructions.
-This avoids one more immediate format in hardware. In
-practice, most uses of JALR will have either a zero immediate or be
-paired with a LUI or AUIPC, so the slight reduction in range is not
-significant.
-
-Clearing the least-significant bit when calculating the JALR target
-address both simplifies the hardware slightly and allows the
-low bit of function pointers to be used to store auxiliary
-information. Although there is potentially a slight loss of error
-checking in this case, in practice jumps to an incorrect instruction
-address will usually quickly raise an exception.
-
-When used with a base {\em rs1}$=${\tt x0}, JALR can be used to implement
-a single instruction subroutine call to the lowest \wunits{2}{KiB} or highest
-\wunits{2}{KiB} address region from anywhere in the address space, which could
-be used to implement fast calls to a small runtime library.
-\end{commentary}
-
-The JAL and JALR instructions will generate an
-instruction-address-misaligned exception if the target address is not
-aligned to a four-byte boundary.
-
-\begin{commentary}
-Instruction-address-misaligned exceptions are not possible on machines
-that support extensions with 16-bit aligned instructions, such as the
-compressed instruction-set extension, C.
-\end{commentary}
-
-Return-address prediction stacks are a common feature of
-high-performance instruction-fetch units, but require accurate
-detection of instructions used for procedure calls and returns to be
-effective. For RISC-V, hints as to the instructions' usage are encoded
-implicitly via the register numbers used. A JAL instruction should
-push the return address onto a return-address stack (RAS) only when
-{\em rd}$=${\tt x1}/{\tt x5}. JALR instructions should push/pop a
-RAS as shown in the Table~\ref{rashints}.
-\begin{table}[hbt]
-\centering
-\begin{tabular}{|c|c|c|l|}
- \hline
- \em rd & \em rs1 & {\em rs1}$=${\em rd} & RAS action \\
- \hline
- !{\em link} & !{\em link} & - & none \\
- !{\em link} & {\em link} & - & pop \\
- {\em link} & !{\em link} & - & push \\
- {\em link} & {\em link} & 0 & pop, then push \\
- {\em link} & {\em link} & 1 & push \\
- \hline
-\end{tabular}
-\caption{Return-address stack prediction hints encoded in register
- specifiers used in the instruction. In the above, {\em link} is
- true when the register is either {\tt x1} or {\tt x5}.}
-\label{rashints}
-\end{table}
-
-\begin{commentary}
-Some other ISAs added explicit hint bits to their indirect-jump instructions
-to guide return-address stack manipulation. We use implicit hinting tied to
-register numbers and the calling convention to reduce the encoding space used
-for these hints.
-
-When two different link registers ({\tt x1} and {\tt x5}) are given as
-{\em rs1} and {\em rd}, then the RAS is both popped and pushed to
-support coroutines. If {\em rs1} and {\em rd} are the same link
-register (either {\tt x1} or {\tt x5}), the RAS is only pushed to
-enable macro-op fusion of the sequences:\linebreak
-{\tt lui ra, imm20; jalr ra, imm12(ra)} \ and \
-{\tt auipc ra, imm20; jalr ra, imm12(ra)}
-\end{commentary}
-
-\subsubsection*{Conditional Branches}
-
-All branch instructions use the B-type instruction format. The
-12-bit B-immediate encodes signed offsets in multiples of 2, and is
-added to the current {\tt pc} to give the target address. The
-conditional branch range is $\pm$\wunits{4}{KiB}.
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{W@{}R@{}F@{}F@{}R@{}R@{}F@{}S}
-\\
-\multicolumn{1}{c}{\instbit{31}} &
-\instbitrange{30}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{8} &
-\multicolumn{1}{c}{\instbit{7}} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[12]} &
-\multicolumn{1}{c|}{imm[10:5]} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{imm[4:1]} &
-\multicolumn{1}{c|}{imm[11]} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-1 & 6 & 5 & 5 & 3 & 4 & 1 & 7 \\
-\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BEQ/BNE & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
-\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BLT[U] & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
-\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BGE[U] & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
-\end{tabular}
-\end{center}
-
-Branch instructions compare two registers. BEQ and BNE take the
-branch if registers {\em rs1} and {\em rs2} are equal or unequal
-respectively. BLT and BLTU take the branch if {\em rs1} is less than
-{\em rs2}, using signed and unsigned comparison respectively. BGE and
-BGEU take the branch if {\em rs1} is greater than or equal to {\em rs2},
-using signed and unsigned comparison respectively. Note, BGT, BGTU,
-BLE, and BLEU can be synthesized by reversing the operands to BLT,
-BLTU, BGE, and BGEU, respectively.
-
-\begin{commentary}
-Signed array bounds may be checked with a single BLTU instruction, since
-any negative index will compare greater than any nonnegative bound.
-\end{commentary}
-
-Software should be optimized such that the sequential code path is the
-most common path, with less-frequently taken code paths placed out of
-line. Software should also assume that backward branches will be
-predicted taken and forward branches as not taken, at least the
-first time they are encountered. Dynamic predictors should quickly
-learn any predictable branch behavior.
-
-Unlike some other architectures, the RISC-V jump (JAL with {\em
- rd}={\tt x0}) instruction should always be used for unconditional
-branches instead of a conditional branch instruction with an
-always-true condition. RISC-V jumps are also PC-relative and support
-a much wider offset range than branches, and will not pollute
-conditional-branch prediction tables.
-
-\begin{commentary}
-The conditional branches were designed to include arithmetic
-comparison operations between two registers (as also done in PA-RISC
-and Xtensa ISA), rather than use condition codes (x86, ARM, SPARC,
-PowerPC), or to only compare one register against zero (Alpha, MIPS),
-or two registers only for equality (MIPS). This design was motivated
-by the observation that a combined compare-and-branch instruction fits
-into a regular pipeline, avoids additional condition code state or use
-of a temporary register, and reduces static code size and dynamic
-instruction fetch traffic. Another point is that comparisons against
-zero require non-trivial circuit delay (especially after the move to
-static logic in advanced processes) and so are almost as expensive as
-arithmetic magnitude compares. Another advantage of a fused
-compare-and-branch instruction is that branches are observed earlier
-in the front-end instruction stream, and so can be predicted earlier.
-There is perhaps an advantage to a design with condition codes in the
-case where multiple branches can be taken based on the same condition
-codes, but we believe this case to be relatively rare.
-
-We considered but did not include static branch hints in the
-instruction encoding. These can reduce the pressure on dynamic
-predictors, but require more instruction encoding space and
-software profiling for best results, and can result in poor
-performance if production runs do not match profiling runs.
-
-We considered but did not include conditional moves or predicated
-instructions, which can effectively replace unpredictable short
-forward branches. Conditional moves are the simpler of the two, but
-are difficult to use with conditional code that might cause exceptions
-(memory accesses and floating-point operations). Predication adds
-additional flag state to a system, additional instructions to set and
-clear flags, and additional encoding overhead on every instruction.
-Both conditional move and predicated instructions add complexity to
-out-of-order microarchitectures, adding an implicit third source
-operand due to the need to copy the original value of the destination
-architectural register into the renamed destination physical register
-if the predicate is false. Also, static compile-time decisions to use
-predication instead of branches can result in lower performance on
-inputs not included in the compiler training set, especially given
-that unpredictable branches are rare, and becoming rarer as branch
-prediction techniques improve.
-
-We note that various microarchitectural techniques exist to
-dynamically convert unpredictable short forward branches into
-internally predicated code to avoid the cost of flushing pipelines on
-a branch mispredict~\cite{heil-tr1996,Klauser-1998,Kim-micro2005} and
-have been implemented in commercial processors~\cite{ibmpower7}.
-The simplest techniques just reduce the penalty of recovering from a
-mispredicted short forward branch by only flushing instructions in the
-branch shadow instead of the entire fetch pipeline, or by fetching
-instructions from both sides using wide instruction fetch or idle
-instruction fetch slots. More complex techniques for out-of-order
-cores add internal predicates on instructions in the branch shadow,
-with the internal predicate value written by the branch instruction,
-allowing the branch and following instructions to be executed
-speculatively and out-of-order with respect to other code~\cite{ibmpower7}.
-\end{commentary}
-
-\section{Load and Store Instructions}
-
-RV32I is a load-store architecture, where only load and store
-instructions access memory and arithmetic instructions only operate on
-CPU registers. RV32I provides a 32-bit address space that is
-byte-addressed and little-endian. The EEI will
-define what portions of the address space are legal to access with
-which instructions (e.g., some addresses might be read only, or
-support word access only). Loads with a destination of {\tt x0} must
-still raise any exceptions and action any other side effects even
-though the load value is discarded.
-
-\vspace{-0.4in}
-\begin{center}
-\begin{tabular}{M@{}R@{}F@{}R@{}O}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-offset[11:0] & base & width & dest & LOAD \\
-\end{tabular}
-\end{center}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{O@{}R@{}R@{}F@{}R@{}O}
-\\
-\instbitrange{31}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:5]} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{imm[4:0]} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-7 & 5 & 5 & 3 & 5 & 7 \\
-offset[11:5] & src & base & width & offset[4:0] & STORE \\
-\end{tabular}
-\end{center}
-
-Load and store instructions transfer a value between the registers and
-memory. Loads are encoded in the I-type format and stores are
-S-type. The effective byte address is obtained by adding register
-{\em rs1} to the sign-extended 12-bit offset. Loads copy a value
-from memory to register {\em rd}. Stores copy the value in register
-{\em rs2} to memory.
-
-The LW instruction loads a 32-bit value from memory into {\em rd}. LH
-loads a 16-bit value from memory, then sign-extends to 32-bits before
-storing in {\em rd}. LHU loads a 16-bit value from memory but then
-zero extends to 32-bits before storing in {\em rd}. LB and LBU are
-defined analogously for 8-bit values. The SW, SH, and SB instructions
-store 32-bit, 16-bit, and 8-bit values from the low bits of register
-{\em rs2} to memory.
-
-Regardless of EEI, loads and stores whose effective addresses are
-naturally aligned shall not raise an address-misaligned exception.
-Loads and stores where the effective address is not naturally aligned
-to the referenced datatype (i.e., on a four-byte boundary for 32-bit
-accesses, and a two-byte boundary for 16-bit accesses) have behavior
-dependent on the EEI.
-
-An EEI may guarantee that misaligned loads and stores are fully
-supported, and so the software running inside the execution
-environment will never experience a contained or fatal
-address-misaligned trap. In this case, the misaligned loads and
-stores can be handled in hardware, or via an invisible trap into the
-execution environment implementation, or possibly a combination of
-hardware and invisible trap depending on address.
-
-An EEI may not guarantee misaligned loads and stores are handled
-invisibly. In this case, loads and stores that are not naturally
-aligned may either complete execution successfully or raise an
-exception. The exception raised can be either an address-misaligned
-exception or an access exception. For a memory access that would
-otherwise be able to complete except for the misalignment, an access
-exception can be raised instead of an address-misaligned exception if
-the misaligned access should not be emulated, e.g., if accesses to the
-memory region have side effects. When an EEI does not guarantee
-misaligned loads and stores are handled invisibly, the EEI must define
-if exceptions caused by address misalignment result in a contained
-trap (allowing software running inside the execution environment to
-handle the trap) or a fatal trap (terminating execution).
-
-\begin{commentary}
-Misaligned accesses are occasionally required when porting legacy
-code, and help performance on applications when using any form of
-packed-SIMD extension or handling externally packed data structures.
-Our rationale for allowing EEIs to choose to support misaligned
-accesses via the regular load and store instructions is to simplify
-the addition of misaligned hardware support. One option would have
-been to disallow misaligned accesses in the base ISA and then provide
-some separate ISA support for misaligned accesses, either special
-instructions to help software handle misaligned accesses or a new
-hardware addressing mode for misaligned accesses. Special
-instructions are difficult to use, complicate the ISA, and often add
-new processor state (e.g., SPARC VIS align address offset register) or
-complicate access to existing processor state (e.g., MIPS LWL/LWR
-partial register writes). In addition, for loop-oriented packed-SIMD
-code, the extra overhead when operands are misaligned motivates
-software to provide multiple forms of loop depending on operand
-alignment, which complicates code generation and adds to loop startup
-overhead. New misaligned hardware addressing modes take considerable
-space in the instruction encoding or require very simplified
-addressing modes (e.g., register indirect only).
-\end{commentary}
-
-Even when misaligned loads and stores complete successfully, these
-accesses might run extremely slowly depending on the implementation
-(e.g., when implemented via an invisible trap). Furthermore, whereas
-naturally aligned loads and stores are guaranteed to execute
-atomically, misaligned loads and stores might not, and hence
-require additional synchronization to ensure atomicity.
-
-\begin{commentary}
-We do not mandate atomicity for misaligned accesses so execution
-environment implementations can use an invisible machine trap and
-a software handler to handle some or all misaligned accesses. If
-hardware misaligned support is provided, software can exploit this by
-simply using regular load and store instructions. Hardware can then
-automatically optimize accesses depending on whether runtime addresses
-are aligned.
-\end{commentary}
-
-\pagebreak
-
-\section{Memory Ordering Instructions}
-\label{sec:fence}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{F@{}IIIIIIIIF@{}F@{}F@{}S}
-\\
-\instbitrange{31}{28} &
-\multicolumn{1}{c}{\instbit{27}} &
-\multicolumn{1}{c}{\instbit{26}} &
-\multicolumn{1}{c}{\instbit{25}} &
-\multicolumn{1}{c}{\instbit{24}} &
-\multicolumn{1}{c}{\instbit{23}} &
-\multicolumn{1}{c}{\instbit{22}} &
-\multicolumn{1}{c}{\instbit{21}} &
-\multicolumn{1}{c}{\instbit{20}} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{fm} &
-\multicolumn{1}{c|}{PI} &
-\multicolumn{1}{c|}{PO} &
-\multicolumn{1}{c|}{PR} &
-\multicolumn{1}{c|}{PW} &
-\multicolumn{1}{|c|}{SI} &
-\multicolumn{1}{c|}{SO} &
-\multicolumn{1}{c|}{SR} &
-\multicolumn{1}{c|}{SW} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-4 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 5 & 3 & 5 & 7 \\
-FM & \multicolumn{4}{c}{predecessor} & \multicolumn{4}{c}{successor} & 0 & FENCE & 0 & MISC-MEM \\
-\end{tabular}
-\end{center}
-
-The FENCE instruction is used to order device I/O and
-memory accesses as viewed by other RISC-V harts and external devices
-or coprocessors. Any combination of device input (I), device output
-(O), memory reads (R), and memory writes (W) may be ordered with
-respect to any combination of the same. Informally, no other RISC-V
-hart or external device can observe any operation in the {\em
- successor} set following a FENCE before any operation in the {\em
- predecessor} set preceding the FENCE.
-Chapter~\ref{ch:memorymodel} provides a precise description of the
-RISC-V memory consistency model.
-
-The EEI will define what I/O operations are possible, and in
-particular, which memory addresses when accessed by load and store instructions will be treated and
-ordered as device input and device output operations respectively
-rather than memory reads and writes. For example, memory-mapped I/O
-devices will typically be accessed with uncached loads and stores that
-are ordered using the I and O bits rather than the R and W bits.
-Instruction-set extensions might also describe new coprocessor I/O
-instructions that will also be ordered using the I and O bits in a
-FENCE.
-
-\begin{table}[htp]
-\begin{small}
-\begin{center}
-\begin{tabular}{|c|c|l|}
-\hline
-{\em fm} field & Mnemonic & Meaning \\
-\hline
-0000 & \em none & Normal Fence \\
-\hline
-\multirow{2}{*}{1000} & \multirow{2}{*}{TSO} & With FENCE RW,RW: exclude write-to-read ordering \\
- & & Otherwise: \em Reserved for future use. \\
-\hline
-\multicolumn{2}{|c|}{\em other} & \em Reserved for future use. \\
-\hline
-\end{tabular}
-\end{center}
-\end{small}
-\caption{Fence mode encoding.}
-\label{fm}
-\end{table}
-
-The fence mode field {\em fm} defines the semantics of the FENCE. A
-FENCE with {\em fm}=0000 orders all memory operations in its
-predecessor set before all memory operations in its successor set.
-
-The optional FENCE.TSO instruction with {\em fm}=1000 orders all load
-operations in its predecessor set before all memory operations in its
-successor set, and all store operations in its predecessor set before
-all store operations in its successor set. This leaves non-AMO store
-operations in the FENCE.TSO's predecessor set unordered with non-AMO
-loads in its successor set.
-
-\begin{commentary}
- The FENCE.TSO encoding was added as an optional extension to the
- original base FENCE instruction encoding. The base definition
- requires that implementations ignore any set bits and treat the
- FENCE as global, and so this is a backwards-compatible extension.
-\end{commentary}
-
-The unused fields in the FENCE instructions---{\em rs1} and {\em rd}---are
-reserved for finer-grain fences in future extensions. For forward
-compatibility, base implementations shall ignore these fields, and standard
-software shall zero these fields. Likewise, many {\em fm} and
-predecessor/successor set settings in Table~\ref{fm} are also reserved
-for future use. Base implementations shall treat all such reserved
-configurations as normal fences with {\em fm}=0000, and standard
-software shall use only non-reserved configurations.
-
-\begin{commentary}
-We chose a relaxed memory model to allow high performance from simple
-machine implementations and from likely future
-coprocessor or accelerator extensions. We separate out I/O ordering
-from memory R/W ordering to avoid unnecessary serialization within a
-device-driver hart and also to support alternative non-memory paths
-to control added coprocessors or I/O devices. Simple implementations
-may additionally ignore the {\em predecessor} and {\em successor}
-fields and always execute a conservative fence on all operations.
-\end{commentary}
-
-\section{Environment Call and Breakpoints}
-
-SYSTEM instructions are used to access system functionality that might
-require privileged access and are encoded using the I-type instruction
-format. These can be divided into two main classes: those that
-atomically read-modify-write control and status registers (CSRs), and
-all other potentially privileged instructions. CSR instructions are
-described in Chapter~\ref{csrinsts}, and the base unprivileged instructions
-are described in the following section.
-
-\begin{commentary}
-The SYSTEM instructions are defined to allow simpler implementations
-to always trap to a single software trap handler. More sophisticated
-implementations might execute more of each system instruction in
-hardware.
-\end{commentary}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{M@{}R@{}F@{}R@{}S}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{funct12} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-ECALL & 0 & PRIV & 0 & SYSTEM \\
-EBREAK & 0 & PRIV & 0 & SYSTEM \\
-\end{tabular}
-\end{center}
-
-There two instructions cause a precise requested trap to the
-supporting execution environment.
-
-The ECALL instruction is used to make a service request to the
-execution environment. The EEI will define how parameters for the
-service request are passed, but usually these will be in defined
-locations in the integer register file.
-
-The EBREAK instruction is used to return control to a debugging
-environment.
-
-\begin{commentary}
-ECALL and EBREAK were previously named SCALL and SBREAK. The
-instructions have the same functionality and encoding, but were
-renamed to reflect that they can be used more generally than to call a
-supervisor-level operating system or debugger.
-\end{commentary}
-
-\begin{commentary}
- EBREAK was primarily designed to be used by a debugger to cause
- execution to stop and fall back into the debugger. EBREAK is also
- used by the standard gcc compiler to mark code paths that should not
- be executed.
-
- Another use of EBREAK is to support ``semihosting'', where the
- execution environment includes a debugger that can provide services
- over an alternate system call interface built around the EBREAK
- instruction. Because the RISC-V base ISA does not provide more than
- one EBREAK instruction, RISC-V semihosting uses a special sequence of
- instructions to distinguish a semihosting EBREAK from a debugger
- inserted EBREAK.
-\begin{verbatim}
- slli x0, x0, 0x1f # Entry NOP
- ebreak # Break to debugger
- srai x0, x0, 7 # NOP encoding the semihosting call number 7
-\end{verbatim}
- Note that these three instructions must be 32-bit-wide instructions,
- i.e., they mustn't be among the compressed 16-bit instructions
- described in Chapter~\ref{compressed}.
-
- The shift NOP instructions are still considered available for use as
- HINTS.
-
- Semihosting is a form of service call and would be more naturally
- encoded as an ECALL using an existing ABI, but this would require
- the debugger to be able to intercept ECALLs, which is a newer
- addition to the debug standard. We intend to move over to using
- ECALLs with a standard ABI, in which case, semihosting can share a
- service ABI with an existing standard.
-
- We note that ARM processors have also moved to using SVC instead of
- BKPT for semihosting calls in newer designs.
-\end{commentary}
-
-\section{HINT Instructions}
-\label{sec:rv32i-hints}
-
-RV32I reserves a large encoding space for HINT instructions, which are
-usually used to communicate performance hints to the
-microarchitecture. HINTs are encoded as integer computational
-instructions with {\em rd}={\tt x0}. Hence, like the NOP instruction,
-HINTs do not change any architecturally visible state, except for
-advancing the {\tt pc} and any applicable performance counters.
-Implementations are always allowed to ignore the encoded hints.
-
-\begin{commentary}
-This HINT encoding has been chosen so that simple implementations can ignore
-HINTs altogether, and instead execute a HINT as a regular computational
-instruction that happens not to mutate the architectural state. For example, ADD is
-a HINT if the destination register is {\tt x0}; the five-bit {\em rs1} and {\em
-rs2} fields encode arguments to the HINT. However, a simple implementation can
-simply execute the HINT as an ADD of {\em rs1} and {\em rs2} that writes {\tt
-x0}, which has no architecturally visible effect.
-\end{commentary}
-
-Table~\ref{tab:rv32i-hints} lists all RV32I HINT code points. 91\% of the HINT
-space is reserved for standard HINTs, but none are presently defined. The
-remainder of the HINT space is reserved for custom HINTs: no standard HINTs
-will ever be defined in this subspace.
-
-\begin{commentary}
-No standard hints are presently defined. We anticipate
-standard hints to eventually include memory-system spatial and
-temporal locality hints, branch prediction hints, thread-scheduling
-hints, security tags, and instrumentation flags for
-simulation/emulation.
-\end{commentary}
-
-\begin{table}[hbt]
-\centering
-\begin{tabular}{|l|l|c|l|}
- \hline
- Instruction & Constraints & Code Points & Purpose \\ \hline \hline
- LUI & {\em rd}={\tt x0} & $2^{20}$ & \multirow{15}{*}{\em Reserved for future standard use} \\ \cline{1-3}
- AUIPC & {\em rd}={\tt x0} & $2^{20}$ & \\ \cline{1-3}
- \multirow{2}{*}{ADDI} & {\em rd}={\tt x0}, and either & \multirow{2}{*}{$2^{17}-1$} & \\
- & {\em rs1}$\neq${\tt x0} or {\em imm}$\neq$0 & & \\ \cline{1-3}
- ANDI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3}
- ORI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3}
- XORI & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3}
- ADD & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SUB & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- AND & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- OR & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- XOR & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SLL & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SRL & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SRA & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- FENCE & {\em pred}=0 or {\em succ}=0 & $2^{5}-1$ & \\ \hline \hline
- SLTI & {\em rd}={\tt x0} & $2^{17}$ & \multirow{7}{*}{\em Reserved for custom use} \\ \cline{1-3}
- SLTIU & {\em rd}={\tt x0} & $2^{17}$ & \\ \cline{1-3}
- SLLI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SRLI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SRAI & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SLT & {\em rd}={\tt x0} & $2^{10}$ & \\ \cline{1-3}
- SLTU & {\em rd}={\tt x0} & $2^{10}$ & \\ \hline
-\end{tabular}
-\caption{RV32I HINT instructions.}
-\label{tab:rv32i-hints}
-\end{table}
-