From ab6f8c9bd7bc85361fcf35667d1fddfaf367a53f Mon Sep 17 00:00:00 2001
From: Andrew Waterman <andrew@sifive.com>
Date: Wed, 1 Feb 2017 20:41:47 -0800
Subject: Reorganize directory structure

---
 src/rv32.tex | 1359 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1359 insertions(+)
 create mode 100644 src/rv32.tex

(limited to 'src/rv32.tex')

diff --git a/src/rv32.tex b/src/rv32.tex
new file mode 100644
index 0000000..77824e5
--- /dev/null
+++ b/src/rv32.tex
@@ -0,0 +1,1359 @@
+\chapter{RV32I Base Integer Instruction Set, Version 2.0}
+\label{rv32}
+
+This chapter describes version 2.0 of the RV32I base integer
+instruction set.  Much of the commentary also applies to the RV64I
+variant.
+
+\begin{commentary}
+RV32I was designed to be sufficient to form a compiler target and to
+support modern operating system environments.  The ISA was also
+designed to reduce the hardware required in a minimal implementation.
+RV32I contains 47 unique instructions, though a simple implementation
+might cover the eight SCALL/SBREAK/CSRR* instructions with a single
+SYSTEM hardware instruction that always traps and might be able to
+implement the FENCE and FENCE.I instructions as NOPs, reducing
+hardware instruction count to 38 total.  RV32I can emulate almost any
+other ISA extension (except the A extension, which requires additional
+hardware support for atomicity).
+\end{commentary}
+
+\section{Programmers' Model for Base Integer Subset}
+
+Figure~\ref{gprs} shows the user-visible state for the base integer
+subset.  There are 31 general-purpose registers {\tt x1}--{\tt x31},
+which hold integer values.  Register {\tt x0} is hardwired to the
+constant 0.  There is no hardwired subroutine return address link
+register, but the standard software calling convention uses register
+{\tt x1} to hold the return address on a call.  For RV32, the {\tt x}
+registers are 32 bits wide, and for RV64, they are 64 bits wide.  This
+document uses the term XLEN to refer to the current width of an {\tt
+  x} register in bits (either 32 or 64).
+
+There is one additional user-visible register: the program counter {\tt pc}
+holds the address of the current instruction.
+
+\begin{commentary}
+The number of available architectural registers can have large impacts
+on code size, performance, and energy consumption.  Although 16
+registers would arguably be sufficient for an integer ISA running
+compiled code, it is impossible to encode a complete ISA with 16
+registers in 16-bit instructions using a 3-address format.  Although a
+2-address format would be possible, it would increase instruction
+count and lower efficiency.  We wanted to avoid intermediate
+instruction sizes (such as Xtensa's 24-bit instructions) to simplify
+base hardware implementations, and once a 32-bit instruction size was
+adopted, it was straightforward to support 32 integer registers.  A
+larger number of integer registers also helps performance on
+high-performance code, where there can be extensive use of loop
+unrolling, software pipelining, and cache tiling.
+
+For these reasons, we chose a conventional size of 32 integer
+registers for the base ISA.  Dynamic register usage tends to be
+dominated by a few frequently accessed registers, and regfile
+implementations can be optimized to reduce access energy for the
+frequently accessed registers~\cite{jtseng:sbbci}.  The optional
+compressed 16-bit instruction format mostly only accesses 8 registers
+and hence can provide a dense instruction encoding, while additional
+instruction-set extensions could support a much larger register space
+(either flat or hierarchical) if desired.
+
+For resource-constrained embedded applications, we have defined the
+RV32E subset, which only has 16 registers (Chapter~\ref{rv32e}).
+\end{commentary}
+
+\begin{figure}[H]
+{\footnotesize
+\begin{center}
+\begin{tabular}{p{2in}}
+\instbitrange{XLEN-1}{0}                                  \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ \ \ x0 / zero}}      \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x1\ \ \ \ \ }}            \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x2\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x3\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x4\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x5\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x6\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x7\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x8\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ \ x9\ \ \ \ \ }}       \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x10\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x11\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x12\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x13\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x14\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x15\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x16\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x17\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x18\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x19\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x20\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x21\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x22\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x23\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x24\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x25\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x26\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x27\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x28\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x29\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x30\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{\ \ \ x31\ \ \ \ \ }}        \\ \cline{1-1}
+\multicolumn{1}{c}{XLEN}                                  \\
+
+\instbitrange{XLEN-1}{0}                                  \\ \cline{1-1}
+\multicolumn{1}{|c|}{\reglabel{pc}}                         \\ \cline{1-1}
+\multicolumn{1}{c}{XLEN}                                  \\
+\end{tabular}
+\end{center}
+}
+\caption{RISC-V user-level base integer register state.}
+\label{gprs}
+\end{figure}
+
+\newpage
+
+\section{Base Instruction Formats}
+
+In the base ISA, there are four core instruction formats (R/I/S/U), as
+shown in Figure~\ref{fig:baseinstformats}.  All are a fixed 32 bits in
+length and must be aligned on a four-byte boundary in memory.  An
+instruction address misaligned exception is generated on a taken
+branch or unconditional jump if the target address is not four-byte
+aligned.  No instruction fetch misaligned exception is generated for a
+conditional branch that is not taken.
+
+\vspace{-0.2in}
+\begin{figure}[h]
+\begin{center}
+\setlength{\tabcolsep}{4pt}
+\begin{tabular}{p{1.2in}@{}p{0.8in}@{}p{0.8in}@{}p{0.6in}@{}p{0.8in}@{}p{1in}l}
+\\
+\instbitrange{31}{25} &
+\instbitrange{24}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\cline{1-6}
+\multicolumn{1}{|c|}{funct7} &
+\multicolumn{1}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+R-type \\
+\cline{1-6}
+\\
+\cline{1-6}
+\multicolumn{2}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+I-type \\
+\cline{1-6}
+\\
+\cline{1-6}
+\multicolumn{1}{|c|}{imm[11:5]} &
+\multicolumn{1}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{imm[4:0]} &
+\multicolumn{1}{c|}{opcode} &
+S-type \\
+\cline{1-6}
+\\
+\cline{1-6}
+\multicolumn{4}{|c|}{imm[31:12]} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+U-type \\
+\cline{1-6}
+\end{tabular}
+\end{center}
+\caption{RISC-V base instruction formats.}
+\label{fig:baseinstformats}
+\end{figure}
+
+The RISC-V ISA keeps the source ({\em rs1} and {\em rs2}) and
+destination ({\em rd}) registers at the same position in all formats
+to simplify decoding.  Immediates are packed towards the leftmost
+available bits in the instruction and have been allocated to reduce
+hardware complexity.  In particular, the sign bit for all immediates
+is always in bit 31 of the instruction to speed sign-extension
+circuitry.  
+
+\begin{commentary}
+Decoding register specifiers is usually on the critical paths in
+implementations, and so the instruction format was chosen to keep all
+register specifiers at the same position in all formats at the expense
+of having to move immediate bits across formats (a property shared
+with RISC-IV aka. SPUR~\cite{spur-jsscc1989}).
+
+In practice, most immediates are either small or require all XLEN
+bits.  We chose an asymmetric immediate split (12 bits in regular
+instructions plus a special load upper immediate instruction with 20
+bits) to increase the opcode space available for regular instructions.
+In addition, these immediates are all sign-extended.  We did not
+observe a benefit to using zero-extension for some immediates and
+wanted to keep the ISA as simple as possible.
+\end{commentary}
+
+\section{Immediate Encoding Variants}
+
+There are a further two variants of the instruction formats (SB/UJ)
+based on the handling of immediates, as shown in
+Figure~\ref{fig:baseinstformatsimm}.
+
+\begin{figure}[h]
+\begin{small}
+\begin{center}
+\setlength{\tabcolsep}{4pt}
+\begin{tabular}{p{0.3in}@{}p{0.8in}@{}p{0.6in}@{}p{0.18in}@{}p{0.7in}@{}p{0.6in}@{}p{0.6in}@{}p{0.3in}@{}p{0.5in}l}
+\\
+\multicolumn{1}{c}{\instbit{31}} &
+\instbitrange{30}{25} &
+\instbitrange{24}{21} &
+\multicolumn{1}{c}{\instbit{20}} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{8} &
+\multicolumn{1}{c}{\instbit{7}} &
+\instbitrange{6}{0} \\
+\cline{1-9}
+\multicolumn{2}{|c|}{funct7} &
+\multicolumn{2}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{2}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+R-type \\
+\cline{1-9}
+\\
+\cline{1-9}
+\multicolumn{4}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{2}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+I-type \\
+\cline{1-9}
+\\
+\cline{1-9}
+\multicolumn{2}{|c|}{imm[11:5]} &
+\multicolumn{2}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{2}{c|}{imm[4:0]} &
+\multicolumn{1}{c|}{opcode} &
+S-type \\
+\cline{1-9}
+\\
+\cline{1-9}
+\multicolumn{1}{|c|}{imm[12]} &
+\multicolumn{1}{c|}{imm[10:5]} &
+\multicolumn{2}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{imm[4:1]} &
+\multicolumn{1}{c|}{imm[11]} &
+\multicolumn{1}{c|}{opcode} &
+SB-type \\
+\cline{1-9}
+\\
+\cline{1-9}
+\multicolumn{6}{|c|}{imm[31:12]} &
+\multicolumn{2}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+U-type \\
+\cline{1-9}
+\\
+\cline{1-9}
+\multicolumn{1}{|c|}{imm[20]} &
+\multicolumn{2}{c|}{imm[10:1]} &
+\multicolumn{1}{c|}{imm[11]} &
+\multicolumn{2}{c|}{imm[19:12]} &
+\multicolumn{2}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} &
+UJ-type \\
+\cline{1-9}
+\end{tabular}
+\end{center}
+\end{small}
+\caption{RISC-V base instruction formats showing immediate variants.}
+\label{fig:baseinstformatsimm}
+\end{figure}
+
+In Figure~\ref{fig:baseinstformatsimm} each immediate
+subfield is labeled with the bit position (imm[{\em x}\,]) in the
+immediate value being produced, rather than the bit position within
+the instruction's immediate field as is usually done.
+Figure~\ref{fig:immtypes} shows the immediates produced by each of the
+base instruction formats, and is labeled to show which instruction
+bit (inst[{\em y}\,]) produces each bit of the immediate value.
+
+\begin{figure}[h]
+\begin{center}
+\setlength{\tabcolsep}{4pt}
+\begin{tabular}{p{0.2in}@{}p{1.2in}@{}p{1.0in}@{}p{0.2in}@{}p{0.7in}@{}p{0.7in}@{}p{0.2in}l}
+\\
+\multicolumn{1}{c}{\instbit{31}} &
+\instbitrange{30}{20} &
+\instbitrange{19}{12} &
+\multicolumn{1}{c}{\instbit{11}} &
+\instbitrange{10}{5} &
+\instbitrange{4}{1} &
+\multicolumn{1}{c}{\instbit{0}} &
+\\
+\cline{1-7}
+\multicolumn{4}{|c|}{--- inst[31] ---} &
+\multicolumn{1}{c|}{inst[30:25]} &
+\multicolumn{1}{c|}{inst[24:21]} &
+\multicolumn{1}{c|}{inst[20]} &
+I-immediate \\
+\cline{1-7}
+\\
+\cline{1-7}
+\multicolumn{4}{|c|}{--- inst[31] ---} &
+\multicolumn{1}{c|}{inst[30:25]} &
+\multicolumn{1}{c|}{inst[11:8]} &
+\multicolumn{1}{c|}{inst[7]} &
+S-immediate \\
+\cline{1-7}
+\\
+\cline{1-7}
+\multicolumn{3}{|c|}{--- inst[31] ---} &
+\multicolumn{1}{c|}{inst[7]} &
+\multicolumn{1}{c|}{inst[30:25]} &
+\multicolumn{1}{c|}{inst[11:8]} &
+\multicolumn{1}{c|}{0} &
+B-immediate \\
+\cline{1-7}
+\\
+\cline{1-7}
+\multicolumn{1}{|c|}{inst[31]} &
+\multicolumn{1}{c|}{inst[30:20]} &
+\multicolumn{1}{c|}{inst[19:12]} &
+\multicolumn{4}{c|}{--- 0 ---} &
+U-immediate \\
+\cline{1-7}
+\\
+\cline{1-7}
+\multicolumn{2}{|c|}{--- inst[31] ---} &
+\multicolumn{1}{c|}{inst[19:12]} &
+\multicolumn{1}{c|}{inst[20]} &
+\multicolumn{1}{c|}{inst[30:25]} &
+\multicolumn{1}{c|}{inst[24:21]} &
+\multicolumn{1}{c|}{0} &
+J-immediate \\
+\cline{1-7}
+\end{tabular}
+\end{center}
+\caption{Types of immediate produced by RISC-V instructions.  The fields are labeled with the
+  instruction bits used to construct their value.  Sign extension
+  always uses inst[31].}
+\label{fig:immtypes}
+\end{figure}
+
+The only difference between the S and SB formats is that the 12-bit
+immediate field is used to encode branch offsets in multiples of 2 in
+the SB format.  Instead of shifting all bits in the
+instruction-encoded immediate left by one in hardware as is
+conventionally done, the middle bits (imm[10:1]) and sign bit stay in
+fixed positions, while the lowest bit in S format (inst[7]) encodes a
+high-order bit in SB format.
+
+Similarly, the only difference between the U and UJ formats is
+that the 20-bit immediate is shifted left by 12 bits to form U
+immediates and by 1 bit to form J immediates.  The location of
+instruction bits in the U and UJ format immediates is chosen to
+maximize overlap with the other formats and with each other.
+
+\begin{commentary}
+Sign-extension is one of the most critical operations on immediates
+(particularly in RV64I), and in RISC-V the sign bit for all immediates
+is always held in bit 31 of the instruction to allow sign-extension to
+proceed in parallel with instruction decoding.
+
+Although more complex implementations might have separate adders for
+branch and jump calculations and so would not benefit from keeping the
+location of immediate bits constant across types of instruction, we
+wanted to reduce the hardware cost of the simplest implementations.
+By rotating bits in the instruction encoding of B and J immediates
+instead of using dynamic hardware muxes to multiply the immediate by
+2, we reduce instruction signal fanout and immediate mux costs by
+around a factor of 2.  The scrambled immediate encoding will add
+negligible time to static or ahead-of-time compilation.  For dynamic
+generation of instructions, there is some small additional
+overhead, but the most common short forward branches have
+straightforward immediate encodings.
+\end{commentary}
+
+\section{Integer Computational Instructions}
+
+Most integer computational instructions operate on XLEN bits of values
+held in the integer register file.  Integer computational instructions
+are either encoded as register-immediate operations using the I-type
+format or as register-register operations using the R-type format.
+The destination is register {\em rd} for both register-immediate and
+register-register instructions.  No integer computational instructions
+cause arithmetic exceptions.
+
+\begin{commentary}
+We did not include special instruction set support for overflow checks
+on integer arithmetic operations, as many overflow checks can be
+cheaply implemented using RISC-V branches.  Overflow checking for
+unsigned addition requires only a single additional branch instruction
+after the addition.  Similarly, signed array bounds checking requires
+only a single branch instruction.  Overflow checks for signed addition
+require several instructions depending on whether the addend is an
+immediate or a variable.  We considered adding branches that test if
+the sum of their signed register operands would overflow, but
+ultimately chose to omit these from the base ISA.
+\end{commentary}
+
+\subsubsection*{Integer Register-Immediate Instructions}
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+I-immediate[11:0] & src & ADDI/SLTI[U]  & dest & OP-IMM \\
+I-immediate[11:0] & src & ANDI/ORI/XORI & dest & OP-IMM \\
+\end{tabular}
+\end{center}
+ADDI adds the sign-extended 12-bit immediate to register {\em rs1}.
+Arithmetic overflow is ignored and the result is simply the low
+XLEN bits of the result.  ADDI {\em rd, rs1, 0} is used to implement the
+MV {\em rd, rs1} assembler pseudo-instruction.
+
+SLTI (set less than immediate) places the value 1 in register {\em rd}
+if register {\em rs1} is less than the sign-extended immediate when
+both are treated as signed numbers, else 0 is written to {\em rd}.
+SLTIU is similar but compares the values as unsigned numbers (i.e.,
+the immediate is first sign-extended to XLEN bits then treated as an
+unsigned number).  Note, SLTIU {\em rd}, {\em rs1}, 1 sets {\em rd}
+to 1 if {\em rs1} equals zero, otherwise sets {\em rd} to 0 (assembler
+pseudo-op SEQZ {\em rd, rs}).
+
+ANDI, ORI, XORI are logical operations that perform bitwise AND, OR,
+and XOR on register {\em rs1} and the sign-extended 12-bit immediate
+and place the result in {\em rd}.  Note, XORI {\em rd, rs1, -1}
+performs a bitwise logical inversion of register {\em rs1} (assembler
+pseudo-instruction NOT {\em rd, rs}).
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{25} &
+\instbitrange{24}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:5]} &
+\multicolumn{1}{c|}{imm[4:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+7 & 5 & 5 & 3 & 5 & 7 \\
+0000000 & shamt[4:0]  & src & SLLI & dest & OP-IMM \\
+0000000 & shamt[4:0]  & src & SRLI & dest & OP-IMM \\
+0100000 & shamt[4:0]  & src & SRAI & dest & OP-IMM \\
+\end{tabular}
+\end{center}
+
+Shifts by a constant are encoded as a specialization of the
+I-type format.  The operand to be shifted is in {\em rs1}, and the
+shift amount is encoded in the lower 5 bits of the I-immediate field.
+The right shift type is encoded in a high bit of the I-immediate.
+SLLI is a logical left shift (zeros are shifted into the lower bits);
+SRLI is a logical right shift (zeros are shifted into the upper bits);
+and SRAI is an arithmetic right shift (the original sign bit is copied
+into the vacated upper bits).
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{U@{}R@{}O}
+\\
+\instbitrange{31}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[31:12]} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+20 & 5 & 7 \\
+U-immediate[31:12] & dest & LUI \\
+U-immediate[31:12] & dest & AUIPC
+\end{tabular}
+\end{center}
+
+LUI (load upper immediate) is used to build 32-bit constants and uses
+the U-type format.  LUI places the U-immediate value in the top 20
+bits of the destination register {\em rd}, filling in the lowest 12
+bits with zeros.
+
+AUIPC (add upper immediate to {\tt pc}) is used to build {\tt pc}-relative
+addresses and uses the U-type format.  AUIPC forms a 32-bit offset from the
+20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset
+to the {\tt pc}, then places the result in register {\em rd}.
+
+\begin{commentary}
+The AUIPC instruction supports two-instruction sequences to access
+arbitrary offsets from the PC for both control-flow transfers and data
+accesses.  The combination of an AUIPC and the 12-bit immediate in a
+JALR can transfer control to any 32-bit PC-relative address, while an
+AUIPC plus the 12-bit immediate offset in regular load or store
+instructions can access any 32-bit PC-relative data address.
+
+The current PC can be obtained by setting the U-immediate to 0.  Although
+a JAL +4 instruction could also be used to obtain the PC, it might cause
+pipeline breaks in simpler microarchitectures or pollute the BTB structures in
+more complex microarchitectures.
+\end{commentary}
+
+\subsubsection*{Integer Register-Register Operations}
+
+RV32I defines several arithmetic R-type operations.  All operations
+read the {\em rs1} and {\em rs2} registers as source operands and
+write the result into register {\em rd}.  The {\em funct7} and {\em
+  funct3} fields select the type of operation.
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{25} &
+\instbitrange{24}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{funct7} &
+\multicolumn{1}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+7 & 5 & 5 & 3 & 5 & 7 \\
+0000000 & src2 & src1 & ADD/SLT/SLTU & dest & OP    \\
+0000000 & src2 & src1 & AND/OR/XOR  & dest & OP    \\
+0000000 & src2 & src1 & SLL/SRL     & dest & OP    \\
+0100000 & src2 & src1 & SUB/SRA     & dest & OP    \\
+\end{tabular}
+\end{center}
+
+ADD and SUB perform addition and subtraction respectively.  Overflows
+are ignored and the low XLEN bits of results are written to the
+destination.  SLT and SLTU perform signed and unsigned compares
+respectively, writing 1 to {\em rd} if $\mbox{\em rs1} < \mbox{\em
+  rs2}$, 0 otherwise.  Note, SLTU {\em rd}, {\em x0}, {\em rs2} sets
+{\em rd} to 1 if {\em rs2} is not equal to zero, otherwise sets {\em
+  rd} to zero (assembler pseudo-op SNEZ {\em rd, rs}).  AND, OR, and
+XOR perform bitwise logical operations.
+
+SLL, SRL, and SRA perform logical left, logical right, and arithmetic
+right shifts on the value in register {\em rs1} by the shift amount
+held in the lower 5 bits of register {\em rs2}.
+
+\subsubsection*{NOP Instruction}
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+0 & 0 & ADDI & 0 & OP-IMM \\
+\end{tabular}
+\end{center}
+
+The NOP instruction does not change any user-visible state, except
+for advancing the {\tt pc}.  NOP is encoded as ADDI {\em x0, x0, 0}.
+
+\begin{commentary}
+NOPs can be used to align code segments to microarchitecturally
+significant address boundaries, or to leave space for inline code
+modifications.  Although there are many possible ways to encode a NOP,
+we define a canonical NOP encoding to allow microarchitectural
+optimizations as well as for more readable disassembly output.
+\end{commentary}
+
+\section{Control Transfer Instructions}
+
+RV32I provides two types of control transfer instructions:
+unconditional jumps and conditional branches.  Control transfer
+instructions in RV32I do {\em not} have architecturally visible delay
+slots.
+
+\subsubsection*{Unconditional Jumps}
+
+\vspace{-0.1in} The jump and link (JAL) instruction uses the UJ-type
+format, where the J-immediate encodes a signed offset in multiples of
+2 bytes.  The offset is sign-extended and added to the {\tt pc}
+to form the jump target address.  Jumps can therefore target a
+$\pm$\wunits{1}{MiB} range. JAL stores the address of the instruction
+following the jump ({\tt pc}+4) into register {\em rd}.  The standard
+software calling convention uses {\tt x1} as the return address
+register and {\tt x5} as an alternate link register.
+
+Plain unconditional jumps (assembler pseudo-op J) are encoded as a JAL
+with {\em rd}={\tt x0}.
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{W@{}E@{}W@{}R@{}R@{}O}
+\\
+\multicolumn{1}{c}{\instbit{31}} &
+\instbitrange{30}{21} &
+\multicolumn{1}{c}{\instbit{20}} &
+\instbitrange{19}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[20]} &
+\multicolumn{1}{c|}{imm[10:1]} &
+\multicolumn{1}{c|}{imm[11]} &
+\multicolumn{1}{c|}{imm[19:12]} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+1 & 10 & \multicolumn{1}{c}{1} & 8 & 5 & 7 \\
+\multicolumn{4}{c}{offset[20:1]} & dest & JAL \\
+\end{tabular}
+\end{center}
+
+The indirect jump instruction JALR (jump and link register) uses the
+I-type encoding.  The target address is obtained by adding the 12-bit
+signed I-immediate to the register {\em rs1}, then setting the
+least-significant bit of the result to zero.  The address of
+the instruction following the jump ({\tt pc}+4) is written to register
+{\em rd}.  Register {\tt x0} can be used as the destination if the
+result is not required.
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+offset[11:0] & base & 0 & dest & JALR \\
+\end{tabular}
+\end{center}
+
+The JAL and JALR instructions can generate a misaligned instruction
+fetch exception if the target address is not aligned to a four-byte
+boundary.
+
+\begin{commentary}
+The unconditional jump instructions all use PC-relative addressing to
+help support position-independent code.  The JALR instruction was
+defined to enable a two-instruction sequence to jump anywhere in a
+32-bit absolute address range.  A LUI instruction can first load {\em
+  rs1} with the upper 20 bits of a target address, then JALR can add
+in the lower bits. Similarly, AUIPC then JALR can jump
+anywhere in a 32-bit {\tt pc}-relative address range.
+
+Note that the JALR instruction does not treat the 12-bit immediate as
+multiples of 2 bytes, unlike the conditional branch instructions.
+This avoids one more immediate format in hardware.  In
+practice, most uses of JALR will have either a zero immediate or be
+paired with a LUI or AUIPC, so the slight reduction in range is not
+significant.
+
+The JALR instruction ignores the lowest bit of the calculated target
+address.  This both simplifies the hardware slightly and allows the
+low bit of function pointers to be used to store auxiliary
+information.  Although there is potentially a slight loss of error
+checking in this case, in practice jumps to an incorrect instruction
+address will usually quickly raise an exception.
+
+Instruction fetch misaligned exceptions are not possible on machines
+that support extensions with 16-bit aligned instructions, such as the
+compressed instruction set extension, C.
+
+Return-address prediction stacks are a common feature of high-performance
+instruction-fetch units.  We note that {\em rd} and {\em rs1} can be used to
+guide an implementation's instruction-fetch prediction logic, indicating
+whether JALR instructions should push ({\em rd}$=${\tt x1}/{\tt x5}), pop
+({\em rs1}$=${\tt x1}/{\tt x5}), or not touch (otherwise)
+a return-address stack.  Similarly, a JAL instruction should push the return
+address onto the return-address stack only when {\em rd}$=${\tt x1}/{\tt x5}.
+
+When used with a base {\em rs1}$=${\tt x0}, JALR can be used to implement
+a single instruction subroutine call to the lowest \wunits{2}{KiB} or highest
+\wunits{2}{KiB} address region from anywhere in the address space, which could
+be used to implement fast calls to a small runtime library.
+\end{commentary}
+
+\subsubsection*{Conditional Branches}
+
+All branch instructions use the SB-type instruction format.  The
+12-bit B-immediate encodes signed offsets in multiples of 2, and is
+added to the current {\tt pc} to give the target address.  The
+conditional branch range is $\pm$\wunits{4}{KiB}.
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{W@{}R@{}F@{}F@{}R@{}R@{}F@{}S}
+\\
+\multicolumn{1}{c}{\instbit{31}} &
+\instbitrange{30}{25} &
+\instbitrange{24}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{8} &
+\multicolumn{1}{c}{\instbit{7}} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[12]} &
+\multicolumn{1}{c|}{imm[10:5]} &
+\multicolumn{1}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{imm[4:1]} &
+\multicolumn{1}{c|}{imm[11]} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+1 & 6 & 5 & 5 & 3 & 4 & 1 & 7 \\
+\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BEQ/BNE & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
+\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BLT[U] & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
+\multicolumn{2}{c}{offset[12,10:5]} & src2 & src1 & BGE[U]  & \multicolumn{2}{c}{offset[11,4:1]} & BRANCH \\
+\end{tabular}
+\end{center}
+
+Branch instructions compare two registers.  BEQ and BNE take the
+branch if registers {\em rs1} and {\em rs2} are equal or unequal
+respectively.  BLT and BLTU take the branch if {\em rs1} is less than
+{\em rs2}, using signed and unsigned comparison respectively.  BGE and
+BGEU take the branch if {\em rs1} is greater than or equal to {\em rs2},
+using signed and unsigned comparison respectively. Note, BGT, BGTU,
+BLE, and BLEU can be synthesized by reversing the operands to BLT,
+BLTU, BGE, and BGEU, respectively.
+
+Software should be optimized such that the sequential code path is the
+most common path, with less-frequently taken code paths placed out of
+line.  Software should also assume that backward branches will be
+predicted taken and forward branches as not taken, at least the
+first time they are encountered.  Dynamic predictors should quickly
+learn any predictable branch behavior.
+
+Unlike some other architectures, the RISC-V jump (JAL with {\em
+  rd}={\tt x0}) instruction should always be used for unconditional
+branches instead of a conditional branch instruction with an always-true
+condition.  RISC-V jumps are also PC-relative and support a much
+wider offset range than branches, and will not pressure conditional
+branch prediction tables.
+
+\begin{commentary}
+The conditional branches were designed to include arithmetic
+comparison operations between two registers (as also done in PA-RISC
+and Xtensa ISA), rather than use condition codes (x86, ARM, SPARC,
+PowerPC), or to only compare one register against zero (Alpha, MIPS),
+or two registers only for equality (MIPS).  This design was motivated
+by the observation that a combined compare-and-branch instruction fits
+into a regular pipeline, avoids additional condition code state or use
+of a temporary register, and reduces static code size and dynamic
+instruction fetch traffic.  Another point is that comparisons against
+zero require non-trivial circuit delay (especially after the move to
+static logic in advanced processes) and so are almost as expensive as
+arithmetic magnitude compares.  Another advantage of a fused
+compare-and-branch instruction is that branches are observed earlier
+in the front-end instruction stream, and so can be predicted earlier.
+There is perhaps an advantage to a design with condition codes in the
+case where multiple branches can be taken based on the same condition
+codes, but we believe this case to be relatively rare.
+
+We considered but did not include static branch hints in the
+instruction encoding.  These can reduce the pressure on dynamic
+predictors, but require more instruction encoding space and
+software profiling for best results, and can result in poor
+performance if production runs do not match profiling runs.
+
+We considered but did not include conditional moves or predicated
+instructions, which can effectively replace unpredictable short
+forward branches.  Conditional moves are the simpler of the two, but
+are difficult to use with conditional code that might cause exceptions
+(memory accesses and floating-point operations).  Predication adds
+additional flag state to a system, additional instructions to set and
+clear flags, and additional encoding overhead on every instruction.
+Both conditional move and predicated instructions add complexity to
+out-of-order microarchitectures, adding an implicit third source
+operand due to the need to copy the original value of the destination
+architectural register into the renamed destination physical register
+if the predicate is false.  Also, static compile-time decisions to use
+predication instead of branches can result in lower performance on
+inputs not included in the compiler training set, especially given
+that unpredictable branches are rare, and becoming rarer as branch
+prediction techniques improve.
+
+We note that various microarchitectural techniques exist to
+dynamically convert unpredictable short forward branches into
+internally predicated code to avoid the cost of flushing pipelines on
+a branch mispredict~\cite{heil-tr1996,Klauser-1998,Kim-micro2005} and
+have been implemented in commercial processors~\cite{ibmpower7}.
+The simplest techniques just reduce the penalty of recovering from a
+mispredicted short forward branch by only flushing instructions in the
+branch shadow instead of the entire fetch pipeline, or by fetching
+instructions from both sides using wide instruction fetch or idle
+instruction fetch slots.  More complex techniques for out-of-order
+cores add internal predicates on instructions in the branch shadow,
+with the internal predicate value written by the branch instruction,
+allowing the branch and following instructions to be executed
+speculatively and out-of-order with respect to other code~\cite{ibmpower7}.
+\end{commentary}
+
+\section{Load and Store Instructions}
+
+RV32I is a load-store architecture, where only load and store
+instructions access memory and arithmetic instructions only operate on
+CPU registers.  RV32I provides a 32-bit user address space that is
+byte-addressed and little-endian.  The execution environment will
+define what portions of the address space are legal to access.  Loads
+with a destination of {\tt x0} must still raise any exceptions and
+action any other side effects even though the load value is discarded.
+
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+offset[11:0] & base & width & dest & LOAD \\
+\end{tabular}
+\end{center}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{O@{}R@{}R@{}F@{}R@{}O}
+\\
+\instbitrange{31}{25} &
+\instbitrange{24}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:5]} &
+\multicolumn{1}{c|}{rs2} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{imm[4:0]} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+7 & 5 & 5 & 3 & 5 & 7 \\
+offset[11:5] & src & base & width & offset[4:0] & STORE \\
+\end{tabular}
+\end{center}
+
+Load and store instructions transfer a value between the registers and
+memory.  Loads are encoded in the I-type format and stores are
+S-type.  The effective byte address is obtained by adding register
+{\em rs1} to the sign-extended 12-bit offset.  Loads copy a value
+from memory to register {\em rd}.  Stores copy the value in register
+{\em rs2} to memory.
+
+The LW instruction loads a 32-bit value from memory into {\em rd}.  LH
+loads a 16-bit value from memory, then sign-extends to 32-bits before
+storing in {\tt rd}. LHU loads a 16-bit value from memory but then
+zero extends to 32-bits before storing in {\em rd}.  LB and LBU are
+defined analogously for 8-bit values.  The SW, SH, and SB instructions
+store 32-bit, 16-bit, and 8-bit values from the low bits of register
+{\em rs2} to memory.
+
+For best performance, the effective address for all loads and stores
+should be naturally aligned for each data type (i.e., on a four-byte
+boundary for 32-bit accesses, and a two-byte boundary for 16-bit
+accesses).  The base ISA supports misaligned accesses, but these might
+run extremely slowly depending on the implementation.  Furthermore,
+naturally aligned loads and stores are guaranteed to execute
+atomically, whereas misaligned loads and stores might not, and hence
+require additional synchronization to ensure atomicity.
+
+\begin{commentary}
+Misaligned accesses are occasionally required when porting legacy
+code, and are essential for good performance on many applications when
+using any form of packed-SIMD extension.  Our rationale for supporting
+misaligned accesses via the regular load and store instructions is to
+simplify the addition of misaligned hardware support.  One option
+would have been to disallow misaligned accesses in the base ISA and
+then provide some separate ISA support for misaligned accesses, either
+special instructions to help software handle misaligned accesses or a
+new hardware addressing mode for misaligned accesses.  Special
+instructions are difficult to use, complicate the ISA, and often add
+new processor state (e.g., SPARC VIS align address offset register) or
+complicate access to existing processor state (e.g., MIPS LWL/LWR
+partial register writes).  In addition, for loop-oriented packed-SIMD
+code, the extra overhead when operands are misaligned motivates
+software to provide multiple forms of loop depending on operand
+alignment, which complicates code generation and adds to loop startup
+overhead.  New misaligned hardware addressing modes take considerable
+space in the instruction encoding or require very simplified
+addressing modes (e.g., register indirect only).
+
+We do not mandate atomicity for misaligned accesses so simple
+implementations can just use a machine trap and software handler to
+handle some or all misaligned accesses.  If hardware misaligned support is
+provided, software can exploit this by simply using regular load and
+store instructions.  Hardware can then automatically optimize accesses
+depending on whether runtime addresses are aligned.
+\end{commentary}
+
+\section{Memory Model}
+
+The base RISC-V ISA supports multiple concurrent threads of execution
+within a single user address space.  Each RISC-V thread has its own
+user register state and program counter, and executes an independent
+sequential instruction stream.  The execution environment will define
+how RISC-V threads are created and managed.  RISC-V threads can
+communicate and synchronize with other threads either via calls to the
+execution environment, which are documented separately in the
+specification for each execution environment, or directly via the
+shared memory system.  RISC-V threads can also interact with I/O
+devices, and indirectly with each other, via loads and stores to
+portions of the address space assigned to I/O.
+
+In the base RISC-V ISA, each RISC-V thread observes its own memory
+operations as if they executed sequentially in program order.  RISC-V
+has a relaxed memory model between threads, requiring an explicit
+FENCE instruction to guarantee any specific ordering between memory
+operations from different RISC-V threads.  Chapter~\ref{atomics}
+describes the optional atomic memory instruction extensions ``A'',
+which provide additional synchronization operations.
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{F@{}IIIIIIIIF@{}F@{}F@{}S}
+\\
+\instbitrange{31}{28} &
+\multicolumn{1}{c}{\instbit{27}} &
+\multicolumn{1}{c}{\instbit{26}} &
+\multicolumn{1}{c}{\instbit{25}} &
+\multicolumn{1}{c}{\instbit{24}} &
+\multicolumn{1}{c}{\instbit{23}} &
+\multicolumn{1}{c}{\instbit{22}} &
+\multicolumn{1}{c}{\instbit{21}} &
+\multicolumn{1}{c}{\instbit{20}} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{0} &
+\multicolumn{1}{c|}{PI} &
+\multicolumn{1}{c|}{PO} &
+\multicolumn{1}{c|}{PR} &
+\multicolumn{1}{c|}{PW} &
+\multicolumn{1}{|c|}{SI} &
+\multicolumn{1}{c|}{SO} &
+\multicolumn{1}{c|}{SR} &
+\multicolumn{1}{c|}{SW} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+4 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 5 & 3 & 5 & 7 \\
+0 & \multicolumn{4}{c}{predecessor} & \multicolumn{4}{c}{successor} & 0 & FENCE & 0 & MISC-MEM \\
+\end{tabular}
+\end{center}
+
+The FENCE instruction is used to order device I/O and
+memory accesses as viewed by other RISC-V threads and external devices
+or coprocessors.  Any combination of device input (I), device output
+(O), memory reads (R), and memory writes (W) may be ordered with
+respect to any combination of the same.  Informally, no other RISC-V
+thread or external device can observe any operation in the {\em
+  successor} set following a FENCE before any operation in the {\em
+  predecessor} set preceding the FENCE.  The execution environment
+will define what I/O operations are possible, and in particular, which
+load and store instructions might be treated and ordered as device
+input and device output operations respectively rather than memory
+reads and writes.  For example, memory-mapped I/O devices will
+typically be accessed with uncached loads and stores that are ordered
+using the I and O bits rather than the R and W bits.  Instruction-set
+extensions might also describe new coprocessor I/O instructions that
+will also be ordered using the I and O bits in a FENCE.
+
+The unused fields in the FENCE instruction, {\em imm[11:8]}, {\em rs1}, and
+{\em rd}, are reserved for finer-grain fences in future extensions.  For
+forward compatibility, base implementations shall ignore these fields, and
+standard software shall zero these fields.
+
+\begin{commentary}
+We chose a relaxed memory model to allow high performance from simple machine
+implementations.  A relaxed memory model is also most compatible with likely
+future coprocessor or accelerator extensions.  We separate out I/O ordering
+from memory R/W ordering to avoid unnecessary serialization within
+a device-driver thread and also to support alternative non-memory paths to
+control added coprocessors or I/O devices.  Simple implementations may
+additionally ignore the {\em predecessor} and {\em successor} fields and
+always execute a conservative fence on all operations.
+\end{commentary}
+
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+0 & 0 & FENCE.I & 0 & MISC-MEM \\
+\end{tabular}
+\end{center}
+
+The FENCE.I instruction is used to synchronize the instruction and
+data streams.  RISC-V does not guarantee that stores to instruction
+memory will be made visible to instruction fetches on the same RISC-V
+thread until a FENCE.I instruction is executed.  A FENCE.I instruction
+only ensures that a subsequent instruction fetch on a RISC-V thread
+will see any previous data stores already visible to the same RISC-V
+thread.  FENCE.I does {\em not} ensure that other RISC-V threads'
+instruction fetches will observe the local thread's stores in a
+multiprocessor system. To make a store to instruction memory visible
+to all RISC-V threads, the writing thread has to execute a data FENCE
+before requesting that all remote RISC-V threads execute a FENCE.I.
+
+The unused fields in the FENCE.I instruction, {\em imm[11:0]}, {\em rs1}, and
+{\em rd}, are reserved for finer-grain fences in future extensions.  For
+forward compatibility, base implementations shall ignore these fields, and
+standard software shall zero these fields.
+
+\begin{commentary}
+The FENCE.I instruction was designed to support a wide variety of
+implementations.  A simple implementation can flush the local
+instruction cache and the instruction pipeline when the FENCE.I is
+executed.  A more complex implementation might snoop the instruction
+(data) cache on every data (instruction) cache miss, or use an
+inclusive unified private L2 cache to invalidate lines from the
+primary instruction cache when they are being written by a local store
+instruction.  If instruction and data caches are kept coherent in this
+way, then only the pipeline needs to be flushed at a FENCE.I.
+
+We considered but did not include a ``store instruction word''
+instruction (as in MAJC~\cite{majc}).  JIT compilers may generate a
+large trace of instructions before a single FENCE.I, and amortize any
+instruction cache snooping/invalidation overhead by writing translated
+instructions to memory regions that are known not to reside in the
+I-cache.
+\end{commentary}
+
+\section{Control and Status Register Instructions}
+
+SYSTEM instructions are used to access system functionality that might
+require privileged access and are encoded using the I-type instruction
+format.  These can be divided into two main classes: those that
+atomically read-modify-write control and status registers (CSRs), and
+all other potentially privileged instructions. CSR instructions are
+described in this section, with the two other user-level SYSTEM
+instructions described in the following section.
+
+\begin{commentary}
+The SYSTEM instructions are defined to allow simpler implementations
+to always trap to a single software trap handler.  More sophisticated
+implementations might execute more of each system instruction in
+hardware.
+\end{commentary}
+
+\subsubsection*{CSR Instructions}
+
+We define the full set of CSR instructions here, although in the standard
+user-level base ISA, only a handful of read-only counter CSRs are accessible.
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}S}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{csr} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+source/dest  & source & CSRRW  & dest & SYSTEM \\
+source/dest  & source & CSRRS  & dest & SYSTEM \\
+source/dest  & source & CSRRC  & dest & SYSTEM \\
+source/dest  & zimm[4:0]   & CSRRWI & dest & SYSTEM \\
+source/dest  & zimm[4:0]   & CSRRSI & dest & SYSTEM \\
+source/dest  & zimm[4:0]   & CSRRCI & dest & SYSTEM \\
+\end{tabular}
+\end{center}
+
+The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values
+in the CSRs and integer registers. CSRRW reads the old value of the
+CSR, zero-extends the value to XLEN bits, then writes it to integer
+register {\em rd}.  The initial value in {\em rs1} is written to the
+CSR.  If {\em rd}={\tt x0}, then the instruction shall not read the CSR
+and shall not cause any of the side-effects that might occur on a CSR
+read.
+
+The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the
+value of the CSR, zero-extends the value to XLEN bits, and writes it
+to integer register {\em rd}.  The initial value in integer register
+{\em rs1} is treated as a bit mask that specifies bit positions to be
+set in the CSR.  Any bit that is high in {\em rs1} will cause the
+corresponding bit to be set in the CSR, if that CSR bit is writable.
+Other bits in the CSR are unaffected (though CSRs might have side
+effects when written).
+
+The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the
+value of the CSR, zero-extends the value to XLEN bits, and writes it
+to integer register {\em rd}.  The initial value in integer register
+{\em rs1} is treated as a bit mask that specifies bit positions to be
+cleared in the CSR.  Any bit that is high in {\em rs1} will cause the
+corresponding bit to be cleared in the CSR, if that CSR bit is
+writable.  Other bits in the CSR are unaffected.
+
+For both CSRRS and CSRRC, if {\em rs1}={\tt x0}, then the instruction
+will not write to the CSR at all, and so shall not cause any of the
+side effects that might otherwise occur on a CSR write, such as
+raising illegal instruction exceptions on accesses to read-only CSRs.
+Note that if {\em rs1} specifies a register holding a zero value other
+than {\tt x0}, the instruction will still attempt to write the
+unmodified value back to the CSR and will cause any attendant side effects.
+
+The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS,
+and CSRRC respectively, except they update the CSR using an XLEN-bit
+value obtained by zero-extending a 5-bit immediate (zimm[4:0]) field
+encoded in the {\em rs1} field instead of a value from an integer
+register.  For CSRRSI and CSRRCI, if the zimm[4:0] field is zero, then
+these instructions will not write to the CSR, and shall not cause any
+of the side effects that might otherwise occur on a CSR write.  For
+CSRRWI, if {\em rd}={\tt x0}, then the instruction shall not read the
+CSR and shall not cause any of the side-effects that might occur on a
+CSR read.
+
+Some CSRs, such as the instructions retired counter, {\tt instret}, may be
+modified as side effects of instruction execution.  In these cases, if a CSR
+access instruction reads a CSR, it reads the value prior to the execution of
+the instruction.  If a CSR access instruction writes a CSR, the update occurs
+after the execution of the instruction.  In particular, a value written to
+{\tt instret} by one instruction will be the value read by the following
+instruction (i.e., the increment of {\tt instret} caused by the first
+instruction retiring happens before the write of the new value).
+
+The assembler pseudo-instruction to read a CSR, CSRR {\em rd, csr}, is
+encoded as CSRRS {\em rd, csr, x0}.  The assembler pseudo-instruction
+to write a CSR, CSRW {\em csr, rs1}, is encoded as CSRRW {\em x0, csr,
+  rs1}, while CSRWI {\em csr, zimm}, is encoded as CSRRWI {\em x0,
+  csr, zimm}.
+
+Further assembler pseudo-instructions are defined to set and clear
+bits in the CSR when the old value is not required: CSRS/CSRC {\em
+  csr, rs1}; CSRSI/CSRCI {\em csr, zimm}.
+
+\subsubsection*{Timers and Counters}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}S}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{csr} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+RDCYCLE[H]   & 0 & CSRRS  & dest & SYSTEM \\
+RDTIME[H]    & 0 & CSRRS  & dest & SYSTEM \\
+RDINSTRET[H] & 0 & CSRRS  & dest & SYSTEM \\
+\end{tabular}
+\end{center}
+
+RV32I provides a number of 64-bit read-only user-level counters, which
+are mapped into the 12-bit CSR address space and accessed in 32-bit
+pieces using CSRRS instructions.
+
+The RDCYCLE pseudo-instruction reads the low XLEN bits of the {\tt
+  cycle} CSR which holds a count of the number of clock cycles
+executed by the processor on which the hardware thread is running from
+an arbitrary start time in the past.  RDCYCLEH is
+an RV32I-only instruction that reads bits 63--32 of the same cycle
+counter.  The underlying 64-bit counter should never overflow in
+practice.  The rate at which the cycle counter advances will depend on
+the implementation and operating environment.  The execution
+environment should provide a means to determine the current rate
+(cycles/second) at which the cycle counter is incrementing.
+
+The RDTIME pseudo-instruction reads the low XLEN bits of the {\tt
+  time} CSR, which counts wall-clock real time that has passed from an
+arbitrary start time in the past.  RDTIMEH is an RV32I-only instruction
+that reads bits 63--32 of the same real-time counter.  The underlying 64-bit
+counter should never overflow in practice.  The execution environment
+should provide a means of determining the period of the real-time
+counter (seconds/tick).  The period must be constant.  The
+real-time clocks of all hardware threads in a single user application
+should be synchronized to within one tick of the real-time clock.  The
+environment should provide a means to determine the accuracy of the
+clock.
+
+The RDINSTRET pseudo-instruction reads the low XLEN bits of the {\tt
+  instret} CSR, which counts the number of instructions retired by
+this hardware thread from some arbitrary start point in the past.
+RDINSTRETH is an RV32I-only instruction that reads bits 63--32 of the
+same instruction counter. The underlying 64-bit counter that should
+never overflow in practice.
+
+The following code sequence will read a valid 64-bit cycle counter value into
+{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper
+and lower halves.
+
+\begin{figure}[h!]
+\begin{center}
+\begin{verbatim}
+    again:
+        rdcycleh     x3
+        rdcycle      x2
+        rdcycleh     x4
+        bne          x3, x4, again
+\end{verbatim}
+\end{center}
+\caption{Sample code for reading the 64-bit cycle counter in RV32.}
+\label{critical}
+\end{figure}
+
+\begin{commentary}
+We mandate these basic counters be provided in all implementations as
+they are essential for basic performance analysis, adaptive and
+dynamic optimization, and to allow an application to work with
+real-time streams.  Additional counters should be provided to help
+diagnose performance problems and these should be made accessible from
+user-level application code with low overhead.
+
+We required the counters be 64 bits wide, even on RV32, as otherwise
+it is very difficult for software to determine if values have
+overflowed.  For a low-end implementation, the upper 32 bits of each
+counter can be implemented using software counters incremented by a
+trap handler triggered by overflow of the lower 32 bits.  The sample
+code described above shows how the full 64-bit width value can be
+safely read using the individual 32-bit instructions.
+
+In some applications, it is important to be able to read multiple
+counters at the same instant in time.  When run under a multitasking
+environment, a user thread can suffer a context switch while
+attempting to read the counters.  One solution is for the user thread
+to read the real-time counter before and after reading the other
+counters to determine if a context switch occurred in the middle of the
+sequence, in which case the reads can be retried.  We considered
+adding output latches to allow a user thread to snapshot the counter
+values atomically, but this would increase the size of the user
+context, especially for implementations with a richer set of counters.
+\end{commentary}
+
+
+\section{Environment Call and Breakpoints}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}S}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{funct12} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+ECALL   & 0 & PRIV & 0 & SYSTEM \\
+EBREAK  & 0 & PRIV & 0 & SYSTEM \\
+\end{tabular}
+\end{center}
+
+The ECALL instruction is used to make a request to the supporting
+execution environment, which is usually an operating system.  The ABI
+for the system will define how parameters for the environment request
+are passed, but usually these will be in defined locations in the
+integer register file.
+
+The EBREAK instruction is used by debuggers to cause control to be
+transferred back to a debugging environment.
+
+\begin{commentary}
+ECALL and EBREAK were previously named SCALL and SBREAK.  The
+instructions have the same functionality and encoding, but were
+renamed to reflect that they can be used more generally than to call a
+supervisor-level operating system or debugger.
+\end{commentary}
+
-- 
cgit v1.1