diff options
author | elisa <elisa@riscv.org> | 2021-09-10 14:07:18 -0700 |
---|---|---|
committer | elisa <elisa@riscv.org> | 2021-09-10 14:07:18 -0700 |
commit | 64353b3717c9d387759c61d16d3984760a028046 (patch) | |
tree | 7dc4b3a1aff37dfd38ff921fa21113a64989a5aa /src/rv32.adoc | |
parent | e1a563505224b47129b0cbb5c46ee22f5e0acddb (diff) | |
download | riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.zip riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.gz riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.bz2 |
adding converted adoc files
Diffstat (limited to 'src/rv32.adoc')
-rw-r--r-- | src/rv32.adoc | 988 |
1 files changed, 988 insertions, 0 deletions
diff --git a/src/rv32.adoc b/src/rv32.adoc new file mode 100644 index 0000000..b035fe3 --- /dev/null +++ b/src/rv32.adoc @@ -0,0 +1,988 @@ +[[rv32]] +== RV32I Base Integer Instruction Set, Version 2.1 + +This chapter describes the RV32I base integer instruction set. + +[TIP] +==== +RV32I was designed to be sufficient to form a compiler target and to +support modern operating system environments. The ISA was also designed +to reduce the hardware required in a minimal implementation. RV32I +contains 40 unique instructions, though a simple implementation might +cover the ECALL/EBREAK instructions with a single SYSTEM hardware +instruction that always traps and might be able to implement the FENCE +instruction as a NOP, reducing base instruction count to 38 total. RV32I +can emulate almost any other ISA extension (except the A extension, +which requires additional hardware support for atomicity). + +In practice, a hardware implementation including the machine-mode +privileged architecture will also require the 6 CSR instructions. + +Subsets of the base integer ISA might be useful for pedagogical +purposes, but the base has been defined such that there should be little +incentive to subset a real hardware implementation beyond omitting +support for misaligned memory accesses and treating all SYSTEM +instructions as a single trap. +==== + +[NOTE] +==== +The standard RISC-V assembly language syntax is documented in the +Assembly Programmer’s Manual cite:[riscv-asm-manual]. + +Most of the commentary for RV32I also applies to the RV64I base. +==== + +=== Programmers’ Model for Base Integer ISA + +<<img-gprs>> shows the unprivileged state for the base +integer ISA. For RV32I, the 32 `x` registers are each 32 bits wide, +i.e., XLEN=32. Register `x0` is hardwired with all bits equal to 0. +General purpose registers `x1`–`x31` hold values that various +instructions interpret as a collection of Boolean values, or as two’s +complement signed binary integers or unsigned binary integers. + +There is one additional unprivileged register: the program counter `pc` +holds the address of the current instruction. + +[[img-gprs]] +.RISC-V base unprivileged integer register state. +image::base-unpriv-reg-state.png[base,180,1000,align="center"] + +There is no dedicated stack pointer or subroutine return address link +register in the Base Integer ISA; the instruction encoding allows any +`x` register to be used for these purposes. However, the standard +software calling convention uses register `x1` to hold the return +address for a call, with register `x5` available as an alternate link +register. The standard calling convention uses register `x2` as the +stack pointer. + +Hardware might choose to accelerate function calls and returns that use +`x1` or `x5`. See the descriptions of the JAL and JALR instructions. + +The optional compressed 16-bit instruction format is designed around the +assumption that `x1` is the return address register and ` x2` is the +stack pointer. Software using other conventions will operate correctly +but may have greater code size. + +The number of available architectural registers can have large impacts +on code size, performance, and energy consumption. Although 16 registers +would arguably be sufficient for an integer ISA running compiled code, +it is impossible to encode a complete ISA with 16 registers in 16-bit +instructions using a 3-address format. Although a 2-address format would +be possible, it would increase instruction count and lower efficiency. +We wanted to avoid intermediate instruction sizes (such as Xtensa’s +24-bit instructions) to simplify base hardware implementations, and once +a 32-bit instruction size was adopted, it was straightforward to support +32 integer registers. A larger number of integer registers also helps +performance on high-performance code, where there can be extensive use +of loop unrolling, software pipelining, and cache tiling. + +For these reasons, we chose a conventional size of 32 integer registers +for RV32I. Dynamic register usage tends to be dominated by a few +frequently accessed registers, and regfile implementations can be +optimized to reduce access energy for the frequently accessed +registers . The optional compressed 16-bit instruction format mostly +only accesses 8 registers and hence can provide a dense instruction +encoding, while additional instruction-set extensions could support a +much larger register space (either flat or hierarchical) if desired. + +For resource-constrained embedded applications, we have defined the +RV32E subset, which only has 16 registers +(<<rv32e>>). + +=== Base Instruction Formats + +In the base RV32I ISA, there are four core instruction formats +(R/I/S/U), as shown in <<base_instr>>. All are a fixed 32 +bits in length and must be aligned on a four-byte boundary in memory. An +instruction-address-misaligned exception is generated on a taken branch +or unconditional jump if the target address is not four-byte aligned. +This exception is reported on the branch or jump instruction, not on the +target instruction. No instruction-address-misaligned exception is +generated for a conditional branch that is not taken. + +The alignment constraint for base ISA instructions is relaxed to a +two-byte boundary when instruction extensions with 16-bit lengths or +other odd multiples of 16-bit lengths are added (i.e., IALIGN=16). + +Instruction-address-misaligned exceptions are reported on the branch or +jump that would cause instruction misalignment to help debugging, and to +simplify hardware design for systems with IALIGN=32, where these are the +only places where misalignment can occur. + +The behavior upon decoding a reserved instruction is UNSPECIFIED. + +Some platforms may require that opcodes reserved for standard use raise +an illegal-instruction exception. Other platforms may permit reserved +opcode space be used for non-conforming extensions. + +include::images/wavedrom/instruction_formats.adoc[] +[[base_instr]] +.Instruction formats +image::image_placeholder.png[] + +[NOTE] +==== +Each immediate subfield in <<base_instr>> above is labeled with the bit position (imm[x ]) in the immediate value being produced, rather than the bit position within the instruction’s immediate field as is usually done. +==== + +The RISC-V ISA keeps the source (_rs1_ and _rs2_) and destination (_rd_) +registers at the same position in all formats to simplify decoding. +Except for the 5-bit immediates used in CSR instructions +(<<csrinsts>>), immediates are always +sign-extended, and are generally packed towards the leftmost available +bits in the instruction and have been allocated to reduce hardware +complexity. In particular, the sign bit for all immediates is always in +bit 31 of the instruction to speed sign-extension circuitry. + +[NOTE] +==== +Decoding register specifiers is usually on the critical paths in +implementations, and so the instruction format was chosen to keep all +register specifiers at the same position in all formats at the expense +of having to move immediate bits across formats (a property shared with +RISC-IV aka. cite:[spur-jsscc1989]). + +In practice, most immediates are either small or require all XLEN bits. +We chose an asymmetric immediate split (12 bits in regular instructions +plus a special load-upper-immediate instruction with 20 bits) to +increase the opcode space available for regular instructions. + +Immediates are sign-extended because we did not observe a benefit to +using zero-extension for some immediates as in the MIPS ISA and wanted +to keep the ISA as simple as possible. +==== + +=== Immediate Encoding Variants + +There are a further two variants of the instruction formats (B/J) based +on the handling of immediates, as shown in <<baseinstformatsimm>>. + +include::images/wavedrom/immediate_variants.adoc[] +[[baseinstformatsimm]] +.RISC-V base instruction formats. +image::image_placeholder.png[] + +[NOTE] +==== +Each immediate subfield is labeled with the bit +position (imm[x ]) in the immediate value being produced, rather than the bit position within the +instruction’s immediate field as is usually done. +==== + +The only difference between the S and B formats is that the 12-bit +immediate field is used to encode branch offsets in multiples of 2 in +the B format. Instead of shifting all bits in the instruction-encoded +immediate left by one in hardware as is conventionally done, the middle +bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest +bit in S format (inst[7]) encodes a high-order bit in B format. + +Similarly, the only difference between the U and J formats is that the +20-bit immediate is shifted left by 12 bits to form U immediates and by +1 bit to form J immediates. The location of instruction bits in the U +and J format immediates is chosen to maximize overlap with the other +formats and with each other. + +<<immtypes>> shows the immediates produced by +each of the base instruction formats, and is labeled to show which +instruction bit `(inst[_y_])` produces each bit of the immediate value. + +include::images/wavedrom/immediate.adoc[] +[[immtypes]] +.Immediate variants for I, S, B, U, and J +image::image_placeholder.png[] + + +Sign-extension is one of the most critical operations on immediates +(particularly for XLEN latexmath:[$>$]32), and in RISC-V the sign bit for +all immediates is always held in bit 31 of the instruction to allow +sign-extension to proceed in parallel with instruction decoding. + +Although more complex implementations might have separate adders for +branch and jump calculations and so would not benefit from keeping the +location of immediate bits constant across types of instruction, we +wanted to reduce the hardware cost of the simplest implementations. By +rotating bits in the instruction encoding of B and J immediates instead +of using dynamic hardware muxes to multiply the immediate by 2, we +reduce instruction signal fanout and immediate mux costs by around a +factor of 2. The scrambled immediate encoding will add negligible time +to static or ahead-of-time compilation. For dynamic generation of +instructions, there is some small additional overhead, but the most +common short forward branches have straightforward immediate encodings. + +=== Integer Computational Instructions + +Most integer computational instructions operate on XLEN bits of values +held in the integer register file. Integer computational instructions +are either encoded as register-immediate operations using the I-type +format or as register-register operations using the R-type format. The +destination is register _rd_ for both register-immediate and +register-register instructions. No integer computational instructions +cause arithmetic exceptions. + +We did not include special instruction-set support for overflow checks +on integer arithmetic operations in the base instruction set, as many +overflow checks can be cheaply implemented using RISC-V branches. +Overflow checking for unsigned addition requires only a single +additional branch instruction after the addition: +`add t0, t1, t2; bltu t0, t1, overflow`. + +For signed addition, if one operand’s sign is known, overflow checking +requires only a single branch after the addition: +`addi t0, t1, +imm; blt t0, t1, overflow`. This covers the common case +of addition with an immediate operand. + +For general signed addition, three additional instructions after the +addition are required, leveraging the observation that the sum should be +less than one of the operands if and only if the other operand is +negative. + +[source,txt] +.... + add t0, t1, t2 + slti t3, t2, 0 + slt t4, t0, t1 + bne t3, t4, overflow +.... + +In RV64I, checks of 32-bit signed additions can be optimized further by +comparing the results of ADD and ADDW on the operands. + +==== Integer Register-Immediate Instructions + +include::images/wavedrom/integer_computational.adoc[] +.Integer Computational Instructions +image::image_placeholder.png[] + +ADDI adds the sign-extended 12-bit immediate to register _rs1_. +Arithmetic overflow is ignored and the result is simply the low XLEN +bits of the result. ADDI _rd, rs1, 0_ is used to implement the MV _rd, +rs1_ assembler pseudoinstruction. + +SLTI (set less than immediate) places the value 1 in register _rd_ if +register _rs1_ is less than the sign-extended immediate when both are +treated as signed numbers, else 0 is written to _rd_. SLTIU is similar +but compares the values as unsigned numbers (i.e., the immediate is +first sign-extended to XLEN bits then treated as an unsigned number). +Note, SLTIU _rd, rs1, 1_ sets _rd_ to 1 if _rs1_ equals zero, otherwise +sets _rd_ to 0 (assembler pseudoinstruction SEQZ _rd, rs_). + +ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and +XOR on register _rs1_ and the sign-extended 12-bit immediate and place +the result in _rd_. Note, XORI _rd, rs1, -1_ performs a bitwise logical +inversion of register _rs1_ (assembler pseudoinstruction NOT _rd, rs_). + +include::images/wavedrom/int-comp-slli-srli-srai.adoc[] +[[int-comp-slli-srli-srai]] +.Integer register-immediate, SLLI, SRLI, SRAI +image::image_placeholder.png[] + + +Shifts by a constant are encoded as a specialization of the I-type +format. The operand to be shifted is in _rs1_, and the shift amount is +encoded in the lower 5 bits of the I-immediate field. The right shift +type is encoded in bit 30. SLLI is a logical left shift (zeros are +shifted into the lower bits); SRLI is a logical right shift (zeros are +shifted into the upper bits); and SRAI is an arithmetic right shift (the +original sign bit is copied into the vacated upper bits). + + +include::images/wavedrom/int-comp-lui-aiupc.adoc[] +[[int-comp-lui-aiupc]] +.Integer register-immediate, U-immediate +image::image_placeholder.png[] + + +LUI (load upper immediate) is used to build 32-bit constants and uses +the U-type format. LUI places the 32-bit U-immediate value into the +destination register _rd_, filling in the lowest 12 bits with zeros. + +AUIPC (add upper immediate to `pc`) is used to build `pc`-relative +addresses and uses the U-type format. AUIPC forms a 32-bit offset from +the U-immediate, filling in the lowest 12 bits with zeros, adds this +offset to the address of the AUIPC instruction, then places the result +in register _rd_. + +The assembly syntax for `lui` and `auipc` does not represent the lower +12 bits of the U-immediate, which are always zero. + +The AUIPC instruction supports two-instruction sequences to access +arbitrary offsets from the PC for both control-flow transfers and data +accesses. The combination of an AUIPC and the 12-bit immediate in a JALR +can transfer control to any 32-bit PC-relative address, while an AUIPC +plus the 12-bit immediate offset in regular load or store instructions +can access any 32-bit PC-relative data address. + +The current PC can be obtained by setting the U-immediate to 0. Although +a JAL +4 instruction could also be used to obtain the local PC (of the +instruction following the JAL), it might cause pipeline breaks in +simpler microarchitectures or pollute BTB structures in more complex +microarchitectures. + +==== Integer Register-Register Operations + +RV32I defines several arithmetic R-type operations. All operations read +the _rs1_ and _rs2_ registers as source operands and write the result +into register _rd_. The _funct7_ and _funct3_ fields select the type of +operation. + +include::images/wavedrom/int_reg-reg.adoc[] +[[int-reg-reg]] +.Integer register-register +image::image_placeholder.png[] + +ADD performs the addition of _rs1_ and _rs2_. SUB performs the +subtraction of _rs2_ from _rs1_. Overflows are ignored and the low XLEN +bits of results are written to the destination _rd_. SLT and SLTU +perform signed and unsigned compares respectively, writing 1 to _rd_ if +latexmath:[$\mbox{\em rs1} < \mbox{\em + rs2}$], 0 otherwise. Note, SLTU _rd_, _x0_, _rs2_ sets _rd_ to 1 if +_rs2_ is not equal to zero, otherwise sets _rd_ to zero (assembler +pseudoinstruction SNEZ _rd, rs_). AND, OR, and XOR perform bitwise +logical operations. + +SLL, SRL, and SRA perform logical left, logical right, and arithmetic +right shifts on the value in register _rs1_ by the shift amount held in +the lower 5 bits of register _rs2_. + +==== NOP Instruction + +include::images/wavedrom/nop.adoc[] +[[nop]] +.NOP instructions +image::image_placeholder.png[] + +The NOP instruction does not change any architecturally visible state, +except for advancing the `pc` and incrementing any applicable +performance counters. NOP is encoded as ADDI _x0, x0, 0_. + +NOPs can be used to align code segments to microarchitecturally +significant address boundaries, or to leave space for inline code +modifications. Although there are many possible ways to encode a NOP, we +define a canonical NOP encoding to allow microarchitectural +optimizations as well as for more readable disassembly output. The other +NOP encodings are made available for HINT instructions +(Section <<rv32i-hints>>). + +ADDI was chosen for the NOP encoding as this is most likely to take +fewest resources to execute across a range of systems (if not optimized +away in decode). In particular, the instruction only reads one register. +Also, an ADDI functional unit is more likely to be available in a +superscalar design as adds are the most common operation. In particular, +address-generation functional units can execute ADDI using the same +hardware needed for base+offset address calculations, while +register-register ADD or logical/shift operations require additional +hardware. + +=== Control Transfer Instructions + +RV32I provides two types of control transfer instructions: unconditional +jumps and conditional branches. Control transfer instructions in RV32I +do _not_ have architecturally visible delay slots. + +If an instruction access-fault or instruction page-fault exception +occurs on the target of a jump or taken branch, the exception is +reported on the target instruction, not on the jump or branch +instruction. + +==== Unconditional Jumps + +The jump and link (JAL) instruction uses the J-type format, where the +J-immediate encodes a signed offset in multiples of 2 bytes. The offset +is sign-extended and added to the address of the jump instruction to +form the jump target address. Jumps can therefore target a +latexmath:[$\pm$] range. JAL stores the address of the instruction +following the jump (`pc`+4) into register _rd_. The standard software +calling convention uses `x1` as the return address register and `x5` as +an alternate link register. + +The alternate link register supports calling millicode routines (e.g., +those to save and restore registers in compressed code) while preserving +the regular return address register. The register `x5` was chosen as the +alternate link register as it maps to a temporary in the standard +calling convention, and has an encoding that is only one bit different +than the regular link register. + +Plain unconditional jumps (assembler pseudoinstruction J) are encoded as +a JAL with _rd_=`x0`. + +include::images/wavedrom/ct-unconditional.adoc[] +[[ct-unconditional]] +.Plain unconditional jumps +image::image_placeholder.png[] + +The indirect jump instruction JALR (jump and link register) uses the +I-type encoding. The target address is obtained by adding the +sign-extended 12-bit I-immediate to the register _rs1_, then setting the +least-significant bit of the result to zero. The address of the +instruction following the jump (`pc`+4) is written to register _rd_. +Register `x0` can be used as the destination if the result is not +required. + +include::images/wavedrom/ct-unconditional-2.adoc[] +[[ct-unconditional-2]] +.Indirect unconditional jump +image::image_placeholder.png[] + +The unconditional jump instructions all use PC-relative addressing to +help support position-independent code. The JALR instruction was defined +to enable a two-instruction sequence to jump anywhere in a 32-bit +absolute address range. A LUI instruction can first load _rs1_ with the +upper 20 bits of a target address, then JALR can add in the lower bits. +Similarly, AUIPC then JALR can jump anywhere in a 32-bit `pc`-relative +address range. + +Note that the JALR instruction does not treat the 12-bit immediate as +multiples of 2 bytes, unlike the conditional branch instructions. This +avoids one more immediate format in hardware. In practice, most uses of +JALR will have either a zero immediate or be paired with a LUI or AUIPC, +so the slight reduction in range is not significant. + +Clearing the least-significant bit when calculating the JALR target +address both simplifies the hardware slightly and allows the low bit of +function pointers to be used to store auxiliary information. Although +there is potentially a slight loss of error checking in this case, in +practice jumps to an incorrect instruction address will usually quickly +raise an exception. + +When used with a base _rs1_latexmath:[$=$]`x0`, JALR can be used to +implement a single instruction subroutine call to the lowest or highest +address region from anywhere in the address space, which could be used +to implement fast calls to a small runtime library. Alternatively, an +ABI could dedicate a general-purpose register to point to a library +elsewhere in the address space. + +The JAL and JALR instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary. + +Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + +Return-address prediction stacks are a common feature of +high-performance instruction-fetch units, but require accurate detection +of instructions used for procedure calls and returns to be effective. +For RISC-V, hints as to the instructions’ usage are encoded implicitly +via the register numbers used. A JAL instruction should push the return +address onto a return-address stack (RAS) only when _rd_ is `x1` or +`x5`. JALR instructions should push/pop a RAS as shown in <<rashints>>. + +[[rashints]] +.Return-address stack prediction hints encoded in the register operands +of a JALR instruction. +[cols="^,^,^,<",options="header",] +|=== +|_rd_ is `x1`/`x5` |_rs1_ is `x1`/`x5` |__rd__latexmath:[$=$]_rs1_ |RAS +action +|No |No |– |None + +|No |Yes |– |Pop + +|Yes |No |– |Push + +|Yes |Yes |No |Pop, then push + +|Yes |Yes |Yes |Push +|=== + +Some other ISAs added explicit hint bits to their indirect-jump +instructions to guide return-address stack manipulation. We use implicit +hinting tied to register numbers and the calling convention to reduce +the encoding space used for these hints. + +When two different link registers (`x1` and `x5`) are given as _rs1_ and +_rd_, then the RAS is both popped and pushed to support coroutines. If +_rs1_ and _rd_ are the same link register (either `x1` or `x5`), the RAS +is only pushed to enable macro-op fusion of the sequences: +`lui ra, imm20; jalr ra, imm12(ra)` and + `auipc ra, imm20; jalr ra, imm12(ra)` + +==== Conditional Branches + +All branch instructions use the B-type instruction format. The 12-bit +B-immediate encodes signed offsets in multiples of 2 bytes. The offset +is sign-extended and added to the address of the branch instruction to +give the target address. The conditional branch range is +latexmath:[$\pm$]. + +include::images/wavedrom/ct-conditional.adoc[] +[[ct-conditional]] +.Conditional branches +image::image_placeholder.png[] + +Branch instructions compare two registers. BEQ and BNE take the branch +if registers _rs1_ and _rs2_ are equal or unequal respectively. BLT and +BLTU take the branch if _rs1_ is less than _rs2_, using signed and +unsigned comparison respectively. BGE and BGEU take the branch if _rs1_ +is greater than or equal to _rs2_, using signed and unsigned comparison +respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by +reversing the operands to BLT, BLTU, BGE, and BGEU, respectively. + +Signed array bounds may be checked with a single BLTU instruction, since +any negative index will compare greater than any nonnegative bound. + +Software should be optimized such that the sequential code path is the +most common path, with less-frequently taken code paths placed out of +line. Software should also assume that backward branches will be +predicted taken and forward branches as not taken, at least the first +time they are encountered. Dynamic predictors should quickly learn any +predictable branch behavior. + +Unlike some other architectures, the RISC-V jump (JAL with _rd_=`x0`) +instruction should always be used for unconditional branches instead of +a conditional branch instruction with an always-true condition. RISC-V +jumps are also PC-relative and support a much wider offset range than +branches, and will not pollute conditional-branch prediction tables. + +The conditional branches were designed to include arithmetic comparison +operations between two registers (as also done in PA-RISC, Xtensa, and +MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or +to only compare one register against zero (Alpha, MIPS), or two +registers only for equality (MIPS). This design was motivated by the +observation that a combined compare-and-branch instruction fits into a +regular pipeline, avoids additional condition code state or use of a +temporary register, and reduces static code size and dynamic instruction +fetch traffic. Another point is that comparisons against zero require +non-trivial circuit delay (especially after the move to static logic in +advanced processes) and so are almost as expensive as arithmetic +magnitude compares. Another advantage of a fused compare-and-branch +instruction is that branches are observed earlier in the front-end +instruction stream, and so can be predicted earlier. There is perhaps an +advantage to a design with condition codes in the case where multiple +branches can be taken based on the same condition codes, but we believe +this case to be relatively rare. + +We considered but did not include static branch hints in the instruction +encoding. These can reduce the pressure on dynamic predictors, but +require more instruction encoding space and software profiling for best +results, and can result in poor performance if production runs do not +match profiling runs. + +We considered but did not include conditional moves or predicated +instructions, which can effectively replace unpredictable short forward +branches. Conditional moves are the simpler of the two, but are +difficult to use with conditional code that might cause exceptions +(memory accesses and floating-point operations). Predication adds +additional flag state to a system, additional instructions to set and +clear flags, and additional encoding overhead on every instruction. Both +conditional move and predicated instructions add complexity to +out-of-order microarchitectures, adding an implicit third source operand +due to the need to copy the original value of the destination +architectural register into the renamed destination physical register if +the predicate is false. Also, static compile-time decisions to use +predication instead of branches can result in lower performance on +inputs not included in the compiler training set, especially given that +unpredictable branches are rare, and becoming rarer as branch prediction +techniques improve. + +We note that various microarchitectural techniques exist to dynamically +convert unpredictable short forward branches into internally predicated +code to avoid the cost of flushing pipelines on a branch mispredict cite:[heil-tr1996], cite:[Klauser-1998], cite:[Kim-micro2005] and +have been implemented in commercial processors cite:[ibmpower7]. The simplest techniques +just reduce the penalty of recovering from a mispredicted short forward +branch by only flushing instructions in the branch shadow instead of the +entire fetch pipeline, or by fetching instructions from both sides using +wide instruction fetch or idle instruction fetch slots. More complex +techniques for out-of-order cores add internal predicates on +instructions in the branch shadow, with the internal predicate value +written by the branch instruction, allowing the branch and following +instructions to be executed speculatively and out-of-order with respect +to other code . + +The conditional branch instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary and the branch condition evaluates to +true. If the branch condition evaluates to false, the +instruction-address-misaligned exception will not be raised. + +Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + +[[ldst]] +=== Load and Store Instructions + +RV32I is a load-store architecture, where only load and store +instructions access memory and arithmetic instructions only operate on +CPU registers. RV32I provides a 32-bit address space that is +byte-addressed. The EEI will define what portions of the address space +are legal to access with which instructions (e.g., some addresses might +be read only, or support word access only). Loads with a destination of +`x0` must still raise any exceptions and cause any other side effects +even though the load value is discarded. + +The EEI will define whether the memory system is little-endian or +big-endian. In RISC-V, endianness is byte-address invariant. + +In a system for which endianness is byte-address invariant, the +following property holds: if a byte is stored to memory at some address +in some endianness, then a byte-sized load from that address in any +endianness returns the stored value. + +In a little-endian configuration, multibyte stores write the +least-significant register byte at the lowest memory byte address, +followed by the other register bytes in ascending order of their +significance. Loads similarly transfer the contents of the lesser memory +byte addresses to the less-significant register bytes. + +In a big-endian configuration, multibyte stores write the +most-significant register byte at the lowest memory byte address, +followed by the other register bytes in descending order of their +significance. Loads similarly transfer the contents of the greater +memory byte addresses to the less-significant register bytes. + +include::images/wavedrom/load_store.adoc[] +[[load-store,load and store]] +.Load and store instructions +image::image_placeholder.png[] + +Load and store instructions transfer a value between the registers and +memory. Loads are encoded in the I-type format and stores are S-type. +The effective address is obtained by adding register _rs1_ to the +sign-extended 12-bit offset. Loads copy a value from memory to register +_rd_. Stores copy the value in register _rs2_ to memory. + +The LW instruction loads a 32-bit value from memory into _rd_. LH loads +a 16-bit value from memory, then sign-extends to 32-bits before storing +in _rd_. LHU loads a 16-bit value from memory but then zero extends to +32-bits before storing in _rd_. LB and LBU are defined analogously for +8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and +8-bit values from the low bits of register _rs2_ to memory. + +Regardless of EEI, loads and stores whose effective addresses are +naturally aligned shall not raise an address-misaligned exception. Loads +and stores whose effective address is not naturally aligned to the +referenced datatype (i.e., the effective address is not divisible by the +size of the access in bytes) have behavior dependent on the EEI. + +An EEI may guarantee that misaligned loads and stores are fully +supported, and so the software running inside the execution environment +will never experience a contained or fatal address-misaligned trap. In +this case, the misaligned loads and stores can be handled in hardware, +or via an invisible trap into the execution environment implementation, +or possibly a combination of hardware and invisible trap depending on +address. + +An EEI may not guarantee misaligned loads and stores are handled +invisibly. In this case, loads and stores that are not naturally aligned +may either complete execution successfully or raise an exception. The +exception raised can be either an address-misaligned exception or an +access-fault exception. For a memory access that would otherwise be able +to complete except for the misalignment, an access-fault exception can +be raised instead of an address-misaligned exception if the misaligned +access should not be emulated, e.g., if accesses to the memory region +have side effects. When an EEI does not guarantee misaligned loads and +stores are handled invisibly, the EEI must define if exceptions caused +by address misalignment result in a contained trap (allowing software +running inside the execution environment to handle the trap) or a fatal +trap (terminating execution). + +Misaligned accesses are occasionally required when porting legacy code, +and help performance on applications when using any form of packed-SIMD +extension or handling externally packed data structures. Our rationale +for allowing EEIs to choose to support misaligned accesses via the +regular load and store instructions is to simplify the addition of +misaligned hardware support. One option would have been to disallow +misaligned accesses in the base ISAs and then provide some separate ISA +support for misaligned accesses, either special instructions to help +software handle misaligned accesses or a new hardware addressing mode +for misaligned accesses. Special instructions are difficult to use, +complicate the ISA, and often add new processor state (e.g., SPARC VIS +align address offset register) or complicate access to existing +processor state (e.g., MIPS LWL/LWR partial register writes). In +addition, for loop-oriented packed-SIMD code, the extra overhead when +operands are misaligned motivates software to provide multiple forms of +loop depending on operand alignment, which complicates code generation +and adds to loop startup overhead. New misaligned hardware addressing +modes take considerable space in the instruction encoding or require +very simplified addressing modes (e.g., register indirect only). + +Even when misaligned loads and stores complete successfully, these +accesses might run extremely slowly depending on the implementation +(e.g., when implemented via an invisible trap). Furthermore, whereas +naturally aligned loads and stores are guaranteed to execute atomically, +misaligned loads and stores might not, and hence require additional +synchronization to ensure atomicity. + +We do not mandate atomicity for misaligned accesses so execution +environment implementations can use an invisible machine trap and a +software handler to handle some or all misaligned accesses. If hardware +misaligned support is provided, software can exploit this by simply +using regular load and store instructions. Hardware can then +automatically optimize accesses depending on whether runtime addresses +are aligned. + +[[fence]] +=== Memory Ordering Instructions + +include::images/wavedrom/mem_order.adoc[] +[[mem-order]] +.Memory ordering instructions +image::image_placeholder.png[] + +The FENCE instruction is used to order device I/O and memory accesses as +viewed by other RISC-V harts and external devices or coprocessors. Any +combination of device input (I), device output (O), memory reads \(R), +and memory writes (W) may be ordered with respect to any combination of +the same. Informally, no other RISC-V hart or external device can +observe any operation in the _successor_ set following a FENCE before +any operation in the _predecessor_ set preceding the FENCE. +<<memorymodeL>> provides a precise description +of the RISC-V memory consistency model. + +The FENCE instruction also orders memory reads and writes made by the +hart as observed by memory reads and writes made by an external device. +However, FENCE does not order observations of events made by an external +device using any other signaling mechanism. + +A device might observe an access to a memory location via some external +communication mechanism, e.g., a memory-mapped control register that +drives an interrupt signal to an interrupt controller. This +communication is outside the scope of the FENCE ordering mechanism and +hence the FENCE instruction can provide no guarantee on when a change in +the interrupt signal is visible to the interrupt controller. Specific +devices might provide additional ordering guarantees to reduce software +overhead but those are outside the scope of the RISC-V memory model. + +The EEI will define what I/O operations are possible, and in particular, +which memory addresses when accessed by load and store instructions will +be treated and ordered as device input and device output operations +respectively rather than memory reads and writes. For example, +memory-mapped I/O devices will typically be accessed with uncached loads +and stores that are ordered using the I and O bits rather than the R and +W bits. Instruction-set extensions might also describe new I/O +instructions that will also be ordered using the I and O bits in a +FENCE. + +[[fm]] +.Fence mode encoding +|=== +|_fm_ field |Mnemonic |Meaning +|0000 |_none_ |Normal Fence +|1000 |TSO |With FENCE RW,RW: exclude write-to-read ordering; otherwise: _Reserved for future use._ +|_other_ | |_Reserved for future use._ +|=== + +The fence mode field _fm_ defines the semantics of the FENCE. A FENCE +with _fm_=0000 orders all memory operations in its predecessor set +before all memory operations in its successor set. + +The optional FENCE.TSO instruction is encoded as a FENCE instruction +with _fm_=1000, _predecessor_=RW, and _successor_=RW. FENCE.TSO orders +all load operations in its predecessor set before all memory operations +in its successor set, and all store operations in its predecessor set +before all store operations in its successor set. This leaves non-AMO +store operations in the FENCE.TSO’s predecessor set unordered with +non-AMO loads in its successor set. + +The FENCE.TSO encoding was added as an optional extension to the +original base FENCE instruction encoding. The base definition requires +that implementations ignore any set bits and treat the FENCE as global, +and so this is a backwards-compatible extension. + +The unused fields in the FENCE instructions--_rs1_ and _rd_--are reserved +for finer-grain fences in future extensions. For forward compatibility, +base implementations shall ignore these fields, and standard software +shall zero these fields. Likewise, many _fm_ and predecessor/successor +set settings in <<fm>> are also reserved for future use. +Base implementations shall treat all such reserved configurations as +normal fences with _fm_=0000, and standard software shall use only +non-reserved configurations. + +We chose a relaxed memory model to allow high performance from simple +machine implementations and from likely future coprocessor or +accelerator extensions. We separate out I/O ordering from memory R/W +ordering to avoid unnecessary serialization within a device-driver hart +and also to support alternative non-memory paths to control added +coprocessors or I/O devices. Simple implementations may additionally +ignore the _predecessor_ and _successor_ fields and always execute a +conservative fence on all operations. + +=== Environment Call and Breakpoints + +SYSTEM instructions are used to access system functionality that might +require privileged access and are encoded using the I-type instruction +format. These can be divided into two main classes: those that +atomically read-modify-write control and status registers (CSRs), and +all other potentially privileged instructions. CSR instructions are +described in <<csrinsts>>, and the base +unprivileged instructions are described in the following section. + + +[TIP] +==== +The SYSTEM instructions are defined to allow simpler implementations to +always trap to a single software trap handler. More sophisticated +implementations might execute more of each system instruction in +hardware. +==== + +include::images/wavedrom/env_call-breakpoint.adoc[] +[[env-call]] +.Evironment call and breakpoint instructions +image::image_placeholder.png[] + +These two instructions cause a precise requested trap to the supporting +execution environment. + +The ECALL instruction is used to make a service request to the execution +environment. The EEI will define how parameters for the service request +are passed, but usually these will be in defined locations in the +integer register file. + +The EBREAK instruction is used to return control to a debugging +environment. + +ECALL and EBREAK were previously named SCALL and SBREAK. The +instructions have the same functionality and encoding, but were renamed +to reflect that they can be used more generally than to call a +supervisor-level operating system or debugger. + +EBREAK was primarily designed to be used by a debugger to cause +execution to stop and fall back into the debugger. EBREAK is also used +by the standard gcc compiler to mark code paths that should not be +executed. + +Another use of EBREAK is to support _semihosting_, where the execution +environment includes a debugger that can provide services over an +alternate system call interface built around the EBREAK instruction. +Because the RISC-V base ISAs do not provide more than one EBREAK +instruction, RISC-V semihosting uses a special sequence of instructions +to distinguish a semihosting EBREAK from a debugger inserted EBREAK. + +.... + slli x0, x0, 0x1f # Entry NOP + ebreak # Break to debugger + srai x0, x0, 7 # NOP encoding the semihosting call number 7 +.... + +Note that these three instructions must be 32-bit-wide instructions, +i.e., they mustn’t be among the compressed 16-bit instructions described +in <<compressed>>. + +The shift NOP instructions are still considered available for use as +HINTs. + +Semihosting is a form of service call and would be more naturally +encoded as an ECALL using an existing ABI, but this would require the +debugger to be able to intercept ECALLs, which is a newer addition to +the debug standard. We intend to move over to using ECALLs with a +standard ABI, in which case, semihosting can share a service ABI with an +existing standard. + +We note that ARM processors have also moved to using SVC instead of BKPT +for semihosting calls in newer designs. + +=== HINT Instructions + +RV32I reserves a large encoding space for HINT instructions, which are +usually used to communicate performance hints to the microarchitecture. +Like the NOP instruction, HINTs do not change any architecturally +visible state, except for advancing the `pc` and any applicable +performance counters. Implementations are always allowed to ignore the +encoded hints. + +Most RV32I HINTs are encoded as integer computational instructions with +_rd_=x0. The other RV32I HINTs are encoded as FENCE instructions with +a null predecessor or successor set and with _fm_=0. + +These HINT encodings have been chosen so that simple implementations can +ignore HINTs altogether, and instead execute a HINT as a regular +instruction that happens not to mutate the architectural state. For +example, ADD is a HINT if the destination register is `x0`; the five-bit +_rs1_ and _rs2_ fields encode arguments to the HINT. However, a simple +implementation can simply execute the HINT as an ADD of _rs1_ and _rs2_ +that writes ` x0`, which has no architecturally visible effect. + +As another example, a FENCE instruction with a zero _pred_ field and a +zero _fm_ field is a HINT; the _succ_, _rs1_, and _rd_ fields encode the +arguments to the HINT. A simple implementation can simply execute the +HINT as a FENCE that orders the null set of prior memory accesses before +whichever subsequent memory accesses are encoded in the _succ_ field. +Since the intersection of the predecessor and successor sets is null, +the instruction imposes no memory orderings, and so it has no +architecturally visible effect. + +<<t-rv32i-hints>> lists all RV32I HINT code points. 91% of the +HINT space is reserved for standard HINTs. The remainder of the HINT +space is designated for custom HINTs: no standard HINTs will ever be +defined in this subspace. + +[TIP] +==== +We anticipate standard hints to eventually include memory-system spatial +and temporal locality hints, branch prediction hints, thread-scheduling +hints, security tags, and instrumentation flags for +simulation/emulation. +==== + +// this table isn't quite right and needs to be fixed--some rows might not have landed properly. It needs to be checked cell-by cell. + +[[t-rv32i-hints]] +.RV32I HINT instructions. +[cols="<,<,^,<",options="header"] +|=== +|Instruction |Constraints |Code Points |Purpose + +|LUI |_rd_=`x0` |latexmath:[$2^{20}$] .22+<.>m|_Reserved for future standard use_ + +|AUIPC |_rd_=`x0` |latexmath:[$2^{20}$] + +|ADDI |_rd_=`x0`, and either latexmath:[$2^{17}-1$] _rs1_ latexmath:[$\neq$]`x0` or _imm_latexmath:[$\neq$]0 | + +|ANDI |_rd_=`x0` |latexmath:[$2^{17}$] + +|ORI |_rd_=`x0` |latexmath:[$2^{17}$] + +|XORI |_rd_=`x0` |latexmath:[$2^{17}$] + +|ADD |_rd_=`x0` |latexmath:[$2^{10}$] + +|SUB |_rd_=`x0` |latexmath:[$2^{10}$] + +|AND |_rd_=`x0` |latexmath:[$2^{10}$] + +|OR |_rd_=`x0` |latexmath:[$2^{10}$] + +|XOR |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLL |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRL |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRA |_rd_=`x0` |latexmath:[$2^{10}$] + +| |_rd_=`x0`| _rs1_latexmath:[$\neq$]`x0`, latexmath:[$2^{10}-63$] + +| |_fm_=0, and either _pred_=0 or _succ_=0 |_rd_latexmath:[$\neq$]`x0` + +| | _rs1_=`x0` |latexmath:[$2^{10}-63$] + +| |_fm_=0, and either _pred_=0 or _succ_=0 | + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15 + +|FENCE |_pred_=0| _succ_latexmath:[$\neq$]0 + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15 + +|FENCE |_pred_latexmath:[$\neq$]W, _succ_=0 | + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0, _pred_=W, _succ_=0 |1 |PAUSE + +|SLTI |_rd_=`x0` |latexmath:[$2^{17}$] .7+<.>m|_Designated for custom use_ + +|SLTIU|_rd_=`x0` |latexmath:[$2^{17}$] + +|SLLI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRLI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRAI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLT |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLTU |_rd_=`x0` |latexmath:[$2^{10}$] +|=== |