From 64353b3717c9d387759c61d16d3984760a028046 Mon Sep 17 00:00:00 2001 From: elisa Date: Fri, 10 Sep 2021 14:07:18 -0700 Subject: adding converted adoc files --- src/a-st-ext.adoc | 452 +++++++++ src/b-st-ext.adoc | 17 + src/bibliography.adoc | 4 + src/c-st-ext.adoc | 1123 ++++++++++++++++++++++ src/colophon.adoc | 347 +++++++ src/counters-f.adoc | 167 ++++ src/counters.adoc | 189 ++++ src/d-st-ext.adoc | 255 +++++ src/extending.adoc | 364 +++++++ src/f-st-ext.adoc | 541 +++++++++++ src/history.adoc | 362 +++++++ src/index.adoc | 2 + src/intro-old.adoc | 733 ++++++++++++++ src/intro.adoc | 731 ++++++++++++++ src/j-st-ext.adoc | 10 + src/m-st-ext.adoc | 159 +++ src/mm-alloy.adoc | 239 +++++ src/mm-eplan.adoc | 1705 +++++++++++++++++++++++++++++++++ src/mm-formal.adoc | 1412 +++++++++++++++++++++++++++ src/mm-herd.adoc | 155 +++ src/naming.adoc | 184 ++++ src/p-st-ext.adoc | 10 + src/q-st-ext.adoc | 148 +++ src/riscv-isa-unpriv-conv-review.adoc | 139 +++ src/riscv-isa-unpriv.adoc | 135 +++ src/rv-32-64g.adoc | 491 ++++++++++ src/rv128.adoc | 79 ++ src/rv32.adoc | 988 +++++++++++++++++++ src/rv32e.adoc | 59 ++ src/rv64.adoc | 252 +++++ src/rvwmo.adoc | 831 ++++++++++++++++ src/test.adoc | 47 + src/v-st-ext.adoc | 11 + src/zam-st-ext.adoc | 52 + src/zicsr.adoc | 242 +++++ src/zifencei.adoc | 97 ++ src/zihintpause.adoc | 61 ++ src/ztso-st-ext.adoc | 38 + 38 files changed, 12831 insertions(+) create mode 100644 src/a-st-ext.adoc create mode 100644 src/b-st-ext.adoc create mode 100644 src/bibliography.adoc create mode 100644 src/c-st-ext.adoc create mode 100644 src/colophon.adoc create mode 100644 src/counters-f.adoc create mode 100644 src/counters.adoc create mode 100644 src/d-st-ext.adoc create mode 100644 src/extending.adoc create mode 100644 src/f-st-ext.adoc create mode 100644 src/history.adoc create mode 100644 src/index.adoc create mode 100644 src/intro-old.adoc create mode 100644 src/intro.adoc create mode 100644 src/j-st-ext.adoc create mode 100644 src/m-st-ext.adoc create mode 100644 src/mm-alloy.adoc create mode 100644 src/mm-eplan.adoc create mode 100644 src/mm-formal.adoc create mode 100644 src/mm-herd.adoc create mode 100644 src/naming.adoc create mode 100644 src/p-st-ext.adoc create mode 100644 src/q-st-ext.adoc create mode 100644 src/riscv-isa-unpriv-conv-review.adoc create mode 100644 src/riscv-isa-unpriv.adoc create mode 100644 src/rv-32-64g.adoc create mode 100644 src/rv128.adoc create mode 100644 src/rv32.adoc create mode 100644 src/rv32e.adoc create mode 100644 src/rv64.adoc create mode 100644 src/rvwmo.adoc create mode 100644 src/test.adoc create mode 100644 src/v-st-ext.adoc create mode 100644 src/zam-st-ext.adoc create mode 100644 src/zicsr.adoc create mode 100644 src/zifencei.adoc create mode 100644 src/zihintpause.adoc create mode 100644 src/ztso-st-ext.adoc (limited to 'src') diff --git a/src/a-st-ext.adoc b/src/a-st-ext.adoc new file mode 100644 index 0000000..7468009 --- /dev/null +++ b/src/a-st-ext.adoc @@ -0,0 +1,452 @@ +[[atomics]] +== `A` Standard Extension for Atomic Instructions, Version 2.1 + +The standard atomic-instruction extension, named `A`, contains +instructions that atomically read-modify-write memory to support +synchronization between multiple RISC-V harts running in the same memory +space. The two forms of atomic instruction provided are +load-reserved/store-conditional instructions and atomic fetch-and-op +memory instructions. Both types of atomic instruction support various +memory consistency orderings including unordered, acquire, release, and +sequentially consistent semantics. These instructions allow RISC-V to +support the RCsc memory consistency model cite:[Gharachorloo90memoryconsistency]. + +[TIP] +==== +After much debate, the language community and architecture community +appear to have finally settled on release consistency as the standard +memory consistency model and so the RISC-V atomic support is built +around this model. +==== + +=== Specifying Ordering of Atomic Instructions + +The base RISC-V ISA has a relaxed memory model, with the FENCE +instruction used to impose additional ordering constraints. The address +space is divided by the execution environment into memory and I/O +domains, and the FENCE instruction provides options to order accesses to +one or both of these two address domains. + +To provide more efficient support for release consistency cite:[Gharachorloo90memoryconsistency], each atomic +instruction has two bits, _aq_ and _rl_, used to specify additional +memory ordering constraints as viewed by other RISC-V harts. The bits +order accesses to one of the two address domains, memory or I/O, +depending on which address domain the atomic instruction is accessing. +No ordering constraint is implied to accesses to the other domain, and a +FENCE instruction should be used to order across both domains. + +If both bits are clear, no additional ordering constraints are imposed +on the atomic memory operation. If only the _aq_ bit is set, the atomic +memory operation is treated as an _acquire_ access, i.e., no following +memory operations on this RISC-V hart can be observed to take place +before the acquire memory operation. If only the _rl_ bit is set, the +atomic memory operation is treated as a _release_ access, i.e., the +release memory operation cannot be observed to take place before any +earlier memory operations on this RISC-V hart. If both the _aq_ and _rl_ +bits are set, the atomic memory operation is _sequentially consistent_ +and cannot be observed to happen before any earlier memory operations or +after any later memory operations in the same RISC-V hart and to the +same address domain. + +[[lrsc]] +=== Load-Reserved/Store-Conditional Instructions + +include::images/wavedrom/load-reserve-st-conditional.adoc[] +[[load-res-st]] +.Load-Reserved/Store-Conditional Instructions +image::image_placeholder.png[] + +Complex atomic memory operations on a single memory word or doubleword +are performed with the load-reserved (LR) and store-conditional (SC) +instructions. LR.W loads a word from the address in _rs1_, places the +sign-extended value in _rd_, and registers a _reservation set_—a set of +bytes that subsumes the bytes in the addressed word. SC.W conditionally +writes a word in _rs2_ to the address in _rs1_: the SC.W succeeds only +if the reservation is still valid and the reservation set contains the +bytes being written. If the SC.W succeeds, the instruction writes the +word in _rs2_ to memory, and it writes zero to _rd_. If the SC.W fails, +the instruction does not write to memory, and it writes a nonzero value +to _rd_. Regardless of success or failure, executing an SC.W instruction +invalidates any reservation held by this hart. LR.D and SC.D act +analogously on doublewords and are only available on RV64. For RV64, +LR.W and SC.W sign-extend the value placed in _rd_. + +[TIP] +==== +Both compare-and-swap (CAS) and LR/SC can be used to build lock-free +data structures. After extensive discussion, we opted for LR/SC for +several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids +because it monitors all writes to the address rather than only checking +for changes in the data value; 2) CAS would also require a new integer +instruction format to support three source operands (address, compare +value, swap value) as well as a different memory system message format, +which would complicate microarchitectures; 3) Furthermore, to avoid the +ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a +counter to be tested and incremented along with a data word. This +requires reading five registers and writing two in one instruction, and +also a new larger memory system message type, further complicating +implementations; 4) LR/SC provides a more efficient implementation of +many primitives as it only requires one load as opposed to two with CAS +(one load before the CAS instruction to obtain a value for speculative +computation, then a second load as part of the CAS instruction to check +if value is unchanged before updating). + +The main disadvantage of LR/SC over CAS is livelock, which we avoid, +under certain circumstances, with an architected guarantee of eventual +forward progress as described below. Another concern is whether the +influence of the current x86 architecture, with its DW-CAS, will +complicate porting of synchronization libraries and other software that +assumes DW-CAS is the basic machine primitive. A possible mitigating +factor is the recent addition of transactional memory instructions to +x86, which might cause a move away from DW-CAS. + +More generally, a multi-word atomic primitive is desirable, but there is +still considerable debate about what form this should take, and +guaranteeing forward progress adds complexity to a system. +==== + +The failure code with value 1 is reserved to encode an unspecified +failure. Other failure codes are reserved at this time, and portable +software should only assume the failure code will be non-zero. + +[NOTE] +==== +We reserve a failure code of 1 to mean `unspecified` so that simple +implementations may return this value using the existing mux required +for the SLT/SLTU instructions. More specific failure codes might be +defined in future versions or extensions to the ISA. +==== + +For LR and SC, the A extension requires that the address held in _rs1_ +be naturally aligned to the size of the operand (i.e., eight-byte +aligned for 64-bit words and four-byte aligned for 32-bit words). If the +address is not naturally aligned, an address-misaligned exception or an +access-fault exception will be generated. The access-fault exception can +be generated for a memory access that would otherwise be able to +complete except for the misalignment, if the misaligned access should +not be emulated. +(((alignment, misaligned))) + +[NOTE] +==== +Emulating misaligned LR/SC sequences is impractical in most systems. + +Misaligned LR/SC sequences also raise the possibility of accessing +multiple reservation sets at once, which present definitions do not +provide for. +==== + +An implementation can register an arbitrarily large reservation set on +each LR, provided the reservation set includes all bytes of the +addressed data word or doubleword. An SC can only pair with the most +recent LR in program order. An SC may succeed only if no store from +another hart to the reservation set can be observed to have occurred +between the LR and the SC, and if there is no other SC between the LR +and itself in program order. An SC may succeed only if no write from a +device other than a hart to the bytes accessed by the LR instruction can +be observed to have occurred between the LR and SC. Note this LR might +have had a different effective address and data size, but reserved the +SC’s address as part of the reservation set. +(((alignment, reservation set))) + +[NOTE] +==== +Following this model, in systems with memory translation, an SC is +allowed to succeed if the earlier LR reserved the same location using an +alias with a different virtual address, but is also allowed to fail if +the virtual address is different. + + +To accommodate legacy devices and buses, writes from devices other than +RISC-V harts are only required to invalidate reservations when they +overlap the bytes accessed by the LR. These writes are not required to +invalidate the reservation when they access other bytes in the +reservation set. +==== + +The SC must fail if the address is not within the reservation set of the +most recent LR in program order. The SC must fail if a store to the +reservation set from another hart can be observed to occur between the +LR and SC. The SC must fail if a write from some other device to the +bytes accessed by the LR can be observed to occur between the LR and SC. +(If such a device writes the reservation set but does not write the +bytes accessed by the LR, the SC may or may not fail.) An SC must fail +if there is another SC (to any address) between the LR and the SC in +program order. The precise statement of the atomicity requirements for +successful LR/SC sequences is defined by the Atomicity Axiom in +<>. + + +[NOTE] +==== +The platform should provide a means to determine the size and shape of +the reservation set. + +A platform specification may constrain the size and shape of the +reservation set. For example, the Unix platform is expected to require +of main memory that the reservation set be of fixed size, contiguous, +naturally aligned, and no greater than the virtual memory page size. + +A store-conditional instruction to a scratch word of memory should be +used to forcibly invalidate any existing load reservation: + +* during a preemptive context switch, and +* if necessary when changing virtual to physical address mappings, such +as when migrating pages that might contain an active reservation. +==== + +The invalidation of a hart’s reservation when it executes an LR or SC +imply that a hart can only hold one reservation at a time, and that an +SC can only pair with the most recent LR, and LR with the next following +SC, in program order. This is a restriction to the Atomicity Axiom in +<> that ensures software runs correctly on +expected common implementations that operate in this manner. + +An SC instruction can never be observed by another RISC-V hart before +the LR instruction that established the reservation. The LR/SC sequence +can be given acquire semantics by setting the _aq_ bit on the LR +instruction. The LR/SC sequence can be given release semantics by +setting the _rl_ bit on the SC instruction. Setting the _aq_ bit on the +LR instruction, and setting both the _aq_ and the _rl_ bit on the SC +instruction makes the LR/SC sequence sequentially consistent, meaning +that it cannot be reordered with earlier or later memory operations from +the same hart. + +If neither bit is set on both LR and SC, the LR/SC sequence can be +observed to occur before or after surrounding memory operations from the +same RISC-V hart. This can be appropriate when the LR/SC sequence is +used to implement a parallel reduction operation. + +Software should not set the _rl_ bit on an LR instruction unless the +_aq_ bit is also set, nor should software set the _aq_ bit on an SC +instruction unless the _rl_ bit is also set. LR._rl_ and SC._aq_ +instructions are not guaranteed to provide any stronger ordering than +those with both bits clear, but may result in lower performance. + +.... + # a0 holds address of memory location + # a1 holds expected value + # a2 holds desired value + # a0 holds return value, 0 if successful, !0 otherwise + cas: + lr.w t0, (a0) # Load original value. + bne t0, a1, fail # Doesn't match, so fail. + sc.w t0, a2, (a0) # Try to update. + bnez t0, cas # Retry if store-conditional failed. + li a0, 0 # Set return to success. + jr ra # Return. + fail: + li a0, 1 # Set return to failure. + jr ra # Return. +.... + +LR/SC can be used to construct lock-free data structures. An example +using LR/SC to implement a compare-and-swap function is shown in +<>. If inlined, compare-and-swap functionality need +only take four instructions. + +[[lrscseq]] +=== Eventual Success of Store-Conditional Instructions + +The standard A extension defines _constrained LR/SC loops_, which have +the following properties: + +* The loop comprises only an LR/SC sequence and code to retry the +sequence in the case of failure, and must comprise at most 16 +instructions placed sequentially in memory. +* An LR/SC sequence begins with an LR instruction and ends with an SC +instruction. The dynamic code executed between the LR and SC +instructions can only contain instructions from the base `I` +instruction set, excluding loads, stores, backward jumps, taken backward +branches, JALR, FENCE, and SYSTEM instructions. If the `C` extension +is supported, then compressed forms of the aforementioned `I` +instructions are also permitted. +* The code to retry a failing LR/SC sequence can contain backwards jumps +and/or branches to repeat the LR/SC sequence, but otherwise has the same +constraint as the code between the LR and SC. +* The LR and SC addresses must lie within a memory region with the +_LR/SC eventuality_ property. The execution environment is responsible +for communicating which regions have this property. +* The SC must be to the same effective address and of the same data size +as the latest LR executed by the same hart. + +LR/SC sequences that do not lie within constrained LR/SC loops are +_unconstrained_. Unconstrained LR/SC sequences might succeed on some +attempts on some implementations, but might never succeed on other +implementations. + +[TIP] +==== +We restricted the length of LR/SC loops to fit within 64 contiguous +instruction bytes in the base ISA to avoid undue restrictions on +instruction cache and TLB size and associativity. Similarly, we +disallowed other loads and stores within the loops to avoid restrictions +on data-cache associativity in simple implementations that track the +reservation within a private cache. The restrictions on branches and +jumps limit the time that can be spent in the sequence. Floating-point +operations and integer multiply/divide were disallowed to simplify the +operating system’s emulation of these instructions on implementations +lacking appropriate hardware support. + +Software is not forbidden from using unconstrained LR/SC sequences, but +portable software must detect the case that the sequence repeatedly +fails, then fall back to an alternate code sequence that does not rely +on an unconstrained LR/SC sequence. Implementations are permitted to +unconditionally fail any unconstrained LR/SC sequence. +==== + +If a hart _H_ enters a constrained LR/SC loop, the execution environment +must guarantee that one of the following events eventually occurs: + +* _H_ or some other hart executes a successful SC to the reservation set +of the LR instruction in _H_’s constrained LR/SC loops. +* Some other hart executes an unconditional store or AMO instruction to +the reservation set of the LR instruction in _H_’s constrained LR/SC +loop, or some other device in the system writes to that reservation set. +* _H_ executes a branch or jump that exits the constrained LR/SC loop. +* _H_ traps. + +[NOTE] +==== +Note that these definitions permit an implementation to fail an SC +instruction occasionally for any reason, provided the aforementioned +guarantee is not violated. + +As a consequence of the eventuality guarantee, if some harts in an +execution environment are executing constrained LR/SC loops, and no +other harts or devices in the execution environment execute an +unconditional store or AMO to that reservation set, then at least one +hart will eventually exit its constrained LR/SC loop. By contrast, if +other harts or devices continue to write to that reservation set, it is +not guaranteed that any hart will exit its LR/SC loop. + +Loads and load-reserved instructions do not by themselves impede the +progress of other harts’ LR/SC sequences. We note this constraint +implies, among other things, that loads and load-reserved instructions +executed by other harts (possibly within the same core) cannot impede +LR/SC progress indefinitely. For example, cache evictions caused by +another hart sharing the cache cannot impede LR/SC progress +indefinitely. Typically, this implies reservations are tracked +independently of evictions from any shared cache. Similarly, cache +misses caused by speculative execution within a hart cannot impede LR/SC +progress indefinitely. + +These definitions admit the possibility that SC instructions may +spuriously fail for implementation reasons, provided progress is +eventually made. + +One advantage of CAS is that it guarantees that some hart eventually +makes progress, whereas an LR/SC atomic sequence could livelock +indefinitely on some systems. To avoid this concern, we added an +architectural guarantee of livelock freedom for certain LR/SC sequences. + +Earlier versions of this specification imposed a stronger +starvation-freedom guarantee. However, the weaker livelock-freedom +guarantee is sufficient to implement the C11 and C++11 languages, and is +substantially easier to provide in some microarchitectural styles. +==== + +[[amo]] +=== Atomic Memory Operations + +include::images/wavedrom/atomic-mem.adoc[] +[[atomic-mem]] +.Atomic memory operations +image::image_placeholder.png[] + +The atomic memory operation (AMO) instructions perform read-modify-write +operations for multiprocessor synchronization and are encoded with an +R-type instruction format. These AMO instructions atomically load a data +value from the address in _rs1_, place the value into register _rd_, +apply a binary operator to the loaded value and the original value in +_rs2_, then store the result back to the original address in _rs1_. AMOs +can either operate on 64-bit (RV64 only) or 32-bit words in memory. For +RV64, 32-bit AMOs always sign-extend the value placed in _rd_, and +ignore the upper 32 bits of the original value of _rs2_. + +For AMOs, the A extension requires that the address held in _rs1_ be +naturally aligned to the size of the operand (i.e., eight-byte aligned +for 64-bit words and four-byte aligned for 32-bit words). If the address +is not naturally aligned, an address-misaligned exception or an +access-fault exception will be generated. The access-fault exception can +be generated for a memory access that would otherwise be able to +complete except for the misalignment, if the misaligned access should +not be emulated. The ``Zam`` extension, described in +<>, relaxes this requirement and specifies the +semantics of misaligned AMOs. + +The operations supported are swap, integer add, bitwise AND, bitwise OR, +bitwise XOR, and signed and unsigned integer maximum and minimum. +Without ordering constraints, these AMOs can be used to implement +parallel reduction operations, where typically the return value would be +discarded by writing to `x0`. + +[NOTE] +==== +We provided fetch-and-op style atomic primitives as they scale to highly +parallel systems better than LR/SC or CAS. A simple microarchitecture +can implement AMOs using the LR/SC primitives, provided the +implementation can guarantee the AMO eventually completes. More complex +implementations might also implement AMOs at memory controllers, and can +optimize away fetching the original value when the destination is `x0`. + +The set of AMOs was chosen to support the C11/C++11 atomic memory +operations efficiently, and also to support parallel reductions in +memory. Another use of AMOs is to provide atomic updates to +memory-mapped device registers (e.g., setting, clearing, or toggling +bits) in the I/O space. +==== + +To help implement multiprocessor synchronization, the AMOs optionally +provide release consistency semantics. If the _aq_ bit is set, then no +later memory operations in this RISC-V hart can be observed to take +place before the AMO. Conversely, if the _rl_ bit is set, then other +RISC-V harts will not observe the AMO before memory accesses preceding +the AMO in this RISC-V hart. Setting both the _aq_ and the _rl_ bit on +an AMO makes the sequence sequentially consistent, meaning that it +cannot be reordered with earlier or later memory operations from the +same hart. + +[NOTE] +==== +The AMOs were designed to implement the C11 and C++11 memory models +efficiently. Although the FENCE R, RW instruction suffices to implement +the _acquire_ operation and FENCE RW, W suffices to implement _release_, +both imply additional unnecessary ordering as compared to AMOs with the +corresponding _aq_ or _rl_ bit set. +==== + +An example code sequence for a critical section guarded by a +test-and-test-and-set spinlock is shown in +<>. Note the first AMO is marked _aq_ to +order the lock acquisition before the critical section, and the second +AMO is marked _rl_ to order the critical section before the lock +relinquishment. + +[[critical]] +.... + li t0, 1 # Initialize swap value. + again: + lw t1, (a0) # Check if lock is held. + bnez t1, again # Retry if held. + amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock. + bnez t1, again # Retry if held. + # ... + # Critical section. + # ... + amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. +.... + +[NOTE] +==== +We recommend the use of the AMO Swap idiom shown above for both lock +acquire and release to simplify the implementation of speculative lock +elision cite:[Rajwar:2001:SLE]. +==== + +The instructions in the `A` extension can also be used to provide +sequentially consistent loads and stores. A sequentially consistent load +can be implemented as an LR with both _aq_ and _rl_ set. A sequentially +consistent store can be implemented as an AMOSWAP that writes the old +value to x0 and has both _aq_ and _rl_ set. + diff --git a/src/b-st-ext.adoc b/src/b-st-ext.adoc new file mode 100644 index 0000000..d160e05 --- /dev/null +++ b/src/b-st-ext.adoc @@ -0,0 +1,17 @@ +[[bits]] +== `B` Standard Extension for Bit Manipulation, Version 0.0 + +This chapter is a placeholder for a future standard extension to provide +bit manipulation instructions, including instructions to insert, +extract, and test bit fields, and for rotations, funnel shifts, and bit +and byte permutations. + +Although bit manipulation instructions are very effective in some +application domains, particularly when dealing with externally packed +data structures, we excluded them from the base ISAs as they are not +useful in all domains and can add additional complexity or instruction +formats to supply all needed operands. + +We anticipate the B extension will be a brownfield encoding within the +base 30-bit instruction space. + diff --git a/src/bibliography.adoc b/src/bibliography.adoc new file mode 100644 index 0000000..4cc3eb7 --- /dev/null +++ b/src/bibliography.adoc @@ -0,0 +1,4 @@ +[bibliography] +== Bibliography + +bibliography::[] diff --git a/src/c-st-ext.adoc b/src/c-st-ext.adoc new file mode 100644 index 0000000..6675a5a --- /dev/null +++ b/src/c-st-ext.adoc @@ -0,0 +1,1123 @@ +[[compressed]] +== `C` Standard Extension for Compressed Instructions, Version 2.0 + +This chapter describes the current proposal for the RISC-V standard +compressed instruction-set extension, named `C`, which reduces static +and dynamic code size by adding short 16-bit instruction encodings for +common operations. The C extension can be added to any of the base ISAs +(RV32, RV64, RV128), and we use the generic term `RVC` to cover any of +these. Typically, 50%–60% of the RISC-V instructions in a program can be +replaced with RVC instructions, resulting in a 25%–30% code-size +reduction. + +=== Overview + +RVC uses a simple compression scheme that offers shorter 16-bit versions +of common 32-bit RISC-V instructions when: + +the immediate or address offset is small, or + +one of the registers is the zero register (`x0`), the ABI link register +(`x1`), or the ABI stack pointer (`x2`), or + +the destination register and the first source register are identical, or + +the registers used are the 8 most popular ones. + +The C extension is compatible with all other standard instruction +extensions. The C extension allows 16-bit instructions to be freely +intermixed with 32-bit instructions, with the latter now able to start +on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C +extension, no instructions can raise instruction-address-misaligned +exceptions. + +Removing the 32-bit alignment constraint on the original 32-bit +instructions allows significantly greater code density. + +The compressed instruction encodings are mostly common across RV32C, +RV64C, and RV128C, but as shown in +Table <>, a few opcodes are used for +different purposes depending on base ISA. For example, the wider +address-space RV64C and RV128C variants require additional opcodes to +compress loads and stores of 64-bit integer values, while RV32C uses the +same opcodes to compress loads and stores of single-precision +floating-point values. Similarly, RV128C requires additional opcodes to +capture loads and stores of 128-bit integer values, while these same +opcodes are used for loads and stores of double-precision floating-point +values in RV32C and RV64C. If the C extension is implemented, the +appropriate compressed floating-point load and store instructions must +be provided whenever the relevant standard floating-point extension (F +and/or D) is also implemented. In addition, RV32C includes a compressed +jump and link instruction to compress short-range subroutine calls, +where the same opcode is used to compress ADDIW for RV64C and RV128C. + +Double-precision loads and stores are a significant fraction of static +and dynamic instructions, hence the motivation to include them in the +RV32C and RV64C encoding. + +Although single-precision loads and stores are not a significant source +of static or dynamic compression for benchmarks compiled for the +currently supported ABIs, for microcontrollers that only provide +hardware single-precision floating-point units and have an ABI that only +supports single-precision floating-point numbers, the single-precision +loads and stores will be used at least as frequently as double-precision +loads and stores in the measured benchmarks. Hence, the motivation to +provide compressed support for these in RV32C. + +Short-range subroutine calls are more likely in small binaries for +microcontrollers, hence the motivation to include these in RV32C. + +Although reusing opcodes for different purposes for different base ISAs +adds some complexity to documentation, the impact on implementation +complexity is small even for designs that support multiple base ISAs. +The compressed floating-point load and store variants use the same +instruction format with the same register specifiers as the wider +integer loads and stores. + +RVC was designed under the constraint that each RVC instruction expands +into a single 32-bit instruction in either the base ISA (RV32I/E, RV64I, +or RV128I) or the F and D standard extensions where present. Adopting +this constraint has two main benefits: + +Hardware designs can simply expand RVC instructions during decode, +simplifying verification and minimizing modifications to existing +microarchitectures. + +Compilers can be unaware of the RVC extension and leave code compression +to the assembler and linker, although a compression-aware compiler will +generally be able to produce better results. + +We felt the multiple complexity reductions of a simple one-one mapping +between C and base IFD instructions far outweighed the potential gains +of a slightly denser encoding that added additional instructions only +supported in the C extension, or that allowed encoding of multiple IFD +instructions in one C instruction. + +It is important to note that the C extension is not designed to be a +stand-alone ISA, and is meant to be used alongside a base ISA. + +Variable-length instruction sets have long been used to improve code +density. For example, the IBM Stretch cite:[stretch], developed in the late 1950s, had +an ISA with 32-bit and 64-bit instructions, where some of the 32-bit +instructions were compressed versions of the full 64-bit instructions. +Stretch also employed the concept of limiting the set of registers that +were addressable in some of the shorter instruction formats, with short +branch instructions that could only refer to one of the index registers. +The later IBM 360 architecture cite:[ibm360] supported a simple variable-length +instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats. + +In 1963, CDC introduced the Cray-designed CDC 6600 cite:[cdc6600], a precursor to RISC +architectures, that introduced a register-rich load-store architecture +with instructions of two lengths, 15-bits and 30-bits. The later Cray-1 +design used a very similar instruction format, with 16-bit and 32-bit +instruction lengths. + +The initial RISC ISAs from the 1980s all picked performance over code +size, which was reasonable for a workstation environment, but not for +embedded systems. Hence, both ARM and MIPS subsequently made versions of +the ISAs that offered smaller code size by offering an alternative +16-bit wide instruction set instead of the standard 32-bit wide +instructions. The compressed RISC ISAs reduced code size relative to +their starting points by about 25–30%, yielding code that was +significantly _smaller_ than 80x86. This result surprised some, as their +intuition was that the variable-length CISC ISA should be smaller than +RISC ISAs that offered only 16-bit and 32-bit formats. + +Since the original RISC ISAs did not leave sufficient opcode space free +to include these unplanned compressed instructions, they were instead +developed as complete new ISAs. This meant compilers needed different +code generators for the separate compressed ISAs. The first compressed +RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed +16-bit instruction size, which gave good reductions in static code size +but caused an increase in dynamic instruction count, which led to lower +performance compared to the original fixed-width 32-bit instruction +size. This led to the development of a second generation of compressed +RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g., +ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to +pure 32-bit instructions but with significant code size savings. +Unfortunately, these different generations of compressed ISAs are +incompatible with each other and with the original uncompressed ISA, +leading to significant complexity in documentation, implementations, and +software tools support. + +Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently +supports a compressed instruction format. It is surprising that the most +popular 64-bit ISA for mobile platforms (ARM v8) does not include a +compressed instruction format given that static code size and dynamic +instruction fetch bandwidth are important metrics. Although static code +size is not a major concern in larger systems, instruction fetch +bandwidth can be a major bottleneck in servers running commercial +workloads, which often have a large instruction working set. + +Benefiting from 25 years of hindsight, RISC-V was designed to support +compressed instructions from the outset, leaving enough opcode space for +RVC to be added as a simple extension on top of the base ISA (along with +many other extensions). The philosophy of RVC is to reduce code size for +embedded applications _and_ to improve performance and energy-efficiency +for all applications due to fewer misses in the instruction cache. +Waterman shows that RVC fetches 25%-30% fewer instruction bits, which +reduces instruction cache misses by 20%-25%, or roughly the same +performance impact as doubling the instruction cache size . + +=== Compressed Instruction Formats +(((compressed, formats))) + +Table <> shows the nine compressed instruction +formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW, +CL, CS, CA, and CB are limited to just 8 of them. +Table <> lists these popular registers, which +correspond to registers `x8` to `x15`. Note that there is a separate +version of load and store instructions that use the stack pointer as the +base address register, since saving to and restoring from the stack are +so prevalent, and that they use the CI and CSS formats to allow access +to all 32 data registers. CIW supplies an 8-bit immediate for the +ADDI4SPN instruction. + +The RISC-V ABI was changed to make the frequently used registers map to +registers `x8`–`x15`. This simplifies the decompression decoder by +having a contiguous naturally aligned set of register numbers, and is +also compatible with the RV32E base ISA, which only has 16 integer +registers. +(((compressed, loads and stores))) + +Compressed register-based floating-point loads and stores also use the +CL and CS formats respectively, with the eight registers mapping to `f8` +to `f15`. +(((calling convention, standard))) + +The standard RISC-V calling convention maps the most frequently used +floating-point registers to registers `f8` to `f15`, which allows the +same register decompression decoding as for integer register numbers. +(((register source specifiers, c-ext))) + +The formats were designed to keep bits for the two register source +specifiers in the same place in all instructions, while the destination +register field can move. When the full 5-bit destination register +specifier is present, it is in the same place as in the 32-bit RISC-V +encoding. Where immediates are sign-extended, the sign-extension is +always from bit 12. Immediate fields have been scrambled, as in the base +specification, to reduce the number of immediate muxes required. + +The immediate fields are scrambled in the instruction formats instead of +in sequential order so that as many bits as possible are in the same +position in every instruction, thereby simplifying implementations. + +For many RVC instructions, zero-valued immediates are disallowed and +`x0` is not a valid 5-bit register specifier. These restrictions free up +encoding space for other instructions requiring fewer operand bits. + +include::images/wavedrom/cr-register.adoc[] +[cr-register] +.Compressed 16-bit RVC instruction formats +image::image_placeholder.png[] +(((compressed, 16-bit))) + +[[rvc-form]] +.Compressed 16-bit RVC instruction formats. +//[cols="^,^,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",] +|=== + +|Format |Meaning | | | | | | | | | | | | | + +|CR |Register |funct4| |rd/rs1 | | |rs2 | | | | |op | | + +|CI |Immediate |funct3 |imm |rd/rs1 | | |imm | | | | |op | | + +|CSS |Stack-relative Store |funct3|imm | | | |rs2 | | | | |op | +| + +|CIW |Wide Immediate |funct3|imm | | | | | |rdlatexmath:[$'$]| | |op | | + +|CL |Load |funct3 |imm |rs1 latexmath:[$'$] | | |imm | |rd latexmath:[$'$] | | |op | | + +|CS |Store |funct3 |imm |rs1 latexmath:[$'$] | | |imm | |rs2 latexmath:[$'$] | | |op | | + +|CA |Arithmetic |funct6 | |rd latexmath:[$'$]/rs1 latexmath:[$'$] | | |funct2 | |rs2 latexmath:[$'$] | | |op | | + +|CB |Branch/Arithmetic |funct3 |offset |rd latexmath:[$'$]/rs1 latexmath:[$'$] | | |offset | | | | |op | | + +|CJ |Jump |funct3 |jump target | | | | | | | | |op | | +|=== + +[registers] +.Registers specified by the three-bit _rs1 latexmath:[$'$]_, +_rs2 latexmath:[$'$]_, and _rd latexmath:[$'$]_ fields of the CIW, CL, +CS, CA, and CB formats. +[cols="<,^,^,^,^,^,^,^,^",] +|=== +|RVC Register Number |000 |001 |010 |011 |100 |101 |110 |111 + +|Integer Register Number |`x8` |`x9` |`x10` |`x11` |`x12` |`x13` |`x14`|`x15` + +|Integer Register ABI Name |`s0` |`s1` |`a0` |`a1` |`a2` |`a3` |`a4`|`a5` + +|Floating-Point Register Number |`f8` |`f9` |`f10` |`f11` |`f12` |`f13`|`f14` |`f15` + +|Floating-Point Register ABI Name |`fs0` |`fs1` |`fa0` |`fa1` |`fa2`|`fa3` |`fa4` |`fa5` +|=== + + +=== Load and Store Instructions + +To increase the reach of 16-bit instructions, data-transfer instructions +use zero-extended immediates that are scaled by the size of the data in +bytes: latexmath:[$\times$]4 for words, latexmath:[$\times$]8 for double +words, and latexmath:[$\times$]16 for quad words. + +RVC provides two variants of loads and stores. One uses the ABI stack +pointer, `x2`, as the base address and can target any data register. The +other can reference one of 8 base address registers and one of 8 data +registers. + +==== Stack-Pointer-Based Loads and Stores + +. +include::images/wavedrom/c-sp-load-store.adoc[] +[c-sp-load-store] +.Stack-Pointer-Based Loads and Stores +image::image_placeholder.png[] + +These instructions use the CI format. + +C.LWSP loads a 32-bit value from memory into register _rd_. It computes +an effective address by adding the _zero_-extended offset, scaled by 4, +to the stack pointer, `x2`. It expands to `lw rd, offset(x2)`. C.LWSP is +only valid when latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code +points with latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved. +(((operations, memory))) + +C.LDSP is an RV64C/RV128C-only instruction that loads a 64-bit value +from memory into register _rd_. It computes its effective address by +adding the zero-extended offset, scaled by 8, to the stack pointer, +`x2`. It expands to `ld rd, offset(x2)`. C.LDSP is only valid when +latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code points with +latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved. + +C.LQSP is an RV128C-only instruction that loads a 128-bit value from +memory into register _rd_. It computes its effective address by adding +the zero-extended offset, scaled by 16, to the stack pointer, `x2`. It +expands to `lq rd, offset(x2)`. C.LQSP is only valid when +latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code points with +latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved. + +C.FLWSP is an RV32FC-only instruction that loads a single-precision +floating-point value from memory into floating-point register _rd_. It +computes its effective address by adding the _zero_-extended offset, +scaled by 4, to the stack pointer, `x2`. It expands to +`flw rd, offset(x2)`. + +C.FLDSP is an RV32DC/RV64DC-only instruction that loads a +double-precision floating-point value from memory into floating-point +register _rd_. It computes its effective address by adding the +_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It +expands to `fld rd, offset(x2)`. + +include::images/wavedrom/sp-base-ls-2.adoc[] +[sp-base-ls-2] +.Stack-Pointer-Based Loads and Stores, CSS format +image::image_placeholder.png[] + +These instructions use the CSS format. + +C.SWSP stores a 32-bit value in register _rs2_ to memory. It computes an +effective address by adding the _zero_-extended offset, scaled by 4, to +the stack pointer, `x2`. It expands to `sw rs2, offset(x2)`. + +C.SDSP is an RV64C/RV128C-only instruction that stores a 64-bit value in +register _rs2_ to memory. It computes an effective address by adding the +_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It +expands to `sd rs2, offset(x2)`. + +C.SQSP is an RV128C-only instruction that stores a 128-bit value in +register _rs2_ to memory. It computes an effective address by adding the +_zero_-extended offset, scaled by 16, to the stack pointer, `x2`. It +expands to `sq rs2, offset(x2)`. + +C.FSWSP is an RV32FC-only instruction that stores a single-precision +floating-point value in floating-point register _rs2_ to memory. It +computes an effective address by adding the _zero_-extended offset, +scaled by 4, to the stack pointer, `x2`. It expands to +`fsw rs2, offset(x2)`. + +C.FSDSP is an RV32DC/RV64DC-only instruction that stores a +double-precision floating-point value in floating-point register _rs2_ +to memory. It computes an effective address by adding the +_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It +expands to `fsd rs2, offset(x2)`. + +Register save/restore code at function entry/exit represents a +significant portion of static code size. The stack-pointer-based +compressed loads and stores in RVC are effective at reducing the +save/restore static code size by a factor of 2 while improving +performance by reducing dynamic instruction bandwidth. + +A common mechanism used in other ISAs to further reduce save/restore +code size is load-multiple and store-multiple instructions. We +considered adopting these for RISC-V but noted the following drawbacks +to these instructions: + +* These instructions complicate processor implementations. +* For virtual memory systems, some data accesses could be resident in +physical memory and some could not, which requires a new restart +mechanism for partially executed instructions. +* Unlike the rest of the RVC instructions, there is no IFD equivalent to +Load Multiple and Store Multiple. +* Unlike the rest of the RVC instructions, the compiler would have to be +aware of these instructions to both generate the instructions and to +allocate registers in an order to maximize the chances of the them being +saved and stored, since they would be saved and restored in sequential +order. +* Simple microarchitectural implementations will constrain how other +instructions can be scheduled around the load and store multiple +instructions, leading to a potential performance loss. +* The desire for sequential register allocation might conflict with the +featured registers selected for the CIW, CL, CS, CA, and CB formats. + +Furthermore, much of the gains can be realized in software by replacing +prologue and epilogue code with subroutine calls to common prologue and +epilogue code, a technique described in Section 5.6 of . + +While reasonable architects might come to different conclusions, we +decided to omit load and store multiple and instead use the +software-only approach of calling save/restore millicode routines to +attain the greatest code size reduction. + +==== Register-Based Loads and Stores + +S@S@S@Y@S@Y + +& & & & & + +& & & & & + +& 3 & 3 & 2 & 3 & 2 + +C.LW & offset[5:3] & base & offset[2latexmath:[$\vert$]6] & dest & C0 + +C.LD & offset[5:3] & base & offset[7:6] & dest & C0 + +C.LQ & offset[5latexmath:[$\vert$]4latexmath:[$\vert$]8] & base & +offset[7:6] & dest & C0 + +C.FLW& offset[5:3] & base & offset[2latexmath:[$\vert$]6] & dest & C0 + +C.FLD& offset[5:3] & base & offset[7:6] & dest & C0 + + +These instructions use the CL format. + +C.LW loads a 32-bit value from memory into register +_rd latexmath:[$'$]_. It computes an effective address by adding the +_zero_-extended offset, scaled by 4, to the base address in register +_rs1 latexmath:[$'$]_. It expands to `lw rd, offset(rs1)`. + +C.LD is an RV64C/RV128C-only instruction that loads a 64-bit value from +memory into register _rd latexmath:[$'$]_. It computes an effective +address by adding the _zero_-extended offset, scaled by 8, to the base +address in register _rs1 latexmath:[$'$]_. It expands to +`ld rd', offset(rs1')`. + +C.LQ is an RV128C-only instruction that loads a 128-bit value from +memory into register _rd latexmath:[$'$]_. It computes an effective +address by adding the _zero_-extended offset, scaled by 16, to the base +address in register _rs1 latexmath:[$'$]_. It expands to +`lq rd, offset(rs1)`. + +C.FLW is an RV32FC-only instruction that loads a single-precision +floating-point value from memory into floating-point register +_rd latexmath:[$'$]_. It computes an effective address by adding the +_zero_-extended offset, scaled by 4, to the base address in register +_rs1 latexmath:[$'$]_. It expands to `flw rd, offset(rs1)`. + +C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision +floating-point value from memory into floating-point register +_rd latexmath:[$'$]_. It computes an effective address by adding the +_zero_-extended offset, scaled by 8, to the base address in register +_rs1 latexmath:[$'$]_. It expands to `fld rd, offset(rs1)`. + +S@S@S@Y@S@Y + +& & & & & + +& & & & & + +& 3 & 3 & 2 & 3 & 2 + +C.SW & offset[5:3] & base & offset[2latexmath:[$\vert$]6] & src & C0 + +C.SD & offset[5:3] & base & offset[7:6] & src & C0 + +C.SQ & offset[5latexmath:[$\vert$]4latexmath:[$\vert$]8] & base & +offset[7:6] & src & C0 + +C.FSW& offset[5:3] & base & offset[2latexmath:[$\vert$]6] & src & C0 + +C.FSD& offset[5:3] & base & offset[7:6] & src & C0 + + +These instructions use the CS format. + +C.SW stores a 32-bit value in register _rs2 latexmath:[$'$]_ to memory. +It computes an effective address by adding the _zero_-extended offset, +scaled by 4, to the base address in register _rs1 latexmath:[$'$]_. It +expands to `sw rs2, offset(rs1)`. + +C.SD is an RV64C/RV128C-only instruction that stores a 64-bit value in +register _rs2 latexmath:[$'$]_ to memory. It computes an effective +address by adding the _zero_-extended offset, scaled by 8, to the base +address in register _rs1 latexmath:[$'$]_. It expands to +`sd rs2, offset(rs1)`. + +C.SQ is an RV128C-only instruction that stores a 128-bit value in +register _rs2 latexmath:[$'$]_ to memory. It computes an effective +address by adding the _zero_-extended offset, scaled by 16, to the base +address in register _rs1 latexmath:[$'$]_. It expands to +`sq rs2, offset(rs1)`. + +C.FSW is an RV32FC-only instruction that stores a single-precision +floating-point value in floating-point register _rs2 latexmath:[$'$]_ to +memory. It computes an effective address by adding the _zero_-extended +offset, scaled by 4, to the base address in register +_rs1 latexmath:[$'$]_. It expands to `fsw rs2, offset(rs1)`. + +C.FSD is an RV32DC/RV64DC-only instruction that stores a +double-precision floating-point value in floating-point register +_rs2 latexmath:[$'$]_ to memory. It computes an effective address by +adding the _zero_-extended offset, scaled by 8, to the base address in +register _rs1 latexmath:[$'$]_. It expands to +`fsd rs2, offset(rs1)`. + +=== Control Transfer Instructions + +RVC provides unconditional jump instructions and conditional branch +instructions. As with base RVI instructions, the offsets of all RVC +control transfer instruction are in multiples of 2 bytes. + +S@L@Y + +& & + +& & + +& 11 & 2 + +C.J & +offset[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5] +& C1 + +C.JAL & +offset[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5] +& C1 + + +These instructions use the CJ format. + +C.J performs an unconditional control transfer. The offset is +sign-extended and added to the `pc` to form the jump target address. C.J +can therefore target a latexmath:[$\pm$] range. C.J expands to +`jal x0, offset`. + +C.JAL is an RV32C-only instruction that performs the same operation as +C.J, but additionally writes the address of the instruction following +the jump (`pc`+2) to the link register, `x1`. C.JAL expands to +`jal x1, offset`. + +E@T@T@Y + +& & & + +& & & + +& 5 & 5 & 2 + +C.JR & srclatexmath:[$\neq$]0 & 0 & C2 + +C.JALR & srclatexmath:[$\neq$]0 & 0 & C2 + + +These instructions use the CR format. + +C.JR (jump register) performs an unconditional control transfer to the +address in register _rs1_. C.JR expands to `jalr x0, 0(rs1)`. C.JR is +only valid when latexmath:[$\textit{rs1}{\neq}\texttt{x0}$]; the code +point with latexmath:[$\textit{rs1}{=}\texttt{x0}$] is reserved. + +C.JALR (jump and link register) performs the same operation as C.JR, but +additionally writes the address of the instruction following the jump +(`pc`+2) to the link register, `x1`. C.JALR expands to +`jalr x1, 0(rs1)`. C.JALR is only valid when +latexmath:[$\textit{rs1}{\neq}\texttt{x0}$]; the code point with +latexmath:[$\textit{rs1}{=}\texttt{x0}$] corresponds to the C.EBREAK +instruction. + +Strictly speaking, C.JALR does not expand exactly to a base RVI +instruction as the value added to the PC to form the link address is 2 +rather than 4 as in the base ISA, but supporting both offsets of 2 and 4 +bytes is only a very minor change to the base microarchitecture. + +S@S@S@T@Y + +& & & & + +& & & & + +& 3 & 3 & 5 & 2 + +C.BEQZ & offset[8latexmath:[$\vert$]4:3] & src & +offset[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] & C1 + +C.BNEZ & offset[8latexmath:[$\vert$]4:3] & src & +offset[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] & C1 + + +These instructions use the CB format. + +C.BEQZ performs conditional control transfers. The offset is +sign-extended and added to the `pc` to form the branch target address. +It can therefore target a latexmath:[$\pm$] range. C.BEQZ takes the +branch if the value in register _rs1 latexmath:[$'$]_ is zero. It +expands to `beq rs1, x0, offset`. + +C.BNEZ is defined analogously, but it takes the branch if +_rs1 latexmath:[$'$]_ contains a nonzero value. It expands to +`bne rs1, x0, offset`. + +=== Integer Computational Instructions + +RVC provides several instructions for integer arithmetic and constant +generation. + +==== Integer Constant-Generation Instructions + +The two constant-generation instructions both use the CI instruction +format and can target any integer register. + +S@W@T@T@Y + +& & & & + +& & & & + +& 1 & 5 & 5 & 2 + +C.LI & imm[5] & destlatexmath:[$\neq$]0 & imm[4:0] & C1 + +C.LUI & nzimm[17] & +latexmath:[$\textrm{dest}{\neq}{\left\{0,2\right\}}$] & nzimm[16:12] & +C1 + + +C.LI loads the sign-extended 6-bit immediate, _imm_, into register _rd_. +C.LI expands into `addi rd, x0, imm`. C.LI is only valid when +_rd_latexmath:[$\neq$]`x0`; the code points with _rd_=`x0` encode HINTs. + +C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the +destination register, clears the bottom 12 bits, and sign-extends bit 17 +into all higher bits of the destination. C.LUI expands into +`lui rd, nzimm`. C.LUI is only valid when +latexmath:[$\textit{rd}{\neq}{\left\{\texttt{x0},\texttt{x2}\right\}}$], +and when the immediate is not equal to zero. The code points with +_nzimm_=0 are reserved; the remaining code points with _rd_=`x0` are +HINTs; and the remaining code points with _rd_=`x2` correspond to the +C.ADDI16SP instruction. + +==== Integer Register-Immediate Operations + +These integer register-immediate operations are encoded in the CI format +and perform operations on an integer register and a 6-bit immediate. + +S@W@T@T@Y + +& & & & + +& & & & + +& 1 & 5 & 5 & 2 + +C.ADDI & nzimm[5] & destlatexmath:[$\neq$]0 & nzimm[4:0] & C1 + +C.ADDIW & imm[5] & destlatexmath:[$\neq$]0 & imm[4:0] & C1 + +C.ADDI16SP & nzimm[9] & 2 & +nzimm[4latexmath:[$\vert$]6latexmath:[$\vert$]8:7latexmath:[$\vert$]5] & +C1 + + +C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in +register _rd_ then writes the result to _rd_. C.ADDI expands into +`addi rd, rd, nzimm`. C.ADDI is only valid when +_rd_latexmath:[$\neq$]`x0` and _nzimm_latexmath:[$\neq$]0. The code +points with _rd_=`x0` encode the C.NOP instruction; the remaining code +points with _nzimm_=0 encode HINTs. + +C.ADDIW is an RV64C/RV128C-only instruction that performs the same +computation but produces a 32-bit result, then sign-extends result to 64 +bits. C.ADDIW expands into `addiw rd, rd, imm`. The immediate can be +zero for C.ADDIW, where this corresponds to ` sext.w rd`. C.ADDIW is +only valid when _rd_latexmath:[$\neq$]`x0`; the code points with +_rd_=`x0` are reserved. + +C.ADDI16SP shares the opcode with C.LUI, but has a destination field of +`x2`. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the +value in the stack pointer (`sp`=`x2`), where the immediate is scaled to +represent multiples of 16 in the range (-512,496). C.ADDI16SP is used to +adjust the stack pointer in procedure prologues and epilogues. It +expands into `addi x2, x2, nzimm`. C.ADDI16SP is only valid when +_nzimm_latexmath:[$\neq$]0; the code point with _nzimm_=0 is reserved. + +In the standard RISC-V calling convention, the stack pointer `sp` is +always 16-byte aligned. + +@S@K@S@Y + +& & & + +& & & + +& 8 & 3 & 2 + +C.ADDI4SPN & +nzuimm[5:4latexmath:[$\vert$]9:6latexmath:[$\vert$]2latexmath:[$\vert$]3] +& dest & C0 + + +C.ADDI4SPN is a CIW-format instruction that adds a _zero_-extended +non-zero immediate, scaled by 4, to the stack pointer, `x2`, and writes +the result to `rd`. This instruction is used to generate pointers to +stack-allocated variables, and expands to `addi rd, x2, nzuimm`. +C.ADDI4SPN is only valid when _nzuimm_latexmath:[$\neq$]0; the code +points with _nzuimm_=0 are reserved. + +S@W@T@T@Y + +& & & & + +& & & & + +& 1 & 5 & 5 & 2 + +C.SLLI & shamt[5] & destlatexmath:[$\neq$]0 & shamt[4:0] & C2 + + +C.SLLI is a CI-format instruction that performs a logical left shift of +the value in register _rd_ then writes the result to _rd_. The shift +amount is encoded in the _shamt_ field. For RV128C, a shift amount of +zero is used to encode a shift of 64. C.SLLI expands into +`slli rd, rd, shamt`, except for RV128C with `shamt=0`, which expands to +`slli rd, rd, 64`. + +For RV32C, _shamt[5]_ must be zero; the code points with _shamt[5]_=1 +are designated for custom extensions. For RV32C and RV64C, the shift +amount must be non-zero; the code points with _shamt_=0 are HINTs. For +all base ISAs, the code points with _rd_=`x0` are HINTs, except those +with _shamt[5]_=1 in RV32C. + +S@W@Y@S@T@Y + +& & & & & + +& & & & & + +& 1 & 2 & 3 & 5 & 2 + +C.SRLI & shamt[5] & C.SRLI & dest & shamt[4:0] & C1 + +C.SRAI & shamt[5] & C.SRAI & dest & shamt[4:0] & C1 + + +C.SRLI is a CB-format instruction that performs a logical right shift of +the value in register _rd latexmath:[$'$]_ then writes the result to +_rd latexmath:[$'$]_. The shift amount is encoded in the _shamt_ field. +For RV128C, a shift amount of zero is used to encode a shift of 64. +Furthermore, the shift amount is sign-extended for RV128C, and so the +legal shift amounts are 1–31, 64, and 96–127. C.SRLI expands into +`srli rd', rd', shamt`, except for RV128C with `shamt=0`, which +expands to `srli rd, rd, 64`. + +For RV32C, _shamt[5]_ must be zero; the code points with _shamt[5]_=1 +are designated for custom extensions. For RV32C and RV64C, the shift +amount must be non-zero; the code points with _shamt_=0 are HINTs. + +C.SRAI is defined analogously to C.SRLI, but instead performs an +arithmetic right shift. C.SRAI expands to `srai rd, rd, shamt`. + +Left shifts are usually more frequent than right shifts, as left shifts +are frequently used to scale address values. Right shifts have therefore +been granted less encoding space and are placed in an encoding quadrant +where all other immediates are sign-extended. For RV128, the decision +was made to have the 6-bit shift-amount immediate also be sign-extended. +Apart from reducing the decode complexity, we believe right-shift +amounts of 96–127 will be more useful than 64–95, to allow extraction of +tags located in the high portions of 128-bit address pointers. We note +that RV128C will not be frozen at the same point as RV32C and RV64C, to +allow evaluation of typical usage of 128-bit address-space codes. + +S@W@Y@S@T@Y + +& & & & & + +& & & & & + +& 1 & 2 & 3 & 5 & 2 + +C.ANDI & imm[5] & C.ANDI & dest & imm[4:0] & C1 + + +C.ANDI is a CB-format instruction that computes the bitwise AND of the +value in register _rd latexmath:[$'$]_ and the sign-extended 6-bit +immediate, then writes the result to _rd latexmath:[$'$]_. C.ANDI +expands to `andi rd, rd, imm`. + +==== Integer Register-Register Operations + +E@T@T@Y + +& & & + +& & & + +& 5 & 5 & 2 + +C.MV & destlatexmath:[$\neq$]0 & srclatexmath:[$\neq$]0 & C2 + +C.ADD & destlatexmath:[$\neq$]0 & srclatexmath:[$\neq$]0 & C2 + + +These instructions use the CR format. + +C.MV copies the value in register _rs2_ into register _rd_. C.MV expands +into `add rd, x0, rs2`. C.MV is only valid when +latexmath:[$\textit{rs2}{\neq}\texttt{x0}$]; the code points with +latexmath:[$\textit{rs2}{=}\texttt{x0}$] correspond to the C.JR +instruction. The code points with +latexmath:[$\textit{rs2}{\neq}\texttt{x0}$] and +latexmath:[$\textit{rd}{=}\texttt{x0}$] are HINTs. + +C.MV expands to a different instruction than the canonical MV +pseudoinstruction, which instead uses ADDI. Implementations that handle +MV specially, e.g. using register-renaming hardware, may find it more +convenient to expand C.MV to MV instead of ADD, at slight additional +hardware cost. + +C.ADD adds the values in registers _rd_ and _rs2_ and writes the result +to register _rd_. C.ADD expands into `add rd, rd, rs2`. C.ADD is only +valid when latexmath:[$\textit{rs2}{\neq}\texttt{x0}$]; the code points +with latexmath:[$\textit{rs2}{=}\texttt{x0}$] correspond to the C.JALR +and C.EBREAK instructions. The code points with +latexmath:[$\textit{rs2}{\neq}\texttt{x0}$] and +latexmath:[$\textit{rd}{=}\texttt{x0}$] are HINTs. + +M@S@Y@S@Y + +& & & & + +& & & & + +& 3 & 2 & 3 & 2 + +C.AND & dest & C.AND & src & C1 + +C.OR & dest & C.OR & src & C1 + +C.XOR & dest & C.XOR & src & C1 + +C.SUB & dest & C.SUB & src & C1 + +C.ADDW & dest & C.ADDW & src & C1 + +C.SUBW & dest & C.SUBW & src & C1 + + +These instructions use the CA format. + +C.AND computes the bitwise AND of the values in registers +_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result +to register _rd latexmath:[$'$]_. C.AND expands into +`and rd, rd, rs2`. + +C.OR computes the bitwise OR of the values in registers +_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result +to register _rd latexmath:[$'$]_. C.OR expands into +`or rd′, rd′, rs2′`. + +C.XOR computes the bitwise XOR of the values in registers +_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result +to register _rd latexmath:[$'$]_. C.XOR expands into +`xor rd', rd', rs2'. + +C.SUB subtracts the value in register _rs2 latexmath:[$'$]_ from the +value in register _rd latexmath:[$'$]_, then writes the result to +register _rd latexmath:[$'$]_. C.SUB expands into +`sub rd', rd', rs2'. + +C.ADDW is an RV64C/RV128C-only instruction that adds the values in +registers _rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then +sign-extends the lower 32 bits of the sum before writing the result to +register _rd latexmath:[$'$]_. C.ADDW expands into +`addw rd', rd', rs2'`. + +C.SUBW is an RV64C/RV128C-only instruction that subtracts the value in +register _rs2 latexmath:[$'$]_ from the value in register +_rd latexmath:[$'$]_, then sign-extends the lower 32 bits of the +difference before writing the result to register _rd latexmath:[$'$]_. +C.SUBW expands into `subw rd', rd', rs2'`. + +This group of six instructions do not provide large savings +individually, but do not occupy much encoding space and are +straightforward to implement, and as a group provide a worthwhile +improvement in static and dynamic compression. + +==== Defined Illegal Instruction + +SW@T@T@Y + +& & & & + +& & & & + +& 1 & 5 & 5 & 2 + +0 & 0 & 0 & 0 & 0 + + +A 16-bit instruction with all bits zero is permanently reserved as an +illegal instruction. + +We reserve all-zero instructions to be illegal instructions to help trap +attempts to execute zero-ed or non-existent portions of the memory +space. The all-zero value should not be redefined in any non-standard +extension. Similarly, we reserve instructions with all bits set to 1 +(corresponding to very long instructions in the RISC-V variable-length +encoding scheme) as illegal to capture another common value seen in +non-existent memory regions. + +==== NOP Instruction + +SW@T@T@Y + +& & & & + +& & & & + +& 1 & 5 & 5 & 2 + +C.NOP & 0 & 0 & 0 & C1 + + +C.NOP is a CI-format instruction that does not change any user-visible +state, except for advancing the `pc` and incrementing any applicable +performance counters. C.NOP expands to `nop`. C.NOP is only valid when +_imm_=0; the code points with _imm_latexmath:[$\neq$]0 encode HINTs. + +==== Breakpoint Instruction + +E@U@Y + +& & + +& & + +& 10 & 2 + +C.EBREAK & 0 & C2 + + +Debuggers can use the C.EBREAK instruction, which expands to `ebreak`, +to cause control to be transferred back to the debugging environment. +C.EBREAK shares the opcode with the C.ADD instruction, but with _rd_ and +_rs2_ both zero, thus can also use the CR format. + +=== Usage of C Instructions in LR/SC Sequences + +On implementations that support the C extension, compressed forms of the +I instructions permitted inside constrained LR/SC sequences, as +described in <>, are also permitted +inside constrained LR/SC sequences. + +The implication is that any implementation that claims to support both +the A and C extensions must ensure that LR/SC sequences containing valid +C instructions will eventually complete. + +[[rvc-hints]] +=== HINT Instructions + +A portion of the RVC encoding space is reserved for microarchitectural +HINTs. Like the HINTs in the RV32I base ISA (see +<>, these instructions do not +modify any architectural state, except for advancing the `pc` and any +applicable performance counters. HINTs are executed as no-ops on +implementations that ignore them. + +RVC HINTs are encoded as computational instructions that do not modify +the architectural state, either because _rd_=`x0` (e.g. +C.ADD _x0_, _t0_), or because _rd_ is overwritten with a copy of itself +(e.g. C.ADDI _t0_, 0). + +This HINT encoding has been chosen so that simple implementations can +ignore HINTs altogether, and instead execute a HINT as a regular +computational instruction that happens not to mutate the architectural +state. + +RVC HINTs do not necessarily expand to their RVI HINT counterparts. For +example, C.ADD _x0_, _t0_ might not encode the same HINT as +ADD _x0_, _x0_, _t0_. + +The primary reason to not require an RVC HINT to expand to an RVI HINT +is that HINTs are unlikely to be compressible in the same manner as the +underlying computational instruction. Also, decoupling the RVC and RVI +HINT mappings allows the scarce RVC HINT space to be allocated to the +most popular HINTs, and in particular, to HINTs that are amenable to +macro-op fusion. + +<> lists all RVC HINT code points. For RV32C, 78% +of the HINT space is reserved for standard HINTs, but none are presently +defined. The remainder of the HINT space is designated for custom HINTs; +no standard HINTs will ever be defined in this subspace. + +[[rvc-t-hints]] +.RVC HINT instructions. +//[cols="<,<,>,<",options="header",] +|=== +|Instruction |Constraints |Code Points |Purpose +|C.NOP |_nzimm_latexmath:[$\neq$]0 |63 |_Reserved for future standard +use_ + +|C.ADDI | |_rd_latexmath:[$\neq$]`x0`, _nzimm_=0 |31 + +|C.LI | |_rd_=`x0` |64 + +|C.LUI | |_rd_=`x0`, _nzimm_latexmath:[$\neq$]0 |63 + +|C.MV | |_rd_=`x0`, _rs2_latexmath:[$\neq$]`x0` |31 + +|C.ADD | |_rd_=`x0`, _rs2_latexmath:[$\neq$]`x0` |31 + +|C.SLLI |_rd_=`x0`, _nzimm_latexmath:[$\neq$]0 |31 (RV32) |_Designated +for custom use_ + +| | |63 (RV64/128) | + +|C.SLLI64 | |_rd_=`x0` |1 + +|C.SLLI64 | |_rd_latexmath:[$\neq$]`x0`, RV32 and RV64 only |31 + +|C.SRLI64 | |RV32 and RV64 only |8 + +|C.SRAI64 | |RV32 and RV64 only |8 +|=== + +=== RVC Instruction Set Listings + +<> shows a map of the major +opcodes for RVC. Each row of the table corresponds to one quadrant of +the encoding space. The last quadrant, which has the two +least-significant bits set, corresponds to instructions wider than 16 +bits, including those in the base ISAs. Several instructions are only +valid for certain operands; when invalid, they are marked either _RES_ +to indicate that the opcode is reserved for future standard extensions; +_Custom_ to indicate that the opcode is designated for custom +extensions; or _HINT_ to indicate that the opcode is reserved for +microarchitectural hints <>. + +//[cols=">,^,^,^,^,^,^,^,^,<",] +|=== +|inst[15:13] |000 |001 |010 |011 |100 |101 |110 |111 | +|inst[1:0] | | | | | | | | | +| |ADDI4SPN |FLD |LW |FLW |_Reserved_ |FSD |SW |FSW |RV32 +| | |FLD | |LD | |FSD | |SD |RV64 +| | |LQ | |LD | |SQ | |SD |RV128 +|01 |ADDI |JAL |LI |LUI/ADDI16SP |MISC-ALU |J |BEQZ |BNEZ |RV32 +| | |ADDIW | | | | | | |RV64 +| | |ADDIW | | | | | | |RV128 +|10 |SLLI |FLDSP |LWSP |FLWSP |J[AL]R/MV/ADD |FSDSP |SWSP |FSWSP |RV32 +| | |FLDSP | |LDSP | |FSDSP | |SDSP |RV64 +| | |LQSP | |LDSP | |SQSP | |SDSP |RV128 +|11 |latexmath:[$>$]16b | | | | | | | | +|=== + +<>, <>, and <> list the +RVC instructions. + +[[rvc-instr-table0]] +.Instruction listing for RVC, Quadrant 0 +//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",] +|=== + +|000| |0 | | | | | | | |0 | | |00 | |_Illegal instruction_ + +|000| +|nzuimm[5:4latexmath:[$\vert$]9:6latexmath:[$\vert$]2latexmath:[$\vert$]3] +| | | | | | | |rd latexmath:[$'$] | | |00 | |C.ADDI4SPN _(RES, +nzuimm=0)_ + +|001| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] | +|rd latexmath:[$'$] | | |00 | |C.FLD _(RV32/64)_ + +|001| |uimm[5:4latexmath:[$\vert$]8] | | |rs1 latexmath:[$'$] | | +|uimm[7:6] | |rd latexmath:[$'$] | | |00 | |C.LQ _(RV128)_ + +|010| |uimm[5:3] | | |rs1 latexmath:[$'$] | | +|uimm[2latexmath:[$\vert$]6] | |rd latexmath:[$'$] | | |00 | |C.LW + +|011| |uimm[5:3] | | |rs1 latexmath:[$'$] | | +|uimm[2latexmath:[$\vert$]6] | |rd latexmath:[$'$] | | |00 | |C.FLW +_(RV32)_ + +|011| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] | +|rd latexmath:[$'$] | | |00 | |C.LD _(RV64/128)_ + +|100| |— | | | | | | | | | | |00 | |_Reserved_ + +|101| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] | +|rs2 latexmath:[$'$] | | |00 | |C.FSD _(RV32/64)_ + +|101 | | |uimm[5:4latexmath:[$\vert$]8] | | |rs1 latexmath:[$'$] | | +|uimm[7:6] | |rs2 latexmath:[$'$] | | |00 | |C.SQ _(RV128)_ + +|110| |uimm[5:3] | | |rs1 latexmath:[$'$] | | +|uimm[2latexmath:[$\vert$]6] | |rs2 latexmath:[$'$] | | |00 | |C.SW + +|111| |uimm[5:3] | | |rs1 latexmath:[$'$] | | +|uimm[2latexmath:[$\vert$]6] | |rs2 latexmath:[$'$] | | |00 | |C.FSW +_(RV32)_ + +|111| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] | +|rs2 latexmath:[$'$] | | |00 | |C.SD _(RV64/128)_ + +|=== + +[[rvc-instr-table1]] +.Instruction listing for RVC, Quadrant 1 +//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",] +|=== + +|000 |nzimm[5] |0 | | | | |nzimm[4:0] | | | | |01 | |C.NOP _(HINT, +nzimmlatexmath:[$\neq$]0)_ + +|000 |nzimm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |nzimm[4:0] | | +| | |01 | |C.ADDI _(HINT, nzimm=0)_ + +|001 | | +|imm[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5] +| | | | | | | | | | |01 | |C.JAL _(RV32)_ + +|001 |imm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |imm[4:0] | | | | +|01 | |C.ADDIW _(RV64/128; RES, rd=0)_ + +|010 |imm[5] |rdlatexmath:[$\neq$]0 | | | | |imm[4:0] | | | | |01 +| |C.LI _(HINT, rd=0)_ + +|011 |nzimm[9] |2 | | | | +|nzimm[4latexmath:[$\vert$]6latexmath:[$\vert$]8:7latexmath:[$\vert$]5] +| | | | |01 | |C.ADDI16SP _(RES, nzimm=0)_ + +|011 |nzimm[17] |rdlatexmath:[$\neq$]latexmath:[$\{0,2\}$] | | | | +|nzimm[16:12] | | | | |01 | |C.LUI _(RES, nzimm=0; HINT, rd=0)_ + +|100 |nzuimm[5] |00 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | +|nzuimm[4:0] | | | | |01 | |C.SRLI _(RV32 Custom, nzuimm[5]=1)_ + +|100 |0 |00 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |0 | | | +| |01 | |C.SRLI64 _(RV128; RV32/64 HINT)_ + +|100 |nzuimm[5] |01 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | +|nzuimm[4:0] | | | | |01 | |C.SRAI _(RV32 Custom, nzuimm[5]=1)_ + +|100 |0 |01 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |0 | | | +| |01 | |C.SRAI64 _(RV128; RV32/64 HINT)_ + +|100 |imm[5] |10 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | +|imm[4:0] | | | | |01 | |C.ANDI + +|100 |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |00 | +|rs2 latexmath:[$'$] | | |01 | |C.SUB + +|100 |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |01 | +|rs2 latexmath:[$'$] | | |01 | |C.XOR + +|100 |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |10 | +|rs2 latexmath:[$'$] | | |01 | |C.OR + +|100 |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |11 | +|rs2 latexmath:[$'$] | | |01 | |C.AND + +|100 |1 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |00 | +|rs2 latexmath:[$'$] | | |01 | |C.SUBW _(RV64/128; RV32 RES)_ + +|100 |1 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |01 | +|rs2 latexmath:[$'$] | | |01 | |C.ADDW _(RV64/128; RV32 RES)_ + +|100 |1 |11 | |— | | |10 | |— | | |01 | |_Reserved_ + +|100 |1 |11 | |— | | |11 | |— | | |01 | |_Reserved_ + +|101 +|imm[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5] +| | | | | | | | | | |01 | |C.J + +|110 |imm[8latexmath:[$\vert$]4:3] | | |rs1 latexmath:[$'$] | | +|imm[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] | | | | |01 | +|C.BEQZ + +|111 |imm[8latexmath:[$\vert$]4:3] | | |rs1 latexmath:[$'$] | | +|imm[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] | | | | |01 | +|C.BNEZ + +|=== + +[[rvc-instr-table2]] +.Instruction listing for RVC, Quadrant 2 +//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",] +|=== + +|000 |nzuimm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |nzuimm[4:0] | +| | | |10 | |C.SLLI _(HINT, rd=0; RV32 Custom, nzuimm[5]=1)_ + +|000 |0 |rs1/rdlatexmath:[$\neq$]0 | | | | |0 | | | | |10 | +|C.SLLI64 _(RV128; RV32/64 HINT; HINT, rd=0)_ + +|001 |uimm[5] |rd | | | | |uimm[4:3latexmath:[$\vert$]8:6] | | | | +|10 | |C.FLDSP _(RV32/64)_ + +|001 |uimm[5] |rdlatexmath:[$\neq$]0 | | | | +|uimm[4latexmath:[$\vert$]9:6] | | | | |10 | |C.LQSP _(RV128; RES, +rd=0)_ + +|010 |uimm[5] |rdlatexmath:[$\neq$]0 | | | | +|uimm[4:2latexmath:[$\vert$]7:6] | | | | |10 | |C.LWSP _(RES, rd=0)_ + +|011 |uimm[5] |rd | | | | |uimm[4:2latexmath:[$\vert$]7:6] | | | | +|10 | |C.FLWSP _(RV32)_ + +|011 |uimm[5] |rdlatexmath:[$\neq$]0 | | | | +|uimm[4:3latexmath:[$\vert$]8:6] | | | | |10 | |C.LDSP _(RV64/128; RES, +rd=0)_ + +|100 |0 |rs1latexmath:[$\neq$]0 | | | | |0 | | | | |10 | |C.JR +_(RES, rs1=0)_ + +|100 |0 |rdlatexmath:[$\neq$]0 | | | | |rs2latexmath:[$\neq$]0 | | +| | |10 | |C.MV _(HINT, rd=0)_ + +|100 |1 |0 | | | | |0 | | | | |10 | |C.EBREAK + +|100 |1 |rs1latexmath:[$\neq$]0 | | | | |0 | | | | |10 | |C.JALR + +|100 | |1 |rs1/rdlatexmath:[$\neq$]0 | | | | |rs2latexmath:[$\neq$]0 +| | | | |10 | |C.ADD _(HINT, rd=0)_ + +|101 |uimm[5:3latexmath:[$\vert$]8:6] | | | | | |rs2 | | | | |10 | +|C.FSDSP _(RV32/64)_ + +|101 |uimm[5:4latexmath:[$\vert$]9:6] | | | | | |rs2 | | | | |10 | +|C.SQSP _(RV128)_ + +|110 |uimm[5:2latexmath:[$\vert$]7:6] | | | | | |rs2 | | | | |10 | +|C.SWSP + +|111 |uimm[5:2latexmath:[$\vert$]7:6] | | | | | |rs2 | | | | |10 | +|C.FSWSP _(RV32)_ + +|111 |uimm[5:3latexmath:[$\vert$]8:6] | | | | | |rs2 | | | | |10 | +|C.SDSP _(RV64/128)_ + +|=== diff --git a/src/colophon.adoc b/src/colophon.adoc new file mode 100644 index 0000000..97f13cd --- /dev/null +++ b/src/colophon.adoc @@ -0,0 +1,347 @@ +[colophon] += Preface + +This document describes the RISC-V unprivileged architecture. + +The ISA modules marked Ratified have been ratified at this time. The +modules marked _Frozen_ are not expected to change significantly before +being put up for ratification. The modules marked _Draft_ are expected +to change before ratification. + +The document contains the following versions of the RISC-V ISA modules: + +[cols="^,<,^",options="header",] +|=== +|Base |Version |Status +|RVWMO |2.0 |*Ratified* +|*RV32I* |*2.1* |*Ratified* +|*RV64I* |*2.1* |*Ratified* +|_RV32E_ |_1.9_ |_Draft_ +|_RV128I_ |_1.7_ |_Draft_ +|Extension |Version |Status +|*M* |*2.0* |*Ratified* +|*A* |*2.1* |*Ratified* +|*F* |*2.2* |*Ratified* +|*D* |*2.2* |*Ratified* +|*Q* |*2.2* |*Ratified* +|*C* |*2.0* |*Ratified* +|_Counters_ |_2.0_ |_Draft_ +|_L_ |_0.0_ |_Draft_ +|_B_ |_0.0_ |_Draft_ +|_J_ |_0.0_ |_Draft_ +|_T_ |_0.0_ |_Draft_ +|_P_ |_0.2_ |_Draft_ +|_V_ |_0.7_ |_Draft_ +|*Zicsr* |*2.0* |*Ratified* +|*Zifencei* |*2.0* |*Ratified* +|_Zam_ |_0.1_ |_Draft_ +|_Ztso_ |_0.1_ |_Frozen_ +|=== + +_Preface to Document Version 20191213-Base-Ratified_ + +This document describes the RISC-V unprivileged architecture. + +The ISA modules marked Ratified have been ratified at this time. The +modules marked _Frozen_ are not expected to change significantly before +being put up for ratification. The modules marked _Draft_ are expected +to change before ratification. + +The document contains the following versions of the RISC-V ISA modules: + +[cols="^,<,^",options="header",] +|=== +|Base |Version |Status +|RVWMO |2.0 |*Ratified* +|*RV32I* |*2.1* |*Ratified* +|*RV64I* |*2.1* |*Ratified* +|_RV32E_ |_1.9_ |_Draft_ +|_RV128I_ |_1.7_ |_Draft_ +|Extension |Version |Status +|*M* |*2.0* |*Ratified* +|*A* |*2.1* |*Ratified* +|*F* |*2.2* |*Ratified* +|*D* |*2.2* |*Ratified* +|*Q* |*2.2* |*Ratified* +|*C* |*2.0* |*Ratified* +|_Counters_ |_2.0_ |_Draft_ +|_L_ |_0.0_ |_Draft_ +|_B_ |_0.0_ |_Draft_ +|_J_ |_0.0_ |_Draft_ +|_T_ |_0.0_ |_Draft_ +|_P_ |_0.2_ |_Draft_ +|_V_ |_0.7_ |_Draft_ +|*Zicsr* |*2.0* |*Ratified* +|*Zifencei* |*2.0* |*Ratified* +|_Zam_ |_0.1_ |_Draft_ +|_Ztso_ |_0.1_ |_Frozen_ +|=== + +The changes in this version of the document include: + +* The A extension, now version 2.1, was ratified by the board in +December 2019. +* Defined big-endian ISA variant. +* Moved N extension for user-mode interrupts into Volume II. +* Defined PAUSE hint instruction. + +_Preface to Document Version 20190608-Base-Ratified_ + +This document describes the RISC-V unprivileged architecture. + +The RVWMO memory model has been ratified at this time. The ISA modules +marked Ratified, have been ratified at this time. The modules marked +_Frozen_ are not expected to change significantly before being put up +for ratification. The modules marked _Draft_ are expected to change +before ratification. + +The document contains the following versions of the RISC-V ISA modules: + +[cols="^,<,^",options="header",] +|=== +|Base |Version |Status +|RVWMO |2.0 |*Ratified* +|*RV32I* |*2.1* |*Ratified* +|*RV64I* |*2.1* |*Ratified* +|_RV32E_ |_1.9_ |_Draft_ +|_RV128I_ |_1.7_ |_Draft_ +|Extension |Version |Status +|*Zifencei* |*2.0* |*Ratified* +|*Zicsr* |*2.0* |*Ratified* +|*M* |*2.0* |*Ratified* +|_A_ |_2.0_ |Frozen +|*F* |*2.2* |*Ratified* +|*D* |*2.2* |*Ratified* +|*Q* |*2.2* |*Ratified* +|*C* |*2.0* |*Ratified* +|_Ztso_ |_0.1_ |_Frozen_ +|_Counters_ |_2.0_ |_Draft_ +|_L_ |_0.0_ |_Draft_ +|_B_ |_0.0_ |_Draft_ +|_J_ |_0.0_ |_Draft_ +|_T_ |_0.0_ |_Draft_ +|_P_ |_0.2_ |_Draft_ +|_V_ |_0.7_ |_Draft_ +|_N_ |_1.1_ |_Draft_ +|_Zam_ |_0.1_ |_Draft_ +|=== + +The changes in this version of the document include: + +* Moved description to *Ratified* for the ISA modules ratified by the +board in early 2019. +* Removed the A extension from ratification. +* Changed document version scheme to avoid confusion with versions of +the ISA modules. +* Incremented the version numbers of the base integer ISA to 2.1, +reflecting the presence of the ratified RVWMO memory model and exclusion +of FENCE.I, counters, and CSR instructions that were in previous base +ISA. +* Incremented the version numbers of the F and D extensions to 2.2, +reflecting that version 2.1 changed the canonical NaN, and version 2.2 +defined the NaN-boxing scheme and changed the definition of the FMIN and +FMAX instructions. +* Changed name of document to refer to ``unprivileged`` instructions as +part of move to separate ISA specifications from platform profile +mandates. +* Added clearer and more precise definitions of execution environments, +harts, traps, and memory accesses. +* Defined instruction-set categories: _standard_, _reserved_, _custom_, +_non-standard_, and _non-conforming_. +* Removed text implying operation under alternate endianness, as +alternate-endianness operation has not yet been defined for RISC-V. +* Changed description of misaligned load and store behavior. The +specification now allows visible misaligned address traps in execution +environment interfaces, rather than just mandating invisible handling of +misaligned loads and stores in user mode. Also, now allows access-fault +exceptions to be reported for misaligned accesses (including atomics) +that should not be emulated. +* Moved FENCE.I out of the mandatory base and into a separate extension, +with Zifencei ISA name. FENCE.I was removed from the Linux user ABI and +is problematic in implementations with large incoherent instruction and +data caches. However, it remains the only standard instruction-fetch +coherence mechanism. +* Removed prohibitions on using RV32E with other extensions. +* Removed platform-specific mandates that certain encodings produce +illegal instruction exceptions in RV32E and RV64I chapters. +* Counter/timer instructions are now not considered part of the +mandatory base ISA, and so CSR instructions were moved into separate +chapter and marked as version 2.0, with the unprivileged counters moved +into another separate chapter. The counters are not ready for +ratification as there are outstanding issues, including counter +inaccuracies. +* A CSR-access ordering model has been added. +* Explicitly defined the 16-bit half-precision floating-point format for +floating-point instructions in the 2-bit _fmt field._ +* Defined the signed-zero behavior of FMIN._fmt_ and FMAX._fmt_, and +changed their behavior on signaling-NaN inputs to conform to the +minimumNumber and maximumNumber operations in the proposed IEEE 754-201x +specification. +* The memory consistency model, RVWMO, has been defined. +* The ``Zam`` extension, which permits misaligned AMOs and specifies +their semantics, has been defined. +* The ``Ztso`` extension, which enforces a stricter memory consistency +model than RVWMO, has been defined. +* Improvements to the description and commentary. +* Defined the term IALIGN as shorthand to describe the +instruction-address alignment constraint. +* Removed text of P extension chapter as now superseded by active task +group documents. +* Removed text of V extension chapter as now superseded by separate +vector extension draft document. + +_Preface to Document Version 2.2_ + +This is version 2.2 of the document describing the RISC-V user-level +architecture. The document contains the following versions of the RISC-V +ISA modules: + +[cols="^,<,^",options="header",] +|=== +|Base |_Version_ |_Draft Frozen?_ +|RV32I |2.0 |Y +|RV32E |1.9 |N +|RV64I |2.0 |Y +|RV128I |1.7 |N +|Extension |Version |Frozen? +|M |2.0 |Y +|A |2.0 |Y +|F |2.0 |Y +|D |2.0 |Y +|Q |2.0 |Y +|L |0.0 |N +|C |2.0 |Y +|B |0.0 |N +|J |0.0 |N +|T |0.0 |N +|P |0.1 |N +|V |0.7 |N +|N |1.1 |N +|=== + +To date, no parts of the standard have been officially ratified by the +RISC-V Foundation, but the components labeled ``frozen`` above are not +expected to change during the ratification process beyond resolving +ambiguities and holes in the specification. + +The major changes in this version of the document include: + +* The previous version of this document was released under a Creative +Commons Attribution 4.0 International License by the original authors, +and this and future versions of this document will be released under the +same license. +* Rearranged chapters to put all extensions first in canonical order. +* Improvements to the description and commentary. +* Modified implicit hinting suggestion on JALR to support more efficient +macro-op fusion of LUI/JALR and AUIPC/JALR pairs. +* Clarification of constraints on load-reserved/store-conditional +sequences. +* A new table of control and status register (CSR) mappings. +* Clarified purpose and behavior of high-order bits of `fcsr`. +* Corrected the description of the FNMADD._fmt_ and FNMSUB._fmt_ +instructions, which had suggested the incorrect sign of a zero result. +* Instructions FMV.S.X and FMV.X.S were renamed to FMV.W.X and FMV.X.W +respectively to be more consistent with their semantics, which did not +change. The old names will continue to be supported in the tools. +* Specified behavior of narrower (latexmath:[$<$]FLEN) floating-point +values held in wider `f` registers using NaN-boxing model. +* Defined the exception behavior of FMA(latexmath:[$\infty$], 0, qNaN). +* Added note indicating that the P extension might be reworked into an +integer packed-SIMD proposal for fixed-point operations using the +integer registers. +* A draft proposal of the V vector instruction-set extension. +* An early draft proposal of the N user-level traps extension. +* An expanded pseudoinstruction listing. +* Removal of the calling convention chapter, which has been superseded +by the RISC-V ELF psABI Specification cite:[riscv-elf-psabi]. +* The C extension has been frozen and renumbered version 2.0. + +_Preface to Document Version 2.1_ + +This is version 2.1 of the document describing the RISC-V user-level +architecture. Note the frozen user-level ISA base and extensions IMAFDQ +version 2.0 have not changed from the previous version of this +document cite:[riscvtr2], but some specification holes have been fixed and the +documentation has been improved. Some changes have been made to the +software conventions. + +* Numerous additions and improvements to the commentary sections. +* Separate version numbers for each chapter. +* Modification to long instruction encodings latexmath:[$>$]64 bits to +avoid moving the _rd_ specifier in very long instruction formats. +* CSR instructions are now described in the base integer format where +the counter registers are introduced, as opposed to only being +introduced later in the floating-point section (and the companion +privileged architecture manual). +* The SCALL and SBREAK instructions have been renamed to ECALL and +EBREAK, respectively. Their encoding and functionality are unchanged. +* Clarification of floating-point NaN handling, and a new canonical NaN +value. +* Clarification of values returned by floating-point to integer +conversions that overflow. +* Clarification of LR/SC allowed successes and required failures, +including use of compressed instructions in the sequence. +* A new RV32E base ISA proposal for reduced integer register counts, +supports MAC extensions. +* A revised calling convention. +* Relaxed stack alignment for soft-float calling convention, and +description of the RV32E calling convention. +* A revised proposal for the C compressed extension, version 1.9 . + +_Preface to Version 2.0_ + +This is the second release of the user ISA specification, and we intend +the specification of the base user ISA plus general extensions (i.e., +IMAFD) to remain fixed for future development. The following changes +have been made since Version 1.0 cite:[openriscarch] of this ISA specification. + +* The ISA has been divided into an integer base with several standard +extensions. +* The instruction formats have been rearranged to make immediate +encoding more efficient. +* The base ISA has been defined to have a little-endian memory system, +with big-endian or bi-endian as non-standard variants. +* Load-Reserved/Store-Conditional (LR/SC) instructions have been added +in the atomic instruction extension. +* AMOs and LR/SC can support the release consistency model. +* The FENCE instruction provides finer-grain memory and I/O orderings. +* An AMO for fetch-and-XOR (AMOXOR) has been added, and the encoding for +AMOSWAP has been changed to make room. +* The AUIPC instruction, which adds a 20-bit upper immediate to the PC, +replaces the RDNPC instruction, which only read the current PC value. +This results in significant savings for position-independent code. +* The JAL instruction has now moved to the U-Type format with an +explicit destination register, and the J instruction has been dropped +being replaced by JAL with _rd_=`x0`. This removes the only instruction +with an implicit destination register and removes the J-Type instruction +format from the base ISA. There is an accompanying reduction in JAL +reach, but a significant reduction in base ISA complexity. +* The static hints on the JALR instruction have been dropped. The hints +are redundant with the _rd_ and _rs1_ register specifiers for code +compliant with the standard calling convention. +* The JALR instruction now clears the lowest bit of the calculated +target address, to simplify hardware and to allow auxiliary information +to be stored in function pointers. +* The MFTX.S and MFTX.D instructions have been renamed to FMV.X.S and +FMV.X.D, respectively. Similarly, MXTF.S and MXTF.D instructions have +been renamed to FMV.S.X and FMV.D.X, respectively. +* The MFFSR and MTFSR instructions have been renamed to FRCSR and FSCSR, +respectively. FRRM, FSRM, FRFLAGS, and FSFLAGS instructions have been +added to individually access the rounding mode and exception flags +subfields of the `fcsr`. +* The FMV.X.S and FMV.X.D instructions now source their operands from +_rs1_, instead of _rs2_. This change simplifies datapath design. +* FCLASS.S and FCLASS.D floating-point classify instructions have been +added. +* A simpler NaN generation and propagation scheme has been adopted. +* For RV32I, the system performance counters have been extended to +64-bits wide, with separate read access to the upper and lower 32 bits. +* Canonical NOP and MV encodings have been defined. +* Standard instruction-length encodings have been defined for 48-bit, +64-bit, and latexmath:[$>$]64-bit instructions. +* Description of a 128-bit address space variant, RV128, has been added. +* Major opcodes in the 32-bit base instruction format have been +allocated for user-defined custom extensions. +* A typographical error that suggested that stores source their data +from _rd_ has been corrected to refer to _rs2_. + diff --git a/src/counters-f.adoc b/src/counters-f.adoc new file mode 100644 index 0000000..4678d78 --- /dev/null +++ b/src/counters-f.adoc @@ -0,0 +1,167 @@ +== Counters + +RISC-V ISAs provide a set of up to 32latexmath:[$\times$]64-bit +performance counters and timers that are accessible via unprivileged +XLEN read-only CSR registers `0xC00`–`0xC1F` (with the upper 32 bits +accessed via CSR registers `0xC80`–`0xC9F` on RV32). The first three of +these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle count, +real-time clock, and instructions-retired respectively), while the +remaining counters, if implemented, provide programmable event counting. + +=== Base Counters and Timers + +M@R@F@R@S + +& & & & + +& & & & + +& 5 & 3 & 5 & 7 + +RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM + +RDTIME[H] & 0 & CSRRS & dest & SYSTEM + +RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM + + +RV32I provides a number of 64-bit read-only user-level counters, which +are mapped into the 12-bit CSR address space and accessed in 32-bit +pieces using CSRRS instructions. In RV64I, the CSR instructions can +manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and +RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`, +`time`, and `instret` counters. Hence, the RDCYCLEH, RDTIMEH, and +RDINSTRETH instructions are RV32I-only. + +Some execution environments might prohibit access to counters to impede +timing side-channel attacks. + +The RDCYCLE pseudoinstruction reads the low XLEN bits of the ` cycle` +CSR which holds a count of the number of clock cycles executed by the +processor core on which the hart is running from an arbitrary start time +in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32 +of the same cycle counter. The underlying 64-bit counter should never +overflow in practice. The rate at which the cycle counter advances will +depend on the implementation and operating environment. The execution +environment should provide a means to determine the current rate +(cycles/second) at which the cycle counter is incrementing. + +RDCYCLE is intended to return the number of cycles executed by the +processor core, not the hart. Precisely defining what is a ``core'' is +difficult given some implementation choices (e.g., AMD Bulldozer). +Precisely defining what is a ``clock cycle'' is also difficult given the +range of implementations (including software emulations), but the intent +is that RDCYCLE is used for performance monitoring along with the other +performance counters. In particular, where there is one hart/core, one +would expect cycle-count/instructions-retired to measure CPI for a hart. + +Cores don’t have to be exposed to software at all, and an implementor +might choose to pretend multiple harts on one physical core are running +on separate cores with one hart/core, and provide separate cycle +counters for each hart. This might make sense in a simple barrel +processor (e.g., CDC 6600 peripheral processors) where inter-hart timing +interactions are non-existent or minimal. + +Where there is more than one hart/core and dynamic multithreading, it is +not generally possible to separate out cycles per hart (especially with +SMT). It might be possible to define a separate performance counter that +tried to capture the number of cycles a particular hart was running, but +this definition would have to be very fuzzy to cover all the possible +threading implementations. For example, should we only count cycles for +which any instruction was issued to execution for this hart, and/or +cycles any instruction retired, or include cycles this hart was +occupying machine resources but couldn’t execute due to stalls while +other harts went into execution? Likely, ``all of the above'' would be +needed to have understandable performance stats. This complexity of +defining a per-hart cycle count, and also the need in any case for a +total per-core cycle count when tuning multithreaded code led to just +standardizing the per-core cycle counter, which also happens to work +well for the common single hart/core case. + +Standardizing what happens during ``sleep'' is not practical given that +what ``sleep'' means is not standardized across execution environments, +but if the entire core is paused (entirely clock-gated or powered-down +in deep sleep), then it is not executing clock cycles, and the cycle +count shouldn’t be increasing per the spec. There are many details, +e.g., whether clock cycles required to reset a processor after waking up +from a power-down event should be counted, and these are considered +execution-environment-specific details. + +Even though there is no precise definition that works for all platforms, +this is still a useful facility for most platforms, and an imprecise, +common, ``usually correct'' standard here is better than no standard. +The intent of RDCYCLE was primarily performance monitoring/tuning, and +the specification was written with that goal in mind. + +The RDTIME pseudoinstruction reads the low XLEN bits of the ` time` CSR, +which counts wall-clock real time that has passed from an arbitrary +start time in the past. RDTIMEH is an RV32I-only instruction that reads +bits 63–32 of the same real-time counter. The underlying 64-bit counter +should never overflow in practice. The execution environment should +provide a means of determining the period of the real-time counter +(seconds/tick). The period must be constant. The real-time clocks of all +harts in a single user application should be synchronized to within one +tick of the real-time clock. The environment should provide a means to +determine the accuracy of the clock. + +On some simple platforms, cycle count might represent a valid +implementation of RDTIME, but in this case, platforms should implement +the RDTIME instruction as an alias for RDCYCLE to make code more +portable, rather than using RDCYCLE to measure wall-clock time. + +The RDINSTRET pseudoinstruction reads the low XLEN bits of the +` instret` CSR, which counts the number of instructions retired by this +hart from some arbitrary start point in the past. RDINSTRETH is an +RV32I-only instruction that reads bits 63–32 of the same instruction +counter. The underlying 64-bit counter should never overflow in +practice. + +The following code sequence will read a valid 64-bit cycle counter value +into `x3`:`x2`, even if the counter overflows its lower half between +reading its upper and lower halves. + +.... + again: + rdcycleh x3 + rdcycle x2 + rdcycleh x4 + bne x3, x4, again +.... + +We recommend provision of these basic counters in implementations as +they are essential for basic performance analysis, adaptive and dynamic +optimization, and to allow an application to work with real-time +streams. Additional counters should be provided to help diagnose +performance problems and these should be made accessible from user-level +application code with low overhead. + +We required the counters be 64 bits wide, even on RV32, as otherwise it +is very difficult for software to determine if values have overflowed. +For a low-end implementation, the upper 32 bits of each counter can be +implemented using software counters incremented by a trap handler +triggered by overflow of the lower 32 bits. The sample code described +above shows how the full 64-bit width value can be safely read using the +individual 32-bit instructions. + +In some applications, it is important to be able to read multiple +counters at the same instant in time. When run under a multitasking +environment, a user thread can suffer a context switch while attempting +to read the counters. One solution is for the user thread to read the +real-time counter before and after reading the other counters to +determine if a context switch occurred in the middle of the sequence, in +which case the reads can be retried. We considered adding output latches +to allow a user thread to snapshot the counter values atomically, but +this would increase the size of the user context, especially for +implementations with a richer set of counters. + +=== Hardware Performance Counters + +There is CSR space allocated for 29 additional unprivileged 64-bit +hardware performance counters, `hpmcounter3`–`hpmcounter31`. For RV32, +the upper 32 bits of these performance counters is accessible via +additional CSRs `hpmcounter3h`–` hpmcounter31h`. These counters count +platform-specific events and are configured via additional privileged +registers. The number and width of these additional counters, and the +set of events they count is platform-specific. + +The privileged architecture manual describes the privileged CSRs +controlling access to these counters and to set the events to be +counted. + +It would be useful to eventually standardize event settings to count +ISA-level metrics, such as the number of floating-point instructions +executed for example, and possibly a few common microarchitectural +metrics, such as ``L1 instruction cache misses''. diff --git a/src/counters.adoc b/src/counters.adoc new file mode 100644 index 0000000..670ed6e --- /dev/null +++ b/src/counters.adoc @@ -0,0 +1,189 @@ +[[perf-counters]] +== Counters + +RISC-V ISAs provide a set of up to 32latexmath:[$\times$]64-bit +performance counters and timers that are accessible via unprivileged +XLEN read-only CSR registers `0xC00`–`0xC1F` (with the upper 32 bits +accessed via CSR registers `0xC80`–`0xC9F` on RV32). The first three of +these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle count, +real-time clock, and instructions-retired respectively), while the +remaining counters, if implemented, provide programmable event counting. + +=== Base Counters and Timers + +include::images/wavedrom/counters-diag.adoc[] +[[counter]] +.Base counters and timers +image::image_placeholder.png[] +(((counters, read-only))) +(((counters, user-level))) + +RV32I provides a number of 64-bit read-only user-level counters, which +are mapped into the 12-bit CSR address space and accessed in 32-bit +pieces using CSRRS instructions. In RV64I, the CSR instructions can +manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and +RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`, +`time`, and `instret` counters. Hence, the RDCYCLEH, RDTIMEH, and +RDINSTRETH instructions are RV32I-only. + +[NOTE] +==== +Some execution environments might prohibit access to counters to impede +timing side-channel attacks. +==== +(((counters, pseudoinstruction))) + +The RDCYCLE pseudoinstruction reads the low XLEN bits of the `cycle` +CSR which holds a count of the number of clock cycles executed by the +processor core on which the hart is running from an arbitrary start time +in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32 +of the same cycle counter. The underlying 64-bit counter should never +overflow in practice. The rate at which the cycle counter advances will +depend on the implementation and operating environment. The execution +environment should provide a means to determine the current rate +(cycles/second) at which the cycle counter is incrementing. + +[TIP] +==== +RDCYCLE is intended to return the number of cycles executed by the +processor core, not the hart. Precisely defining what is +difficult given some implementation choices (e.g., AMD Bulldozer). +Precisely defining what is a `clock cycle` is also difficult given the +range of implementations (including software emulations), but the intent +is that RDCYCLE is used for performance monitoring along with the other +performance counters. In particular, where there is one hart/core, one +would expect cycle-count/instructions-retired to measure CPI for a hart. + +Cores don’t have to be exposed to software at all, and an implementor +might choose to pretend multiple harts on one physical core are running +on separate cores with one hart/core, and provide separate cycle +counters for each hart. This might make sense in a simple barrel +processor (e.g., CDC 6600 peripheral processors) where inter-hart timing +interactions are non-existent or minimal. +(((counters, handling multithredaing))) + +Where there is more than one hart/core and dynamic multithreading, it is +not generally possible to separate out cycles per hart (especially with +SMT). It might be possible to define a separate performance counter that +tried to capture the number of cycles a particular hart was running, but +this definition would have to be very fuzzy to cover all the possible +threading implementations. For example, should we only count cycles for +which any instruction was issued to execution for this hart, and/or +cycles any instruction retired, or include cycles this hart was +occupying machine resources but couldn’t execute due to stalls while +other harts went into execution? Likely, `all of the above` would be +needed to have understandable performance stats. This complexity of +defining a per-hart cycle count, and also the need in any case for a +total per-core cycle count when tuning multithreaded code led to just +standardizing the per-core cycle counter, which also happens to work +well for the common single hart/core case. +(((counters, handling sleep cycles))) + +Standardizing what happens during `sleep` is not practical given that +what `sleep` means is not standardized across execution environments, +but if the entire core is paused (entirely clock-gated or powered-down +in deep sleep), then it is not executing clock cycles, and the cycle +count shouldn’t be increasing per the spec. There are many details, +e.g., whether clock cycles required to reset a processor after waking up +from a power-down event should be counted, and these are considered +execution-environment-specific details. + +Even though there is no precise definition that works for all platforms, +this is still a useful facility for most platforms, and an imprecise, +common, `usually correct` standard here is better than no standard. +The intent of RDCYCLE was primarily performance monitoring/tuning, and +the specification was written with that goal in mind. +==== + +The RDTIME pseudoinstruction reads the low XLEN bits of the ` time` CSR, +which counts wall-clock real time that has passed from an arbitrary +start time in the past. RDTIMEH is an RV32I-only instruction that reads +bits 63–32 of the same real-time counter. The underlying 64-bit counter +should never overflow in practice. The execution environment should +provide a means of determining the period of the real-time counter +(seconds/tick). The period must be constant. The real-time clocks of all +harts in a single user application should be synchronized to within one +tick of the real-time clock. The environment should provide a means to +determine the accuracy of the clock. + +[NOTE] +==== +On some simple platforms, cycle count might represent a valid +implementation of RDTIME, but in this case, platforms should implement +the RDTIME instruction as an alias for RDCYCLE to make code more +portable, rather than using RDCYCLE to measure wall-clock time. +==== +(((counters, pseudoinstructions))) + +The RDINSTRET pseudoinstruction reads the low XLEN bits of the +` instret` CSR, which counts the number of instructions retired by this +hart from some arbitrary start point in the past. RDINSTRETH is an +RV32I-only instruction that reads bits 63–32 of the same instruction +counter. The underlying 64-bit counter should never overflow in +practice. + +The following code sequence will read a valid 64-bit cycle counter value +into `x3`:`x2`, even if the counter overflows its lower half between +reading its upper and lower halves. + +.Sample code for reading the 64-bit cycle counter in RV32. +.... + again: + rdcycleh x3 + rdcycle x2 + rdcycleh x4 + bne x3, x4, again +.... + +[TIP] +==== +We recommend provision of these basic counters in implementations as +they are essential for basic performance analysis, adaptive and dynamic +optimization, and to allow an application to work with real-time +streams. Additional counters should be provided to help diagnose +performance problems and these should be made accessible from user-level +application code with low overhead. + +We required the counters be 64 bits wide, even on RV32, as otherwise it +is very difficult for software to determine if values have overflowed. +For a low-end implementation, the upper 32 bits of each counter can be +implemented using software counters incremented by a trap handler +triggered by overflow of the lower 32 bits. The sample code described +above shows how the full 64-bit width value can be safely read using the +individual 32-bit instructions. + +In some applications, it is important to be able to read multiple +counters at the same instant in time. When run under a multitasking +environment, a user thread can suffer a context switch while attempting +to read the counters. One solution is for the user thread to read the +real-time counter before and after reading the other counters to +determine if a context switch occurred in the middle of the sequence, in +which case the reads can be retried. We considered adding output latches +to allow a user thread to snapshot the counter values atomically, but +this would increase the size of the user context, especially for +implementations with a richer set of counters. +==== + +=== Hardware Performance Counters +(((counters, performance))) + +There is CSR space allocated for 29 additional unprivileged 64-bit +hardware performance counters, `hpmcounter3`–`hpmcounter31`. For RV32, +the upper 32 bits of these performance counters is accessible via +additional CSRs `hpmcounter3h`–` hpmcounter31h`. These counters count +platform-specific events and are configured via additional privileged +registers. The number and width of these additional counters, and the +set of events they count is platform-specific. + +[NOTE] +==== +The privileged architecture manual describes the privileged CSRs +controlling access to these counters and to set the events to be +counted. + +It would be useful to eventually standardize event settings to count +ISA-level metrics, such as the number of floating-point instructions +executed for example, and possibly a few common microarchitectural +metrics, such as `L1 instruction cache misses`. +==== + diff --git a/src/d-st-ext.adoc b/src/d-st-ext.adoc new file mode 100644 index 0000000..b2ea945 --- /dev/null +++ b/src/d-st-ext.adoc @@ -0,0 +1,255 @@ +== ``D`` Standard Extension for Double-Precision Floating-Point, Version 2.2 + +This chapter describes the standard double-precision floating-point +instruction-set extension, which is named `D` and adds +double-precision floating-point computational instructions compliant +with the IEEE 754-2008 arithmetic standard. The D extension depends on +the base single-precision instruction subset F. +((double-precision floaing point)) + +=== D Register State + +The D extension widens the 32 floating-point registers, `f0`–` f31`, to +64 bits (FLEN=64 in <>. The `f` registers can +now hold either 32-bit or 64-bit floating-point values as described +below in <>. + +FLEN can be 32, 64, or 128 depending on which of the F, D, and Q +extensions are supported. There can be up to four different +floating-point precisions supported, including H, F, D, and Q. +(((floating-point, supported precisions))) + +[[nanboxing]] +=== NaN Boxing of Narrower Values + +When multiple floating-point precisions are supported, then valid values +of narrower latexmath:[$n$]-bit types, latexmath:[$n<$]FLEN, are +represented in the lower latexmath:[$n$] bits of an FLEN-bit NaN value, +in a process termed NaN-boxing. The upper bits of a valid NaN-boxed +value must be all 1s. Valid NaN-boxed latexmath:[$n$]-bit values +therefore appear as negative quiet NaNs (qNaNs) when viewed as any wider +latexmath:[$m$]-bit value, latexmath:[$n < m \leq$]FLEN. Any operation +that writes a narrower result to an `f` register must write all 1s to +the uppermost FLENlatexmath:[$-n$] bits to yield a legal NaN-boxed +value. +(((floating-point, requirements))) + + +Software might not know the current type of data stored in a +floating-point register but has to be able to save and restore the +register values, hence the result of using wider operations to transfer +narrower values has to be defined. A common case is for callee-saved +registers, but a standard convention is also desirable for features +including varargs, user-level threading libraries, virtual machine +migration, and debugging. + +Floating-point latexmath:[$n$]-bit transfer operations move external +values held in IEEE standard formats into and out of the `f` registers, +and comprise floating-point loads and stores +(FLlatexmath:[$n$]/FSlatexmath:[$n$]) and floating-point move +instructions (FMV.latexmath:[$n$].X/FMV.X.latexmath:[$n$]). A narrower +latexmath:[$n$]-bit transfer, latexmath:[$n<$]FLEN, into the `f` +registers will create a valid NaN-boxed value. A narrower +latexmath:[$n$]-bit transfer out of the floating-point registers will +transfer the lower latexmath:[$n$] bits of the register ignoring the +upper FLENlatexmath:[$-n$] bits. + +Apart from transfer operations described in the previous paragraph, all +other floating-point operations on narrower latexmath:[$n$]-bit +operations, latexmath:[$n<$]FLEN, check if the input operands are +correctly NaN-boxed, i.e., all upper FLENlatexmath:[$-n$] bits are 1. If +so, the latexmath:[$n$] least-significant bits of the input are used as +the input value, otherwise the input value is treated as an +latexmath:[$n$]-bit canonical NaN. + +Earlier versions of this document did not define the behavior of feeding +the results of narrower or wider operands into an operation, except to +require that wider saves and restores would preserve the value of a +narrower operand. The new definition removes this +implementation-specific behavior, while still accommodating both +non-recoded and recoded implementations of the floating-point unit. The +new definition also helps catch software errors by propagating NaNs if +values are used incorrectly. + +Non-recoded implementations unpack and pack the operands to IEEE +standard format on the input and output of every floating-point +operation. The NaN-boxing cost to a non-recoded implementation is +primarily in checking if the upper bits of a narrower operation +represent a legal NaN-boxed value, and in writing all 1s to the upper +bits of a result. + +Recoded implementations use a more convenient internal format to +represent floating-point values, with an added exponent bit to allow all +values to be held normalized. The cost to the recoded implementation is +primarily the extra tagging needed to track the internal types and sign +bits, but this can be done without adding new state bits by recoding +NaNs internally in the exponent field. Small modifications are needed to +the pipelines used to transfer values in and out of the recoded format, +but the datapath and latency costs are minimal. The recoding process has +to handle shifting of input subnormal values for wide operands in any +case, and extracting the NaN-boxed value is a similar process to +normalization except for skipping over leading-1 bits instead of +skipping over leading-0 bits, allowing the datapath muxing to be shared. + +[[fld_fsd]] +=== Double-Precision Load and Store Instructions + +The FLD instruction loads a double-precision floating-point value from +memory into floating-point register _rd_. FSD stores a double-precision +value from the floating-point registers to memory. + +The double-precision value may be a NaN-boxed single-precision value. + +[[fprs-d]] + +M@R@F@R@O + +& & & & + +& & & & + +& 5 & 3 & 5 & 7 + +offset[11:0] & base & D & dest & LOAD-FP + + +O@R@R@F@R@O + +& & & & & + +& & & & & + +& 5 & 5 & 3 & 5 & 7 + +offset[11:5] & src & base & D & offset[4:0] & STORE-FP + + +FLD and FSD are only guaranteed to execute atomically if the effective +address is naturally aligned and XLENlatexmath:[$\geq$]64. + +FLD and FSD do not modify the bits being transferred; in particular, the +payloads of non-canonical NaNs are preserved. + +=== Double-Precision Floating-Point Computational Instructions + +The double-precision floating-point computational instructions are +defined analogously to their single-precision counterparts, but operate +on double-precision operands and produce double-precision results. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FADD/FSUB & D & src2 & src1 & RM & dest & OP-FP + +FMUL/FDIV & D & src2 & src1 & RM & dest & OP-FP + +FMIN-MAX & D & src2 & src1 & MIN/MAX & dest & OP-FP + +FSQRT & D & 0 & src & RM & dest & OP-FP + + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +src3 & D & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB + + +=== Double-Precision Floating-Point Conversion and Move Instructions + +Floating-point-to-integer and integer-to-floating-point conversion +instructions are encoded in the OP-FP major opcode space. FCVT.W.D or +FCVT.L.D converts a double-precision floating-point number in +floating-point register _rs1_ to a signed 32-bit or 64-bit integer, +respectively, in integer register _rd_. FCVT.D.W or FCVT.D.L converts a +32-bit or 64-bit signed integer, respectively, in integer register _rs1_ +into a double-precision floating-point number in floating-point register +_rd_. FCVT.WU.D, FCVT.LU.D, FCVT.D.WU, and FCVT.D.LU variants convert to +or from unsigned integer values. For RV64, FCVT.W[U].D sign-extends the +32-bit result. FCVT.L[U].D and FCVT.D.L[U] are RV64-only instructions. +The range of valid inputs for FCVT._int_.D and the behavior for invalid +inputs are the same as for FCVT._int_.S. + +All floating-point to integer and integer to floating-point conversion +instructions round according to the _rm_ field. Note FCVT.D.W[U] always +produces an exact result and is unaffected by rounding mode. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCVT._int_.D & D & W[U]/L[U] & src & RM & dest & OP-FP + +FCVT.D._int_ & D & W[U]/L[U] & src & RM & dest & OP-FP + + +The double-precision to single-precision and single-precision to +double-precision conversion instructions, FCVT.S.D and FCVT.D.S, are +encoded in the OP-FP major opcode space and both the source and +destination are floating-point registers. The _rs2_ field encodes the +datatype of the source, and the _fmt_ field encodes the datatype of the +destination. FCVT.S.D rounds according to the RM field; FCVT.D.S will +never round. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCVT.S.D & S & D & src & RM & dest & OP-FP + +FCVT.D.S & D & S & src & RM & dest & OP-FP + + +Floating-point to floating-point sign-injection instructions, FSGNJ.D, +FSGNJN.D, and FSGNJX.D are defined analogously to the single-precision +sign-injection instruction. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FSGNJ & D & src2 & src1 & J[N]/JX & dest & OP-FP + + +For XLENlatexmath:[$\geq$]64 only, instructions are provided to move bit +patterns between the floating-point and integer registers. FMV.X.D moves +the double-precision value in floating-point register _rs1_ to a +representation in IEEE 754-2008 standard encoding in integer register +_rd_. FMV.D.X moves the double-precision value encoded in IEEE 754-2008 +standard encoding from the integer register _rs1_ to the floating-point +register _rd_. + +FMV.X.D and FMV.D.X do not modify the bits being transferred; in +particular, the payloads of non-canonical NaNs are preserved. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FMV.X.D & D & 0 & src & 000 & dest & OP-FP + +FMV.D.X & D & 0 & src & 000 & dest & OP-FP + + +Early versions of the RISC-V ISA had additional instructions to allow +RV32 systems to transfer between the upper and lower portions of a +64-bit floating-point register and an integer register. However, these +would be the only instructions with partial register writes and would +add complexity in implementations with recoded floating-point or +register renaming, requiring a pipeline read-modify-write sequence. +Scaling up to handling quad-precision for RV32 and RV64 would also +require additional instructions if they were to follow this pattern. The +ISA was defined to reduce the number of explicit int-float register +moves, by having conversions and comparisons write results to the +appropriate register file, so we expect the benefit of these +instructions to be lower than for other ISAs. + +We note that for systems that implement a 64-bit floating-point unit +including fused multiply-add support and 64-bit floating-point loads and +stores, the marginal hardware cost of moving from a 32-bit to a 64-bit +integer datapath is low, and a software ABI supporting 32-bit wide +address-space and pointers can be used to avoid growth of static data +and dynamic memory traffic. + +=== Double-Precision Floating-Point Compare Instructions + +The double-precision floating-point compare instructions are defined +analogously to their single-precision counterparts, but operate on +double-precision operands. + +S@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCMP & D & src2 & src1 & EQ/LT/LE & dest & OP-FP + + +=== Double-Precision Floating-Point Classify Instruction + +The double-precision floating-point classify instruction, FCLASS.D, is +defined analogously to its single-precision counterpart, but operates on +double-precision operands. + +S@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCLASS & D & 0 & src & 001 & dest & OP-FP + diff --git a/src/extending.adoc b/src/extending.adoc new file mode 100644 index 0000000..5ddef11 --- /dev/null +++ b/src/extending.adoc @@ -0,0 +1,364 @@ +[[extending]] +== Extending RISC-V + +In addition to supporting standard general-purpose software development, +another goal of RISC-V is to provide a basis for more specialized +instruction-set extensions or more customized accelerators. The +instruction encoding spaces and optional variable-length instruction +encoding are designed to make it easier to leverage software development +effort for the standard ISA toolchain when building more customized +processors. For example, the intent is to continue to provide full +software support for implementations that only use the standard I base, +perhaps together with many non-standard instruction-set extensions. + +This chapter describes various ways in which the base RISC-V ISA can be +extended, together with the scheme for managing instruction-set +extensions developed by independent groups. This volume only deals with +the unprivileged ISA, although the same approach and terminology is used +for supervisor-level extensions described in the second volume. + +=== Extension Terminology + +This section defines some standard terminology for describing RISC-V +extensions. + +==== Standard versus Non-Standard Extension + +Any RISC-V processor implementation must support a base integer ISA +(RV32I, RV32E, RV64I, or RV128I). In addition, an implementation may +support one or more extensions. We divide extensions into two broad +categories: _standard_ versus _non-standard_. + +* A standard extension is one that is generally useful and that is +designed to not conflict with any other standard extension. Currently, +`MAFDQLCBTPV`, described in other chapters of this manual, are either +complete or planned standard extensions. +* A non-standard extension may be highly specialized and may conflict +with other standard or non-standard extensions. We anticipate a wide +variety of non-standard extensions will be developed over time, with +some eventually being promoted to standard extensions. + +==== Instruction Encoding Spaces and Prefixes + +An instruction encoding space is some number of instruction bits within +which a base ISA or ISA extension is encoded. RISC-V supports varying +instruction lengths, but even within a single instruction length, there +are various sizes of encoding space available. For example, the base +ISAs are defined within a 30-bit encoding space (bits 31–2 of the 32-bit +instruction), while the atomic extension ``A`` fits within a 25-bit +encoding space (bits 31–7). + +We use the term _prefix_ to refer to the bits to the _right_ of an +instruction encoding space (since instruction fetch in RISC-V is +little-endian, the bits to the right are stored at earlier memory +addresses, hence form a prefix in instruction-fetch order). The prefix +for the standard base ISA encoding is the two-bit `11` field held in +bits 1–0 of the 32-bit word, while the prefix for the standard atomic +extension `A` is the seven-bit `0101111` field held in bits 6–0 of +the 32-bit word representing the AMO major opcode. A quirk of the +encoding format is that the 3-bit funct3 field used to encode a minor +opcode is not contiguous with the major opcode bits in the 32-bit +instruction format, but is considered part of the prefix for 22-bit +instruction spaces. + +Although an instruction encoding space could be of any size, adopting a +smaller set of common sizes simplifies packing independently developed +extensions into a single global encoding. +<> gives the suggested sizes for RISC-V. + +[[encodingspaces]] +.Suggested standard RISC-V instruction encoding space sizes. +[cols="^,<,>,>,>,>",] +|=== +|Size |Usage |# Available in standard instruction length | | | + +| | |16-bit |32-bit |48-bit |64-bit + +|14-bit |Quadrant of compressed 16-bit encoding |3 | | | + +|22-bit |Minor opcode in base 32-bit encoding | |latexmath:[$2^{8}$] +|latexmath:[$2^{20}$] |latexmath:[$2^{35}$] + +|25-bit |Major opcode in base 32-bit encoding | |32 +|latexmath:[$2^{17}$] |latexmath:[$2^{32}$] + +|30-bit |Quadrant of base 32-bit encoding | |1 |latexmath:[$2^{12}$] +|latexmath:[$2^{27}$] + +|32-bit |Minor opcode in 48-bit encoding | | |latexmath:[$2^{10}$] +|latexmath:[$2^{25}$] + +|37-bit |Major opcode in 48-bit encoding | | |32 |latexmath:[$2^{20}$] + +|40-bit |Quadrant of 48-bit encoding | | |4 |latexmath:[$2^{17}$] + +|45-bit |Sub-minor opcode in 64-bit encoding | | | |latexmath:[$2^{12}$] + +|48-bit |Minor opcode in 64-bit encoding | | | |latexmath:[$2^{9}$] + +|52-bit |Major opcode in 64-bit encoding | | | |32 +|=== + +==== Greenfield versus Brownfield Extensions + +We use the term _greenfield extension_ to describe an extension that +begins populating a new instruction encoding space, and hence can only +cause encoding conflicts at the prefix level. We use the term +_brownfield extension_ to describe an extension that fits around +existing encodings in a previously defined instruction space. A +brownfield extension is necessarily tied to a particular greenfield +parent encoding, and there may be multiple brownfield extensions to the +same greenfield parent encoding. For example, the base ISAs are +greenfield encodings of a 30-bit instruction space, while the FDQ +floating-point extensions are all brownfield extensions adding to the +parent base ISA 30-bit encoding space. + +Note that we consider the standard A extension to have a greenfield +encoding as it defines a new previously empty 25-bit encoding space in +the leftmost bits of the full 32-bit base instruction encoding, even +though its standard prefix locates it within the 30-bit encoding space +of its parent base ISA. Changing only its single 7-bit prefix could move +the A extension to a different 30-bit encoding space while only worrying +about conflicts at the prefix level, not within the encoding space +itself. + +[[exttax]] +.Two-dimensional characterization of standard instruction-set +extensions. +[cols=">,^,^",options="header",] +|=== +| |Adds state |No new state +|Greenfield |RV32I(30), RV64I(30) |A(25) +|Brownfield |F(I), D(F), Q(D) |M(I) +|=== + +<> shows the bases and standard extensions placed +in a simple two-dimensional taxonomy. One axis is whether the extension +is greenfield or brownfield, while the other axis is whether the +extension adds architectural state. For greenfield extensions, the size +of the instruction encoding space is given in parentheses. For +brownfield extensions, the name of the extension (greenfield or +brownfield) it builds upon is given in parentheses. Additional +user-level architectural state usually implies changes to the +supervisor-level system or possibly to the standard calling convention. + +Note that RV64I is not considered an extension of RV32I, but a different +complete base encoding. + +==== Standard-Compatible Global Encodings + +A complete or _global_ encoding of an ISA for an actual RISC-V +implementation must allocate a unique non-conflicting prefix for every +included instruction encoding space. The bases and every standard +extension have each had a standard prefix allocated to ensure they can +all coexist in a global encoding. + +A _standard-compatible_ global encoding is one where the base and every +included standard extension have their standard prefixes. A +standard-compatible global encoding can include non-standard extensions +that do not conflict with the included standard extensions. A +standard-compatible global encoding can also use standard prefixes for +non-standard extensions if the associated standard extensions are not +included in the global encoding. In other words, a standard extension +must use its standard prefix if included in a standard-compatible global +encoding, but otherwise its prefix is free to be reallocated. These +constraints allow a common toolchain to target the standard subset of +any RISC-V standard-compatible global encoding. + +==== Guaranteed Non-Standard Encoding Space + +To support development of proprietary custom extensions, portions of the +encoding space are guaranteed to never be used by standard extensions. + +=== RISC-V Extension Design Philosophy + +We intend to support a large number of independently developed +extensions by encouraging extension developers to operate within +instruction encoding spaces, and by providing tools to pack these into a +standard-compatible global encoding by allocating unique prefixes. Some +extensions are more naturally implemented as brownfield augmentations of +existing extensions, and will share whatever prefix is allocated to +their parent greenfield extension. The standard extension prefixes avoid +spurious incompatibilities in the encoding of core functionality, while +allowing custom packing of more esoteric extensions. + +This capability of repacking RISC-V extensions into different +standard-compatible global encodings can be used in a number of ways. + +One use-case is developing highly specialized custom accelerators, +designed to run kernels from important application domains. These might +want to drop all but the base integer ISA and add in only the extensions +that are required for the task in hand. The base ISAs have been designed +to place minimal requirements on a hardware implementation, and has been +encoded to use only a small fraction of a 32-bit instruction encoding +space. + +Another use-case is to build a research prototype for a new type of +instruction-set extension. The researchers might not want to expend the +effort to implement a variable-length instruction-fetch unit, and so +would like to prototype their extension using a simple 32-bit +fixed-width instruction encoding. However, this new extension might be +too large to coexist with standard extensions in the 32-bit space. If +the research experiments do not need all of the standard extensions, a +standard-compatible global encoding might drop the unused standard +extensions and reuse their prefixes to place the proposed extension in a +non-standard location to simplify engineering of the research prototype. +Standard tools will still be able to target the base and any standard +extensions that are present to reduce development time. Once the +instruction-set extension has been evaluated and refined, it could then +be made available for packing into a larger variable-length encoding +space to avoid conflicts with all standard extensions. + +The following sections describe increasingly sophisticated strategies +for developing implementations with new instruction-set extensions. +These are mostly intended for use in highly customized, educational, or +experimental architectures rather than for the main line of RISC-V ISA +development. + +[[fix32b]] +=== Extensions within fixed-width 32-bit instruction format + +In this section, we discuss adding extensions to implementations that +only support the base fixed-width 32-bit instruction format. + +We anticipate the simplest fixed-width 32-bit encoding will be popular +for many restricted accelerators and research prototypes. + +==== Available 30-bit instruction encoding spaces + +In the standard encoding, three of the available 30-bit instruction +encoding spaces (those with 2-bit prefixes 00, 01, and 10) are used to +enable the optional compressed instruction extension. However, if the +compressed instruction-set extension is not required, then these three +further 30-bit encoding spaces become available. This quadruples the +available encoding space within the 32-bit format. + +==== Available 25-bit instruction encoding spaces + +A 25-bit instruction encoding space corresponds to a major opcode in the +base and standard extension encodings. + +There are four major opcodes expressly designated for custom extensions +<>, each of which represents a 25-bit +encoding space. Two of these are reserved for eventual use in the RV128 +base encoding (will be OP-IMM-64 and OP-64), but can be used for +non-standard extensions for RV32 and RV64. + +The two major opcodes reserved for RV64 (OP-IMM-32 and OP-32) can also +be used for non-standard extensions to RV32 only. + +If an implementation does not require floating-point, then the seven +major opcodes reserved for standard floating-point extensions (LOAD-FP, +STORE-FP, MADD, MSUB, NMSUB, NMADD, OP-FP) can be reused for +non-standard extensions. Similarly, the AMO major opcode can be reused +if the standard atomic extensions are not required. + +If an implementation does not require instructions longer than 32-bits, +then an additional four major opcodes are available (those marked in +gray in <>. + +The base RV32I encoding uses only 11 major opcodes plus 3 reserved +opcodes, leaving up to 18 available for extensions. The base RV64I +encoding uses only 13 major opcodes plus 3 reserved opcodes, leaving up +to 16 available for extensions. + +==== Available 22-bit instruction encoding spaces + +A 22-bit encoding space corresponds to a funct3 minor opcode space in +the base and standard extension encodings. Several major opcodes have a +funct3 field minor opcode that is not completely occupied, leaving +available several 22-bit encoding spaces. + +Usually a major opcode selects the format used to encode operands in the +remaining bits of the instruction, and ideally, an extension should +follow the operand format of the major opcode to simplify hardware +decoding. + +==== Other spaces + +Smaller spaces are available under certain major opcodes, and not all +minor opcodes are entirely filled. + +=== Adding aligned 64-bit instruction extensions + +The simplest approach to provide space for extensions that are too large +for the base 32-bit fixed-width instruction format is to add naturally +aligned 64-bit instructions. The implementation must still support the +32-bit base instruction format, but can require that 64-bit instructions +are aligned on 64-bit boundaries to simplify instruction fetch, with a +32-bit NOP instruction used as alignment padding where necessary. + +To simplify use of standard tools, the 64-bit instructions should be +encoded as described in <>. +However, an implementation might choose a non-standard +instruction-length encoding for 64-bit instructions, while retaining the +standard encoding for 32-bit instructions. For example, if compressed +instructions are not required, then a 64-bit instruction could be +encoded using one or more zero bits in the first two bits of an +instruction. + +We anticipate processor generators that produce instruction-fetch units +capable of automatically handling any combination of supported +variable-length instruction encodings. + +=== Supporting VLIW encodings + +Although RISC-V was not designed as a base for a pure VLIW machine, VLIW +encodings can be added as extensions using several alternative +approaches. In all cases, the base 32-bit encoding has to be supported +to allow use of any standard software tools. + +==== Fixed-size instruction group + +The simplest approach is to define a single large naturally aligned +instruction format (e.g., 128 bits) within which VLIW operations are +encoded. In a conventional VLIW, this approach would tend to waste +instruction memory to hold NOPs, but a RISC-V-compatible implementation +would have to also support the base 32-bit instructions, confining the +VLIW code size expansion to VLIW-accelerated functions. + +==== Encoded-Length Groups + +Another approach is to use the standard length encoding from +<> to encode parallel +instruction groups, allowing NOPs to be compressed out of the VLIW +instruction. For example, a 64-bit instruction could hold two 28-bit +operations, while a 96-bit instruction could hold three 28-bit +operations, and so on. Alternatively, a 48-bit instruction could hold +one 42-bit operation, while a 96-bit instruction could hold two 42-bit +operations, and so on. + +This approach has the advantage of retaining the base ISA encoding for +instructions holding a single operation, but has the disadvantage of +requiring a new 28-bit or 42-bit encoding for operations within the VLIW +instructions, and misaligned instruction fetch for larger groups. One +simplification is to not allow VLIW instructions to straddle certain +microarchitecturally significant boundaries (e.g., cache lines or +virtual memory pages). + +==== Fixed-Size Instruction Bundles + +Another approach, similar to Itanium, is to use a larger naturally +aligned fixed instruction bundle size (e.g., 128 bits) across which +parallel operation groups are encoded. This simplifies instruction +fetch, but shifts the complexity to the group execution engine. To +remain RISC-V compatible, the base 32-bit instruction would still have +to be supported. + +==== End-of-Group bits in Prefix + +None of the above approaches retains the RISC-V encoding for the +individual operations within a VLIW instruction. Yet another approach is +to repurpose the two prefix bits in the fixed-width 32-bit encoding. One +prefix bit can be used to signal `end-of-group` if set, while the +second bit could indicate execution under a predicate if clear. Standard +RISC-V 32-bit instructions generated by tools unaware of the VLIW +extension would have both prefix bits set (11) and thus have the correct +semantics, with each instruction at the end of a group and not +predicated. + +The main disadvantage of this approach is that the base ISAs lack the +complex predication support usually required in an aggressive VLIW +system, and it is difficult to add space to specify more predicate +registers in the standard 30-bit encoding space. + diff --git a/src/f-st-ext.adoc b/src/f-st-ext.adoc new file mode 100644 index 0000000..90239be --- /dev/null +++ b/src/f-st-ext.adoc @@ -0,0 +1,541 @@ +[[single-float]] +== `F` Standard Extension for Single-Precision Floating-Point, Version 2.2 + +This chapter describes the standard instruction-set extension for +single-precision floating-point, which is named `F` and adds +single-precision floating-point computational instructions compliant +with the IEEE 754-2008 arithmetic standard cite:[ieee754-2008]. The F extension depends on +the `Zicsr` extension for control and status register access. + +=== F Register State + +The F extension adds 32 floating-point registers, `f0`–`f31`, each 32 +bits wide, and a floating-point control and status register `fcsr`, +which contains the operating mode and exception status of the +floating-point unit. This additional state is shown in +<>. We use the term FLEN to describe the width of +the floating-point registers in the RISC-V ISA, and FLEN=32 for the F +single-precision floating-point extension. Most floating-point +instructions operate on values in the floating-point register file. +Floating-point load and store instructions transfer floating-point +values between registers and memory. Instructions to transfer values to +and from the integer register file are also provided. + +[TIP] +==== +We considered a unified register file for both integer and +floating-point values as this simplifies software register allocation +and calling conventions, and reduces total user state. However, a split +organization increases the total number of registers accessible with a +given instruction width, simplifies provision of enough regfile ports +for wide superscalar issue, supports decoupled floating-point-unit +architectures, and simplifies use of internal floating-point encoding +techniques. Compiler support and calling conventions for split register +file architectures are well understood, and using dirty bits on +floating-point register file state can reduce context-switch overhead. +==== + +[[fprs]] +.RISC-V standard F extension single-precision floating-point state +image::f-standard.png[base,180,1000,align="center"] + +=== Floating-Point Control and Status Register + +The floating-point control and status register, `fcsr`, is a RISC-V +control and status register (CSR). It is a 32-bit read/write register +that selects the dynamic rounding mode for floating-point arithmetic +operations and holds the accrued exception flags, as shown in <>. + + +include::images/wavedrom/float-csr.adoc[] +[[fcsr]] +.Floating-point control and status register +image::image_placeholder.png[] + +The `fcsr` register can be read and written with the FRCSR and FSCSR +instructions, which are assembler pseudoinstructions built on the +underlying CSR access instructions. FRCSR reads `fcsr` by copying it +into integer register _rd_. FSCSR swaps the value in ` fcsr` by copying +the original value into integer register _rd_, and then writing a new +value obtained from integer register _rs1_ into `fcsr`. + +The fields within the `fcsr` can also be accessed individually through +different CSR addresses, and separate assembler pseudoinstructions are +defined for these accesses. The FRRM instruction reads the Rounding Mode +field `frm` and copies it into the least-significant three bits of +integer register _rd_, with zero in all other bits. FSRM swaps the value +in `frm` by copying the original value into integer register _rd_, and +then writing a new value obtained from the three least-significant bits +of integer register _rs1_ into `frm`. FRFLAGS and FSFLAGS are defined +analogously for the Accrued Exception Flags field `fflags`. + + +Bits 31–8 of the `fcsr` are reserved for other standard extensions. If +these extensions are not present, implementations shall ignore writes to +these bits and supply a zero value when read. Standard software should +preserve the contents of these bits. + +Floating-point operations use either a static rounding mode encoded in +the instruction, or a dynamic rounding mode held in `frm`. Rounding +modes are encoded as shown in <>. A value of 111 in the +instruction’s _rm_ field selects the dynamic rounding mode held in +`frm`. The behavior of floating-point instructions that depend on +rounding mode when executed with a reserved rounding mode is _reserved_, +including both static reserved rounding modes (101–110) and dynamic +reserved rounding modes (101–111). Some instructions, including widening +conversions, have the _rm_ field but are nevertheless mathematically +unaffected by the rounding mode; software should set their _rm_ field to +RNE (000) but implementations must treat the _rm_ field as usual (in +particular, with regard to decoding legal vs. reserved encodings). + +[[rm]] +.Rounding mode encoding. +[cols="^,^,<",options="header",] +|=== +|Rounding Mode |Mnemonic |Meaning +|000 |RNE |Round to Nearest, ties to Even +|001 |RTZ |Round towards Zero +|010 |RDN |Round Down (towards latexmath:[$-\infty$]) +|011 |RUP |Round Up (towards latexmath:[$+\infty$]) +|100 |RMM |Round to Nearest, ties to Max Magnitude +|101 | |_Reserved for future use._ +|110 | |_Reserved for future use._ +|111 |DYN |In instruction’s _rm_ field, selects dynamic rounding mode; +| | |In Rounding Mode register, _reserved_. +|=== + + +[NOTE] +==== +The C99 language standard effectively mandates the provision of a +dynamic rounding mode register. In typical implementations, writes to +the dynamic rounding mode CSR state will serialize the pipeline. Static +rounding modes are used to implement specialized arithmetic operations +that often have to switch frequently between different rounding modes. + +The ratified version of the F spec mandated that an illegal instruction +exception was raised when an instruction was executed with a reserved +dynamic rounding mode. This has been weakened to reserved, which matches +the behavior of static rounding-mode instructions. Raising an illegal +instruction exception is still valid behavior when encountering a +reserved encoding, so implementations compatible with the ratified spec +are compatible with the weakened spec. +==== + + +The accrued exception flags indicate the exception conditions that have +arisen on any floating-point arithmetic instruction since the field was +last reset by software, as shown in <>. The base +RISC-V ISA does not support generating a trap on the setting of a +floating-point exception flag. +(((floating-point, excpetion flag))) + +[[bitdef]] +.Accrued exception flag encoding. +[cols="^,<",options="header",] +|=== +|Flag Mnemonic |Flag Meaning +|NV |Invalid Operation +|DZ |Divide by Zero +|OF |Overflow +|UF |Underflow +|NX |Inexact +|=== + +[NOTE] +==== +As allowed by the standard, we do not support traps on floating-point +exceptions in the F extension, but instead require explicit checks of +the flags in software. We considered adding branches controlled directly +by the contents of the floating-point accrued exception flags, but +ultimately chose to omit these instructions to keep the ISA simple. +==== + +=== NaN Generation and Propagation +(((NaN, generation))) +(((NaN, propagation))) + +Except when otherwise stated, if the result of a floating-point +operation is NaN, it is the canonical NaN. The canonical NaN has a +positive sign and all significand bits clear except the MSB, a.k.a. the +quiet bit. For single-precision floating-point, this corresponds to the +pattern `0x7fc00000`. + +[TIP] +==== +We considered propagating NaN payloads, as is recommended by the +standard, but this decision would have increased hardware cost. +Moreover, since this feature is optional in the standard, it cannot be +used in portable code. + +Implementors are free to provide a NaN payload propagation scheme as a +nonstandard extension enabled by a nonstandard operating mode. However, +the canonical NaN scheme described above must always be supported and +should be the default mode. +==== + +[NOTE] +==== +We require implementations to return the standard-mandated default +values in the case of exceptional conditions, without any further +intervention on the part of user-level software (unlike the Alpha ISA +floating-point trap barriers). We believe full hardware handling of +exceptional cases will become more common, and so wish to avoid +complicating the user-level ISA to optimize other approaches. +Implementations can always trap to machine-mode software handlers to +provide exceptional default values. +==== + +=== Subnormal Arithmetic +(((operations, subnormal))) + +Operations on subnormal numbers are handled in accordance with the IEEE +754-2008 standard. + +In the parlance of the IEEE standard, tininess is detected after +rounding. +(((tininess, handling))) + +[NOTE] +==== +Detecting tininess after rounding results in fewer spurious underflow +signals. +==== + +=== Single-Precision Load and Store Instructions + +Floating-point loads and stores use the same base+offset addressing mode +as the integer base ISAs, with a base address in register _rs1_ and a +12-bit signed byte offset. The FLW instruction loads a single-precision +floating-point value from memory into floating-point register _rd_. FSW +stores a single-precision value from floating-point register _rs2_ to +memory. + +include::images/wavedrom/sp-load-store.adoc[] +[[sp-ldst]] +.SP load and store +image::image_placeholder.png[] + +FLW and FSW are only guaranteed to execute atomically if the effective +address is naturally aligned. + +FLW and FSW do not modify the bits being transferred; in particular, the +payloads of non-canonical NaNs are preserved. + +As described in <>, the execution +environment defines whether misaligned floating-point loads and stores +are handled invisibly or raise a contained or fatal trap. + +[[single-float-compute]] +=== Single-Precision Floating-Point Computational Instructions + +Floating-point arithmetic instructions with one or two source operands +use the R-type format with the OP-FP major opcode. FADD.S and FMUL.S +perform single-precision floating-point addition and multiplication +respectively, between _rs1_ and _rs2_. FSUB.S performs the +single-precision floating-point subtraction of _rs2_ from _rs1_. FDIV.S +performs the single-precision floating-point division of _rs1_ by _rs2_. +FSQRT.S computes the square root of _rs1_. In each case, the result is +written to _rd_. + +The 2-bit floating-point format field _fmt_ is encoded as shown in +<>. It is set to _S_ (00) for all instructions in the F +extension. + +[[fmt]] +.Format field encoding +[cols="^,^,<",options="header",] +|=== +|_fmt_ field |Mnemonic |Meaning +|00 |S |32-bit single-precision +|01 |D |64-bit double-precision +|10 |H |16-bit half-precision +|11 |Q |128-bit quad-precision +|=== + +All floating-point operations that perform rounding can select the +rounding mode using the _rm_ field with the encoding shown in +<>. + +Floating-point minimum-number and maximum-number instructions FMIN.S and +FMAX.S write, respectively, the smaller or larger of _rs1_ and _rs2_ to +_rd_. For the purposes of these instructions only, the value +latexmath:[$-0.0$] is considered to be less than the value +latexmath:[$+0.0$]. If both inputs are NaNs, the result is the canonical +NaN. If only one operand is a NaN, the result is the non-NaN operand. +Signaling NaN inputs set the invalid operation exception flag, even when +the result is not NaN. + +[NOTE] +==== +Note that in version 2.2 of the F extension, the FMIN.S and FMAX.S +instructions were amended to implement the proposed IEEE 754-201x +minimumNumber and maximumNumber operations, rather than the IEEE +754-2008 minNum and maxNum operations. These operations differ in their +handling of signaling NaNs. +==== + +include::images/wavedrom/spfloat.adoc[] +[[spfloat]] +.Single-Precision Floating-Point Computational Instructions +image::image_placeholder.png[] +(((floating point, fused multiply-add))) + +Floating-point fused multiply-add instructions require a new standard +instruction format. R4-type instructions specify three source registers +(_rs1_, _rs2_, and _rs3_) and a destination register (_rd_). This format +is only used by the floating-point fused multiply-add instructions. + +FMADD.S multiplies the values in _rs1_ and _rs2_, adds the value in +_rs3_, and writes the final result to _rd_. FMADD.S computes +_(rs1latexmath:[$\times$]rs2)+rs3_. + +FMSUB.S multiplies the values in _rs1_ and _rs2_, subtracts the value in +_rs3_, and writes the final result to _rd_. FMSUB.S computes +_(rs1latexmath:[$\times$]rs2)-rs3_. + +FNMSUB.S multiplies the values in _rs1_ and _rs2_, negates the product, +adds the value in _rs3_, and writes the final result to _rd_. FNMSUB.S +computes _-(rs1latexmath:[$\times$]rs2)+rs3_. + +FNMADD.S multiplies the values in _rs1_ and _rs2_, negates the product, +subtracts the value in _rs3_, and writes the final result to _rd_. +FNMADD.S computes _-(rs1latexmath:[$\times$]rs2)-rs3_. + +[NOTE] +==== +The FNMSUB and FNMADD instructions are counterintuitively named, owing +to the naming of the corresponding instructions in MIPS-IV. The MIPS +instructions were defined to negate the sum, rather than negating the +product as the RISC-V instructions do, so the naming scheme was more +rational at the time. The two definitions differ with respect to +signed-zero results. The RISC-V definition matches the behavior of the +x86 and ARM fused multiply-add instructions, but unfortunately the +RISC-V FNMSUB and FNMADD instruction names are swapped compared to x86 +and ARM. +==== + +include::images/wavedrom/fnmaddsub.adoc[] +[[fnmaddsub]] +.F[N]MADD/F[N]MSUB instructions +image::image_placeholder.png[] + +[NOTE] +==== +The fused multiply-add (FMA) instructions consume a large part of the +32-bit instruction encoding space. Some alternatives considered were to +restrict FMA to only use dynamic rounding modes, but static rounding +modes are useful in code that exploits the lack of product rounding. +Another alternative would have been to use rd to provide rs3, but this +would require additional move instructions in some common sequences. The +current design still leaves a large portion of the 32-bit encoding space +open while avoiding having FMA be non-orthogonal. +==== + +The fused multiply-add instructions must set the invalid operation +exception flag when the multiplicands are latexmath:[$\infty$] and zero, +even when the addend is a quiet NaN. + +[NOTE] +==== +The IEEE 754-2008 standard permits, but does not require, raising the +invalid exception for the operation +latexmath:[$\infty\times 0\ +$]qNaN. +==== + +=== Single-Precision Floating-Point Conversion and Move Instructions + +Floating-point-to-integer and integer-to-floating-point conversion +instructions are encoded in the OP-FP major opcode space. FCVT.W.S or +FCVT.L.S converts a floating-point number in floating-point register +_rs1_ to a signed 32-bit or 64-bit integer, respectively, in integer +register _rd_. FCVT.S.W or FCVT.S.L converts a 32-bit or 64-bit signed +integer, respectively, in integer register _rs1_ into a floating-point +number in floating-point register _rd_. FCVT.WU.S, FCVT.LU.S, FCVT.S.WU, +and FCVT.S.LU variants convert to or from unsigned integer values. For +XLENlatexmath:[$>32$], FCVT.W[U].S sign-extends the 32-bit result to the +destination register width. FCVT.L[U].S and FCVT.S.L[U] are RV64-only +instructions. If the rounded result is not representable in the +destination format, it is clipped to the nearest value and the invalid +flag is set. <> gives the range of valid inputs +for FCVT._int_.S and the behavior for invalid inputs. +(((floating-point, conversion))) + +[[int_conv]] +.Domains of float-to-integer conversions and behavior for invalid inputs +[cols="<,>,>,>,>",options="header",] +|=== +| |FCVT.W.S |FCVT.WU.S |FCVT.L.S |FCVT.LU.S +|Minimum valid input (after rounding) |latexmath:[$-2^{31}$] |0 +|latexmath:[$-2^{63}$] |0 + +|Maximum valid input (after rounding) |latexmath:[$2^{31}-1$] +|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$] + +|Output for out-of-range negative input |latexmath:[$-2^{31}$] |0 +|latexmath:[$-2^{63}$] |0 + +|Output for latexmath:[$-\infty$] |latexmath:[$-2^{31}$] |0 +|latexmath:[$-2^{63}$] |0 + +|Output for out-of-range positive input |latexmath:[$2^{31}-1$] +|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$] + +|Output for latexmath:[$+\infty$] or NaN |latexmath:[$2^{31}-1$] +|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$] +|=== + +All floating-point to integer and integer to floating-point conversion +instructions round according to the _rm_ field. A floating-point +register can be initialized to floating-point positive zero using +FCVT.S.W _rd_, `x0`, which will never set any exception flags. + +All floating-point conversion instructions set the Inexact exception +flag if the rounded result differs from the operand value and the +Invalid exception flag is not set. + +include::images/wavedrom/spfloat.adoc[] +[[fcvt]] +.SP float convert and move +image::image_placeholder.png[] + +Floating-point to floating-point sign-injection instructions, FSGNJ.S, +FSGNJN.S, and FSGNJX.S, produce a result that takes all bits except the +sign bit from _rs1_. For FSGNJ, the result’s sign bit is _rs2_’s sign +bit; for FSGNJN, the result’s sign bit is the opposite of _rs2_’s sign +bit; and for FSGNJX, the sign bit is the XOR of the sign bits of _rs1_ +and _rs2_. Sign-injection instructions do not set floating-point +exception flags, nor do they canonicalize NaNs. Note, FSGNJ.S _rx, ry, +ry_ moves _ry_ to _rx_ (assembler pseudoinstruction FMV.S _rx, ry_); +FSGNJN.S _rx, ry, ry_ moves the negation of _ry_ to _rx_ (assembler +pseudoinstruction FNEG.S _rx, ry_); and FSGNJX.S _rx, ry, ry_ moves the +absolute value of _ry_ to _rx_ (assembler pseudoinstruction FABS.S _rx, +ry_). + +include::images/wavedrom/spfloat-cn-cmp.adoc[] +[[spfloat-cn-cmp]] +.SP floating point convert and compare +image::image_placeholder.png[] + +[NOTE] +==== +The sign-injection instructions provide floating-point MV, ABS, and NEG, +as well as supporting a few other operations, including the IEEE +copySign operation and sign manipulation in transcendental math function +libraries. Although MV, ABS, and NEG only need a single register +operand, whereas FSGNJ instructions need two, it is unlikely most +microarchitectures would add optimizations to benefit from the reduced +number of register reads for these relatively infrequent instructions. +Even in this case, a microarchitecture can simply detect when both +source registers are the same for FSGNJ instructions and only read a +single copy. +==== + +Instructions are provided to move bit patterns between the +floating-point and integer registers. FMV.X.W moves the single-precision +value in floating-point register _rs1_ represented in IEEE 754-2008 +encoding to the lower 32 bits of integer register _rd_. The bits are not +modified in the transfer, and in particular, the payloads of +non-canonical NaNs are preserved. For RV64, the higher 32 bits of the +destination register are filled with copies of the floating-point +number’s sign bit. + +FMV.W.X moves the single-precision value encoded in IEEE 754-2008 +standard encoding from the lower 32 bits of integer register _rs1_ to +the floating-point register _rd_. The bits are not modified in the +transfer, and in particular, the payloads of non-canonical NaNs are +preserved. + +[NOTE] +==== +The FMV.W.X and FMV.X.W instructions were previously called FMV.S.X and +FMV.X.S. The use of W is more consistent with their semantics as an +instruction that moves 32 bits without interpreting them. This became +clearer after defining NaN-boxing. To avoid disturbing existing code, +both the W and S versions will be supported by tools. +==== + +include::images/wavedrom/spfloat-mv.adoc[] +[[spfloat-mv]] +.SP floating point move +image::image_placeholder.png[] + + +[TIP] +==== +The base floating-point ISA was defined so as to allow implementations +to employ an internal recoding of the floating-point format in registers +to simplify handling of subnormal values and possibly to reduce +functional unit latency. To this end, the F extension avoids +representing integer values in the floating-point registers by defining +conversion and comparison operations that read and write the integer +register file directly. This also removes many of the common cases where +explicit moves between integer and floating-point registers are +required, reducing instruction count and critical paths for common +mixed-format code sequences. +==== + +=== Single-Precision Floating-Point Compare Instructions + +Floating-point compare instructions (FEQ.S, FLT.S, FLE.S) perform the +specified comparison between floating-point registers +(latexmath:[$\mbox{\em rs1} += \mbox{\em rs2}$], latexmath:[$\mbox{\em rs1} < \mbox{\em rs2}$], +latexmath:[$\mbox{\em rs1} \leq +\mbox{\em rs2}$]) writing 1 to the integer register _rd_ if the +condition holds, and 0 otherwise. + +FLT.S and FLE.S perform what the IEEE 754-2008 standard refers to as +_signaling_ comparisons: that is, they set the invalid operation +exception flag if either input is NaN. FEQ.S performs a _quiet_ +comparison: it only sets the invalid operation exception flag if either +input is a signaling NaN. For all three instructions, the result is 0 if +either operand is NaN. + +include::images/wavedrom/spfloat-comp.adoc[] +[[spfloat-comp]] +.SP floating point compare +image::image_placeholder.png[] + +[NOTE] +==== +The F extension provides a latexmath:[$\leq$] comparison, whereas the +base ISAs provide a latexmath:[$\geq$] branch comparison. Because +latexmath:[$\leq$] can be synthesized from latexmath:[$\geq$] and +vice-versa, there is no performance implication to this inconsistency, +but it is nevertheless an unfortunate incongruity in the ISA. +==== + +=== Single-Precision Floating-Point Classify Instruction + +The FCLASS.S instruction examines the value in floating-point register +_rs1_ and writes to integer register _rd_ a 10-bit mask that indicates +the class of the floating-point number. The format of the mask is +described in <>. The corresponding bit in _rd_ will +be set if the property is true and clear otherwise. All other bits in +_rd_ are cleared. Note that exactly one bit in _rd_ will be set. +FCLASS.S does not set the floating-point exception flags. +(((floating-point, classification))) + +include::images/wavedrom/spfloat-classify.adoc[] +[[spfloat-classify]] +.SP floating point classify +image::image_placeholder.png[] + +[[fclass]] +.Format of result of FCLASS instruction. +[cols="^,<",options="header",] +|=== +|_rd_ bit |Meaning +|0 |_rs1_ is latexmath:[$-\infty$]. +|1 |_rs1_ is a negative normal number. +|2 |_rs1_ is a negative subnormal number. +|3 |_rs1_ is latexmath:[$-0$]. +|4 |_rs1_ is latexmath:[$+0$]. +|5 |_rs1_ is a positive subnormal number. +|6 |_rs1_ is a positive normal number. +|7 |_rs1_ is latexmath:[$+\infty$]. +|8 |_rs1_ is a signaling NaN. +|9 |_rs1_ is a quiet NaN. +|=== + diff --git a/src/history.adoc b/src/history.adoc new file mode 100644 index 0000000..8ec9fbd --- /dev/null +++ b/src/history.adoc @@ -0,0 +1,362 @@ +[[history]] +== History and Acknowledgments + +=== `Why Develop a new ISA?` Rationale from Berkeley Group + +We developed RISC-V to support our own needs in research and education, +where our group is particularly interested in actual hardware +implementations of research ideas (we have completed eleven different +silicon fabrications of RISC-V since the first edition of this +specification), and in providing real implementations for students to +explore in classes (RISC-V processor RTL designs have been used in +multiple undergraduate and graduate classes at Berkeley). In our current +research, we are especially interested in the move towards specialized +and heterogeneous accelerators, driven by the power constraints imposed +by the end of conventional transistor scaling. We wanted a highly +flexible and extensible base ISA around which to build our research +effort. + +A question we have been repeatedly asked is `Why develop a new ISA?` +The biggest obvious benefit of using an existing commercial ISA is the +large and widely supported software ecosystem, both development tools +and ported applications, which can be leveraged in research and +teaching. Other benefits include the existence of large amounts of +documentation and tutorial examples. However, our experience of using +commercial instruction sets for research and teaching is that these +benefits are smaller in practice, and do not outweigh the disadvantages: + +* *Commercial ISAs are proprietary.* Except for SPARC V8, which is an +open IEEE standard cite:[sparcieee1994] , most owners of commercial ISAs carefully guard +their intellectual property and do not welcome freely available +competitive implementations. This is much less of an issue for academic +research and teaching using only software simulators, but has been a +major concern for groups wishing to share actual RTL implementations. It +is also a major concern for entities who do not want to trust the few +sources of commercial ISA implementations, but who are prohibited from +creating their own clean room implementations. We cannot guarantee that +all RISC-V implementations will be free of third-party patent +infringements, but we can guarantee we will not attempt to sue a RISC-V +implementor. +* *Commercial ISAs are only popular in certain market domains.* The most +obvious examples at time of writing are that the ARM architecture is not +well supported in the server space, and the Intel x86 architecture (or +for that matter, almost every other architecture) is not well supported +in the mobile space, though both Intel and ARM are attempting to enter +each other’s market segments. Another example is ARC and Tensilica, +which provide extensible cores but are focused on the embedded space. +This market segmentation dilutes the benefit of supporting a particular +commercial ISA as in practice the software ecosystem only exists for +certain domains, and has to be built for others. +* *Commercial ISAs come and go.* Previous research infrastructures have +been built around commercial ISAs that are no longer popular (SPARC, +MIPS) or even no longer in production (Alpha). These lose the benefit of +an active software ecosystem, and the lingering intellectual property +issues around the ISA and supporting tools interfere with the ability of +interested third parties to continue supporting the ISA. An open ISA +might also lose popularity, but any interested party can continue using +and developing the ecosystem. +* *Popular commercial ISAs are complex.* The dominant commercial ISAs +(x86 and ARM) are both very complex to implement in hardware to the +level of supporting common software stacks and operating systems. Worse, +nearly all the complexity is due to bad, or at least outdated, ISA +design decisions rather than features that truly improve efficiency. +* *Commercial ISAs alone are not enough to bring up applications.* Even +if we expend the effort to implement a commercial ISA, this is not +enough to run existing applications for that ISA. Most applications need +a complete ABI (application binary interface) to run, not just the +user-level ISA. Most ABIs rely on libraries, which in turn rely on +operating system support. To run an existing operating system requires +implementing the supervisor-level ISA and device interfaces expected by +the OS. These are usually much less well-specified and considerably more +complex to implement than the user-level ISA. +* *Popular commercial ISAs were not designed for extensibility.* The +dominant commercial ISAs were not particularly designed for +extensibility, and as a consequence have added considerable instruction +encoding complexity as their instruction sets have grown. Companies such +as Tensilica (acquired by Cadence) and ARC (acquired by Synopsys) have +built ISAs and toolchains around extensibility, but have focused on +embedded applications rather than general-purpose computing systems. +* *A modified commercial ISA is a new ISA.* One of our main goals is to +support architecture research, including major ISA extensions. Even +small extensions diminish the benefit of using a standard ISA, as +compilers have to be modified and applications rebuilt from source code +to use the extension. Larger extensions that introduce new architectural +state also require modifications to the operating system. Ultimately, +the modified commercial ISA becomes a new ISA, but carries along all the +legacy baggage of the base ISA. + +Our position is that the ISA is perhaps the most important interface in +a computing system, and there is no reason that such an important +interface should be proprietary. The dominant commercial ISAs are based +on instruction-set concepts that were already well known over 30 years +ago. Software developers should be able to target an open standard +hardware target, and commercial processor designers should compete on +implementation quality. + +We are far from the first to contemplate an open ISA design suitable for +hardware implementation. We also considered other existing open ISA +designs, of which the closest to our goals was the OpenRISC +architecture cite:[openriscarch]. We decided against adopting the OpenRISC ISA for several +technical reasons: + +* OpenRISC has condition codes and branch delay slots, which complicate +higher performance implementations. +* OpenRISC uses a fixed 32-bit encoding and 16-bit immediates, which +precludes a denser instruction encoding and limits space for later +expansion of the ISA. +* OpenRISC does not support the 2008 revision to the IEEE 754 +floating-point standard. +* The OpenRISC 64-bit design had not been completed when we began. + +By starting from a clean slate, we could design an ISA that met all of +our goals, though of course, this took far more effort than we had +planned at the outset. We have now invested considerable effort in +building up the RISC-V ISA infrastructure, including documentation, +compiler tool chains, operating system ports, reference ISA simulators, +FPGA implementations, efficient ASIC implementations, architecture test +suites, and teaching materials. Since the last edition of this manual, +there has been considerable uptake of the RISC-V ISA in both academia +and industry, and we have created the non-profit RISC-V Foundation to +protect and promote the standard. The RISC-V Foundation website at +https://riscv.org contains the latest information on the Foundation +membership and various open-source projects using RISC-V. + +=== History from Revision 1.0 of ISA manual + +The RISC-V ISA and instruction-set manual builds upon several earlier +projects. Several aspects of the supervisor-level machine and the +overall format of the manual date back to the T0 (Torrent-0) vector +microprocessor project at UC Berkeley and ICSI, begun in 1992. T0 was a +vector processor based on the MIPS-II ISA, with Krste Asanović as main +architect and RTL designer, and Brian Kingsbury and Bertrand Irrisou as +principal VLSI implementors. David Johnson at ICSI was a major +contributor to the T0 ISA design, particularly supervisor mode, and to +the manual text. John Hauser also provided considerable feedback on the +T0 ISA design. + +The Scale (Software-Controlled Architecture for Low Energy) project at +MIT, begun in 2000, built upon the T0 project infrastructure, refined +the supervisor-level interface, and moved away from the MIPS scalar ISA +by dropping the branch delay slot. Ronny Krashinsky and Christopher +Batten were the principal architects of the Scale Vector-Thread +processor at MIT, while Mark Hampton ported the GCC-based compiler +infrastructure and tools for Scale. + +A lightly edited version of the T0 MIPS scalar processor specification +(MIPS-6371) was used in teaching a new version of the MIT 6.371 +Introduction to VLSI Systems class in the Fall 2002 semester, with Chris +Terman and Krste Asanović as lecturers. Chris Terman contributed most of +the lab material for the class (there was no TA!). The 6.371 class +evolved into the trial 6.884 Complex Digital Design class at MIT, taught +by Arvind and Krste Asanović in Spring 2005, which became a regular +Spring class 6.375. A reduced version of the Scale MIPS-based scalar +ISA, named SMIPS, was used in 6.884/6.375. Christopher Batten was the TA +for the early offerings of these classes and developed a considerable +amount of documentation and lab material based around the SMIPS ISA. +This same SMIPS lab material was adapted and enhanced by TA Yunsup Lee +for the UC Berkeley Fall 2009 CS250 VLSI Systems Design class taught by +John Wawrzynek, Krste Asanović, and John Lazzaro. + +The Maven (Malleable Array of Vector-thread ENgines) project was a +second-generation vector-thread architecture. Its design was led by +Christopher Batten when he was an Exchange Scholar at UC Berkeley +starting in summer 2007. Hidetaka Aoki, a visiting industrial fellow +from Hitachi, gave considerable feedback on the early Maven ISA and +microarchitecture design. The Maven infrastructure was based on the +Scale infrastructure but the Maven ISA moved further away from the MIPS +ISA variant defined in Scale, with a unified floating-point and integer +register file. Maven was designed to support experimentation with +alternative data-parallel accelerators. Yunsup Lee was the main +implementor of the various Maven vector units, while Rimas Avižienis was +the main implementor of the various Maven scalar units. Yunsup Lee and +Christopher Batten ported GCC to work with the new Maven ISA. +Christopher Celio provided the initial definition of a traditional +vector instruction set (`Flood`) variant of Maven. + +Based on experience with all these previous projects, the RISC-V ISA +definition was begun in Summer 2010, with Andrew Waterman, Yunsup Lee, +Krste Asanović, and David Patterson as principal designers. An initial +version of the RISC-V 32-bit instruction subset was used in the UC +Berkeley Fall 2010 CS250 VLSI Systems Design class, with Yunsup Lee as +TA. RISC-V is a clean break from the earlier MIPS-inspired designs. John +Hauser contributed to the floating-point ISA definition, including the +sign-injection instructions and a register encoding scheme that permits +internal recoding of floating-point values. + +=== History from Revision 2.0 of ISA manual + +Multiple implementations of RISC-V processors have been completed, +including several silicon fabrications, as shown in +<>. + +[[silicon]] +[cols="<,>,<,<",options="header",] +|=== +|Name |Tapeout Date |Process |ISA +|Raven-1 |May 29, 2011 |ST 28nm FDSOI |RV64G1_Xhwacha1 +|EOS14 |April 1, 2012 |IBM 45nm SOI |RV64G1p1_Xhwacha2 +|EOS16 |August 17, 2012 |IBM 45nm SOI |RV64G1p1_Xhwacha2 +|Raven-2 |August 22, 2012 |ST 28nm FDSOI |RV64G1p1_Xhwacha2 +|EOS18 |February 6, 2013 |IBM 45nm SOI |RV64G1p1_Xhwacha2 +|EOS20 |July 3, 2013 |IBM 45nm SOI |RV64G1p99_Xhwacha2 +|Raven-3 |September 26, 2013 |ST 28nm SOI |RV64G1p99_Xhwacha2 +|EOS22 |March 7, 2014 |IBM 45nm SOI |RV64G1p9999_Xhwacha3 +|=== + +The first RISC-V processors to be fabricated were written in Verilog and +manufactured in a pre-production FDSOI technology from ST as the Raven-1 +testchip in 2011. Two cores were developed by Yunsup Lee and Andrew +Waterman, advised by Krste Asanović, and fabricated together: 1) an RV64 +scalar core with error-detecting flip-flops, and 2) an RV64 core with an +attached 64-bit floating-point vector unit. The first microarchitecture +was informally known as `TrainWreck`, due to the short time available +to complete the design with immature design libraries. + +Subsequently, a clean microarchitecture for an in-order decoupled RV64 +core was developed by Andrew Waterman, Rimas Avižienis, and Yunsup Lee, +advised by Krste Asanović, and, continuing the railway theme, was +codenamed `Rocket` after George Stephenson’s successful steam +locomotive design. Rocket was written in Chisel, a new hardware design +language developed at UC Berkeley. The IEEE floating-point units used in +Rocket were developed by John Hauser, Andrew Waterman, and Brian +Richards. Rocket has since been refined and developed further, and has +been fabricated two more times in FDSOI (Raven-2, Raven-3), and five +times in IBM SOI technology (EOS14, EOS16, EOS18, EOS20, EOS22) for a +photonics project. Work is ongoing to make the Rocket design available +as a parameterized RISC-V processor generator. + +EOS14–EOS22 chips include early versions of Hwacha, a 64-bit IEEE +floating-point vector unit, developed by Yunsup Lee, Andrew Waterman, +Huy Vo, Albert Ou, Quan Nguyen, and Stephen Twigg, advised by Krste +Asanović. EOS16–EOS22 chips include dual cores with a cache-coherence +protocol developed by Henry Cook and Andrew Waterman, advised by Krste +Asanović. EOS14 silicon has successfully run at 1.25 GHz. EOS16 silicon suffered +from a bug in the IBM pad libraries. EOS18 and EOS20 have successfully +run at 1.35 GHz. + +Contributors to the Raven testchips include Yunsup Lee, Andrew Waterman, +Rimas Avižienis, Brian Zimmer, Jaehwa Kwak, Ruzica Jevtić, Milovan +Blagojević, Alberto Puggelli, Steven Bailey, Ben Keller, Pi-Feng Chiu, +Brian Richards, Borivoje Nikolić, and Krste Asanović. + +Contributors to the EOS testchips include Yunsup Lee, Rimas Avižienis, +Andrew Waterman, Henry Cook, Huy Vo, Daiwei Li, Chen Sun, Albert Ou, +Quan Nguyen, Stephen Twigg, Vladimir Stojanović, and Krste Asanović. + +Andrew Waterman and Yunsup Lee developed the C++ ISA simulator +`Spike`, used as a golden model in development and named after the +golden spike used to celebrate completion of the US transcontinental +railway. Spike has been made available as a BSD open-source project. + +Andrew Waterman completed a Master’s thesis with a preliminary design of +the RISC-V compressed instruction set cite:[waterman-ms]. + +Various FPGA implementations of the RISC-V have been completed, +primarily as part of integrated demos for the Par Lab project research +retreats. The largest FPGA design has 3 cache-coherent RV64IMA +processors running a research operating system. Contributors to the FPGA +implementations include Andrew Waterman, Yunsup Lee, Rimas Avižienis, +and Krste Asanović. + +RISC-V processors have been used in several classes at UC Berkeley. +Rocket was used in the Fall 2011 offering of CS250 as a basis for class +projects, with Brian Zimmer as TA. For the undergraduate CS152 class in +Spring 2012, Christopher Celio used Chisel to write a suite of +educational RV32 processors, named `Sodor` after the island on which +`Thomas the Tank Engine` and friends live. The suite includes a +microcoded core, an unpipelined core, and 2, 3, and 5-stage pipelined +cores, and is publicly available under a BSD license. The suite was +subsequently updated and used again in CS152 in Spring 2013, with Yunsup +Lee as TA, and in Spring 2014, with Eric Love as TA. Christopher Celio +also developed an out-of-order RV64 design known as BOOM (Berkeley +Out-of-Order Machine), with accompanying pipeline visualizations, that +was used in the CS152 classes. The CS152 classes also used +cache-coherent versions of the Rocket core developed by Andrew Waterman +and Henry Cook. + +Over the summer of 2013, the RoCC (Rocket Custom Coprocessor) interface +was defined to simplify adding custom accelerators to the Rocket core. +Rocket and the RoCC interface were used extensively in the Fall 2013 +CS250 VLSI class taught by Jonathan Bachrach, with several student +accelerator projects built to the RoCC interface. The Hwacha vector unit +has been rewritten as a RoCC coprocessor. + +Two Berkeley undergraduates, Quan Nguyen and Albert Ou, have +successfully ported Linux to run on RISC-V in Spring 2013. + +Colin Schmidt successfully completed an LLVM backend for RISC-V 2.0 in +January 2014. + +Darius Rad at Bluespec contributed soft-float ABI support to the GCC +port in March 2014. + +John Hauser contributed the definition of the floating-point +classification instructions. + +We are aware of several other RISC-V core implementations, including one +in Verilog by Tommy Thorn, and one in Bluespec by Rishiyur Nikhil. + +=== Acknowledgments + +Thanks to Christopher F. Batten, Preston Briggs, Christopher Celio, +David Chisnall, Stefan Freudenberger, John Hauser, Ben Keller, Rishiyur +Nikhil, Michael Taylor, Tommy Thorn, and Robert Watson for comments on +the draft ISA version 2.0 specification. + +=== History from Revision 2.1 + +Uptake of the RISC-V ISA has been very rapid since the introduction of +the frozen version 2.0 in May 2014, with too much activity to record in +a short history section such as this. Perhaps the most important single +event was the formation of the non-profit RISC-V Foundation in August +2015. The Foundation will now take over stewardship of the official +RISC-V ISA standard, and the official website `riscv.org` is the best +place to obtain news and updates on the RISC-V standard. + +=== Acknowledgments + +Thanks to Scott Beamer, Allen J. Baum, Christopher Celio, David +Chisnall, Paul Clayton, Palmer Dabbelt, Jan Gray, Michael Hamburg, and +John Hauser for comments on the version 2.0 specification. + +=== History from Revision 2.2 + +=== Acknowledgments + +Thanks to Jacob Bachmeyer, Alex Bradbury, David Horner, Stefan O’Rear, +and Joseph Myers for comments on the version 2.1 specification. + +=== History for Revision 2.3 + +Uptake of RISC-V continues at breakneck pace. + +John Hauser and Andrew Waterman contributed a hypervisor ISA extension +based upon a proposal from Paolo Bonzini. + +Daniel Lustig, Arvind, Krste Asanović, Shaked Flur, Paul Loewenstein, +Yatin Manerkar, Luc Maranget, Margaret Martonosi, Vijayanand Nagarajan, +Rishiyur Nikhil, Jonas Oberhauser, Christopher Pulte, Jose Renau, Peter +Sewell, Susmit Sarkar, Caroline Trippel, Muralidaran Vijayaraghavan, +Andrew Waterman, Derek Williams, Andrew Wright, and Sizhuo Zhang +contributed the memory consistency model. + +=== Funding + +Development of the RISC-V architecture and implementations has been +partially funded by the following sponsors. + +* *Par Lab:* Research supported by Microsoft (Award # 024263) and Intel +(Award # 024894) funding and by matching funding by U.C. Discovery (Award +# DIG07-10227). Additional support came from Par Lab affiliates Nokia, +NVIDIA, Oracle, and Samsung. +* *Project Isis:* DoE Award DE-SC0003624. +* *ASPIRE Lab*: DARPA PERFECT program, Award HR0011-12-2-0016. DARPA +POEM program Award HR0011-11-C-0100. The Center for Future Architectures +Research (C-FAR), a STARnet center funded by the Semiconductor Research +Corporation. Additional support from ASPIRE industrial sponsor, Intel, +and ASPIRE affiliates, Google, Hewlett Packard Enterprise, Huawei, +Nokia, NVIDIA, Oracle, and Samsung. + +The content of this paper does not necessarily reflect the position or +the policy of the US government and no official endorsement should be +inferred. diff --git a/src/index.adoc b/src/index.adoc new file mode 100644 index 0000000..4abaca2 --- /dev/null +++ b/src/index.adoc @@ -0,0 +1,2 @@ +[index] +== Index diff --git a/src/intro-old.adoc b/src/intro-old.adoc new file mode 100644 index 0000000..a1d24f7 --- /dev/null +++ b/src/intro-old.adoc @@ -0,0 +1,733 @@ +[[introduction]] +== Introduction + +RISC-V (pronounced `risk-five`) is a new instruction-set architecture +(ISA) that was originally designed to support computer architecture +research and education, but which we now hope will also become a +standard free and open architecture for industry implementations. Our +goals in defining RISC-V include: + +* A completely _open_ ISA that is freely available to academia and +industry. +* A _real_ ISA suitable for direct native hardware implementation, not +just simulation or binary translation. +* An ISA that avoids `over-architecting` for a particular +microarchitecture style (e.g., microcoded, in-order, decoupled, +out-of-order) or implementation technology (e.g., full-custom, ASIC, +FPGA), but which allows efficient implementation in any of these. +* An ISA separated into a _small_ base integer ISA, usable by itself as +a base for customized accelerators or for educational purposes, and +optional standard extensions, to support general-purpose software +development. +* Support for the revised 2008 IEEE-754 floating-point standard . +* An ISA supporting extensive ISA extensions and specialized variants. +* Both 32-bit and 64-bit address space variants for applications, +operating system kernels, and hardware implementations. +* An ISA with support for highly parallel multicore or manycore +implementations, including heterogeneous multiprocessors. +* Optional _variable-length instructions_ to both expand available +instruction encoding space and to support an optional _dense instruction +encoding_ for improved performance, static code size, and energy +efficiency. +* A fully virtualizable ISA to ease hypervisor development. +* An ISA that simplifies experiments with new privileged architecture +designs. + +[TIP] +==== +Commentary on our design decisions is formatted as in this paragraph. +This non-normative text can be skipped if the reader is only interested +in the specification itself. +==== + +[NOTE] +==== +The name RISC-V was chosen to represent the fifth major RISC ISA design +from UC Berkeley (RISC-I cite:[riscI-isca1981], RISC-II cite:[Katevenis:1983], SOAR cite:[Ungar:1984], and SPUR cite:[spur-jsscc1989] were the first +four). We also pun on the use of the Roman numeral `V` to signify +`variations` and `vectors`, as support for a range of architecture +research, including various data-parallel accelerators, is an explicit +goal of the ISA design. +==== +(((ISA, definition))) + +The RISC-V ISA is defined avoiding implementation details as much as +possible (although commentary is included on implementation-driven +decisions) and should be read as the software-visible interface to a +wide variety of implementations rather than as the design of a +particular hardware artifact. The RISC-V manual is structured in two +volumes. This volume covers the design of the base _unprivileged_ +instructions, including optional unprivileged ISA extensions. +Unprivileged instructions are those that are generally usable in all +privilege modes in all privileged architectures, though behavior might +vary depending on privilege mode and privilege architecture. The second +volume provides the design of the first (`classic`) privileged +architecture. The manuals use IEC 80000-13:2008 conventions, with a byte +of 8 bits. + +[TIP] +==== +In the unprivileged ISA design, we tried to remove any dependence on +particular microarchitectural features, such as cache line size, or on +privileged architecture details, such as page translation. This is both +for simplicity and to allow maximum flexibility for alternative +microarchitectures or alternative privileged architectures. +==== + +=== RISC-V Hardware Platform Terminology + +A RISC-V hardware platform can contain one or more RISC-V-compatible +processing cores together with other non-RISC-V-compatible cores, +fixed-function accelerators, various physical memory structures, I/O +devices, and an interconnect structure to allow the components to +communicate. +(((core, component))) + +A component is termed a _core_ if it contains an independent instruction +fetch unit. A RISC-V-compatible core might support multiple +RISC-V-compatible hardware threads, or _harts_, through multithreading. +(((core, extensions, coprocessor))) + +A RISC-V core might have additional specialized instruction-set +extensions or an added _coprocessor_. We use the term _coprocessor_ to +refer to a unit that is attached to a RISC-V core and is mostly +sequenced by a RISC-V instruction stream, but which contains additional +architectural state and instruction-set extensions, and possibly some +limited autonomy relative to the primary RISC-V instruction stream. +(((core, accelerator))) + +We use the term _accelerator_ to refer to either a non-programmable +fixed-function unit or a core that can operate autonomously but is +specialized for certain tasks. In RISC-V systems, we expect many +programmable accelerators will be RISC-V-based cores with specialized +instruction-set extensions and/or customized coprocessors. An important +class of RISC-V accelerators are I/O accelerators, which offload I/O +processing tasks from the main application cores. +(((core, cluster, multiprocessors))) + +The system-level organization of a RISC-V hardware platform can range +from a single-core microcontroller to a many-thousand-node cluster of +shared-memory manycore server nodes. Even small systems-on-a-chip might +be structured as a hierarchy of multicomputers and/or multiprocessors to +modularize development effort or to provide secure isolation between +subsystems. + +=== RISC-V Software Execution Environments and Harts + +The behavior of a RISC-V program depends on the execution environment in +which it runs. A RISC-V execution environment interface (EEI) defines +the initial state of the program, the number and type of harts in the +environment including the privilege modes supported by the harts, the +accessibility and attributes of memory and I/O regions, the behavior of +all legal instructions executed on each hart (i.e., the ISA is one +component of the EEI), and the handling of any interrupts or exceptions +raised during execution including environment calls. Examples of EEIs +include the Linux application binary interface (ABI), or the RISC-V +supervisor binary interface (SBI). The implementation of a RISC-V +execution environment can be pure hardware, pure software, or a +combination of hardware and software. For example, opcode traps and +software emulation can be used to implement functionality not provided +in hardware. Examples of execution environment implementations include: + +* `Bare metal` hardware platforms where harts are directly implemented +by physical processor threads and instructions have full access to the +physical address space. The hardware platform defines an execution +environment that begins at power-on reset. +* RISC-V operating systems that provide multiple user-level execution +environments by multiplexing user-level harts onto available physical +processor threads and by controlling access to memory via virtual +memory. +* RISC-V hypervisors that provide multiple supervisor-level execution +environments for guest operating systems. +* RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V +harts on an underlying x86 system, and which can provide either a +user-level or a supervisor-level execution environment. + +[TIP] +==== +A bare hardware platform can be considered to define an EEI, where the +accessible harts, memory, and other devices populate the environment, +and the initial state is that at power-on reset. Generally, most +software is designed to use a more abstract interface to the hardware, +as more abstract EEIs provide greater portability across different +hardware platforms. Often EEIs are layered on top of one another, where +one higher-level EEI uses another lower-level EEI. +==== +(((hart, exectution environment))) + +From the perspective of software running in a given execution +environment, a hart is a resource that autonomously fetches and executes +RISC-V instructions within that execution environment. In this respect, +a hart behaves like a hardware thread resource even if time-multiplexed +onto real hardware by the execution environment. Some EEIs support the +creation and destruction of additional harts, for example, via +environment calls to fork new harts. + +The execution environment is responsible for ensuring the eventual +forward progress of each of its harts. For a given hart, that +responsibility is suspended while the hart is exercising a mechanism +that explicitly waits for an event, such as the wait-for-interrupt +instruction defined in Volume II of this specification; and that +responsibility ends if the hart is terminated. The following events +constitute forward progress: + +* The retirement of an instruction. +* A trap, as defined in <>. +* Any other event defined by an extension to constitute forward +progress. + + +[TIP] +==== +The term hart was introduced in the work on Lithe in cite:[lithe-pan-hotpar09] and cite:[lithe-pan-pldi10] to provide a term to +represent an abstract execution resource as opposed to a software thread +programming abstraction. + +The important distinction between a hardware thread (hart) and a +software thread context is that the software running inside an execution +environment is not responsible for causing progress of each of its +harts; that is the responsibility of the outer execution environment. So +the environment’s harts operate like hardware threads from the +perspective of the software inside the execution environment. + +An execution environment implementation might time-multiplex a set of +guest harts onto fewer host harts provided by its own execution +environment but must do so in a way that guest harts operate like +independent hardware threads. In particular, if there are more guest +harts than host harts then the execution environment must be able to +preempt the guest harts and must not wait indefinitely for guest +software on a guest hart to “yield" control of the guest hart. +==== + +=== RISC-V ISA Overview + +A RISC-V ISA is defined as a base integer ISA, which must be present in +any implementation, plus optional extensions to the base ISA. The base +integer ISAs are very similar to that of the early RISC processors +except with no branch delay slots and with support for optional +variable-length instruction encodings. A base is carefully restricted to +a minimal set of instructions sufficient to provide a reasonable target +for compilers, assemblers, linkers, and operating systems (with +additional privileged operations), and so provides a convenient ISA and +software toolchain `skeleton` around which more customized processor +ISAs can be built. + +Although it is convenient to speak of _the_ RISC-V ISA, RISC-V is +actually a family of related ISAs, of which there are currently four +base ISAs. Each base integer instruction set is characterized by the +width of the integer registers and the corresponding size of the address +space and by the number of integer registers. There are two primary base +integer variants, RV32I and RV64I, described in +Chapters (see <> and <>, which provide 32-bit +or 64-bit address spaces respectively. We use the term XLEN to refer to +the width of an integer register in bits (either 32 or 64). +Chapter <> describes the RV32E subset variant of the +RV32I base instruction set, which has been added to support small +microcontrollers, and which has half the number of integer registers. +Chapter <> sketches a future RV128I variant of the +base integer instruction set supporting a flat 128-bit address space +(XLEN=128). The base integer instruction sets use a two’s-complement +representation for signed integer values. + +Although 64-bit address spaces are a requirement for larger systems, we +believe 32-bit address spaces will remain adequate for many embedded and +client devices for decades to come and will be desirable to lower memory +traffic and energy consumption. In addition, 32-bit address spaces are +sufficient for educational purposes. A larger flat 128-bit address space +might eventually be required, so we ensured this could be accommodated +within the RISC-V ISA framework. + +The four base ISAs in RISC-V are treated as distinct base ISAs. A common +question is why is there not a single ISA, and in particular, why is +RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC, +MIPS) adopted a strict superset policy when increasing address space +size to support running existing 32-bit binaries on new 64-bit hardware. + +The main advantage of explicitly separating base ISAs is that each base +ISA can be optimized for its needs without requiring to support all the +operations needed for other base ISAs. For example, RV64I can omit +instructions and CSRs that are only needed to cope with the narrower +registers in RV32I. The RV32I variants can use encoding space otherwise +reserved for instructions only required by wider address-space variants. + +The main disadvantage of not treating the design as a single ISA is that +it complicates the hardware needed to emulate one base ISA on another +(e.g., RV32I on RV64I). However, differences in addressing and illegal +instruction traps generally mean some mode switch would be required in +hardware in any case even with full superset instruction encodings, and +the different RISC-V base ISAs are similar enough that supporting +multiple versions is relatively low cost. Although some have proposed +that the strict superset design would allow legacy 32-bit libraries to +be linked with 64-bit code, this is impractical in practice, even with +compatible encodings, due to the differences in software calling +conventions and system-call interfaces. + +The RISC-V privileged architecture provides fields in ` misa` to control +the unprivileged ISA at each level to support emulating different base +ISAs on the same hardware. We note that newer SPARC and MIPS ISA +revisions have deprecated support for running 32-bit code unchanged on +64-bit systems. + +A related question is why there is a different encoding for 32-bit adds +in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for +32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the +existing design which uses the same opcode ADD for 32-bit adds in RV32I +and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in +RV64I. This would also be more consistent with the use of the same LW +opcode for 32-bit load in both RV32I and RV64I. The very first versions +of RISC-V ISA did have a variant of this alternate design, but the +RISC-V design was changed to the current choice in January 2011. Our +focus was on supporting 32-bit integers in the 64-bit ISA not on +providing compatibility with the 32-bit ISA, and the motivation was to +remove the asymmetry that arose from having not all opcodes in RV32I +have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was +perhaps not well-justified and a consequence of designing both ISAs at +the same time as opposed to adding one later to sit on top of another, +and also from a belief we had to fold platform requirements into the ISA +spec which would imply that all the RV32I instructions would have been +required in RV64I. It is too late to change the encoding now, but this +is also of little practical consequence for the reasons stated above. + +It has been noted we could enable the *W* variants as an extension to +RV32I systems to provide a common encoding across RV64I and a future +RV32 variant. + +RISC-V has been designed to support extensive customization and +specialization. Each base integer ISA can be extended with one or more +optional instruction-set extensions. An extension may be categorized as +either standard, custom, or non-conforming. For this purpose, we divide +each RISC-V instruction-set encoding space (and related encoding spaces +such as the CSRs) into three disjoint categories: _standard_, +_reserved_, and _custom_. Standard extensions and encodings are defined +by the Foundation; any extensions not defined by the Foundation are +_non-standard_. Each base ISA and its standard extensions use only +standard encodings, and shall not conflict with each other in their uses +of these encodings. Reserved encodings are currently not defined but are +saved for future standard extensions; once thus used, they become +standard encodings. Custom encodings shall never be used for standard +extensions and are made available for vendor-specific non-standard +extensions. Non-standard extensions are either custom extensions, that +use only custom encodings, or _non-conforming_ extensions, that use any +standard or reserved encoding. Instruction-set extensions are generally +shared but may provide slightly different functionality depending on the +base ISA. Chapter <> describes various ways +of extending the RISC-V ISA. We have also developed a naming convention +for RISC-V base instructions and instruction-set extensions, described +in detail in Chapter <>. + +To support more general software development, a set of standard +extensions are defined to provide integer multiply/divide, atomic +operations, and single and double-precision floating-point arithmetic. +The base integer ISA is named `I` (prefixed by RV32 or RV64 depending +on integer register width), and contains integer computational +instructions, integer loads, integer stores, and control-flow +instructions. The standard integer multiplication and division extension +is named `M`, and adds instructions to multiply and divide values held +in the integer registers. The standard atomic instruction extension, +denoted by `A`, adds instructions that atomically read, modify, and +write memory for inter-processor synchronization. The standard +single-precision floating-point extension, denoted by `F`, adds +floating-point registers, single-precision computational instructions, +and single-precision loads and stores. The standard double-precision +floating-point extension, denoted by `D`, expands the floating-point +registers, and adds double-precision computational instructions, loads, +and stores. The standard `C` compressed instruction extension provides +narrower 16-bit forms of common instructions. + +Beyond the base integer ISA and the standard GC extensions, we believe +it is rare that a new instruction will provide a significant benefit for +all applications, although it may be very beneficial for a certain +domain. As energy efficiency concerns are forcing greater +specialization, we believe it is important to simplify the required +portion of an ISA specification. Whereas other architectures usually +treat their ISA as a single entity, which changes to a new version as +instructions are added over time, RISC-V will endeavor to keep the base +and each standard extension constant over time, and instead layer new +instructions as further optional extensions. For example, the base +integer ISAs will continue as fully supported standalone ISAs, +regardless of any subsequent extensions. + +=== Memory + +A RISC-V hart has a single byte-addressable address space of +latexmath:[$2^{\text{XLEN}}$] bytes for all memory accesses. A _word_ of +memory is defined as (). Correspondingly, a _halfword_ is (), a +_doubleword_ is (), and a _quadword_ is (). The memory address space is +circular, so that the byte at address latexmath:[$2^{\text{XLEN}}-1$] is +adjacent to the byte at address zero. Accordingly, memory address +computations done by the hardware ignore overflow and instead wrap +around modulo latexmath:[$2^{\text{XLEN}}$]. + +The execution environment determines the mapping of hardware resources +into a hart’s address space. Different address ranges of a hart’s +address space may (1) be vacant, or (2) contain _main memory_, or +(3) contain one or more _I/O devices_. Reads and writes of I/O devices +may have visible side effects, but accesses to main memory cannot. +Although it is possible for the execution environment to call everything +in a hart’s address space an I/O device, it is usually expected that +some portion will be specified as main memory. + +When a RISC-V platform has multiple harts, the address spaces of any two +harts may be entirely the same, or entirely different, or may be partly +different but sharing some subset of resources, mapped into the same or +different address ranges. + +For a purely `bare metal` environment, all harts may see an identical +address space, accessed entirely by physical addresses. However, when +the execution environment includes an operating system employing address +translation, it is common for each hart to be given a virtual address +space that is largely or entirely its own. + +(((memory access, implicit and explicit))) +Executing each RISC-V machine instruction entails one or more memory +accesses, subdivided into _implicit_ and _explicit_ accesses. For each +instruction executed, an _implicit_ memory read (instruction fetch) is +done to obtain the encoded instruction to execute. Many RISC-V +instructions perform no further memory accesses beyond instruction +fetch. Specific load and store instructions perform an _explicit_ read +or write of memory at an address determined by the instruction. The +execution environment may dictate that instruction execution performs +other _implicit_ memory accesses (such as to implement address +translation) beyond those documented for the unprivileged ISA. + +The execution environment determines what portions of the non-vacant +address space are accessible for each kind of memory access. For +example, the set of locations that can be implicitly read for +instruction fetch may or may not have any overlap with the set of +locations that can be explicitly read by a load instruction; and the set +of locations that can be explicitly written by a store instruction may +be only a subset of locations that can be read. Ordinarily, if an +instruction attempts to access memory at an inaccessible address, an +exception is raised for the instruction. Vacant locations in the address +space are never accessible. + +Except when specified otherwise, implicit reads that do not raise an +exception and that have no side effects may occur arbitrarily early and +speculatively, even before the machine could possibly prove that the +read will be needed. For instance, a valid implementation could attempt +to read all of main memory at the earliest opportunity, cache as many +fetchable (executable) bytes as possible for later instruction fetches, +and avoid reading main memory for instruction fetches ever again. To +ensure that certain implicit reads are ordered only after writes to the +same memory locations, software must execute specific fence or +cache-control instructions defined for this purpose (such as the FENCE.I +instruction defined in Chapter <>. + +(((memory access, implicit and explicit))) +The memory accesses (implicit or explicit) made by a hart may appear to +occur in a different order as perceived by another hart or by any other +agent that can access the same memory. This perceived reordering of +memory accesses is always constrained, however, by the applicable memory +consistency model. The default memory consistency model for RISC-V is +the RISC-V Weak Memory Ordering (RVWMO), defined in +Chapter <> and in appendices. Optionally, +an implementation may adopt the stronger model of Total Store Ordering, +as defined in Chapter <>. The execution environment +may also add constraints that further limit the perceived reordering of +memory accesses. Since the RVWMO model is the weakest model allowed for +any RISC-V implementation, software written for this model is compatible +with the actual memory consistency rules of all RISC-V implementations. +As with implicit reads, software must execute fence or cache-control +instructions to ensure specific ordering of memory accesses beyond the +requirements of the assumed memory consistency model and execution +environment. + +=== Base Instruction-Length Encoding + +The base RISC-V ISA has fixed-length 32-bit instructions that must be +naturally aligned on 32-bit boundaries. However, the standard RISC-V +encoding scheme is designed to support ISA extensions with +variable-length instructions, where each instruction can be any number +of 16-bit instruction _parcels_ in length and parcels are naturally +aligned on 16-bit boundaries. The standard compressed ISA extension +described in Chapter <> reduces code size by +providing compressed 16-bit instructions and relaxes the alignment +constraints to allow all instructions (16 bit and 32 bit) to be aligned +on any 16-bit boundary to improve code density. + +We use the term IALIGN (measured in bits) to refer to the +instruction-address alignment constraint the implementation enforces. +IALIGN is 32 bits in the base ISA, but some ISA extensions, including +the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not +take on any value other than 16 or 32. + +`(((ILEN))) +We use the term ILEN (measured in bits) to refer to the maximum +instruction length supported by an implementation, and which is always a +multiple of IALIGN. For implementations supporting only a base +instruction set, ILEN is 32 bits. Implementations supporting longer +instructions have larger values of ILEN. + +<> illustrates the standard +RISC-V instruction-length encoding convention. All the 32-bit +instructions in the base ISA have their lowest two bits set to `11`. The +optional compressed 16-bit instruction-set extensions have their lowest +two bits equal to `00`, `01`, or `10`. + +==== Expanded Instruction-Length Encoding + +`(((instruction length encoding))) +A portion of the 32-bit instruction-encoding space has been tentatively +allocated for instructions longer than 32 bits. The entirety of this +space is reserved at this time, and the following proposal for encoding +instructions longer than 32 bits is not considered frozen. + +Standard instruction-set extensions encoded with more than 32 bits have +additional low-order bits set to `1`, with the conventions for 48-bit +and 64-bit lengths shown in +<>. Instruction lengths +between 80 bits and 176 bits are encoded using a 3-bit field in bits +[14:12] giving the number of 16-bit words in addition to the first +5latexmath:[$\times$]16-bit words. The encoding with bits [14:12] set to +`111` is reserved for future longer instruction encodings. + +//table needs to be fixed +[[instlengthcode]] +.Instruction length code +[cols="^,^,^,^,<",] +|=== +| | | |`xxxxxxxxxxxxxxaa` |16-bit (`aa` latexmath:[$\neq$] `11`) + +| | | | | + +| | |`xxxxxxxxxxxxxxxx` |`xxxxxxxxxxxbbb11` |32-bit (`bbb` +latexmath:[$\neq$] `111`) + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xxxxxxxxxx011111` |48-bit + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xxxxxxxxx0111111` |64-bit + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xnnnxxxxx1111111` |(80+16*`nnn`)-bit, `nnn`latexmath:[$\neq$]`111` + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`x111xxxxx1111111` |Reserved for latexmath:[$\geq$]192-bits + +| | | | | + +|Byte Address: |base+4 |base+2 |base | +|=== + +Given the code size and energy savings of a compressed format, we wanted +to build in support for a compressed format to the ISA encoding scheme +rather than adding this as an afterthought, but to allow simpler +implementations we didn’t want to make the compressed format mandatory. +We also wanted to optionally allow longer instructions to support +experimentation and larger instruction-set extensions. Although our +encoding convention required a tighter encoding of the core RISC-V ISA, +this has several beneficial effects. +(((IMAFED))) + +An implementation of the standard IMAFD ISA need only hold the +most-significant 30 bits in instruction caches (a 6.25% saving). On +instruction cache refills, any instructions encountered with either low +bit clear should be recoded into illegal 30-bit instructions before +storing in the cache to preserve illegal instruction exception behavior. + +Perhaps more importantly, by condensing our base ISA into a subset of +the 32-bit instruction word, we leave more space available for +non-standard and custom extensions. In particular, the base RV32I ISA +uses less than 1/8 of the encoding space in the 32-bit instruction word. +As described in Chapter <>, an +implementation that does not require support for the standard compressed +instruction extension can map 3 additional non-conforming 30-bit +instruction spaces into the 32-bit fixed-width format, while preserving +support for standard latexmath:[$\geq$]32-bit instruction-set +extensions. Further, if the implementation also does not need +instructions latexmath:[$>$]32-bits in length, it can recover a further +four major opcodes for non-conforming extensions. + +Encodings with bits [15:0] all zeros are defined as illegal +instructions. These instructions are considered to be of minimal length: +16 bits if any 16-bit instruction-set extension is present, otherwise 32 +bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this +instruction is considered to be ILEN bits long. + +We consider it a feature that any length of instruction containing all +zero bits is not legal, as this quickly traps erroneous jumps into +zeroed memory regions. Similarly, we also reserve the instruction +encoding containing all ones to be an illegal instruction, to catch the +other common pattern observed with unprogrammed non-volatile memory +devices, disconnected memory buses, or broken memory devices. + +Software can rely on a naturally aligned 32-bit word containing zero to +act as an illegal instruction on all RISC-V implementations, to be used +by software where an illegal instruction is explicitly desired. Defining +a corresponding known illegal value for all ones is more difficult due +to the variable-length encoding. Software cannot generally use the +illegal value of ILEN bits of all 1s, as software might not know ILEN +for the eventual target machine (e.g., if software is compiled into a +standard binary library used by many different machines). Defining a +32-bit word of all ones as illegal was also considered, as all machines +must support a 32-bit instruction size, but this requires the +instruction-fetch unit on machines with ILENlatexmath:[$>$]32 report an +illegal instruction exception rather than an access-fault exception when +such an instruction borders a protection boundary, complicating +variable-instruction-length fetch and decode. +(((endian, little and big))) + +RISC-V base ISAs have either little-endian or big-endian memory systems, +with the privileged architecture further defining bi-endian operation. +Instructions are stored in memory as a sequence of 16-bit little-endian +parcels, regardless of memory system endianness. Parcels forming one +instruction are stored at increasing halfword addresses, with the +lowest-addressed parcel holding the lowest-numbered bits in the +instruction specification. + +We originally chose little-endian byte ordering for the RISC-V memory +system because little-endian systems are currently dominant commercially +(all x86 systems; iOS, Android, and Windows for ARM). A minor point is +that we have also found little-endian memory systems to be more natural +for hardware designers. However, certain application areas, such as IP +networking, operate on big-endian data structures, and certain legacy +code bases have been built assuming big-endian processors, so we have +defined big-endian and bi-endian variants of RISC-V. + +We have to fix the order in which instruction parcels are stored in +memory, independent of memory system endianness, to ensure that the +length-encoding bits always appear first in halfword address order. This +allows the length of a variable-length instruction to be quickly +determined by an instruction-fetch unit by examining only the first few +bits of the first 16-bit instruction parcel. + +We further make the instruction parcels themselves little-endian to +decouple the instruction encoding from the memory system endianness +altogether. This design benefits both software tooling and bi-endian +hardware. Otherwise, for instance, a RISC-V assembler or disassembler +would always need to know the intended active endianness, despite that +in bi-endian systems, the endianness mode might change dynamically +during execution. In contrast, by giving instructions a fixed +endianness, it is sometimes possible for carefully written software to +be endianness-agnostic even in binary form, much like +position-independent code. + +The choice to have instructions be only little-endian does have +consequences, however, for RISC-V software that encodes or decodes +machine instructions. Big-endian JIT compilers, for example, must swap +the byte order when storing to instruction memory. + +Once we had decided to fix on a little-endian instruction encoding, this +naturally led to placing the length-encoding bits in the LSB positions +of the instruction format to avoid breaking up opcode fields. + +[[trap-defn]] +=== Exceptions, Traps, and Interrupts +(((exceptions))) +(((traps))) +(((interrupts))) + +We use the term _exception_ to refer to an unusual condition occurring +at run time associated with an instruction in the current RISC-V hart. +We use the term _interrupt_ to refer to an external asynchronous event +that may cause a RISC-V hart to experience an unexpected transfer of +control. We use the term _trap_ to refer to the transfer of control to a +trap handler caused by either an exception or an interrupt. + +The instruction descriptions in following chapters describe conditions +that can raise an exception during execution. The general behavior of +most RISC-V EEIs is that a trap to some handler occurs when an exception +is signaled on an instruction (except for floating-point exceptions, +which, in the standard floating-point extensions, do not cause traps). +The manner in which interrupts are generated, routed to, and enabled by +a hart depends on the EEI. + +Our use of `exception` and `trap` is compatible with that in the +IEEE-754 floating-point standard. + +How traps are handled and made visible to software running on the hart +depends on the enclosing execution environment. From the perspective of +software running inside an execution environment, traps encountered by a +hart at runtime can have four different effects: + +Contained Trap::: + The trap is visible to, and handled by, software running inside the + execution environment. For example, in an EEI providing both + supervisor and user mode on harts, an ECALL by a user-mode hart will + generally result in a transfer of control to a supervisor-mode handler + running on the same hart. Similarly, in the same environment, when a + hart is interrupted, an interrupt handler will be run in supervisor + mode on the hart. +Requested Trap::: + The trap is a synchronous exception that is an explicit call to the + execution environment requesting an action on behalf of software + inside the execution environment. An example is a system call. In this + case, execution may or may not resume on the hart after the requested + action is taken by the execution environment. For example, a system + call could remove the hart or cause an orderly termination of the + entire execution environment. +Invisible Trap::: + The trap is handled transparently by the execution environment and + execution resumes normally after the trap is handled. Examples include + emulating missing instructions, handling non-resident page faults in a + demand-paged virtual-memory system, or handling device interrupts for + a different job in a multiprogrammed machine. In these cases, the + software running inside the execution environment is not aware of the + trap (we ignore timing effects in these definitions). +Fatal Trap::: + The trap represents a fatal failure and causes the execution + environment to terminate execution. Examples include failing a + virtual-memory page-protection check or allowing a watchdog timer to + expire. Each EEI should define how execution is terminated and + reported to an external environment. + +<> shows the characteristics of each +kind of trap. _Notes: 1) Termination may be requested. 2) +Imprecise fatal traps might be observable by software._ + +[[trapcharacteristics]] +.Characteristics of traps +[cols="<,^,^,^,^",options="header",] +|=== +| |Contained |Requested |Invisible |Fatal +|Execution terminates |No |Nolatexmath:[$^{1}$] |No |Yes +|Software is oblivious |No |No |Yes |Yeslatexmath:[$^{2}$] +|Handled by environment |No |Yes |Yes |Yes +|=== + +The EEI defines for each trap whether it is handled precisely, though +the recommendation is to maintain preciseness where possible. Contained +and requested traps can be observed to be imprecise by software inside +the execution environment. Invisible traps, by definition, cannot be +observed to be precise or imprecise by software running inside the +execution environment. Fatal traps can be observed to be imprecise by +software running inside the execution environment, if known-errorful +instructions do not cause immediate termination. + +Because this document describes unprivileged instructions, traps are +rarely mentioned. Architectural means to handle contained traps are +defined in the privileged architecture manual, along with other features +to support richer EEIs. Unprivileged instructions that are defined +solely to cause requested traps are documented here. Invisible traps +are, by their nature, out of scope for this document. Instruction +encodings that are not defined here and not defined by some other means +may cause a fatal trap. + +=== UNSPECIFIED Behaviors and Values +(((unspecified, behaviors))) +(((unspecified, values))) + +The architecture fully describes what implementations must do and any +constraints on what they may do. In cases where the architecture +intentionally does not constrain implementations, the term _unspecified_ is +explicitly used. + +The term _unspecified_ refers to a behavior or value that is intentionally +unconstrained. The definition of these behaviors or values is open to +extensions, platform standards, or implementations. Extensions, platform +standards, or implementation documentation may provide normative content +to further constrain cases that the base architecture defines as . + +Like the base architecture, extensions should fully describe allowable +behavior and values and use the term _unspecified_ for cases that are intentionally +unconstrained. These cases may be constrained or defined by other +extensions, platform standards, or implementations. diff --git a/src/intro.adoc b/src/intro.adoc new file mode 100644 index 0000000..d57a12a --- /dev/null +++ b/src/intro.adoc @@ -0,0 +1,731 @@ +== Introduction + +RISC-V (pronounced `risk-five`) is a new instruction-set architecture +(ISA) that was originally designed to support computer architecture +research and education, but which we now hope will also become a +standard free and open architecture for industry implementations. Our +goals in defining RISC-V include: + +* A completely _open_ ISA that is freely available to academia and +industry. +* A _real_ ISA suitable for direct native hardware implementation, not +just simulation or binary translation. +* An ISA that avoids `over-architecting` for a particular +microarchitecture style (e.g., microcoded, in-order, decoupled, +out-of-order) or implementation technology (e.g., full-custom, ASIC, +FPGA), but which allows efficient implementation in any of these. +* An ISA separated into a _small_ base integer ISA, usable by itself as +a base for customized accelerators or for educational purposes, and +optional standard extensions, to support general-purpose software +development. +* Support for the revised 2008 IEEE-754 floating-point standard . +* An ISA supporting extensive ISA extensions and specialized variants. +* Both 32-bit and 64-bit address space variants for applications, +operating system kernels, and hardware implementations. +* An ISA with support for highly parallel multicore or manycore +implementations, including heterogeneous multiprocessors. +* Optional _variable-length instructions_ to both expand available +instruction encoding space and to support an optional _dense instruction +encoding_ for improved performance, static code size, and energy +efficiency. +* A fully virtualizable ISA to ease hypervisor development. +* An ISA that simplifies experiments with new privileged architecture +designs. + +[TIP] +==== +Commentary on our design decisions is formatted as in this paragraph. +This non-normative text can be skipped if the reader is only interested +in the specification itself. +==== + +[NOTE] +==== +The name RISC-V was chosen to represent the fifth major RISC ISA design +from UC Berkeley (RISC-I cite:[riscI-isca1981], RISC-II cite:[Katevenis:1983], SOAR cite:[Ungar:1984], and SPUR cite:[spur-jsscc1989] were the first +four). We also pun on the use of the Roman numeral `V` to signify +`variations` and `vectors`, as support for a range of architecture +research, including various data-parallel accelerators, is an explicit +goal of the ISA design. +==== +(((ISA, definition))) + +The RISC-V ISA is defined avoiding implementation details as much as +possible (although commentary is included on implementation-driven +decisions) and should be read as the software-visible interface to a +wide variety of implementations rather than as the design of a +particular hardware artifact. The RISC-V manual is structured in two +volumes. This volume covers the design of the base _unprivileged_ +instructions, including optional unprivileged ISA extensions. +Unprivileged instructions are those that are generally usable in all +privilege modes in all privileged architectures, though behavior might +vary depending on privilege mode and privilege architecture. The second +volume provides the design of the first (`classic`) privileged +architecture. The manuals use IEC 80000-13:2008 conventions, with a byte +of 8 bits. + +[TIP] +==== +In the unprivileged ISA design, we tried to remove any dependence on +particular microarchitectural features, such as cache line size, or on +privileged architecture details, such as page translation. This is both +for simplicity and to allow maximum flexibility for alternative +microarchitectures or alternative privileged architectures. +==== + +=== RISC-V Hardware Platform Terminology + +A RISC-V hardware platform can contain one or more RISC-V-compatible +processing cores together with other non-RISC-V-compatible cores, +fixed-function accelerators, various physical memory structures, I/O +devices, and an interconnect structure to allow the components to +communicate. +(((core, component))) + +A component is termed a _core_ if it contains an independent instruction +fetch unit. A RISC-V-compatible core might support multiple +RISC-V-compatible hardware threads, or _harts_, through multithreading. +(((core, extensions, coprocessor))) + +A RISC-V core might have additional specialized instruction-set +extensions or an added _coprocessor_. We use the term _coprocessor_ to +refer to a unit that is attached to a RISC-V core and is mostly +sequenced by a RISC-V instruction stream, but which contains additional +architectural state and instruction-set extensions, and possibly some +limited autonomy relative to the primary RISC-V instruction stream. + +We use the term _accelerator_ to refer to either a non-programmable +fixed-function unit or a core that can operate autonomously but is +specialized for certain tasks. In RISC-V systems, we expect many +programmable accelerators will be RISC-V-based cores with specialized +instruction-set extensions and/or customized coprocessors. An important +class of RISC-V accelerators are I/O accelerators, which offload I/O +processing tasks from the main application cores. +(((core, accelerator))) + +The system-level organization of a RISC-V hardware platform can range +from a single-core microcontroller to a many-thousand-node cluster of +shared-memory manycore server nodes. Even small systems-on-a-chip might +be structured as a hierarchy of multicomputers and/or multiprocessors to +modularize development effort or to provide secure isolation between +subsystems. +(((core, cluster, multiprocessors))) + +=== RISC-V Software Execution Environments and Harts + +The behavior of a RISC-V program depends on the execution environment in +which it runs. A RISC-V execution environment interface (EEI) defines +the initial state of the program, the number and type of harts in the +environment including the privilege modes supported by the harts, the +accessibility and attributes of memory and I/O regions, the behavior of +all legal instructions executed on each hart (i.e., the ISA is one +component of the EEI), and the handling of any interrupts or exceptions +raised during execution including environment calls. Examples of EEIs +include the Linux application binary interface (ABI), or the RISC-V +supervisor binary interface (SBI). The implementation of a RISC-V +execution environment can be pure hardware, pure software, or a +combination of hardware and software. For example, opcode traps and +software emulation can be used to implement functionality not provided +in hardware. Examples of execution environment implementations include: + +* `Bare metal` hardware platforms where harts are directly implemented +by physical processor threads and instructions have full access to the +physical address space. The hardware platform defines an execution +environment that begins at power-on reset. +* RISC-V operating systems that provide multiple user-level execution +environments by multiplexing user-level harts onto available physical +processor threads and by controlling access to memory via virtual +memory. +* RISC-V hypervisors that provide multiple supervisor-level execution +environments for guest operating systems. +* RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V +harts on an underlying x86 system, and which can provide either a +user-level or a supervisor-level execution environment. + +[TIP] +==== +A bare hardware platform can be considered to define an EEI, where the +accessible harts, memory, and other devices populate the environment, +and the initial state is that at power-on reset. Generally, most +software is designed to use a more abstract interface to the hardware, +as more abstract EEIs provide greater portability across different +hardware platforms. Often EEIs are layered on top of one another, where +one higher-level EEI uses another lower-level EEI. +==== +(((hart, exectution environment))) + +From the perspective of software running in a given execution +environment, a hart is a resource that autonomously fetches and executes +RISC-V instructions within that execution environment. In this respect, +a hart behaves like a hardware thread resource even if time-multiplexed +onto real hardware by the execution environment. Some EEIs support the +creation and destruction of additional harts, for example, via +environment calls to fork new harts. + +The execution environment is responsible for ensuring the eventual +forward progress of each of its harts. For a given hart, that +responsibility is suspended while the hart is exercising a mechanism +that explicitly waits for an event, such as the wait-for-interrupt +instruction defined in Volume II of this specification; and that +responsibility ends if the hart is terminated. The following events +constitute forward progress: + +* The retirement of an instruction. +* A trap, as defined in <>. +* Any other event defined by an extension to constitute forward +progress. + +[TIP] +==== +The term hart was introduced in the work on Lithe cite:[lithe-pan-hotpar09] and cite:[lithe-pan-pldi10] to provide a term to +represent an abstract execution resource as opposed to a software thread +programming abstraction. + +The important distinction between a hardware thread (hart) and a +software thread context is that the software running inside an execution +environment is not responsible for causing progress of each of its +harts; that is the responsibility of the outer execution environment. So +the environment’s harts operate like hardware threads from the +perspective of the software inside the execution environment. + +An execution environment implementation might time-multiplex a set of +guest harts onto fewer host harts provided by its own execution +environment but must do so in a way that guest harts operate like +independent hardware threads. In particular, if there are more guest +harts than host harts then the execution environment must be able to +preempt the guest harts and must not wait indefinitely for guest +software on a guest hart to “yield" control of the guest hart. +==== + +=== RISC-V ISA Overview + +A RISC-V ISA is defined as a base integer ISA, which must be present in +any implementation, plus optional extensions to the base ISA. The base +integer ISAs are very similar to that of the early RISC processors +except with no branch delay slots and with support for optional +variable-length instruction encodings. A base is carefully restricted to +a minimal set of instructions sufficient to provide a reasonable target +for compilers, assemblers, linkers, and operating systems (with +additional privileged operations), and so provides a convenient ISA and +software toolchain `skeleton` around which more customized processor +ISAs can be built. + +Although it is convenient to speak of _the_ RISC-V ISA, RISC-V is +actually a family of related ISAs, of which there are currently four +base ISAs. Each base integer instruction set is characterized by the +width of the integer registers and the corresponding size of the address +space and by the number of integer registers. There are two primary base +integer variants, RV32I and RV64I, described in +<> and <>, which provide 32-bit +or 64-bit address spaces respectively. We use the term XLEN to refer to +the width of an integer register in bits (either 32 or 64). +<> describes the RV32E subset variant of the +RV32I base instruction set, which has been added to support small +microcontrollers, and which has half the number of integer registers. +<> sketches a future RV128I variant of the +base integer instruction set supporting a flat 128-bit address space +(XLEN=128). The base integer instruction sets use a two’s-complement +representation for signed integer values. + +Although 64-bit address spaces are a requirement for larger systems, we +believe 32-bit address spaces will remain adequate for many embedded and +client devices for decades to come and will be desirable to lower memory +traffic and energy consumption. In addition, 32-bit address spaces are +sufficient for educational purposes. A larger flat 128-bit address space +might eventually be required, so we ensured this could be accommodated +within the RISC-V ISA framework. + +The four base ISAs in RISC-V are treated as distinct base ISAs. A common +question is why is there not a single ISA, and in particular, why is +RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC, +MIPS) adopted a strict superset policy when increasing address space +size to support running existing 32-bit binaries on new 64-bit hardware. + +The main advantage of explicitly separating base ISAs is that each base +ISA can be optimized for its needs without requiring to support all the +operations needed for other base ISAs. For example, RV64I can omit +instructions and CSRs that are only needed to cope with the narrower +registers in RV32I. The RV32I variants can use encoding space otherwise +reserved for instructions only required by wider address-space variants. + +The main disadvantage of not treating the design as a single ISA is that +it complicates the hardware needed to emulate one base ISA on another +(e.g., RV32I on RV64I). However, differences in addressing and illegal +instruction traps generally mean some mode switch would be required in +hardware in any case even with full superset instruction encodings, and +the different RISC-V base ISAs are similar enough that supporting +multiple versions is relatively low cost. Although some have proposed +that the strict superset design would allow legacy 32-bit libraries to +be linked with 64-bit code, this is impractical in practice, even with +compatible encodings, due to the differences in software calling +conventions and system-call interfaces. + +The RISC-V privileged architecture provides fields in` misa` to control +the unprivileged ISA at each level to support emulating different base +ISAs on the same hardware. We note that newer SPARC and MIPS ISA +revisions have deprecated support for running 32-bit code unchanged on +64-bit systems. + +A related question is why there is a different encoding for 32-bit adds +in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for +32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the +existing design which uses the same opcode ADD for 32-bit adds in RV32I +and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in +RV64I. This would also be more consistent with the use of the same LW +opcode for 32-bit load in both RV32I and RV64I. The very first versions +of RISC-V ISA did have a variant of this alternate design, but the +RISC-V design was changed to the current choice in January 2011. Our +focus was on supporting 32-bit integers in the 64-bit ISA not on +providing compatibility with the 32-bit ISA, and the motivation was to +remove the asymmetry that arose from having not all opcodes in RV32I +have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was +perhaps not well-justified and a consequence of designing both ISAs at +the same time as opposed to adding one later to sit on top of another, +and also from a belief we had to fold platform requirements into the ISA +spec which would imply that all the RV32I instructions would have been +required in RV64I. It is too late to change the encoding now, but this +is also of little practical consequence for the reasons stated above. + +It has been noted we could enable the *W variants as an extension to +RV32I systems to provide a common encoding across RV64I and a future +RV32 variant. + +RISC-V has been designed to support extensive customization and +specialization. Each base integer ISA can be extended with one or more +optional instruction-set extensions. An extension may be categorized as +either standard, custom, or non-conforming. For this purpose, we divide +each RISC-V instruction-set encoding space (and related encoding spaces +such as the CSRs) into three disjoint categories: _standard_, +_reserved_, and _custom_. Standard extensions and encodings are defined +by the Foundation; any extensions not defined by the Foundation are +_non-standard_. Each base ISA and its standard extensions use only +standard encodings, and shall not conflict with each other in their uses +of these encodings. Reserved encodings are currently not defined but are +saved for future standard extensions; once thus used, they become +standard encodings. Custom encodings shall never be used for standard +extensions and are made available for vendor-specific non-standard +extensions. Non-standard extensions are either custom extensions, that +use only custom encodings, or _non-conforming_ extensions, that use any +standard or reserved encoding. Instruction-set extensions are generally +shared but may provide slightly different functionality depending on the +base ISA. <> describes various ways +of extending the RISC-V ISA. We have also developed a naming convention +for RISC-V base instructions and instruction-set extensions, described +in detail in <>. + +To support more general software development, a set of standard +extensions are defined to provide integer multiply/divide, atomic +operations, and single and double-precision floating-point arithmetic. +The base integer ISA is named `I` (prefixed by RV32 or RV64 depending +on integer register width), and contains integer computational +instructions, integer loads, integer stores, and control-flow +instructions. The standard integer multiplication and division extension +is named `M`, and adds instructions to multiply and divide values held +in the integer registers. The standard atomic instruction extension, +denoted by `A`, adds instructions that atomically read, modify, and +write memory for inter-processor synchronization. The standard +single-precision floating-point extension, denoted by `F`, adds +floating-point registers, single-precision computational instructions, +and single-precision loads and stores. The standard double-precision +floating-point extension, denoted by `D`, expands the floating-point +registers, and adds double-precision computational instructions, loads, +and stores. The standard `C` compressed instruction extension provides +narrower 16-bit forms of common instructions. + +Beyond the base integer ISA and the standard GC extensions, we believe +it is rare that a new instruction will provide a significant benefit for +all applications, although it may be very beneficial for a certain +domain. As energy efficiency concerns are forcing greater +specialization, we believe it is important to simplify the required +portion of an ISA specification. Whereas other architectures usually +treat their ISA as a single entity, which changes to a new version as +instructions are added over time, RISC-V will endeavor to keep the base +and each standard extension constant over time, and instead layer new +instructions as further optional extensions. For example, the base +integer ISAs will continue as fully supported standalone ISAs, +regardless of any subsequent extensions. + +=== Memory + +A RISC-V hart has a single byte-addressable address space of +latexmath:[$2^{\text{XLEN}}$] bytes for all memory accesses. A _word_ of +memory is defined as (). Correspondingly, a _halfword_ is (), a +_doubleword_ is (), and a _quadword_ is (). The memory address space is +circular, so that the byte at address latexmath:[$2^{\text{XLEN}}-1$] is +adjacent to the byte at address zero. Accordingly, memory address +computations done by the hardware ignore overflow and instead wrap +around modulo latexmath:[$2^{\text{XLEN}}$]. + +The execution environment determines the mapping of hardware resources +into a hart’s address space. Different address ranges of a hart’s +address space may (1) be vacant, or (2) contain _main memory_, or +(3) contain one or more _I/O devices_. Reads and writes of I/O devices +may have visible side effects, but accesses to main memory cannot. +Although it is possible for the execution environment to call everything +in a hart’s address space an I/O device, it is usually expected that +some portion will be specified as main memory. + +When a RISC-V platform has multiple harts, the address spaces of any two +harts may be entirely the same, or entirely different, or may be partly +different but sharing some subset of resources, mapped into the same or +different address ranges. + +For a purely `bare metal` environment, all harts may see an identical +address space, accessed entirely by physical addresses. However, when +the execution environment includes an operating system employing address +translation, it is common for each hart to be given a virtual address +space that is largely or entirely its own. +(((memory access, implicit and explicit))) + +Executing each RISC-V machine instruction entails one or more memory +accesses, subdivided into _implicit_ and _explicit_ accesses. For each +instruction executed, an _implicit_ memory read (instruction fetch) is +done to obtain the encoded instruction to execute. Many RISC-V +instructions perform no further memory accesses beyond instruction +fetch. Specific load and store instructions perform an _explicit_ read +or write of memory at an address determined by the instruction. The +execution environment may dictate that instruction execution performs +other _implicit_ memory accesses (such as to implement address +translation) beyond those documented for the unprivileged ISA. + +The execution environment determines what portions of the non-vacant +address space are accessible for each kind of memory access. For +example, the set of locations that can be implicitly read for +instruction fetch may or may not have any overlap with the set of +locations that can be explicitly read by a load instruction; and the set +of locations that can be explicitly written by a store instruction may +be only a subset of locations that can be read. Ordinarily, if an +instruction attempts to access memory at an inaccessible address, an +exception is raised for the instruction. Vacant locations in the address +space are never accessible. + +Except when specified otherwise, implicit reads that do not raise an +exception and that have no side effects may occur arbitrarily early and +speculatively, even before the machine could possibly prove that the +read will be needed. For instance, a valid implementation could attempt +to read all of main memory at the earliest opportunity, cache as many +fetchable (executable) bytes as possible for later instruction fetches, +and avoid reading main memory for instruction fetches ever again. To +ensure that certain implicit reads are ordered only after writes to the +same memory locations, software must execute specific fence or +cache-control instructions defined for this purpose (such as the FENCE.I +instruction defined in <>. +(((memory access, implicit and explicit))) + +The memory accesses (implicit or explicit) made by a hart may appear to +occur in a different order as perceived by another hart or by any other +agent that can access the same memory. This perceived reordering of +memory accesses is always constrained, however, by the applicable memory +consistency model. The default memory consistency model for RISC-V is +the RISC-V Weak Memory Ordering (RVWMO), defined in +<> and in appendices. Optionally, +an implementation may adopt the stronger model of Total Store Ordering, +as defined in <>. The execution environment +may also add constraints that further limit the perceived reordering of +memory accesses. Since the RVWMO model is the weakest model allowed for +any RISC-V implementation, software written for this model is compatible +with the actual memory consistency rules of all RISC-V implementations. +As with implicit reads, software must execute fence or cache-control +instructions to ensure specific ordering of memory accesses beyond the +requirements of the assumed memory consistency model and execution +environment. + +=== Base Instruction-Length Encoding + +The base RISC-V ISA has fixed-length 32-bit instructions that must be +naturally aligned on 32-bit boundaries. However, the standard RISC-V +encoding scheme is designed to support ISA extensions with +variable-length instructions, where each instruction can be any number +of 16-bit instruction _parcels_ in length and parcels are naturally +aligned on 16-bit boundaries. The standard compressed ISA extension +described in <> reduces code size by +providing compressed 16-bit instructions and relaxes the alignment +constraints to allow all instructions (16 bit and 32 bit) to be aligned +on any 16-bit boundary to improve code density. + +We use the term IALIGN (measured in bits) to refer to the +instruction-address alignment constraint the implementation enforces. +IALIGN is 32 bits in the base ISA, but some ISA extensions, including +the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not +take on any value other than 16 or 32. +(((ILEN))) + +We use the term ILEN (measured in bits) to refer to the maximum +instruction length supported by an implementation, and which is always a +multiple of IALIGN. For implementations supporting only a base +instruction set, ILEN is 32 bits. Implementations supporting longer +instructions have larger values of ILEN. + +<> illustrates the standard +RISC-V instruction-length encoding convention. All the 32-bit +instructions in the base ISA have their lowest two bits set to`11`. The +optional compressed 16-bit instruction-set extensions have their lowest +two bits equal to`00`,`01`, or`10`. + +==== Expanded Instruction-Length Encoding +(((instruction length encoding))) + +A portion of the 32-bit instruction-encoding space has been tentatively +allocated for instructions longer than 32 bits. The entirety of this +space is reserved at this time, and the following proposal for encoding +instructions longer than 32 bits is not considered frozen. + +Standard instruction-set extensions encoded with more than 32 bits have +additional low-order bits set to`1`, with the conventions for 48-bit +and 64-bit lengths shown in +<>. Instruction lengths +between 80 bits and 176 bits are encoded using a 3-bit field in bits +[14:12] giving the number of 16-bit words in addition to the first +5latexmath:[$\times$]16-bit words. The encoding with bits [14:12] set to +`111` is reserved for future longer instruction encodings. + +[[instlengthcode]] +[cols="^,^,^,^,<",] +|=== +| | | |`xxxxxxxxxxxxxxaa` |16-bit (`aa` latexmath:[$\neq$]`11`) + +| | | | | + +| | |`xxxxxxxxxxxxxxxx` |`xxxxxxxxxxxbbb11` |32-bit (`bbb` +latexmath:[$\neq$]`111`) + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xxxxxxxxxx011111` |48-bit + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xxxxxxxxx0111111` |64-bit + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`xnnnxxxxx1111111` |(80+16*`nnn`)-bit,`nnn`latexmath:[$\neq$]`111` + +| | | | | + +| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx` +|`x111xxxxx1111111` |Reserved for latexmath:[$\geq$]192-bits + +| | | | | + +|Byte Address: |base+4 |base+2 |base | +|=== + +Given the code size and energy savings of a compressed format, we wanted +to build in support for a compressed format to the ISA encoding scheme +rather than adding this as an afterthought, but to allow simpler +implementations we didn’t want to make the compressed format mandatory. +We also wanted to optionally allow longer instructions to support +experimentation and larger instruction-set extensions. Although our +encoding convention required a tighter encoding of the core RISC-V ISA, +this has several beneficial effects. +(((IMAFED))) + +An implementation of the standard IMAFD ISA need only hold the +most-significant 30 bits in instruction caches (a 6.25% saving). On +instruction cache refills, any instructions encountered with either low +bit clear should be recoded into illegal 30-bit instructions before +storing in the cache to preserve illegal instruction exception behavior. + +Perhaps more importantly, by condensing our base ISA into a subset of +the 32-bit instruction word, we leave more space available for +non-standard and custom extensions. In particular, the base RV32I ISA +uses less than 1/8 of the encoding space in the 32-bit instruction word. +As described in Chapter link:#extensions[[extensions]], an +implementation that does not require support for the standard compressed +instruction extension can map 3 additional non-conforming 30-bit +instruction spaces into the 32-bit fixed-width format, while preserving +support for standard latexmath:[$\geq$]32-bit instruction-set +extensions. Further, if the implementation also does not need +instructions latexmath:[$>$]32-bits in length, it can recover a further +four major opcodes for non-conforming extensions. + +Encodings with bits [15:0] all zeros are defined as illegal +instructions. These instructions are considered to be of minimal length: +16 bits if any 16-bit instruction-set extension is present, otherwise 32 +bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this +instruction is considered to be ILEN bits long. + +We consider it a feature that any length of instruction containing all +zero bits is not legal, as this quickly traps erroneous jumps into +zeroed memory regions. Similarly, we also reserve the instruction +encoding containing all ones to be an illegal instruction, to catch the +other common pattern observed with unprogrammed non-volatile memory +devices, disconnected memory buses, or broken memory devices. + +Software can rely on a naturally aligned 32-bit word containing zero to +act as an illegal instruction on all RISC-V implementations, to be used +by software where an illegal instruction is explicitly desired. Defining +a corresponding known illegal value for all ones is more difficult due +to the variable-length encoding. Software cannot generally use the +illegal value of ILEN bits of all 1s, as software might not know ILEN +for the eventual target machine (e.g., if software is compiled into a +standard binary library used by many different machines). Defining a +32-bit word of all ones as illegal was also considered, as all machines +must support a 32-bit instruction size, but this requires the +instruction-fetch unit on machines with ILEN latexmath:[$>$]32 report an +illegal instruction exception rather than an access-fault exception when +such an instruction borders a protection boundary, complicating +variable-instruction-length fetch and decode. +(((endian, little and big))) + +RISC-V base ISAs have either little-endian or big-endian memory systems, +with the privileged architecture further defining bi-endian operation. +Instructions are stored in memory as a sequence of 16-bit little-endian +parcels, regardless of memory system endianness. Parcels forming one +instruction are stored at increasing halfword addresses, with the +lowest-addressed parcel holding the lowest-numbered bits in the +instruction specification. +((bi-endian)) +(((endian, bi-))) + +We originally chose little-endian byte ordering for the RISC-V memory +system because little-endian systems are currently dominant commercially +(all x86 systems; iOS, Android, and Windows for ARM). A minor point is +that we have also found little-endian memory systems to be more natural +for hardware designers. However, certain application areas, such as IP +networking, operate on big-endian data structures, and certain legacy +code bases have been built assuming big-endian processors, so we have +defined big-endian and bi-endian variants of RISC-V. + +We have to fix the order in which instruction parcels are stored in +memory, independent of memory system endianness, to ensure that the +length-encoding bits always appear first in halfword address order. This +allows the length of a variable-length instruction to be quickly +determined by an instruction-fetch unit by examining only the first few +bits of the first 16-bit instruction parcel. + +We further make the instruction parcels themselves little-endian to +decouple the instruction encoding from the memory system endianness +altogether. This design benefits both software tooling and bi-endian +hardware. Otherwise, for instance, a RISC-V assembler or disassembler +would always need to know the intended active endianness, despite that +in bi-endian systems, the endianness mode might change dynamically +during execution. In contrast, by giving instructions a fixed +endianness, it is sometimes possible for carefully written software to +be endianness-agnostic even in binary form, much like +position-independent code. + +The choice to have instructions be only little-endian does have +consequences, however, for RISC-V software that encodes or decodes +machine instructions. Big-endian JIT compilers, for example, must swap +the byte order when storing to instruction memory. + +Once we had decided to fix on a little-endian instruction encoding, this +naturally led to placing the length-encoding bits in the LSB positions +of the instruction format to avoid breaking up opcode fields. + +[[trap-defn]] +=== Exceptions, Traps, and Interrupts +(((exceptions))) +(((traps))) +(((interrupts))) + +We use the term _exception_ to refer to an unusual condition occurring +at run time associated with an instruction in the current RISC-V hart. +We use the term _interrupt_ to refer to an external asynchronous event +that may cause a RISC-V hart to experience an unexpected transfer of +control. We use the term _trap_ to refer to the transfer of control to a +trap handler caused by either an exception or an interrupt. + +The instruction descriptions in following chapters describe conditions +that can raise an exception during execution. The general behavior of +most RISC-V EEIs is that a trap to some handler occurs when an exception +is signaled on an instruction (except for floating-point exceptions, +which, in the standard floating-point extensions, do not cause traps). +The manner in which interrupts are generated, routed to, and enabled by +a hart depends on the EEI. + +Our use of `exception` and `trap` is compatible with that in the +IEEE-754 floating-point standard. + +How traps are handled and made visible to software running on the hart +depends on the enclosing execution environment. From the perspective of +software running inside an execution environment, traps encountered by a +hart at runtime can have four different effects: + +Contained Trap::: + The trap is visible to, and handled by, software running inside the + execution environment. For example, in an EEI providing both + supervisor and user mode on harts, an ECALL by a user-mode hart will + generally result in a transfer of control to a supervisor-mode handler + running on the same hart. Similarly, in the same environment, when a + hart is interrupted, an interrupt handler will be run in supervisor + mode on the hart. +Requested Trap::: + The trap is a synchronous exception that is an explicit call to the + execution environment requesting an action on behalf of software + inside the execution environment. An example is a system call. In this + case, execution may or may not resume on the hart after the requested + action is taken by the execution environment. For example, a system + call could remove the hart or cause an orderly termination of the + entire execution environment. +Invisible Trap::: + The trap is handled transparently by the execution environment and + execution resumes normally after the trap is handled. Examples include + emulating missing instructions, handling non-resident page faults in a + demand-paged virtual-memory system, or handling device interrupts for + a different job in a multiprogrammed machine. In these cases, the + software running inside the execution environment is not aware of the + trap (we ignore timing effects in these definitions). +Fatal Trap::: + The trap represents a fatal failure and causes the execution + environment to terminate execution. Examples include failing a + virtual-memory page-protection check or allowing a watchdog timer to + expire. Each EEI should define how execution is terminated and + reported to an external environment. + +<> shows the characteristics of each +kind of trap. + +[[trapcharacteristics]] +.Characteristics of traps; 1) Termination may be requested. 2) +Imprecise fatal traps might be observable by software. +[cols="<,^,^,^,^",options="header",] +|=== +| |Contained |Requested |Invisible |Fatal +|Execution terminates |No |Nolatexmath:[$^{1}$] |No |Yes +|Software is oblivious |No |No |Yes |Yeslatexmath:[$^{2}$] +|Handled by environment |No |Yes |Yes |Yes +|=== + +The EEI defines for each trap whether it is handled precisely, though +the recommendation is to maintain preciseness where possible. Contained +and requested traps can be observed to be imprecise by software inside +the execution environment. Invisible traps, by definition, cannot be +observed to be precise or imprecise by software running inside the +execution environment. Fatal traps can be observed to be imprecise by +software running inside the execution environment, if known-errorful +instructions do not cause immediate termination. + +Because this document describes unprivileged instructions, traps are +rarely mentioned. Architectural means to handle contained traps are +defined in the privileged architecture manual, along with other features +to support richer EEIs. Unprivileged instructions that are defined +solely to cause requested traps are documented here. Invisible traps +are, by their nature, out of scope for this document. Instruction +encodings that are not defined here and not defined by some other means +may cause a fatal trap. + +=== UNSPECIFIED Behaviors and Values +(((unspecified, behaviors))) +(((unspecified, values))) + +The architecture fully describes what implementations must do and any +constraints on what they may do. In cases where the architecture +intentionally does not constrain implementations, the term  is +explicitly used. + +The term _unspecified_ refers to a behavior or value that is intentionally +unconstrained. The definition of these behaviors or values is open to +extensions, platform standards, or implementations. Extensions, platform +standards, or implementation documentation may provide normative content +to further constrain cases that the base architecture defines as . + +Like the base architecture, extensions should fully describe allowable +behavior and values and use the term _unspecified_ for cases that are intentionally +unconstrained. These cases may be constrained or defined by other +extensions, platform standards, or implementations. diff --git a/src/j-st-ext.adoc b/src/j-st-ext.adoc new file mode 100644 index 0000000..0b5358c --- /dev/null +++ b/src/j-st-ext.adoc @@ -0,0 +1,10 @@ +[[j-extendj]] +== `J` Standard Extension for Dynamically Translated Languages, Version 0.0 + +This chapter is a placeholder for a future standard extension to support +dynamically translated languages. + +Many popular languages are usually implemented via dynamic translation, +including Java and Javascript. These languages can benefit from +additional ISA support for dynamic checks and garbage collection. + diff --git a/src/m-st-ext.adoc b/src/m-st-ext.adoc new file mode 100644 index 0000000..6b2593e --- /dev/null +++ b/src/m-st-ext.adoc @@ -0,0 +1,159 @@ +[[mstandard]] +== `M` Standard Extension for Integer Multiplication and Division, Version 2.0 + +This chapter describes the standard integer multiplication and division +instruction extension, which is named `M` and contains instructions +that multiply or divide values held in two integer registers. + +[TIP] +==== +We separate integer multiply and divide out from the base to simplify +low-end implementations, or for applications where integer multiply and +divide operations are either infrequent or better handled in attached +accelerators. +==== + +=== Multiplication Operations + +include::images/wavedrom/m-st-ext-for-int-mult.adoc[] +[[m-st-ext-for-int-mult]] +.Multiplication operation instructions +image::image_placeholder.png[] +(((MUL, MULH))) +(((MUL, MULHU))) +(((MUL, MULHSU))) + +MUL performs an XLEN-bitlatexmath:[$\times$]XLEN-bit multiplication of +_rs1_ by _rs2_ and places the lower XLEN bits in the destination +register. MULH, MULHU, and MULHSU perform the same multiplication but +return the upper XLEN bits of the full 2latexmath:[$\times$]XLEN-bit +product, for signedlatexmath:[$\times$]signed, +unsignedlatexmath:[$\times$]unsigned, and latexmath:[$\times$] +multiplication, respectively. If both the high and low bits of the same +product are required, then the recommended code sequence is: MULH[[S]U] +_rdh, rs1, rs2_; MUL _rdl, rs1, rs2_ (source register specifiers must be +in same order and _rdh_ cannot be the same as _rs1_ or _rs2_). +Microarchitectures can then fuse these into a single multiply operation +instead of performing two separate multiplies. + +[NOTE] +==== +MULHSU is used in multi-word signed multiplication to multiply the +most-significant word of the multiplicand (which contains the sign bit) +with the less-significant words of the multiplier (which are unsigned). +==== + +MULW is an RV64 instruction that multiplies the lower 32 bits of the +source registers, placing the sign-extension of the lower 32 bits of the +result into the destination register. + +[NOTE] +==== +In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit +product, but signed arguments must be proper 32-bit signed values, +whereas unsigned arguments must have their upper 32 bits clear. If the +arguments are not known to be sign- or zero-extended, an alternative is +to shift both arguments left by 32 bits, then use MULH[[S]U]. +==== + +=== Division Operations + +include::images/wavedrom/division-op.adoc[] +[[division-op]] +.Division operation instructions +image::image_placeholder.png[] +(((MUL, DIV))) +(((MUL, DIVU))) + +DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned +integer division of _rs1_ by _rs2_, rounding towards zero. REM and REMU +provide the remainder of the corresponding division operation. For REM, +the sign of the result equals the sign of the dividend. + +[NOTE] +==== +For both signed and unsigned division, it holds that +latexmath:[$\textrm{dividend} = \textrm{divisor} \times \textrm{quotient} + \textrm{remainder}$]. +==== + +If both the quotient and remainder are required from the same division, +the recommended code sequence is: DIV[U] _rdq, rs1, rs2_; REM[U] _rdr, +rs1, rs2_ (_rdq_ cannot be the same as _rs1_ or _rs2_). +Microarchitectures can then fuse these into a single divide operation +instead of performing two separate divides. + +DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of +_rs1_ by the lower 32 bits of _rs2_, treating them as signed and +unsigned integers respectively, placing the 32-bit quotient in _rd_, +sign-extended to 64 bits. REMW and REMUW are RV64 instructions that +provide the corresponding signed and unsigned remainder operations +respectively. Both REMW and REMUW always sign-extend the 32-bit result +to 64 bits, including on a divide by zero. +(((MUL, div by zero))) + +The semantics for division by zero and division overflow are summarized +in <>. The quotient of division by zero has all bits +set, and the remainder of division by zero equals the dividend. Signed +division overflow occurs only when the most-negative integer is divided +by latexmath:[$-1$]. The quotient of a signed division with overflow is +equal to the dividend, and the remainder is zero. Unsigned division +overflow cannot occur. + +[[divby0]] +.Semantics for division by zero and division overflow. +[cols="<,^,^,^,^,^,^",options="header",] +|=== +|Condition |Dividend |Divisor |DIVU[W] |REMU[W] |DIV[W] |REM[W] +|Division by zero |latexmath:[$x$] |0 |latexmath:[$2^{L}-1$] +|latexmath:[$x$] |latexmath:[$-1$] |latexmath:[$x$] + +|Overflow (signed only) |latexmath:[$-2^{L-1}$] |latexmath:[$-1$] |– |– +|latexmath:[$-2^{L-1}$] |0 +|=== + +In <>, L is the width of the operation in bits: XLEN for DIV[U] and REM[U], or 32 for DIV[U]W and REM[U]W. + +[TIP] +==== +We considered raising exceptions on integer divide by zero, with these +exceptions causing a trap in most execution environments. However, this +would be the only arithmetic trap in the standard ISA (floating-point +exceptions set flags and write default values, but do not cause traps) +and would require language implementors to interact with the execution +environment’s trap handlers for this case. Further, where language +standards mandate that a divide-by-zero exception must cause an +immediate control flow change, only a single branch instruction needs to +be added to each divide operation, and this branch instruction can be +inserted after the divide and should normally be very predictably not +taken, adding little runtime overhead. + +The value of all bits set is returned for both unsigned and signed +divide by zero to simplify the divider circuitry. The value of all 1s is +both the natural value to return for unsigned divide, representing the +largest unsigned number, and also the natural result for simple unsigned +divider implementations. Signed division is often implemented using an +unsigned division circuit and specifying the same overflow result +simplifies the hardware. +==== + +=== Zmmul Extension, Version 0.1 + +The Zmmul extension implements the multiplication subset of the M +extension. It adds all of the instructions defined in +<>, namely: MUL, MULH, MULHU, +MULHSU, and (for RV64 only) MULW. The encodings are identical to those +of the corresponding M-extension instructions. +(((MUL, Zmmul))) + +[NOTE] +==== +The Zmmul extension enables low-cost implementations that require +multiplication operations but not division. For many microcontroller +applications, division operations are too infrequent to justify the cost +of divider hardware. By contrast, multiplication operations are more +frequent, making the cost of multiplier hardware more justifiable. +Simple FPGA soft cores particularly benefit from eliminating division +but retaining multiplication, since many FPGAs provide hardwired +multipliers but require dividers be implemented in soft logic. +==== + diff --git a/src/mm-alloy.adoc b/src/mm-alloy.adoc new file mode 100644 index 0000000..93c0c41 --- /dev/null +++ b/src/mm-alloy.adoc @@ -0,0 +1,239 @@ +[[sec:alloy]] +== Formal Axiomatic Specification in Alloy + +We present a formal specification of the RVWMO memory model in Alloy +(http://alloy.mit.edu). This model is available online at +https://github.com/daniellustig/riscv-memory-model. + +The online material also contains some litmus tests and some examples of +how Alloy can be used to model check some of the mappings in +Section #sec:memory:porting[[sec:memory:porting]]. + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =RVWMO PPO= + +// Preserved Program Order +fun ppo : Event->Event { + // same-address ordering + po_loc :> Store + + rdw + + (AMO + StoreConditional) <: rfi + + // explicit synchronization + + ppo_fence + + Acquire <: ^po :> MemoryEvent + + MemoryEvent <: ^po :> Release + + RCsc <: ^po :> RCsc + + pair + + // syntactic dependencies + + addrdep + + datadep + + ctrldep :> Store + + // pipeline dependencies + + (addrdep+datadep).rfi + + addrdep.^po :> Store +} + +// the global memory order respects preserved program order +fact { ppo in ^gmo } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =RVWMO axioms= + +// Load Value Axiom +fun candidates[r: MemoryEvent] : set MemoryEvent { + (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo + + (r.^~po & Store & same_addr[r]) // writes preceding r in po +} + +fun latest_among[s: set Event] : Event { s - s.~^gmo } + +pred LoadValue { + all w: Store | all r: Load | + w->r in rf <=> w = latest_among[candidates[r]] +} + +// Atomicity Axiom +pred Atomicity { + all r: Store.~pair | // starting from the lr, + no x: Store & same_addr[r] | // there is no store x to the same addr + x not in same_hart[r] // such that x is from a different hart, + and x in r.~rf.^gmo // x follows (the store r reads from) in gmo, + and r.pair in x.^gmo // and r follows x in gmo +} + +// Progress Axiom implicit: Alloy only considers finite executions + +pred RISCV_mm { LoadValue and Atomicity /* and Progress */ } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// Basic model of memory + +sig Hart { // hardware thread + start : one Event +} +sig Address {} +abstract sig Event { + po: lone Event // program order +} + +abstract sig MemoryEvent extends Event { + address: one Address, + acquireRCpc: lone MemoryEvent, + acquireRCsc: lone MemoryEvent, + releaseRCpc: lone MemoryEvent, + releaseRCsc: lone MemoryEvent, + addrdep: set MemoryEvent, + ctrldep: set Event, + datadep: set MemoryEvent, + gmo: set MemoryEvent, // global memory order + rf: set MemoryEvent +} +sig LoadNormal extends MemoryEvent {} // l{b|h|w|d} +sig LoadReserve extends MemoryEvent { // lr + pair: lone StoreConditional +} +sig StoreNormal extends MemoryEvent {} // s{b|h|w|d} +// all StoreConditionals in the model are assumed to be successful +sig StoreConditional extends MemoryEvent {} // sc +sig AMO extends MemoryEvent {} // amo +sig NOP extends Event {} + +fun Load : Event { LoadNormal + LoadReserve + AMO } +fun Store : Event { StoreNormal + StoreConditional + AMO } + +sig Fence extends Event { + pr: lone Fence, // opcode bit + pw: lone Fence, // opcode bit + sr: lone Fence, // opcode bit + sw: lone Fence // opcode bit +} +sig FenceTSO extends Fence {} + +/* Alloy encoding detail: opcode bits are either set (encoded, e.g., + * as f.pr in iden) or unset (f.pr not in iden). The bits cannot be used for + * anything else */ +fact { pr + pw + sr + sw in iden } +// likewise for ordering annotations +fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden } +// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is +fact { no FenceTSO.(pr + pw + sr + sw) } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =Basic model rules= + +// Ordering annotation groups +fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc } +fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc } +fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc } +fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc } + +// There is no such thing as store-acquire or load-release, unless it's both +fact { Load & Release in Acquire } +fact { Store & Acquire in Release } + +// FENCE PPO +fun FencePRSR : Fence { Fence.(pr & sr) } +fun FencePRSW : Fence { Fence.(pr & sw) } +fun FencePWSR : Fence { Fence.(pw & sr) } +fun FencePWSW : Fence { Fence.(pw & sw) } + +fun ppo_fence : MemoryEvent->MemoryEvent { + (Load <: ^po :> FencePRSR).(^po :> Load) + + (Load <: ^po :> FencePRSW).(^po :> Store) + + (Store <: ^po :> FencePWSR).(^po :> Load) + + (Store <: ^po :> FencePWSW).(^po :> Store) + + (Load <: ^po :> FenceTSO) .(^po :> MemoryEvent) + + (Store <: ^po :> FenceTSO) .(^po :> Store) +} + +// auxiliary definitions +fun po_loc : Event->Event { ^po & address.~address } +fun same_hart[e: Event] : set Event { e + e.^~po + e.^po } +fun same_addr[e: Event] : set Event { e.address.~address } + +// initial stores +fun NonInit : set Event { Hart.start.*po } +fun Init : set Event { Event - NonInit } +fact { Init in StoreNormal } +fact { Init->(MemoryEvent & NonInit) in ^gmo } +fact { all e: NonInit | one e.*~po.~start } // each event is in exactly one hart +fact { all a: Address | one Init & a.~address } // one init store per address +fact { no Init <: po and no po :> Init } +.... + +` ` + +.... +// po +fact { acyclic[po] } + +// gmo +fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents + +//rf +fact { rf.~rf in iden } // each read returns the value of only one write +fact { rf in Store <: address.~address :> Load } +fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) } + +//dep +fact { no StoreNormal <: (addrdep + ctrldep + datadep) } +fact { addrdep + ctrldep + datadep + pair in ^po } +fact { datadep in datadep :> Store } +fact { ctrldep.*po in ctrldep } +fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po } +fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed + +// rdw +fun rdw : Event->Event { + (Load <: po_loc :> Load) // start with all same_address load-load pairs, + - (~rf.rf) // subtract pairs that read from the same store, + - (po_loc.rfi) // and subtract out "fri-rfi" patterns +} + +// filter out redundant instances and/or visualizations +fact { no gmo & gmo.gmo } // keep the visualization uncluttered +fact { all a: Address | some a.~address } + +//////////////////////////////////////////////////////////////////////////////// +// =Optional: opcode encoding restrictions= + +// the list of blessed fences +fact { Fence in + Fence.pr.sr + + Fence.pw.sw + + Fence.pr.pw.sw + + Fence.pr.sr.sw + + FenceTSO + + Fence.pr.pw.sr.sw +} + +pred restrict_to_current_encodings { + no (LoadNormal + StoreNormal) & (Acquire + Release) +} + +//////////////////////////////////////////////////////////////////////////////// +// =Alloy shortcuts= +pred acyclic[rel: Event->Event] { no iden & ^rel } +pred total[rel: Event->Event, bag: Event] { + all disj e, e': bag | e->e' in rel + ~rel + acyclic[rel] +} +.... diff --git a/src/mm-eplan.adoc b/src/mm-eplan.adoc new file mode 100644 index 0000000..a58d306 --- /dev/null +++ b/src/mm-eplan.adoc @@ -0,0 +1,1705 @@ +[appendix] +== RVWMO Explanatory Material, Version 0.1 +[[mm-explain]] + +This section provides more explanation for RVWMO +<>, using more informal +language and concrete examples. These are intended to clarify the +meaning and intent of the axioms and preserved program order rules. This +appendix should be treated as commentary; all normative material is +provided in <> and in the rest of +the main body of the ISA specification. All currently known +discrepancies are listed in <>. Any +other discrepancies are unintentional. + +[[whyrvwmo]] +=== Why RVWMO? + +Memory consistency models fall along a loose spectrum from weak to +strong. Weak memory models allow more hardware implementation +flexibility and deliver arguably better performance, performance per +watt, power, scalability, and hardware verification overheads than +strong models, at the expense of a more complex programming model. +Strong models provide simpler programming models, but at the cost of +imposing more restrictions on the kinds of (non-speculative) hardware +optimizations that can be performed in the pipeline and in the memory +system, and in turn imposing some cost in terms of power, area overhead, +and verification burden. + +RISC-V has chosen the RVWMO memory model, a variant of release +consistency. This places it in between the two extremes of the memory +model spectrum. The RVWMO memory model enables architects to build +simple implementations, aggressive implementations, implementations +embedded deeply inside a much larger system and subject to complex +memory system interactions, or any number of other possibilities, all +while simultaneously being strong enough to support programming language +memory models at high performance. + +To facilitate the porting of code from other architectures, some +hardware implementations may choose to implement the Ztso extension, +which provides stricter RVTSO ordering semantics by default. Code +written for RVWMO is automatically and inherently compatible with RVTSO, +but code written assuming RVTSO is not guaranteed to run correctly on +RVWMO implementations. In fact, most RVWMO implementations will (and +should) simply refuse to run RVTSO-only binaries. Each implementation +must therefore choose whether to prioritize compatibility with RVTSO +code (e.g., to facilitate porting from x86) or whether to instead +prioritize compatibility with other RISC-V cores implementing RVWMO. + +Some fences and/or memory ordering annotations in code written for RVWMO +may become redundant under RVTSO; the cost that the default of RVWMO +imposes on Ztso implementations is the incremental overhead of fetching +those fences (e.g., FENCE R,RW and FENCE RW,W) which become no-ops on +that implementation. However, these fences must remain present in the +code if compatibility with non-Ztso implementations is desired. + +[[litmustests]] +=== Litmus Tests + +The explanations in this chapter make use of _litmus tests_, or small +programs designed to test or highlight one particular aspect of a memory +model.<> shows an example +of a litmus test with two harts. As a convention for this figure and for +all figures that follow in this chapter, we assume that `s0`–`s2` are +pre-set to the same value in all harts and that `s0` holds the address +labeled `x`, `s1` holds `y`, and `s2` holds `z`, where `x`, `y`, and `z` +are disjoint memory locations aligned to 8 byte boundaries. Each figure +shows the litmus test code on the left, and a visualization of one +particular valid or invalid execution on the right. + +include::images/graphviz/litmus_sample.adoc[] +[[litmus-sample]] +.A sample litmus test and one forbidden execution (a0=1) +image::image_placeholder.png[float+"right"] + +.Key for sample litmus test +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |latexmath:[$\vdots$] | |latexmath:[$\vdots$] +| |li t1,1 | |li t4,4 +|(a) |sw t1,0(s0) |(e) |sw t4,0(s0) +| |latexmath:[$\vdots$] | |latexmath:[$\vdots$] +| |li t2,2 | | +|(b) |sw t2,0(s0) | | +| |latexmath:[$\vdots$] | |latexmath:[$\vdots$] +|(c) |lw a0,0(s0) | | +| |latexmath:[$\vdots$] | |latexmath:[$\vdots$] +| |li t3,3 | |li t5,5 +|(d) |sw t3,0(s0) |(f) |sw t5,0(s0) +| |latexmath:[$\vdots$] | |latexmath:[$\vdots$] +|=== + + +Litmus tests are used to understand the implications of the memory model +in specific concrete situations. For example, in the litmus test of +<>, the final value of `a0` +in the first hart can be either 2, 4, or 5, depending on the dynamic +interleaving of the instruction stream from each hart at runtime. +However, in this example, the final value of `a0` in Hart 0 will never +be 1 or 3; intuitively, the value 1 will no longer be visible at the +time the load executes, and the value 3 will not yet be visible by the +time the load executes. We analyze this test and many others below. + +[[litmus-key]] +.A key for the litmus test diagrams drawn in this appendix +[cols="^,<",options="header",] +|=== +|Edge |Full Name (and explanation) +|rf |Reads From (from each store to the loads that return a value +written by that store) + +|co |Coherence (a total order on the stores to each address) + +|fr |From-Reads (from each load to co-successors of the store from which +the load returned a value) + +|ppo |Preserved Program Order + +|fence |Orderings enforced by a FENCE instruction + +|addr |Address Dependency + +|ctrl |Control Dependency + +|data |Data Dependency +|=== + +The diagram shown to the right of each litmus test shows a visual +representation of the particular execution candidate being considered. +These diagrams use a notation that is common in the memory model +literature for constraining the set of possible global memory orders +that could produce the execution in question. It is also the basis for +the [.sans-serif]#herd# models presented in +<>. This notation is explained in +<>. Of the listed relations, rf edges between +harts, co edges, fr edges, and ppo edges directly constrain the global +memory order (as do fence, addr, data, and some ctrl edges, via ppo). +Other edges (such as intra-hart rf edges) are informative but do not +constrain the global memory order. + +For example, in <>, `a0=1` +could occur only if one of the following were true: + +* {empty}(b) appears before (a) in global memory order (and in the +coherence order co). However, this violates RVWMO PPO +rule `ppo:->st`. The co edge from (b) to (a) highlights this +contradiction. +* {empty}(a) appears before (b) in global memory order (and in the +coherence order co). However, in this case, the Load Value Axiom would +be violated, because (a) is not the latest matching store prior to (c) +in program order. The fr edge from (c) to (b) highlights this +contradiction. + +Since neither of these scenarios satisfies the RVWMO axioms, the outcome +`a0=1` is forbidden. + +Beyond what is described in this appendix, a suite of more than seven +thousand litmus tests is available at +https://github.com/litmus-tests/litmus-tests-riscv. + +The litmus tests repository also provides instructions on how to run the +litmus tests on RISC-V hardware and how to compare the results with the +operational and axiomatic models. + +In the future, we expect to adapt these memory model litmus tests for +use as part of the RISC-V compliance test suite as well. + +=== Explaining the RVWMO Rules + +In this section, we provide explanation and examples for all of the +RVWMO rules and axioms. + +==== Preserved Program Order and Global Memory Order + +Preserved program order represents the subset of program order that must +be respected within the global memory order. Conceptually, events from +the same hart that are ordered by preserved program order must appear in +that order from the perspective of other harts and/or observers. Events +from the same hart that are not ordered by preserved program order, on +the other hand, may appear reordered from the perspective of other harts +and/or observers. + +Informally, the global memory order represents the order in which loads +and stores perform. The formal memory model literature has moved away +from specifications built around the concept of performing, but the idea +is still useful for building up informal intuition. A load is said to +have performed when its return value is determined. A store is said to +have performed not when it has executed inside the pipeline, but rather +only when its value has been propagated to globally visible memory. In +this sense, the global memory order also represents the contribution of +the coherence protocol and/or the rest of the memory system to +interleave the (possibly reordered) memory accesses being issued by each +hart into a single total order agreed upon by all harts. + +The order in which loads perform does not always directly correspond to +the relative age of the values those two loads return. In particular, a +load latexmath:[$b$] may perform before another load latexmath:[$a$] to +the same address (i.e., latexmath:[$b$] may execute before +latexmath:[$a$], and latexmath:[$b$] may appear before latexmath:[$a$] +in the global memory order), but latexmath:[$a$] may nevertheless return +an older value than latexmath:[$b$]. This discrepancy captures (among +other things) the reordering effects of buffering placed between the +core and memory. For example, latexmath:[$b$] may have returned a value +from a store in the store buffer, while latexmath:[$a$] may have ignored +that younger store and read an older value from memory instead. To +account for this, at the time each load performs, the value it returns +is determined by the load value axiom, not just strictly by determining +the most recent store to the same address in the global memory order, as +described below. + + + +[[loadvalueaxiom]] +==== Load value axiom + +[NOTE] +==== +Load Value Axiom: Each byte of each load i returns the value written +to that byte by the store that is the latest in global memory order among +the following stores: +. Stores that write that byte and that precede i in the global memory +order +. Stores that write that byte and that precede i in program order +==== + +Preserved program order is _not_ required to respect the ordering of a +store followed by a load to an overlapping address. This complexity +arises due to the ubiquity of store buffers in nearly all +implementations. Informally, the load may perform (return a value) by +forwarding from the store while the store is still in the store buffer, +and hence before the store itself performs (writes back to globally +visible memory). Any other hart will therefore observe the load as +performing before the store. + +Consider the litmus test of Figure A.2. When running this program on an implementation with +store buffers, it is possible to arrive at the final outcome a0=1, a1=0, a2=1, a3=0 as follows: + +A store buffer forwarding litmus test (outcome permitted) + +include::images/graphviz/storebuffer.adoc[float="right"] +[[storebuffer]] +.A store buffer forwarding litmus test (outcome permitted) +image::image_placeholder.png[] + +.A store buffer forwarding litmus test (outcome permitted) +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |sw t1,0(s0) |(e) |sw t1,0(s1) +|(b) |lw a0,0(s0) |(f) |lw a2,0(s1) +|(c) |fence r,r |(g) |fence r,r +|(d) |lw a1,0(s1) |(h) |lw a3,0(s0) +|Outcome: `a0=1`, `a1=0`, `a2=1`, `a3=0` | | | +|=== + + +Consider the litmus test of <>. When running +this program on an implementation with store buffers, it is possible to +arrive at the final outcome `a0=1`, `a1=0`, `a2=1`, `a3=0` as follows: + +* {empty}(a) executes and enters the first hart’s private store buffer +* {empty}(b) executes and forwards its return value 1 from (a) in the +store buffer +* {empty}(c) executes since all previous loads (i.e., (b)) have +completed +* {empty}(d) executes and reads the value 0 from memory +* {empty}(e) executes and enters the second hart’s private store buffer +* {empty}(f) executes and forwards its return value 1 from (e) in the +store buffer +* {empty}(g) executes since all previous loads (i.e., (f)) have +completed +* {empty}(h) executes and reads the value 0 from memory +* {empty}(a) drains from the first hart’s store buffer to memory +* {empty}(e) drains from the second hart’s store buffer to memory + +Therefore, the memory model must be able to account for this behavior. + +To put it another way, suppose the definition of preserved program order +did include the following hypothetical rule: memory access +latexmath:[$a$] precedes memory access latexmath:[$b$] in preserved +program order (and hence also in the global memory order) if +latexmath:[$a$] precedes latexmath:[$b$] in program order and +latexmath:[$a$] and latexmath:[$b$] are accesses to the same memory +location, latexmath:[$a$] is a write, and latexmath:[$b$] is a read. +Call this `Rule X`. Then we get the following: + +* {empty}(a) precedes (b): by rule X +* {empty}(b) precedes (d): by rule ppo:fence[[ppo:fence]] +* {empty}(d) precedes (e): by the load value axiom. Otherwise, if (e) +preceded (d), then (d) would be required to return the value 1. (This is +a perfectly legal execution; it’s just not the one in question) +* {empty}(e) precedes (f): by rule X +* {empty}(f) precedes (h): by rule ppo:fence[[ppo:fence]] +* {empty}(h) precedes (a): by the load value axiom, as above. + +The global memory order must be a total order and cannot be cyclic, +because a cycle would imply that every event in the cycle happens before +itself, which is impossible. Therefore, the execution proposed above +would be forbidden, and hence the addition of rule X would forbid +implementations with store buffer forwarding, which would clearly be +undesirable. + +Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the +global memory order, the only sensible possibility in this example is +for (b) to return the value written by (a), and likewise for (f) and +(e). This combination of circumstances is what leads to the second +option in the definition of the load value axiom. Even though (b) +precedes (a) in the global memory order, (a) will still be visible to +(b) by virtue of sitting in the store buffer at the time (b) executes. +Therefore, even if (b) precedes (a) in the global memory order, (b) +should return the value written by (a) because (a) precedes (b) in +program order. Likewise for (e) and (f). + +include::images/graphviz/ppoca.adoc[] +[[ppoca]]] +.A test that highlights the behavior of store buffers +image::image_placeholder.png[] + +.Key for test that highlights the behavior of store buffers +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |sw t1,0(s0) | |LOOP: +|(b) |fence w,w |(d) |lw a0,0(s1) +|(c) |sw t1,0(s1) | |beqz a0, LOOP +| | |(e) |sw t1,0(s2) +| | |(f) |lw a1,0(s2) +| | | |xor a2,a1,a1 +| | | |add s0,s0,a2 +| | |(g) |lw a2,0(s0) +|Outcome: `a0=1`, `a1=1`, `a2=0` | | | +|=== + + +Another test that highlights the behavior of store buffers is shown in +Figure fig:litmus:ppoca[[fig:litmus:ppoca]]. In this example, (d) is +ordered before (e) because of the control dependency, and (f) is ordered +before (g) because of the address dependency. However, (e) is _not_ +necessarily ordered before (f), even though (f) returns the value +written by (e). This could correspond to the following sequence of +events: + +* {empty}(e) executes speculatively and enters the second hart’s private +store buffer (but does not drain to memory) +* {empty}(f) executes speculatively and forwards its return value 1 from +(e) in the store buffer +* {empty}(g) executes speculatively and reads the value 0 from memory +* {empty}(a) executes, enters the first hart’s private store buffer, and +drains to memory +* {empty}(b) executes and retires +* {empty}(c) executes, enters the first hart’s private store buffer, and +drains to memory +* {empty}(d) executes and reads the value 1 from memory +* (e), (f), and (g) commit, since the speculation turned out to be +correct +* {empty}(e) drains from the store buffer to memory + +[[atomicityaxiom]] +==== Atomicity axiom + +p1cm|p12cm & (for Aligned Atomics): + +The RISC-V architecture decouples the notion of atomicity from the +notion of ordering. Unlike architectures such as TSO, RISC-V atomics +under RVWMO do not impose any ordering requirements by default. Ordering +semantics are only guaranteed by the PPO rules that otherwise apply. + +RISC-V contains two types of atomics: AMOs and LR/SC pairs. These +conceptually behave differently, in the following way. LR/SC behave as +if the old value is brought up to the core, modified, and written back +to memory, all while a reservation is held on that memory location. AMOs +on the other hand conceptually behave as if they are performed directly +in memory. AMOs are therefore inherently atomic, while LR/SC pairs are +atomic in the slightly different sense that the memory location in +question will not be modified by another hart during the time the +original hart holds the reservation. + +{empty}(a) lr.d a0, 0(s0) (b) sd t1, 0(s0) (c) sc.d t2, 0(s0) + +       + +{empty}(a) lr.d a0, 0(s0) (b) sw t1, 4(s0) (c) sc.d t2, 0(s0) + +       + +{empty}(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 0(s0) + +       + +{empty}(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 8(s0) + +The atomicity axiom forbids stores from other harts from being +interleaved in global memory order between an LR and the SC paired with +that LR. The atomicity axiom does not forbid loads from being +interleaved between the paired operations in program order or in the +global memory order, nor does it forbid stores from the same hart or +stores to non-overlapping locations from appearing between the paired +operations in either program order or in the global memory order. For +example, the SC instructions in +Figure fig:litmus:lrsdsc[[fig:litmus:lrsdsc]] may (but are not +guaranteed to) succeed. None of those successes would violate the +atomicity axiom, because the intervening non-conditional stores are from +the same hart as the paired load-reserved and store-conditional +instructions. This way, a memory system that tracks memory accesses at +cache line granularity (and which therefore will see the four snippets +of Figure fig:litmus:lrsdsc[[fig:litmus:lrsdsc]] as identical) will not +be forced to fail a store-conditional instruction that happens to +(falsely) share another portion of the same cache line as the memory +location being held by the reservation. + +The atomicity axiom also technically supports cases in which the LR and +SC touch different addresses and/or use different access sizes; however, +use cases for such behaviors are expected to be rare in practice. +Likewise, scenarios in which stores from the same hart between an LR/SC +pair actually overlap the memory location(s) referenced by the LR or SC +are expected to be rare compared to scenarios where the intervening +store may simply fall onto the same cache line. + +[[mm-progress]] +==== Progress + +p1cm|p12cm & : + +The progress axiom ensures a minimal forward progress guarantee. It +ensures that stores from one hart will eventually be made visible to +other harts in the system in a finite amount of time, and that loads +from other harts will eventually be able to read those values (or +successors thereof). Without this rule, it would be legal, for example, +for a spinlock to spin infinitely on a value, even with a store from +another hart waiting to unlock the spinlock. + +The progress axiom is intended not to impose any other notion of +fairness, latency, or quality of service onto the harts in a RISC-V +implementation. Any stronger notions of fairness are up to the rest of +the ISA and/or up to the platform and/or device to define and implement. + +The forward progress axiom will in almost all cases be naturally +satisfied by any standard cache coherence protocol. Implementations with +non-coherent caches may have to provide some other mechanism to ensure +the eventual visibility of all stores (or successors thereof) to all +harts. + +[[mm-overlap]] +==== Overlapping-Address Orderings (Rules #ppo:->st[[ppo:->st]]–#ppo:amoforward[[ppo:amoforward]]) + +p1cm|p12cm & Rule ppo:->st[[ppo:->st]]: + +& Rule ppo:rdw[[ppo:rdw]]: + +& Rule ppo:amoforward[[ppo:amoforward]]: + + +Same-address orderings where the latter is a store are straightforward: +a load or store can never be reordered with a later store to an +overlapping memory location. From a microarchitecture perspective, +generally speaking, it is difficult or impossible to undo a +speculatively reordered store if the speculation turns out to be +invalid, so such behavior is simply disallowed by the model. +Same-address orderings from a store to a later load, on the other hand, +do not need to be enforced. As discussed in +Section sec:memory:loadvalueaxiom[1.3.2], this reflects the observable +behavior of implementations that forward values from buffered stores to +later loads. + +Same-address load-load ordering requirements are far more subtle. The +basic requirement is that a younger load must not return a value that is +older than a value returned by an older load in the same hart to the +same address. This is often known as ``CoRR`` (Coherence for Read-Read +pairs), or as part of a broader ``coherence`` or ``sequential +consistency per location`` requirement. Some architectures in the past +have relaxed same-address load-load ordering, but in hindsight this is +generally considered to complicate the programming model too much, and +so RVWMO requires CoRR ordering to be enforced. However, because the +global memory order corresponds to the order in which loads perform +rather than the ordering of the values being returned, capturing CoRR +requirements in terms of the global memory order requires a bit of +indirection. + +m.4@m.4 + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li  t2, 2 +|(a) |sw t1,0(s0) |(d) |lw  a0,0(s1) +|(b) |fence w, w |(e) |sw  t2,0(s1) +|(c) |sw t1,0(s1) |(f) |lw  a1,0(s1) +| | |(g) |xor t3,a1,a1 +| | |(h) |add s0,s0,t3 +| | |(i) |lw  a2,0(s0) +|Outcome: `a0=1`, `a1=2`, `a2=0` | | | +|=== + +& + +Consider the litmus test of +Figure fig:litmus:frirfi[[fig:litmus:frirfi]], which is one particular +instance of the more general ``fri-rfi`` pattern. The term ``fri-rfi`` +refers to the sequence (d), (e), (f): (d) ``from-reads`` (i.e., reads +from an earlier write than) (e) which is the same hart, and (f) reads +from (e) which is in the same hart. + +From a microarchitectural perspective, outcome `a0=1`, `a1=2`, `a2=0` is +legal (as are various other less subtle outcomes). Intuitively, the +following would produce the outcome in question: + +* {empty}(d) stalls (for whatever reason; perhaps it’s stalled waiting +for some other preceding instruction) +* {empty}(e) executes and enters the store buffer (but does not yet +drain to memory) +* {empty}(f) executes and forwards from (e) in the store buffer +* (g), (h), and (i) execute +* {empty}(a) executes and drains to memory, (b) executes, and (c) +executes and drains to memory +* {empty}(d) unstalls and executes +* {empty}(e) drains from the store buffer to memory + +This corresponds to a global memory order of (f), (i), (a), (c), (d), +(e). Note that even though (f) performs before (d), the value returned +by (f) is newer than the value returned by (d). Therefore, this +execution is legal and does not violate the CoRR requirements. + +Likewise, if two back-to-back loads return the values written by the +same store, then they may also appear out-of-order in the global memory +order without violating CoRR. Note that this is not the same as saying +that the two loads return the same value, since two different stores may +write the same value. + +m.4@m.6 + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 |(d) |lw  a0,0(s1) +|(a) |sw t1,0(s0) |(e) |xor t2,a0,a0 +|(b) |fence w, w |(f) |add s4,s2,t2 +|(c) |sw t1,0(s1) |(g) |lw  a1,0(s4) +| | |(h) |lw  a2,0(s2) +| | |(i) |xor t3,a2,a2 +| | |(j) |add s0,s0,t3 +| | |(k) |lw  a3,0(s0) +|Outcome: `a0=1`, `a1=v`, `a2=v`, `a3=0` | | | +|=== + +& + +Consider the litmus test of Figure fig:litmus:rsw[[fig:litmus:rsw]]. +The outcome `a0=1`, `a1=v`, `a2=v`, `a3=0` (where latexmath:[$v$] is +some value written by another hart) can be observed by allowing (g) and +(h) to be reordered. This might be done speculatively, and the +speculation can be justified by the microarchitecture (e.g., by snooping +for cache invalidations and finding none) because replaying (h) after +(g) would return the value written by the same store anyway. Hence +assuming `a1` and `a2` would end up with the same value written by the +same store anyway, (g) and (h) can be legally reordered. The global +memory order corresponding to this execution would be +(h),(k),(a),(c),(d),(g). + +Executions of the test in Figure fig:litmus:rsw[[fig:litmus:rsw]] in +which `a1` does not equal `a2` do in fact require that (g) appears +before (h) in the global memory order. Allowing (h) to appear before (g) +in the global memory order would in that case result in a violation of +CoRR, because then (h) would return an older value than that returned by +(g). Therefore, PPO rule ppo:rdw[[ppo:rdw]] forbids this CoRR violation +from occurring. As such, PPO rule ppo:rdw[[ppo:rdw]] strikes a careful +balance between enforcing CoRR in all cases while simultaneously being +weak enough to permit `RSW` and `fri-rfi` patterns that commonly +appear in real microarchitectures. + +There is one more overlapping-address rule: PPO +rule ppo:amoforward[[ppo:amoforward]] simply states that a value cannot +be returned from an AMO or SC to a subsequent load until the AMO or SC +has (in the case of the SC, successfully) performed globally. This +follows somewhat naturally from the conceptual view that both AMOs and +SC instructions are meant to be performed atomically in memory. However, +notably, PPO rule ppo:amoforward[[ppo:amoforward]] states that hardware +may not even non-speculatively forward the value being stored by an +AMOSWAP to a subsequent load, even though for AMOSWAP that store value +is not actually semantically dependent on the previous value in memory, +as is the case for the other AMOs. The same holds true even when +forwarding from SC store values that are not semantically dependent on +the value returned by the paired LR. + +The three PPO rules above also apply when the memory accesses in +question only overlap partially. This can occur, for example, when +accesses of different sizes are used to access the same object. Note +also that the base addresses of two overlapping memory operations need +not necessarily be the same for two memory accesses to overlap. When +misaligned memory accesses are being used, the overlapping-address PPO +rules apply to each of the component memory accesses independently. + +[[mm-fence]] +==== Fences + +p1cm|p12cm & Rule ppo:fence[[ppo:fence]]: + +By default, the FENCE instruction ensures that all memory accesses from +instructions preceding the fence in program order (the ``predecessor +set``) appear earlier in the global memory order than memory accesses +from instructions appearing after the fence in program order (the +``successor set``). However, fences can optionally further restrict the +predecessor set and/or the successor set to a smaller set of memory +accesses in order to provide some speedup. Specifically, fences have PR, +PW, SR, and SW bits which restrict the predecessor and/or successor +sets. The predecessor set includes loads (resp.stores) if and only if PR +(resp.PW) is set. Similarly, the successor set includes loads +(resp.stores) if and only if SR (resp.SW) is set. + +The FENCE encoding currently has nine non-trivial combinations of the +four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which +facilitates mapping of ``acquire+release`` or RVTSO semantics. The +remaining seven combinations have empty predecessor and/or successor +sets and hence are no-ops. Of the ten non-trivial options, only six are +commonly used in practice: + +* FENCE RW,RW +* FENCE.TSO +* FENCE RW,W +* FENCE R,RW +* FENCE R,R +* FENCE W,W + +FENCE instructions using any other combination of PR, PW, SR, and SW are +reserved. We strongly recommend that programmers stick to these six. +Other combinations may have unknown or unexpected interactions with the +memory model. + +Finally, we note that since RISC-V uses a multi-copy atomic memory +model, programmers can reason about fences bits in a thread-local +manner. There is no complex notion of ``fence cumulativity`` as found in +memory models that are not multi-copy atomic. + +[[sec:memory:acqrel]] +==== Explicit Synchronization (Rules #ppo:acquire[[ppo:acquire]]–#ppo:pair[[ppo:pair]]) + +p1cm|p12cm & Rule ppo:acquire[[ppo:acquire]]: + +& Rule ppo:release[[ppo:release]]: + +& Rule ppo:rcsc[[ppo:rcsc]]: + +& Rule ppo:pair[[ppo:pair]]: + + +An _acquire_ operation, as would be used at the start of a critical +section, requires all memory operations following the acquire in program +order to also follow the acquire in the global memory order. This +ensures, for example, that all loads and stores inside the critical +section are up to date with respect to the synchronization variable +being used to protect it. Acquire ordering can be enforced in one of two +ways: with an acquire annotation, which enforces ordering with respect +to just the synchronization variable itself, or with a FENCE R,RW, which +enforces ordering with respect to all previous loads. + +.... + sd x1, (a1) # Arbitrary unrelated store + ld x2, (a2) # Arbitrary unrelated load + li t0, 1 # Initialize swap value. + again: + amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock. + bnez t0, again # Retry if held. + # ... + # Critical section. + # ... + amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. + sd x3, (a3) # Arbitrary unrelated store + ld x4, (a4) # Arbitrary unrelated load +.... + +Consider +<>. +Because this example uses _aq_, the loads and stores in the critical +section are guaranteed to appear in the global memory order after the +AMOSWAP used to acquire the lock. However, assuming `a0`, `a1`, and `a2` +point to different memory locations, the loads and stores in the +critical section may or may not appear after the ``Arbitrary unrelated +load`` at the beginning of the example in the global memory order. + +.... + sd x1, (a1) # Arbitrary unrelated store + ld x2, (a2) # Arbitrary unrelated load + li t0, 1 # Initialize swap value. + again: + amoswap.w t0, t0, (a0) # Attempt to acquire lock. + fence r, rw # Enforce "acquire" memory ordering + bnez t0, again # Retry if held. + # ... + # Critical section. + # ... + fence rw, w # Enforce "release" memory ordering + amoswap.w x0, x0, (a0) # Release lock by storing 0. + sd x3, (a3) # Arbitrary unrelated store + ld x4, (a4) # Arbitrary unrelated load +.... + +Now, consider the alternative in +Figure fig:litmus:spinlock_fences[[fig:litmus:spinlock_fences]]. In +this case, even though the AMOSWAP does not enforce ordering with an +_aq_ bit, the fence nevertheless enforces that the acquire AMOSWAP +appears earlier in the global memory order than all loads and stores in +the critical section. Note, however, that in this case, the fence also +enforces additional orderings: it also requires that the ``Arbitrary +unrelated load`` at the start of the program appears earlier in the +global memory order than the loads and stores of the critical section. +(This particular fence does not, however, enforce any ordering with +respect to the ``Arbitrary unrelated store`` at the start of the +snippet.) In this way, fence-enforced orderings are slightly coarser +than orderings enforced by _.aq_. + +Release orderings work exactly the same as acquire orderings, just in +the opposite direction. Release semantics require all loads and stores +preceding the release operation in program order to also precede the +release operation in the global memory order. This ensures, for example, +that memory accesses in a critical section appear before the +lock-releasing store in the global memory order. Just as for acquire +semantics, release semantics can be enforced using release annotations +or with a FENCE RW,W operation. Using the same examples, the ordering +between the loads and stores in the critical section and the ``Arbitrary +unrelated store`` at the end of the code snippet is enforced only by the +FENCE RW,W in +Figure #fig:litmus:spinlock_fences[[fig:litmus:spinlock_fences]], not by +the _rl_ in +Figure #fig:litmus:spinlock_atomics[[fig:litmus:spinlock_atomics]]. + +With RCpc annotations alone, store-release-to-load-acquire ordering is +not enforced. This facilitates the porting of code written under the TSO +and/or RCpc memory models. To enforce store-release-to-load-acquire +ordering, the code must use store-release-RCsc and load-acquire-RCsc +operations so that PPO rule #ppo:rcsc[[ppo:rcsc]] applies. RCpc alone is +sufficient for many use cases in C/C++ but is insufficient for many +other use cases in C/C++, Java, and Linux, to name just a few examples; +see Section #sec:memory:porting[1.5] for details. + +PPO rule #ppo:pair[[ppo:pair]] indicates that an SC must appear after +its paired LR in the global memory order. This will follow naturally +from the common use of LR/SC to perform an atomic read-modify-write +operation due to the inherent data dependency. However, PPO +rule #ppo:pair[[ppo:pair]] also applies even when the value being stored +does not syntactically depend on the value returned by the paired LR. + +Lastly, we note that just as with fences, programmers need not worry +about ``cumulativity`` when analyzing ordering annotations. + +[[sec:memory:dependencies]] +==== Syntactic Dependencies (Rules #ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]]) + +p1cm|p12cm & Rule #ppo:addr[[ppo:addr]]: + +& Rule #ppo:data[[ppo:data]]: + +& Rule #ppo:ctrl[[ppo:ctrl]]: + + +Dependencies from a load to a later memory operation in the same hart +are respected by the RVWMO memory model. The Alpha memory model was +notable for choosing _not_ to enforce the ordering of such dependencies, +but most modern hardware and software memory models consider allowing +dependent instructions to be reordered too confusing and +counterintuitive. Furthermore, modern code sometimes intentionally uses +such dependencies as a particularly lightweight ordering enforcement +mechanism. + +The terms in +<> +work as follows. Instructions are said to carry dependencies from their +source register(s) to their destination register(s) whenever the value +written into each destination register is a function of the source +register(s). For most instructions, this means that the destination +register(s) carry a dependency from all source register(s). However, +there are a few notable exceptions. In the case of memory instructions, +the value written into the destination register ultimately comes from +the memory system rather than from the source register(s) directly, and +so this breaks the chain of dependencies carried from the source +register(s). In the case of unconditional jumps, the value written into +the destination register comes from the current `pc` (which is never +considered a source register by the memory model), and so likewise, JALR +(the only jump with a source register) does not carry a dependency from +_rs1_ to _rd_. + +{empty}(a) fadd f3,f1,f2 (b) fadd f6,f4,f5 (c) csrrs a0,fflags,x0 + +The notion of accumulating into a destination register rather than +writing into it reflects the behavior of CSRs such as `fflags`. In +particular, an accumulation into a register does not clobber any +previous writes or accumulations into the same register. For example, in +Figure #fig:litmus:fflags[[fig:litmus:fflags]], (c) has a syntactic +dependency on both (a) and (b). + +Like other modern memory models, the RVWMO memory model uses syntactic +rather than semantic dependencies. In other words, this definition +depends on the identities of the registers being accessed by different +instructions, not the actual contents of those registers. This means +that an address, control, or data dependency must be enforced even if +the calculation could seemingly be ``optimized away``. This choice +ensures that RVWMO remains compatible with code that uses these false +syntactic dependencies as a lightweight ordering mechanism. + +ld a1,0(s0) xor a2,a1,a1 add s1,s1,a2 ld a5,0(s1) + +For example, there is a syntactic address dependency from the memory +operation generated by the first instruction to the memory operation +generated by the last instruction in +Figure #fig:litmus:address[[fig:litmus:address]], even though `a1` XOR +`a1` is zero and hence has no effect on the address accessed by the +second load. + +The benefit of using dependencies as a lightweight synchronization +mechanism is that the ordering enforcement requirement is limited only +to the specific two instructions in question. Other non-dependent +instructions may be freely reordered by aggressive implementations. One +alternative would be to use a load-acquire, but this would enforce +ordering for the first load with respect to _all_ subsequent +instructions. Another would be to use a FENCE R,R, but this would +include all previous and all subsequent loads, making this option more +expensive. + +lw x1,0(x2) bne x1,x0,next sw x3,0(x4) next: sw x5,0(x6) + +Control dependencies behave differently from address and data +dependencies in the sense that a control dependency always extends to +all instructions following the original target in program order. +Consider Figure #fig:litmus:control1[[fig:litmus:control1]]: the +instruction at `next` will always execute, but the memory operation +generated by that last instruction nevertheless still has a control +dependency from the memory operation generated by the first instruction. + +lw x1,0(x2) bne x1,x0,next next: sw x3,0(x4) + +Likewise, consider Figure #fig:litmus:control2[[fig:litmus:control2]]. +Even though both branch outcomes have the same target, there is still a +control dependency from the memory operation generated by the first +instruction in this snippet to the memory operation generated by the +last instruction. This definition of control dependency is subtly +stronger than what might be seen in other contexts (e.g., C++), but it +conforms with standard definitions of control dependencies in the +literature. + +Notably, PPO rules #ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]] are also +intentionally designed to respect dependencies that originate from the +output of a successful store-conditional instruction. Typically, an SC +instruction will be followed by a conditional branch checking whether +the outcome was successful; this implies that there will be a control +dependency from the store operation generated by the SC instruction to +any memory operations following the branch. PPO +rule #ppo:ctrl[[ppo:ctrl]] in turn implies that any subsequent store +operations will appear later in the global memory order than the store +operation generated by the SC. However, since control, address, and data +dependencies are defined over memory operations, and since an +unsuccessful SC does not generate a memory operation, no order is +enforced between unsuccessful SC and its dependent instructions. +Moreover, since SC is defined to carry dependencies from its source +registers to _rd_ only when the SC is successful, an unsuccessful SC has +no effect on the global memory order. + +m.4m0.05m.4 + +` ` + +[cols="^,<,^,<",] +|=== +|Initial values: 0(s0)=1; 0(s1)=1 | | | +| | | | +|Hart 0 | |Hart 1 | +|(a) |ld a0,0(s0) |(e) |ld a3,0(s2) +|(b) |lr a1,0(s1) |(f) |sd a3,0(s0) +|(c) |sc a2,a0,0(s1) | | +|(d) |sd a2,0(s2) | | +|Outcome: `a0=0`, `a3=0` | | | +|=== + +& & + +In addition, the choice to respect dependencies originating at +store-conditional instructions ensures that certain out-of-thin-air-like +behaviors will be prevented. Consider +Figure #fig:litmus:successdeps[[fig:litmus:successdeps]]. Suppose a +hypothetical implementation could occasionally make some early guarantee +that a store-conditional operation will succeed. In this case, (c) could +return 0 to `a2` early (before actually executing), allowing the +sequence (d), (e), (f), (a), and then (b) to execute, and then (c) might +execute (successfully) only at that point. This would imply that (c) +writes its own success value to `0(s1)`! Fortunately, this situation and +others like it are prevented by the fact that RVWMO respects +dependencies originating at the stores generated by successful SC +instructions. + +We also note that syntactic dependencies between instructions only have +any force when they take the form of a syntactic address, control, +and/or data dependency. For example: a syntactic dependency between two +``F`` instructions via one of the ``accumulating CSRs`` in +Section #sec:source-dest-regs[[sec:source-dest-regs]] does _not_ imply +that the two ``F`` instructions must be executed in order. Such a +dependency would only serve to ultimately set up later a dependency from +both ``F`` instructions to a later CSR instruction accessing the CSR +flag in question. + +[[sec:memory:ppopipeline]] +==== Pipeline Dependencies (Rules #ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]]) + +p1cm|p12cm & Rule #ppo:addrdatarfi[[ppo:addrdatarfi]]: + +& Rule #ppo:addrpo[[ppo:addrpo]]: + + +m.4m.05m.4 + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 |(d) |lw a0, 0(s1) +|(a) |sw t1,0(s0) |(e) |sw a0, 0(s2) +|(b) |fence w, w |(f) |lw a1, 0(s2) +|(c) |sw t1,0(s1) | |xor a2,a1,a1 +| | | |add s0,s0,a2 +| | |(g) |lw a3,0(s0) +|Outcome: `a0=1`, `a3=0` | | | +|=== + +& & + +PPO rules #ppo:addrdatarfi[[ppo:addrdatarfi]] and +#ppo:addrpo[[ppo:addrpo]] reflect behaviors of almost all real processor +pipeline implementations. Rule #ppo:addrdatarfi[[ppo:addrdatarfi]] +states that a load cannot forward from a store until the address and +data for that store are known. Consider +Figure #fig:litmus:addrdatarfi[[fig:litmus:addrdatarfi]]: (f) cannot be +executed until the data for (e) has been resolved, because (f) must +return the value written by (e) (or by something even later in the +global memory order), and the old value must not be clobbered by the +writeback of (e) before (d) has had a chance to perform. Therefore, (f) +will never perform before (d) has performed. + +m.4m.05m.4 + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |sw t1,0(s0) |(d) |lw a0, 0(s1) +|(b) |fence w, w |(e) |sw a0, 0(s2) +|(c) |sw t1,0(s1) |(f) |sw t1, 0(s2) +| | |(g) |lw a1, 0(s2) +| | | |xor a2,a1,a1 +| | | |add s0,s0,a2 +| | |(h) |lw a3,0(s0) +|Outcome: `a0=1`, `a3=0` | | | +|=== + +& & + +If there were another store to the same address in between (e) and (f), +as in Figure #fig:litmus:addrdatarfi_no[[fig:litmus:addrdatarfi_no]], +then (f) would no longer be dependent on the data of (e) being resolved, +and hence the dependency of (f) on (d), which produces the data for (e), +would be broken. + +Rule #ppo:addrpo[[ppo:addrpo]] makes a similar observation to the +previous rule: a store cannot be performed at memory until all previous +loads that might access the same address have themselves been performed. +Such a load must appear to execute before the store, but it cannot do so +if the store were to overwrite the value in memory before the load had a +chance to read the old value. Likewise, a store generally cannot be +performed until it is known that preceding instructions will not cause +an exception due to failed address resolution, and in this sense, +rule #ppo:addrpo[[ppo:addrpo]] can be seen as somewhat of a special case +of rule #ppo:ctrl[[ppo:ctrl]]. + +m.4m.05m.4 ` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| | | |li t1, 1 +|(a) |lw a0,0(s0) |(d) |lw a1, 0(s1) +|(b) |fence rw,rw |(e) |lw a2, 0(a1) +|(c) |sw s2,0(s1) |(f) |sw t1, 0(s0) +|Outcome: `a0=1`, `a1=t` | | | +|=== + +& & + +Consider Figure #fig:litmus:addrpo[[fig:litmus:addrpo]]: (f) cannot be +executed until the address for (e) is resolved, because it may turn out +that the addresses match; i.e., that `a1=s0`. Therefore, (f) cannot be +sent to memory before (d) has executed and confirmed whether the +addresses do indeed overlap. + +=== Beyond Main Memory + +RVWMO does not currently attempt to formally describe how FENCE.I, +SFENCE.VMA, I/O fences, and PMAs behave. All of these behaviors will be +described by future formalizations. In the meantime, the behavior of +FENCE.I is described in Chapter #chap:zifencei[[chap:zifencei]], the +behavior of SFENCE.VMA is described in the RISC-V Instruction Set +Privileged Architecture Manual, and the behavior of I/O fences and the +effects of PMAs are described below. + +==== Coherence and Cacheability + +The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs) +which specify, among other things, whether portions of the address space +are coherent and/or cacheable. See the RISC-V Privileged ISA +Specification for the complete details. Here, we simply discuss how the +various details in each PMA relate to the memory model: + +* Main memory vs.I/O, and I/O memory ordering PMAs: the memory model as +defined applies to main memory regions. I/O ordering is discussed below. +* Supported access types and atomicity PMAs: the memory model is simply +applied on top of whatever primitives each region supports. +* Cacheability PMAs: the cacheability PMAs in general do not affect the +memory model. Non-cacheable regions may have more restrictive behavior +than cacheable regions, but the set of allowed behaviors does not change +regardless. However, some platform-specific and/or device-specific +cacheability settings may differ. +* Coherence PMAs: The memory consistency model for memory regions marked +as non-coherent in PMAs is currently platform-specific and/or +device-specific: the load-value axiom, the atomicity axiom, and the +progress axiom all may be violated with non-coherent memory. Note +however that coherent memory does not require a hardware cache coherence +protocol. The RISC-V Privileged ISA Specification suggests that +hardware-incoherent regions of main memory are discouraged, but the +memory model is compatible with hardware coherence, software coherence, +implicit coherence due to read-only memory, implicit coherence due to +only one agent having access, or otherwise. +* Idempotency PMAs: Idempotency PMAs are used to specify memory regions +for which loads and/or stores may have side effects, and this in turn is +used by the microarchitecture to determine, e.g., whether prefetches are +legal. This distinction does not affect the memory model. + +==== I/O Ordering + +For I/O, the load value axiom and atomicity axiom in general do not +apply, as both reads and writes might have device-specific side effects +and may return values other than the value ``written`` by the most +recent store to the same address. Nevertheless, the following preserved +program order rules still generally apply for accesses to I/O memory: +memory access latexmath:[$a$] precedes memory access latexmath:[$b$] in +global memory order if latexmath:[$a$] precedes latexmath:[$b$] in +program order and one or more of the following holds: + +. latexmath:[$a$] precedes latexmath:[$b$] in preserved program order as +defined in Chapter #ch:memorymodel[[ch:memorymodel]], with the exception +that acquire and release ordering annotations apply only from one memory +operation to another memory operation and from one I/O operation to +another I/O operation, but not from a memory operation to an I/O nor +vice versa +. latexmath:[$a$] and latexmath:[$b$] are accesses to overlapping +addresses in an I/O region +. latexmath:[$a$] and latexmath:[$b$] are accesses to the same strongly +ordered I/O region +. latexmath:[$a$] and latexmath:[$b$] are accesses to I/O regions, and +the channel associated with the I/O region accessed by either +latexmath:[$a$] or latexmath:[$b$] is channel 1 +. latexmath:[$a$] and latexmath:[$b$] are accesses to I/O regions +associated with the same channel (except for channel 0) + +Note that the FENCE instruction distinguishes between main memory +operations and I/O operations in its predecessor and successor sets. To +enforce ordering between I/O operations and main memory operations, code +must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW. +For example, to enforce ordering between a write to main memory and an +I/O write to a device register, a FENCE W,O or stronger is needed. + +sd t0, 0(a0) fence w,o sd a0, 0(a1) + +When a fence is in fact used, implementations must assume that the +device may attempt to access memory immediately after receiving the MMIO +signal, and subsequent memory accesses from that device to memory must +observe the effects of all accesses ordered prior to that MMIO +operation. In other words, in Figure #fig:litmus:wo[[fig:litmus:wo]], +suppose `0(a0)` is in main memory and `0(a1)` is the address of a device +register in I/O memory. If the device accesses `0(a0)` upon receiving +the MMIO write, then that load must conceptually appear after the first +store to `0(a0)` according to the rules of the RVWMO memory model. In +some implementations, the only way to ensure this will be to require +that the first store does in fact complete before the MMIO write is +issued. Other implementations may find ways to be more aggressive, while +others still may not need to do anything different at all for I/O and +main memory accesses. Nevertheless, the RVWMO memory model does not +distinguish between these options; it simply provides an +implementation-agnostic mechanism to specify the orderings that must be +enforced. + +Many architectures include separate notions of ``ordering`` and +``completion`` fences, especially as it relates to I/O (as opposed to +regular main memory). Ordering fences simply ensure that memory +operations stay in order, while completion fences ensure that +predecessor accesses have all completed before any successors are made +visible. RISC-V does not explicitly distinguish between ordering and +completion fences. Instead, this distinction is simply inferred from +different uses of the FENCE bits. + +For implementations that conform to the RISC-V Unix Platform +Specification, I/O devices and DMA operations are required to access +memory coherently and via strongly ordered I/O channels. Therefore, +accesses to regular main memory regions that are concurrently accessed +by external devices can also use the standard synchronization +mechanisms. Implementations that do not conform to the Unix Platform +Specification and/or in which devices do not access memory coherently +will need to use mechanisms (which are currently platform-specific or +device-specific) to enforce coherency. + +I/O regions in the address space should be considered non-cacheable +regions in the PMAs for those regions. Such regions can be considered +coherent by the PMA if they are not cached by any agent. + +The ordering guarantees in this section may not apply beyond a +platform-specific boundary between the RISC-V cores and the device. In +particular, I/O accesses sent across an external bus (e.g., PCIe) may be +reordered before they reach their ultimate destination. Ordering must be +enforced in such situations according to the platform-specific rules of +those external devices and buses. + +[[sec:memory:porting]] +=== Code Porting and Mapping Guidelines + +[[tsomappings]] +.Mappings from TSO operations to RISC-V operations +[cols="<,<",options="header",] +|=== +|x86/TSO Operation |RVWMO Mapping +|Load |`l{b|h|w|d}; fence r,rw ` +|Store |`fence rw,w; s{b|h|w|d} ` +|Atomic RMW |`amo.{w|d}.aqrl OR ` +| |`loop:lr.{w|d}.aq; ; sc.{w|d}.aqrl; bnez loop ` +|Fence |`fence rw,rw ` +|=== + +<> provides a mapping from TSO memory +operations onto RISC-V memory instructions. Normal x86 loads and stores +are all inherently acquire-RCpc and release-RCpc operations: TSO +enforces all load-load, load-store, and store-store ordering by default. +Therefore, under RVWMO, all TSO loads must be mapped onto a load +followed by FENCE R,RW, and all TSO stores must be mapped onto +FENCE RW,W followed by a store. TSO atomic read-modify-writes and x86 +instructions using the LOCK prefix are fully ordered and can be +implemented either via an AMO with both _aq_ and _rl_ set, or via an LR +with _aq_ set, the arithmetic operation in question, an SC with both +_aq_ and _rl_ set, and a conditional branch checking the success +condition. In the latter case, the _rl_ annotation on the LR turns out +(for non-obvious reasons) to be redundant and can be omitted. + +Alternatives to <> are also possible. A TSO +store can be mapped onto AMOSWAP with _rl_ set. However, since RVWMO PPO +Rule #ppo:amoforward[[ppo:amoforward]] forbids forwarding of values from +AMOs to subsequent loads, the use of AMOSWAP for stores may negatively +affect performance. A TSO load can be mapped using LR with _aq_ set: all +such LR instructions will be unpaired, but that fact in and of itself +does not preclude the use of LR for loads. However, again, this mapping +may also negatively affect performance if it puts more pressure on the +reservation mechanism than was originally intended. + +[[powermappings]] +.Mappings from Power operations to RISC-V operations +[cols="<,<",options="header",] +|=== +|Power Operation |RVWMO Mapping +|Load |`l{b|h|w|d} ` +|Load-Reserve |`lr.{w|d} ` +|Store |`s{b|h|w|d} ` +|Store-Conditional |`sc.{w|d} ` +|`lwsync ` |`fence.tso ` +|`sync ` |`fence rw,rw ` +|`isync ` |`fence.i; fence r,r ` +|=== + +<> provides a mapping from Power memory +operations onto RISC-V memory instructions. Power ISYNC maps on RISC-V +to a FENCE.I followed by a FENCE R,R; the latter fence is needed because +ISYNC is used to define a `control+control fence` dependency that is +not present in RVWMO. + +[[armmappings]] +.Mappings from ARM operations to RISC-V operations +[cols="<,<",options="header",] +|=== +|ARM Operation |RVWMO Mapping +|Load |`l{b|h|w|d} ` +|Load-Acquire |`fence rw, rw; l{b|h|w|d}; fence r,rw ` +|Load-Exclusive |`lr.{w|d} ` +|Load-Acquire-Exclusive |`lr.{w|d}.aqrl ` +|Store |`s{b|h|w|d} ` +|Store-Release |`fence rw,w; s{b|h|w|d} ` +|Store-Exclusive |`sc.{w|d} ` +|Store-Release-Exclusive |`sc.{w|d}.rl ` +|`dmb ` |`fence rw,rw ` +|`dmb.ld ` |`fence r,rw ` +|`dmb.st ` |`fence w,w ` +|`isb ` |`fence.i; fence r,r ` +|=== + +<> provides a mapping from ARM memory +operations onto RISC-V memory instructions. Since RISC-V does not +currently have plain load and store opcodes with _aq_ or _rl_ +annotations, ARM load-acquire and store-release operations should be +mapped using fences instead. Furthermore, in order to enforce +store-release-to-load-acquire ordering, there must be a FENCE RW,RW +between the store-release and load-acquire; Table #tab:armmappings[1.4] +enforces this by always placing the fence in front of each acquire +operation. ARM load-exclusive and store-exclusive instructions can +likewise map onto their RISC-V LR and SC equivalents, but instead of +placing a FENCE RW,RW in front of an LR with _aq_ set, we simply also +set _rl_ instead. ARM ISB maps on RISC-V to FENCE.I followed by +FENCE R,R similarly to how ISYNC maps for Power. + +[[linuxmappings]] +.Mappings from Linux memory primitives to RISC-V primitives. +[cols="<,<",options="header",] +|=== +|Linux Operation |RVWMO Mapping +|`smp_mb() ` |`fence rw,rw ` + +|`smp_rmb() ` |`fence r,r ` + +|`smp_wmb() ` |`fence w,w ` + +|`dma_rmb() ` |`fence r,r ` + +|`dma_wmb() ` |`fence w,w ` + +|`mb() ` |`fence iorw,iorw ` + +|`rmb() ` |`fence ri,ri ` + +|`wmb() ` |`fence wo,wo ` + +|`smp_load_acquire() ` |`l{b|h|w|d}; fence r,rw ` + +|`smp_store_release() ` |`fence.tso; s{b|h|w|d} ` + +|Linux Construct |RVWMO AMO Mapping + +|`atomic__relaxed ` |`amo.{w|d} ` + +|`atomic__acquire ` |`amo.{w|d}.aq ` + +|`atomic__release ` |`amo.{w|d}.rl ` + +|`atomic_ ` |`amo.{w|d}.aqrl ` + +|Linux Construct |RVWMO LR/SC Mapping + +|`atomic__relaxed ` |`loop:lr.{w|d}; ; sc.{w|d}; bnez loop ` + +|`atomic__acquire ` |`loop:lr.{w|d}.aq; ; sc.{w|d}; bnez loop ` + +|`atomic__release` +|`loop:lr.{w|d}; ; sc.{w|d}.aqrl^*; bnez loop OR ` + +| |`fence.tso; loop:lr.{w|d}; ; sc.{w|d}^*; bnez loop ` + +|`atomic_ ` |`loop:lr.{w|d}.aq; ; sc.{w|d}.aqrl; bnez loop ` +|=== + +With regosrds to <>, other +constructs (such as spinlocks) should follow accordingly. Platforms or +devices with non-coherent DMA may need additional synchronization (such +as cache flush or invalidate mechanisms); currently any such extra +synchronization will be device-specific. + +Table #tab:linuxmappings[1.5] provides a mapping of Linux memory +ordering macros onto RISC-V memory instructions. The Linux fences +`dma_rmb()` and `dma_wmb()` map onto FENCE R,R and FENCE W,W, +respectively, since the RISC-V Unix Platform requires coherent DMA, but +would be mapped onto FENCE RI,RI and FENCE WO,WO, respectively, on a +platform with non-coherent DMA. Platforms with non-coherent DMA may also +require a mechanism by which cache lines can be flushed and/or +invalidated. Such mechanisms will be device-specific and/or standardized +in a future extension to the ISA. + +The Linux mappings for release operations may seem stronger than +necessary, but these mappings are needed to cover some cases in which +Linux requires stronger orderings than the more intuitive mappings would +provide. In particular, as of the time this text is being written, Linux +is actively debating whether to require load-load, load-store, and +store-store orderings between accesses in one critical section and +accesses in a subsequent critical section in the same hart and protected +by the same synchronization object. Not all combinations of +FENCE RW,W/FENCE R,RW mappings with _aq_/_rl_ mappings combine to +provide such orderings. There are a few ways around this problem, +including: + +. Always use FENCE RW,W/FENCE R,RW, and never use _aq_/_rl_. This +suffices but is undesirable, as it defeats the purpose of the _aq_/_rl_ +modifiers. +. Always use _aq_/_rl_, and never use FENCE RW,W/FENCE R,RW. This does +not currently work due to the lack of load and store opcodes with _aq_ +and _rl_ modifiers. +. Strengthen the mappings of release operations such that they would +enforce sufficient orderings in the presence of either type of acquire +mapping. This is the currently recommended solution, and the one shown +in Table #tab:linuxmappings[1.5]. + +           + +RVWMO Mapping: (a) lw a0, 0(s0) (b) fence.tso // vs. fence rw,w (c) sd +x0,0(s1) ... loop: (d) amoswap.d.aq a1,t1,0(s1) bnez a1,loop (e) lw +a2,0(s2) + +For example, the critical section ordering rule currently being debated +by the Linux community would require (a) to be ordered before (e) in +Figure #fig:litmus:lkmm_ll[[fig:litmus:lkmm_ll]]. If that will indeed be +required, then it would be insufficient for (b) to map as FENCE RW,W. +That said, these mappings are subject to change as the Linux Kernel +Memory Model evolves. + +[[tab:c11mappings]] +.Mappings from C/C++ primitives to RISC-V primitives. +[cols="<,<",options="header",] +|=== +|C/C++ Construct |RVWMO Mapping +|Non-atomic load |`l{b|h|w|d} ` + +|`atomic_load(memory_order_relaxed) ` |`l{b|h|w|d} ` + +|`atomic_load(memory_order_acquire) ` |`l{b|h|w|d}; fence r,rw ` + +|`atomic_load(memory_order_seq_cst) ` +|`fence rw,rw; l{b|h|w|d}; fence r,rw ` + +|Non-atomic store |`s{b|h|w|d} ` + +|`atomic_store(memory_order_relaxed) ` |`s{b|h|w|d} ` + +|`atomic_store(memory_order_release) ` |`fence rw,w; s{b|h|w|d} ` + +|`atomic_store(memory_order_seq_cst) ` |`fence rw,w; s{b|h|w|d} ` + +|`atomic_thread_fence(memory_order_acquire) ` |`fence r,rw ` + +|`atomic_thread_fence(memory_order_release) ` |`fence rw,w ` + +|`atomic_thread_fence(memory_order_acq_rel) ` |`fence.tso` + +|`atomic_thread_fence(memory_order_seq_cst) ` |`fence rw,rw ` + +|C/C++ Construct |RVWMO AMO Mapping + +|`atomic_(memory_order_relaxed) ` |`amo.{w|d} ` + +|`atomic_(memory_order_acquire) ` |`amo.{w|d}.aq ` + +|`atomic_(memory_order_release) ` |`amo.{w|d}.rl ` + +|`atomic_(memory_order_acq_rel) ` |`amo.{w|d}.aqrl ` + +|`atomic_(memory_order_seq_cst) ` |`amo.{w|d}.aqrl ` + +|C/C++ Construct |RVWMO LR/SC Mapping + +|`atomic_(memory_order_relaxed)` |`loop:lr.{w|d}; ; sc.{w|d}; ` + +| |`bnez loop ` + +|`atomic_(memory_order_acquire)` +|`loop:lr.{w|d}.aq; ; sc.{w|d}; ` + +| |`bnez loop ` + +|`atomic_(memory_order_release)` +|`loop:lr.{w|d}; ; sc.{w|d}.rl; ` + +| |`bnez loop ` + +|`atomic_(memory_order_acq_rel)` +|`loop:lr.{w|d}.aq; ; sc.{w|d}.rl; ` + +| |`bnez loop ` + +|`atomic_(memory_order_seq_cst)` |`loop:lr.{w|d}.aqrl; ; ` + +| |`sc.{w|d}.rl; bnez loop ` +|=== + +[[tab:c11mappings_hypothetical]] +.Hypothetical mappings from C/C++ primitives to RISC-V primitives, if +native load-acquire and store-release opcodes are introduced. +[cols="<,<",options="header",] +|=== +|C/C++ Construct |RVWMO Mapping +|Non-atomic load |`l{b|h|w|d} ` + +|`atomic_load(memory_order_relaxed) ` |`l{b|h|w|d} ` + +|`atomic_load(memory_order_acquire) ` |`l{b|h|w|d}.aq ` + +|`atomic_load(memory_order_seq_cst) ` |`l{b|h|w|d}.aq ` + +|Non-atomic store |`s{b|h|w|d} ` + +|`atomic_store(memory_order_relaxed) ` |`s{b|h|w|d} ` + +|`atomic_store(memory_order_release) ` |`s{b|h|w|d}.rl ` + +|`atomic_store(memory_order_seq_cst) ` |`s{b|h|w|d}.rl ` + +|`atomic_thread_fence(memory_order_acquire) ` |`fence r,rw ` + +|`atomic_thread_fence(memory_order_release) ` |`fence rw,w ` + +|`atomic_thread_fence(memory_order_acq_rel) ` |`fence.tso` + +|`atomic_thread_fence(memory_order_seq_cst) ` |`fence rw,rw ` + +|C/C++ Construct |RVWMO AMO Mapping + +|`atomic_(memory_order_relaxed) ` |`amo.{w|d} ` + +|`atomic_(memory_order_acquire) ` |`amo.{w|d}.aq ` + +|`atomic_(memory_order_release) ` |`amo.{w|d}.rl ` + +|`atomic_(memory_order_acq_rel) ` |`amo.{w|d}.aqrl ` + +|`atomic_(memory_order_seq_cst) ` |`amo.{w|d}.aqrl ` + +|C/C++ Construct |RVWMO LR/SC Mapping + +|`atomic_(memory_order_relaxed) ` |`lr.{w|d}; ; sc.{w|d} ` + +|`atomic_(memory_order_acquire) ` |`lr.{w|d}.aq; ; sc.{w|d} ` + +|`atomic_(memory_order_release) ` |`lr.{w|d}; ; sc.{w|d}.rl ` + +|`atomic_(memory_order_acq_rel) ` |`lr.{w|d}.aq; ; sc.{w|d}.rl ` + +|`atomic_(memory_order_seq_cst) ` +|`lr.{w|d}.aq^*; ; sc.{w|d}.rl ` + +|latexmath:[$^*$]must be `lr.{w|d}.aqrl` in order to interoperate with +code mapped per Table #tab:c11mappings[1.6] | +|=== + +Table #tab:c11mappings[1.6] provides a mapping of C11/C++11 atomic +operations onto RISC-V memory instructions. If load and store opcodes +with _aq_ and _rl_ modifiers are introduced, then the mappings in +Table #tab:c11mappings_hypothetical[1.7] will suffice. Note however that +the two mappings only interoperate correctly if +`atomic_(memory_order_seq_cst)` is mapped using an LR that has both +_aq_ and _rl_ set. + +Any AMO can be emulated by an LR/SC pair, but care must be taken to +ensure that any PPO orderings that originate from the LR are also made +to originate from the SC, and that any PPO orderings that terminate at +the SC are also made to terminate at the LR. For example, the LR must +also be made to respect any data dependencies that the AMO has, given +that load operations do not otherwise have any notion of a data +dependency. Likewise, the effect a FENCE R,R elsewhere in the same hart +must also be made to apply to the SC, which would not otherwise respect +that fence. The emulator may achieve this effect by simply mapping AMOs +onto `lr.aq; ; sc.aqrl`, matching the mapping used elsewhere for +fully ordered atomics. + +These C11/C++11 mappings require the platform to provide the following +Physical Memory Attributes (as defined in the RISC-V Privileged ISA) for +all memory: + +* main memory +* coherent +* AMOArithmetic +* RsrvEventual + +Platforms with different attributes may require different mappings, or +require platform-specific SW (e.g., memory-mapped I/O). + +=== Implementation Guidelines + +The RVWMO and RVTSO memory models by no means preclude +microarchitectures from employing sophisticated speculation techniques +or other forms of optimization in order to deliver higher performance. +The models also do not impose any requirement to use any one particular +cache hierarchy, nor even to use a cache coherence protocol at all. +Instead, these models only specify the behaviors that can be exposed to +software. Microarchitectures are free to use any pipeline design, any +coherent or non-coherent cache hierarchy, any on-chip interconnect, +etc., as long as the design only admits executions that satisfy the +memory model rules. That said, to help people understand the actual +implementations of the memory model, in this section we provide some +guidelines on how architects and programmers should interpret the +models’ rules. + +Both RVWMO and RVTSO are multi-copy atomic (or +``other-multi-copy-atomic``): any store value that is visible to a hart +other than the one that originally issued it must also be conceptually +visible to all other harts in the system. In other words, harts may +forward from their own previous stores before those stores have become +globally visible to all harts, but no early inter-hart forwarding is +permitted. Multi-copy atomicity may be enforced in a number of ways. It +might hold inherently due to the physical design of the caches and store +buffers, it may be enforced via a single-writer/multiple-reader cache +coherence protocol, or it might hold due to some other mechanism. + +Although multi-copy atomicity does impose some restrictions on the +microarchitecture, it is one of the key properties keeping the memory +model from becoming extremely complicated. For example, a hart may not +legally forward a value from a neighbor hart’s private store buffer +(unless of course it is done in such a way that no new illegal behaviors +become architecturally visible). Nor may a cache coherence protocol +forward a value from one hart to another until the coherence protocol +has invalidated all older copies from other caches. Of course, +microarchitectures may (and high-performance implementations likely +will) violate these rules under the covers through speculation or other +optimizations, as long as any non-compliant behaviors are not exposed to +the programmer. + +As a rough guideline for interpreting the PPO rules in RVWMO, we expect +the following from the software perspective: + +* programmers will use PPO rules #ppo:->st[[ppo:->st]] and +#ppo:fence[[ppo:fence]]–#ppo:pair[[ppo:pair]] regularly and actively. +* expert programmers will use PPO rules +#ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]] to speed up critical paths +of important data structures. +* even expert programmers will rarely if ever use PPO rules +#ppo:rdw[[ppo:rdw]]–#ppo:amoforward[[ppo:amoforward]] and +#ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]] directly. +These are included to facilitate common microarchitectural optimizations +(rule #ppo:rdw[[ppo:rdw]]) and the operational formal modeling approach +(rules #ppo:amoforward[[ppo:amoforward]] and +#ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]]) described +in Section #sec:operational[[sec:operational]]. They also facilitate the +process of porting code from other architectures that have similar +rules. + +We also expect the following from the hardware perspective: + +* PPO rules #ppo:->st[[ppo:->st]] and +#ppo:amoforward[[ppo:amoforward]]–#ppo:release[[ppo:release]] reflect +well-understood rules that should pose few surprises to architects. +* PPO rule #ppo:rdw[[ppo:rdw]] reflects a natural and common hardware +optimization, but one that is very subtle and hence is worth double +checking carefully. +* PPO rule #ppo:rcsc[[ppo:rcsc]] may not be immediately obvious to +architects, but it is a standard memory model requirement +* The load value axiom, the atomicity axiom, and PPO rules +#ppo:pair[[ppo:pair]]–#ppo:addrpo[[ppo:addrpo]] reflect rules that most +hardware implementations will enforce naturally, unless they contain +extreme optimizations. Of course, implementations should make sure to +double check these rules nevertheless. Hardware must also ensure that +syntactic dependencies are not ``optimized away``. + +Architectures are free to implement any of the memory model rules as +conservatively as they choose. For example, a hardware implementation +may choose to do any or all of the following: + +* interpret all fences as if they were FENCE RW,RW (or FENCE IORW,IORW, +if I/O is involved), regardless of the bits actually set +* implement all fences with PW and SR as if they were FENCE RW,RW (or +FENCE IORW,IORW, if I/O is involved), as PW with SR is the most +expensive of the four possible main memory ordering components anyway +* emulate _aq_ and _rl_ as described in Section #sec:memory:porting[1.5] +* enforcing all same-address load-load ordering, even in the presence of +patterns such as ``fri-rfi`` and ``RSW`` +* forbid any forwarding of a value from a store in the store buffer to a +subsequent AMO or LR to the same address +* forbid any forwarding of a value from an AMO or SC in the store buffer +to a subsequent load to the same address +* implement TSO on all memory accesses, and ignore any main memory +fences that do not include PW and SR ordering (e.g., as Ztso +implementations will do) +* implement all atomics to be RCsc or even fully ordered, regardless of +annotation + +Architectures that implement RVTSO can safely do the following: + +* Ignore all fences that do not have both PW and SR (unless the fence +also orders I/O) +* Ignore all PPO rules except for rules #ppo:fence[[ppo:fence]] through +#ppo:rcsc[[ppo:rcsc]], since the rest are redundant with other PPO rules +under RVTSO assumptions + +Other general notes: + +* Silent stores (i.e., stores that write the same value that already +exists at a memory location) behave like any other store from a memory +model point of view. Likewise, AMOs which do not actually change the +value in memory (e.g., an AMOMAX for which the value in _rs2_ is smaller +than the value currently in memory) are still semantically considered +store operations. Microarchitectures that attempt to implement silent +stores must take care to ensure that the memory model is still obeyed, +particularly in cases such as RSW (Section #sec:memory:overlap[1.3.5]) +which tend to be incompatible with silent stores. +* Writes may be merged (i.e., two consecutive writes to the same address +may be merged) or subsumed (i.e., the earlier of two back-to-back writes +to the same address may be elided) as long as the resulting behavior +does not otherwise violate the memory model semantics. + +The question of write subsumption can be understood from the following +example: + +m.4m.1m.4 ` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 3 | |li t3, 2 +| |li t2, 1 | | +|(a) |sw t1,0(s0) |(d) |lw a0,0(s1) +|(b) |fence w, w |(e) |sw a0,0(s0) +|(c) |sw t2,0(s1) |(f) |sw t3,0(s0) +|=== + +& & + +As written, if the load  (d) reads value latexmath:[$1$], then (a) must +precede (f) in the global memory order: + +* {empty}(a) precedes (c) in the global memory order because of rule 2 +* {empty}(c) precedes (d) in the global memory order because of the Load +Value axiom +* {empty}(d) precedes (e) in the global memory order because of rule 7 +* {empty}(e) precedes (f) in the global memory order because of rule 1 + +In other words the final value of the memory location whose address is +in `s0` must be latexmath:[$2$] (the value written by the store (f)) and +cannot be latexmath:[$3$] (the value written by the store (a)). + +A very aggressive microarchitecture might erroneously decide to discard +(e), as (f) supersedes it, and this may in turn lead the +microarchitecture to break the now-eliminated dependency between (d) and +(f) (and hence also between (a) and (f)). This would violate the memory +model rules, and hence it is forbidden. Write subsumption may in other +cases be legal, if for example there were no data dependency between (d) +and (e). + +==== Possible Future Extensions + +We expect that any or all of the following possible future extensions +would be compatible with the RVWMO memory model: + +* `V' vector ISA extensions +* `J' JIT extension +* Native encodings for load and store opcodes with _aq_ and _rl_ set +* Fences limited to certain addresses +* Cache writeback/flush/invalidate/etc.instructions + +[[discrepancies]] +=== Known Issues + +[[mixedrsw]] +==== Mixed-size RSW + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |lw a0,0(s0) |(d) |lw a1,0(s1) +|(b) |fence rw,rw |(e) |amoswap.w.rl a2,t1,0(s2) +|(c) |sw t1,0(s1) |(f) |ld a3,0(s2) +| | |(g) |lw a4,4(s2) +| | | |xor a5,a4,a4 +| | | |add s0,s0,a5 +| | |(h) |sw a2,0(s0) +|Outcome: `a0=1`, `a1=1`, `a2=0`, `a3=1`, `a4=0` | | | +|=== + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |lw a0,0(s0) |(d) |ld a1,0(s1) +|(b) |fence rw,rw |(e) |lw a2,4(s1) +|(c) |sw t1,0(s1) | |xor a3,a2,a2 +| | | |add s0,s0,a3 +| | |(f) |sw a2,0(s0) +|Outcome: `a0=0`, `a1=1`, `a2=0` | | | +|=== + +` ` + +[cols="^,<,^,<",options="header",] +|=== +|Hart 0 | |Hart 1 | +| |li t1, 1 | |li t1, 1 +|(a) |lw a0,0(s0) |(d) |sw t1,4(s1) +|(b) |fence rw,rw |(e) |ld a1,0(s1) +|(c) |sw t1,0(s1) |(f) |lw a2,4(s1) +| | | |xor a3,a2,a2 +| | | |add s0,s0,a3 +| | |(g) |sw a2,0(s0) +|Outcome: `a0=1`, `a1=0x100000001`, `a1=1` | | | +|=== + +There is a known discrepancy between the operational and axiomatic +specifications within the family of mixed-size RSW variants shown in +Figures #fig:litmus:discrepancy:rsw1[[fig:litmus:discrepancy:rsw1]]–#fig:litmus:discrepancy:rsw3[[fig:litmus:discrepancy:rsw3]]. +To address this, we may choose to add something like the following new +PPO rule: Memory operation latexmath:[$a$] precedes memory operation +latexmath:[$b$] in preserved program order (and hence also in the global +memory order) if latexmath:[$a$] precedes latexmath:[$b$] in program +order, latexmath:[$a$] and latexmath:[$b$] both access regular main +memory (rather than I/O regions), latexmath:[$a$] is a load, +latexmath:[$b$] is a store, there is a load latexmath:[$m$] between +latexmath:[$a$] and latexmath:[$b$], there is a byte latexmath:[$x$] +that both latexmath:[$a$] and latexmath:[$m$] read, there is no store +between latexmath:[$a$] and latexmath:[$m$] that writes to +latexmath:[$x$], and latexmath:[$m$] precedes latexmath:[$b$] in PPO. In +other words, in herd syntax, we may choose to add +`(po-loc & rsw);ppo;[W]` to PPO. Many implementations will already +enforce this ordering naturally. As such, even though this rule is not +official, we recommend that implementers enforce it nevertheless in +order to ensure forwards compatibility with the possible future addition +of this rule to RVWMO. diff --git a/src/mm-formal.adoc b/src/mm-formal.adoc new file mode 100644 index 0000000..feea627 --- /dev/null +++ b/src/mm-formal.adoc @@ -0,0 +1,1412 @@ +[appendix] +== Formal Memory Model Specifications, Version 0.1 +[[mm-formal]] + +To facilitate formal analysis of RVWMO, this chapter presents a set of +formalizations using different tools and modeling approaches. Any +discrepancies are unintended; the expectation is that the models +describe exactly the same sets of legal behaviors. + +This appendix should be treated as commentary; all normative material is +provided in <> and in the rest of +the main body of the ISA specification. All currently known +discrepancies are listed in +<>. Any other +discrepancies are unintentional. + +[[alloy]] +=== Formal Axiomatic Specification in Alloy + +We present a formal specification of the RVWMO memory model in Alloy +(http://alloy.mit.edu). This model is available online at +https://github.com/daniellustig/riscv-memory-model. + +The online material also contains some litmus tests and some examples of +how Alloy can be used to model check some of the mappings in +Section #sec:memory:porting[[sec:memory:porting]]. + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =RVWMO PPO= + +// Preserved Program Order +fun ppo : Event->Event { + // same-address ordering + po_loc :> Store + + rdw + + (AMO + StoreConditional) <: rfi + + // explicit synchronization + + ppo_fence + + Acquire <: ^po :> MemoryEvent + + MemoryEvent <: ^po :> Release + + RCsc <: ^po :> RCsc + + pair + + // syntactic dependencies + + addrdep + + datadep + + ctrldep :> Store + + // pipeline dependencies + + (addrdep+datadep).rfi + + addrdep.^po :> Store +} + +// the global memory order respects preserved program order +fact { ppo in ^gmo } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =RVWMO axioms= + +// Load Value Axiom +fun candidates[r: MemoryEvent] : set MemoryEvent { + (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo + + (r.^~po & Store & same_addr[r]) // writes preceding r in po +} + +fun latest_among[s: set Event] : Event { s - s.~^gmo } + +pred LoadValue { + all w: Store | all r: Load | + w->r in rf <=> w = latest_among[candidates[r]] +} + +// Atomicity Axiom +pred Atomicity { + all r: Store.~pair | // starting from the lr, + no x: Store & same_addr[r] | // there is no store x to the same addr + x not in same_hart[r] // such that x is from a different hart, + and x in r.~rf.^gmo // x follows (the store r reads from) in gmo, + and r.pair in x.^gmo // and r follows x in gmo +} + +// Progress Axiom implicit: Alloy only considers finite executions + +pred RISCV_mm { LoadValue and Atomicity /* and Progress */ } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// Basic model of memory + +sig Hart { // hardware thread + start : one Event +} +sig Address {} +abstract sig Event { + po: lone Event // program order +} + +abstract sig MemoryEvent extends Event { + address: one Address, + acquireRCpc: lone MemoryEvent, + acquireRCsc: lone MemoryEvent, + releaseRCpc: lone MemoryEvent, + releaseRCsc: lone MemoryEvent, + addrdep: set MemoryEvent, + ctrldep: set Event, + datadep: set MemoryEvent, + gmo: set MemoryEvent, // global memory order + rf: set MemoryEvent +} +sig LoadNormal extends MemoryEvent {} // l{b|h|w|d} +sig LoadReserve extends MemoryEvent { // lr + pair: lone StoreConditional +} +sig StoreNormal extends MemoryEvent {} // s{b|h|w|d} +// all StoreConditionals in the model are assumed to be successful +sig StoreConditional extends MemoryEvent {} // sc +sig AMO extends MemoryEvent {} // amo +sig NOP extends Event {} + +fun Load : Event { LoadNormal + LoadReserve + AMO } +fun Store : Event { StoreNormal + StoreConditional + AMO } + +sig Fence extends Event { + pr: lone Fence, // opcode bit + pw: lone Fence, // opcode bit + sr: lone Fence, // opcode bit + sw: lone Fence // opcode bit +} +sig FenceTSO extends Fence {} + +/* Alloy encoding detail: opcode bits are either set (encoded, e.g., + * as f.pr in iden) or unset (f.pr not in iden). The bits cannot be used for + * anything else */ +fact { pr + pw + sr + sw in iden } +// likewise for ordering annotations +fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden } +// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is +fact { no FenceTSO.(pr + pw + sr + sw) } +.... + +` ` + +.... +//////////////////////////////////////////////////////////////////////////////// +// =Basic model rules= + +// Ordering annotation groups +fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc } +fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc } +fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc } +fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc } + +// There is no such thing as store-acquire or load-release, unless it's both +fact { Load & Release in Acquire } +fact { Store & Acquire in Release } + +// FENCE PPO +fun FencePRSR : Fence { Fence.(pr & sr) } +fun FencePRSW : Fence { Fence.(pr & sw) } +fun FencePWSR : Fence { Fence.(pw & sr) } +fun FencePWSW : Fence { Fence.(pw & sw) } + +fun ppo_fence : MemoryEvent->MemoryEvent { + (Load <: ^po :> FencePRSR).(^po :> Load) + + (Load <: ^po :> FencePRSW).(^po :> Store) + + (Store <: ^po :> FencePWSR).(^po :> Load) + + (Store <: ^po :> FencePWSW).(^po :> Store) + + (Load <: ^po :> FenceTSO) .(^po :> MemoryEvent) + + (Store <: ^po :> FenceTSO) .(^po :> Store) +} + +// auxiliary definitions +fun po_loc : Event->Event { ^po & address.~address } +fun same_hart[e: Event] : set Event { e + e.^~po + e.^po } +fun same_addr[e: Event] : set Event { e.address.~address } + +// initial stores +fun NonInit : set Event { Hart.start.*po } +fun Init : set Event { Event - NonInit } +fact { Init in StoreNormal } +fact { Init->(MemoryEvent & NonInit) in ^gmo } +fact { all e: NonInit | one e.*~po.~start } // each event is in exactly one hart +fact { all a: Address | one Init & a.~address } // one init store per address +fact { no Init <: po and no po :> Init } +.... + +` ` + +.... +// po +fact { acyclic[po] } + +// gmo +fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents + +//rf +fact { rf.~rf in iden } // each read returns the value of only one write +fact { rf in Store <: address.~address :> Load } +fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) } + +//dep +fact { no StoreNormal <: (addrdep + ctrldep + datadep) } +fact { addrdep + ctrldep + datadep + pair in ^po } +fact { datadep in datadep :> Store } +fact { ctrldep.*po in ctrldep } +fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po } +fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed + +// rdw +fun rdw : Event->Event { + (Load <: po_loc :> Load) // start with all same_address load-load pairs, + - (~rf.rf) // subtract pairs that read from the same store, + - (po_loc.rfi) // and subtract out "fri-rfi" patterns +} + +// filter out redundant instances and/or visualizations +fact { no gmo & gmo.gmo } // keep the visualization uncluttered +fact { all a: Address | some a.~address } + +//////////////////////////////////////////////////////////////////////////////// +// =Optional: opcode encoding restrictions= + +// the list of blessed fences +fact { Fence in + Fence.pr.sr + + Fence.pw.sw + + Fence.pr.pw.sw + + Fence.pr.sr.sw + + FenceTSO + + Fence.pr.pw.sr.sw +} + +pred restrict_to_current_encodings { + no (LoadNormal + StoreNormal) & (Acquire + Release) +} + +//////////////////////////////////////////////////////////////////////////////// +// =Alloy shortcuts= +pred acyclic[rel: Event->Event] { no iden & ^rel } +pred total[rel: Event->Event, bag: Event] { + all disj e, e': bag | e->e' in rel + ~rel + acyclic[rel] +} +.... + +[[sec:herd]] +=== Formal Axiomatic Specification in Herd + +The tool [.sans-serif]#herd# takes a memory model and a litmus test as +input and simulates the execution of the test on top of the memory +model. Memory models are written in the domain specific language Cat. +This section provides two Cat memory model of RVWMO. The first model, +Figure #fig:herd2[[fig:herd2]], follows the _global memory order_, +Chapter #ch:memorymodel[[ch:memorymodel]], definition of RVWMO, as much +as is possible for a Cat model. The second model, +Figure #fig:herd3[[fig:herd3]], is an equivalent, more efficient, +partial order based RVWMO model. + +The simulator [.sans-serif]#herd# is part of the [.sans-serif]#diy# tool +suite — see http://diy.inria.fr for software and documentation. The +models and more are available online +at http://diy.inria.fr/cats7/riscv/. + +` ` + +.... +(*************) +(* Utilities *) +(*************) + +(* All fence relations *) +let fence.r.r = [R];fencerel(Fence.r.r);[R] +let fence.r.w = [R];fencerel(Fence.r.w);[W] +let fence.r.rw = [R];fencerel(Fence.r.rw);[M] +let fence.w.r = [W];fencerel(Fence.w.r);[R] +let fence.w.w = [W];fencerel(Fence.w.w);[W] +let fence.w.rw = [W];fencerel(Fence.w.rw);[M] +let fence.rw.r = [M];fencerel(Fence.rw.r);[R] +let fence.rw.w = [M];fencerel(Fence.rw.w);[W] +let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M] +let fence.tso = + let f = fencerel(Fence.tso) in + ([W];f;[W]) | ([R];f;[M]) + +let fence = + fence.r.r | fence.r.w | fence.r.rw | + fence.w.r | fence.w.w | fence.w.rw | + fence.rw.r | fence.rw.w | fence.rw.rw | + fence.tso + +(* Same address, no W to the same address in-between *) +let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc) +(* Read same write *) +let rsw = rf^-1;rf +(* Acquire, or stronger *) +let AQ = Acq|AcqRel +(* Release or stronger *) +and RL = RelAcqRel +(* All RCsc *) +let RCsc = Acq|Rel|AcqRel +(* Amo events are both R and W, relation rmw relates paired lr/sc *) +let AMO = R & W +let StCond = range(rmw) + +(*************) +(* ppo rules *) +(*************) + +(* Overlapping-Address Orderings *) +let r1 = [M];po-loc;[W] +and r2 = ([R];po-loc-no-w;[R]) \ rsw +and r3 = [AMO|StCond];rfi;[R] +(* Explicit Synchronization *) +and r4 = fence +and r5 = [AQ];po;[M] +and r6 = [M];po;[RL] +and r7 = [RCsc];po;[RCsc] +and r8 = rmw +(* Syntactic Dependencies *) +and r9 = [M];addr;[M] +and r10 = [M];data;[W] +and r11 = [M];ctrl;[W] +(* Pipeline Dependencies *) +and r12 = [R];(addr|data);[W];rfi;[R] +and r13 = [R];addr;[M];po;[W] + +let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13 +.... + +` ` + +.... +Total + +(* Notice that herd has defined its own rf relation *) + +(* Define ppo *) +include "riscv-defs.cat" + +(********************************) +(* Generate global memory order *) +(********************************) + +let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *) + loc & (W\FW) * FW | # Final write after any write to the same location + ppo | # ppo compatible + rfe # includes herd external rf (optimization) + +(* Walk over all linear extensions of gmo0 *) +with gmo from linearizations(M\IW,gmo0) + +(* Add initial writes upfront -- convenient for computing rfGMO *) +let gmo = gmo | loc & IW * (M\IW) + +(**********) +(* Axioms *) +(**********) + +(* Compute rf according to the load value axiom, aka rfGMO *) +let WR = loc & ([W];(gmo|po);[R]) +let rfGMO = WR \ (loc&([W];gmo);WR) + +(* Check equality of herd rf and of rfGMO *) +empty (rf\rfGMO)|(rfGMO\rf) as RfCons + +(* Atomicity axiom *) +let infloc = (gmo & loc)^-1 +let inflocext = infloc & ext +let winside = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W] +empty winside as Atomic +.... + +` ` + +.... +Partial + +(***************) +(* Definitions *) +(***************) + +(* Define ppo *) +include "riscv-defs.cat" + +(* Compute coherence relation *) +include "cos-opt.cat" + +(**********) +(* Axioms *) +(**********) + +(* Sc per location *) +acyclic co|rf|fr|po-loc as Coherence + +(* Main model axiom *) +acyclic co|rfe|fr|ppo as Model + +(* Atomicity axiom *) +empty rmw & (fre;coe) as Atomic +.... + +[[sec:operational]] +=== An Operational Memory Model + +This is an alternative presentation of the RVWMO memory model in +operational style. It aims to admit exactly the same extensional +behavior as the axiomatic presentation: for any given program, admitting +an execution if and only if the axiomatic presentation allows it. + +The axiomatic presentation is defined as a predicate on complete +candidate executions. In contrast, this operational presentation has an +abstract microarchitectural flavor: it is expressed as a state machine, +with states that are an abstract representation of hardware machine +states, and with explicit out-of-order and speculative execution (but +abstracting from more implementation-specific microarchitectural details +such as register renaming, store buffers, cache hierarchies, cache +protocols, etc.). As such, it can provide useful intuition. It can also +construct executions incrementally, making it possible to interactively +and randomly explore the behavior of larger examples, while the +axiomatic model requires complete candidate executions over which the +axioms can be checked. + +The operational presentation covers mixed-size execution, with +potentially overlapping memory accesses of different power-of-two byte +sizes. Misaligned accesses are broken up into single-byte accesses. + +The operational model, together with a fragment of the RISC-V ISA +semantics (RV64I and A), are integrated into the `rmem` exploration tool +(https://github.com/rems-project/rmem). `rmem` can explore litmus tests +(see #sec:litmustests[[sec:litmustests]]) and small ELF binaries +exhaustively, pseudo-randomly and interactively. In `rmem`, the ISA +semantics is expressed explicitly in Sail (see +https://github.com/rems-project/sail for the Sail language, and +https://github.com/rems-project/sail-riscv for the RISC-V ISA model), +and the concurrency semantics is expressed in Lem (see +https://github.com/rems-project/lem for the Lem language). + +`rmem` has a command-line interface and a web-interface. The +web-interface runs entirely on the client side, and is provided online +together with a library of litmus tests: +http://www.cl.cam.ac.uk/~pes20/rmem. The command-line interface is +faster than the web-interface, specially in exhaustive mode. + +Below is an informal introduction of the model states and transitions. +The description of the formal model starts in the next subsection. + +Terminology: In contrast to the axiomatic presentation, here every +memory operation is either a load or a store. Hence, AMOs give rise to +two distinct memory operations, a load and a store. When used in +conjunction with ``instruction``, the terms ``load`` and ``store`` refer +to instructions that give rise to such memory operations. As such, both +include AMO instructions. The term ``acquire`` refers to an instruction +(or its memory operation) with the acquire-RCpc or acquire-RCsc +annotation. The term ``release`` refers to an instruction (or its memory +operation) with the release-RCpc or release-RCsc annotation. + +==== Model states + +A model state consists of a shared memory and a tuple of hart states. + +[cols="^,^,^",] +|=== +| |*…* |Hart latexmath:[$n$] + +|latexmath:[$\big\uparrow$] latexmath:[$\big\downarrow$] | +|latexmath:[$\big\uparrow$] latexmath:[$\big\downarrow$] + +|Shared Memory | | +|=== + +The shared memory state records all the memory store operations that +have propagated so far, in the order they propagated (this can be made +more efficient, but for simplicity of the presentation we keep it this +way). + +Each hart state consists principally of a tree of instruction instances, +some of which have been _finished_, and some of which have not. +Non-finished instruction instances can be subject to _restart_, e.g. if +they depend on an out-of-order or speculative load that turns out to be +unsound. + +Conditional branch and indirect jump instructions may have multiple +successors in the instruction tree. When such instruction is finished, +any un-taken alternative paths are discarded. + +Each instruction instance in the instruction tree has a state that +includes an execution state of the intra-instruction semantics (the ISA +pseudocode for this instruction). The model uses a formalization of the +intra-instruction semantics in Sail. One can think of the execution +state of an instruction as a representation of the pseudocode control +state, pseudocode call stack, and local variable values. An instruction +instance state also includes information about the instance’s memory and +register footprints, its register reads and writes, its memory +operations, whether it is finished, etc. + +==== Model transitions + +The model defines, for any model state, the set of allowed transitions, +each of which is a single atomic step to a new abstract machine state. +Execution of a single instruction will typically involve many +transitions, and they may be interleaved in operational-model execution +with transitions arising from other instructions. Each transition arises +from a single instruction instance; it will change the state of that +instance, and it may depend on or change the rest of its hart state and +the shared memory state, but it does not depend on other hart states, +and it will not change them. The transitions are introduced below and +defined in Section #sec:omm:transitions[1.3.5], with a precondition and +a construction of the post-transition model state for each. + +Transitions for all instructions: + +* : This transition represents a fetch and decode of a new instruction +instance, as a program order successor of a previously fetched +instruction instance (or the initial fetch address). ++ +The model assumes the instruction memory is fixed; it does not describe +the behavior of self-modifying code. In particular, the transition does +not generate memory load operations, and the shared memory is not +involved in the transition. Instead, the model depends on an external +oracle that provides an opcode when given a memory location. +* : This is a write of a register value. +* : This is a read of a register value from the most recent +program-order-predecessor instruction instance that writes to that +register. +* : This covers pseudocode internal computation: arithmetic, function +calls, etc. +* : At this point the instruction pseudocode is done, the instruction +cannot be restarted, memory accesses cannot be discarded, and all memory +effects have taken place. For conditional branch and indirect jump +instructions, any program order successors that were fetched from an +address that is not the one that was written to the _pc_ register are +discarded, together with the sub-tree of instruction instances below +them. + +Transitions specific to load instructions: + +* : At this point the memory footprint of the load instruction is +provisionally known (it could change if earlier instructions are +restarted) and its individual memory load operations can start being +satisfied. +* : This partially or entirely satisfies a single memory load operation +by forwarding, from program-order-previous memory store operations. +* : This entirely satisfies the outstanding slices of a single memory +load operation, from memory. +* : At this point all the memory load operations of the instruction have +been entirely satisfied and the instruction pseudocode can continue +executing. A load instruction can be subject to being restarted until +the transition. But, under some conditions, the model might treat a load +instruction as non-restartable even before it is finished (e.g. see ). + +Transitions specific to store instructions: + +* : At this point the memory footprint of the store is provisionally +known. +* : At this point the memory store operations have their values and +program-order-successor memory load operations can be satisfied by +forwarding from them. +* : At this point the store operations are guaranteed to happen (the +instruction can no longer be restarted or discarded), and they can start +being propagated to memory. +* : This propagates a single memory store operation to memory. +* : At this point all the memory store operations of the instruction +have been propagated to memory, and the instruction pseudocode can +continue executing. + +Transitions specific to `sc` instructions: + +* : This causes the `sc` to fail, either a spontaneous fail or because +it is not paired with a program-order-previous `lr`. +* : This transition indicates the `sc` is paired with an `lr` and might +succeed. +* : This is an atomic execution of the transitions and , it is enabled +only if the stores from which the `lr` read from have not been +overwritten. +* : This causes the `sc` to fail, either a spontaneous fail or because +the stores from which the `lr` read from have been overwritten. + +Transitions specific to AMO instructions: + +* : This is an atomic execution of all the transitions needed to satisfy +the load operation, do the required arithmetic, and propagate the store +operation. + +Transitions specific to fence instructions: + +* {blank} + +The transitions labeled latexmath:[$\circ$] can always be taken eagerly, +as soon as their precondition is satisfied, without excluding other +behavior; the latexmath:[$\bullet$] cannot. Although is marked with a +latexmath:[$\bullet$], it can be taken eagerly as long as it is not +taken infinitely many times. + +An instance of a non-AMO load instruction, after being fetched, will +typically experience the following transitions in this order: + +. {blank} +. {blank} +. and/or (as many as needed to satisfy all the load operations of the +instance) +. {blank} +. {blank} +. {blank} + +Before, between and after the transitions above, any number of +transitions may appear. In addition, a transition for fetching the +instruction in the next program location will be available until it is +taken. + +This concludes the informal description of the operational model. The +following sections describe the formal operational model. + +[[sec:omm:pseudocode_exec]] +==== Intra-instruction Pseudocode Execution + +The intra-instruction semantics for each instruction instance is +expressed as a state machine, essentially running the instruction +pseudocode. Given a pseudocode execution state, it computes the next +state. Most states identify a pending memory or register operation, +requested by the pseudocode, which the memory model has to do. The +states are (this is a tagged union; tags in small-caps): + +[cols="<,<",] +|=== +|Load_mem(_kind_, _address_, _size_, _load_continuation_) |memory load +operation + +|Early_sc_fail(_res_continuation_) |allow `sc` to fail early + +|Store_ea(_kind_, _address_, _size_, _next_state_) |memory store +effective address + +|Store_memv(_mem_value_, _store_continuation_) |memory store value + +|Fence(_kind_, _next_state_) |fence + +|Read_reg(_reg_name_, _read_continuation_) |register read + +|Write_reg(_reg_name_, _reg_value_, _next_state_) |register write + +|Internal(_next_state_) |pseudocode internal step + +|Done |end of pseudocode +|=== + +Here: + +_mem_value_ and _reg_value_ are lists of bytes; + +_address_ is an integer of XLEN bits; + +for load/store, _kind_ identifies whether it is `lr/sc`, +acquire-RCpc/release-RCpc, acquire-RCsc/release-RCsc, +acquire-release-RCsc; + +for fence, _kind_ identifies whether it is a normal or TSO, and (for +normal fences) the predecessor and successor ordering bits; + +_reg_name_ identifies a register and a slice thereof (start and end bit +indices); and + +the continuations describe how the instruction instance will continue +for each value that might be provided by the surrounding memory model +(the _load_continuation_ and _read_continuation_ take the value loaded +from memory and read from the previous register write, the +_store_continuation_ takes _false_ for an `sc` that failed and _true_ in +all other cases, and _res_continuation_ takes _false_ if the `sc` fails +and _true_ otherwise). + +For example, given the load instruction `lw x1,0(x2)`, an execution will +typically go as follows. The initial execution state will be computed +from the pseudocode for the given opcode. This can be expected to be +Read_reg(`x2`, _read_continuation_). Feeding the most recently written +value of register `x2` (the instruction semantics will be blocked if +necessary until the register value is available), say `0x4000`, to +_read_continuation_ returns Load_mem(`plain_load`, `0x4000`, `4`, +_load_continuation_). Feeding the 4-byte value loaded from memory +location `0x4000`, say `0x42`, to _load_continuation_ returns +Write_reg(`x1`, `0x42`, Done). Many Internal(_next_state_) states may +appear before and between the states above. + +Notice that writing to memory is split into two steps, Store_ea and +Store_memv: the first one makes the memory footprint of the store +provisionally known, and the second one adds the value to be stored. We +ensure these are paired in the pseudocode (Store_ea followed by +Store_memv), but there may be other steps between them. + +It is observable that the Store_ea can occur before the value to be +stored is determined. For example, for the litmus test +LB+fence.r.rw+data-po to be allowed by the operational model (as it is +by RVWMO), the first store in Hart 1 has to take the Store_ea step +before its value is determined, so that the second store can see it is +to a non-overlapping memory footprint, allowing the second store to be +committed out of order without violating coherence. + +The pseudocode of each instruction performs at most one store or one +load, except for AMOs that perform exactly one load and one store. Those +memory accesses are then split apart into the architecturally atomic +units by the hart semantics (see and below). + +Informally, each bit of a register read should be satisfied from a +register write by the most recent (in program order) instruction +instance that can write that bit (or from the hart’s initial register +state if there is no such write). Hence, it is essential to know the +register write footprint of each instruction instance, which we +calculate when the instruction instance is created (see the action of +below). We ensure in the pseudocode that each instruction does at most +one register write to each register bit, and also that it does not try +to read a register value it just wrote. + +Data-flow dependencies (address and data) in the model emerge from the +fact that each register read has to wait for the appropriate register +write to be executed (as described above). + +[[sec:omm:inst_state]] +==== Instruction Instance State + +Each instruction instance latexmath:[$i$] has a state comprising: + +* _program_loc_, the memory address from which the instruction was +fetched; +* _instruction_kind_, identifying whether this is a load, store, AMO, +fence, branch/jump or a `simple' instruction (this also includes a +_kind_ similar to the one described for the pseudocode execution +states); +* _src_regs_, the set of source _reg_name_s (including system +registers), as statically determined from the pseudocode of the +instruction; +* _dst_regs_, the destination _reg_name_s (including system registers), +as statically determined from the pseudocode of the instruction; +* _pseudocode_state_ (or sometimes just `state' for short), one of (this +is a tagged union; tags in small-caps): ++ +[cols="<,<",] +|=== +|Plain(_isa_state_) |ready to make a pseudocode transition + +|Pending_mem_loads(_load_continuation_) |requesting memory load +operation(s) + +|Pending_mem_stores(_store_continuation_) |requesting memory store +operation(s) +|=== +* _reg_reads_, the register reads the instance has performed, including, +for each one, the register write slices it read from; +* _reg_writes_, the register writes the instance has performed; +* _mem_loads_, a set of memory load operations, and for each one the +as-yet-unsatisfied slices (the byte indices that have not been satisfied +yet), and, for the satisfied slices, the store slices (each consisting +of a memory store operation and subset of its byte indices) that +satisfied it. +* _mem_stores_, a set of memory store operations, and for each one a +flag that indicates whether it has been propagated (passed to the shared +memory) or not. +* information recording whether the instance is committed, finished, +etc. + +Each memory load operation includes a memory footprint (address and +size). Each memory store operations includes a memory footprint, and, +when available, a value. + +A load instruction instance with a non-empty _mem_loads_, for which all +the load operations are satisfied (i.e. there are no unsatisfied load +slices) is said to be _entirely satisfied_. + +Informally, an instruction instance is said to have _fully determined +data_ if the load (and `sc`) instructions feeding its source registers +are finished. Similarly, it is said to have a _fully determined memory +footprint_ if the load (and `sc`) instructions feeding its memory +operation address register are finished. Formally, we first define the +notion of _fully determined register write_: a register write +latexmath:[$w$] from _reg_writes_ of instruction instance +latexmath:[$i$] is said to be _fully determined_ if one of the following +conditions hold: + +. latexmath:[$i$] is finished; or +. the value written by latexmath:[$w$] is not affected by a memory +operation that latexmath:[$i$] has made (i.e. a value loaded from memory +or the result of `sc`), and, for every register read that +latexmath:[$i$] has made, that affects latexmath:[$w$], the register +write from which latexmath:[$i$] read is fully determined (or +latexmath:[$i$] read from the initial register state). + +Now, an instruction instance latexmath:[$i$] is said to have _fully +determined data_ if for every register read latexmath:[$r$] from +_reg_reads_, the register writes that latexmath:[$r$] reads from are +fully determined. An instruction instance latexmath:[$i$] is said to +have a _fully determined memory footprint_ if for every register read +latexmath:[$r$] from _reg_reads_ that feeds into latexmath:[$i$]’s +memory operation address, the register writes that latexmath:[$r$] reads +from are fully determined. + +The `rmem` tool records, for every register write, the set of register +writes from other instructions that have been read by this instruction +at the point of performing the write. By carefully arranging the +pseudocode of the instructions covered by the tool we were able to make +it so that this is exactly the set of register writes on which the write +depends on. + +==== Hart State + +The model state of a single hart comprises: + +* _hart_id_, a unique identifier of the hart; +* _initial_register_state_, the initial register value for each +register; +* _initial_fetch_address_, the initial instruction fetch address; +* _instruction_tree_, a tree of the instruction instances that have been +fetched (and not discarded), in program order. + +==== Shared Memory State + +The model state of the shared memory comprises a list of memory store +operations, in the order they propagated to the shared memory. + +When a store operation is propagated to the shared memory it is simply +added to the end of the list. When a load operation is satisfied from +memory, for each byte of the load operation, the most recent +corresponding store slice is returned. + +For most purposes, it is simpler to think of the shared memory as an +array, i.e., a map from memory locations to memory store operation +slices, where each memory location is mapped to a one-byte slice of the +most recent memory store operation to that location. However, this +abstraction is not detailed enough to properly handle the `sc` +instruction. The RVWMO allows store operations from the same hart as the +`sc` to intervene between the store operation of the `sc` and the store +operations the paired `lr` read from. To allow such store operations to +intervene, and forbid others, the array abstraction must be extended to +record more information. Here, we use a list as it is very simple, but a +more efficient and scalable implementations should probably use +something better. + +[[sec:omm:transitions]] +==== Transitions + +Each of the paragraphs below describes a single kind of system +transition. The description starts with a condition over the current +system state. The transition can be taken in the current state only if +the condition is satisfied. The condition is followed by an action that +is applied to that state when the transition is taken, in order to +generate the new system state. + +[[omm:fetch]] +===== Fetch instruction + +A possible program-order-successor of instruction instance +latexmath:[$i$] can be fetched from address _loc_ if: + +. it has not already been fetched, i.e., none of the immediate +successors of latexmath:[$i$] in the hart’s _instruction_tree_ are from +_loc_; and +. if latexmath:[$i$]’s pseudocode has already written an address to +_pc_, then _loc_ must be that address, otherwise _loc_ is: +* for a conditional branch, the successor address or the branch target +address; +* for a (direct) jump and link instruction (`jal`), the target address; +* for an indirect jump instruction (`jalr`), any address; and +* for any other instruction, latexmath:[$i.\textit{program\_loc}+4$]. + +Action: construct a freshly initialized instruction instance +latexmath:[$i'$] for the instruction in the program memory at _loc_, +with state Plain(_isa_state_), computed from the instruction pseudocode, +including the static information available from the pseudocode such as +its _instruction_kind_, _src_regs_, and _dst_regs_, and add +latexmath:[$i'$] to the hart’s _instruction_tree_ as a successor of +latexmath:[$i$]. + +The possible next fetch addresses (_loc_) are available immediately +after fetching latexmath:[$i$] and the model does not need to wait for +the pseudocode to write to _pc_; this allows out-of-order execution, and +speculation past conditional branches and jumps. For most instructions +these addresses are easily obtained from the instruction pseudocode. The +only exception to that is the indirect jump instruction (`jalr`), where +the address depends on the value held in a register. In principle the +mathematical model should allow speculation to arbitrary addresses here. +The exhaustive search in the `rmem` tool handles this by running the +exhaustive search multiple times with a growing set of possible next +fetch addresses for each indirect jump. The initial search uses empty +sets, hence there is no fetch after indirect jump instruction until the +pseudocode of the instruction writes to _pc_, and then we use that value +for fetching the next instruction. Before starting the next iteration of +exhaustive search, we collect for each indirect jump (grouped by code +location) the set of values it wrote to _pc_ in all the executions in +the previous search iteration, and use that as possible next fetch +addresses of the instruction. This process terminates when no new fetch +addresses are detected. + +[[omm:initiate_load]] +===== Initiate memory load operations + +An instruction instance latexmath:[$i$] in state Plain(Load_mem(_kind_, +_address_, _size_, _load_continuation_)) can always initiate the +corresponding memory load operations. Action: + +. Construct the appropriate memory load operations latexmath:[$mlos$]: +* if _address_ is aligned to _size_ then latexmath:[$mlos$] is a single +memory load operation of _size_ bytes from _address_; +* otherwise, latexmath:[$mlos$] is a set of _size_ memory load +operations, each of one byte, from the addresses +latexmath:[$\textit{address}\ldots\textit{address}+\textit{size}-1$]. +. set _mem_loads_ of latexmath:[$i$] to latexmath:[$mlos$]; and +. update the state of latexmath:[$i$] to +Pending_mem_loads(_load_continuation_). + +In Section #sec:rvwmo:primitives[[sec:rvwmo:primitives]] it is said that +misaligned memory accesses may be decomposed at any granularity. Here we +decompose them to one-byte accesses as this granularity subsumes all +others. + +[[omm:sat_by_forwarding]] +===== Satisfy memory load operation by forwarding from unpropagated stores + +For a non-AMO load instruction instance latexmath:[$i$] in state +Pending_mem_loads(_load_continuation_), and a memory load operation +latexmath:[$mlo$] in latexmath:[$i.\textit{mem\_loads}$] that has +unsatisfied slices, the memory load operation can be partially or +entirely satisfied by forwarding from unpropagated memory store +operations by store instruction instances that are program-order-before +latexmath:[$i$] if: + +. all program-order-previous `fence` instructions with `.sr` and `.pw` +set are finished; +. for every program-order-previous `fence` instruction, latexmath:[$f$], +with `.sr` and `.pr` set, and `.pw` not set, if latexmath:[$f$] is not +finished then all load instructions that are program-order-before +latexmath:[$f$] are entirely satisfied; +. for every program-order-previous `fence.tso` instruction, +latexmath:[$f$], that is not finished, all load instructions that are +program-order-before latexmath:[$f$] are entirely satisfied; +. if latexmath:[$i$] is a load-acquire-RCsc, all program-order-previous +store-releases-RCsc are finished; +. if latexmath:[$i$] is a load-acquire-release, all +program-order-previous instructions are finished; +. all non-finished program-order-previous load-acquire instructions are +entirely satisfied; and +. all program-order-previous store-acquire-release instructions are +finished; + +Let latexmath:[$msoss$] be the set of all unpropagated memory store +operation slices from non-`sc` store instruction instances that are +program-order-before latexmath:[$i$] and have already calculated the +value to be stored, that overlap with the unsatisfied slices of +latexmath:[$mlo$], and which are not superseded by intervening store +operations or store operations that are read from by an intervening +load. The last condition requires, for each memory store operation slice +latexmath:[$msos$] in latexmath:[$msoss$] from instruction +latexmath:[$i'$]: + +that there is no store instruction program-order-between latexmath:[$i$] +and latexmath:[$i'$] with a memory store operation overlapping +latexmath:[$msos$]; and + +that there is no load instruction program-order-between latexmath:[$i$] +and latexmath:[$i'$] that was satisfied from an overlapping memory store +operation slice from a different hart. + +Action: + +. update latexmath:[$i.\textit{mem\_loads}$] to indicate that +latexmath:[$mlo$] was satisfied by latexmath:[$msoss$]; and +. restart any speculative instructions which have violated coherence as +a result of this, i.e., for every non-finished instruction +latexmath:[$i'$] that is a program-order-successor of latexmath:[$i$], +and every memory load operation latexmath:[$mlo'$] of latexmath:[$i'$] +that was satisfied from latexmath:[$msoss'$], if there exists a memory +store operation slice latexmath:[$msos'$] in latexmath:[$msoss'$], and +an overlapping memory store operation slice from a different memory +store operation in latexmath:[$msoss$], and latexmath:[$msos'$] is not +from an instruction that is a program-order-successor of +latexmath:[$i$], restart latexmath:[$i'$] and its _restart-dependents_. + +Where, the _restart-dependents_ of instruction latexmath:[$j$] are: + +program-order-successors of latexmath:[$j$] that have data-flow +dependency on a register write of latexmath:[$j$]; + +program-order-successors of latexmath:[$j$] that have a memory load +operation that reads from a memory store operation of latexmath:[$j$] +(by forwarding); + +if latexmath:[$j$] is a load-acquire, all the program-order-successors +of latexmath:[$j$]; + +if latexmath:[$j$] is a load, for every `fence`, latexmath:[$f$], with +`.sr` and `.pr` set, and `.pw` not set, that is a +program-order-successor of latexmath:[$j$], all the load instructions +that are program-order-successors of latexmath:[$f$]; + +if latexmath:[$j$] is a load, for every `fence.tso`, latexmath:[$f$], +that is a program-order-successor of latexmath:[$j$], all the load +instructions that are program-order-successors of latexmath:[$f$]; and + +(recursively) all the restart-dependents of all the instruction +instances above. + +Forwarding memory store operations to a memory load might satisfy only +some slices of the load, leaving other slices unsatisfied. + +A program-order-previous store operation that was not available when +taking the transition above might make latexmath:[$msoss$] provisionally +unsound (violating coherence) when it becomes available. That store will +prevent the load from being finished (see ), and will cause it to +restart when that store operation is propagated (see ). + +A consequence of the transition condition above is that +store-release-RCsc memory store operations cannot be forwarded to +load-acquire-RCsc instructions: latexmath:[$msoss$] does not include +memory store operations from finished stores (as those must be +propagated memory store operations), and the condition above requires +all program-order-previous store-releases-RCsc to be finished when the +load is acquire-RCsc. + +[[omm:sat_from_mem]] +===== Satisfy memory load operation from memory + +For an instruction instance latexmath:[$i$] of a non-AMO load +instruction or an AMO instruction in the context of the ```` transition, +any memory load operation latexmath:[$mlo$] in +latexmath:[$i.\textit{mem\_loads}$] that has unsatisfied slices, can be +satisfied from memory if all the conditions of are satisfied. Action: +let latexmath:[$msoss$] be the memory store operation slices from memory +covering the unsatisfied slices of latexmath:[$mlo$], and apply the +action of . + +Note that might leave some slices of the memory load operation +unsatisfied, those will have to be satisfied by taking the transition +again, or taking . , on the other hand, will always satisfy all the +unsatisfied slices of the memory load operation. + +[[omm:complete_loads]] +===== Complete load operations + +A load instruction instance latexmath:[$i$] in state +Pending_mem_loads(_load_continuation_) can be completed (not to be +confused with finished) if all the memory load operations +latexmath:[$i.\textit{mem\_loads}$] are entirely satisfied (i.e. there +are no unsatisfied slices). Action: update the state of latexmath:[$i$] +to Plain(_load_continuation(mem_value)_), where _mem_value_ is assembled +from all the memory store operation slices that satisfied +latexmath:[$i.\textit{mem\_loads}$]. + +[[omm:early_sc_fail]] +===== Early `sc` fail + +An `sc` instruction instance latexmath:[$i$] in state +Plain(Early_sc_fail(_res_continuation_)) can always be made to fail. +Action: update the state of latexmath:[$i$] to +Plain(_res_continuation(false)_). + +[[omm:paired_sc]] +===== Paired `sc` + +An `sc` instruction instance latexmath:[$i$] in state +Plain(Early_sc_fail(_res_continuation_)) can continue its (potentially +successful) execution if latexmath:[$i$] is paired with an `lr`. Action: +update the state of latexmath:[$i$] to Plain(_res_continuation(true)_). + +[[omm:initiate_store_footprint]] +===== Initiate memory store operation footprints + +An instruction instance latexmath:[$i$] in state Plain(Store_ea(_kind_, +_address_, _size_, _next_state_)) can always announce its pending memory +store operation footprint. Action: + +. construct the appropriate memory store operations latexmath:[$msos$] +(without the store value): +* if _address_ is aligned to _size_ then latexmath:[$msos$] is a single +memory store operation of _size_ bytes to _address_; +* otherwise, latexmath:[$msos$] is a set of _size_ memory store +operations, each of one-byte size, to the addresses +latexmath:[$\textit{address}\ldots\textit{address}+\textit{size}-1$]. +. set latexmath:[$i.\textit{mem\_stores}$] to latexmath:[$msos$]; and +. update the state of latexmath:[$i$] to Plain(_next_state_). + +Note that after taking the transition above the memory store operations +do not yet have their values. The importance of splitting this +transition from the transition below is that it allows other +program-order-successor store instructions to observe the memory +footprint of this instruction, and if they don’t overlap, propagate out +of order as early as possible (i.e. before the data register value +becomes available). + +[[omm:instantiate_store_value]] +===== Instantiate memory store operation values + +An instruction instance latexmath:[$i$] in state +Plain(Store_memv(_mem_value_, _store_continuation_)) can always +instantiate the values of the memory store operations +latexmath:[$i.\textit{mem\_stores}$]. Action: + +. split _mem_value_ between the memory store operations +latexmath:[$i.\textit{mem\_stores}$]; and +. update the state of latexmath:[$i$] to +Pending_mem_stores(_store_continuation_). + +[[omm:commit_stores]] +===== Commit store instruction + +An uncommitted instruction instance latexmath:[$i$] of a non-`sc` store +instruction or an `sc` instruction in the context of the ```` +transition, in state Pending_mem_stores(_store_continuation_), can be +committed (not to be confused with propagated) if: + +. latexmath:[$i$] has fully determined data; +. all program-order-previous conditional branch and indirect jump +instructions are finished; +. all program-order-previous `fence` instructions with `.sw` set are +finished; +. all program-order-previous `fence.tso` instructions are finished; +. all program-order-previous load-acquire instructions are finished; +. all program-order-previous store-acquire-release instructions are +finished; +. if latexmath:[$i$] is a store-release, all program-order-previous +instructions are finished; +. [#omm:commit_store:prev_addrs]#[omm:commit_store:prev_addrs]# all +program-order-previous memory access instructions have a fully +determined memory footprint; +. [#omm:commit_store:prev_stores]#[omm:commit_store:prev_stores]# all +program-order-previous store instructions, except for `sc` that failed, +have initiated and so have non-empty _mem_stores_; and +. [#omm:commit_store:prev_loads]#[omm:commit_store:prev_loads]# all +program-order-previous load instructions have initiated and so have +non-empty _mem_loads_. + +Action: record that latexmath:[$i$] is committed. + +Notice that if condition +#omm:commit_store:prev_addrs[[omm:commit_store:prev_addrs]] is satisfied +the conditions +#omm:commit_store:prev_stores[[omm:commit_store:prev_stores]] and +#omm:commit_store:prev_loads[[omm:commit_store:prev_loads]] are also +satisfied, or will be satisfied after taking some eager transitions. +Hence, requiring them does not strengthen the model. By requiring them, +we guarantee that previous memory access instructions have taken enough +transitions to make their memory operations visible for the condition +check of , which is the next transition the instruction will take, +making that condition simpler. + +[[omm:prop_store]] +===== Propagate store operation + +For a committed instruction instance latexmath:[$i$] in state +Pending_mem_stores(_store_continuation_), and an unpropagated memory +store operation latexmath:[$mso$] in +latexmath:[$i.\textit{mem\_stores}$], latexmath:[$mso$] can be +propagated if: + +. all memory store operations of program-order-previous store +instructions that overlap with latexmath:[$mso$] have already +propagated; +. all memory load operations of program-order-previous load instructions +that overlap with latexmath:[$mso$] have already been satisfied, and +(the load instructions) are _non-restartable_ (see definition below); +and +. all memory load operations that were satisfied by forwarding +latexmath:[$mso$] are entirely satisfied. + +Where a non-finished instruction instance latexmath:[$j$] is +_non-restartable_ if: + +. there does not exist a store instruction latexmath:[$s$] and an +unpropagated memory store operation latexmath:[$mso$] of latexmath:[$s$] +such that applying the action of the ```` transition to +latexmath:[$mso$] will result in the restart of latexmath:[$j$]; and +. there does not exist a non-finished load instruction latexmath:[$l$] +and a memory load operation latexmath:[$mlo$] of latexmath:[$l$] such +that applying the action of the ````/```` transition (even if +latexmath:[$mlo$] is already satisfied) to latexmath:[$mlo$] will result +in the restart of latexmath:[$j$]. + +Action: + +. update the shared memory state with latexmath:[$mso$]; +. update latexmath:[$i.\textit{mem\_stores}$] to indicate that +latexmath:[$mso$] was propagated; and +. restart any speculative instructions which have violated coherence as +a result of this, i.e., for every non-finished instruction +latexmath:[$i'$] program-order-after latexmath:[$i$] and every memory +load operation latexmath:[$mlo'$] of latexmath:[$i'$] that was satisfied +from latexmath:[$msoss'$], if there exists a memory store operation +slice latexmath:[$msos'$] in latexmath:[$msoss'$] that overlaps with +latexmath:[$mso$] and is not from latexmath:[$mso$], and +latexmath:[$msos'$] is not from a program-order-successor of +latexmath:[$i$], restart latexmath:[$i'$] and its _restart-dependents_ +(see ). + +[[omm:commit_sc]] +===== Commit and propagate store operation of an `sc` + +An uncommitted `sc` instruction instance latexmath:[$i$], from hart +latexmath:[$h$], in state Pending_mem_stores(_store_continuation_), with +a paired `lr` latexmath:[$i'$] that has been satisfied by some store +slices latexmath:[$msoss$], can be committed and propagated at the same +time if: + +. latexmath:[$i'$] is finished; +. every memory store operation that has been forwarded to +latexmath:[$i'$] is propagated; +. the conditions of is satisfied; +. the conditions of is satisfied (notice that an `sc` instruction can +only have one memory store operation); and +. for every store slice latexmath:[$msos$] from latexmath:[$msoss$], +latexmath:[$msos$] has not been overwritten, in the shared memory, by a +store that is from a hart that is not latexmath:[$h$], at any point +since latexmath:[$msos$] was propagated to memory. + +Action: + +. apply the actions of ; and +. apply the action of . + +[[omm:late_sc_fail]] +===== Late `sc` fail + +An `sc` instruction instance latexmath:[$i$] in state +Pending_mem_stores(_store_continuation_), that has not propagated its +memory store operation, can always be made to fail. Action: + +. clear latexmath:[$i.\textit{mem\_stores}$]; and +. update the state of latexmath:[$i$] to +Plain(_store_continuation(false)_). + +For efficiency, the `rmem` tool allows this transition only when it is +not possible to take the transition. This does not affect the set of +allowed final states, but when explored interactively, if the `sc` +should fail one should use the transition instead of waiting for this +transition. + +[[omm:complete_stores]] +===== Complete store operations + +A store instruction instance latexmath:[$i$] in state +Pending_mem_stores(_store_continuation_), for which all the memory store +operations in latexmath:[$i.\textit{mem\_stores}$] have been propagated, +can always be completed (not to be confused with finished). Action: +update the state of latexmath:[$i$] to +Plain(_store_continuation(true)_). + +[[omm:do_amo]] +===== Satisfy, commit and propagate operations of an AMO + +An AMO instruction instance latexmath:[$i$] in state +Pending_mem_loads(_load_continuation_) can perform its memory access if +it is possible to perform the following sequence of transitions with no +intervening transitions: + +. {blank} +. {blank} +. (zero or more times) +. {blank} +. {blank} +. {blank} +. {blank} + +and in addition, the condition of , with the exception of not requiring +latexmath:[$i$] to be in state Plain(Done), holds after those +transitions. Action: perform the above sequence of transitions (this +does not include ), one after the other, with no intervening +transitions. + +Notice that program-order-previous stores cannot be forwarded to the +load of an AMO. This is simply because the sequence of transitions above +does not include the forwarding transition. But even if it did include +it, the sequence will fail when trying to do the transition, as this +transition requires all program-order-previous store operations to +overlapping memory footprints to be propagated, and forwarding requires +the store operation to be unpropagated. + +In addition, the store of an AMO cannot be forwarded to a +program-order-successor load. Before taking the transition above, the +store operation of the AMO does not have its value and therefore cannot +be forwarded; after taking the transition above the store operation is +propagated and therefore cannot be forwarded. + +[[omm:commit_fence]] +===== Commit fence + +A fence instruction instance latexmath:[$i$] in state +Plain(Fence(_kind_, _next_state_)) can be committed if: + +. if latexmath:[$i$] is a normal fence and it has `.pr` set, all +program-order-previous load instructions are finished; +. if latexmath:[$i$] is a normal fence and it has `.pw` set, all +program-order-previous store instructions are finished; and +. if latexmath:[$i$] is a `fence.tso`, all program-order-previous load +and store instructions are finished. + +Action: + +. record that latexmath:[$i$] is committed; and +. update the state of latexmath:[$i$] to Plain(_next_state_). + +[[omm:reg_read]] +===== Register read + +An instruction instance latexmath:[$i$] in state +Plain(Read_reg(_reg_name_, _read_cont_)) can do a register read of +_reg_name_ if every instruction instance that it needs to read from has +already performed the expected _reg_name_ register write. + +Let _read_sources_ include, for each bit of _reg_name_, the write to +that bit by the most recent (in program order) instruction instance that +can write to that bit, if any. If there is no such instruction, the +source is the initial register value from _initial_register_state_. Let +_reg_value_ be the value assembled from _read_sources_. Action: + +. add _reg_name_ to latexmath:[$i.\textit{reg\_reads}$] with +_read_sources_ and _reg_value_; and +. update the state of latexmath:[$i$] to Plain(_read_cont(reg_value)_). + +[[omm:reg_write]] +===== Register write + +An instruction instance latexmath:[$i$] in state +Plain(Write_reg(_reg_name_, _reg_value_, _next_state_)) can always do a +_reg_name_ register write. Action: + +. add _reg_name_ to latexmath:[$i.\textit{reg\_writes}$] with +latexmath:[$deps$] and _reg_value_; and +. update the state of latexmath:[$i$] to Plain(_next_state_). + +where latexmath:[$deps$] is a pair of the set of all _read_sources_ from +latexmath:[$i.\textit{reg\_reads}$], and a flag that is true iff +latexmath:[$i$] is a load instruction instance that has already been +entirely satisfied. + +[[omm:sail_interp]] +===== Pseudocode internal step + +An instruction instance latexmath:[$i$] in state +Plain(Internal(_next_state_)) can always do that pseudocode-internal +step. Action: update the state of latexmath:[$i$] to +Plain(_next_state_). + +[[omm:finish]] +===== Finish instruction + +A non-finished instruction instance latexmath:[$i$] in state Plain(Done) +can be finished if: + +. if latexmath:[$i$] is a load instruction: +.. all program-order-previous load-acquire instructions are finished; +.. all program-order-previous `fence` instructions with `.sr` set are +finished; +.. for every program-order-previous `fence.tso` instruction, +latexmath:[$f$], that is not finished, all load instructions that are +program-order-before latexmath:[$f$] are finished; and +.. it is guaranteed that the values read by the memory load operations +of latexmath:[$i$] will not cause coherence violations, i.e., for any +program-order-previous instruction instance latexmath:[$i'$], let +latexmath:[$\textit{cfp}$] be the combined footprint of propagated +memory store operations from store instructions program-order-between +latexmath:[$i$] and latexmath:[$i'$], and _fixed memory store +operations_ that were forwarded to latexmath:[$i$] from store +instructions program-order-between latexmath:[$i$] and latexmath:[$i'$] +including latexmath:[$i'$], and let +latexmath:[$\overline{\textit{cfp}}$] be the complement of +latexmath:[$\textit{cfp}$] in the memory footprint of latexmath:[$i$]. +If latexmath:[$\overline{\textit{cfp}}$] is not empty: +... latexmath:[$i'$] has a fully determined memory footprint; +... latexmath:[$i'$] has no unpropagated memory store operations that +overlap with latexmath:[$\overline{\textit{cfp}}$]; and +... if latexmath:[$i'$] is a load with a memory footprint that overlaps +with latexmath:[$\overline{\textit{cfp}}$], then all the memory load +operations of latexmath:[$i'$] that overlap with +latexmath:[$\overline{\textit{cfp}}$] are satisfied and latexmath:[$i'$] +is _non-restartable_ (see the transition for how to determined if an +instruction is non-restartable). ++ +Here, a memory store operation is called fixed if the store instruction +has fully determined data. +. latexmath:[$i$] has a fully determined data; and +. if latexmath:[$i$] is not a fence, all program-order-previous +conditional branch and indirect jump instructions are finished. + +Action: + +. if latexmath:[$i$] is a conditional branch or indirect jump +instruction, discard any untaken paths of execution, i.e., remove all +instruction instances that are not reachable by the branch/jump taken in +_instruction_tree_; and +. record the instruction as finished, i.e., set _finished_ to _true_. + +[[sec:omm:limitations]] +==== Limitations + +* The model covers user-level RV64I and RV64A. In particular, it does +not support the misaligned atomics extension ``Zam`` or the total store +ordering extension ``Ztso``. It should be trivial to adapt the model to +RV32I/A and to the G, Q and C extensions, but we have never tried it. +This will involve, mostly, writing Sail code for the instructions, with +minimal, if any, changes to the concurrency model. +* The model covers only normal memory accesses (it does not handle I/O +accesses). +* The model does not cover TLB-related effects. +* The model assumes the instruction memory is fixed. In particular, the +transition does not generate memory load operations, and the shared +memory is not involved in the transition. Instead, the model depends on +an external oracle that provides an opcode when given a memory location. +* The model does not cover exceptions, traps and interrupts. diff --git a/src/mm-herd.adoc b/src/mm-herd.adoc new file mode 100644 index 0000000..f1c0fd8 --- /dev/null +++ b/src/mm-herd.adoc @@ -0,0 +1,155 @@ +[[sec:herd]] +== Formal Axiomatic Specification in Herd + +The tool [.sans-serif]#herd# takes a memory model and a litmus test as +input and simulates the execution of the test on top of the memory +model. Memory models are written in the domain specific language Cat. +This section provides two Cat memory model of RVWMO. The first model, +Figure #fig:herd2[[fig:herd2]], follows the _global memory order_, +Chapter #ch:memorymodel[[ch:memorymodel]], definition of RVWMO, as much +as is possible for a Cat model. The second model, +Figure #fig:herd3[[fig:herd3]], is an equivalent, more efficient, +partial order based RVWMO model. + +The simulator [.sans-serif]#herd# is part of the [.sans-serif]#diy# tool +suite — see http://diy.inria.fr for software and documentation. The +models and more are available online +at http://diy.inria.fr/cats7/riscv/. + +` ` + +.... +(*************) +(* Utilities *) +(*************) + +(* All fence relations *) +let fence.r.r = [R];fencerel(Fence.r.r);[R] +let fence.r.w = [R];fencerel(Fence.r.w);[W] +let fence.r.rw = [R];fencerel(Fence.r.rw);[M] +let fence.w.r = [W];fencerel(Fence.w.r);[R] +let fence.w.w = [W];fencerel(Fence.w.w);[W] +let fence.w.rw = [W];fencerel(Fence.w.rw);[M] +let fence.rw.r = [M];fencerel(Fence.rw.r);[R] +let fence.rw.w = [M];fencerel(Fence.rw.w);[W] +let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M] +let fence.tso = + let f = fencerel(Fence.tso) in + ([W];f;[W]) | ([R];f;[M]) + +let fence = + fence.r.r | fence.r.w | fence.r.rw | + fence.w.r | fence.w.w | fence.w.rw | + fence.rw.r | fence.rw.w | fence.rw.rw | + fence.tso + +(* Same address, no W to the same address in-between *) +let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc) +(* Read same write *) +let rsw = rf^-1;rf +(* Acquire, or stronger *) +let AQ = Acq|AcqRel +(* Release or stronger *) +and RL = RelAcqRel +(* All RCsc *) +let RCsc = Acq|Rel|AcqRel +(* Amo events are both R and W, relation rmw relates paired lr/sc *) +let AMO = R & W +let StCond = range(rmw) + +(*************) +(* ppo rules *) +(*************) + +(* Overlapping-Address Orderings *) +let r1 = [M];po-loc;[W] +and r2 = ([R];po-loc-no-w;[R]) \ rsw +and r3 = [AMO|StCond];rfi;[R] +(* Explicit Synchronization *) +and r4 = fence +and r5 = [AQ];po;[M] +and r6 = [M];po;[RL] +and r7 = [RCsc];po;[RCsc] +and r8 = rmw +(* Syntactic Dependencies *) +and r9 = [M];addr;[M] +and r10 = [M];data;[W] +and r11 = [M];ctrl;[W] +(* Pipeline Dependencies *) +and r12 = [R];(addr|data);[W];rfi;[R] +and r13 = [R];addr;[M];po;[W] + +let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13 +.... + +` ` + +.... +Total + +(* Notice that herd has defined its own rf relation *) + +(* Define ppo *) +include "riscv-defs.cat" + +(********************************) +(* Generate global memory order *) +(********************************) + +let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *) + loc & (W\FW) * FW | # Final write after any write to the same location + ppo | # ppo compatible + rfe # includes herd external rf (optimization) + +(* Walk over all linear extensions of gmo0 *) +with gmo from linearizations(M\IW,gmo0) + +(* Add initial writes upfront -- convenient for computing rfGMO *) +let gmo = gmo | loc & IW * (M\IW) + +(**********) +(* Axioms *) +(**********) + +(* Compute rf according to the load value axiom, aka rfGMO *) +let WR = loc & ([W];(gmo|po);[R]) +let rfGMO = WR \ (loc&([W];gmo);WR) + +(* Check equality of herd rf and of rfGMO *) +empty (rf\rfGMO)|(rfGMO\rf) as RfCons + +(* Atomicity axiom *) +let infloc = (gmo & loc)^-1 +let inflocext = infloc & ext +let winside = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W] +empty winside as Atomic +.... + +` ` + +.... +Partial + +(***************) +(* Definitions *) +(***************) + +(* Define ppo *) +include "riscv-defs.cat" + +(* Compute coherence relation *) +include "cos-opt.cat" + +(**********) +(* Axioms *) +(**********) + +(* Sc per location *) +acyclic co|rf|fr|po-loc as Coherence + +(* Main model axiom *) +acyclic co|rfe|fr|ppo as Model + +(* Atomicity axiom *) +empty rmw & (fre;coe) as Atomic +.... diff --git a/src/naming.adoc b/src/naming.adoc new file mode 100644 index 0000000..2ac4465 --- /dev/null +++ b/src/naming.adoc @@ -0,0 +1,184 @@ +[[naming]] +== ISA Extension Naming Conventions + +This chapter describes the RISC-V ISA extension naming scheme that is +used to concisely describe the set of instructions present in a hardware +implementation, or the set of instructions used by an application binary +interface (ABI). + +The RISC-V ISA is designed to support a wide variety of implementations +with various experimental instruction-set extensions. We have found that +an organized naming scheme simplifies software tools and documentation. + +=== Case Sensitivity + +The ISA naming strings are case insensitive. + +=== Base Integer ISA + +RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I +indicating the supported address space size in bits for the base integer +ISA. + +=== Instruction-Set Extension Names + +Standard ISA extensions are given a name consisting of a single letter. +For example, the first four standard extensions to the integer bases +are: `M` for integer multiplication and division, `A` for atomic +memory instructions, `F` for single-precision floating-point +instructions, and `D` for double-precision floating-point +instructions. Any RISC-V instruction-set variant can be succinctly +described by concatenating the base integer prefix with the names of the +included extensions, e.g., `RV64IMAFD`. + +We have also defined an abbreviation `G to represent the +`IMAFDZicsr_Zifencei` base and extensions, as this is intended to +represent our standard general-purpose ISA. + +Standard extensions to the RISC-V ISA are given other reserved letters, +e.g., `Q` for quad-precision floating-point, or `C` for the 16-bit +compressed instruction format. + +Some ISA extensions depend on the presence of other extensions, e.g., +`D` depends on `F` and `F` depends on `Zicsr`. These dependences +may be implicit in the ISA name: for example, RV32IF is equivalent to +RV32IFZicsr, and RV32ID is equivalent to RV32IFD and RV32IFDZicsr. + +=== Version Numbers + +Recognizing that instruction sets may expand or alter over time, we +encode extension version numbers following the extension name. Version +numbers are divided into major and minor version numbers, separated by a +`p`. If the minor version is `0`, then `p0` can be omitted from +the version string. Changes in major version numbers imply a loss of +backwards compatibility, whereas changes in only the minor version +number must be backwards-compatible. For example, the original 64-bit +standard ISA defined in release 1.0 of this manual can be written in +full as `RV64I1p0M1p0A1p0F1p0D1p0`, more concisely as +`RV64I1M1A1F1D1`. + +We introduced the version numbering scheme with the second release. +Hence, we define the default version of a standard extension to be the +version present at that time, e.g., `RV32I` is equivalent to +`RV32I2`. + +=== Underscores + +Underscores `\_` may be used to separate ISA extensions to improve +readability and to provide disambiguation, e.g., `RV32I2_M2_A2`. + +Because the `P` extension for Packed SIMD can be confused for the +decimal point in a version number, it must be preceded by an underscore +if it follows a number. For example, `rv32i2p2` means version 2.2 of +RV32I, whereas `rv32i2_p2` means version 2.0 of RV32I with version 2.0 +of the P extension. + +=== Additional Standard Extension Names + +Standard extensions can also be named using a single `Z` followed by +an alphabetical name and an optional version number. For example, +`Zifencei` names the instruction-fetch fence extension described in +<>; `Zifencei2` and +`Zifencei2p0` name version 2.0 of same. + +The first letter following the `Z` conventionally indicates the most +closely related alphabetical extension category, IMAFDQLCBKJTPV. For the +`Zam` extension for misaligned atomics, for example, the letter `a` +indicates the extension is related to the `A` standard extension. If +multiple `Z` extensions are named, they should be ordered first by +category, then alphabetically within a category—for example, +`Zicsr_Zifencei_Zam`. + +Extensions with the `Z` prefix must be separated from other +multi-letter extensions by an underscore, e.g., +`RV32IMACZicsr_Zifencei`. + +=== Supervisor-level Instruction-Set Extensions + +Standard supervisor-level instruction-set extensions are defined in +Volume II, but are named using `S` as a prefix, followed by an +alphabetical name and an optional version number. Supervisor-level +extensions must be separated from other multi-letter extensions by an +underscore. + +Standard supervisor-level extensions should be listed after standard +unprivileged extensions. If multiple supervisor-level extensions are +listed, they should be ordered alphabetically. + +=== Hypervisor-level Instruction-Set Extensions + +Standard hypervisor-level instruction-set extensions are named like +supervisor-level extensions, but beginning with the letter `H` instead +of the letter `S`. + +Standard hypervisor-level extensions should be listed after standard +lesser-privileged extensions. If multiple hypervisor-level extensions +are listed, they should be ordered alphabetically. + +=== Machine-level Instruction-Set Extensions + +Standard machine-level instruction-set extensions are prefixed with the +three letters `Zxm`. + +Standard machine-level extensions should be listed after standard +lesser-privileged extensions. If multiple machine-level extensions are +listed, they should be ordered alphabetically. + +=== Non-Standard Extension Names + +Non-standard extensions are named using a single `X` followed by an +alphabetical name and an optional version number. For example, +`Xhwacha` names the Hwacha vector-fetch ISA extension; `Xhwacha2` +and `Xhwacha2p0` name version 2.0 of same. + +Non-standard extensions must be listed after all standard extensions. +They must be separated from other multi-letter extensions by an +underscore. For example, an ISA with non-standard extensions Argle and +Bargle may be named `RV64IZifencei_Xargle_Xbargle`. + +If multiple non-standard extensions are listed, they should be ordered +alphabetically. + +=== Subset Naming Convention + +<> summarizes the standardized extension +names.   + + +[[isanametable]] +.Standard ISA extension names. The table also defines the canonical +order in which extension names must appear in the name string, with +top-to-bottom in table indicating first-to-last in the name string, +e.g., RV32IMACV is legal, whereas RV32IMAVC is not. +[cols="<,^,^",options="header",] +|=== +|Subset |Name |Implies +|Base ISA | | +|Integer |I | +|Reduced Integer |E | +|Standard Unprivileged Extensions | | +|Integer Multiplication and Division |M | +|Atomics |A | +|Single-Precision Floating-Point |F |Zicsr +|Double-Precision Floating-Point |D |F +|General |G |IMADZifencei +|Quad-Precision Floating-Point |Q |D +|16-bit Compressed Instructions |C | +|Bit Manipulation |B | +|Cryptography Extensions |K | +|Dynamic Languages |J | +|Packed-SIMD Extensions |P | +|Vector Extensions |V | +|Control and Status Register Access |Zicsr | +|Instruction-Fetch Fence |Zifencei | +|Misaligned Atomics |Zam |A +|Total Store Ordering |Ztso | +|Standard Supervisor-Level Extensions | | +|Supervisor-level extension `def` |Sdef | +|Standard Hypervisor-Level Extensions | | +|Hypervisor-level extension `ghi` |Hghi | +|Standard Machine-Level Extensions | | +|Machine-level extension `jkl` |Zxmjkl | +|Non-Standard Extensions | | +|Non-standard extension `mno` |Xmno | +|=== + diff --git a/src/p-st-ext.adoc b/src/p-st-ext.adoc new file mode 100644 index 0000000..79a070a --- /dev/null +++ b/src/p-st-ext.adoc @@ -0,0 +1,10 @@ +[[packedsimd]] +== `P` Standard Extension for Packed-SIMD Instructions, Version 0.2 + +Discussions at the 5th RISC-V workshop indicated a desire to drop this +packed-SIMD proposal for floating-point registers in favor of +standardizing on the V extension for large floating-point SIMD +operations. However, there was interest in packed-SIMD fixed-point +operations for use in the integer registers of small RISC-V +implementations. A task group is working to define the new P extension. + diff --git a/src/q-st-ext.adoc b/src/q-st-ext.adoc new file mode 100644 index 0000000..02122b7 --- /dev/null +++ b/src/q-st-ext.adoc @@ -0,0 +1,148 @@ +== `Q` Standard Extension for Quad-Precision Floating-Point, Version 2.2 + +This chapter describes the Q standard extension for 128-bit +quad-precision binary floating-point instructions compliant with the +IEEE 754-2008 arithmetic standard. The quad-precision binary +floating-point instruction-set extension is named `Q`; it depends on +the double-precision floating-point extension D. The floating-point +registers are now extended to hold either a single, double, or +quad-precision floating-point value (FLEN=128). The NaN-boxing scheme +described in <> is now extended +recursively to allow a single-precision value to be NaN-boxed inside a +double-precision value which is itself NaN-boxed inside a quad-precision +value. + +=== Quad-Precision Load and Store Instructions + +New 128-bit variants of LOAD-FP and STORE-FP instructions are added, +encoded with a new value for the funct3 width field. + +M@R@F@R@O + +& & & & + +& & & & + +& 5 & 3 & 5 & 7 + +offset[11:0] & base & Q & dest & LOAD-FP + + +O@R@R@F@R@O + +& & & & & + +& & & & & + +& 5 & 5 & 3 & 5 & 7 + +offset[11:5] & src & base & Q & offset[4:0] & STORE-FP + + +FLQ and FSQ are only guaranteed to execute atomically if the effective +address is naturally aligned and XLEN=128. + +FLQ and FSQ do not modify the bits being transferred; in particular, the +payloads of non-canonical NaNs are preserved. + +=== Quad-Precision Computational Instructions + +A new supported format is added to the format field of most +instructions, as shown in <> + +[[fpextfmt]] +.Format field encoding. +[cols="^,^,<",options="header",] +|=== +|_fmt_ field |Mnemonic |Meaning +|00 |S |32-bit single-precision +|01 |D |64-bit double-precision +|10 |H |16-bit half-precision +|11 |Q |128-bit quad-precision +|=== + +The quad-precision floating-point computational instructions are defined +analogously to their double-precision counterparts, but operate on +quad-precision operands and produce quad-precision results. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FADD/FSUB & Q & src2 & src1 & RM & dest & OP-FP + +FMUL/FDIV & Q & src2 & src1 & RM & dest & OP-FP + +FMIN-MAX & Q & src2 & src1 & MIN/MAX & dest & OP-FP + +FSQRT & Q & 0 & src & RM & dest & OP-FP + + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +src3 & Q & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB + + +=== Quad-Precision Convert and Move Instructions + +New floating-point-to-integer and integer-to-floating-point conversion +instructions are added. These instructions are defined analogously to +the double-precision-to-integer and integer-to-double-precision +conversion instructions. FCVT.W.Q or FCVT.L.Q converts a quad-precision +floating-point number to a signed 32-bit or 64-bit integer, +respectively. FCVT.Q.W or FCVT.Q.L converts a 32-bit or 64-bit signed +integer, respectively, into a quad-precision floating-point number. +FCVT.WU.Q, FCVT.LU.Q, FCVT.Q.WU, and FCVT.Q.LU variants convert to or +from unsigned integer values. FCVT.L[U].Q and FCVT.Q.L[U] are RV64-only +instructions. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCVT._int_.Q & Q & W[U]/L[U] & src & RM & dest & OP-FP + +FCVT.Q._int_ & Q & W[U]/L[U] & src & RM & dest & OP-FP + + +New floating-point-to-floating-point conversion instructions are added. +These instructions are defined analogously to the double-precision +floating-point-to-floating-point conversion instructions. FCVT.S.Q or +FCVT.Q.S converts a quad-precision floating-point number to a +single-precision floating-point number, or vice-versa, respectively. +FCVT.D.Q or FCVT.Q.D converts a quad-precision floating-point number to +a double-precision floating-point number, or vice-versa, respectively. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCVT.S.Q & S & Q & src & RM & dest & OP-FP + +FCVT.Q.S & Q & S & src & RM & dest & OP-FP + +FCVT.D.Q & D & Q & src & RM & dest & OP-FP + +FCVT.Q.D & Q & D & src & RM & dest & OP-FP + + +Floating-point to floating-point sign-injection instructions, FSGNJ.Q, +FSGNJN.Q, and FSGNJX.Q are defined analogously to the double-precision +sign-injection instruction. + +R@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FSGNJ & Q & src2 & src1 & J[N]/JX & dest & OP-FP + + +FMV.X.Q and FMV.Q.X instructions are not provided in RV32 or RV64, so +quad-precision bit patterns must be moved to the integer registers via +memory. + +RV128 will support FMV.X.Q and FMV.Q.X in the Q extension. + +=== Quad-Precision Floating-Point Compare Instructions + +The quad-precision floating-point compare instructions are defined +analogously to their double-precision counterparts, but operate on +quad-precision operands. + +S@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCMP & Q & src2 & src1 & EQ/LT/LE & dest & OP-FP + + +=== Quad-Precision Floating-Point Classify Instruction + +The quad-precision floating-point classify instruction, FCLASS.Q, is +defined analogously to its double-precision counterpart, but operates on +quad-precision operands. + +S@F@R@R@F@R@O + +& & & & & & + +& & & & & & + +& 2 & 5 & 5 & 3 & 5 & 7 + +FCLASS & Q & 0 & src & 001 & dest & OP-FP + diff --git a/src/riscv-isa-unpriv-conv-review.adoc b/src/riscv-isa-unpriv-conv-review.adoc new file mode 100644 index 0000000..2f3c349 --- /dev/null +++ b/src/riscv-isa-unpriv-conv-review.adoc @@ -0,0 +1,139 @@ +[[risc-v-isa]] += The RISC-V Instruction Set Manual +:description: Volume I: Unprivileged ISA +:company: RISC-V.org +//:authors: Editors: Andrew waterman, Krste Asanovic, SiFive, Inc., CS Division, EECS Department, University of California, Berkeley +:revdate: 09/2021 +:revnumber: Convert pre +:revremark: Pre-release version +//development: assume everything can change +//stable: assume everything could change +//frozen: of you implement this version you assume the risk that something might change because of the public review cycle but we expect little to no change. +//ratified: you can implement this and be assured nothing will change. if something needs to change due to an errata or enhancement, it will come out in a new extension. we do not revise extensions. +:url-riscv: http://riscv.org +:doctype: book +//:doctype: report +:colophon: +:preface-title: Preamble +:appendix-caption: Appendix +:imagesdir: images +:title-logo-image: image:risc-v_logo.png[pdfwidth=3.25in,align=center] +:page-background-image: image:draft.png[opacity=20%] +//:title-page-background-image: none +:back-cover-image: image:backpage.png[opacity=25%] +// Settings: +:experimental: +:reproducible: +:imagesoutdir: images +:bibtex-file: resources/riscv-spec.bib +:bibtex-order: alphabetical +:bibtex-style: apa +:icons: font +:lang: en +:listing-caption: Listing +:sectnums: +:toc: left +:toclevels: 4 +:source-highlighter: pygments +ifdef::backend-pdf[] +:source-highlighter: coderay +endif::[] +:table-caption: Table +:figure-caption: Figure +:xrefstyle: full +:chapter-refsig: Chapter +:section-refsig: Section +:appendix-refsig: Appendix +:data-uri: +:hide-uri-scheme: +:stem: latexmath +:footnote: + +_Contributors to all versions of the spec in alphabetical order (please contact editors to suggest +corrections): Arvind, Krste Asanovi´c, Rimas Aviˇzienis, Jacob Bachmeyer, Christopher F. Batten, +Allen J. Baum, Alex Bradbury, Scott Beamer, Preston Briggs, Christopher Celio, Chuanhua +Chang, David Chisnall, Paul Clayton, Palmer Dabbelt, Ken Dockser, Roger Espasa, Greg Favor, +Shaked Flur, Stefan Freudenberger, Marc Gauthier, Andy Glew, Jan Gray, Michael Hamburg, John +Hauser, David Horner, Bruce Hoult, Bill Huffman, Alexandre Joannou, Olof Johansson, Ben Keller, +David Kruckemyer, Yunsup Lee, Paul Loewenstein, Daniel Lustig, Yatin Manerkar, Luc Maranget, +Margaret Martonosi, Joseph Myers, Vijayanand Nagarajan, Rishiyur Nikhil, Jonas Oberhauser, +Stefan O’Rear, Albert Ou, John Ousterhout, David Patterson, Christopher Pulte, Jose Renau, +Josh Scheid, Colin Schmidt, Peter Sewell, Susmit Sarkar, Michael Taylor, Wesley Terpstra, Matt +Thomas, Tommy Thorn, Caroline Trippel, Ray VanDeWalker, Muralidaran Vijayaraghavan, Megan +Wachs, Andrew Waterman, Robert Watson, Derek Williams, Andrew Wright, Reinoud Zandijk, +and Sizhuo Zhang._ + +_This document is released under a Creative Commons Attribution 4.0 International License._ + +_This document is a derivative of “The RISC-V Instruction Set Manual, Volume I: User-Level ISA +Version 2.1” released under the following license: ©2010–2017 Andrew Waterman, Yunsup Lee, +David Patterson, Krste Asanovi'c. Creative Commons Attribution 4.0 International License. +Please cite as: “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document +Version 20191214-draft”, Editors Andrew Waterman and Krste Asanovi´c, RISC-V Foundation, +December 2019._ + +//the colophon allows for a section after the preamble that is part of the frontmatter and therefore not assigned a page number. +include::colophon.adoc[] +//preface.tex +include::intro.adoc[] +//intro.tex +include::rv32.adoc[] +//rv32.tex +include::zifencei.adoc[] +//zfencei.tex +include::zihintpause.adoc[] +//zihintpause.tex +include::rv32e.adoc[] +//rv32e.tex +include::rv64.adoc[] +//rv54.tex +include::rv128.adoc[] +//rv128.tex +include::m-st-ext.adoc[] +//m.tex +include::a-st-ext.adoc[] +//a.tex +include::zicsr.adoc[] +//csr.tex +include::counters.adoc[] +//counters.tex +include::f-st-ext.adoc[] +//f.tex t +include::d-st-ext.adoc[] +//d.tex +include::q-st-ext.adoc[] +//q.tex +include::rvwmo.adoc[] +//rvwmo.tex +include::c-st-ext.adoc[] +//c.tex +include::b-st-ext.adoc[] +//b.tex +include::j-st-ext.adoc[] +//j.tex +include::p-st-ext.adoc[] +//p.tex +include::v-st-ext.adoc[] +//v.tex +include::zam-st-ext.adoc[] +//zam.tex +include::ztso-st-ext.adoc[] +//ztso.tex +include::rv-32-64g.adoc[] +//gmaps.tex +include::extending.adoc[] +//extensions.tex +include::naming.adoc[] +//naming.tex +include::history.adoc[] +//history.tex +//include::test.adoc[] +//include::mm-eplan.adoc[] +//memory.tex +//include::mm-formal.adoc[] +//end of memory.tex, memory-model-alloy.tex, memory-model-herd.tex +include::index.adoc[] +// this is generated generated from index markers. +include::bibliography.adoc[] +// this references the riscv-spec.bi file that has been copied into the resources directoy + diff --git a/src/riscv-isa-unpriv.adoc b/src/riscv-isa-unpriv.adoc new file mode 100644 index 0000000..459eb6f --- /dev/null +++ b/src/riscv-isa-unpriv.adoc @@ -0,0 +1,135 @@ +[[risc-v-isa]] += The RISC-V Instruction Set Manual +:description: Volume I: Unprivileged ISA +:company: RISC-V.org +//:authors: Editors: Andrew waterman, Krste Asanovic, SiFive, Inc., CS Division, EECS Department, University of California, Berkeley +:revdate: 07/2021 +:revnumber: Version 20191214-draft +:revremark: Pre-release version +//development: assume everything can change +//stable: assume everything could change +//frozen: of you implement this version you assume the risk that something might change because of the public review cycle but we expect little to no change. +//ratified: you can implement this and be assured nothing will change. if something needs to change due to an errata or enhancement, it will come out in a new extension. we do not revise extensions. +:url-riscv: http://riscv.org +:doctype: book +//:doctype: report +:preface-title: Preamble +:colophon: +:appendix-caption: Appendix +:imagesdir: images +:title-logo-image: image:risc-v_logo.png[pdfwidth=3.25in,align=center] +//:page-background-image: image:draft.svg[opacity=20%] +//:title-page-background-image: none +:back-cover-image: image:backpage.png[opacity=25%] +// Settings: +:experimental: +:reproducible: +// needs to be changed? bug discussion started +//:WaveDromEditorApp: app/wavedrom-editor.app +:imagesoutdir: images +:bibtex-file: resources/riscv-spec.bib +:bibtex-order: alphabetical +:bibtex-style: ieee +:icons: font +:lang: en +:listing-caption: Listing +:sectnums: +:toc: left +:toclevels: 4 +:source-highlighter: pygments +ifdef::backend-pdf[] +:source-highlighter: coderay +endif::[] +:data-uri: +:hide-uri-scheme: +:stem: latexmath +:footnote: +:xrefstyle: short + +Contributors to all versions of the spec in alphabetical order (please contact editors to suggest +corrections): Arvind, Krste Asanovi´c, Rimas Aviˇzienis, Jacob Bachmeyer, Christopher F. Batten, +Allen J. Baum, Alex Bradbury, Scott Beamer, Preston Briggs, Christopher Celio, Chuanhua +Chang, David Chisnall, Paul Clayton, Palmer Dabbelt, Ken Dockser, Roger Espasa, Greg Favor, +Shaked Flur, Stefan Freudenberger, Marc Gauthier, Andy Glew, Jan Gray, Michael Hamburg, John +Hauser, David Horner, Bruce Hoult, Bill Huffman, Alexandre Joannou, Olof Johansson, Ben Keller, +David Kruckemyer, Yunsup Lee, Paul Loewenstein, Daniel Lustig, Yatin Manerkar, Luc Maranget, +Margaret Martonosi, Joseph Myers, Vijayanand Nagarajan, Rishiyur Nikhil, Jonas Oberhauser, +Stefan O’Rear, Albert Ou, John Ousterhout, David Patterson, Christopher Pulte, Jose Renau, +Josh Scheid, Colin Schmidt, Peter Sewell, Susmit Sarkar, Michael Taylor, Wesley Terpstra, Matt +Thomas, Tommy Thorn, Caroline Trippel, Ray VanDeWalker, Muralidaran Vijayaraghavan, Megan +Wachs, Andrew Waterman, Robert Watson, Derek Williams, Andrew Wright, Reinoud Zandijk, +and Sizhuo Zhang. + +This document is released under a Creative Commons Attribution 4.0 International License. + +This document is a derivative of “The RISC-V Instruction Set Manual, Volume I: User-Level ISA +Version 2.1” released under the following license: ©2010–2017 Andrew Waterman, Yunsup Lee, +David Patterson, Krste Asanovi'c. Creative Commons Attribution 4.0 International License. +Please cite as: “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document +Version 20191214-draft”, Editors Andrew Waterman and Krste Asanovi´c, RISC-V Foundation, +December 2019. + +//the colophon allows for a section after the preamble that is part of the frontmatter and therefore not assigned a page number. +include::colophon.adoc[] +//preface.tex +//While some documents need several levels of introductory material, other documents only need a breif introduction. +//include::overview.adoc[] +include::introduction.adoc[] +//intro.tex +include::rv32.adoc[] +//rv32.tex +include::zifencei.adoc[] +//zfencei.tex +include::zihintpause.adoc[] +//zihintpause.tex +include::rv32e.adoc[] +//rv32e.tex +include::rv64.adoc[] +//rv54.tex +include::rv128.adoc[] +//rv128.tex +include::m-st-ext.adoc[] +//m.tex +include::a-st-ext.adoc[] +//a.tex +include::zicsr.adoc[] +//csr.tex +include::f-st-ext.adoc[] +//f.tex t +include::d-st-ext.adoc[] +//d.tex +include::q-st-ext.adoc[] +//q.tex +include::rvwmo.adoc[] +//rvwmo.tex +include::c-st-ext.adoc[] +//c.tex +include::b-st-ext.adoc[] +//b.tex +include::j-st-ext.adoc[] +//j.tex +include::p-st-ext.adoc[] +//p.tex +include::v-st-ext.adoc[] +//v.tex +include::zam-st-ext.adoc[] +//zam.tex +include::ztso-st-ext.adoc[] +//ztso.tex +include::rv-32-64g.adoc[] +//gmaps.tex +include::extending.adoc[] +//extensions.tex +include::naming.adoc[] +//naming.tex +include::history.adoc[] +//history.tex +//include::mm-explain.adoc[] +//memory.tex +include::mm-formal.adoc[] +//end of memory.tex, memory-model-alloy.tex, memory-model-herd.tex +//include::index.adoc[] +// this is generated generated from index markers. +include::bibliography.adoc[] +// this references the riscv-spec.bi file that has been copied into the resources directoy + diff --git a/src/rv-32-64g.adoc b/src/rv-32-64g.adoc new file mode 100644 index 0000000..4355fae --- /dev/null +++ b/src/rv-32-64g.adoc @@ -0,0 +1,491 @@ +[[rv32-64g]] +== RV32/64G Instruction Set Listings + +One goal of the RISC-V project is that it be used as a stable software +development target. For this purpose, we define a combination of a base +ISA (RV32I or RV64I) plus selected standard extensions (IMAFD, Zicsr, +Zifencei) as a ``general-purpose`` ISA, and we use the abbreviation G +for the IMAFDZicsr_Zifencei combination of instruction-set extensions. +This chapter presents opcode maps and instruction-set listings for RV32G +and RV64G. + +[cols=">,^,^,^,^,^,^,^,^",] +|=== +|inst[4:2] |000 |001 |010 |011 |100 |101 |110 |111 + +|inst[6:5] | | | | | | | |(latexmath:[$>32b$]) + +|00 |LOAD |LOAD-FP |_custom-0_ |MISC-MEM |OP-IMM |AUIPC |OP-IMM-32 +|latexmath:[$48b$] + +|01 |STORE |STORE-FP |_custom-1_ |AMO |OP |LUI |OP-32 |latexmath:[$64b$] + +|10 |MADD |MSUB |NMSUB |NMADD |OP-FP |_reserved_ |_custom-2/rv128_ +|latexmath:[$48b$] + +|11 |BRANCH |JALR |_reserved_ |JAL |SYSTEM |_reserved_ |_custom-3/rv128_ +|latexmath:[$\geq80b$] +|=== + +<> shows a map of the major opcodes for +RVG. Major opcodes with 3 or more lower bits set are reserved for +instruction lengths greater than 32 bits. Opcodes marked as _reserved_ +should be avoided for custom instruction-set extensions as they might be +used by future standard extensions. Major opcodes marked as _custom-0_ +and _custom-1_ will be avoided by future standard extensions and are +recommended for use by custom instruction-set extensions within the base +32-bit instruction format. The opcodes marked _custom-2/rv128_ and +_custom-3/rv128_ are reserved for future use by RV128, but will +otherwise be avoided for standard extensions and so can also be used for +custom instruction-set extensions in RV32 and RV64. + +We believe RV32G and RV64G provide simple but complete instruction sets +for a broad range of general-purpose computing. The optional compressed +instruction set described in <> can +be added (forming RV32GC and RV64GC) to improve performance, code size, +and energy efficiency, though with some additional hardware complexity. + +As we move beyond IMAFDC into further instruction-set extensions, the +added instructions tend to be more domain-specific and only provide +benefits to a restricted class of applications, e.g., for multimedia or +security. Unlike most commercial ISAs, the RISC-V ISA design clearly +separates the base ISA and broadly applicable standard extensions from +these more specialized additions. <> +has a more extensive discussion of ways to add extensions to the RISC-V +ISA. + +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | + +| | | | | | | | | | | | + +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type + +| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type + +| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |funct3 +|imm[4:1latexmath:[$\vert$]11] |opcode |B-type + +| |imm[31:12] | | | | | | | |rd |opcode |U-type + +| +|imm[20latexmath:[$\vert$]10:1latexmath:[$\vert$]11latexmath:[$\vert$]19:12] +| | | | | | | |rd |opcode |J-type + +| | | | | | | | | | | | + +| |*RV32I Base Instruction Set* | | | | | | | | | | + +| |imm[31:12] | | | | | | | |rd |0110111 |LUI + +| |imm[31:12] | | | | | | | |rd |0010111 |AUIPC + +| +|imm[20latexmath:[$\vert$]10:1latexmath:[$\vert$]11latexmath:[$\vert$]19:12] +| | | | | | | |rd |1101111 |JAL + +| |imm[11:0] | | | | | |rs1 |000 |rd |1100111 |JALR + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |000 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BEQ + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |001 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BNE + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |100 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BLT + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |101 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BGE + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |110 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BLTU + +| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |111 +|imm[4:1latexmath:[$\vert$]11] |1100011 |BGEU + +| |imm[11:0] | | | | | |rs1 |000 |rd |0000011 |LB + +| |imm[11:0] | | | | | |rs1 |001 |rd |0000011 |LH + +| |imm[11:0] | | | | | |rs1 |010 |rd |0000011 |LW + +| |imm[11:0] | | | | | |rs1 |100 |rd |0000011 |LBU + +| |imm[11:0] | | | | | |rs1 |101 |rd |0000011 |LHU + +| |imm[11:5] | | | |rs2 | |rs1 |000 |imm[4:0] |0100011 |SB + +| |imm[11:5] | | | |rs2 | |rs1 |001 |imm[4:0] |0100011 |SH + +| |imm[11:5] | | | |rs2 | |rs1 |010 |imm[4:0] |0100011 |SW + +| |imm[11:0] | | | | | |rs1 |000 |rd |0010011 |ADDI + +| |imm[11:0] | | | | | |rs1 |010 |rd |0010011 |SLTI + +| |imm[11:0] | | | | | |rs1 |011 |rd |0010011 |SLTIU + +| |imm[11:0] | | | | | |rs1 |100 |rd |0010011 |XORI + +| |imm[11:0] | | | | | |rs1 |110 |rd |0010011 |ORI + +| |imm[11:0] | | | | | |rs1 |111 |rd |0010011 |ANDI + +| |0000000 | | | |shamt | |rs1 |001 |rd |0010011 |SLLI + +| |0000000 | | | |shamt | |rs1 |101 |rd |0010011 |SRLI + +| |0100000 | | | |shamt | |rs1 |101 |rd |0010011 |SRAI + +| |0000000 | | | |rs2 | |rs1 |000 |rd |0110011 |ADD + +| |0100000 | | | |rs2 | |rs1 |000 |rd |0110011 |SUB + +| |0000000 | | | |rs2 | |rs1 |001 |rd |0110011 |SLL + +| |0000000 | | | |rs2 | |rs1 |010 |rd |0110011 |SLT + +| |0000000 | | | |rs2 | |rs1 |011 |rd |0110011 |SLTU + +| |0000000 | | | |rs2 | |rs1 |100 |rd |0110011 |XOR + +| |0000000 | | | |rs2 | |rs1 |101 |rd |0110011 |SRL + +| |0100000 | | | |rs2 | |rs1 |101 |rd |0110011 |SRA + +| |0000000 | | | |rs2 | |rs1 |110 |rd |0110011 |OR + +| |0000000 | | | |rs2 | |rs1 |111 |rd |0110011 |AND + +| |fm | |pred | | |succ |rs1 |000 |rd |0001111 |FENCE + +| |1000 | |0011 | | |0011 |00000 |000 |00000 |0001111 |FENCE.TSO + +| |0000 | |0001 | | |0000 |00000 |000 |00000 |0001111 |PAUSE + +| |000000000000 | | | | | |00000 |000 |00000 |1110011 |ECALL + +| |000000000001 | | | | | |00000 |000 |00000 |1110011 |EBREAK + +| | | | | | | | | | | | +|=== + +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | + +| | | | | | | | | | | | + +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type + +| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type + +| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type + +| | | | | | | | | | | | + +| |*RV64I Base Instruction Set (in addition to RV32I)* | | | | | | | | | +| + +| |imm[11:0] | | | | | |rs1 |110 |rd |0000011 |LWU + +| |imm[11:0] | | | | | |rs1 |011 |rd |0000011 |LD + +| |imm[11:5] | | | |rs2 | |rs1 |011 |imm[4:0] |0100011 |SD + +| |000000 | | |shamt | | |rs1 |001 |rd |0010011 |SLLI + +| |000000 | | |shamt | | |rs1 |101 |rd |0010011 |SRLI + +| |010000 | | |shamt | | |rs1 |101 |rd |0010011 |SRAI + +| |imm[11:0] | | | | | |rs1 |000 |rd |0011011 |ADDIW + +| |0000000 | | | |shamt | |rs1 |001 |rd |0011011 |SLLIW + +| |0000000 | | | |shamt | |rs1 |101 |rd |0011011 |SRLIW + +| |0100000 | | | |shamt | |rs1 |101 |rd |0011011 |SRAIW + +| |0000000 | | | |rs2 | |rs1 |000 |rd |0111011 |ADDW + +| |0100000 | | | |rs2 | |rs1 |000 |rd |0111011 |SUBW + +| |0000000 | | | |rs2 | |rs1 |001 |rd |0111011 |SLLW + +| |0000000 | | | |rs2 | |rs1 |101 |rd |0111011 |SRLW + +| |0100000 | | | |rs2 | |rs1 |101 |rd |0111011 |SRAW + +| | | | | | | | | | | | + +| |*RV32/RV64 _Zifencei_ Standard Extension* | | | | | | | | | | + +| |imm[11:0] | | | | | |rs1 |001 |rd |0001111 |FENCE.I + +| | | | | | | | | | | | + +| |*RV32/RV64 _Zicsr_ Standard Extension* | | | | | | | | | | + +| |csr | | | | | |rs1 |001 |rd |1110011 |CSRRW + +| |csr | | | | | |rs1 |010 |rd |1110011 |CSRRS + +| |csr | | | | | |rs1 |011 |rd |1110011 |CSRRC + +| |csr | | | | | |uimm |101 |rd |1110011 |CSRRWI + +| |csr | | | | | |uimm |110 |rd |1110011 |CSRRSI + +| |csr | | | | | |uimm |111 |rd |1110011 |CSRRCI + +| | | | | | | | | | | | + +| |*RV32M Standard Extension* | | | | | | | | | | + +| |0000001 | | | |rs2 | |rs1 |000 |rd |0110011 |MUL + +| |0000001 | | | |rs2 | |rs1 |001 |rd |0110011 |MULH + +| |0000001 | | | |rs2 | |rs1 |010 |rd |0110011 |MULHSU + +| |0000001 | | | |rs2 | |rs1 |011 |rd |0110011 |MULHU + +| |0000001 | | | |rs2 | |rs1 |100 |rd |0110011 |DIV + +| |0000001 | | | |rs2 | |rs1 |101 |rd |0110011 |DIVU + +| |0000001 | | | |rs2 | |rs1 |110 |rd |0110011 |REM + +| |0000001 | | | |rs2 | |rs1 |111 |rd |0110011 |REMU + +| | | | | | | | | | | | + +| |*RV64M Standard Extension (in addition to RV32M)* | | | | | | | | | | + +| |0000001 | | | |rs2 | |rs1 |000 |rd |0111011 |MULW + +| |0000001 | | | |rs2 | |rs1 |100 |rd |0111011 |DIVW + +| |0000001 | | | |rs2 | |rs1 |101 |rd |0111011 |DIVUW + +| |0000001 | | | |rs2 | |rs1 |110 |rd |0111011 |REMW + +| |0000001 | | | |rs2 | |rs1 |111 |rd |0111011 |REMUW + +| | | | | | | | | | | | +|=== + +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | +| | | | | | | | | | | | +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type +| | | | | | | | | | | | +| |*RV32A Standard Extension* | | | | | | | | | | +| |00010 | |aq |rl |00000 | |rs1 |010 |rd |0101111 |LR.W +| |00011 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |SC.W +| |00001 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOSWAP.W +| |00000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOADD.W +| |00100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOXOR.W +| |01100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOAND.W +| |01000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOOR.W +| |10000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMIN.W +| |10100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMAX.W +| |11000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMINU.W +| |11100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMAXU.W +| | | | | | | | | | | | +| |*RV64A Standard Extension (in addition to RV32A)* | | | | | | | | | | +| |00010 | |aq |rl |00000 | |rs1 |011 |rd |0101111 |LR.D +| |00011 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |SC.D +| |00001 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOSWAP.D +| |00000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOADD.D +| |00100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOXOR.D +| |01100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOAND.D +| |01000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOOR.D +| |10000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMIN.D +| |10100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMAX.D +| |11000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMINU.D +| |11100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMAXU.D +| | | | | | | | | | | | +|=== + +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | +| | | | | | | | | | | | +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type +| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type +| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type +| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type +| | | | | | | | | | | | +| |*RV32F Standard Extension* | | | | | | | | | | +| |imm[11:0] | | | | | |rs1 |010 |rd |0000111 |FLW +| |imm[11:5] | | | |rs2 | |rs1 |010 |imm[4:0] |0100111 |FSW +| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.S +| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.S +| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.S +| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.S +| |0000000 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.S +| |0000100 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.S +| |0001000 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.S +| |0001100 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.S +| |0101100 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.S +| |0010000 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.S +| |0010000 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.S +| |0010000 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.S +| |0010100 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.S +| |0010100 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.S +| |1100000 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.S +| |1100000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.S +| |1110000 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.X.W +| |1010000 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.S +| |1010000 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.S +| |1010000 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.S +| |1110000 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.S +| |1101000 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.S.W +| |1101000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.S.WU +| |1111000 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.W.X +| | | | | | | | | | | | +| |*RV64F Standard Extension (in addition to RV32F)* | | | | | | | | | | +| |1100000 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.S +| |1100000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.S +| |1101000 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.S.L +| |1101000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.S.LU +| | | | | | | | | | | | +|=== + +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | +| | | | | | | | | | | | +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type +| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type +| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type +| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type +| | | | | | | | | | | | +| |*RV32D Standard Extension* | | | | | | | | | | +| |imm[11:0] | | | | | |rs1 |011 |rd |0000111 |FLD +| |imm[11:5] | | | |rs2 | |rs1 |011 |imm[4:0] |0100111 |FSD +| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.D +| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.D +| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.D +| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.D +| |0000001 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.D +| |0000101 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.D +| |0001001 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.D +| |0001101 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.D +| |0101101 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.D +| |0010001 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.D +| |0010001 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.D +| |0010001 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.D +| |0010101 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.D +| |0010101 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.D +| |0100000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.S.D +| |0100001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.D.S +| |1010001 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.D +| |1010001 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.D +| |1010001 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.D +| |1110001 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.D +| |1100001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.D +| |1100001 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.D +| |1101001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.D.W +| |1101001 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.D.WU +| | | | | | | | | | | | +| |*RV64D Standard Extension (in addition to RV32D)* | | | | | | | | | | +| |1100001 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.D +| |1100001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.D +| |1110001 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.X.D +| |1101001 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.D.L +| |1101001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.D.LU +| |1111001 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.D.X +| | | | | | | | | | | | +|=== + +.Instruction listing for RISC-V +[cols="<,<,<,<,<,<,<,<,<,<,<,<",] +|=== +| | | | | | | | | | | | +| | | | | | | | | | | | +| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type +| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type +| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type +| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type +| | | | | | | | | | | | +| |*RV32Q Standard Extension* | | | | | | | | | | +| |imm[11:0] | | | | | |rs1 |100 |rd |0000111 |FLQ +| |imm[11:5] | | | |rs2 | |rs1 |100 |imm[4:0] |0100111 |FSQ +| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.Q +| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.Q +| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.Q +| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.Q +| |0000011 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.Q +| |0000111 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.Q +| |0001011 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.Q +| |0001111 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.Q +| |0101111 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.Q +| |0010011 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.Q +| |0010011 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.Q +| |0010011 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.Q +| |0010111 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.Q +| |0010111 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.Q +| |0100000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.S.Q +| |0100011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.Q.S +| |0100001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.D.Q +| |0100011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.Q.D +| |1010011 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.Q +| |1010011 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.Q +| |1010011 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.Q +| |1110011 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.Q +| |1100011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.Q +| |1100011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.Q +| |1101011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.Q.W +| |1101011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.Q.WU +| | | | | | | | | | | | +| |*RV64Q Standard Extension (in addition to RV32Q)* | | | | | | | | | | +| |1100011 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.Q +| |1100011 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.Q +| |1101011 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.Q.L +| |1101011 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.Q.LU +| | | | | | | | | | | | +|=== + +<> lists the CSRs that have currently been +allocated CSR addresses. The timers, counters, and floating-point CSRs +are the only CSRs defined in this specification. + +[[rvgcsrnames]] +.RISC-V control and status register (CSR) address map. +[cols="<,<,<,<",options="header",] +|=== +|Number |Privilege |Name |Description +|Floating-Point Control and Status Registers | | | + +|`0x001 ` |Read/write |`fflags ` |Floating-Point Accrued Exceptions. + +|`0x002 ` |Read/write |`frm ` |Floating-Point Dynamic Rounding Mode. + +|`0x003 ` |Read/write |`fcsr ` |Floating-Point Control and Status +Register (`frm` + `fflags`). + +|Counters and Timers | | | + +|`0xC00 ` |Read-only |`cycle ` |Cycle counter for RDCYCLE instruction. + +|`0xC01 ` |Read-only |`time ` |Timer for RDTIME instruction. + +|`0xC02 ` |Read-only |`instret ` |Instructions-retired counter for +RDINSTRET instruction. + +|`0xC80 ` |Read-only |`cycleh ` |Upper 32 bits of `cycle`, RV32I only. + +|`0xC81 ` |Read-only |`timeh ` |Upper 32 bits of `time`, RV32I only. + +|`0xC82 ` |Read-only |`instreth ` |Upper 32 bits of `instret`, RV32I +only. +|=== + diff --git a/src/rv128.adoc b/src/rv128.adoc new file mode 100644 index 0000000..1513bf9 --- /dev/null +++ b/src/rv128.adoc @@ -0,0 +1,79 @@ +[[rv128]] +== RV128I Base Integer Instruction Set, Version 1.7 + +==== +_"There is only one mistake that can be made in computer design that is +difficult to recover from—not having enough address bits for memory +addressing and memory management."_ Bell and Strecker, ISCA-3, 1976. +==== + +This chapter describes RV128I, a variant of the RISC-V ISA supporting a +flat 128-bit address space. The variant is a straightforward +extrapolation of the existing RV32I and RV64I designs. +(((RV128, design))) + +[TIP] +==== +The primary reason to extend integer register width is to support larger +address spaces. It is not clear when a flat address space larger than 64 +bits will be required. At the time of writing, the fastest supercomputer +in the world as measured by the Top500 benchmark had over of DRAM, and +would require over 50 bits of address space if all the DRAM resided in a +single address space. Some warehouse-scale computers already contain +even larger quantities of DRAM, and new dense solid-state non-volatile +memories and fast interconnect technologies might drive a demand for +even larger memory spaces. Exascale systems research is targeting memory +systems, which occupy 57 bits of address space. At historic rates of +growth, it is possible that greater than 64 bits of address space might +be required before 2030. +==== + +History suggests that whenever it becomes clear that more than 64 bits +of address space is needed, architects will repeat intensive debates +about alternatives to extending the address space, including +segmentation, 96-bit address spaces, and software workarounds, until, +finally, flat 128-bit address spaces will be adopted as the simplest and +best solution. +(((RV128, evolution))) + +[TIP] +==== +We have not frozen the RV128 spec at this time, as there might be need +to evolve the design based on actual usage of 128-bit address spaces. +==== +(((RV128I, as relates to RV64I))) + +RV128I builds upon RV64I in the same way RV64I builds upon RV32I, with +integer registers extended to 128 bits (i.e., XLEN=128). Most integer +computational instructions are unchanged as they are defined to operate +on XLEN bits. The RV64I `\*W` integer instructions that operate on +32-bit values in the low bits of a register are retained but now sign +extend their results from bit 31 to bit 127. A new set of `\*D` integer +instructions are added that operate on 64-bit values held in the low +bits of the 128-bit integer registers and sign extend their results from +bit 63 to bit 127. The `\*D` instructions consume two major opcodes +(OP-IMM-64 and OP-64) in the standard 32-bit encoding. +(((RV128I, compatibility with RV64))) + +To improve compatibility with RV64, in a reverse of how RV32 to RV64 was +handled, we might change the decoding around to rename RV64I ADD as a +64-bit ADDD, and add a 128-bit ADDQ in what was previously the OP-64 +major opcode (now renamed the OP-128 major opcode). + + +Shifts by an immediate (SLLI/SRLI/SRAI) are now encoded using the low 7 +bits of the I-immediate, and variable shifts (SLL/SRL/SRA) use the low 7 +bits of the shift amount source register. +(((RV128I, LOU))) + +A LDU (load double unsigned) instruction is added using the existing +LOAD major opcode, along with new LQ and SQ instructions to load and +store quadword values. SQ is added to the STORE major opcode, while LQ +is added to the MISC-MEM major opcode. + + +The floating-point instruction set is unchanged, although the 128-bit Q +floating-point extension can now support FMV.X.Q and FMV.Q.X +instructions, together with additional FCVT instructions to and from the +T (128-bit) integer format. + diff --git a/src/rv32.adoc b/src/rv32.adoc new file mode 100644 index 0000000..b035fe3 --- /dev/null +++ b/src/rv32.adoc @@ -0,0 +1,988 @@ +[[rv32]] +== RV32I Base Integer Instruction Set, Version 2.1 + +This chapter describes the RV32I base integer instruction set. + +[TIP] +==== +RV32I was designed to be sufficient to form a compiler target and to +support modern operating system environments. The ISA was also designed +to reduce the hardware required in a minimal implementation. RV32I +contains 40 unique instructions, though a simple implementation might +cover the ECALL/EBREAK instructions with a single SYSTEM hardware +instruction that always traps and might be able to implement the FENCE +instruction as a NOP, reducing base instruction count to 38 total. RV32I +can emulate almost any other ISA extension (except the A extension, +which requires additional hardware support for atomicity). + +In practice, a hardware implementation including the machine-mode +privileged architecture will also require the 6 CSR instructions. + +Subsets of the base integer ISA might be useful for pedagogical +purposes, but the base has been defined such that there should be little +incentive to subset a real hardware implementation beyond omitting +support for misaligned memory accesses and treating all SYSTEM +instructions as a single trap. +==== + +[NOTE] +==== +The standard RISC-V assembly language syntax is documented in the +Assembly Programmer’s Manual cite:[riscv-asm-manual]. + +Most of the commentary for RV32I also applies to the RV64I base. +==== + +=== Programmers’ Model for Base Integer ISA + +<> shows the unprivileged state for the base +integer ISA. For RV32I, the 32 `x` registers are each 32 bits wide, +i.e., XLEN=32. Register `x0` is hardwired with all bits equal to 0. +General purpose registers `x1`–`x31` hold values that various +instructions interpret as a collection of Boolean values, or as two’s +complement signed binary integers or unsigned binary integers. + +There is one additional unprivileged register: the program counter `pc` +holds the address of the current instruction. + +[[img-gprs]] +.RISC-V base unprivileged integer register state. +image::base-unpriv-reg-state.png[base,180,1000,align="center"] + +There is no dedicated stack pointer or subroutine return address link +register in the Base Integer ISA; the instruction encoding allows any +`x` register to be used for these purposes. However, the standard +software calling convention uses register `x1` to hold the return +address for a call, with register `x5` available as an alternate link +register. The standard calling convention uses register `x2` as the +stack pointer. + +Hardware might choose to accelerate function calls and returns that use +`x1` or `x5`. See the descriptions of the JAL and JALR instructions. + +The optional compressed 16-bit instruction format is designed around the +assumption that `x1` is the return address register and ` x2` is the +stack pointer. Software using other conventions will operate correctly +but may have greater code size. + +The number of available architectural registers can have large impacts +on code size, performance, and energy consumption. Although 16 registers +would arguably be sufficient for an integer ISA running compiled code, +it is impossible to encode a complete ISA with 16 registers in 16-bit +instructions using a 3-address format. Although a 2-address format would +be possible, it would increase instruction count and lower efficiency. +We wanted to avoid intermediate instruction sizes (such as Xtensa’s +24-bit instructions) to simplify base hardware implementations, and once +a 32-bit instruction size was adopted, it was straightforward to support +32 integer registers. A larger number of integer registers also helps +performance on high-performance code, where there can be extensive use +of loop unrolling, software pipelining, and cache tiling. + +For these reasons, we chose a conventional size of 32 integer registers +for RV32I. Dynamic register usage tends to be dominated by a few +frequently accessed registers, and regfile implementations can be +optimized to reduce access energy for the frequently accessed +registers . The optional compressed 16-bit instruction format mostly +only accesses 8 registers and hence can provide a dense instruction +encoding, while additional instruction-set extensions could support a +much larger register space (either flat or hierarchical) if desired. + +For resource-constrained embedded applications, we have defined the +RV32E subset, which only has 16 registers +(<>). + +=== Base Instruction Formats + +In the base RV32I ISA, there are four core instruction formats +(R/I/S/U), as shown in <>. All are a fixed 32 +bits in length and must be aligned on a four-byte boundary in memory. An +instruction-address-misaligned exception is generated on a taken branch +or unconditional jump if the target address is not four-byte aligned. +This exception is reported on the branch or jump instruction, not on the +target instruction. No instruction-address-misaligned exception is +generated for a conditional branch that is not taken. + +The alignment constraint for base ISA instructions is relaxed to a +two-byte boundary when instruction extensions with 16-bit lengths or +other odd multiples of 16-bit lengths are added (i.e., IALIGN=16). + +Instruction-address-misaligned exceptions are reported on the branch or +jump that would cause instruction misalignment to help debugging, and to +simplify hardware design for systems with IALIGN=32, where these are the +only places where misalignment can occur. + +The behavior upon decoding a reserved instruction is UNSPECIFIED. + +Some platforms may require that opcodes reserved for standard use raise +an illegal-instruction exception. Other platforms may permit reserved +opcode space be used for non-conforming extensions. + +include::images/wavedrom/instruction_formats.adoc[] +[[base_instr]] +.Instruction formats +image::image_placeholder.png[] + +[NOTE] +==== +Each immediate subfield in <> above is labeled with the bit position (imm[x ]) in the immediate value being produced, rather than the bit position within the instruction’s immediate field as is usually done. +==== + +The RISC-V ISA keeps the source (_rs1_ and _rs2_) and destination (_rd_) +registers at the same position in all formats to simplify decoding. +Except for the 5-bit immediates used in CSR instructions +(<>), immediates are always +sign-extended, and are generally packed towards the leftmost available +bits in the instruction and have been allocated to reduce hardware +complexity. In particular, the sign bit for all immediates is always in +bit 31 of the instruction to speed sign-extension circuitry. + +[NOTE] +==== +Decoding register specifiers is usually on the critical paths in +implementations, and so the instruction format was chosen to keep all +register specifiers at the same position in all formats at the expense +of having to move immediate bits across formats (a property shared with +RISC-IV aka. cite:[spur-jsscc1989]). + +In practice, most immediates are either small or require all XLEN bits. +We chose an asymmetric immediate split (12 bits in regular instructions +plus a special load-upper-immediate instruction with 20 bits) to +increase the opcode space available for regular instructions. + +Immediates are sign-extended because we did not observe a benefit to +using zero-extension for some immediates as in the MIPS ISA and wanted +to keep the ISA as simple as possible. +==== + +=== Immediate Encoding Variants + +There are a further two variants of the instruction formats (B/J) based +on the handling of immediates, as shown in <>. + +include::images/wavedrom/immediate_variants.adoc[] +[[baseinstformatsimm]] +.RISC-V base instruction formats. +image::image_placeholder.png[] + +[NOTE] +==== +Each immediate subfield is labeled with the bit +position (imm[x ]) in the immediate value being produced, rather than the bit position within the +instruction’s immediate field as is usually done. +==== + +The only difference between the S and B formats is that the 12-bit +immediate field is used to encode branch offsets in multiples of 2 in +the B format. Instead of shifting all bits in the instruction-encoded +immediate left by one in hardware as is conventionally done, the middle +bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest +bit in S format (inst[7]) encodes a high-order bit in B format. + +Similarly, the only difference between the U and J formats is that the +20-bit immediate is shifted left by 12 bits to form U immediates and by +1 bit to form J immediates. The location of instruction bits in the U +and J format immediates is chosen to maximize overlap with the other +formats and with each other. + +<> shows the immediates produced by +each of the base instruction formats, and is labeled to show which +instruction bit `(inst[_y_])` produces each bit of the immediate value. + +include::images/wavedrom/immediate.adoc[] +[[immtypes]] +.Immediate variants for I, S, B, U, and J +image::image_placeholder.png[] + + +Sign-extension is one of the most critical operations on immediates +(particularly for XLEN latexmath:[$>$]32), and in RISC-V the sign bit for +all immediates is always held in bit 31 of the instruction to allow +sign-extension to proceed in parallel with instruction decoding. + +Although more complex implementations might have separate adders for +branch and jump calculations and so would not benefit from keeping the +location of immediate bits constant across types of instruction, we +wanted to reduce the hardware cost of the simplest implementations. By +rotating bits in the instruction encoding of B and J immediates instead +of using dynamic hardware muxes to multiply the immediate by 2, we +reduce instruction signal fanout and immediate mux costs by around a +factor of 2. The scrambled immediate encoding will add negligible time +to static or ahead-of-time compilation. For dynamic generation of +instructions, there is some small additional overhead, but the most +common short forward branches have straightforward immediate encodings. + +=== Integer Computational Instructions + +Most integer computational instructions operate on XLEN bits of values +held in the integer register file. Integer computational instructions +are either encoded as register-immediate operations using the I-type +format or as register-register operations using the R-type format. The +destination is register _rd_ for both register-immediate and +register-register instructions. No integer computational instructions +cause arithmetic exceptions. + +We did not include special instruction-set support for overflow checks +on integer arithmetic operations in the base instruction set, as many +overflow checks can be cheaply implemented using RISC-V branches. +Overflow checking for unsigned addition requires only a single +additional branch instruction after the addition: +`add t0, t1, t2; bltu t0, t1, overflow`. + +For signed addition, if one operand’s sign is known, overflow checking +requires only a single branch after the addition: +`addi t0, t1, +imm; blt t0, t1, overflow`. This covers the common case +of addition with an immediate operand. + +For general signed addition, three additional instructions after the +addition are required, leveraging the observation that the sum should be +less than one of the operands if and only if the other operand is +negative. + +[source,txt] +.... + add t0, t1, t2 + slti t3, t2, 0 + slt t4, t0, t1 + bne t3, t4, overflow +.... + +In RV64I, checks of 32-bit signed additions can be optimized further by +comparing the results of ADD and ADDW on the operands. + +==== Integer Register-Immediate Instructions + +include::images/wavedrom/integer_computational.adoc[] +.Integer Computational Instructions +image::image_placeholder.png[] + +ADDI adds the sign-extended 12-bit immediate to register _rs1_. +Arithmetic overflow is ignored and the result is simply the low XLEN +bits of the result. ADDI _rd, rs1, 0_ is used to implement the MV _rd, +rs1_ assembler pseudoinstruction. + +SLTI (set less than immediate) places the value 1 in register _rd_ if +register _rs1_ is less than the sign-extended immediate when both are +treated as signed numbers, else 0 is written to _rd_. SLTIU is similar +but compares the values as unsigned numbers (i.e., the immediate is +first sign-extended to XLEN bits then treated as an unsigned number). +Note, SLTIU _rd, rs1, 1_ sets _rd_ to 1 if _rs1_ equals zero, otherwise +sets _rd_ to 0 (assembler pseudoinstruction SEQZ _rd, rs_). + +ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and +XOR on register _rs1_ and the sign-extended 12-bit immediate and place +the result in _rd_. Note, XORI _rd, rs1, -1_ performs a bitwise logical +inversion of register _rs1_ (assembler pseudoinstruction NOT _rd, rs_). + +include::images/wavedrom/int-comp-slli-srli-srai.adoc[] +[[int-comp-slli-srli-srai]] +.Integer register-immediate, SLLI, SRLI, SRAI +image::image_placeholder.png[] + + +Shifts by a constant are encoded as a specialization of the I-type +format. The operand to be shifted is in _rs1_, and the shift amount is +encoded in the lower 5 bits of the I-immediate field. The right shift +type is encoded in bit 30. SLLI is a logical left shift (zeros are +shifted into the lower bits); SRLI is a logical right shift (zeros are +shifted into the upper bits); and SRAI is an arithmetic right shift (the +original sign bit is copied into the vacated upper bits). + + +include::images/wavedrom/int-comp-lui-aiupc.adoc[] +[[int-comp-lui-aiupc]] +.Integer register-immediate, U-immediate +image::image_placeholder.png[] + + +LUI (load upper immediate) is used to build 32-bit constants and uses +the U-type format. LUI places the 32-bit U-immediate value into the +destination register _rd_, filling in the lowest 12 bits with zeros. + +AUIPC (add upper immediate to `pc`) is used to build `pc`-relative +addresses and uses the U-type format. AUIPC forms a 32-bit offset from +the U-immediate, filling in the lowest 12 bits with zeros, adds this +offset to the address of the AUIPC instruction, then places the result +in register _rd_. + +The assembly syntax for `lui` and `auipc` does not represent the lower +12 bits of the U-immediate, which are always zero. + +The AUIPC instruction supports two-instruction sequences to access +arbitrary offsets from the PC for both control-flow transfers and data +accesses. The combination of an AUIPC and the 12-bit immediate in a JALR +can transfer control to any 32-bit PC-relative address, while an AUIPC +plus the 12-bit immediate offset in regular load or store instructions +can access any 32-bit PC-relative data address. + +The current PC can be obtained by setting the U-immediate to 0. Although +a JAL +4 instruction could also be used to obtain the local PC (of the +instruction following the JAL), it might cause pipeline breaks in +simpler microarchitectures or pollute BTB structures in more complex +microarchitectures. + +==== Integer Register-Register Operations + +RV32I defines several arithmetic R-type operations. All operations read +the _rs1_ and _rs2_ registers as source operands and write the result +into register _rd_. The _funct7_ and _funct3_ fields select the type of +operation. + +include::images/wavedrom/int_reg-reg.adoc[] +[[int-reg-reg]] +.Integer register-register +image::image_placeholder.png[] + +ADD performs the addition of _rs1_ and _rs2_. SUB performs the +subtraction of _rs2_ from _rs1_. Overflows are ignored and the low XLEN +bits of results are written to the destination _rd_. SLT and SLTU +perform signed and unsigned compares respectively, writing 1 to _rd_ if +latexmath:[$\mbox{\em rs1} < \mbox{\em + rs2}$], 0 otherwise. Note, SLTU _rd_, _x0_, _rs2_ sets _rd_ to 1 if +_rs2_ is not equal to zero, otherwise sets _rd_ to zero (assembler +pseudoinstruction SNEZ _rd, rs_). AND, OR, and XOR perform bitwise +logical operations. + +SLL, SRL, and SRA perform logical left, logical right, and arithmetic +right shifts on the value in register _rs1_ by the shift amount held in +the lower 5 bits of register _rs2_. + +==== NOP Instruction + +include::images/wavedrom/nop.adoc[] +[[nop]] +.NOP instructions +image::image_placeholder.png[] + +The NOP instruction does not change any architecturally visible state, +except for advancing the `pc` and incrementing any applicable +performance counters. NOP is encoded as ADDI _x0, x0, 0_. + +NOPs can be used to align code segments to microarchitecturally +significant address boundaries, or to leave space for inline code +modifications. Although there are many possible ways to encode a NOP, we +define a canonical NOP encoding to allow microarchitectural +optimizations as well as for more readable disassembly output. The other +NOP encodings are made available for HINT instructions +(Section <>). + +ADDI was chosen for the NOP encoding as this is most likely to take +fewest resources to execute across a range of systems (if not optimized +away in decode). In particular, the instruction only reads one register. +Also, an ADDI functional unit is more likely to be available in a +superscalar design as adds are the most common operation. In particular, +address-generation functional units can execute ADDI using the same +hardware needed for base+offset address calculations, while +register-register ADD or logical/shift operations require additional +hardware. + +=== Control Transfer Instructions + +RV32I provides two types of control transfer instructions: unconditional +jumps and conditional branches. Control transfer instructions in RV32I +do _not_ have architecturally visible delay slots. + +If an instruction access-fault or instruction page-fault exception +occurs on the target of a jump or taken branch, the exception is +reported on the target instruction, not on the jump or branch +instruction. + +==== Unconditional Jumps + +The jump and link (JAL) instruction uses the J-type format, where the +J-immediate encodes a signed offset in multiples of 2 bytes. The offset +is sign-extended and added to the address of the jump instruction to +form the jump target address. Jumps can therefore target a +latexmath:[$\pm$] range. JAL stores the address of the instruction +following the jump (`pc`+4) into register _rd_. The standard software +calling convention uses `x1` as the return address register and `x5` as +an alternate link register. + +The alternate link register supports calling millicode routines (e.g., +those to save and restore registers in compressed code) while preserving +the regular return address register. The register `x5` was chosen as the +alternate link register as it maps to a temporary in the standard +calling convention, and has an encoding that is only one bit different +than the regular link register. + +Plain unconditional jumps (assembler pseudoinstruction J) are encoded as +a JAL with _rd_=`x0`. + +include::images/wavedrom/ct-unconditional.adoc[] +[[ct-unconditional]] +.Plain unconditional jumps +image::image_placeholder.png[] + +The indirect jump instruction JALR (jump and link register) uses the +I-type encoding. The target address is obtained by adding the +sign-extended 12-bit I-immediate to the register _rs1_, then setting the +least-significant bit of the result to zero. The address of the +instruction following the jump (`pc`+4) is written to register _rd_. +Register `x0` can be used as the destination if the result is not +required. + +include::images/wavedrom/ct-unconditional-2.adoc[] +[[ct-unconditional-2]] +.Indirect unconditional jump +image::image_placeholder.png[] + +The unconditional jump instructions all use PC-relative addressing to +help support position-independent code. The JALR instruction was defined +to enable a two-instruction sequence to jump anywhere in a 32-bit +absolute address range. A LUI instruction can first load _rs1_ with the +upper 20 bits of a target address, then JALR can add in the lower bits. +Similarly, AUIPC then JALR can jump anywhere in a 32-bit `pc`-relative +address range. + +Note that the JALR instruction does not treat the 12-bit immediate as +multiples of 2 bytes, unlike the conditional branch instructions. This +avoids one more immediate format in hardware. In practice, most uses of +JALR will have either a zero immediate or be paired with a LUI or AUIPC, +so the slight reduction in range is not significant. + +Clearing the least-significant bit when calculating the JALR target +address both simplifies the hardware slightly and allows the low bit of +function pointers to be used to store auxiliary information. Although +there is potentially a slight loss of error checking in this case, in +practice jumps to an incorrect instruction address will usually quickly +raise an exception. + +When used with a base _rs1_latexmath:[$=$]`x0`, JALR can be used to +implement a single instruction subroutine call to the lowest or highest +address region from anywhere in the address space, which could be used +to implement fast calls to a small runtime library. Alternatively, an +ABI could dedicate a general-purpose register to point to a library +elsewhere in the address space. + +The JAL and JALR instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary. + +Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + +Return-address prediction stacks are a common feature of +high-performance instruction-fetch units, but require accurate detection +of instructions used for procedure calls and returns to be effective. +For RISC-V, hints as to the instructions’ usage are encoded implicitly +via the register numbers used. A JAL instruction should push the return +address onto a return-address stack (RAS) only when _rd_ is `x1` or +`x5`. JALR instructions should push/pop a RAS as shown in <>. + +[[rashints]] +.Return-address stack prediction hints encoded in the register operands +of a JALR instruction. +[cols="^,^,^,<",options="header",] +|=== +|_rd_ is `x1`/`x5` |_rs1_ is `x1`/`x5` |__rd__latexmath:[$=$]_rs1_ |RAS +action +|No |No |– |None + +|No |Yes |– |Pop + +|Yes |No |– |Push + +|Yes |Yes |No |Pop, then push + +|Yes |Yes |Yes |Push +|=== + +Some other ISAs added explicit hint bits to their indirect-jump +instructions to guide return-address stack manipulation. We use implicit +hinting tied to register numbers and the calling convention to reduce +the encoding space used for these hints. + +When two different link registers (`x1` and `x5`) are given as _rs1_ and +_rd_, then the RAS is both popped and pushed to support coroutines. If +_rs1_ and _rd_ are the same link register (either `x1` or `x5`), the RAS +is only pushed to enable macro-op fusion of the sequences: +`lui ra, imm20; jalr ra, imm12(ra)`  and + `auipc ra, imm20; jalr ra, imm12(ra)` + +==== Conditional Branches + +All branch instructions use the B-type instruction format. The 12-bit +B-immediate encodes signed offsets in multiples of 2 bytes. The offset +is sign-extended and added to the address of the branch instruction to +give the target address. The conditional branch range is +latexmath:[$\pm$]. + +include::images/wavedrom/ct-conditional.adoc[] +[[ct-conditional]] +.Conditional branches +image::image_placeholder.png[] + +Branch instructions compare two registers. BEQ and BNE take the branch +if registers _rs1_ and _rs2_ are equal or unequal respectively. BLT and +BLTU take the branch if _rs1_ is less than _rs2_, using signed and +unsigned comparison respectively. BGE and BGEU take the branch if _rs1_ +is greater than or equal to _rs2_, using signed and unsigned comparison +respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by +reversing the operands to BLT, BLTU, BGE, and BGEU, respectively. + +Signed array bounds may be checked with a single BLTU instruction, since +any negative index will compare greater than any nonnegative bound. + +Software should be optimized such that the sequential code path is the +most common path, with less-frequently taken code paths placed out of +line. Software should also assume that backward branches will be +predicted taken and forward branches as not taken, at least the first +time they are encountered. Dynamic predictors should quickly learn any +predictable branch behavior. + +Unlike some other architectures, the RISC-V jump (JAL with _rd_=`x0`) +instruction should always be used for unconditional branches instead of +a conditional branch instruction with an always-true condition. RISC-V +jumps are also PC-relative and support a much wider offset range than +branches, and will not pollute conditional-branch prediction tables. + +The conditional branches were designed to include arithmetic comparison +operations between two registers (as also done in PA-RISC, Xtensa, and +MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or +to only compare one register against zero (Alpha, MIPS), or two +registers only for equality (MIPS). This design was motivated by the +observation that a combined compare-and-branch instruction fits into a +regular pipeline, avoids additional condition code state or use of a +temporary register, and reduces static code size and dynamic instruction +fetch traffic. Another point is that comparisons against zero require +non-trivial circuit delay (especially after the move to static logic in +advanced processes) and so are almost as expensive as arithmetic +magnitude compares. Another advantage of a fused compare-and-branch +instruction is that branches are observed earlier in the front-end +instruction stream, and so can be predicted earlier. There is perhaps an +advantage to a design with condition codes in the case where multiple +branches can be taken based on the same condition codes, but we believe +this case to be relatively rare. + +We considered but did not include static branch hints in the instruction +encoding. These can reduce the pressure on dynamic predictors, but +require more instruction encoding space and software profiling for best +results, and can result in poor performance if production runs do not +match profiling runs. + +We considered but did not include conditional moves or predicated +instructions, which can effectively replace unpredictable short forward +branches. Conditional moves are the simpler of the two, but are +difficult to use with conditional code that might cause exceptions +(memory accesses and floating-point operations). Predication adds +additional flag state to a system, additional instructions to set and +clear flags, and additional encoding overhead on every instruction. Both +conditional move and predicated instructions add complexity to +out-of-order microarchitectures, adding an implicit third source operand +due to the need to copy the original value of the destination +architectural register into the renamed destination physical register if +the predicate is false. Also, static compile-time decisions to use +predication instead of branches can result in lower performance on +inputs not included in the compiler training set, especially given that +unpredictable branches are rare, and becoming rarer as branch prediction +techniques improve. + +We note that various microarchitectural techniques exist to dynamically +convert unpredictable short forward branches into internally predicated +code to avoid the cost of flushing pipelines on a branch mispredict cite:[heil-tr1996], cite:[Klauser-1998], cite:[Kim-micro2005] and +have been implemented in commercial processors cite:[ibmpower7]. The simplest techniques +just reduce the penalty of recovering from a mispredicted short forward +branch by only flushing instructions in the branch shadow instead of the +entire fetch pipeline, or by fetching instructions from both sides using +wide instruction fetch or idle instruction fetch slots. More complex +techniques for out-of-order cores add internal predicates on +instructions in the branch shadow, with the internal predicate value +written by the branch instruction, allowing the branch and following +instructions to be executed speculatively and out-of-order with respect +to other code . + +The conditional branch instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary and the branch condition evaluates to +true. If the branch condition evaluates to false, the +instruction-address-misaligned exception will not be raised. + +Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + +[[ldst]] +=== Load and Store Instructions + +RV32I is a load-store architecture, where only load and store +instructions access memory and arithmetic instructions only operate on +CPU registers. RV32I provides a 32-bit address space that is +byte-addressed. The EEI will define what portions of the address space +are legal to access with which instructions (e.g., some addresses might +be read only, or support word access only). Loads with a destination of +`x0` must still raise any exceptions and cause any other side effects +even though the load value is discarded. + +The EEI will define whether the memory system is little-endian or +big-endian. In RISC-V, endianness is byte-address invariant. + +In a system for which endianness is byte-address invariant, the +following property holds: if a byte is stored to memory at some address +in some endianness, then a byte-sized load from that address in any +endianness returns the stored value. + +In a little-endian configuration, multibyte stores write the +least-significant register byte at the lowest memory byte address, +followed by the other register bytes in ascending order of their +significance. Loads similarly transfer the contents of the lesser memory +byte addresses to the less-significant register bytes. + +In a big-endian configuration, multibyte stores write the +most-significant register byte at the lowest memory byte address, +followed by the other register bytes in descending order of their +significance. Loads similarly transfer the contents of the greater +memory byte addresses to the less-significant register bytes. + +include::images/wavedrom/load_store.adoc[] +[[load-store,load and store]] +.Load and store instructions +image::image_placeholder.png[] + +Load and store instructions transfer a value between the registers and +memory. Loads are encoded in the I-type format and stores are S-type. +The effective address is obtained by adding register _rs1_ to the +sign-extended 12-bit offset. Loads copy a value from memory to register +_rd_. Stores copy the value in register _rs2_ to memory. + +The LW instruction loads a 32-bit value from memory into _rd_. LH loads +a 16-bit value from memory, then sign-extends to 32-bits before storing +in _rd_. LHU loads a 16-bit value from memory but then zero extends to +32-bits before storing in _rd_. LB and LBU are defined analogously for +8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and +8-bit values from the low bits of register _rs2_ to memory. + +Regardless of EEI, loads and stores whose effective addresses are +naturally aligned shall not raise an address-misaligned exception. Loads +and stores whose effective address is not naturally aligned to the +referenced datatype (i.e., the effective address is not divisible by the +size of the access in bytes) have behavior dependent on the EEI. + +An EEI may guarantee that misaligned loads and stores are fully +supported, and so the software running inside the execution environment +will never experience a contained or fatal address-misaligned trap. In +this case, the misaligned loads and stores can be handled in hardware, +or via an invisible trap into the execution environment implementation, +or possibly a combination of hardware and invisible trap depending on +address. + +An EEI may not guarantee misaligned loads and stores are handled +invisibly. In this case, loads and stores that are not naturally aligned +may either complete execution successfully or raise an exception. The +exception raised can be either an address-misaligned exception or an +access-fault exception. For a memory access that would otherwise be able +to complete except for the misalignment, an access-fault exception can +be raised instead of an address-misaligned exception if the misaligned +access should not be emulated, e.g., if accesses to the memory region +have side effects. When an EEI does not guarantee misaligned loads and +stores are handled invisibly, the EEI must define if exceptions caused +by address misalignment result in a contained trap (allowing software +running inside the execution environment to handle the trap) or a fatal +trap (terminating execution). + +Misaligned accesses are occasionally required when porting legacy code, +and help performance on applications when using any form of packed-SIMD +extension or handling externally packed data structures. Our rationale +for allowing EEIs to choose to support misaligned accesses via the +regular load and store instructions is to simplify the addition of +misaligned hardware support. One option would have been to disallow +misaligned accesses in the base ISAs and then provide some separate ISA +support for misaligned accesses, either special instructions to help +software handle misaligned accesses or a new hardware addressing mode +for misaligned accesses. Special instructions are difficult to use, +complicate the ISA, and often add new processor state (e.g., SPARC VIS +align address offset register) or complicate access to existing +processor state (e.g., MIPS LWL/LWR partial register writes). In +addition, for loop-oriented packed-SIMD code, the extra overhead when +operands are misaligned motivates software to provide multiple forms of +loop depending on operand alignment, which complicates code generation +and adds to loop startup overhead. New misaligned hardware addressing +modes take considerable space in the instruction encoding or require +very simplified addressing modes (e.g., register indirect only). + +Even when misaligned loads and stores complete successfully, these +accesses might run extremely slowly depending on the implementation +(e.g., when implemented via an invisible trap). Furthermore, whereas +naturally aligned loads and stores are guaranteed to execute atomically, +misaligned loads and stores might not, and hence require additional +synchronization to ensure atomicity. + +We do not mandate atomicity for misaligned accesses so execution +environment implementations can use an invisible machine trap and a +software handler to handle some or all misaligned accesses. If hardware +misaligned support is provided, software can exploit this by simply +using regular load and store instructions. Hardware can then +automatically optimize accesses depending on whether runtime addresses +are aligned. + +[[fence]] +=== Memory Ordering Instructions + +include::images/wavedrom/mem_order.adoc[] +[[mem-order]] +.Memory ordering instructions +image::image_placeholder.png[] + +The FENCE instruction is used to order device I/O and memory accesses as +viewed by other RISC-V harts and external devices or coprocessors. Any +combination of device input (I), device output (O), memory reads \(R), +and memory writes (W) may be ordered with respect to any combination of +the same. Informally, no other RISC-V hart or external device can +observe any operation in the _successor_ set following a FENCE before +any operation in the _predecessor_ set preceding the FENCE. +<> provides a precise description +of the RISC-V memory consistency model. + +The FENCE instruction also orders memory reads and writes made by the +hart as observed by memory reads and writes made by an external device. +However, FENCE does not order observations of events made by an external +device using any other signaling mechanism. + +A device might observe an access to a memory location via some external +communication mechanism, e.g., a memory-mapped control register that +drives an interrupt signal to an interrupt controller. This +communication is outside the scope of the FENCE ordering mechanism and +hence the FENCE instruction can provide no guarantee on when a change in +the interrupt signal is visible to the interrupt controller. Specific +devices might provide additional ordering guarantees to reduce software +overhead but those are outside the scope of the RISC-V memory model. + +The EEI will define what I/O operations are possible, and in particular, +which memory addresses when accessed by load and store instructions will +be treated and ordered as device input and device output operations +respectively rather than memory reads and writes. For example, +memory-mapped I/O devices will typically be accessed with uncached loads +and stores that are ordered using the I and O bits rather than the R and +W bits. Instruction-set extensions might also describe new I/O +instructions that will also be ordered using the I and O bits in a +FENCE. + +[[fm]] +.Fence mode encoding +|=== +|_fm_ field |Mnemonic |Meaning +|0000 |_none_ |Normal Fence +|1000 |TSO |With FENCE RW,RW: exclude write-to-read ordering; otherwise: _Reserved for future use._ +|_other_ | |_Reserved for future use._ +|=== + +The fence mode field _fm_ defines the semantics of the FENCE. A FENCE +with _fm_=0000 orders all memory operations in its predecessor set +before all memory operations in its successor set. + +The optional FENCE.TSO instruction is encoded as a FENCE instruction +with _fm_=1000, _predecessor_=RW, and _successor_=RW. FENCE.TSO orders +all load operations in its predecessor set before all memory operations +in its successor set, and all store operations in its predecessor set +before all store operations in its successor set. This leaves non-AMO +store operations in the FENCE.TSO’s predecessor set unordered with +non-AMO loads in its successor set. + +The FENCE.TSO encoding was added as an optional extension to the +original base FENCE instruction encoding. The base definition requires +that implementations ignore any set bits and treat the FENCE as global, +and so this is a backwards-compatible extension. + +The unused fields in the FENCE instructions--_rs1_ and _rd_--are reserved +for finer-grain fences in future extensions. For forward compatibility, +base implementations shall ignore these fields, and standard software +shall zero these fields. Likewise, many _fm_ and predecessor/successor +set settings in <> are also reserved for future use. +Base implementations shall treat all such reserved configurations as +normal fences with _fm_=0000, and standard software shall use only +non-reserved configurations. + +We chose a relaxed memory model to allow high performance from simple +machine implementations and from likely future coprocessor or +accelerator extensions. We separate out I/O ordering from memory R/W +ordering to avoid unnecessary serialization within a device-driver hart +and also to support alternative non-memory paths to control added +coprocessors or I/O devices. Simple implementations may additionally +ignore the _predecessor_ and _successor_ fields and always execute a +conservative fence on all operations. + +=== Environment Call and Breakpoints + +SYSTEM instructions are used to access system functionality that might +require privileged access and are encoded using the I-type instruction +format. These can be divided into two main classes: those that +atomically read-modify-write control and status registers (CSRs), and +all other potentially privileged instructions. CSR instructions are +described in <>, and the base +unprivileged instructions are described in the following section. + + +[TIP] +==== +The SYSTEM instructions are defined to allow simpler implementations to +always trap to a single software trap handler. More sophisticated +implementations might execute more of each system instruction in +hardware. +==== + +include::images/wavedrom/env_call-breakpoint.adoc[] +[[env-call]] +.Evironment call and breakpoint instructions +image::image_placeholder.png[] + +These two instructions cause a precise requested trap to the supporting +execution environment. + +The ECALL instruction is used to make a service request to the execution +environment. The EEI will define how parameters for the service request +are passed, but usually these will be in defined locations in the +integer register file. + +The EBREAK instruction is used to return control to a debugging +environment. + +ECALL and EBREAK were previously named SCALL and SBREAK. The +instructions have the same functionality and encoding, but were renamed +to reflect that they can be used more generally than to call a +supervisor-level operating system or debugger. + +EBREAK was primarily designed to be used by a debugger to cause +execution to stop and fall back into the debugger. EBREAK is also used +by the standard gcc compiler to mark code paths that should not be +executed. + +Another use of EBREAK is to support _semihosting_, where the execution +environment includes a debugger that can provide services over an +alternate system call interface built around the EBREAK instruction. +Because the RISC-V base ISAs do not provide more than one EBREAK +instruction, RISC-V semihosting uses a special sequence of instructions +to distinguish a semihosting EBREAK from a debugger inserted EBREAK. + +.... + slli x0, x0, 0x1f # Entry NOP + ebreak # Break to debugger + srai x0, x0, 7 # NOP encoding the semihosting call number 7 +.... + +Note that these three instructions must be 32-bit-wide instructions, +i.e., they mustn’t be among the compressed 16-bit instructions described +in <>. + +The shift NOP instructions are still considered available for use as +HINTs. + +Semihosting is a form of service call and would be more naturally +encoded as an ECALL using an existing ABI, but this would require the +debugger to be able to intercept ECALLs, which is a newer addition to +the debug standard. We intend to move over to using ECALLs with a +standard ABI, in which case, semihosting can share a service ABI with an +existing standard. + +We note that ARM processors have also moved to using SVC instead of BKPT +for semihosting calls in newer designs. + +=== HINT Instructions + +RV32I reserves a large encoding space for HINT instructions, which are +usually used to communicate performance hints to the microarchitecture. +Like the NOP instruction, HINTs do not change any architecturally +visible state, except for advancing the `pc` and any applicable +performance counters. Implementations are always allowed to ignore the +encoded hints. + +Most RV32I HINTs are encoded as integer computational instructions with +_rd_=x0. The other RV32I HINTs are encoded as FENCE instructions with +a null predecessor or successor set and with _fm_=0. + +These HINT encodings have been chosen so that simple implementations can +ignore HINTs altogether, and instead execute a HINT as a regular +instruction that happens not to mutate the architectural state. For +example, ADD is a HINT if the destination register is `x0`; the five-bit +_rs1_ and _rs2_ fields encode arguments to the HINT. However, a simple +implementation can simply execute the HINT as an ADD of _rs1_ and _rs2_ +that writes ` x0`, which has no architecturally visible effect. + +As another example, a FENCE instruction with a zero _pred_ field and a +zero _fm_ field is a HINT; the _succ_, _rs1_, and _rd_ fields encode the +arguments to the HINT. A simple implementation can simply execute the +HINT as a FENCE that orders the null set of prior memory accesses before +whichever subsequent memory accesses are encoded in the _succ_ field. +Since the intersection of the predecessor and successor sets is null, +the instruction imposes no memory orderings, and so it has no +architecturally visible effect. + +<> lists all RV32I HINT code points. 91% of the +HINT space is reserved for standard HINTs. The remainder of the HINT +space is designated for custom HINTs: no standard HINTs will ever be +defined in this subspace. + +[TIP] +==== +We anticipate standard hints to eventually include memory-system spatial +and temporal locality hints, branch prediction hints, thread-scheduling +hints, security tags, and instrumentation flags for +simulation/emulation. +==== + +// this table isn't quite right and needs to be fixed--some rows might not have landed properly. It needs to be checked cell-by cell. + +[[t-rv32i-hints]] +.RV32I HINT instructions. +[cols="<,<,^,<",options="header"] +|=== +|Instruction |Constraints |Code Points |Purpose + +|LUI |_rd_=`x0` |latexmath:[$2^{20}$] .22+<.>m|_Reserved for future standard use_ + +|AUIPC |_rd_=`x0` |latexmath:[$2^{20}$] + +|ADDI |_rd_=`x0`, and either latexmath:[$2^{17}-1$] _rs1_ latexmath:[$\neq$]`x0` or _imm_latexmath:[$\neq$]0 | + +|ANDI |_rd_=`x0` |latexmath:[$2^{17}$] + +|ORI |_rd_=`x0` |latexmath:[$2^{17}$] + +|XORI |_rd_=`x0` |latexmath:[$2^{17}$] + +|ADD |_rd_=`x0` |latexmath:[$2^{10}$] + +|SUB |_rd_=`x0` |latexmath:[$2^{10}$] + +|AND |_rd_=`x0` |latexmath:[$2^{10}$] + +|OR |_rd_=`x0` |latexmath:[$2^{10}$] + +|XOR |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLL |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRL |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRA |_rd_=`x0` |latexmath:[$2^{10}$] + +| |_rd_=`x0`| _rs1_latexmath:[$\neq$]`x0`, latexmath:[$2^{10}-63$] + +| |_fm_=0, and either _pred_=0 or _succ_=0 |_rd_latexmath:[$\neq$]`x0` + +| | _rs1_=`x0` |latexmath:[$2^{10}-63$] + +| |_fm_=0, and either _pred_=0 or _succ_=0 | + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15 + +|FENCE |_pred_=0| _succ_latexmath:[$\neq$]0 + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15 + +|FENCE |_pred_latexmath:[$\neq$]W, _succ_=0 | + +|FENCE |_rd_=_rs1_=`x0`, _fm_=0, _pred_=W, _succ_=0 |1 |PAUSE + +|SLTI |_rd_=`x0` |latexmath:[$2^{17}$] .7+<.>m|_Designated for custom use_ + +|SLTIU|_rd_=`x0` |latexmath:[$2^{17}$] + +|SLLI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRLI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRAI |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLT |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLTU |_rd_=`x0` |latexmath:[$2^{10}$] +|=== diff --git a/src/rv32e.adoc b/src/rv32e.adoc new file mode 100644 index 0000000..775771f --- /dev/null +++ b/src/rv32e.adoc @@ -0,0 +1,59 @@ +[[rv32e]] +== RV32E Base Integer Instruction Set, Version 1.9 + +This chapter describes a draft proposal for the RV32E base integer +instruction set, which is a reduced version of RV32I designed for +embedded systems. The only change is to reduce the number of integer +registers to 16. This chapter only outlines the differences between +RV32E and RV32I, and so should be read after <>. +(((RV32E, design))) + +[NOTE] +==== +RV32E was designed to provide an even smaller base core for embedded +microcontrollers. Although we had mentioned this possibility in version +2.0 of this document, we initially resisted defining this subset. +However, given the demand for the smallest possible 32-bit +microcontroller, and in the interests of preempting fragmentation in +this space, we have now defined RV32E as a fourth standard base ISA in +addition to RV32I, RV64I, and RV128I. There is also interest in defining +an RV64E to reduce context state for highly threaded 64-bit processors. +==== + +=== RV32E Programmers’ Model + +RV32E reduces the integer register count to 16 general-purpose +registers, (`x0`–`x15`), where `x0` is a dedicated zero register. + +[TIP] +==== +We have found that in the small RV32I core designs, the upper 16 +registers consume around one quarter of the total area of the core +excluding memories, thus their removal saves around 25% core area with a +corresponding core power reduction. + +This change requires a different calling convention and ABI. In +particular, RV32E is only used with a soft-float calling convention. A +new embedded ABI is under consideration that would work across RV32E and +RV32I. +==== + +=== RV32E Instruction Set +(((RV32E, difference from RV32I))) + +RV32E uses the same instruction-set encoding as RV32I, except that only +registers `x0`–`x15` are provided. Any future standard extensions will +not make use of the instruction bits freed up by the reduced +register-specifier fields and so these are designated for custom +extensions. + +[NOTE] +==== +RV32E can be combined with all current standard extensions. Defining the +F, D, and Q extensions as having a 16-entry floating point register file +when combined with RV32E was considered but decided against. To support +systems with reduced floating-point register state, we intend to define +a `Zfinx` extension that makes floating-point computations use the +integer registers, removing the floating-point loads, stores, and moves +between floating point and integer registers. +==== diff --git a/src/rv64.adoc b/src/rv64.adoc new file mode 100644 index 0000000..748fbc3 --- /dev/null +++ b/src/rv64.adoc @@ -0,0 +1,252 @@ +[[rv64]] +== RV64I Base Integer Instruction Set, Version 2.1 + +This chapter describes the RV64I base integer instruction set, which +builds upon the RV32I variant described in <>. +This chapter presents only the differences with RV32I, so should be read +in conjunction with the earlier chapter. + +=== Register State + +RV64I widens the integer registers and supported user address space to +64 bits (XLEN=64 in <>. + +=== Integer Computational Instructions + +Most integer computational instructions operate on XLEN-bit values. +Additional instruction variants are provided to manipulate 32-bit values +in RV64I, indicated by a `W` suffix to the opcode. These `\*W` +instructions ignore the upper 32 bits of their inputs and always produce +32-bit signed values, sign-extending them to 64 bits, i.e. bits XLEN-1 +through 31 are equal. + +The compiler and calling convention maintain an invariant that all +32-bit values are held in a sign-extended format in 64-bit registers. +Even 32-bit unsigned integers extend bit 31 into bits 63 through 32. +Consequently, conversion between unsigned and signed 32-bit integers is +a no-op, as is conversion from a signed 32-bit integer to a signed +64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares +still operate correctly on unsigned 32-bit integers under this +invariant. Similarly, existing 64-bit wide logical operations on 32-bit +sign-extended integers preserve the sign-extension property. A few new +instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to +ensure reasonable performance for 32-bit values. +(((RV64I, shifts))) +(((RV64I, compares))) + +==== Integer Register-Immediate Instructions + +include::images/wavedrom/rv64i-base-int.adoc[] +[[rv64i-base-int]] +.RV64I register-immediate instructions +image::image_placeholder.png[] + +ADDIW is an RV64I instruction that adds the sign-extended 12-bit +immediate to register _rs1_ and produces the proper sign-extension of a +32-bit result in _rd_. Overflows are ignored and the result is the low +32 bits of the result sign-extended to 64 bits. Note, ADDIW _rd, rs1, 0_ +writes the sign-extension of the lower 32 bits of register _rs1_ into +register _rd_ (assembler pseudoinstruction SEXT.W). +//`the following diagram doesn't match the tex spec` + +include::images/wavedrom/rv64i-addiw.adoc[] +[[rv64i-addiw]] +.RV64I register-immediate (descr ADDIW) instructions +image::image_placeholder.png[] + +Shifts by a constant are encoded as a specialization of the I-type +format using the same instruction opcode as RV32I. The operand to be +shifted is in _rs1_, and the shift amount is encoded in the lower 6 bits +of the I-immediate field for RV64I. The right shift type is encoded in +bit 30. SLLI is a logical left shift (zeros are shifted into the lower +bits); SRLI is a logical right shift (zeros are shifted into the upper +bits); and SRAI is an arithmetic right shift (the original sign bit is +copied into the vacated upper bits). +(((RV64I, SLLI))) +(((RV64I, SRKIW))) +(((RV64I, SRLIW))) +(((RV64I, RV64I-only))) + + +SLLIW, SRLIW, and SRAIW are RV64I-only instructions that are analogously +defined but operate on 32-bit values and sign-extend their 32-bit +results to 64 bits. SLLIW, SRLIW, and SRAIW encodings with +latexmath:[$imm[5] \neq 0$] are reserved. + +[NOTE] +==== +Previously, SLLIW, SRLIW, and SRAIW with latexmath:[$imm[5] \neq 0$] +were defined to cause illegal instruction exceptions, whereas now they +are marked as reserved. This is a backwards-compatible change. +==== + +include::images/wavedrom/rv64_lui-auipc.adoc[] +[[rv64_lui-auipc]] +.RV64I register-immediate (descr) instructions +image::image_placeholder.png[] + +LUI (load upper immediate) uses the same opcode as RV32I. LUI places the +32-bit U-immediate into register _rd_, filling in the lowest 12 bits +with zeros. The 32-bit result is sign-extended to 64 bits. + +AUIPC (add upper immediate to `pc`) uses the same opcode as RV32I. AUIPC +is used to build ` pc`-relative addresses and uses the U-type format. +AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest +12 bits with zeros, sign-extends the result to 64 bits, adds it to the +address of the AUIPC instruction, then places the result in register +_rd_. + +Note that the set of address offsets that can be formed by pairing LUI +with LD, AUIPC with JALR, etc.in RV64I is +[latexmath:[${-}2^{31}{-}2^{11}$], latexmath:[$2^{31}{-}2^{11}{-}1$]]. + +==== Integer Register-Register Operations + +//`this diagrM doesn't match the tex specification` +include::images/wavedrom/rv64i_int-reg-reg.adoc[] +[[int_reg-reg]] +.RV64I integer register-register instructions +image::image_placeholder.png[] + +ADDW and SUBW are RV64I-only instructions that are defined analogously +to ADD and SUB but operate on 32-bit values and produce signed 32-bit +results. Overflows are ignored, and the low 32-bits of the result is +sign-extended to 64-bits and written to the destination register. + +SLL, SRL, and SRA perform logical left, logical right, and arithmetic +right shifts on the value in register _rs1_ by the shift amount held in +register _rs2_. In RV64I, only the low 6 bits of _rs2_ are considered +for the shift amount. + +SLLW, SRLW, and SRAW are RV64I-only instructions that are analogously +defined but operate on 32-bit values and sign-extend their 32-bit +results to 64 bits. The shift amount is given by _rs2[4:0]_. + +=== Load and Store Instructions + +RV64I extends the address space to 64 bits. The execution environment +will define what portions of the address space are legal to access. + +include::images/wavedrom/load_store.adoc[] +[[load_store]] +.Load and store instructions +image::image_placeholder.png[] + +The LD instruction loads a 64-bit value from memory into register _rd_ +for RV64I. +(((RV64I, LD))) + +The LW instruction loads a 32-bit value from memory and sign-extends +this to 64 bits before storing it in register _rd_ for RV64I. The LWU +instruction, on the other hand, zero-extends the 32-bit value from +memory for RV64I. LH and LHU are defined analogously for 16-bit values, +as are LB and LBU for 8-bit values. The SD, SW, SH, and SB instructions +store 64-bit, 32-bit, 16-bit, and 8-bit values from the low bits of +register _rs2_ to memory respectively. + +[[sec:rv64i-hints]] +=== HINT Instructions + +All instructions that are microarchitectural HINTs in RV32I (see +<> are also HINTs in RV64I. +The additional computational instructions in RV64I expand both the +standard and custom HINT encoding spaces. +(((RV64I, HINT))) + +<> lists all RV64I HINT code points. 91% of the +HINT space is reserved for standard HINTs, but none are presently +defined. The remainder of the HINT space is designated for custom HINTs; +no standard HINTs will ever be defined in this subspace. + +[[rv64i-hints]] +.RV64I HINT instructions. +[cols="<,<,^,<",options="header",] +|=== +|Instruction |Constraints |Code Points |Purpose +|LUI |_rd_=`x0` |latexmath:[$2^{20}$] |_Reserved for future standard +use_ + +|AUIPC | |_rd_=`x0` |latexmath:[$2^{20}$] + +| | |_rd_=`x0`, and either |latexmath:[$2^{17}-1$] + +| | |_rs1_latexmath:[$\neq$]`x0` or _imm_latexmath:[$\neq$]0 | + +|ANDI | |_rd_=`x0` |latexmath:[$2^{17}$] + +|ORI | |_rd_=`x0` |latexmath:[$2^{17}$] + +|XORI | |_rd_=`x0` |latexmath:[$2^{17}$] + +|ADDIW | |_rd_=`x0` |latexmath:[$2^{17}$] + +|ADD | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SUB | |_rd_=`x0` |latexmath:[$2^{10}$] + +|AND | |_rd_=`x0` |latexmath:[$2^{10}$] + +|OR | |_rd_=`x0` |latexmath:[$2^{10}$] + +|XOR | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLL | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRL | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRA | |_rd_=`x0` |latexmath:[$2^{10}$] + +|ADDW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SUBW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLLW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRLW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRAW | |_rd_=`x0` |latexmath:[$2^{10}$] + +| | |_rd_=`x0`, _rs1_latexmath:[$\neq$]`x0`, |latexmath:[$2^{10}-63$] + +| | |_fm_=0, and either | + +| | |_pred_=0 or _succ_=0 | + +| | |_rd_latexmath:[$\neq$]`x0`, _rs1_=`x0`, |latexmath:[$2^{10}-63$] + +| | |_fm_=0, and either | + +| | |_pred_=0 or _succ_=0 | + +| | |_rd_=_rs1_=`x0`, _fm_=0, |15 + +| | |_pred_=0, _succ_latexmath:[$\neq$]0 | + +| | |_rd_=_rs1_=`x0`, _fm_=0, |15 + +| | |_pred_latexmath:[$\neq$]W, _succ_=0 | + +|FENCE | |_rd_=_rs1_=`x0`, _fm_=0, |1 + +| |_pred_=W, _succ_=0 | | + +|SLTI |_rd_=`x0` |latexmath:[$2^{17}$] |_Designated for custom use_ + +|SLTIU | |_rd_=`x0` |latexmath:[$2^{17}$] + +|SLLI | |_rd_=`x0` |latexmath:[$2^{11}$] + +|SRLI | |_rd_=`x0` |latexmath:[$2^{11}$] + +|SRAI | |_rd_=`x0` |latexmath:[$2^{11}$] + +|SLLIW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRLIW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SRAIW | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLT | |_rd_=`x0` |latexmath:[$2^{10}$] + +|SLTU | |_rd_=`x0` |latexmath:[$2^{10}$] +|=== diff --git a/src/rvwmo.adoc b/src/rvwmo.adoc new file mode 100644 index 0000000..e6bc412 --- /dev/null +++ b/src/rvwmo.adoc @@ -0,0 +1,831 @@ +[[memorymodel]] +== RVWMO Memory Consistency Model, Version 2.0 + +This chapter defines the RISC-V memory consistency model. A memory +consistency model is a set of rules specifying the values that can be +returned by loads of memory. RISC-V uses a memory model called `RVWMO` +(RISC-V Weak Memory Ordering) which is designed to provide flexibility +for architects to build high-performance scalable designs while +simultaneously supporting a tractable programming model. +(((design, high performace))) +(((design, scalable))) + +Under RVWMO, code running on a single hart appears to execute in order +from the perspective of other memory instructions in the same hart, but +memory instructions from another hart may observe the memory +instructions from the first hart being executed in a different order. +Therefore, multithreaded code may require explicit synchronization to +guarantee ordering between memory instructions from different harts. The +base RISC-V ISA provides a FENCE instruction for this purpose, described +in <>, while the atomics extension `A` +additionally defines load-reserved/store-conditional and atomic +read-modify-write instructions. +(((atomics, misaligned))) + +The standard ISA extension for misaligned atomics `Zam` +(<>) and the standard ISA extension for total +store ordering `Ztso` (<>) augment RVWMO +with additional rules specific to those extensions. + +The appendices to this specification provide both axiomatic and +operational formalizations of the memory consistency model as well as +additional explanatory material. +((FENCE)) +((SFENCE)) + +This chapter defines the memory model for regular main memory +operations. The interaction of the memory model with I/O memory, +instruction fetches, FENCE.I, page table walks, and SFENCE.VMA is not +(yet) formalized. Some or all of the above may be formalized in a future +revision of this specification. The RV128 base ISA and future ISA +extensions such as the `V` vector and `J` JIT extensions will need +to be incorporated into a future revision as well. + +Memory consistency models supporting overlapping memory accesses of +different widths simultaneously remain an active area of academic +research and are not yet fully understood. The specifics of how memory +accesses of different sizes interact under RVWMO are specified to the +best of our current abilities, but they are subject to revision should +new issues be uncovered. + +[[rvwmo]] +=== Definition of the RVWMO Memory Model + +The RVWMO memory model is defined in terms of the _global memory order_, +a total ordering of the memory operations produced by all harts. In +general, a multithreaded program has many different possible executions, +with each execution having its own corresponding global memory order. +((RVWMO)) + +The global memory order is defined over the primitive load and store +operations generated by memory instructions. It is then subject to the +constraints defined in the rest of this chapter. Any execution +satisfying all of the memory model constraints is a legal execution (as +far as the memory model is concerned). + +[[rvwmo-primitives]] +==== Memory Model Primitives + +The _program order_ over memory operations reflects the order in which +the instructions that generate each load and store are logically laid +out in that hart’s dynamic instruction stream; i.e., the order in which +a simple in-order processor would execute the instructions of that hart. + +Memory-accessing instructions give rise to _memory operations_. A memory +operation can be either a _load operation_, a _store operation_, or both +simultaneously. All memory operations are single-copy atomic: they can +never be observed in a partially complete state. +(((operations, memory))) + +Among instructions in RV32GC and RV64GC, each aligned memory instruction +gives rise to exactly one memory operation, with two exceptions. First, +an unsuccessful SC instruction does not give rise to any memory +operations. Second, FLD and FSD instructions may each give rise to +multiple memory operations if XLENlatexmath:[$<$]64, as stated in +<> and clarified below. An aligned AMO +gives rise to a single memory operation that is both a load operation +and a store operation simultaneously. + +Instructions in the RV128 base instruction set and in future ISA +extensions such as V (vector) and P (SIMD) may give rise to multiple +memory operations. However, the memory model for these extensions has +not yet been formalized. + +A misaligned load or store instruction may be decomposed into a set of +component memory operations of any granularity. An FLD or FSD +instruction for which XLENlatexmath:[$<$]64 may also be decomposed into +a set of component memory operations of any granularity. The memory +operations generated by such instructions are not ordered with respect +to each other in program order, but they are ordered normally with +respect to the memory operations generated by preceding and subsequent +instructions in program order. The atomics extension `A` does not +require execution environments to support misaligned atomic instructions +at all; however, if misaligned atomics are supported via the `Zam` +extension, LRs, SCs, and AMOs may be decomposed subject to the +constraints of the atomicity axiom for misaligned atomics, which is +defined in <>. +((decomposition)) + +The decomposition of misaligned memory operations down to byte +granularity facilitates emulation on implementations that do not +natively support misaligned accesses. Such implementations might, for +example, simply iterate over the bytes of a misaligned access one by +one. + +An LR instruction and an SC instruction are said to be _paired_ if the +LR precedes the SC in program order and if there are no other LR or SC +instructions in between; the corresponding memory operations are said to +be paired as well (except in case of a failed SC, where no store +operation is generated). The complete list of conditions determining +whether an SC must succeed, may succeed, or must fail is defined in +<>. + +Load and store operations may also carry one or more ordering +annotations from the following set: `acquire-RCpc`, `acquire-RCsc`, +`release-RCpc`, and `release-RCsc`. An AMO or LR instruction with +_aq_ set has an `acquire-RCsc` annotation. An AMO or SC instruction +with _rl_ set has a `release-RCsc` annotation. An AMO, LR, or SC +instruction with both _aq_ and _rl_ set has both `acquire-RCsc` and +`release-RCsc` annotations. + +For convenience, we use the term `acquire annotation` to refer to an +acquire-RCpc annotation or an acquire-RCsc annotation. Likewise, a +`release annotation` refers to a release-RCpc annotation or a +release-RCsc annotation. An `RCpc annotation` refers to an +acquire-RCpc annotation or a release-RCpc annotation. An `RCsc +annotation` refers to an acquire-RCsc annotation or a release-RCsc +annotation. + +In the memory model literature, the term `RCpc` stands for release +consistency with processor-consistent synchronization operations, and +the term `RCsc` stands for release consistency with sequentially +consistent synchronization operations . + +While there are many different definitions for acquire and release +annotations in the literature, in the context of RVWMO these terms are +concisely and completely defined by Preserved Program Order rules +<>. + +`RCpc` annotations are currently only used when implicitly assigned to +every memory access per the standard extension `Ztso` +(<>). Furthermore, although the ISA does not +currently contain native load-acquire or store-release instructions, nor +RCpc variants thereof, the RVWMO model itself is designed to be +forwards-compatible with the potential addition of any or all of the +above into the ISA in a future extension. + +[[mem-dependencies]] +==== Syntactic Dependencies + +The definition of the RVWMO memory model depends in part on the notion +of a syntactic dependency, defined as follows. + +In the context of defining dependencies, a `register` refers either to +an entire general-purpose register, some portion of a CSR, or an entire +CSR. The granularity at which dependencies are tracked through CSRs is +specific to each CSR and is defined in +<>. + +Syntactic dependencies are defined in terms of instructions’ _source +registers_, instructions’ _destination registers_, and the way +instructions _carry a dependency_ from their source registers to their +destination registers. This section provides a general definition of all +of these terms; however, <> provides a +complete listing of the specifics for each instruction. + +In general, a register latexmath:[$r$] other than `x0` is a _source +register_ for an instruction latexmath:[$i$] if any of the following +hold: + +* In the opcode of latexmath:[$i$], _rs1_, _rs2_, or _rs3_ is set to +latexmath:[$r$] +* latexmath:[$i$] is a CSR instruction, and in the opcode of +latexmath:[$i$], _csr_ is set to latexmath:[$r$], unless latexmath:[$i$] +is CSRRW or CSRRWI and _rd_ is set to `x0` +* latexmath:[$r$] is a CSR and an implicit source register for +latexmath:[$i$], as defined in <> +* latexmath:[$r$] is a CSR that aliases with another source register for +latexmath:[$i$] + +Memory instructions also further specify which source registers are +_address source registers_ and which are _data source registers_. + +In general, a register latexmath:[$r$] other than `x0` is a _destination +register_ for an instruction latexmath:[$i$] if any of the following +hold: + +* In the opcode of latexmath:[$i$], _rd_ is set to latexmath:[$r$] +* latexmath:[$i$] is a CSR instruction, and in the opcode of +latexmath:[$i$], _csr_ is set to latexmath:[$r$], unless latexmath:[$i$] +is CSRRS or CSRRC and _rs1_ is set to `x0` or latexmath:[$i$] is CSRRSI +or CSRRCI and uimm[4:0] is set to zero. +* latexmath:[$r$] is a CSR and an implicit destination register for +latexmath:[$i$], as defined in <> +* latexmath:[$r$] is a CSR that aliases with another destination +register for latexmath:[$i$] + +Most non-memory instructions _carry a dependency_ from each of their +source registers to each of their destination registers. However, there +are exceptions to this rule; see <<>>source-dest-regs>>. + +Instruction latexmath:[$j$] has a _syntactic dependency_ on instruction +latexmath:[$i$] via destination register latexmath:[$s$] of +latexmath:[$i$] and source register latexmath:[$r$] of latexmath:[$j$] +if either of the following hold: + +* latexmath:[$s$] is the same as latexmath:[$r$], and no instruction +program-ordered between latexmath:[$i$] and latexmath:[$j$] has +latexmath:[$r$] as a destination register +* There is an instruction latexmath:[$m$] program-ordered between +latexmath:[$i$] and latexmath:[$j$] such that all of the following hold: +. latexmath:[$j$] has a syntactic dependency on latexmath:[$m$] via +destination register latexmath:[$q$] and source register latexmath:[$r$] +. latexmath:[$m$] has a syntactic dependency on latexmath:[$i$] via +destination register latexmath:[$s$] and source register latexmath:[$p$] +. latexmath:[$m$] carries a dependency from latexmath:[$p$] to +latexmath:[$q$] + +Finally, in the definitions that follow, let latexmath:[$a$] and +latexmath:[$b$] be two memory operations, and let latexmath:[$i$] and +latexmath:[$j$] be the instructions that generate latexmath:[$a$] and +latexmath:[$b$], respectively. + +latexmath:[$b$] has a _syntactic address dependency_ on latexmath:[$a$] +if latexmath:[$r$] is an address source register for latexmath:[$j$] and +latexmath:[$j$] has a syntactic dependency on latexmath:[$i$] via source +register latexmath:[$r$] + +latexmath:[$b$] has a _syntactic data dependency_ on latexmath:[$a$] if +latexmath:[$b$] is a store operation, latexmath:[$r$] is a data source +register for latexmath:[$j$], and latexmath:[$j$] has a syntactic +dependency on latexmath:[$i$] via source register latexmath:[$r$] + +latexmath:[$b$] has a _syntactic control dependency_ on latexmath:[$a$] +if there is an instruction latexmath:[$m$] program-ordered between +latexmath:[$i$] and latexmath:[$j$] such that latexmath:[$m$] is a +branch or indirect jump and latexmath:[$m$] has a syntactic dependency +on latexmath:[$i$]. + +Generally speaking, non-AMO load instructions do not have data source +registers, and unconditional non-AMO store instructions do not have +destination registers. However, a successful SC instruction is +considered to have the register specified in _rd_ as a destination +register, and hence it is possible for an instruction to have a +syntactic dependency on a successful SC instruction that precedes it in +program order. + +==== Preserved Program Order + +The global memory order for any given execution of a program respects +some but not all of each hart’s program order. The subset of program +order that must be respected by the global memory order is known as +_preserved program order_. + +The complete definition of preserved program order is as follows (and +note that AMOs are simultaneously both loads and stores): memory +operation latexmath:[$a$] precedes memory operation latexmath:[$b$] in +preserved program order (and hence also in the global memory order) if +latexmath:[$a$] precedes latexmath:[$b$] in program order, +latexmath:[$a$] and latexmath:[$b$] both access regular main memory +(rather than I/O regions), and any of the following hold: + +[[overlapping-orering]] +* Overlapping-Address Orderings: +. latexmath:[$b$] is a store, and +latexmath:[$a$] and latexmath:[$b$] access overlapping memory addresses +. [#ppo:rdw]#[ppo:rdw]# latexmath:[$a$] and latexmath:[$b$] are loads, +latexmath:[$x$] is a byte read by both latexmath:[$a$] and +latexmath:[$b$], there is no store to latexmath:[$x$] between +latexmath:[$a$] and latexmath:[$b$] in program order, and +latexmath:[$a$] and latexmath:[$b$] return values for latexmath:[$x$] +written by different memory operations +. {empty}[#ppo:amoforward]#[ppo:amoforward]# latexmath:[$a$] is +generated by an AMO or SC instruction, latexmath:[$b$] is a load, and +latexmath:[$b$] returns a value written by latexmath:[$a$] +* Explicit Synchronization +. {empty} There is a FENCE instruction that +orders latexmath:[$a$] before latexmath:[$b$] +. latexmath:[$a$] has an acquire +annotation +. latexmath:[$b$] has a release annotation +. latexmath:[$a$] and latexmath:[$b$] both have +RCsc annotations +. {empty} latexmath:[$a$] is paired with +latexmath:[$b$] +* Syntactic Dependencies +. {empty} latexmath:[$b$] has a syntactic address +dependency on latexmath:[$a$] +. {empty} latexmath:[$b$] has a syntactic data +dependency on latexmath:[$a$] +. {empty} latexmath:[$b$] is a store, and +latexmath:[$b$] has a syntactic control dependency on latexmath:[$a$] +* Pipeline Dependencies +. {empty} latexmath:[$b$] is a +load, and there exists some store latexmath:[$m$] between +latexmath:[$a$] and latexmath:[$b$] in program order such that +latexmath:[$m$] has an address or data dependency on latexmath:[$a$], +and latexmath:[$b$] returns a value written by latexmath:[$m$] +. {empty} latexmath:[$b$] is a store, and +there exists some instruction latexmath:[$m$] between latexmath:[$a$] +and latexmath:[$b$] in program order such that latexmath:[$m$] has an +address dependency on latexmath:[$a$] + +==== Memory Model Axioms + +An execution of a RISC-V program obeys the RVWMO memory consistency +model only if there exists a global memory order conforming to preserved +program order and satisfying the _load value axiom_, the _atomicity +axiom_, and the _progress axiom_. + +[[ax-load]] +===== Load Value Axiom + +Each byte of each load latexmath:[$i$] returns the value written to that +byte by the store that is the latest in global memory order among the +following stores: + +. Stores that write that byte and that precede latexmath:[$i$] in the +global memory order +. Stores that write that byte and that precede latexmath:[$i$] in +program order + +[[ax-atom]] +===== Atomicity Axiom + +If latexmath:[$r$] and latexmath:[$w$] are paired load and store +operations generated by aligned LR and SC instructions in a hart +latexmath:[$h$], latexmath:[$s$] is a store to byte latexmath:[$x$], and +latexmath:[$r$] returns a value written by latexmath:[$s$], then +latexmath:[$s$] must precede latexmath:[$w$] in the global memory order, +and there can be no store from a hart other than latexmath:[$h$] to byte +latexmath:[$x$] following latexmath:[$s$] and preceding latexmath:[$w$] +in the global memory order. + +The theoretically supports LR/SC pairs of different widths and to +mismatched addresses, since implementations are permitted to allow SC +operations to succeed in such cases. However, in practice, we expect +such patterns to be rare, and their use is discouraged. + +[[ax-prog]] +===== Progress Axiom + +No memory operation may be preceded in the global memory order by an +infinite sequence of other memory operations. + +[[sec:csr-granularity]] +=== CSR Dependency Tracking Granularity + +.Granularities at which syntactic dependencies are tracked through CSRs +[cols="<,<,<",options="header",] +|=== +|Name |Portions Tracked as Independent Units |Aliases +|`fflags` |Bits 4, 3, 2, 1, 0 |`fcsr` +|`frm` |entire CSR |`fcsr` +|`fcsr` |Bits 7-5, 4, 3, 2, 1, 0 |`fflags`, `frm` +|=== + +Note: read-only CSRs are not listed, as they do not participate in the +definition of syntactic dependencies. + +[[sec:source-dest-regs]] +=== Source and Destination Register Listings + +This section provides a concrete listing of the source and destination +registers for each instruction. These listings are used in the +definition of syntactic dependencies in +<>. + +The term `accumulating CSR` is used to describe a CSR that is both a +source and a destination register, but which carries a dependency only +from itself to itself. + +Instructions carry a dependency from each source register in the +`Source Registers` column to each destination register in the +`Destination Registers` column, from each source register in the +`Source Registers` column to each CSR in the `Accumulating CSRs` +column, and from each CSR in the `Accumulating CSRs` column to itself, +except where annotated otherwise. + +Key: + +latexmath:[$^A$]Address source register + +latexmath:[$^D$]Data source register + +latexmath:[$^\dagger$]The instruction does not carry a dependency from +any source register to any destination register + +latexmath:[$^\ddagger$]The instruction carries dependencies from source +register(s) to destination register(s) as specified + +[cols="<,<,<,<,<",] +|=== +|*RV32I Base Integer Instruction Set* | | | | +| |Source |Destination |Accumulating | +| |Registers |Registers |CSRs | +|LUI | |_rd_ | | +|AUIPC | |_rd_ | | +|JAL | |_rd_ | | +|JALRlatexmath:[$^\dagger$] |_rs1_ |_rd_ | | +|BEQ |_rs1_, _rs2_ | | | +|BNE |_rs1_, _rs2_ | | | +|BLT |_rs1_, _rs2_ | | | +|BGE |_rs1_, _rs2_ | | | +|BLTU |_rs1_, _rs2_ | | | +|BGEU |_rs1_, _rs2_ | | | +|LBlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|LHlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|LWlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|LBUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|LHUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|SB |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | +|SH |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | +|SW |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | +|ADDI |_rs1_ |_rd_ | | +|SLTI |_rs1_ |_rd_ | | +|SLTIU |_rs1_ |_rd_ | | +|XORI |_rs1_ |_rd_ | | +|ORI |_rs1_ |_rd_ | | +|ANDI |_rs1_ |_rd_ | | +|SLLI |_rs1_ |_rd_ | | +|SRLI |_rs1_ |_rd_ | | +|SRAI |_rs1_ |_rd_ | | +|ADD |_rs1_, _rs2_ |_rd_ | | +|SUB |_rs1_, _rs2_ |_rd_ | | +|SLL |_rs1_, _rs2_ |_rd_ | | +|SLT |_rs1_, _rs2_ |_rd_ | | +|SLTU |_rs1_, _rs2_ |_rd_ | | +|XOR |_rs1_, _rs2_ |_rd_ | | +|SRL |_rs1_, _rs2_ |_rd_ | | +|SRA |_rs1_, _rs2_ |_rd_ | | +|OR |_rs1_, _rs2_ |_rd_ | | +|AND |_rs1_, _rs2_ |_rd_ | | +|FENCE | | | | +|FENCE.I | | | | +|ECALL | | | | +|EBREAK | | | | +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32I Base Integer Instruction Set (continued)* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|CSRRWlatexmath:[$^\ddagger$] |_rs1_, _csr_latexmath:[$^*$] |_rd_, _csr_ +| |latexmath:[$^*$]unless _rd_=`x0` + +|CSRRSlatexmath:[$^\ddagger$] |_rs1_, _csr_ |_rd_latexmath:[$^*$], _csr_ +| |latexmath:[$^*$]unless _rs1_=`x0` + +|CSRRClatexmath:[$^\ddagger$] |_rs1_, _csr_ |_rd_latexmath:[$^*$], _csr_ +| |latexmath:[$^*$]unless _rs1_=`x0` + +| |latexmath:[$\ddagger$]carries a dependency from _rs1_ to _csr_ and +from _csr_ to _rd_ | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32I Base Integer Instruction Set (continued)* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|CSRRWIlatexmath:[$^\ddagger$] |_csr_latexmath:[$^*$] |_rd_, _csr_ | +|latexmath:[$^*$]unless _rd_=`x0` + +|CSRRSIlatexmath:[$^\ddagger$] |_csr_ |_rd_, _csr_latexmath:[$^*$] | +|latexmath:[$^*$]unless uimm[4:0]=0 + +|CSRRCIlatexmath:[$^\ddagger$] |_csr_ |_rd_, _csr_latexmath:[$^*$] | +|latexmath:[$^*$]unless uimm[4:0]=0 + +| |latexmath:[$\ddagger$]carries a dependency from _csr_ to _rd_ | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV64I Base Integer Instruction Set* | | | | +| |Source |Destination |Accumulating | +| |Registers |Registers |CSRs | +|LWUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|LDlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | +|SD |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | +|SLLI |_rs1_ |_rd_ | | +|SRLI |_rs1_ |_rd_ | | +|SRAI |_rs1_ |_rd_ | | +|ADDIW |_rs1_ |_rd_ | | +|SLLIW |_rs1_ |_rd_ | | +|SRLIW |_rs1_ |_rd_ | | +|SRAIW |_rs1_ |_rd_ | | +|ADDW |_rs1_, _rs2_ |_rd_ | | +|SUBW |_rs1_, _rs2_ |_rd_ | | +|SLLW |_rs1_, _rs2_ |_rd_ | | +|SRLW |_rs1_, _rs2_ |_rd_ | | +|SRAW |_rs1_, _rs2_ |_rd_ | | +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32M Standard Extension* | | | | +| |Source |Destination |Accumulating | +| |Registers |Registers |CSRs | +|MUL |_rs1_, _rs2_ |_rd_ | | +|MULH |_rs1_, _rs2_ |_rd_ | | +|MULHSU |_rs1_, _rs2_ |_rd_ | | +|MULHU |_rs1_, _rs2_ |_rd_ | | +|DIV |_rs1_, _rs2_ |_rd_ | | +|DIVU |_rs1_, _rs2_ |_rd_ | | +|REM |_rs1_, _rs2_ |_rd_ | | +|REMU |_rs1_, _rs2_ |_rd_ | | +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV64M Standard Extension* | | | | +| |Source |Destination |Accumulating | +| |Registers |Registers |CSRs | +|MULW |_rs1_, _rs2_ |_rd_ | | +|DIVW |_rs1_, _rs2_ |_rd_ | | +|DIVUW |_rs1_, _rs2_ |_rd_ | | +|REMW |_rs1_, _rs2_ |_rd_ | | +|REMUW |_rs1_, _rs2_ |_rd_ | | +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32A Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|LR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | + +|SC.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_latexmath:[$^*$] | |latexmath:[$^*$]if +successful + +|AMOSWAP.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOADD.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOXOR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOAND.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOOR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMIN.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMAX.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMINU.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMAXU.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV64A Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|LR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | + +|SC.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_latexmath:[$^*$] | |latexmath:[$^*$]if +successful + +|AMOSWAP.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOADD.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOXOR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOAND.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOOR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMIN.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMAX.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMINU.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +|AMOMAXU.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$], +_rs2_latexmath:[$^D$] |_rd_ | | + +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32F Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|FLWlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | + +|FSW |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | + +|FMADD.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FMSUB.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FNMSUB.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, +NX |latexmath:[$^*$]if rm=111 + +|FNMADD.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, +NX |latexmath:[$^*$]if rm=111 + +|FADD.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX +|latexmath:[$^*$]if rm=111 + +|FSUB.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX +|latexmath:[$^*$]if rm=111 + +|FMUL.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FDIV.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, DZ, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FSQRT.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FSGNJ.S |_rs1_, _rs2_ |_rd_ | | + +|FSGNJN.S |_rs1_, _rs2_ |_rd_ | | + +|FSGNJX.S |_rs1_, _rs2_ |_rd_ | | + +|FMIN.S |_rs1_, _rs2_ |_rd_ |NV | + +|FMAX.S |_rs1_, _rs2_ |_rd_ |NV | + +|FCVT.W.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.WU.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FMV.X.W |_rs1_ |_rd_ | | + +|FEQ.S |_rs1_, _rs2_ |_rd_ |NV | + +|FLT.S |_rs1_, _rs2_ |_rd_ |NV | + +|FLE.S |_rs1_, _rs2_ |_rd_ |NV | + +|FCLASS.S |_rs1_ |_rd_ | | + +|FCVT.S.W |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +|FCVT.S.WU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +|FMV.W.X |_rs1_ |_rd_ | | + +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV64F Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|FCVT.L.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.LU.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.S.L |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +|FCVT.S.LU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV32D Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|FLDlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | | + +|FSD |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | | + +|FMADD.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FMSUB.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FNMSUB.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, +NX |latexmath:[$^*$]if rm=111 + +|FNMADD.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, +NX |latexmath:[$^*$]if rm=111 + +|FADD.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX +|latexmath:[$^*$]if rm=111 + +|FSUB.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX +|latexmath:[$^*$]if rm=111 + +|FMUL.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FDIV.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, DZ, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FSQRT.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FSGNJ.D |_rs1_, _rs2_ |_rd_ | | + +|FSGNJN.D |_rs1_, _rs2_ |_rd_ | | + +|FSGNJX.D |_rs1_, _rs2_ |_rd_ | | + +|FMIN.D |_rs1_, _rs2_ |_rd_ |NV | + +|FMAX.D |_rs1_, _rs2_ |_rd_ |NV | + +|FCVT.S.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX +|latexmath:[$^*$]if rm=111 + +|FCVT.D.S |_rs1_ |_rd_ |NV | + +|FEQ.D |_rs1_, _rs2_ |_rd_ |NV | + +|FLT.D |_rs1_, _rs2_ |_rd_ |NV | + +|FLE.D |_rs1_, _rs2_ |_rd_ |NV | + +|FCLASS.D |_rs1_ |_rd_ | | + +|FCVT.W.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.WU.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.D.W |_rs1_ |_rd_ | | + +|FCVT.D.WU |_rs1_ |_rd_ | | + +| | | | | +|=== + +[cols="<,<,<,<,<",] +|=== +|*RV64D Standard Extension* | | | | + +| |Source |Destination |Accumulating | + +| |Registers |Registers |CSRs | + +|FCVT.L.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FCVT.LU.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if +rm=111 + +|FMV.X.D |_rs1_ |_rd_ | | + +|FCVT.D.L |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +|FCVT.D.LU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if +rm=111 + +|FMV.D.X |_rs1_ |_rd_ | | + +| | | | | +|=== + diff --git a/src/test.adoc b/src/test.adoc new file mode 100644 index 0000000..10992eb --- /dev/null +++ b/src/test.adoc @@ -0,0 +1,47 @@ +[appendix] +== Test + + +//include::images/graphviz/litmus_sample.dot[] +//[[litmus-sample]] +//.A sample litmus test and one forbidden execution (a0=1). +//image::image_placeholder.png[] + + +[graphviz,test-diagram,svg] +.... +digraph G { + +splines=spline; +pad="0.000000"; + + +/* the unlocked events */ +eiid0 [label="a: Ry=1", shape="none", fontsize=8, pos="1.000000,1.125000!", fixedsize="false", height="0.111111", width="0.555556"]; +eiid1 [label="c: Wx=t", shape="none", fontsize=8, pos="1.000000,0.562500!", fixedsize="false", height="0.111111", width="0.555556"]; +eiid2 [label="d: Rx=t", shape="none", fontsize=8, pos="2.500000,1.125000!", fixedsize="false", height="0.111111", width="0.555556"]; +eiid3 [label="e: Rt=$v$", shape="none", fontsize=8, pos="2.500000,0.562500!", fixedsize="false", height="0.111111", width="0.555556"]; +eiid4 [label="f: Wy=1", shape="none", fontsize=8, pos="2.500000,0.000000!", fixedsize="false", height="0.111111", width="0.555556"]; + +/* the intra_causality_data edges */ + + +/* the intra_causality_control edges */ + +/* the poi edges */ +/* the rfmap edges */ + + +/* The viewed-before edges */ +eiid0 -> eiid1 [label=<fenceppo>, color="darkgreen:indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +eiid1 -> eiid2 [label=<rf>, color="red", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +eiid2 -> eiid3 [label=<addrppo>, color="indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +eiid2 -> eiid4 [label=<ppo>, color="indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +eiid3 -> eiid4 [label=<po>, color="black", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +eiid4 -> eiid0 [label=<rf>, color="red", fontsize=11, penwidth="3.000000", arrowsize="0.666700"]; +} +.... + + +para + diff --git a/src/v-st-ext.adoc b/src/v-st-ext.adoc new file mode 100644 index 0000000..ab637ae --- /dev/null +++ b/src/v-st-ext.adoc @@ -0,0 +1,11 @@ +[[vector]] +== `V` Standard Extension for Vector Operations, Version 0.7 + +The current working group draft is hosted at +` https://github.com/riscv/riscv-v-spec`. + +The base vector extension is intended to provide general support for +data-parallel execution within the 32-bit instruction encoding space, +with later vector extensions supporting richer functionality for certain +domains. + diff --git a/src/zam-st-ext.adoc b/src/zam-st-ext.adoc new file mode 100644 index 0000000..4468e31 --- /dev/null +++ b/src/zam-st-ext.adoc @@ -0,0 +1,52 @@ +[[zam]] +== `Zam` Standard Extension for Misaligned Atomics, v0.1 + +This chapter defines the ``Zam`` extension, which extends the ``A`` +extension by standardizing support for misaligned atomic memory +operations (AMOs). On platforms implementing ``Zam``, misaligned AMOs +need only execute atomically with respect to other accesses (including +non-atomic loads and stores) to the same address and of the same size. +More precisely, execution environments implementing ``Zam`` are subject +to the following axiom: + +[[misaligned]] +=== Atomicity Axiom for misaligned atomics + +If latexmath:[$r$] and latexmath:[$w$] are paired misaligned load and +store instructions from a hart latexmath:[$h$] with the same address and +of the same size, then there can be no store instruction latexmath:[$s$] +from a hart other than latexmath:[$h$] with the same address and of the +same size as latexmath:[$r$] and latexmath:[$w$] such that a store +operation generated by latexmath:[$s$] lies in between memory operations +generated by latexmath:[$r$] and latexmath:[$w$] in the global memory +order. Furthermore, there can be no load instruction latexmath:[$l$] +from a hart other than latexmath:[$h$] with the same address and of the +same size as latexmath:[$r$] and latexmath:[$w$] such that a load +operation generated by latexmath:[$l$] lies between two memory +operations generated by latexmath:[$r$] or by latexmath:[$w$] in the +global memory order. + +This restricted form of atomicity is intended to balance the needs of +applications which require support for misaligned atomics and the +ability of the implementation to actually provide the necessary degree +of atomicity. + +Aligned instructions under `Zam` continue to behave as they normally +do under RVWMO. + +The intention of `Zam` is that it can be implemented in one of two +ways: + +. On hardware that natively supports atomic misaligned accesses to the +address and size in question (e.g., for misaligned accesses within a +single cache line): by simply following the same rules that would be +applied for aligned AMOs. +. On hardware that does not natively support misaligned accesses to the +address and size in question: by trapping on all instructions (including +loads) with that address and size and executing them (via any number of +memory operations) inside a mutex that is a function of the given memory +address and access size. AMOs may be emulated by splitting them into +separate load and store operations, but all preserved program order +rules (e.g., incoming and outgoing syntactic dependencies) must behave +as if the AMO is still a single memory operation. + diff --git a/src/zicsr.adoc b/src/zicsr.adoc new file mode 100644 index 0000000..3c4eed4 --- /dev/null +++ b/src/zicsr.adoc @@ -0,0 +1,242 @@ +[[csrinsts]] +== `Zicsr`, Control and Status Register (CSR) Instructions, Version 2.0 + +RISC-V defines a separate address space of 4096 Control and Status +registers associated with each hart. This chapter defines the full set +of CSR instructions that operate on these CSRs. + +[NOTE] +==== +While CSRs are primarily used by the privileged architecture, there are +several uses in unprivileged code including for counters and timers, and +for floating-point status. + +The counters and timers are no longer considered mandatory parts of the +standard base ISAs, and so the CSR instructions required to access them +have been moved out of <> into this separate +chapter. +==== + +=== CSR Instructions +((CSR)) + +All CSR instructions atomically read-modify-write a single CSR, whose +CSR specifier is encoded in the 12-bit _csr_ field of the instruction +held in bits 31–20. The immediate forms use a 5-bit zero-extended +immediate encoded in the _rs1_ field. +(((CSR, instructions))) + +include::images/wavedrom/csr-instr.adoc[] +[[csr-instr]] +.CSR instructions +image::image_placeholder.png[] +(((CSR, CSSRW))) + +The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in +the CSRs and integer registers. CSRRW reads the old value of the CSR, +zero-extends the value to XLEN bits, then writes it to integer register +_rd_. The initial value in _rs1_ is written to the CSR. If _rd_=`x0`, +then the instruction shall not read the CSR and shall not cause any of +the side effects that might occur on a CSR read. +(((CSR, CSSRS))) + +The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value +of the CSR, zero-extends the value to XLEN bits, and writes it to +integer register _rd_. The initial value in integer register _rs1_ is +treated as a bit mask that specifies bit positions to be set in the CSR. +Any bit that is high in _rs1_ will cause the corresponding bit to be set +in the CSR, if that CSR bit is writable. Other bits in the CSR are not +explicitly written. +(((CSR, CSRRC))) + +The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the +value of the CSR, zero-extends the value to XLEN bits, and writes it to +integer register _rd_. The initial value in integer register _rs1_ is +treated as a bit mask that specifies bit positions to be cleared in the +CSR. Any bit that is high in _rs1_ will cause the corresponding bit to +be cleared in the CSR, if that CSR bit is writable. Other bits in the +CSR are not explicitly written. + +For both CSRRS and CSRRC, if _rs1_=`x0`, then the instruction will not +write to the CSR at all, and so shall not cause any of the side effects +that might otherwise occur on a CSR write, nor raise illegal instruction +exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always +read the addressed CSR and cause any read side effects regardless of +_rs1_ and _rd_ fields. Note that if _rs1_ specifies a register holding a +zero value other than ` x0`, the instruction will still attempt to write +the unmodified value back to the CSR and will cause any attendant side +effects. A CSRRW with _rs1_=`x0` will attempt to write zero to the +destination CSR. +(((CSR, CSRRWI))) +(((CSR, CSRRCI))) + +The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and +CSRRC respectively, except they update the CSR using an XLEN-bit value +obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field +encoded in the _rs1_ field instead of a value from an integer register. +For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these +instructions will not write to the CSR, and shall not cause any of the +side effects that might otherwise occur on a CSR write, nor raise +illegal instruction exceptions on accesses to read-only CSRs. For +CSRRWI, if _rd_=`x0`, then the instruction shall not read the CSR and +shall not cause any of the side effects that might occur on a CSR read. +Both CSRRSI and CSRRCI will always read the CSR and cause any read side +effects regardless of _rd_ and _rs1_ fields. +(((CSR, side effects))) + +[[csrsideeffects]] +.Conditions determining whether a CSR instruction reads or writes the +specified CSR. +[cols="<,^,^,^,^",options="header",] +|=== +|Register operand | | | | +|Instruction |_rd_ is `x0` |_rs1_ is `x0` |Reads CSR |Writes CSR + +|CSRRW |Yes |– |No |Yes + +|CSRRW |No |– |Yes |Yes + +|CSRRS/CSRRC |– |Yes |Yes |No + +|CSRRS/CSRRC |– |No |Yes |Yes + +|Immediate operand | | | | + +|Instruction |_rd_ is `x0` |_uimm_ latexmath:[$=$]0 |Reads CSR |Writes +CSR + +|CSRRWI |Yes |– |No |Yes + +|CSRRWI |No |– |Yes |Yes + +|CSRRSI/CSRRCI |– |Yes |Yes |No + +|CSRRSI/CSRRCI |– |No |Yes |Yes +|=== +(((CSR, defects))) + +<> summarizes the behavior of the CSR +instructions with respect to whether they read and/or write the CSR. + +For any event or consequence that occurs due to a CSR having a +particular value, if a write to the CSR gives it that value, the +resulting event or consequence is said to be an _indirect effect_ of the +write. Indirect effects of a CSR write are not considered by the RISC-V +ISA to be side effects of that write. + +[TIP] +==== +An example of side effects for CSR accesses would be if reading from a +specific CSR causes a light bulb to turn on, while writing an odd value +to the same CSR causes the light to turn off. Assume writing an even +value has no effect. In this case, both the read and write have side +effects controlling whether the bulb is lit, as this condition is not +determined solely from the CSR value. (Note that after writing an odd +value to the CSR to turn off the light, then reading to turn the light +on, writing again the same odd value causes the light to turn off again. +Hence, on the last write, it is not a change in the CSR value that turns +off the light.) + +On the other hand, if a bulb is rigged to light whenever the value of a +particular CSR is odd, then turning the light on and off is not +considered a side effect of writing to the CSR but merely an indirect +effect of such writes. + +More concretely, the RISC-V privileged architecture defined in Volume II +specifies that certain combinations of CSR values cause a trap to occur. +When an explicit write to a CSR creates the conditions that trigger the +trap, the trap is not considered a side effect of the write but merely +an indirect effect. +// check whether we are using "volume" + +Standard CSRs do not have any side effects on reads. Standard CSRs may +have side effects on writes. Custom extensions might add CSRs for which +accesses have side effects on either reads or writes. +==== + +Some CSRs, such as the instructions-retired counter, `instret`, may be +modified as side effects of instruction execution. In these cases, if a +CSR access instruction reads a CSR, it reads the value prior to the +execution of the instruction. If a CSR access instruction writes such a +CSR, the write is done instead of the increment. In particular, a value +written to `instret` by one instruction will be the value read by the +following instruction. + +The assembler pseudoinstruction to read a CSR, CSRR _rd, csr_, is +encoded as CSRRS _rd, csr, x0_. The assembler pseudoinstruction to write +a CSR, CSRW _csr, rs1_, is encoded as CSRRW _x0, csr, rs1_, while CSRWI +_csr, uimm_, is encoded as CSRRWI _x0, csr, uimm_. + +Further assembler pseudoinstructions are defined to set and clear bits +in the CSR when the old value is not required: CSRS/CSRC _csr, rs1_; +CSRSI/CSRCI _csr, uimm_. + +==== CSR Access Ordering +(((CSR, access ordering))) + +Each RISC-V hart normally observes its own CSR accesses, including its +implicit CSR accesses, as performed in program order. In particular, +unless specified otherwise, a CSR access is performed after the +execution of any prior instructions in program order whose behavior +modifies or is modified by the CSR state and before the execution of any +subsequent instructions in program order whose behavior modifies or is +modified by the CSR state. Furthermore, an explicit CSR read returns the +CSR state before the execution of the instruction, while an explict CSR +write suppresses and overrides any implicit writes or modifications to +the same CSR by the same instruction. + +Likewise, any side effects from an explicit CSR access are normally +observed to occur synchronously in program order. Unless specified +otherwise, the full consequences of any such side effects are observable +by the very next instruction, and no consequences may be observed +out-of-order by preceding instructions. (Note the distinction made +earlier between side effects and indirect effects of CSR writes.) + +For the RVWMO memory consistency model <>, CSR accesses are weakly +ordered by default, so other harts or devices may observe CSR accesses +in an order different from program order. In addition, CSR accesses are +not ordered with respect to explicit memory accesses, unless a CSR +access modifies the execution behavior of the instruction that performs +the explicit memory access or unless a CSR access and an explicit memory +access are ordered by either the syntactic dependencies defined by the +memory model or the ordering requirements defined by the Memory-Ordering +PMAs section in Volume II of this manual. To enforce ordering in all +other cases, software should execute a FENCE instruction between the +relevant accesses. For the purposes of the FENCE instruction, CSR read +accesses are classified as device input (I), and CSR write accesses are +classified as device output (O). + +[NOTE] +==== +Informally, the CSR space acts as a weakly ordered memory-mapped I/O +region, as defined by the Memory-Ordering PMAs section in Volume II of +this manual. As a result, the order of CSR accesses with respect to all +other accesses is constrained by the same mechanisms that constrain the +order of memory-mapped I/O accesses to such a region. + +These CSR-ordering constraints are imposed to support ordering main +memory and memory-mapped I/O accesses with respect to CSR accesses that +are visible to, or affected by, devices or other harts. Examples include +the `time`, `cycle`, and `mcycle` CSRs, in addition to CSRs that reflect +pending interrupts, like `mip` and `sip`. Note that implicit reads of +such CSRs (e.g., taking an interrupt because of a change in `mip`) are +also ordered as device input. +==== + +Most CSRs (including, e.g., the `fcsr`) are not visible to other harts; +their accesses can be freely reordered in the global memory order with +respect to FENCE instructions without violating this specification. + +The hardware platform may define that accesses to certain CSRs are +strongly ordered, as defined by the Memory-Ordering PMAs section in +Volume II of this manual. Accesses to strongly ordered CSRs have +stronger ordering constraints with respect to accesses to both weakly +ordered CSRs and accesses to memory-mapped I/O regions. + +[NOTE] +==== +The rules for the reordering of CSR accesses in the global memory order +should probably be moved to <> +concerning the RVWMO memory consistency model. +==== + diff --git a/src/zifencei.adoc b/src/zifencei.adoc new file mode 100644 index 0000000..48b4440 --- /dev/null +++ b/src/zifencei.adoc @@ -0,0 +1,97 @@ +[[zifencei]] +== `Zifencei` Instruction-Fetch Fence, Version 2.0 + +This chapter defines the 'Zifencei' extension, which includes the +FENCE.I instruction that provides explicit synchronization between +writes to instruction memory and instruction fetches on the same hart. +Currently, this instruction is the only standard mechanism to ensure +that stores visible to a hart will also be visible to its instruction +fetches. +(((store instruction word, not included))) + +[NOTE] +==== +We considered but did not include a `store instruction word` +instruction as in cite:[majc]. JIT compilers may generate a large trace of +instructions before a single FENCE.I, and amortize any instruction cache +snooping/invalidation overhead by writing translated instructions to +memory regions that are known not to reside in the I-cache. +==== + +[TIP] +==== +The FENCE.I instruction was designed to support a wide variety of +implementations. A simple implementation can flush the local instruction +cache and the instruction pipeline when the FENCE.I is executed. A more +complex implementation might snoop the instruction (data) cache on every +data (instruction) cache miss, or use an inclusive unified private L2 +cache to invalidate lines from the primary instruction cache when they +are being written by a local store instruction. If instruction and data +caches are kept coherent in this way, or if the memory system consists +of only uncached RAMs, then just the fetch pipeline needs to be flushed +at a FENCE.I. + +The FENCE.I instruction was previously part of the base I instruction +set. Two main issues are driving moving this out of the mandatory base, +although at time of writing it is still the only standard method for +maintaining instruction-fetch coherence. + +First, it has been recognized that on some systems, FENCE.I will be +expensive to implement and alternate mechanisms are being discussed in +the memory model task group. In particular, for designs that have an +incoherent instruction cache and an incoherent data cache, or where the +instruction cache refill does not snoop a coherent data cache, both +caches must be completely flushed when a FENCE.I instruction is +encountered. This problem is exacerbated when there are multiple levels +of I and D cache in front of a unified cache or outer memory system. + +Second, the instruction is not powerful enough to make available at user +level in a Unix-like operating system environment. The FENCE.I only +synchronizes the local hart, and the OS can reschedule the user hart to +a different physical hart after the FENCE.I. This would require the OS +to execute an additional FENCE.I as part of every context migration. For +this reason, the standard Linux ABI has removed FENCE.I from user-level +and now requires a system call to maintain instruction-fetch coherence, +which allows the OS to minimize the number of FENCE.I executions +required on current systems and provides forward-compatibility with +future improved instruction-fetch coherence mechanisms. + +Future approaches to instruction-fetch coherence under discussion +include providing more restricted versions of FENCE.I that only target a +given address specified in _rs1_, and/or allowing software to use an ABI +that relies on machine-mode cache-maintenance operations. +==== + +include::images/wavedrom/zifencei-ff.adoc[] +[[zifencei-ff]] +.FENCE.I instruction +image::image_placeholder.png[] +(((FENCE.I, synchronization))) + +The FENCE.I instruction is used to synchronize the instruction and data +streams. RISC-V does not guarantee that stores to instruction memory +will be made visible to instruction fetches on a RISC-V hart until that +hart executes a FENCE.I instruction. A FENCE.I instruction ensures that +a subsequent instruction fetch on a RISC-V hart will see any previous +data stores already visible to the same RISC-V hart. FENCE.I does _not_ +ensure that other RISC-V harts’ instruction fetches will observe the +local hart’s stores in a multiprocessor system. To make a store to +instruction memory visible to all RISC-V harts, the writing hart also +has to execute a data FENCE before requesting that all remote RISC-V +harts execute a FENCE.I. + +The unused fields in the FENCE.I instruction, _imm[11:0]_, _rs1_, and +_rd_, are reserved for finer-grain fences in future extensions. For +forward compatibility, base implementations shall ignore these fields, +and standard software shall zero these fields. +(((FENCE.I, finer-grained))) +(((FENCE.I, forward compatibility))) + +[NOTE] +==== +Because FENCE.I only orders stores with a hart’s own instruction +fetches, application code should only rely upon FENCE.I if the +application thread will not be migrated to a different hart. The EEI can +provide mechanisms for efficient multiprocessor instruction-stream +synchronization. +==== diff --git a/src/zihintpause.adoc b/src/zihintpause.adoc new file mode 100644 index 0000000..9b30a61 --- /dev/null +++ b/src/zihintpause.adoc @@ -0,0 +1,61 @@ +[[zihintpause]] +== "Zihintpause" Pause Hint, Version 2.0 + +The PAUSE instruction is a HINT that indicates the current hart’s rate +of instruction retirement should be temporarily reduced or paused. The +duration of its effect must be bounded and may be zero. No architectural +state is changed. +(((PAUSE, HINT))) +(((HINT, PAUSE))) + +Software can use the PAUSE instruction to reduce energy consumption +while executing spin-wait code sequences. Multithreaded cores might +temporarily relinquish execution resources to other harts when PAUSE is +executed. It is recommended that a PAUSE instruction generally be +included in the code sequence for a spin-wait loop. +(((PAUSE, energy consumption))) + +A future extension might add primitives similar to the x86 MONITOR/MWAIT +instructions, which provide a more efficient mechanism to wait on writes +to a specific memory location. However, these instructions would not +supplant PAUSE. PAUSE is more appropriate when polling for non-memory +events, when polling for multiple events, or when software does not know +precisely what events it is polling for. + +The duration of a PAUSE instruction’s effect may vary significantly +within and among implementations. In typical implementations this +duration should be much less than the time to perform a context switch, +probably more on the rough order of an on-chip cache miss latency or a +cacheless access to main memory. +(((PAUSE, duration))) + +A series of PAUSE instructions can be used to create a cumulative delay +loosely proportional to the number of PAUSE instructions. In spin-wait +loops in portable code, however, only one PAUSE instruction should be +used before re-evaluating loop conditions, else the hart might stall +longer than optimal on some implementations, degrading system +performance. + +PAUSE is encoded as a FENCE instruction with _pred_=W, _succ_=0, _fm_=0, +_rd_=`x0`, and _rs1_=`x0`. + +PAUSE is encoded as a hint within the FENCE opcode because some +implementations are expected to deliberately stall the PAUSE instruction +until outstanding memory transactions have completed. Because the +successor set is null, however, PAUSE does not _mandate_ any particular +memory ordering—hence, it truly is a HINT. +(((PAUSE, encoding))) + +include::images/wavedrom/zihintpause-hint.adoc[] +[zihintpause-hint] +.Zihintpause fence instructions +image::image_placeholder.png[] + +Like other FENCE instructions, PAUSE cannot be used within LR/SC +sequences without voiding the forward-progress guarantee. +(((PAUSE, LR/RC sequences))) + +The choice of a predecessor set of W is arbitrary, since the successor +set is null. Other HINTs similar to PAUSE might be encoded with other +predecessor sets. + diff --git a/src/ztso-st-ext.adoc b/src/ztso-st-ext.adoc new file mode 100644 index 0000000..2ec457d --- /dev/null +++ b/src/ztso-st-ext.adoc @@ -0,0 +1,38 @@ +[[ztso]] +== `Ztso` Standard Extension for Total Store Ordering, v0.1 + +This chapter defines the ``Ztso`` extension for the RISC-V Total Store +Ordering (RVTSO) memory consistency model. RVTSO is defined as a delta +from RVWMO, which is defined in <>. + +The Ztso extension is meant to facilitate the porting of code originally +written for the x86 or SPARC architectures, both of which use TSO by +default. It also supports implementations which inherently provide RVTSO +behavior and want to expose that fact to software. + +RVTSO makes the following adjustments to RVWMO: + +* All load operations behave as if they have an acquire-RCpc annotation +* All store operations behave as if they have a release-RCpc annotation. +* All AMOs behave as if they have both acquire-RCsc and release-RCsc +annotations. + +These rules render all PPO rules except +<>–<> redundant. They also make +redundant any non-I/O fences that do not have both PW and SR set. +Finally, they also imply that no memory operation will be reordered past +an AMO in either direction. + +In the context of RVTSO, as is the case for RVWMO, the storage ordering +annotations are concisely and completely defined by PPO rules +<>–<>. In both of these +memory models, it is the that allows a hart to forward a value from its +store buffer to a subsequent (in program order) load—that is to say that +stores can be forwarded locally before they are visible to other harts. + +In spite of the fact that Ztso adds no new instructions to the ISA, code +written assuming RVTSO will not run correctly on implementations not +supporting Ztso. Binaries compiled to run only under Ztso should +indicate as such via a flag in the binary, so that platforms which do +not implement Ztso can simply refuse to run them. + -- cgit v1.1