adding converted adoc files

author: elisa <elisa@riscv.org> 2021-09-10 14:07:18 -0700
committer: elisa <elisa@riscv.org> 2021-09-10 14:07:18 -0700
commit: 64353b3717c9d387759c61d16d3984760a028046 (patch)
tree: 7dc4b3a1aff37dfd38ff921fa21113a64989a5aa
parent: e1a563505224b47129b0cbb5c46ee22f5e0acddb (diff)
download: riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.zip
riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.gz
riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.bz2
38 files changed, 12831 insertions, 0 deletions
diff --git a/src/a-st-ext.adoc b/src/a-st-ext.adoc
new file mode 100644
index 0000000..7468009
--- /dev/null
+++ b/src/a-st-ext.adoc
@@ -0,0 +1,452 @@
+[[atomics]]
+== `A` Standard Extension for Atomic Instructions, Version 2.1
+
+The standard atomic-instruction extension, named `A`, contains
+instructions that atomically read-modify-write memory to support
+synchronization between multiple RISC-V harts running in the same memory
+space. The two forms of atomic instruction provided are
+load-reserved/store-conditional instructions and atomic fetch-and-op
+memory instructions. Both types of atomic instruction support various
+memory consistency orderings including unordered, acquire, release, and
+sequentially consistent semantics. These instructions allow RISC-V to
+support the RCsc memory consistency model cite:[Gharachorloo90memoryconsistency].
+
+[TIP]
+====
+After much debate, the language community and architecture community
+appear to have finally settled on release consistency as the standard
+memory consistency model and so the RISC-V atomic support is built
+around this model.
+====
+
+=== Specifying Ordering of Atomic Instructions
+
+The base RISC-V ISA has a relaxed memory model, with the FENCE
+instruction used to impose additional ordering constraints. The address
+space is divided by the execution environment into memory and I/O
+domains, and the FENCE instruction provides options to order accesses to
+one or both of these two address domains.
+
+To provide more efficient support for release consistency cite:[Gharachorloo90memoryconsistency], each atomic
+instruction has two bits, _aq_ and _rl_, used to specify additional
+memory ordering constraints as viewed by other RISC-V harts. The bits
+order accesses to one of the two address domains, memory or I/O,
+depending on which address domain the atomic instruction is accessing.
+No ordering constraint is implied to accesses to the other domain, and a
+FENCE instruction should be used to order across both domains.
+
+If both bits are clear, no additional ordering constraints are imposed
+on the atomic memory operation. If only the _aq_ bit is set, the atomic
+memory operation is treated as an _acquire_ access, i.e., no following
+memory operations on this RISC-V hart can be observed to take place
+before the acquire memory operation. If only the _rl_ bit is set, the
+atomic memory operation is treated as a _release_ access, i.e., the
+release memory operation cannot be observed to take place before any
+earlier memory operations on this RISC-V hart. If both the _aq_ and _rl_
+bits are set, the atomic memory operation is _sequentially consistent_
+and cannot be observed to happen before any earlier memory operations or
+after any later memory operations in the same RISC-V hart and to the
+same address domain.
+
+[[lrsc]]
+=== Load-Reserved/Store-Conditional Instructions
+
+include::images/wavedrom/load-reserve-st-conditional.adoc[]
+[[load-res-st]]
+.Load-Reserved/Store-Conditional Instructions
+image::image_placeholder.png[]
+
+Complex atomic memory operations on a single memory word or doubleword
+are performed with the load-reserved (LR) and store-conditional (SC)
+instructions. LR.W loads a word from the address in _rs1_, places the
+sign-extended value in _rd_, and registers a _reservation set_—a set of
+bytes that subsumes the bytes in the addressed word. SC.W conditionally
+writes a word in _rs2_ to the address in _rs1_: the SC.W succeeds only
+if the reservation is still valid and the reservation set contains the
+bytes being written. If the SC.W succeeds, the instruction writes the
+word in _rs2_ to memory, and it writes zero to _rd_. If the SC.W fails,
+the instruction does not write to memory, and it writes a nonzero value
+to _rd_. Regardless of success or failure, executing an SC.W instruction
+invalidates any reservation held by this hart. LR.D and SC.D act
+analogously on doublewords and are only available on RV64. For RV64,
+LR.W and SC.W sign-extend the value placed in _rd_.
+
+[TIP]
+====
+Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
+data structures. After extensive discussion, we opted for LR/SC for
+several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids
+because it monitors all writes to the address rather than only checking
+for changes in the data value; 2) CAS would also require a new integer
+instruction format to support three source operands (address, compare
+value, swap value) as well as a different memory system message format,
+which would complicate microarchitectures; 3) Furthermore, to avoid the
+ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a
+counter to be tested and incremented along with a data word. This
+requires reading five registers and writing two in one instruction, and
+also a new larger memory system message type, further complicating
+implementations; 4) LR/SC provides a more efficient implementation of
+many primitives as it only requires one load as opposed to two with CAS
+(one load before the CAS instruction to obtain a value for speculative
+computation, then a second load as part of the CAS instruction to check
+if value is unchanged before updating).
+
+The main disadvantage of LR/SC over CAS is livelock, which we avoid,
+under certain circumstances, with an architected guarantee of eventual
+forward progress as described below. Another concern is whether the
+influence of the current x86 architecture, with its DW-CAS, will
+complicate porting of synchronization libraries and other software that
+assumes DW-CAS is the basic machine primitive. A possible mitigating
+factor is the recent addition of transactional memory instructions to
+x86, which might cause a move away from DW-CAS.
+
+More generally, a multi-word atomic primitive is desirable, but there is
+still considerable debate about what form this should take, and
+guaranteeing forward progress adds complexity to a system.
+====
+
+The failure code with value 1 is reserved to encode an unspecified
+failure. Other failure codes are reserved at this time, and portable
+software should only assume the failure code will be non-zero.
+
+[NOTE]
+====
+We reserve a failure code of 1 to mean `unspecified` so that simple
+implementations may return this value using the existing mux required
+for the SLT/SLTU instructions. More specific failure codes might be
+defined in future versions or extensions to the ISA.
+====
+
+For LR and SC, the A extension requires that the address held in _rs1_
+be naturally aligned to the size of the operand (i.e., eight-byte
+aligned for 64-bit words and four-byte aligned for 32-bit words). If the
+address is not naturally aligned, an address-misaligned exception or an
+access-fault exception will be generated. The access-fault exception can
+be generated for a memory access that would otherwise be able to
+complete except for the misalignment, if the misaligned access should
+not be emulated.
+(((alignment, misaligned)))
+
+[NOTE]
+====
+Emulating misaligned LR/SC sequences is impractical in most systems.
+
+Misaligned LR/SC sequences also raise the possibility of accessing
+multiple reservation sets at once, which present definitions do not
+provide for.
+====
+
+An implementation can register an arbitrarily large reservation set on
+each LR, provided the reservation set includes all bytes of the
+addressed data word or doubleword. An SC can only pair with the most
+recent LR in program order. An SC may succeed only if no store from
+another hart to the reservation set can be observed to have occurred
+between the LR and the SC, and if there is no other SC between the LR
+and itself in program order. An SC may succeed only if no write from a
+device other than a hart to the bytes accessed by the LR instruction can
+be observed to have occurred between the LR and SC. Note this LR might
+have had a different effective address and data size, but reserved the
+SC’s address as part of the reservation set.
+(((alignment, reservation set)))
+
+[NOTE]
+====
+Following this model, in systems with memory translation, an SC is
+allowed to succeed if the earlier LR reserved the same location using an
+alias with a different virtual address, but is also allowed to fail if
+the virtual address is different.
+
+
+To accommodate legacy devices and buses, writes from devices other than
+RISC-V harts are only required to invalidate reservations when they
+overlap the bytes accessed by the LR. These writes are not required to
+invalidate the reservation when they access other bytes in the
+reservation set.
+====
+
+The SC must fail if the address is not within the reservation set of the
+most recent LR in program order. The SC must fail if a store to the
+reservation set from another hart can be observed to occur between the
+LR and SC. The SC must fail if a write from some other device to the
+bytes accessed by the LR can be observed to occur between the LR and SC.
+(If such a device writes the reservation set but does not write the
+bytes accessed by the LR, the SC may or may not fail.) An SC must fail
+if there is another SC (to any address) between the LR and the SC in
+program order. The precise statement of the atomicity requirements for
+successful LR/SC sequences is defined by the Atomicity Axiom in
+<<rvwmo>>.
+
+
+[NOTE]
+====
+The platform should provide a means to determine the size and shape of
+the reservation set.
+
+A platform specification may constrain the size and shape of the
+reservation set. For example, the Unix platform is expected to require
+of main memory that the reservation set be of fixed size, contiguous,
+naturally aligned, and no greater than the virtual memory page size.
+
+A store-conditional instruction to a scratch word of memory should be
+used to forcibly invalidate any existing load reservation:
+
+* during a preemptive context switch, and
+* if necessary when changing virtual to physical address mappings, such
+as when migrating pages that might contain an active reservation.
+====
+
+The invalidation of a hart’s reservation when it executes an LR or SC
+imply that a hart can only hold one reservation at a time, and that an
+SC can only pair with the most recent LR, and LR with the next following
+SC, in program order. This is a restriction to the Atomicity Axiom in
+<<rvwmo>> that ensures software runs correctly on
+expected common implementations that operate in this manner.
+
+An SC instruction can never be observed by another RISC-V hart before
+the LR instruction that established the reservation. The LR/SC sequence
+can be given acquire semantics by setting the _aq_ bit on the LR
+instruction. The LR/SC sequence can be given release semantics by
+setting the _rl_ bit on the SC instruction. Setting the _aq_ bit on the
+LR instruction, and setting both the _aq_ and the _rl_ bit on the SC
+instruction makes the LR/SC sequence sequentially consistent, meaning
+that it cannot be reordered with earlier or later memory operations from
+the same hart.
+
+If neither bit is set on both LR and SC, the LR/SC sequence can be
+observed to occur before or after surrounding memory operations from the
+same RISC-V hart. This can be appropriate when the LR/SC sequence is
+used to implement a parallel reduction operation.
+
+Software should not set the _rl_ bit on an LR instruction unless the
+_aq_ bit is also set, nor should software set the _aq_ bit on an SC
+instruction unless the _rl_ bit is also set. LR._rl_ and SC._aq_
+instructions are not guaranteed to provide any stronger ordering than
+those with both bits clear, but may result in lower performance.
+
+....
+        # a0 holds address of memory location 
+        # a1 holds expected value
+        # a2 holds desired value
+        # a0 holds return value, 0 if successful, !0 otherwise
+    cas:
+        lr.w t0, (a0)        # Load original value.
+        bne t0, a1, fail     # Doesn't match, so fail.
+        sc.w t0, a2, (a0)    # Try to update.
+        bnez t0, cas         # Retry if store-conditional failed.
+        li a0, 0             # Set return to success.
+        jr ra                # Return.
+    fail:
+        li a0, 1             # Set return to failure.
+        jr ra                # Return.
+....
+
+LR/SC can be used to construct lock-free data structures. An example
+using LR/SC to implement a compare-and-swap function is shown in
+<<cas>>. If inlined, compare-and-swap functionality need
+only take four instructions.
+
+[[lrscseq]]
+=== Eventual Success of Store-Conditional Instructions
+
+The standard A extension defines _constrained LR/SC loops_, which have
+the following properties:
+
+* The loop comprises only an LR/SC sequence and code to retry the
+sequence in the case of failure, and must comprise at most 16
+instructions placed sequentially in memory.
+* An LR/SC sequence begins with an LR instruction and ends with an SC
+instruction. The dynamic code executed between the LR and SC
+instructions can only contain instructions from the base `I`
+instruction set, excluding loads, stores, backward jumps, taken backward
+branches, JALR, FENCE, and SYSTEM instructions. If the `C` extension
+is supported, then compressed forms of the aforementioned `I`
+instructions are also permitted.
+* The code to retry a failing LR/SC sequence can contain backwards jumps
+and/or branches to repeat the LR/SC sequence, but otherwise has the same
+constraint as the code between the LR and SC.
+* The LR and SC addresses must lie within a memory region with the
+_LR/SC eventuality_ property. The execution environment is responsible
+for communicating which regions have this property.
+* The SC must be to the same effective address and of the same data size
+as the latest LR executed by the same hart.
+
+LR/SC sequences that do not lie within constrained LR/SC loops are
+_unconstrained_. Unconstrained LR/SC sequences might succeed on some
+attempts on some implementations, but might never succeed on other
+implementations.
+
+[TIP]
+====
+We restricted the length of LR/SC loops to fit within 64 contiguous
+instruction bytes in the base ISA to avoid undue restrictions on
+instruction cache and TLB size and associativity. Similarly, we
+disallowed other loads and stores within the loops to avoid restrictions
+on data-cache associativity in simple implementations that track the
+reservation within a private cache. The restrictions on branches and
+jumps limit the time that can be spent in the sequence. Floating-point
+operations and integer multiply/divide were disallowed to simplify the
+operating system’s emulation of these instructions on implementations
+lacking appropriate hardware support.
+
+Software is not forbidden from using unconstrained LR/SC sequences, but
+portable software must detect the case that the sequence repeatedly
+fails, then fall back to an alternate code sequence that does not rely
+on an unconstrained LR/SC sequence. Implementations are permitted to
+unconditionally fail any unconstrained LR/SC sequence.
+====
+
+If a hart _H_ enters a constrained LR/SC loop, the execution environment
+must guarantee that one of the following events eventually occurs:
+
+* _H_ or some other hart executes a successful SC to the reservation set
+of the LR instruction in _H_’s constrained LR/SC loops.
+* Some other hart executes an unconditional store or AMO instruction to
+the reservation set of the LR instruction in _H_’s constrained LR/SC
+loop, or some other device in the system writes to that reservation set.
+* _H_ executes a branch or jump that exits the constrained LR/SC loop.
+* _H_ traps.
+
+[NOTE]
+====
+Note that these definitions permit an implementation to fail an SC
+instruction occasionally for any reason, provided the aforementioned
+guarantee is not violated.
+
+As a consequence of the eventuality guarantee, if some harts in an
+execution environment are executing constrained LR/SC loops, and no
+other harts or devices in the execution environment execute an
+unconditional store or AMO to that reservation set, then at least one
+hart will eventually exit its constrained LR/SC loop. By contrast, if
+other harts or devices continue to write to that reservation set, it is
+not guaranteed that any hart will exit its LR/SC loop.
+
+Loads and load-reserved instructions do not by themselves impede the
+progress of other harts’ LR/SC sequences. We note this constraint
+implies, among other things, that loads and load-reserved instructions
+executed by other harts (possibly within the same core) cannot impede
+LR/SC progress indefinitely. For example, cache evictions caused by
+another hart sharing the cache cannot impede LR/SC progress
+indefinitely. Typically, this implies reservations are tracked
+independently of evictions from any shared cache. Similarly, cache
+misses caused by speculative execution within a hart cannot impede LR/SC
+progress indefinitely.
+
+These definitions admit the possibility that SC instructions may
+spuriously fail for implementation reasons, provided progress is
+eventually made.
+
+One advantage of CAS is that it guarantees that some hart eventually
+makes progress, whereas an LR/SC atomic sequence could livelock
+indefinitely on some systems. To avoid this concern, we added an
+architectural guarantee of livelock freedom for certain LR/SC sequences.
+
+Earlier versions of this specification imposed a stronger
+starvation-freedom guarantee. However, the weaker livelock-freedom
+guarantee is sufficient to implement the C11 and C++11 languages, and is
+substantially easier to provide in some microarchitectural styles.
+====
+
+[[amo]]
+=== Atomic Memory Operations
+
+include::images/wavedrom/atomic-mem.adoc[]
+[[atomic-mem]]
+.Atomic memory operations
+image::image_placeholder.png[]
+
+The atomic memory operation (AMO) instructions perform read-modify-write
+operations for multiprocessor synchronization and are encoded with an
+R-type instruction format. These AMO instructions atomically load a data
+value from the address in _rs1_, place the value into register _rd_,
+apply a binary operator to the loaded value and the original value in
+_rs2_, then store the result back to the original address in _rs1_. AMOs
+can either operate on 64-bit (RV64 only) or 32-bit words in memory. For
+RV64, 32-bit AMOs always sign-extend the value placed in _rd_, and
+ignore the upper 32 bits of the original value of _rs2_.
+
+For AMOs, the A extension requires that the address held in _rs1_ be
+naturally aligned to the size of the operand (i.e., eight-byte aligned
+for 64-bit words and four-byte aligned for 32-bit words). If the address
+is not naturally aligned, an address-misaligned exception or an
+access-fault exception will be generated. The access-fault exception can
+be generated for a memory access that would otherwise be able to
+complete except for the misalignment, if the misaligned access should
+not be emulated. The ``Zam`` extension, described in
+<<zam>>, relaxes this requirement and specifies the
+semantics of misaligned AMOs.
+
+The operations supported are swap, integer add, bitwise AND, bitwise OR,
+bitwise XOR, and signed and unsigned integer maximum and minimum.
+Without ordering constraints, these AMOs can be used to implement
+parallel reduction operations, where typically the return value would be
+discarded by writing to `x0`.
+
+[NOTE]
+====
+We provided fetch-and-op style atomic primitives as they scale to highly
+parallel systems better than LR/SC or CAS. A simple microarchitecture
+can implement AMOs using the LR/SC primitives, provided the
+implementation can guarantee the AMO eventually completes. More complex
+implementations might also implement AMOs at memory controllers, and can
+optimize away fetching the original value when the destination is `x0`.
+
+The set of AMOs was chosen to support the C11/C++11 atomic memory
+operations efficiently, and also to support parallel reductions in
+memory. Another use of AMOs is to provide atomic updates to
+memory-mapped device registers (e.g., setting, clearing, or toggling
+bits) in the I/O space.
+====
+
+To help implement multiprocessor synchronization, the AMOs optionally
+provide release consistency semantics. If the _aq_ bit is set, then no
+later memory operations in this RISC-V hart can be observed to take
+place before the AMO. Conversely, if the _rl_ bit is set, then other
+RISC-V harts will not observe the AMO before memory accesses preceding
+the AMO in this RISC-V hart. Setting both the _aq_ and the _rl_ bit on
+an AMO makes the sequence sequentially consistent, meaning that it
+cannot be reordered with earlier or later memory operations from the
+same hart.
+
+[NOTE]
+====
+The AMOs were designed to implement the C11 and C++11 memory models
+efficiently. Although the FENCE R, RW instruction suffices to implement
+the _acquire_ operation and FENCE RW, W suffices to implement _release_,
+both imply additional unnecessary ordering as compared to AMOs with the
+corresponding _aq_ or _rl_ bit set.
+====
+
+An example code sequence for a critical section guarded by a
+test-and-test-and-set spinlock is shown in
+<<critical>>. Note the first AMO is marked _aq_ to
+order the lock acquisition before the critical section, and the second
+AMO is marked _rl_ to order the critical section before the lock
+relinquishment.
+
+[[critical]]
+....
+        li           t0, 1        # Initialize swap value.
+    again:
+        lw           t1, (a0)     # Check if lock is held.
+        bnez         t1, again    # Retry if held.
+        amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock.
+        bnez         t1, again    # Retry if held.
+        # ...
+        # Critical section.
+        # ...
+        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
+....
+
+[NOTE]
+====
+We recommend the use of the AMO Swap idiom shown above for both lock
+acquire and release to simplify the implementation of speculative lock
+elision cite:[Rajwar:2001:SLE].
+====
+
+The instructions in the `A` extension can also be used to provide
+sequentially consistent loads and stores. A sequentially consistent load
+can be implemented as an LR with both _aq_ and _rl_ set. A sequentially
+consistent store can be implemented as an AMOSWAP that writes the old
+value to x0 and has both _aq_ and _rl_ set.
+
diff --git a/src/b-st-ext.adoc b/src/b-st-ext.adoc
new file mode 100644
index 0000000..d160e05
--- /dev/null
+++ b/src/b-st-ext.adoc
@@ -0,0 +1,17 @@
+[[bits]]
+== `B` Standard Extension for Bit Manipulation, Version 0.0
+
+This chapter is a placeholder for a future standard extension to provide
+bit manipulation instructions, including instructions to insert,
+extract, and test bit fields, and for rotations, funnel shifts, and bit
+and byte permutations.
+
+Although bit manipulation instructions are very effective in some
+application domains, particularly when dealing with externally packed
+data structures, we excluded them from the base ISAs as they are not
+useful in all domains and can add additional complexity or instruction
+formats to supply all needed operands.
+
+We anticipate the B extension will be a brownfield encoding within the
+base 30-bit instruction space.
+
diff --git a/src/bibliography.adoc b/src/bibliography.adoc
new file mode 100644
index 0000000..4cc3eb7
--- /dev/null
+++ b/src/bibliography.adoc
@@ -0,0 +1,4 @@
+[bibliography]
+== Bibliography
+
+bibliography::[]
diff --git a/src/c-st-ext.adoc b/src/c-st-ext.adoc
new file mode 100644
index 0000000..6675a5a
--- /dev/null
+++ b/src/c-st-ext.adoc
@@ -0,0 +1,1123 @@
+[[compressed]]
+== `C` Standard Extension for Compressed Instructions, Version 2.0
+
+This chapter describes the current proposal for the RISC-V standard
+compressed instruction-set extension, named `C`, which reduces static
+and dynamic code size by adding short 16-bit instruction encodings for
+common operations. The C extension can be added to any of the base ISAs
+(RV32, RV64, RV128), and we use the generic term `RVC` to cover any of
+these. Typically, 50%–60% of the RISC-V instructions in a program can be
+replaced with RVC instructions, resulting in a 25%–30% code-size
+reduction.
+
+=== Overview
+
+RVC uses a simple compression scheme that offers shorter 16-bit versions
+of common 32-bit RISC-V instructions when:
+
+the immediate or address offset is small, or
+
+one of the registers is the zero register (`x0`), the ABI link register
+(`x1`), or the ABI stack pointer (`x2`), or
+
+the destination register and the first source register are identical, or
+
+the registers used are the 8 most popular ones.
+
+The C extension is compatible with all other standard instruction
+extensions. The C extension allows 16-bit instructions to be freely
+intermixed with 32-bit instructions, with the latter now able to start
+on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C
+extension, no instructions can raise instruction-address-misaligned
+exceptions.
+
+Removing the 32-bit alignment constraint on the original 32-bit
+instructions allows significantly greater code density.
+
+The compressed instruction encodings are mostly common across RV32C,
+RV64C, and RV128C, but as shown in
+Table <<rvcopcodemap>>, a few opcodes are used for
+different purposes depending on base ISA. For example, the wider
+address-space RV64C and RV128C variants require additional opcodes to
+compress loads and stores of 64-bit integer values, while RV32C uses the
+same opcodes to compress loads and stores of single-precision
+floating-point values. Similarly, RV128C requires additional opcodes to
+capture loads and stores of 128-bit integer values, while these same
+opcodes are used for loads and stores of double-precision floating-point
+values in RV32C and RV64C. If the C extension is implemented, the
+appropriate compressed floating-point load and store instructions must
+be provided whenever the relevant standard floating-point extension (F
+and/or D) is also implemented. In addition, RV32C includes a compressed
+jump and link instruction to compress short-range subroutine calls,
+where the same opcode is used to compress ADDIW for RV64C and RV128C.
+
+Double-precision loads and stores are a significant fraction of static
+and dynamic instructions, hence the motivation to include them in the
+RV32C and RV64C encoding.
+
+Although single-precision loads and stores are not a significant source
+of static or dynamic compression for benchmarks compiled for the
+currently supported ABIs, for microcontrollers that only provide
+hardware single-precision floating-point units and have an ABI that only
+supports single-precision floating-point numbers, the single-precision
+loads and stores will be used at least as frequently as double-precision
+loads and stores in the measured benchmarks. Hence, the motivation to
+provide compressed support for these in RV32C.
+
+Short-range subroutine calls are more likely in small binaries for
+microcontrollers, hence the motivation to include these in RV32C.
+
+Although reusing opcodes for different purposes for different base ISAs
+adds some complexity to documentation, the impact on implementation
+complexity is small even for designs that support multiple base ISAs.
+The compressed floating-point load and store variants use the same
+instruction format with the same register specifiers as the wider
+integer loads and stores.
+
+RVC was designed under the constraint that each RVC instruction expands
+into a single 32-bit instruction in either the base ISA (RV32I/E, RV64I,
+or RV128I) or the F and D standard extensions where present. Adopting
+this constraint has two main benefits:
+
+Hardware designs can simply expand RVC instructions during decode,
+simplifying verification and minimizing modifications to existing
+microarchitectures.
+
+Compilers can be unaware of the RVC extension and leave code compression
+to the assembler and linker, although a compression-aware compiler will
+generally be able to produce better results.
+
+We felt the multiple complexity reductions of a simple one-one mapping
+between C and base IFD instructions far outweighed the potential gains
+of a slightly denser encoding that added additional instructions only
+supported in the C extension, or that allowed encoding of multiple IFD
+instructions in one C instruction.
+
+It is important to note that the C extension is not designed to be a
+stand-alone ISA, and is meant to be used alongside a base ISA.
+
+Variable-length instruction sets have long been used to improve code
+density. For example, the IBM Stretch cite:[stretch], developed in the late 1950s, had
+an ISA with 32-bit and 64-bit instructions, where some of the 32-bit
+instructions were compressed versions of the full 64-bit instructions.
+Stretch also employed the concept of limiting the set of registers that
+were addressable in some of the shorter instruction formats, with short
+branch instructions that could only refer to one of the index registers.
+The later IBM 360 architecture cite:[ibm360] supported a simple variable-length
+instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats.
+
+In 1963, CDC introduced the Cray-designed CDC 6600 cite:[cdc6600], a precursor to RISC
+architectures, that introduced a register-rich load-store architecture
+with instructions of two lengths, 15-bits and 30-bits. The later Cray-1
+design used a very similar instruction format, with 16-bit and 32-bit
+instruction lengths.
+
+The initial RISC ISAs from the 1980s all picked performance over code
+size, which was reasonable for a workstation environment, but not for
+embedded systems. Hence, both ARM and MIPS subsequently made versions of
+the ISAs that offered smaller code size by offering an alternative
+16-bit wide instruction set instead of the standard 32-bit wide
+instructions. The compressed RISC ISAs reduced code size relative to
+their starting points by about 25–30%, yielding code that was
+significantly _smaller_ than 80x86. This result surprised some, as their
+intuition was that the variable-length CISC ISA should be smaller than
+RISC ISAs that offered only 16-bit and 32-bit formats.
+
+Since the original RISC ISAs did not leave sufficient opcode space free
+to include these unplanned compressed instructions, they were instead
+developed as complete new ISAs. This meant compilers needed different
+code generators for the separate compressed ISAs. The first compressed
+RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed
+16-bit instruction size, which gave good reductions in static code size
+but caused an increase in dynamic instruction count, which led to lower
+performance compared to the original fixed-width 32-bit instruction
+size. This led to the development of a second generation of compressed
+RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g.,
+ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to
+pure 32-bit instructions but with significant code size savings.
+Unfortunately, these different generations of compressed ISAs are
+incompatible with each other and with the original uncompressed ISA,
+leading to significant complexity in documentation, implementations, and
+software tools support.
+
+Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently
+supports a compressed instruction format. It is surprising that the most
+popular 64-bit ISA for mobile platforms (ARM v8) does not include a
+compressed instruction format given that static code size and dynamic
+instruction fetch bandwidth are important metrics. Although static code
+size is not a major concern in larger systems, instruction fetch
+bandwidth can be a major bottleneck in servers running commercial
+workloads, which often have a large instruction working set.
+
+Benefiting from 25 years of hindsight, RISC-V was designed to support
+compressed instructions from the outset, leaving enough opcode space for
+RVC to be added as a simple extension on top of the base ISA (along with
+many other extensions). The philosophy of RVC is to reduce code size for
+embedded applications _and_ to improve performance and energy-efficiency
+for all applications due to fewer misses in the instruction cache.
+Waterman shows that RVC fetches 25%-30% fewer instruction bits, which
+reduces instruction cache misses by 20%-25%, or roughly the same
+performance impact as doubling the instruction cache size .
+
+=== Compressed Instruction Formats
+(((compressed, formats)))
+
+Table <<rvcopcodemap>> shows the nine compressed instruction
+formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW,
+CL, CS, CA, and CB are limited to just 8 of them.
+Table <<registers>> lists these popular registers, which
+correspond to registers `x8` to `x15`. Note that there is a separate
+version of load and store instructions that use the stack pointer as the
+base address register, since saving to and restoring from the stack are
+so prevalent, and that they use the CI and CSS formats to allow access
+to all 32 data registers. CIW supplies an 8-bit immediate for the
+ADDI4SPN instruction.
+
+The RISC-V ABI was changed to make the frequently used registers map to
+registers `x8`–`x15`. This simplifies the decompression decoder by
+having a contiguous naturally aligned set of register numbers, and is
+also compatible with the RV32E base ISA, which only has 16 integer
+registers.
+(((compressed, loads and stores)))
+
+Compressed register-based floating-point loads and stores also use the
+CL and CS formats respectively, with the eight registers mapping to `f8`
+to `f15`.
+(((calling convention, standard)))
+
+The standard RISC-V calling convention maps the most frequently used
+floating-point registers to registers `f8` to `f15`, which allows the
+same register decompression decoding as for integer register numbers.
+(((register source specifiers, c-ext)))
+
+The formats were designed to keep bits for the two register source
+specifiers in the same place in all instructions, while the destination
+register field can move. When the full 5-bit destination register
+specifier is present, it is in the same place as in the 32-bit RISC-V
+encoding. Where immediates are sign-extended, the sign-extension is
+always from bit 12. Immediate fields have been scrambled, as in the base
+specification, to reduce the number of immediate muxes required.
+
+The immediate fields are scrambled in the instruction formats instead of
+in sequential order so that as many bits as possible are in the same
+position in every instruction, thereby simplifying implementations.
+
+For many RVC instructions, zero-valued immediates are disallowed and
+`x0` is not a valid 5-bit register specifier. These restrictions free up
+encoding space for other instructions requiring fewer operand bits.
+
+include::images/wavedrom/cr-register.adoc[]
+[cr-register]
+.Compressed 16-bit RVC instruction formats
+image::image_placeholder.png[]
+(((compressed, 16-bit)))
+
+[[rvc-form]]
+.Compressed 16-bit RVC instruction formats.
+//[cols="^,^,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+
+|Format |Meaning | | | | | | | | | | | | |
+
+|CR |Register |funct4| |rd/rs1 | | |rs2 | | | | |op | |
+
+|CI |Immediate |funct3 |imm |rd/rs1 | | |imm | | | | |op | |
+
+|CSS |Stack-relative Store |funct3|imm | | | |rs2 | | | | |op |
+|
+
+|CIW |Wide Immediate |funct3|imm | | | | | |rdlatexmath:[$'$]| | |op | |
+
+|CL |Load |funct3 |imm |rs1 latexmath:[$'$] | | |imm | |rd latexmath:[$'$] | | |op | |
+
+|CS |Store |funct3 |imm |rs1 latexmath:[$'$] | | |imm | |rs2 latexmath:[$'$] | | |op | |
+
+|CA |Arithmetic |funct6 | |rd latexmath:[$'$]/rs1 latexmath:[$'$] | | |funct2 | |rs2 latexmath:[$'$] | | |op | |
+
+|CB |Branch/Arithmetic |funct3 |offset |rd latexmath:[$'$]/rs1 latexmath:[$'$] | | |offset | | | | |op | |
+
+|CJ |Jump |funct3 |jump target | | | | | | | | |op | |
+|===
+
+[registers]
+.Registers specified by the three-bit _rs1 latexmath:[$'$]_,
+_rs2 latexmath:[$'$]_, and _rd latexmath:[$'$]_ fields of the CIW, CL,
+CS, CA, and CB formats.
+[cols="<,^,^,^,^,^,^,^,^",]
+|===
+|RVC Register Number |000 |001 |010 |011 |100 |101 |110 |111
+
+|Integer Register Number |`x8` |`x9` |`x10` |`x11` |`x12` |`x13` |`x14`|`x15`
+
+|Integer Register ABI Name |`s0` |`s1` |`a0` |`a1` |`a2` |`a3` |`a4`|`a5`
+
+|Floating-Point Register Number |`f8` |`f9` |`f10` |`f11` |`f12` |`f13`|`f14` |`f15`
+
+|Floating-Point Register ABI Name |`fs0` |`fs1` |`fa0` |`fa1` |`fa2`|`fa3` |`fa4` |`fa5`
+|===
+
+
+=== Load and Store Instructions
+
+To increase the reach of 16-bit instructions, data-transfer instructions
+use zero-extended immediates that are scaled by the size of the data in
+bytes: latexmath:[$\times$]4 for words, latexmath:[$\times$]8 for double
+words, and latexmath:[$\times$]16 for quad words.
+
+RVC provides two variants of loads and stores. One uses the ABI stack
+pointer, `x2`, as the base address and can target any data register. The
+other can reference one of 8 base address registers and one of 8 data
+registers.
+
+==== Stack-Pointer-Based Loads and Stores
+
+.
+include::images/wavedrom/c-sp-load-store.adoc[]
+[c-sp-load-store]
+.Stack-Pointer-Based Loads and Stores
+image::image_placeholder.png[]
+
+These instructions use the CI format.
+
+C.LWSP loads a 32-bit value from memory into register _rd_. It computes
+an effective address by adding the _zero_-extended offset, scaled by 4,
+to the stack pointer, `x2`. It expands to `lw rd, offset(x2)`. C.LWSP is
+only valid when latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code
+points with latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved.
+(((operations, memory)))
+
+C.LDSP is an RV64C/RV128C-only instruction that loads a 64-bit value
+from memory into register _rd_. It computes its effective address by
+adding the zero-extended offset, scaled by 8, to the stack pointer,
+`x2`. It expands to `ld rd, offset(x2)`. C.LDSP is only valid when
+latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code points with
+latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved.
+
+C.LQSP is an RV128C-only instruction that loads a 128-bit value from
+memory into register _rd_. It computes its effective address by adding
+the zero-extended offset, scaled by 16, to the stack pointer, `x2`. It
+expands to `lq rd, offset(x2)`. C.LQSP is only valid when
+latexmath:[$\textit{rd}{\neq}\texttt{x0}$]; the code points with
+latexmath:[$\textit{rd}{=}\texttt{x0}$] are reserved.
+
+C.FLWSP is an RV32FC-only instruction that loads a single-precision
+floating-point value from memory into floating-point register _rd_. It
+computes its effective address by adding the _zero_-extended offset,
+scaled by 4, to the stack pointer, `x2`. It expands to
+`flw rd, offset(x2)`.
+
+C.FLDSP is an RV32DC/RV64DC-only instruction that loads a
+double-precision floating-point value from memory into floating-point
+register _rd_. It computes its effective address by adding the
+_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It
+expands to `fld rd, offset(x2)`.
+
+include::images/wavedrom/sp-base-ls-2.adoc[]
+[sp-base-ls-2]
+.Stack-Pointer-Based Loads and Stores, CSS format
+image::image_placeholder.png[]
+
+These instructions use the CSS format.
+
+C.SWSP stores a 32-bit value in register _rs2_ to memory. It computes an
+effective address by adding the _zero_-extended offset, scaled by 4, to
+the stack pointer, `x2`. It expands to `sw rs2, offset(x2)`.
+
+C.SDSP is an RV64C/RV128C-only instruction that stores a 64-bit value in
+register _rs2_ to memory. It computes an effective address by adding the
+_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It
+expands to `sd rs2, offset(x2)`.
+
+C.SQSP is an RV128C-only instruction that stores a 128-bit value in
+register _rs2_ to memory. It computes an effective address by adding the
+_zero_-extended offset, scaled by 16, to the stack pointer, `x2`. It
+expands to `sq rs2, offset(x2)`.
+
+C.FSWSP is an RV32FC-only instruction that stores a single-precision
+floating-point value in floating-point register _rs2_ to memory. It
+computes an effective address by adding the _zero_-extended offset,
+scaled by 4, to the stack pointer, `x2`. It expands to
+`fsw rs2, offset(x2)`.
+
+C.FSDSP is an RV32DC/RV64DC-only instruction that stores a
+double-precision floating-point value in floating-point register _rs2_
+to memory. It computes an effective address by adding the
+_zero_-extended offset, scaled by 8, to the stack pointer, `x2`. It
+expands to `fsd rs2, offset(x2)`.
+
+Register save/restore code at function entry/exit represents a
+significant portion of static code size. The stack-pointer-based
+compressed loads and stores in RVC are effective at reducing the
+save/restore static code size by a factor of 2 while improving
+performance by reducing dynamic instruction bandwidth.
+
+A common mechanism used in other ISAs to further reduce save/restore
+code size is load-multiple and store-multiple instructions. We
+considered adopting these for RISC-V but noted the following drawbacks
+to these instructions:
+
+* These instructions complicate processor implementations.
+* For virtual memory systems, some data accesses could be resident in
+physical memory and some could not, which requires a new restart
+mechanism for partially executed instructions.
+* Unlike the rest of the RVC instructions, there is no IFD equivalent to
+Load Multiple and Store Multiple.
+* Unlike the rest of the RVC instructions, the compiler would have to be
+aware of these instructions to both generate the instructions and to
+allocate registers in an order to maximize the chances of the them being
+saved and stored, since they would be saved and restored in sequential
+order.
+* Simple microarchitectural implementations will constrain how other
+instructions can be scheduled around the load and store multiple
+instructions, leading to a potential performance loss.
+* The desire for sequential register allocation might conflict with the
+featured registers selected for the CIW, CL, CS, CA, and CB formats.
+
+Furthermore, much of the gains can be realized in software by replacing
+prologue and epilogue code with subroutine calls to common prologue and
+epilogue code, a technique described in Section 5.6 of .
+
+While reasonable architects might come to different conclusions, we
+decided to omit load and store multiple and instead use the
+software-only approach of calling save/restore millicode routines to
+attain the greatest code size reduction.
+
+==== Register-Based Loads and Stores
+
+S@S@S@Y@S@Y +
+& & & & & +
+& & & & & +
+& 3 & 3 & 2 & 3 & 2 +
+C.LW & offset[5:3] & base & offset[2latexmath:[$\vert$]6] & dest & C0 +
+C.LD & offset[5:3] & base & offset[7:6] & dest & C0 +
+C.LQ & offset[5latexmath:[$\vert$]4latexmath:[$\vert$]8] & base &
+offset[7:6] & dest & C0 +
+C.FLW& offset[5:3] & base & offset[2latexmath:[$\vert$]6] & dest & C0 +
+C.FLD& offset[5:3] & base & offset[7:6] & dest & C0 +
+
+These instructions use the CL format.
+
+C.LW loads a 32-bit value from memory into register
+_rd latexmath:[$'$]_. It computes an effective address by adding the
+_zero_-extended offset, scaled by 4, to the base address in register
+_rs1 latexmath:[$'$]_. It expands to `lw rd, offset(rs1)`.
+
+C.LD is an RV64C/RV128C-only instruction that loads a 64-bit value from
+memory into register _rd latexmath:[$'$]_. It computes an effective
+address by adding the _zero_-extended offset, scaled by 8, to the base
+address in register _rs1 latexmath:[$'$]_. It expands to
+`ld rd', offset(rs1')`.
+
+C.LQ is an RV128C-only instruction that loads a 128-bit value from
+memory into register _rd latexmath:[$'$]_. It computes an effective
+address by adding the _zero_-extended offset, scaled by 16, to the base
+address in register _rs1 latexmath:[$'$]_. It expands to
+`lq rd, offset(rs1)`.
+
+C.FLW is an RV32FC-only instruction that loads a single-precision
+floating-point value from memory into floating-point register
+_rd latexmath:[$'$]_. It computes an effective address by adding the
+_zero_-extended offset, scaled by 4, to the base address in register
+_rs1 latexmath:[$'$]_. It expands to `flw rd, offset(rs1)`.
+
+C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision
+floating-point value from memory into floating-point register
+_rd latexmath:[$'$]_. It computes an effective address by adding the
+_zero_-extended offset, scaled by 8, to the base address in register
+_rs1 latexmath:[$'$]_. It expands to `fld rd, offset(rs1)`.
+
+S@S@S@Y@S@Y +
+& & & & & +
+& & & & & +
+& 3 & 3 & 2 & 3 & 2 +
+C.SW & offset[5:3] & base & offset[2latexmath:[$\vert$]6] & src & C0 +
+C.SD & offset[5:3] & base & offset[7:6] & src & C0 +
+C.SQ & offset[5latexmath:[$\vert$]4latexmath:[$\vert$]8] & base &
+offset[7:6] & src & C0 +
+C.FSW& offset[5:3] & base & offset[2latexmath:[$\vert$]6] & src & C0 +
+C.FSD& offset[5:3] & base & offset[7:6] & src & C0 +
+
+These instructions use the CS format.
+
+C.SW stores a 32-bit value in register _rs2 latexmath:[$'$]_ to memory.
+It computes an effective address by adding the _zero_-extended offset,
+scaled by 4, to the base address in register _rs1 latexmath:[$'$]_. It
+expands to `sw rs2, offset(rs1)`.
+
+C.SD is an RV64C/RV128C-only instruction that stores a 64-bit value in
+register _rs2 latexmath:[$'$]_ to memory. It computes an effective
+address by adding the _zero_-extended offset, scaled by 8, to the base
+address in register _rs1 latexmath:[$'$]_. It expands to
+`sd rs2, offset(rs1)`.
+
+C.SQ is an RV128C-only instruction that stores a 128-bit value in
+register _rs2 latexmath:[$'$]_ to memory. It computes an effective
+address by adding the _zero_-extended offset, scaled by 16, to the base
+address in register _rs1 latexmath:[$'$]_. It expands to
+`sq rs2, offset(rs1)`.
+
+C.FSW is an RV32FC-only instruction that stores a single-precision
+floating-point value in floating-point register _rs2 latexmath:[$'$]_ to
+memory. It computes an effective address by adding the _zero_-extended
+offset, scaled by 4, to the base address in register
+_rs1 latexmath:[$'$]_. It expands to `fsw rs2, offset(rs1)`.
+
+C.FSD is an RV32DC/RV64DC-only instruction that stores a
+double-precision floating-point value in floating-point register
+_rs2 latexmath:[$'$]_ to memory. It computes an effective address by
+adding the _zero_-extended offset, scaled by 8, to the base address in
+register _rs1 latexmath:[$'$]_. It expands to
+`fsd rs2, offset(rs1)`.
+
+=== Control Transfer Instructions
+
+RVC provides unconditional jump instructions and conditional branch
+instructions. As with base RVI instructions, the offsets of all RVC
+control transfer instruction are in multiples of 2 bytes.
+
+S@L@Y +
+& & +
+& & +
+& 11 & 2 +
+C.J &
+offset[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5]
+& C1 +
+C.JAL &
+offset[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5]
+& C1 +
+
+These instructions use the CJ format.
+
+C.J performs an unconditional control transfer. The offset is
+sign-extended and added to the `pc` to form the jump target address. C.J
+can therefore target a latexmath:[$\pm$] range. C.J expands to
+`jal x0, offset`.
+
+C.JAL is an RV32C-only instruction that performs the same operation as
+C.J, but additionally writes the address of the instruction following
+the jump (`pc`+2) to the link register, `x1`. C.JAL expands to
+`jal x1, offset`.
+
+E@T@T@Y +
+& & & +
+& & & +
+& 5 & 5 & 2 +
+C.JR & srclatexmath:[$\neq$]0 & 0 & C2 +
+C.JALR & srclatexmath:[$\neq$]0 & 0 & C2 +
+
+These instructions use the CR format.
+
+C.JR (jump register) performs an unconditional control transfer to the
+address in register _rs1_. C.JR expands to `jalr x0, 0(rs1)`. C.JR is
+only valid when latexmath:[$\textit{rs1}{\neq}\texttt{x0}$]; the code
+point with latexmath:[$\textit{rs1}{=}\texttt{x0}$] is reserved.
+
+C.JALR (jump and link register) performs the same operation as C.JR, but
+additionally writes the address of the instruction following the jump
+(`pc`+2) to the link register, `x1`. C.JALR expands to
+`jalr x1, 0(rs1)`. C.JALR is only valid when
+latexmath:[$\textit{rs1}{\neq}\texttt{x0}$]; the code point with
+latexmath:[$\textit{rs1}{=}\texttt{x0}$] corresponds to the C.EBREAK
+instruction.
+
+Strictly speaking, C.JALR does not expand exactly to a base RVI
+instruction as the value added to the PC to form the link address is 2
+rather than 4 as in the base ISA, but supporting both offsets of 2 and 4
+bytes is only a very minor change to the base microarchitecture.
+
+S@S@S@T@Y +
+& & & & +
+& & & & +
+& 3 & 3 & 5 & 2 +
+C.BEQZ & offset[8latexmath:[$\vert$]4:3] & src &
+offset[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] & C1 +
+C.BNEZ & offset[8latexmath:[$\vert$]4:3] & src &
+offset[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] & C1 +
+
+These instructions use the CB format.
+
+C.BEQZ performs conditional control transfers. The offset is
+sign-extended and added to the `pc` to form the branch target address.
+It can therefore target a latexmath:[$\pm$] range. C.BEQZ takes the
+branch if the value in register _rs1 latexmath:[$'$]_ is zero. It
+expands to `beq rs1, x0, offset`.
+
+C.BNEZ is defined analogously, but it takes the branch if
+_rs1 latexmath:[$'$]_ contains a nonzero value. It expands to
+`bne rs1, x0, offset`.
+
+=== Integer Computational Instructions
+
+RVC provides several instructions for integer arithmetic and constant
+generation.
+
+==== Integer Constant-Generation Instructions
+
+The two constant-generation instructions both use the CI instruction
+format and can target any integer register.
+
+S@W@T@T@Y +
+& & & & +
+& & & & +
+& 1 & 5 & 5 & 2 +
+C.LI & imm[5] & destlatexmath:[$\neq$]0 & imm[4:0] & C1 +
+C.LUI & nzimm[17] &
+latexmath:[$\textrm{dest}{\neq}{\left\{0,2\right\}}$] & nzimm[16:12] &
+C1 +
+
+C.LI loads the sign-extended 6-bit immediate, _imm_, into register _rd_.
+C.LI expands into `addi rd, x0, imm`. C.LI is only valid when
+_rd_latexmath:[$\neq$]`x0`; the code points with _rd_=`x0` encode HINTs.
+
+C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the
+destination register, clears the bottom 12 bits, and sign-extends bit 17
+into all higher bits of the destination. C.LUI expands into
+`lui rd, nzimm`. C.LUI is only valid when
+latexmath:[$\textit{rd}{\neq}{\left\{\texttt{x0},\texttt{x2}\right\}}$],
+and when the immediate is not equal to zero. The code points with
+_nzimm_=0 are reserved; the remaining code points with _rd_=`x0` are
+HINTs; and the remaining code points with _rd_=`x2` correspond to the
+C.ADDI16SP instruction.
+
+==== Integer Register-Immediate Operations
+
+These integer register-immediate operations are encoded in the CI format
+and perform operations on an integer register and a 6-bit immediate.
+
+S@W@T@T@Y +
+& & & & +
+& & & & +
+& 1 & 5 & 5 & 2 +
+C.ADDI & nzimm[5] & destlatexmath:[$\neq$]0 & nzimm[4:0] & C1 +
+C.ADDIW & imm[5] & destlatexmath:[$\neq$]0 & imm[4:0] & C1 +
+C.ADDI16SP & nzimm[9] & 2 &
+nzimm[4latexmath:[$\vert$]6latexmath:[$\vert$]8:7latexmath:[$\vert$]5] &
+C1 +
+
+C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in
+register _rd_ then writes the result to _rd_. C.ADDI expands into
+`addi rd, rd, nzimm`. C.ADDI is only valid when
+_rd_latexmath:[$\neq$]`x0` and _nzimm_latexmath:[$\neq$]0. The code
+points with _rd_=`x0` encode the C.NOP instruction; the remaining code
+points with _nzimm_=0 encode HINTs.
+
+C.ADDIW is an RV64C/RV128C-only instruction that performs the same
+computation but produces a 32-bit result, then sign-extends result to 64
+bits. C.ADDIW expands into `addiw rd, rd, imm`. The immediate can be
+zero for C.ADDIW, where this corresponds to ` sext.w rd`. C.ADDIW is
+only valid when _rd_latexmath:[$\neq$]`x0`; the code points with
+_rd_=`x0` are reserved.
+
+C.ADDI16SP shares the opcode with C.LUI, but has a destination field of
+`x2`. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the
+value in the stack pointer (`sp`=`x2`), where the immediate is scaled to
+represent multiples of 16 in the range (-512,496). C.ADDI16SP is used to
+adjust the stack pointer in procedure prologues and epilogues. It
+expands into `addi x2, x2, nzimm`. C.ADDI16SP is only valid when
+_nzimm_latexmath:[$\neq$]0; the code point with _nzimm_=0 is reserved.
+
+In the standard RISC-V calling convention, the stack pointer `sp` is
+always 16-byte aligned.
+
+@S@K@S@Y +
+& & & +
+& & & +
+& 8 & 3 & 2 +
+C.ADDI4SPN &
+nzuimm[5:4latexmath:[$\vert$]9:6latexmath:[$\vert$]2latexmath:[$\vert$]3]
+& dest & C0 +
+
+C.ADDI4SPN is a CIW-format instruction that adds a _zero_-extended
+non-zero immediate, scaled by 4, to the stack pointer, `x2`, and writes
+the result to `rd`. This instruction is used to generate pointers to
+stack-allocated variables, and expands to `addi rd, x2, nzuimm`.
+C.ADDI4SPN is only valid when _nzuimm_latexmath:[$\neq$]0; the code
+points with _nzuimm_=0 are reserved.
+
+S@W@T@T@Y +
+& & & & +
+& & & & +
+& 1 & 5 & 5 & 2 +
+C.SLLI & shamt[5] & destlatexmath:[$\neq$]0 & shamt[4:0] & C2 +
+
+C.SLLI is a CI-format instruction that performs a logical left shift of
+the value in register _rd_ then writes the result to _rd_. The shift
+amount is encoded in the _shamt_ field. For RV128C, a shift amount of
+zero is used to encode a shift of 64. C.SLLI expands into
+`slli rd, rd, shamt`, except for RV128C with `shamt=0`, which expands to
+`slli rd, rd, 64`.
+
+For RV32C, _shamt[5]_ must be zero; the code points with _shamt[5]_=1
+are designated for custom extensions. For RV32C and RV64C, the shift
+amount must be non-zero; the code points with _shamt_=0 are HINTs. For
+all base ISAs, the code points with _rd_=`x0` are HINTs, except those
+with _shamt[5]_=1 in RV32C.
+
+S@W@Y@S@T@Y +
+& & & & & +
+& & & & & +
+& 1 & 2 & 3 & 5 & 2 +
+C.SRLI & shamt[5] & C.SRLI & dest & shamt[4:0] & C1 +
+C.SRAI & shamt[5] & C.SRAI & dest & shamt[4:0] & C1 +
+
+C.SRLI is a CB-format instruction that performs a logical right shift of
+the value in register _rd latexmath:[$'$]_ then writes the result to
+_rd latexmath:[$'$]_. The shift amount is encoded in the _shamt_ field.
+For RV128C, a shift amount of zero is used to encode a shift of 64.
+Furthermore, the shift amount is sign-extended for RV128C, and so the
+legal shift amounts are 1–31, 64, and 96–127. C.SRLI expands into
+`srli rd', rd', shamt`, except for RV128C with `shamt=0`, which
+expands to `srli rd, rd, 64`.
+
+For RV32C, _shamt[5]_ must be zero; the code points with _shamt[5]_=1
+are designated for custom extensions. For RV32C and RV64C, the shift
+amount must be non-zero; the code points with _shamt_=0 are HINTs.
+
+C.SRAI is defined analogously to C.SRLI, but instead performs an
+arithmetic right shift. C.SRAI expands to `srai rd, rd, shamt`.
+
+Left shifts are usually more frequent than right shifts, as left shifts
+are frequently used to scale address values. Right shifts have therefore
+been granted less encoding space and are placed in an encoding quadrant
+where all other immediates are sign-extended. For RV128, the decision
+was made to have the 6-bit shift-amount immediate also be sign-extended.
+Apart from reducing the decode complexity, we believe right-shift
+amounts of 96–127 will be more useful than 64–95, to allow extraction of
+tags located in the high portions of 128-bit address pointers. We note
+that RV128C will not be frozen at the same point as RV32C and RV64C, to
+allow evaluation of typical usage of 128-bit address-space codes.
+
+S@W@Y@S@T@Y +
+& & & & & +
+& & & & & +
+& 1 & 2 & 3 & 5 & 2 +
+C.ANDI & imm[5] & C.ANDI & dest & imm[4:0] & C1 +
+
+C.ANDI is a CB-format instruction that computes the bitwise AND of the
+value in register _rd latexmath:[$'$]_ and the sign-extended 6-bit
+immediate, then writes the result to _rd latexmath:[$'$]_. C.ANDI
+expands to `andi rd, rd, imm`.
+
+==== Integer Register-Register Operations
+
+E@T@T@Y +
+& & & +
+& & & +
+& 5 & 5 & 2 +
+C.MV & destlatexmath:[$\neq$]0 & srclatexmath:[$\neq$]0 & C2 +
+C.ADD & destlatexmath:[$\neq$]0 & srclatexmath:[$\neq$]0 & C2 +
+
+These instructions use the CR format.
+
+C.MV copies the value in register _rs2_ into register _rd_. C.MV expands
+into `add rd, x0, rs2`. C.MV is only valid when
+latexmath:[$\textit{rs2}{\neq}\texttt{x0}$]; the code points with
+latexmath:[$\textit{rs2}{=}\texttt{x0}$] correspond to the C.JR
+instruction. The code points with
+latexmath:[$\textit{rs2}{\neq}\texttt{x0}$] and
+latexmath:[$\textit{rd}{=}\texttt{x0}$] are HINTs.
+
+C.MV expands to a different instruction than the canonical MV
+pseudoinstruction, which instead uses ADDI. Implementations that handle
+MV specially, e.g. using register-renaming hardware, may find it more
+convenient to expand C.MV to MV instead of ADD, at slight additional
+hardware cost.
+
+C.ADD adds the values in registers _rd_ and _rs2_ and writes the result
+to register _rd_. C.ADD expands into `add rd, rd, rs2`. C.ADD is only
+valid when latexmath:[$\textit{rs2}{\neq}\texttt{x0}$]; the code points
+with latexmath:[$\textit{rs2}{=}\texttt{x0}$] correspond to the C.JALR
+and C.EBREAK instructions. The code points with
+latexmath:[$\textit{rs2}{\neq}\texttt{x0}$] and
+latexmath:[$\textit{rd}{=}\texttt{x0}$] are HINTs.
+
+M@S@Y@S@Y +
+& & & & +
+& & & & +
+& 3 & 2 & 3 & 2 +
+C.AND & dest & C.AND & src & C1 +
+C.OR & dest & C.OR & src & C1 +
+C.XOR & dest & C.XOR & src & C1 +
+C.SUB & dest & C.SUB & src & C1 +
+C.ADDW & dest & C.ADDW & src & C1 +
+C.SUBW & dest & C.SUBW & src & C1 +
+
+These instructions use the CA format.
+
+C.AND computes the bitwise AND of the values in registers
+_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result
+to register _rd latexmath:[$'$]_. C.AND expands into
+`and rd, rd, rs2`.
+
+C.OR computes the bitwise OR of the values in registers
+_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result
+to register _rd latexmath:[$'$]_. C.OR expands into
+`or rd&#8242;, rd&#8242;, rs2&#8242;`.
+
+C.XOR computes the bitwise XOR of the values in registers
+_rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then writes the result
+to register _rd latexmath:[$'$]_. C.XOR expands into
+`xor rd', rd', rs2'.
+
+C.SUB subtracts the value in register _rs2 latexmath:[$'$]_ from the
+value in register _rd latexmath:[$'$]_, then writes the result to
+register _rd latexmath:[$'$]_. C.SUB expands into
+`sub rd', rd', rs2'.
+
+C.ADDW is an RV64C/RV128C-only instruction that adds the values in
+registers _rd latexmath:[$'$]_ and _rs2 latexmath:[$'$]_, then
+sign-extends the lower 32 bits of the sum before writing the result to
+register _rd latexmath:[$'$]_. C.ADDW expands into
+`addw rd', rd', rs2'`.
+
+C.SUBW is an RV64C/RV128C-only instruction that subtracts the value in
+register _rs2 latexmath:[$'$]_ from the value in register
+_rd latexmath:[$'$]_, then sign-extends the lower 32 bits of the
+difference before writing the result to register _rd latexmath:[$'$]_.
+C.SUBW expands into `subw rd', rd', rs2'`.
+
+This group of six instructions do not provide large savings
+individually, but do not occupy much encoding space and are
+straightforward to implement, and as a group provide a worthwhile
+improvement in static and dynamic compression.
+
+==== Defined Illegal Instruction
+
+SW@T@T@Y +
+& & & & +
+& & & & +
+& 1 & 5 & 5 & 2 +
+0 & 0 & 0 & 0 & 0 +
+
+A 16-bit instruction with all bits zero is permanently reserved as an
+illegal instruction.
+
+We reserve all-zero instructions to be illegal instructions to help trap
+attempts to execute zero-ed or non-existent portions of the memory
+space. The all-zero value should not be redefined in any non-standard
+extension. Similarly, we reserve instructions with all bits set to 1
+(corresponding to very long instructions in the RISC-V variable-length
+encoding scheme) as illegal to capture another common value seen in
+non-existent memory regions.
+
+==== NOP Instruction
+
+SW@T@T@Y +
+& & & & +
+& & & & +
+& 1 & 5 & 5 & 2 +
+C.NOP & 0 & 0 & 0 & C1 +
+
+C.NOP is a CI-format instruction that does not change any user-visible
+state, except for advancing the `pc` and incrementing any applicable
+performance counters. C.NOP expands to `nop`. C.NOP is only valid when
+_imm_=0; the code points with _imm_latexmath:[$\neq$]0 encode HINTs.
+
+==== Breakpoint Instruction
+
+E@U@Y +
+& & +
+& & +
+& 10 & 2 +
+C.EBREAK & 0 & C2 +
+
+Debuggers can use the C.EBREAK instruction, which expands to `ebreak`,
+to cause control to be transferred back to the debugging environment.
+C.EBREAK shares the opcode with the C.ADD instruction, but with _rd_ and
+_rs2_ both zero, thus can also use the CR format.
+
+=== Usage of C Instructions in LR/SC Sequences
+
+On implementations that support the C extension, compressed forms of the
+I instructions permitted inside constrained LR/SC sequences, as
+described in <<lrscseq>>, are also permitted
+inside constrained LR/SC sequences.
+
+The implication is that any implementation that claims to support both
+the A and C extensions must ensure that LR/SC sequences containing valid
+C instructions will eventually complete.
+
+[[rvc-hints]]
+=== HINT Instructions
+
+A portion of the RVC encoding space is reserved for microarchitectural
+HINTs. Like the HINTs in the RV32I base ISA (see
+<<rv32i-hints>>, these instructions do not
+modify any architectural state, except for advancing the `pc` and any
+applicable performance counters. HINTs are executed as no-ops on
+implementations that ignore them.
+
+RVC HINTs are encoded as computational instructions that do not modify
+the architectural state, either because _rd_=`x0` (e.g.
+C.ADD _x0_, _t0_), or because _rd_ is overwritten with a copy of itself
+(e.g. C.ADDI _t0_, 0).
+
+This HINT encoding has been chosen so that simple implementations can
+ignore HINTs altogether, and instead execute a HINT as a regular
+computational instruction that happens not to mutate the architectural
+state.
+
+RVC HINTs do not necessarily expand to their RVI HINT counterparts. For
+example, C.ADD _x0_, _t0_ might not encode the same HINT as
+ADD _x0_, _x0_, _t0_.
+
+The primary reason to not require an RVC HINT to expand to an RVI HINT
+is that HINTs are unlikely to be compressible in the same manner as the
+underlying computational instruction. Also, decoupling the RVC and RVI
+HINT mappings allows the scarce RVC HINT space to be allocated to the
+most popular HINTs, and in particular, to HINTs that are amenable to
+macro-op fusion.
+
+<<rvc-t-hints>> lists all RVC HINT code points. For RV32C, 78%
+of the HINT space is reserved for standard HINTs, but none are presently
+defined. The remainder of the HINT space is designated for custom HINTs;
+no standard HINTs will ever be defined in this subspace.
+
+[[rvc-t-hints]]
+.RVC HINT instructions.
+//[cols="<,<,>,<",options="header",]
+|===
+|Instruction |Constraints |Code Points |Purpose
+|C.NOP |_nzimm_latexmath:[$\neq$]0 |63 |_Reserved for future standard
+use_
+
+|C.ADDI | |_rd_latexmath:[$\neq$]`x0`, _nzimm_=0 |31
+
+|C.LI | |_rd_=`x0` |64
+
+|C.LUI | |_rd_=`x0`, _nzimm_latexmath:[$\neq$]0 |63
+
+|C.MV | |_rd_=`x0`, _rs2_latexmath:[$\neq$]`x0` |31
+
+|C.ADD | |_rd_=`x0`, _rs2_latexmath:[$\neq$]`x0` |31
+
+|C.SLLI |_rd_=`x0`, _nzimm_latexmath:[$\neq$]0 |31 (RV32) |_Designated
+for custom use_
+
+| | |63 (RV64/128) |
+
+|C.SLLI64 | |_rd_=`x0` |1
+
+|C.SLLI64 | |_rd_latexmath:[$\neq$]`x0`, RV32 and RV64 only |31
+
+|C.SRLI64 | |RV32 and RV64 only |8
+
+|C.SRAI64 | |RV32 and RV64 only |8
+|===
+
+=== RVC Instruction Set Listings
+
+<<rvcopcodemap>> shows a map of the major
+opcodes for RVC. Each row of the table corresponds to one quadrant of
+the encoding space. The last quadrant, which has the two
+least-significant bits set, corresponds to instructions wider than 16
+bits, including those in the base ISAs. Several instructions are only
+valid for certain operands; when invalid, they are marked either _RES_
+to indicate that the opcode is reserved for future standard extensions;
+_Custom_ to indicate that the opcode is designated for custom
+extensions; or _HINT_ to indicate that the opcode is reserved for
+microarchitectural hints <<rvc-hints>>.
+
+//[cols=">,^,^,^,^,^,^,^,^,<",]
+|===
+|inst[15:13] |000 |001 |010 |011 |100 |101 |110 |111 |
+|inst[1:0] | | | | | | | | |
+| |ADDI4SPN |FLD |LW |FLW |_Reserved_ |FSD |SW |FSW |RV32
+| | |FLD | |LD | |FSD | |SD |RV64
+| | |LQ | |LD | |SQ | |SD |RV128
+|01 |ADDI |JAL |LI |LUI/ADDI16SP |MISC-ALU |J |BEQZ |BNEZ |RV32
+| | |ADDIW | | | | | | |RV64
+| | |ADDIW | | | | | | |RV128
+|10 |SLLI |FLDSP |LWSP |FLWSP |J[AL]R/MV/ADD |FSDSP |SWSP |FSWSP |RV32
+| | |FLDSP | |LDSP | |FSDSP | |SDSP |RV64
+| | |LQSP | |LDSP | |SQSP | |SDSP |RV128
+|11 |latexmath:[$>$]16b | | | | | | | |
+|===
+
+<<rvc-instr-table0>>, <<rvc-instr-table1>>, and <<rvc-instr-table2>> list the
+RVC instructions.
+
+[[rvc-instr-table0]]
+.Instruction listing for RVC, Quadrant 0
+//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+
+|000| |0 | | | | | | | |0 | | |00 | |_Illegal instruction_
+
+|000|
+|nzuimm[5:4latexmath:[$\vert$]9:6latexmath:[$\vert$]2latexmath:[$\vert$]3]
+| | | | | | | |rd latexmath:[$'$] | | |00 | |C.ADDI4SPN _(RES,
+nzuimm=0)_
+
+|001| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] |
+|rd latexmath:[$'$] | | |00 | |C.FLD _(RV32/64)_
+
+|001| |uimm[5:4latexmath:[$\vert$]8] | | |rs1 latexmath:[$'$] | |
+|uimm[7:6] | |rd latexmath:[$'$] | | |00 | |C.LQ _(RV128)_
+
+|010| |uimm[5:3] | | |rs1 latexmath:[$'$] | |
+|uimm[2latexmath:[$\vert$]6] | |rd latexmath:[$'$] | | |00 | |C.LW
+
+|011| |uimm[5:3] | | |rs1 latexmath:[$'$] | |
+|uimm[2latexmath:[$\vert$]6] | |rd latexmath:[$'$] | | |00 | |C.FLW
+_(RV32)_
+
+|011| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] |
+|rd latexmath:[$'$] | | |00 | |C.LD _(RV64/128)_
+
+|100| |— | | | | | | | | | | |00 | |_Reserved_
+
+|101| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] |
+|rs2 latexmath:[$'$] | | |00 | |C.FSD _(RV32/64)_
+
+|101 | | |uimm[5:4latexmath:[$\vert$]8] | | |rs1 latexmath:[$'$] | |
+|uimm[7:6] | |rs2 latexmath:[$'$] | | |00 | |C.SQ _(RV128)_
+
+|110| |uimm[5:3] | | |rs1 latexmath:[$'$] | |
+|uimm[2latexmath:[$\vert$]6] | |rs2 latexmath:[$'$] | | |00 | |C.SW
+
+|111| |uimm[5:3] | | |rs1 latexmath:[$'$] | |
+|uimm[2latexmath:[$\vert$]6] | |rs2 latexmath:[$'$] | | |00 | |C.FSW
+_(RV32)_
+
+|111| |uimm[5:3] | | |rs1 latexmath:[$'$] | | |uimm[7:6] |
+|rs2 latexmath:[$'$] | | |00 | |C.SD _(RV64/128)_
+
+|===
+
+[[rvc-instr-table1]]
+.Instruction listing for RVC, Quadrant 1
+//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+
+|000  |nzimm[5] |0 | | | | |nzimm[4:0] | | | | |01 | |C.NOP _(HINT,
+nzimmlatexmath:[$\neq$]0)_
+
+|000  |nzimm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |nzimm[4:0] | |
+| | |01 | |C.ADDI _(HINT, nzimm=0)_
+
+|001 | |
+|imm[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5]
+| | | | | | | | | | |01 | |C.JAL _(RV32)_
+
+|001  |imm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |imm[4:0] | | | |
+|01 | |C.ADDIW _(RV64/128; RES, rd=0)_
+
+|010  |imm[5] |rdlatexmath:[$\neq$]0 | | | | |imm[4:0] | | | | |01
+| |C.LI _(HINT, rd=0)_
+
+|011  |nzimm[9] |2 | | | |
+|nzimm[4latexmath:[$\vert$]6latexmath:[$\vert$]8:7latexmath:[$\vert$]5]
+| | | | |01 | |C.ADDI16SP _(RES, nzimm=0)_
+
+|011  |nzimm[17] |rdlatexmath:[$\neq$]latexmath:[$\{0,2\}$] | | | |
+|nzimm[16:12] | | | | |01 | |C.LUI _(RES, nzimm=0; HINT, rd=0)_
+
+|100  |nzuimm[5] |00 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | |
+|nzuimm[4:0] | | | | |01 | |C.SRLI _(RV32 Custom, nzuimm[5]=1)_
+
+|100  |0 |00 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |0 | | |
+| |01 | |C.SRLI64 _(RV128; RV32/64 HINT)_
+
+|100  |nzuimm[5] |01 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | |
+|nzuimm[4:0] | | | | |01 | |C.SRAI _(RV32 Custom, nzuimm[5]=1)_
+
+|100 |0 |01 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |0 | | |
+| |01 | |C.SRAI64 _(RV128; RV32/64 HINT)_
+
+|100  |imm[5] |10 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | |
+|imm[4:0] | | | | |01 | |C.ANDI
+
+|100  |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |00 |
+|rs2 latexmath:[$'$] | | |01 | |C.SUB
+
+|100  |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |01 |
+|rs2 latexmath:[$'$] | | |01 | |C.XOR
+
+|100  |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |10 |
+|rs2 latexmath:[$'$] | | |01 | |C.OR
+
+|100  |0 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |11 |
+|rs2 latexmath:[$'$] | | |01 | |C.AND
+
+|100  |1 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |00 |
+|rs2 latexmath:[$'$] | | |01 | |C.SUBW _(RV64/128; RV32 RES)_
+
+|100  |1 |11 | |rs1 latexmath:[$'$]/rd latexmath:[$'$] | | |01 |
+|rs2 latexmath:[$'$] | | |01 | |C.ADDW _(RV64/128; RV32 RES)_
+
+|100  |1 |11 | |— | | |10 | |— | | |01 | |_Reserved_
+
+|100  |1 |11 | |— | | |11 | |— | | |01 | |_Reserved_
+
+|101
+|imm[11latexmath:[$\vert$]4latexmath:[$\vert$]9:8latexmath:[$\vert$]10latexmath:[$\vert$]6latexmath:[$\vert$]7latexmath:[$\vert$]3:1latexmath:[$\vert$]5]
+| | | | | | | | | | |01 | |C.J
+
+|110  |imm[8latexmath:[$\vert$]4:3] | | |rs1 latexmath:[$'$] | |
+|imm[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] | | | | |01 |
+|C.BEQZ
+
+|111  |imm[8latexmath:[$\vert$]4:3] | | |rs1 latexmath:[$'$] | |
+|imm[7:6latexmath:[$\vert$]2:1latexmath:[$\vert$]5] | | | | |01 |
+|C.BNEZ
+
+|===
+
+[[rvc-instr-table2]]
+.Instruction listing for RVC, Quadrant 2
+//[cols="<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+
+|000  |nzuimm[5] |rs1/rdlatexmath:[$\neq$]0 | | | | |nzuimm[4:0] |
+| | | |10 | |C.SLLI _(HINT, rd=0; RV32 Custom, nzuimm[5]=1)_
+
+|000  |0 |rs1/rdlatexmath:[$\neq$]0 | | | | |0 | | | | |10 |
+|C.SLLI64 _(RV128; RV32/64 HINT; HINT, rd=0)_
+
+|001  |uimm[5] |rd | | | | |uimm[4:3latexmath:[$\vert$]8:6] | | | |
+|10 | |C.FLDSP _(RV32/64)_
+
+|001  |uimm[5] |rdlatexmath:[$\neq$]0 | | | |
+|uimm[4latexmath:[$\vert$]9:6] | | | | |10 | |C.LQSP _(RV128; RES,
+rd=0)_
+
+|010  |uimm[5] |rdlatexmath:[$\neq$]0 | | | |
+|uimm[4:2latexmath:[$\vert$]7:6] | | | | |10 | |C.LWSP _(RES, rd=0)_
+
+|011  |uimm[5] |rd | | | | |uimm[4:2latexmath:[$\vert$]7:6] | | | |
+|10 | |C.FLWSP _(RV32)_
+
+|011 |uimm[5] |rdlatexmath:[$\neq$]0 | | | |
+|uimm[4:3latexmath:[$\vert$]8:6] | | | | |10 | |C.LDSP _(RV64/128; RES,
+rd=0)_
+
+|100  |0 |rs1latexmath:[$\neq$]0 | | | | |0 | | | | |10 | |C.JR
+_(RES, rs1=0)_
+
+|100  |0 |rdlatexmath:[$\neq$]0 | | | | |rs2latexmath:[$\neq$]0 | |
+| | |10 | |C.MV _(HINT, rd=0)_
+
+|100 |1 |0 | | | | |0 | | | | |10 | |C.EBREAK
+
+|100  |1 |rs1latexmath:[$\neq$]0 | | | | |0 | | | | |10 | |C.JALR
+
+|100  | |1 |rs1/rdlatexmath:[$\neq$]0 | | | | |rs2latexmath:[$\neq$]0
+| | | | |10 | |C.ADD _(HINT, rd=0)_
+
+|101  |uimm[5:3latexmath:[$\vert$]8:6] | | | | | |rs2 | | | | |10 |
+|C.FSDSP _(RV32/64)_
+
+|101  |uimm[5:4latexmath:[$\vert$]9:6] | | | | | |rs2 | | | | |10 |
+|C.SQSP _(RV128)_
+
+|110  |uimm[5:2latexmath:[$\vert$]7:6] | | | | | |rs2 | | | | |10 |
+|C.SWSP
+
+|111  |uimm[5:2latexmath:[$\vert$]7:6] | | | | | |rs2 | | | | |10 |
+|C.FSWSP _(RV32)_
+
+|111  |uimm[5:3latexmath:[$\vert$]8:6] | | | | | |rs2 | | | | |10 |
+|C.SDSP _(RV64/128)_
+
+|===
diff --git a/src/colophon.adoc b/src/colophon.adoc
new file mode 100644
index 0000000..97f13cd
--- /dev/null
+++ b/src/colophon.adoc
@@ -0,0 +1,347 @@
+[colophon]
+= Preface
+
+This document describes the RISC-V unprivileged architecture.
+
+The ISA modules marked Ratified have been ratified at this time. The
+modules marked _Frozen_ are not expected to change significantly before
+being put up for ratification. The modules marked _Draft_ are expected
+to change before ratification.
+
+The document contains the following versions of the RISC-V ISA modules:
+
+[cols="^,<,^",options="header",]
+|===
+|Base |Version |Status
+|RVWMO |2.0 |*Ratified*
+|*RV32I* |*2.1* |*Ratified*
+|*RV64I* |*2.1* |*Ratified*
+|_RV32E_ |_1.9_ |_Draft_
+|_RV128I_ |_1.7_ |_Draft_
+|Extension |Version |Status
+|*M* |*2.0* |*Ratified*
+|*A* |*2.1* |*Ratified*
+|*F* |*2.2* |*Ratified*
+|*D* |*2.2* |*Ratified*
+|*Q* |*2.2* |*Ratified*
+|*C* |*2.0* |*Ratified*
+|_Counters_ |_2.0_ |_Draft_
+|_L_ |_0.0_ |_Draft_
+|_B_ |_0.0_ |_Draft_
+|_J_ |_0.0_ |_Draft_
+|_T_ |_0.0_ |_Draft_
+|_P_ |_0.2_ |_Draft_
+|_V_ |_0.7_ |_Draft_
+|*Zicsr* |*2.0* |*Ratified*
+|*Zifencei* |*2.0* |*Ratified*
+|_Zam_ |_0.1_ |_Draft_
+|_Ztso_ |_0.1_ |_Frozen_
+|===
+
+_Preface to Document Version 20191213-Base-Ratified_
+
+This document describes the RISC-V unprivileged architecture.
+
+The ISA modules marked Ratified have been ratified at this time. The
+modules marked _Frozen_ are not expected to change significantly before
+being put up for ratification. The modules marked _Draft_ are expected
+to change before ratification.
+
+The document contains the following versions of the RISC-V ISA modules:
+
+[cols="^,<,^",options="header",]
+|===
+|Base |Version |Status
+|RVWMO |2.0 |*Ratified*
+|*RV32I* |*2.1* |*Ratified*
+|*RV64I* |*2.1* |*Ratified*
+|_RV32E_ |_1.9_ |_Draft_
+|_RV128I_ |_1.7_ |_Draft_
+|Extension |Version |Status
+|*M* |*2.0* |*Ratified*
+|*A* |*2.1* |*Ratified*
+|*F* |*2.2* |*Ratified*
+|*D* |*2.2* |*Ratified*
+|*Q* |*2.2* |*Ratified*
+|*C* |*2.0* |*Ratified*
+|_Counters_ |_2.0_ |_Draft_
+|_L_ |_0.0_ |_Draft_
+|_B_ |_0.0_ |_Draft_
+|_J_ |_0.0_ |_Draft_
+|_T_ |_0.0_ |_Draft_
+|_P_ |_0.2_ |_Draft_
+|_V_ |_0.7_ |_Draft_
+|*Zicsr* |*2.0* |*Ratified*
+|*Zifencei* |*2.0* |*Ratified*
+|_Zam_ |_0.1_ |_Draft_
+|_Ztso_ |_0.1_ |_Frozen_
+|===
+
+The changes in this version of the document include:
+
+* The A extension, now version 2.1, was ratified by the board in
+December 2019.
+* Defined big-endian ISA variant.
+* Moved N extension for user-mode interrupts into Volume II.
+* Defined PAUSE hint instruction.
+
+_Preface to Document Version 20190608-Base-Ratified_
+
+This document describes the RISC-V unprivileged architecture.
+
+The RVWMO memory model has been ratified at this time. The ISA modules
+marked Ratified, have been ratified at this time. The modules marked
+_Frozen_ are not expected to change significantly before being put up
+for ratification. The modules marked _Draft_ are expected to change
+before ratification.
+
+The document contains the following versions of the RISC-V ISA modules:
+
+[cols="^,<,^",options="header",]
+|===
+|Base |Version |Status
+|RVWMO |2.0 |*Ratified*
+|*RV32I* |*2.1* |*Ratified*
+|*RV64I* |*2.1* |*Ratified*
+|_RV32E_ |_1.9_ |_Draft_
+|_RV128I_ |_1.7_ |_Draft_
+|Extension |Version |Status
+|*Zifencei* |*2.0* |*Ratified*
+|*Zicsr* |*2.0* |*Ratified*
+|*M* |*2.0* |*Ratified*
+|_A_ |_2.0_ |Frozen
+|*F* |*2.2* |*Ratified*
+|*D* |*2.2* |*Ratified*
+|*Q* |*2.2* |*Ratified*
+|*C* |*2.0* |*Ratified*
+|_Ztso_ |_0.1_ |_Frozen_
+|_Counters_ |_2.0_ |_Draft_
+|_L_ |_0.0_ |_Draft_
+|_B_ |_0.0_ |_Draft_
+|_J_ |_0.0_ |_Draft_
+|_T_ |_0.0_ |_Draft_
+|_P_ |_0.2_ |_Draft_
+|_V_ |_0.7_ |_Draft_
+|_N_ |_1.1_ |_Draft_
+|_Zam_ |_0.1_ |_Draft_
+|===
+
+The changes in this version of the document include:
+
+* Moved description to *Ratified* for the ISA modules ratified by the
+board in early 2019.
+* Removed the A extension from ratification.
+* Changed document version scheme to avoid confusion with versions of
+the ISA modules.
+* Incremented the version numbers of the base integer ISA to 2.1,
+reflecting the presence of the ratified RVWMO memory model and exclusion
+of FENCE.I, counters, and CSR instructions that were in previous base
+ISA.
+* Incremented the version numbers of the F and D extensions to 2.2,
+reflecting that version 2.1 changed the canonical NaN, and version 2.2
+defined the NaN-boxing scheme and changed the definition of the FMIN and
+FMAX instructions.
+* Changed name of document to refer to ``unprivileged`` instructions as
+part of move to separate ISA specifications from platform profile
+mandates.
+* Added clearer and more precise definitions of execution environments,
+harts, traps, and memory accesses.
+* Defined instruction-set categories: _standard_, _reserved_, _custom_,
+_non-standard_, and _non-conforming_.
+* Removed text implying operation under alternate endianness, as
+alternate-endianness operation has not yet been defined for RISC-V.
+* Changed description of misaligned load and store behavior. The
+specification now allows visible misaligned address traps in execution
+environment interfaces, rather than just mandating invisible handling of
+misaligned loads and stores in user mode. Also, now allows access-fault
+exceptions to be reported for misaligned accesses (including atomics)
+that should not be emulated.
+* Moved FENCE.I out of the mandatory base and into a separate extension,
+with Zifencei ISA name. FENCE.I was removed from the Linux user ABI and
+is problematic in implementations with large incoherent instruction and
+data caches. However, it remains the only standard instruction-fetch
+coherence mechanism.
+* Removed prohibitions on using RV32E with other extensions.
+* Removed platform-specific mandates that certain encodings produce
+illegal instruction exceptions in RV32E and RV64I chapters.
+* Counter/timer instructions are now not considered part of the
+mandatory base ISA, and so CSR instructions were moved into separate
+chapter and marked as version 2.0, with the unprivileged counters moved
+into another separate chapter. The counters are not ready for
+ratification as there are outstanding issues, including counter
+inaccuracies.
+* A CSR-access ordering model has been added.
+* Explicitly defined the 16-bit half-precision floating-point format for
+floating-point instructions in the 2-bit _fmt field._
+* Defined the signed-zero behavior of FMIN._fmt_ and FMAX._fmt_, and
+changed their behavior on signaling-NaN inputs to conform to the
+minimumNumber and maximumNumber operations in the proposed IEEE 754-201x
+specification.
+* The memory consistency model, RVWMO, has been defined.
+* The ``Zam`` extension, which permits misaligned AMOs and specifies
+their semantics, has been defined.
+* The ``Ztso`` extension, which enforces a stricter memory consistency
+model than RVWMO, has been defined.
+* Improvements to the description and commentary.
+* Defined the term IALIGN as shorthand to describe the
+instruction-address alignment constraint.
+* Removed text of P extension chapter as now superseded by active task
+group documents.
+* Removed text of V extension chapter as now superseded by separate
+vector extension draft document.
+
+_Preface to Document Version 2.2_
+
+This is version 2.2 of the document describing the RISC-V user-level
+architecture. The document contains the following versions of the RISC-V
+ISA modules:
+
+[cols="^,<,^",options="header",]
+|===
+|Base |_Version_ |_Draft Frozen?_
+|RV32I |2.0 |Y
+|RV32E |1.9 |N
+|RV64I |2.0 |Y
+|RV128I |1.7 |N
+|Extension |Version |Frozen?
+|M |2.0 |Y
+|A |2.0 |Y
+|F |2.0 |Y
+|D |2.0 |Y
+|Q |2.0 |Y
+|L |0.0 |N
+|C |2.0 |Y
+|B |0.0 |N
+|J |0.0 |N
+|T |0.0 |N
+|P |0.1 |N
+|V |0.7 |N
+|N |1.1 |N
+|===
+
+To date, no parts of the standard have been officially ratified by the
+RISC-V Foundation, but the components labeled ``frozen`` above are not
+expected to change during the ratification process beyond resolving
+ambiguities and holes in the specification.
+
+The major changes in this version of the document include:
+
+* The previous version of this document was released under a Creative
+Commons Attribution 4.0 International License by the original authors,
+and this and future versions of this document will be released under the
+same license.
+* Rearranged chapters to put all extensions first in canonical order.
+* Improvements to the description and commentary.
+* Modified implicit hinting suggestion on JALR to support more efficient
+macro-op fusion of LUI/JALR and AUIPC/JALR pairs.
+* Clarification of constraints on load-reserved/store-conditional
+sequences.
+* A new table of control and status register (CSR) mappings.
+* Clarified purpose and behavior of high-order bits of `fcsr`.
+* Corrected the description of the FNMADD._fmt_ and FNMSUB._fmt_
+instructions, which had suggested the incorrect sign of a zero result.
+* Instructions FMV.S.X and FMV.X.S were renamed to FMV.W.X and FMV.X.W
+respectively to be more consistent with their semantics, which did not
+change. The old names will continue to be supported in the tools.
+* Specified behavior of narrower (latexmath:[$<$]FLEN) floating-point
+values held in wider `f` registers using NaN-boxing model.
+* Defined the exception behavior of FMA(latexmath:[$\infty$], 0, qNaN).
+* Added note indicating that the P extension might be reworked into an
+integer packed-SIMD proposal for fixed-point operations using the
+integer registers.
+* A draft proposal of the V vector instruction-set extension.
+* An early draft proposal of the N user-level traps extension.
+* An expanded pseudoinstruction listing.
+* Removal of the calling convention chapter, which has been superseded
+by the RISC-V ELF psABI Specification cite:[riscv-elf-psabi].
+* The C extension has been frozen and renumbered version 2.0.
+
+_Preface to Document Version 2.1_
+
+This is version 2.1 of the document describing the RISC-V user-level
+architecture. Note the frozen user-level ISA base and extensions IMAFDQ
+version 2.0 have not changed from the previous version of this
+document cite:[riscvtr2], but some specification holes have been fixed and the
+documentation has been improved. Some changes have been made to the
+software conventions.
+
+* Numerous additions and improvements to the commentary sections.
+* Separate version numbers for each chapter.
+* Modification to long instruction encodings latexmath:[$>$]64 bits to
+avoid moving the _rd_ specifier in very long instruction formats.
+* CSR instructions are now described in the base integer format where
+the counter registers are introduced, as opposed to only being
+introduced later in the floating-point section (and the companion
+privileged architecture manual).
+* The SCALL and SBREAK instructions have been renamed to ECALL and
+EBREAK, respectively. Their encoding and functionality are unchanged.
+* Clarification of floating-point NaN handling, and a new canonical NaN
+value.
+* Clarification of values returned by floating-point to integer
+conversions that overflow.
+* Clarification of LR/SC allowed successes and required failures,
+including use of compressed instructions in the sequence.
+* A new RV32E base ISA proposal for reduced integer register counts,
+supports MAC extensions.
+* A revised calling convention.
+* Relaxed stack alignment for soft-float calling convention, and
+description of the RV32E calling convention.
+* A revised proposal for the C compressed extension, version 1.9 .
+
+_Preface to Version 2.0_
+
+This is the second release of the user ISA specification, and we intend
+the specification of the base user ISA plus general extensions (i.e.,
+IMAFD) to remain fixed for future development. The following changes
+have been made since Version 1.0 cite:[openriscarch] of this ISA specification.
+
+* The ISA has been divided into an integer base with several standard
+extensions.
+* The instruction formats have been rearranged to make immediate
+encoding more efficient.
+* The base ISA has been defined to have a little-endian memory system,
+with big-endian or bi-endian as non-standard variants.
+* Load-Reserved/Store-Conditional (LR/SC) instructions have been added
+in the atomic instruction extension.
+* AMOs and LR/SC can support the release consistency model.
+* The FENCE instruction provides finer-grain memory and I/O orderings.
+* An AMO for fetch-and-XOR (AMOXOR) has been added, and the encoding for
+AMOSWAP has been changed to make room.
+* The AUIPC instruction, which adds a 20-bit upper immediate to the PC,
+replaces the RDNPC instruction, which only read the current PC value.
+This results in significant savings for position-independent code.
+* The JAL instruction has now moved to the U-Type format with an
+explicit destination register, and the J instruction has been dropped
+being replaced by JAL with _rd_=`x0`. This removes the only instruction
+with an implicit destination register and removes the J-Type instruction
+format from the base ISA. There is an accompanying reduction in JAL
+reach, but a significant reduction in base ISA complexity.
+* The static hints on the JALR instruction have been dropped. The hints
+are redundant with the _rd_ and _rs1_ register specifiers for code
+compliant with the standard calling convention.
+* The JALR instruction now clears the lowest bit of the calculated
+target address, to simplify hardware and to allow auxiliary information
+to be stored in function pointers.
+* The MFTX.S and MFTX.D instructions have been renamed to FMV.X.S and
+FMV.X.D, respectively. Similarly, MXTF.S and MXTF.D instructions have
+been renamed to FMV.S.X and FMV.D.X, respectively.
+* The MFFSR and MTFSR instructions have been renamed to FRCSR and FSCSR,
+respectively. FRRM, FSRM, FRFLAGS, and FSFLAGS instructions have been
+added to individually access the rounding mode and exception flags
+subfields of the `fcsr`.
+* The FMV.X.S and FMV.X.D instructions now source their operands from
+_rs1_, instead of _rs2_. This change simplifies datapath design.
+* FCLASS.S and FCLASS.D floating-point classify instructions have been
+added.
+* A simpler NaN generation and propagation scheme has been adopted.
+* For RV32I, the system performance counters have been extended to
+64-bits wide, with separate read access to the upper and lower 32 bits.
+* Canonical NOP and MV encodings have been defined.
+* Standard instruction-length encodings have been defined for 48-bit,
+64-bit, and latexmath:[$>$]64-bit instructions.
+* Description of a 128-bit address space variant, RV128, has been added.
+* Major opcodes in the 32-bit base instruction format have been
+allocated for user-defined custom extensions.
+* A typographical error that suggested that stores source their data
+from _rd_ has been corrected to refer to _rs2_.
+
diff --git a/src/counters-f.adoc b/src/counters-f.adoc
new file mode 100644
index 0000000..4678d78
--- /dev/null
+++ b/src/counters-f.adoc
@@ -0,0 +1,167 @@
+== Counters
+
+RISC-V ISAs provide a set of up to 32latexmath:[$\times$]64-bit
+performance counters and timers that are accessible via unprivileged
+XLEN read-only CSR registers `0xC00`–`0xC1F` (with the upper 32 bits
+accessed via CSR registers `0xC80`–`0xC9F` on RV32). The first three of
+these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle count,
+real-time clock, and instructions-retired respectively), while the
+remaining counters, if implemented, provide programmable event counting.
+
+=== Base Counters and Timers
+
+M@R@F@R@S +
+& & & & +
+& & & & +
+& 5 & 3 & 5 & 7 +
+RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM +
+RDTIME[H] & 0 & CSRRS & dest & SYSTEM +
+RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM +
+
+RV32I provides a number of 64-bit read-only user-level counters, which
+are mapped into the 12-bit CSR address space and accessed in 32-bit
+pieces using CSRRS instructions. In RV64I, the CSR instructions can
+manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
+RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`,
+`time`, and `instret` counters. Hence, the RDCYCLEH, RDTIMEH, and
+RDINSTRETH instructions are RV32I-only.
+
+Some execution environments might prohibit access to counters to impede
+timing side-channel attacks.
+
+The RDCYCLE pseudoinstruction reads the low XLEN bits of the ` cycle`
+CSR which holds a count of the number of clock cycles executed by the
+processor core on which the hart is running from an arbitrary start time
+in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32
+of the same cycle counter. The underlying 64-bit counter should never
+overflow in practice. The rate at which the cycle counter advances will
+depend on the implementation and operating environment. The execution
+environment should provide a means to determine the current rate
+(cycles/second) at which the cycle counter is incrementing.
+
+RDCYCLE is intended to return the number of cycles executed by the
+processor core, not the hart. Precisely defining what is a ``core'' is
+difficult given some implementation choices (e.g., AMD Bulldozer).
+Precisely defining what is a ``clock cycle'' is also difficult given the
+range of implementations (including software emulations), but the intent
+is that RDCYCLE is used for performance monitoring along with the other
+performance counters. In particular, where there is one hart/core, one
+would expect cycle-count/instructions-retired to measure CPI for a hart.
+
+Cores don’t have to be exposed to software at all, and an implementor
+might choose to pretend multiple harts on one physical core are running
+on separate cores with one hart/core, and provide separate cycle
+counters for each hart. This might make sense in a simple barrel
+processor (e.g., CDC 6600 peripheral processors) where inter-hart timing
+interactions are non-existent or minimal.
+
+Where there is more than one hart/core and dynamic multithreading, it is
+not generally possible to separate out cycles per hart (especially with
+SMT). It might be possible to define a separate performance counter that
+tried to capture the number of cycles a particular hart was running, but
+this definition would have to be very fuzzy to cover all the possible
+threading implementations. For example, should we only count cycles for
+which any instruction was issued to execution for this hart, and/or
+cycles any instruction retired, or include cycles this hart was
+occupying machine resources but couldn’t execute due to stalls while
+other harts went into execution? Likely, ``all of the above'' would be
+needed to have understandable performance stats. This complexity of
+defining a per-hart cycle count, and also the need in any case for a
+total per-core cycle count when tuning multithreaded code led to just
+standardizing the per-core cycle counter, which also happens to work
+well for the common single hart/core case.
+
+Standardizing what happens during ``sleep'' is not practical given that
+what ``sleep'' means is not standardized across execution environments,
+but if the entire core is paused (entirely clock-gated or powered-down
+in deep sleep), then it is not executing clock cycles, and the cycle
+count shouldn’t be increasing per the spec. There are many details,
+e.g., whether clock cycles required to reset a processor after waking up
+from a power-down event should be counted, and these are considered
+execution-environment-specific details.
+
+Even though there is no precise definition that works for all platforms,
+this is still a useful facility for most platforms, and an imprecise,
+common, ``usually correct'' standard here is better than no standard.
+The intent of RDCYCLE was primarily performance monitoring/tuning, and
+the specification was written with that goal in mind.
+
+The RDTIME pseudoinstruction reads the low XLEN bits of the ` time` CSR,
+which counts wall-clock real time that has passed from an arbitrary
+start time in the past. RDTIMEH is an RV32I-only instruction that reads
+bits 63–32 of the same real-time counter. The underlying 64-bit counter
+should never overflow in practice. The execution environment should
+provide a means of determining the period of the real-time counter
+(seconds/tick). The period must be constant. The real-time clocks of all
+harts in a single user application should be synchronized to within one
+tick of the real-time clock. The environment should provide a means to
+determine the accuracy of the clock.
+
+On some simple platforms, cycle count might represent a valid
+implementation of RDTIME, but in this case, platforms should implement
+the RDTIME instruction as an alias for RDCYCLE to make code more
+portable, rather than using RDCYCLE to measure wall-clock time.
+
+The RDINSTRET pseudoinstruction reads the low XLEN bits of the
+` instret` CSR, which counts the number of instructions retired by this
+hart from some arbitrary start point in the past. RDINSTRETH is an
+RV32I-only instruction that reads bits 63–32 of the same instruction
+counter. The underlying 64-bit counter should never overflow in
+practice.
+
+The following code sequence will read a valid 64-bit cycle counter value
+into `x3`:`x2`, even if the counter overflows its lower half between
+reading its upper and lower halves.
+
+....
+    again:
+        rdcycleh     x3
+        rdcycle      x2
+        rdcycleh     x4
+        bne          x3, x4, again
+....
+
+We recommend provision of these basic counters in implementations as
+they are essential for basic performance analysis, adaptive and dynamic
+optimization, and to allow an application to work with real-time
+streams. Additional counters should be provided to help diagnose
+performance problems and these should be made accessible from user-level
+application code with low overhead.
+
+We required the counters be 64 bits wide, even on RV32, as otherwise it
+is very difficult for software to determine if values have overflowed.
+For a low-end implementation, the upper 32 bits of each counter can be
+implemented using software counters incremented by a trap handler
+triggered by overflow of the lower 32 bits. The sample code described
+above shows how the full 64-bit width value can be safely read using the
+individual 32-bit instructions.
+
+In some applications, it is important to be able to read multiple
+counters at the same instant in time. When run under a multitasking
+environment, a user thread can suffer a context switch while attempting
+to read the counters. One solution is for the user thread to read the
+real-time counter before and after reading the other counters to
+determine if a context switch occurred in the middle of the sequence, in
+which case the reads can be retried. We considered adding output latches
+to allow a user thread to snapshot the counter values atomically, but
+this would increase the size of the user context, especially for
+implementations with a richer set of counters.
+
+=== Hardware Performance Counters
+
+There is CSR space allocated for 29 additional unprivileged 64-bit
+hardware performance counters, `hpmcounter3`–`hpmcounter31`. For RV32,
+the upper 32 bits of these performance counters is accessible via
+additional CSRs `hpmcounter3h`–` hpmcounter31h`. These counters count
+platform-specific events and are configured via additional privileged
+registers. The number and width of these additional counters, and the
+set of events they count is platform-specific.
+
+The privileged architecture manual describes the privileged CSRs
+controlling access to these counters and to set the events to be
+counted.
+
+It would be useful to eventually standardize event settings to count
+ISA-level metrics, such as the number of floating-point instructions
+executed for example, and possibly a few common microarchitectural
+metrics, such as ``L1 instruction cache misses''.
diff --git a/src/counters.adoc b/src/counters.adoc
new file mode 100644
index 0000000..670ed6e
--- /dev/null
+++ b/src/counters.adoc
@@ -0,0 +1,189 @@
+[[perf-counters]]
+== Counters
+
+RISC-V ISAs provide a set of up to 32latexmath:[$\times$]64-bit
+performance counters and timers that are accessible via unprivileged
+XLEN read-only CSR registers `0xC00`–`0xC1F` (with the upper 32 bits
+accessed via CSR registers `0xC80`–`0xC9F` on RV32). The first three of
+these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle count,
+real-time clock, and instructions-retired respectively), while the
+remaining counters, if implemented, provide programmable event counting.
+
+=== Base Counters and Timers
+
+include::images/wavedrom/counters-diag.adoc[]
+[[counter]]
+.Base counters and timers
+image::image_placeholder.png[]
+(((counters, read-only)))
+(((counters, user-level)))
+
+RV32I provides a number of 64-bit read-only user-level counters, which
+are mapped into the 12-bit CSR address space and accessed in 32-bit
+pieces using CSRRS instructions. In RV64I, the CSR instructions can
+manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
+RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`,
+`time`, and `instret` counters. Hence, the RDCYCLEH, RDTIMEH, and
+RDINSTRETH instructions are RV32I-only.
+
+[NOTE]
+====
+Some execution environments might prohibit access to counters to impede
+timing side-channel attacks.
+====
+(((counters, pseudoinstruction)))
+
+The RDCYCLE pseudoinstruction reads the low XLEN bits of the `cycle`
+CSR which holds a count of the number of clock cycles executed by the
+processor core on which the hart is running from an arbitrary start time
+in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32
+of the same cycle counter. The underlying 64-bit counter should never
+overflow in practice. The rate at which the cycle counter advances will
+depend on the implementation and operating environment. The execution
+environment should provide a means to determine the current rate
+(cycles/second) at which the cycle counter is incrementing.
+
+[TIP]
+====
+RDCYCLE is intended to return the number of cycles executed by the
+processor core, not the hart. Precisely defining what is
+difficult given some implementation choices (e.g., AMD Bulldozer).
+Precisely defining what is a `clock cycle` is also difficult given the
+range of implementations (including software emulations), but the intent
+is that RDCYCLE is used for performance monitoring along with the other
+performance counters. In particular, where there is one hart/core, one
+would expect cycle-count/instructions-retired to measure CPI for a hart.
+
+Cores don’t have to be exposed to software at all, and an implementor
+might choose to pretend multiple harts on one physical core are running
+on separate cores with one hart/core, and provide separate cycle
+counters for each hart. This might make sense in a simple barrel
+processor (e.g., CDC 6600 peripheral processors) where inter-hart timing
+interactions are non-existent or minimal.
+(((counters, handling multithredaing)))
+
+Where there is more than one hart/core and dynamic multithreading, it is
+not generally possible to separate out cycles per hart (especially with
+SMT). It might be possible to define a separate performance counter that
+tried to capture the number of cycles a particular hart was running, but
+this definition would have to be very fuzzy to cover all the possible
+threading implementations. For example, should we only count cycles for
+which any instruction was issued to execution for this hart, and/or
+cycles any instruction retired, or include cycles this hart was
+occupying machine resources but couldn’t execute due to stalls while
+other harts went into execution? Likely, `all of the above` would be
+needed to have understandable performance stats. This complexity of
+defining a per-hart cycle count, and also the need in any case for a
+total per-core cycle count when tuning multithreaded code led to just
+standardizing the per-core cycle counter, which also happens to work
+well for the common single hart/core case.
+(((counters, handling sleep cycles)))
+
+Standardizing what happens during `sleep` is not practical given that
+what `sleep` means is not standardized across execution environments,
+but if the entire core is paused (entirely clock-gated or powered-down
+in deep sleep), then it is not executing clock cycles, and the cycle
+count shouldn’t be increasing per the spec. There are many details,
+e.g., whether clock cycles required to reset a processor after waking up
+from a power-down event should be counted, and these are considered
+execution-environment-specific details.
+
+Even though there is no precise definition that works for all platforms,
+this is still a useful facility for most platforms, and an imprecise,
+common, `usually correct` standard here is better than no standard.
+The intent of RDCYCLE was primarily performance monitoring/tuning, and
+the specification was written with that goal in mind.
+====
+
+The RDTIME pseudoinstruction reads the low XLEN bits of the ` time` CSR,
+which counts wall-clock real time that has passed from an arbitrary
+start time in the past. RDTIMEH is an RV32I-only instruction that reads
+bits 63–32 of the same real-time counter. The underlying 64-bit counter
+should never overflow in practice. The execution environment should
+provide a means of determining the period of the real-time counter
+(seconds/tick). The period must be constant. The real-time clocks of all
+harts in a single user application should be synchronized to within one
+tick of the real-time clock. The environment should provide a means to
+determine the accuracy of the clock.
+
+[NOTE]
+====
+On some simple platforms, cycle count might represent a valid
+implementation of RDTIME, but in this case, platforms should implement
+the RDTIME instruction as an alias for RDCYCLE to make code more
+portable, rather than using RDCYCLE to measure wall-clock time.
+====
+(((counters, pseudoinstructions)))
+
+The RDINSTRET pseudoinstruction reads the low XLEN bits of the
+` instret` CSR, which counts the number of instructions retired by this
+hart from some arbitrary start point in the past. RDINSTRETH is an
+RV32I-only instruction that reads bits 63–32 of the same instruction
+counter. The underlying 64-bit counter should never overflow in
+practice.
+
+The following code sequence will read a valid 64-bit cycle counter value
+into `x3`:`x2`, even if the counter overflows its lower half between
+reading its upper and lower halves.
+
+.Sample code for reading the 64-bit cycle counter in RV32.
+....
+    again:
+        rdcycleh     x3
+        rdcycle      x2
+        rdcycleh     x4
+        bne          x3, x4, again
+....
+
+[TIP]
+====
+We recommend provision of these basic counters in implementations as
+they are essential for basic performance analysis, adaptive and dynamic
+optimization, and to allow an application to work with real-time
+streams. Additional counters should be provided to help diagnose
+performance problems and these should be made accessible from user-level
+application code with low overhead.
+
+We required the counters be 64 bits wide, even on RV32, as otherwise it
+is very difficult for software to determine if values have overflowed.
+For a low-end implementation, the upper 32 bits of each counter can be
+implemented using software counters incremented by a trap handler
+triggered by overflow of the lower 32 bits. The sample code described
+above shows how the full 64-bit width value can be safely read using the
+individual 32-bit instructions.
+
+In some applications, it is important to be able to read multiple
+counters at the same instant in time. When run under a multitasking
+environment, a user thread can suffer a context switch while attempting
+to read the counters. One solution is for the user thread to read the
+real-time counter before and after reading the other counters to
+determine if a context switch occurred in the middle of the sequence, in
+which case the reads can be retried. We considered adding output latches
+to allow a user thread to snapshot the counter values atomically, but
+this would increase the size of the user context, especially for
+implementations with a richer set of counters.
+====
+
+=== Hardware Performance Counters
+(((counters, performance)))
+
+There is CSR space allocated for 29 additional unprivileged 64-bit
+hardware performance counters, `hpmcounter3`–`hpmcounter31`. For RV32,
+the upper 32 bits of these performance counters is accessible via
+additional CSRs `hpmcounter3h`–` hpmcounter31h`. These counters count
+platform-specific events and are configured via additional privileged
+registers. The number and width of these additional counters, and the
+set of events they count is platform-specific.
+
+[NOTE]
+====
+The privileged architecture manual describes the privileged CSRs
+controlling access to these counters and to set the events to be
+counted.
+
+It would be useful to eventually standardize event settings to count
+ISA-level metrics, such as the number of floating-point instructions
+executed for example, and possibly a few common microarchitectural
+metrics, such as `L1 instruction cache misses`.
+====
+
diff --git a/src/d-st-ext.adoc b/src/d-st-ext.adoc
new file mode 100644
index 0000000..b2ea945
--- /dev/null
+++ b/src/d-st-ext.adoc
@@ -0,0 +1,255 @@
+== ``D`` Standard Extension for Double-Precision Floating-Point, Version 2.2
+
+This chapter describes the standard double-precision floating-point
+instruction-set extension, which is named `D` and adds
+double-precision floating-point computational instructions compliant
+with the IEEE 754-2008 arithmetic standard. The D extension depends on
+the base single-precision instruction subset F.
+((double-precision floaing point))
+
+=== D Register State
+
+The D extension widens the 32 floating-point registers, `f0`–` f31`, to
+64 bits (FLEN=64 in <<fprs-d>>. The `f` registers can
+now hold either 32-bit or 64-bit floating-point values as described
+below in <<nanboxing>>.
+
+FLEN can be 32, 64, or 128 depending on which of the F, D, and Q
+extensions are supported. There can be up to four different
+floating-point precisions supported, including H, F, D, and Q.
+(((floating-point, supported precisions)))
+
+[[nanboxing]]
+=== NaN Boxing of Narrower Values
+
+When multiple floating-point precisions are supported, then valid values
+of narrower latexmath:[$n$]-bit types, latexmath:[$n<$]FLEN, are
+represented in the lower latexmath:[$n$] bits of an FLEN-bit NaN value,
+in a process termed NaN-boxing. The upper bits of a valid NaN-boxed
+value must be all 1s. Valid NaN-boxed latexmath:[$n$]-bit values
+therefore appear as negative quiet NaNs (qNaNs) when viewed as any wider
+latexmath:[$m$]-bit value, latexmath:[$n < m \leq$]FLEN. Any operation
+that writes a narrower result to an `f` register must write all 1s to
+the uppermost FLENlatexmath:[$-n$] bits to yield a legal NaN-boxed
+value.
+(((floating-point, requirements)))
+
+
+Software might not know the current type of data stored in a
+floating-point register but has to be able to save and restore the
+register values, hence the result of using wider operations to transfer
+narrower values has to be defined. A common case is for callee-saved
+registers, but a standard convention is also desirable for features
+including varargs, user-level threading libraries, virtual machine
+migration, and debugging.
+
+Floating-point latexmath:[$n$]-bit transfer operations move external
+values held in IEEE standard formats into and out of the `f` registers,
+and comprise floating-point loads and stores
+(FLlatexmath:[$n$]/FSlatexmath:[$n$]) and floating-point move
+instructions (FMV.latexmath:[$n$].X/FMV.X.latexmath:[$n$]). A narrower
+latexmath:[$n$]-bit transfer, latexmath:[$n<$]FLEN, into the `f`
+registers will create a valid NaN-boxed value. A narrower
+latexmath:[$n$]-bit transfer out of the floating-point registers will
+transfer the lower latexmath:[$n$] bits of the register ignoring the
+upper FLENlatexmath:[$-n$] bits.
+
+Apart from transfer operations described in the previous paragraph, all
+other floating-point operations on narrower latexmath:[$n$]-bit
+operations, latexmath:[$n<$]FLEN, check if the input operands are
+correctly NaN-boxed, i.e., all upper FLENlatexmath:[$-n$] bits are 1. If
+so, the latexmath:[$n$] least-significant bits of the input are used as
+the input value, otherwise the input value is treated as an
+latexmath:[$n$]-bit canonical NaN.
+
+Earlier versions of this document did not define the behavior of feeding
+the results of narrower or wider operands into an operation, except to
+require that wider saves and restores would preserve the value of a
+narrower operand. The new definition removes this
+implementation-specific behavior, while still accommodating both
+non-recoded and recoded implementations of the floating-point unit. The
+new definition also helps catch software errors by propagating NaNs if
+values are used incorrectly.
+
+Non-recoded implementations unpack and pack the operands to IEEE
+standard format on the input and output of every floating-point
+operation. The NaN-boxing cost to a non-recoded implementation is
+primarily in checking if the upper bits of a narrower operation
+represent a legal NaN-boxed value, and in writing all 1s to the upper
+bits of a result.
+
+Recoded implementations use a more convenient internal format to
+represent floating-point values, with an added exponent bit to allow all
+values to be held normalized. The cost to the recoded implementation is
+primarily the extra tagging needed to track the internal types and sign
+bits, but this can be done without adding new state bits by recoding
+NaNs internally in the exponent field. Small modifications are needed to
+the pipelines used to transfer values in and out of the recoded format,
+but the datapath and latency costs are minimal. The recoding process has
+to handle shifting of input subnormal values for wide operands in any
+case, and extracting the NaN-boxed value is a similar process to
+normalization except for skipping over leading-1 bits instead of
+skipping over leading-0 bits, allowing the datapath muxing to be shared.
+
+[[fld_fsd]]
+=== Double-Precision Load and Store Instructions
+
+The FLD instruction loads a double-precision floating-point value from
+memory into floating-point register _rd_. FSD stores a double-precision
+value from the floating-point registers to memory.
+
+The double-precision value may be a NaN-boxed single-precision value.
+
+[[fprs-d]]
+
+M@R@F@R@O +
+& & & & +
+& & & & +
+& 5 & 3 & 5 & 7 +
+offset[11:0] & base & D & dest & LOAD-FP +
+
+O@R@R@F@R@O +
+& & & & & +
+& & & & & +
+& 5 & 5 & 3 & 5 & 7 +
+offset[11:5] & src & base & D & offset[4:0] & STORE-FP +
+
+FLD and FSD are only guaranteed to execute atomically if the effective
+address is naturally aligned and XLENlatexmath:[$\geq$]64.
+
+FLD and FSD do not modify the bits being transferred; in particular, the
+payloads of non-canonical NaNs are preserved.
+
+=== Double-Precision Floating-Point Computational Instructions
+
+The double-precision floating-point computational instructions are
+defined analogously to their single-precision counterparts, but operate
+on double-precision operands and produce double-precision results.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FADD/FSUB & D & src2 & src1 & RM & dest & OP-FP +
+FMUL/FDIV & D & src2 & src1 & RM & dest & OP-FP +
+FMIN-MAX & D & src2 & src1 & MIN/MAX & dest & OP-FP +
+FSQRT & D & 0 & src & RM & dest & OP-FP +
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+src3 & D & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB +
+
+=== Double-Precision Floating-Point Conversion and Move Instructions
+
+Floating-point-to-integer and integer-to-floating-point conversion
+instructions are encoded in the OP-FP major opcode space. FCVT.W.D or
+FCVT.L.D converts a double-precision floating-point number in
+floating-point register _rs1_ to a signed 32-bit or 64-bit integer,
+respectively, in integer register _rd_. FCVT.D.W or FCVT.D.L converts a
+32-bit or 64-bit signed integer, respectively, in integer register _rs1_
+into a double-precision floating-point number in floating-point register
+_rd_. FCVT.WU.D, FCVT.LU.D, FCVT.D.WU, and FCVT.D.LU variants convert to
+or from unsigned integer values. For RV64, FCVT.W[U].D sign-extends the
+32-bit result. FCVT.L[U].D and FCVT.D.L[U] are RV64-only instructions.
+The range of valid inputs for FCVT._int_.D and the behavior for invalid
+inputs are the same as for FCVT._int_.S.
+
+All floating-point to integer and integer to floating-point conversion
+instructions round according to the _rm_ field. Note FCVT.D.W[U] always
+produces an exact result and is unaffected by rounding mode.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCVT._int_.D & D & W[U]/L[U] & src & RM & dest & OP-FP +
+FCVT.D._int_ & D & W[U]/L[U] & src & RM & dest & OP-FP +
+
+The double-precision to single-precision and single-precision to
+double-precision conversion instructions, FCVT.S.D and FCVT.D.S, are
+encoded in the OP-FP major opcode space and both the source and
+destination are floating-point registers. The _rs2_ field encodes the
+datatype of the source, and the _fmt_ field encodes the datatype of the
+destination. FCVT.S.D rounds according to the RM field; FCVT.D.S will
+never round.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCVT.S.D & S & D & src & RM & dest & OP-FP +
+FCVT.D.S & D & S & src & RM & dest & OP-FP +
+
+Floating-point to floating-point sign-injection instructions, FSGNJ.D,
+FSGNJN.D, and FSGNJX.D are defined analogously to the single-precision
+sign-injection instruction.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FSGNJ & D & src2 & src1 & J[N]/JX & dest & OP-FP +
+
+For XLENlatexmath:[$\geq$]64 only, instructions are provided to move bit
+patterns between the floating-point and integer registers. FMV.X.D moves
+the double-precision value in floating-point register _rs1_ to a
+representation in IEEE 754-2008 standard encoding in integer register
+_rd_. FMV.D.X moves the double-precision value encoded in IEEE 754-2008
+standard encoding from the integer register _rs1_ to the floating-point
+register _rd_.
+
+FMV.X.D and FMV.D.X do not modify the bits being transferred; in
+particular, the payloads of non-canonical NaNs are preserved.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FMV.X.D & D & 0 & src & 000 & dest & OP-FP +
+FMV.D.X & D & 0 & src & 000 & dest & OP-FP +
+
+Early versions of the RISC-V ISA had additional instructions to allow
+RV32 systems to transfer between the upper and lower portions of a
+64-bit floating-point register and an integer register. However, these
+would be the only instructions with partial register writes and would
+add complexity in implementations with recoded floating-point or
+register renaming, requiring a pipeline read-modify-write sequence.
+Scaling up to handling quad-precision for RV32 and RV64 would also
+require additional instructions if they were to follow this pattern. The
+ISA was defined to reduce the number of explicit int-float register
+moves, by having conversions and comparisons write results to the
+appropriate register file, so we expect the benefit of these
+instructions to be lower than for other ISAs.
+
+We note that for systems that implement a 64-bit floating-point unit
+including fused multiply-add support and 64-bit floating-point loads and
+stores, the marginal hardware cost of moving from a 32-bit to a 64-bit
+integer datapath is low, and a software ABI supporting 32-bit wide
+address-space and pointers can be used to avoid growth of static data
+and dynamic memory traffic.
+
+=== Double-Precision Floating-Point Compare Instructions
+
+The double-precision floating-point compare instructions are defined
+analogously to their single-precision counterparts, but operate on
+double-precision operands.
+
+S@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCMP & D & src2 & src1 & EQ/LT/LE & dest & OP-FP +
+
+=== Double-Precision Floating-Point Classify Instruction
+
+The double-precision floating-point classify instruction, FCLASS.D, is
+defined analogously to its single-precision counterpart, but operates on
+double-precision operands.
+
+S@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCLASS & D & 0 & src & 001 & dest & OP-FP +
diff --git a/src/extending.adoc b/src/extending.adoc
new file mode 100644
index 0000000..5ddef11
--- /dev/null
+++ b/src/extending.adoc
@@ -0,0 +1,364 @@
+[[extending]]
+== Extending RISC-V
+
+In addition to supporting standard general-purpose software development,
+another goal of RISC-V is to provide a basis for more specialized
+instruction-set extensions or more customized accelerators. The
+instruction encoding spaces and optional variable-length instruction
+encoding are designed to make it easier to leverage software development
+effort for the standard ISA toolchain when building more customized
+processors. For example, the intent is to continue to provide full
+software support for implementations that only use the standard I base,
+perhaps together with many non-standard instruction-set extensions.
+
+This chapter describes various ways in which the base RISC-V ISA can be
+extended, together with the scheme for managing instruction-set
+extensions developed by independent groups. This volume only deals with
+the unprivileged ISA, although the same approach and terminology is used
+for supervisor-level extensions described in the second volume.
+
+=== Extension Terminology
+
+This section defines some standard terminology for describing RISC-V
+extensions.
+
+==== Standard versus Non-Standard Extension
+
+Any RISC-V processor implementation must support a base integer ISA
+(RV32I, RV32E, RV64I, or RV128I). In addition, an implementation may
+support one or more extensions. We divide extensions into two broad
+categories: _standard_ versus _non-standard_.
+
+* A standard extension is one that is generally useful and that is
+designed to not conflict with any other standard extension. Currently,
+`MAFDQLCBTPV`, described in other chapters of this manual, are either
+complete or planned standard extensions.
+* A non-standard extension may be highly specialized and may conflict
+with other standard or non-standard extensions. We anticipate a wide
+variety of non-standard extensions will be developed over time, with
+some eventually being promoted to standard extensions.
+
+==== Instruction Encoding Spaces and Prefixes
+
+An instruction encoding space is some number of instruction bits within
+which a base ISA or ISA extension is encoded. RISC-V supports varying
+instruction lengths, but even within a single instruction length, there
+are various sizes of encoding space available. For example, the base
+ISAs are defined within a 30-bit encoding space (bits 31–2 of the 32-bit
+instruction), while the atomic extension ``A`` fits within a 25-bit
+encoding space (bits 31–7).
+
+We use the term _prefix_ to refer to the bits to the _right_ of an
+instruction encoding space (since instruction fetch in RISC-V is
+little-endian, the bits to the right are stored at earlier memory
+addresses, hence form a prefix in instruction-fetch order). The prefix
+for the standard base ISA encoding is the two-bit `11` field held in
+bits 1–0 of the 32-bit word, while the prefix for the standard atomic
+extension `A` is the seven-bit `0101111` field held in bits 6–0 of
+the 32-bit word representing the AMO major opcode. A quirk of the
+encoding format is that the 3-bit funct3 field used to encode a minor
+opcode is not contiguous with the major opcode bits in the 32-bit
+instruction format, but is considered part of the prefix for 22-bit
+instruction spaces.
+
+Although an instruction encoding space could be of any size, adopting a
+smaller set of common sizes simplifies packing independently developed
+extensions into a single global encoding.
+<<encodingspaces>> gives the suggested sizes for RISC-V.
+
+[[encodingspaces]]
+.Suggested standard RISC-V instruction encoding space sizes.
+[cols="^,<,>,>,>,>",]
+|===
+|Size |Usage |# Available in standard instruction length | | |
+
+| | |16-bit |32-bit |48-bit |64-bit
+
+|14-bit |Quadrant of compressed 16-bit encoding |3 | | |
+
+|22-bit |Minor opcode in base 32-bit encoding | |latexmath:[$2^{8}$]
+|latexmath:[$2^{20}$] |latexmath:[$2^{35}$]
+
+|25-bit |Major opcode in base 32-bit encoding | |32
+|latexmath:[$2^{17}$] |latexmath:[$2^{32}$]
+
+|30-bit |Quadrant of base 32-bit encoding | |1 |latexmath:[$2^{12}$]
+|latexmath:[$2^{27}$]
+
+|32-bit |Minor opcode in 48-bit encoding | | |latexmath:[$2^{10}$]
+|latexmath:[$2^{25}$]
+
+|37-bit |Major opcode in 48-bit encoding | | |32 |latexmath:[$2^{20}$]
+
+|40-bit |Quadrant of 48-bit encoding | | |4 |latexmath:[$2^{17}$]
+
+|45-bit |Sub-minor opcode in 64-bit encoding | | | |latexmath:[$2^{12}$]
+
+|48-bit |Minor opcode in 64-bit encoding | | | |latexmath:[$2^{9}$]
+
+|52-bit |Major opcode in 64-bit encoding | | | |32
+|===
+
+==== Greenfield versus Brownfield Extensions
+
+We use the term _greenfield extension_ to describe an extension that
+begins populating a new instruction encoding space, and hence can only
+cause encoding conflicts at the prefix level. We use the term
+_brownfield extension_ to describe an extension that fits around
+existing encodings in a previously defined instruction space. A
+brownfield extension is necessarily tied to a particular greenfield
+parent encoding, and there may be multiple brownfield extensions to the
+same greenfield parent encoding. For example, the base ISAs are
+greenfield encodings of a 30-bit instruction space, while the FDQ
+floating-point extensions are all brownfield extensions adding to the
+parent base ISA 30-bit encoding space.
+
+Note that we consider the standard A extension to have a greenfield
+encoding as it defines a new previously empty 25-bit encoding space in
+the leftmost bits of the full 32-bit base instruction encoding, even
+though its standard prefix locates it within the 30-bit encoding space
+of its parent base ISA. Changing only its single 7-bit prefix could move
+the A extension to a different 30-bit encoding space while only worrying
+about conflicts at the prefix level, not within the encoding space
+itself.
+
+[[exttax]]
+.Two-dimensional characterization of standard instruction-set
+extensions.
+[cols=">,^,^",options="header",]
+|===
+| |Adds state |No new state
+|Greenfield |RV32I(30), RV64I(30) |A(25)
+|Brownfield |F(I), D(F), Q(D) |M(I)
+|===
+
+<<exttax>> shows the bases and standard extensions placed
+in a simple two-dimensional taxonomy. One axis is whether the extension
+is greenfield or brownfield, while the other axis is whether the
+extension adds architectural state. For greenfield extensions, the size
+of the instruction encoding space is given in parentheses. For
+brownfield extensions, the name of the extension (greenfield or
+brownfield) it builds upon is given in parentheses. Additional
+user-level architectural state usually implies changes to the
+supervisor-level system or possibly to the standard calling convention.
+
+Note that RV64I is not considered an extension of RV32I, but a different
+complete base encoding.
+
+==== Standard-Compatible Global Encodings
+
+A complete or _global_ encoding of an ISA for an actual RISC-V
+implementation must allocate a unique non-conflicting prefix for every
+included instruction encoding space. The bases and every standard
+extension have each had a standard prefix allocated to ensure they can
+all coexist in a global encoding.
+
+A _standard-compatible_ global encoding is one where the base and every
+included standard extension have their standard prefixes. A
+standard-compatible global encoding can include non-standard extensions
+that do not conflict with the included standard extensions. A
+standard-compatible global encoding can also use standard prefixes for
+non-standard extensions if the associated standard extensions are not
+included in the global encoding. In other words, a standard extension
+must use its standard prefix if included in a standard-compatible global
+encoding, but otherwise its prefix is free to be reallocated. These
+constraints allow a common toolchain to target the standard subset of
+any RISC-V standard-compatible global encoding.
+
+==== Guaranteed Non-Standard Encoding Space
+
+To support development of proprietary custom extensions, portions of the
+encoding space are guaranteed to never be used by standard extensions.
+
+=== RISC-V Extension Design Philosophy
+
+We intend to support a large number of independently developed
+extensions by encouraging extension developers to operate within
+instruction encoding spaces, and by providing tools to pack these into a
+standard-compatible global encoding by allocating unique prefixes. Some
+extensions are more naturally implemented as brownfield augmentations of
+existing extensions, and will share whatever prefix is allocated to
+their parent greenfield extension. The standard extension prefixes avoid
+spurious incompatibilities in the encoding of core functionality, while
+allowing custom packing of more esoteric extensions.
+
+This capability of repacking RISC-V extensions into different
+standard-compatible global encodings can be used in a number of ways.
+
+One use-case is developing highly specialized custom accelerators,
+designed to run kernels from important application domains. These might
+want to drop all but the base integer ISA and add in only the extensions
+that are required for the task in hand. The base ISAs have been designed
+to place minimal requirements on a hardware implementation, and has been
+encoded to use only a small fraction of a 32-bit instruction encoding
+space.
+
+Another use-case is to build a research prototype for a new type of
+instruction-set extension. The researchers might not want to expend the
+effort to implement a variable-length instruction-fetch unit, and so
+would like to prototype their extension using a simple 32-bit
+fixed-width instruction encoding. However, this new extension might be
+too large to coexist with standard extensions in the 32-bit space. If
+the research experiments do not need all of the standard extensions, a
+standard-compatible global encoding might drop the unused standard
+extensions and reuse their prefixes to place the proposed extension in a
+non-standard location to simplify engineering of the research prototype.
+Standard tools will still be able to target the base and any standard
+extensions that are present to reduce development time. Once the
+instruction-set extension has been evaluated and refined, it could then
+be made available for packing into a larger variable-length encoding
+space to avoid conflicts with all standard extensions.
+
+The following sections describe increasingly sophisticated strategies
+for developing implementations with new instruction-set extensions.
+These are mostly intended for use in highly customized, educational, or
+experimental architectures rather than for the main line of RISC-V ISA
+development.
+
+[[fix32b]]
+=== Extensions within fixed-width 32-bit instruction format
+
+In this section, we discuss adding extensions to implementations that
+only support the base fixed-width 32-bit instruction format.
+
+We anticipate the simplest fixed-width 32-bit encoding will be popular
+for many restricted accelerators and research prototypes.
+
+==== Available 30-bit instruction encoding spaces
+
+In the standard encoding, three of the available 30-bit instruction
+encoding spaces (those with 2-bit prefixes 00, 01, and 10) are used to
+enable the optional compressed instruction extension. However, if the
+compressed instruction-set extension is not required, then these three
+further 30-bit encoding spaces become available. This quadruples the
+available encoding space within the 32-bit format.
+
+==== Available 25-bit instruction encoding spaces
+
+A 25-bit instruction encoding space corresponds to a major opcode in the
+base and standard extension encodings.
+
+There are four major opcodes expressly designated for custom extensions
+<<opcodemap>>, each of which represents a 25-bit
+encoding space. Two of these are reserved for eventual use in the RV128
+base encoding (will be OP-IMM-64 and OP-64), but can be used for
+non-standard extensions for RV32 and RV64.
+
+The two major opcodes reserved for RV64 (OP-IMM-32 and OP-32) can also
+be used for non-standard extensions to RV32 only.
+
+If an implementation does not require floating-point, then the seven
+major opcodes reserved for standard floating-point extensions (LOAD-FP,
+STORE-FP, MADD, MSUB, NMSUB, NMADD, OP-FP) can be reused for
+non-standard extensions. Similarly, the AMO major opcode can be reused
+if the standard atomic extensions are not required.
+
+If an implementation does not require instructions longer than 32-bits,
+then an additional four major opcodes are available (those marked in
+gray in <<opcodemap>>.
+
+The base RV32I encoding uses only 11 major opcodes plus 3 reserved
+opcodes, leaving up to 18 available for extensions. The base RV64I
+encoding uses only 13 major opcodes plus 3 reserved opcodes, leaving up
+to 16 available for extensions.
+
+==== Available 22-bit instruction encoding spaces
+
+A 22-bit encoding space corresponds to a funct3 minor opcode space in
+the base and standard extension encodings. Several major opcodes have a
+funct3 field minor opcode that is not completely occupied, leaving
+available several 22-bit encoding spaces.
+
+Usually a major opcode selects the format used to encode operands in the
+remaining bits of the instruction, and ideally, an extension should
+follow the operand format of the major opcode to simplify hardware
+decoding.
+
+==== Other spaces
+
+Smaller spaces are available under certain major opcodes, and not all
+minor opcodes are entirely filled.
+
+=== Adding aligned 64-bit instruction extensions
+
+The simplest approach to provide space for extensions that are too large
+for the base 32-bit fixed-width instruction format is to add naturally
+aligned 64-bit instructions. The implementation must still support the
+32-bit base instruction format, but can require that 64-bit instructions
+are aligned on 64-bit boundaries to simplify instruction fetch, with a
+32-bit NOP instruction used as alignment padding where necessary.
+
+To simplify use of standard tools, the 64-bit instructions should be
+encoded as described in <<instlengthcode>>.
+However, an implementation might choose a non-standard
+instruction-length encoding for 64-bit instructions, while retaining the
+standard encoding for 32-bit instructions. For example, if compressed
+instructions are not required, then a 64-bit instruction could be
+encoded using one or more zero bits in the first two bits of an
+instruction.
+
+We anticipate processor generators that produce instruction-fetch units
+capable of automatically handling any combination of supported
+variable-length instruction encodings.
+
+=== Supporting VLIW encodings
+
+Although RISC-V was not designed as a base for a pure VLIW machine, VLIW
+encodings can be added as extensions using several alternative
+approaches. In all cases, the base 32-bit encoding has to be supported
+to allow use of any standard software tools.
+
+==== Fixed-size instruction group
+
+The simplest approach is to define a single large naturally aligned
+instruction format (e.g., 128 bits) within which VLIW operations are
+encoded. In a conventional VLIW, this approach would tend to waste
+instruction memory to hold NOPs, but a RISC-V-compatible implementation
+would have to also support the base 32-bit instructions, confining the
+VLIW code size expansion to VLIW-accelerated functions.
+
+==== Encoded-Length Groups
+
+Another approach is to use the standard length encoding from
+<<instlengthcode>> to encode parallel
+instruction groups, allowing NOPs to be compressed out of the VLIW
+instruction. For example, a 64-bit instruction could hold two 28-bit
+operations, while a 96-bit instruction could hold three 28-bit
+operations, and so on. Alternatively, a 48-bit instruction could hold
+one 42-bit operation, while a 96-bit instruction could hold two 42-bit
+operations, and so on.
+
+This approach has the advantage of retaining the base ISA encoding for
+instructions holding a single operation, but has the disadvantage of
+requiring a new 28-bit or 42-bit encoding for operations within the VLIW
+instructions, and misaligned instruction fetch for larger groups. One
+simplification is to not allow VLIW instructions to straddle certain
+microarchitecturally significant boundaries (e.g., cache lines or
+virtual memory pages).
+
+==== Fixed-Size Instruction Bundles
+
+Another approach, similar to Itanium, is to use a larger naturally
+aligned fixed instruction bundle size (e.g., 128 bits) across which
+parallel operation groups are encoded. This simplifies instruction
+fetch, but shifts the complexity to the group execution engine. To
+remain RISC-V compatible, the base 32-bit instruction would still have
+to be supported.
+
+==== End-of-Group bits in Prefix
+
+None of the above approaches retains the RISC-V encoding for the
+individual operations within a VLIW instruction. Yet another approach is
+to repurpose the two prefix bits in the fixed-width 32-bit encoding. One
+prefix bit can be used to signal `end-of-group` if set, while the
+second bit could indicate execution under a predicate if clear. Standard
+RISC-V 32-bit instructions generated by tools unaware of the VLIW
+extension would have both prefix bits set (11) and thus have the correct
+semantics, with each instruction at the end of a group and not
+predicated.
+
+The main disadvantage of this approach is that the base ISAs lack the
+complex predication support usually required in an aggressive VLIW
+system, and it is difficult to add space to specify more predicate
+registers in the standard 30-bit encoding space.
+
diff --git a/src/f-st-ext.adoc b/src/f-st-ext.adoc
new file mode 100644
index 0000000..90239be
--- /dev/null
+++ b/src/f-st-ext.adoc
@@ -0,0 +1,541 @@
+[[single-float]]
+== `F` Standard Extension for Single-Precision Floating-Point, Version 2.2
+
+This chapter describes the standard instruction-set extension for
+single-precision floating-point, which is named `F` and adds
+single-precision floating-point computational instructions compliant
+with the IEEE 754-2008 arithmetic standard cite:[ieee754-2008]. The F extension depends on
+the `Zicsr` extension for control and status register access.
+
+=== F Register State
+
+The F extension adds 32 floating-point registers, `f0`–`f31`, each 32
+bits wide, and a floating-point control and status register `fcsr`,
+which contains the operating mode and exception status of the
+floating-point unit. This additional state is shown in
+<<fprs>>. We use the term FLEN to describe the width of
+the floating-point registers in the RISC-V ISA, and FLEN=32 for the F
+single-precision floating-point extension. Most floating-point
+instructions operate on values in the floating-point register file.
+Floating-point load and store instructions transfer floating-point
+values between registers and memory. Instructions to transfer values to
+and from the integer register file are also provided.
+
+[TIP]
+====
+We considered a unified register file for both integer and
+floating-point values as this simplifies software register allocation
+and calling conventions, and reduces total user state. However, a split
+organization increases the total number of registers accessible with a
+given instruction width, simplifies provision of enough regfile ports
+for wide superscalar issue, supports decoupled floating-point-unit
+architectures, and simplifies use of internal floating-point encoding
+techniques. Compiler support and calling conventions for split register
+file architectures are well understood, and using dirty bits on
+floating-point register file state can reduce context-switch overhead.
+====
+
+[[fprs]]
+.RISC-V standard F extension single-precision floating-point state
+image::f-standard.png[base,180,1000,align="center"]
+
+=== Floating-Point Control and Status Register
+
+The floating-point control and status register, `fcsr`, is a RISC-V
+control and status register (CSR). It is a 32-bit read/write register
+that selects the dynamic rounding mode for floating-point arithmetic
+operations and holds the accrued exception flags, as shown in <<fcsr>>.
+
+
+include::images/wavedrom/float-csr.adoc[]
+[[fcsr]]
+.Floating-point control and status register
+image::image_placeholder.png[]
+
+The `fcsr` register can be read and written with the FRCSR and FSCSR
+instructions, which are assembler pseudoinstructions built on the
+underlying CSR access instructions. FRCSR reads `fcsr` by copying it
+into integer register _rd_. FSCSR swaps the value in ` fcsr` by copying
+the original value into integer register _rd_, and then writing a new
+value obtained from integer register _rs1_ into `fcsr`.
+
+The fields within the `fcsr` can also be accessed individually through
+different CSR addresses, and separate assembler pseudoinstructions are
+defined for these accesses. The FRRM instruction reads the Rounding Mode
+field `frm` and copies it into the least-significant three bits of
+integer register _rd_, with zero in all other bits. FSRM swaps the value
+in `frm` by copying the original value into integer register _rd_, and
+then writing a new value obtained from the three least-significant bits
+of integer register _rs1_ into `frm`. FRFLAGS and FSFLAGS are defined
+analogously for the Accrued Exception Flags field `fflags`.
+
+
+Bits 31–8 of the `fcsr` are reserved for other standard extensions. If
+these extensions are not present, implementations shall ignore writes to
+these bits and supply a zero value when read. Standard software should
+preserve the contents of these bits.
+
+Floating-point operations use either a static rounding mode encoded in
+the instruction, or a dynamic rounding mode held in `frm`. Rounding
+modes are encoded as shown in <<rm>>. A value of 111 in the
+instruction’s _rm_ field selects the dynamic rounding mode held in
+`frm`. The behavior of floating-point instructions that depend on
+rounding mode when executed with a reserved rounding mode is _reserved_,
+including both static reserved rounding modes (101–110) and dynamic
+reserved rounding modes (101–111). Some instructions, including widening
+conversions, have the _rm_ field but are nevertheless mathematically
+unaffected by the rounding mode; software should set their _rm_ field to
+RNE (000) but implementations must treat the _rm_ field as usual (in
+particular, with regard to decoding legal vs. reserved encodings).
+
+[[rm]]
+.Rounding mode encoding.
+[cols="^,^,<",options="header",]
+|===
+|Rounding Mode |Mnemonic |Meaning
+|000 |RNE |Round to Nearest, ties to Even
+|001 |RTZ |Round towards Zero
+|010 |RDN |Round Down (towards latexmath:[$-\infty$])
+|011 |RUP |Round Up (towards latexmath:[$+\infty$])
+|100 |RMM |Round to Nearest, ties to Max Magnitude
+|101 | |_Reserved for future use._
+|110 | |_Reserved for future use._
+|111 |DYN |In instruction’s _rm_ field, selects dynamic rounding mode;
+| | |In Rounding Mode register, _reserved_.
+|===
+
+
+[NOTE]
+====
+The C99 language standard effectively mandates the provision of a
+dynamic rounding mode register. In typical implementations, writes to
+the dynamic rounding mode CSR state will serialize the pipeline. Static
+rounding modes are used to implement specialized arithmetic operations
+that often have to switch frequently between different rounding modes.
+
+The ratified version of the F spec mandated that an illegal instruction
+exception was raised when an instruction was executed with a reserved
+dynamic rounding mode. This has been weakened to reserved, which matches
+the behavior of static rounding-mode instructions. Raising an illegal
+instruction exception is still valid behavior when encountering a
+reserved encoding, so implementations compatible with the ratified spec
+are compatible with the weakened spec.
+====
+
+
+The accrued exception flags indicate the exception conditions that have
+arisen on any floating-point arithmetic instruction since the field was
+last reset by software, as shown in <<bitdef>>. The base
+RISC-V ISA does not support generating a trap on the setting of a
+floating-point exception flag.
+(((floating-point, excpetion flag)))
+
+[[bitdef]]
+.Accrued exception flag encoding.
+[cols="^,<",options="header",]
+|===
+|Flag Mnemonic |Flag Meaning
+|NV |Invalid Operation
+|DZ |Divide by Zero
+|OF |Overflow
+|UF |Underflow
+|NX |Inexact
+|===
+
+[NOTE]
+====
+As allowed by the standard, we do not support traps on floating-point
+exceptions in the F extension, but instead require explicit checks of
+the flags in software. We considered adding branches controlled directly
+by the contents of the floating-point accrued exception flags, but
+ultimately chose to omit these instructions to keep the ISA simple.
+====
+
+=== NaN Generation and Propagation
+(((NaN, generation)))
+(((NaN, propagation)))
+
+Except when otherwise stated, if the result of a floating-point
+operation is NaN, it is the canonical NaN. The canonical NaN has a
+positive sign and all significand bits clear except the MSB, a.k.a. the
+quiet bit. For single-precision floating-point, this corresponds to the
+pattern `0x7fc00000`.
+
+[TIP]
+====
+We considered propagating NaN payloads, as is recommended by the
+standard, but this decision would have increased hardware cost.
+Moreover, since this feature is optional in the standard, it cannot be
+used in portable code.
+
+Implementors are free to provide a NaN payload propagation scheme as a
+nonstandard extension enabled by a nonstandard operating mode. However,
+the canonical NaN scheme described above must always be supported and
+should be the default mode.
+====
+
+[NOTE]
+====
+We require implementations to return the standard-mandated default
+values in the case of exceptional conditions, without any further
+intervention on the part of user-level software (unlike the Alpha ISA
+floating-point trap barriers). We believe full hardware handling of
+exceptional cases will become more common, and so wish to avoid
+complicating the user-level ISA to optimize other approaches.
+Implementations can always trap to machine-mode software handlers to
+provide exceptional default values.
+====
+
+=== Subnormal Arithmetic
+(((operations, subnormal)))
+
+Operations on subnormal numbers are handled in accordance with the IEEE
+754-2008 standard.
+
+In the parlance of the IEEE standard, tininess is detected after
+rounding.
+(((tininess, handling)))
+
+[NOTE]
+====
+Detecting tininess after rounding results in fewer spurious underflow
+signals.
+====
+
+=== Single-Precision Load and Store Instructions
+
+Floating-point loads and stores use the same base+offset addressing mode
+as the integer base ISAs, with a base address in register _rs1_ and a
+12-bit signed byte offset. The FLW instruction loads a single-precision
+floating-point value from memory into floating-point register _rd_. FSW
+stores a single-precision value from floating-point register _rs2_ to
+memory.
+
+include::images/wavedrom/sp-load-store.adoc[]
+[[sp-ldst]]
+.SP load and store
+image::image_placeholder.png[]
+
+FLW and FSW are only guaranteed to execute atomically if the effective
+address is naturally aligned.
+
+FLW and FSW do not modify the bits being transferred; in particular, the
+payloads of non-canonical NaNs are preserved.
+
+As described in <<sp-ldst>>, the execution
+environment defines whether misaligned floating-point loads and stores
+are handled invisibly or raise a contained or fatal trap.
+
+[[single-float-compute]]
+=== Single-Precision Floating-Point Computational Instructions
+
+Floating-point arithmetic instructions with one or two source operands
+use the R-type format with the OP-FP major opcode. FADD.S and FMUL.S
+perform single-precision floating-point addition and multiplication
+respectively, between _rs1_ and _rs2_. FSUB.S performs the
+single-precision floating-point subtraction of _rs2_ from _rs1_. FDIV.S
+performs the single-precision floating-point division of _rs1_ by _rs2_.
+FSQRT.S computes the square root of _rs1_. In each case, the result is
+written to _rd_.
+
+The 2-bit floating-point format field _fmt_ is encoded as shown in
+<<fmt>>. It is set to _S_ (00) for all instructions in the F
+extension.
+
+[[fmt]]
+.Format field encoding
+[cols="^,^,<",options="header",]
+|===
+|_fmt_ field |Mnemonic |Meaning
+|00 |S |32-bit single-precision
+|01 |D |64-bit double-precision
+|10 |H |16-bit half-precision
+|11 |Q |128-bit quad-precision
+|===
+
+All floating-point operations that perform rounding can select the
+rounding mode using the _rm_ field with the encoding shown in
+<<rm>>.
+
+Floating-point minimum-number and maximum-number instructions FMIN.S and
+FMAX.S write, respectively, the smaller or larger of _rs1_ and _rs2_ to
+_rd_. For the purposes of these instructions only, the value
+latexmath:[$-0.0$] is considered to be less than the value
+latexmath:[$+0.0$]. If both inputs are NaNs, the result is the canonical
+NaN. If only one operand is a NaN, the result is the non-NaN operand.
+Signaling NaN inputs set the invalid operation exception flag, even when
+the result is not NaN.
+
+[NOTE]
+====
+Note that in version 2.2 of the F extension, the FMIN.S and FMAX.S
+instructions were amended to implement the proposed IEEE 754-201x
+minimumNumber and maximumNumber operations, rather than the IEEE
+754-2008 minNum and maxNum operations. These operations differ in their
+handling of signaling NaNs.
+====
+
+include::images/wavedrom/spfloat.adoc[]
+[[spfloat]]
+.Single-Precision Floating-Point Computational Instructions
+image::image_placeholder.png[]
+(((floating point, fused multiply-add)))
+
+Floating-point fused multiply-add instructions require a new standard
+instruction format. R4-type instructions specify three source registers
+(_rs1_, _rs2_, and _rs3_) and a destination register (_rd_). This format
+is only used by the floating-point fused multiply-add instructions.
+
+FMADD.S multiplies the values in _rs1_ and _rs2_, adds the value in
+_rs3_, and writes the final result to _rd_. FMADD.S computes
+_(rs1latexmath:[$\times$]rs2)+rs3_.
+
+FMSUB.S multiplies the values in _rs1_ and _rs2_, subtracts the value in
+_rs3_, and writes the final result to _rd_. FMSUB.S computes
+_(rs1latexmath:[$\times$]rs2)-rs3_.
+
+FNMSUB.S multiplies the values in _rs1_ and _rs2_, negates the product,
+adds the value in _rs3_, and writes the final result to _rd_. FNMSUB.S
+computes _-(rs1latexmath:[$\times$]rs2)+rs3_.
+
+FNMADD.S multiplies the values in _rs1_ and _rs2_, negates the product,
+subtracts the value in _rs3_, and writes the final result to _rd_.
+FNMADD.S computes _-(rs1latexmath:[$\times$]rs2)-rs3_.
+
+[NOTE]
+====
+The FNMSUB and FNMADD instructions are counterintuitively named, owing
+to the naming of the corresponding instructions in MIPS-IV. The MIPS
+instructions were defined to negate the sum, rather than negating the
+product as the RISC-V instructions do, so the naming scheme was more
+rational at the time. The two definitions differ with respect to
+signed-zero results. The RISC-V definition matches the behavior of the
+x86 and ARM fused multiply-add instructions, but unfortunately the
+RISC-V FNMSUB and FNMADD instruction names are swapped compared to x86
+and ARM.
+====
+
+include::images/wavedrom/fnmaddsub.adoc[]
+[[fnmaddsub]]
+.F[N]MADD/F[N]MSUB instructions
+image::image_placeholder.png[]
+
+[NOTE]
+====
+The fused multiply-add (FMA) instructions consume a large part of the
+32-bit instruction encoding space. Some alternatives considered were to
+restrict FMA to only use dynamic rounding modes, but static rounding
+modes are useful in code that exploits the lack of product rounding.
+Another alternative would have been to use rd to provide rs3, but this
+would require additional move instructions in some common sequences. The
+current design still leaves a large portion of the 32-bit encoding space
+open while avoiding having FMA be non-orthogonal.
+====
+
+The fused multiply-add instructions must set the invalid operation
+exception flag when the multiplicands are latexmath:[$\infty$] and zero,
+even when the addend is a quiet NaN.
+
+[NOTE]
+====
+The IEEE 754-2008 standard permits, but does not require, raising the
+invalid exception for the operation
+latexmath:[$\infty\times 0\ +$]qNaN.
+====
+
+=== Single-Precision Floating-Point Conversion and Move Instructions
+
+Floating-point-to-integer and integer-to-floating-point conversion
+instructions are encoded in the OP-FP major opcode space. FCVT.W.S or
+FCVT.L.S converts a floating-point number in floating-point register
+_rs1_ to a signed 32-bit or 64-bit integer, respectively, in integer
+register _rd_. FCVT.S.W or FCVT.S.L converts a 32-bit or 64-bit signed
+integer, respectively, in integer register _rs1_ into a floating-point
+number in floating-point register _rd_. FCVT.WU.S, FCVT.LU.S, FCVT.S.WU,
+and FCVT.S.LU variants convert to or from unsigned integer values. For
+XLENlatexmath:[$>32$], FCVT.W[U].S sign-extends the 32-bit result to the
+destination register width. FCVT.L[U].S and FCVT.S.L[U] are RV64-only
+instructions. If the rounded result is not representable in the
+destination format, it is clipped to the nearest value and the invalid
+flag is set. <<int_conv>> gives the range of valid inputs
+for FCVT._int_.S and the behavior for invalid inputs.
+(((floating-point, conversion)))
+
+[[int_conv]]
+.Domains of float-to-integer conversions and behavior for invalid inputs
+[cols="<,>,>,>,>",options="header",]
+|===
+| |FCVT.W.S |FCVT.WU.S |FCVT.L.S |FCVT.LU.S
+|Minimum valid input (after rounding) |latexmath:[$-2^{31}$] |0
+|latexmath:[$-2^{63}$] |0
+
+|Maximum valid input (after rounding) |latexmath:[$2^{31}-1$]
+|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$]
+
+|Output for out-of-range negative input |latexmath:[$-2^{31}$] |0
+|latexmath:[$-2^{63}$] |0
+
+|Output for latexmath:[$-\infty$] |latexmath:[$-2^{31}$] |0
+|latexmath:[$-2^{63}$] |0
+
+|Output for out-of-range positive input |latexmath:[$2^{31}-1$]
+|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$]
+
+|Output for latexmath:[$+\infty$] or NaN |latexmath:[$2^{31}-1$]
+|latexmath:[$2^{32}-1$] |latexmath:[$2^{63}-1$] |latexmath:[$2^{64}-1$]
+|===
+
+All floating-point to integer and integer to floating-point conversion
+instructions round according to the _rm_ field. A floating-point
+register can be initialized to floating-point positive zero using
+FCVT.S.W _rd_, `x0`, which will never set any exception flags.
+
+All floating-point conversion instructions set the Inexact exception
+flag if the rounded result differs from the operand value and the
+Invalid exception flag is not set.
+
+include::images/wavedrom/spfloat.adoc[]
+[[fcvt]]
+.SP float convert and move
+image::image_placeholder.png[]
+
+Floating-point to floating-point sign-injection instructions, FSGNJ.S,
+FSGNJN.S, and FSGNJX.S, produce a result that takes all bits except the
+sign bit from _rs1_. For FSGNJ, the result’s sign bit is _rs2_’s sign
+bit; for FSGNJN, the result’s sign bit is the opposite of _rs2_’s sign
+bit; and for FSGNJX, the sign bit is the XOR of the sign bits of _rs1_
+and _rs2_. Sign-injection instructions do not set floating-point
+exception flags, nor do they canonicalize NaNs. Note, FSGNJ.S _rx, ry,
+ry_ moves _ry_ to _rx_ (assembler pseudoinstruction FMV.S _rx, ry_);
+FSGNJN.S _rx, ry, ry_ moves the negation of _ry_ to _rx_ (assembler
+pseudoinstruction FNEG.S _rx, ry_); and FSGNJX.S _rx, ry, ry_ moves the
+absolute value of _ry_ to _rx_ (assembler pseudoinstruction FABS.S _rx,
+ry_).
+
+include::images/wavedrom/spfloat-cn-cmp.adoc[]
+[[spfloat-cn-cmp]]
+.SP floating point convert and compare
+image::image_placeholder.png[]
+
+[NOTE]
+====
+The sign-injection instructions provide floating-point MV, ABS, and NEG,
+as well as supporting a few other operations, including the IEEE
+copySign operation and sign manipulation in transcendental math function
+libraries. Although MV, ABS, and NEG only need a single register
+operand, whereas FSGNJ instructions need two, it is unlikely most
+microarchitectures would add optimizations to benefit from the reduced
+number of register reads for these relatively infrequent instructions.
+Even in this case, a microarchitecture can simply detect when both
+source registers are the same for FSGNJ instructions and only read a
+single copy.
+====
+
+Instructions are provided to move bit patterns between the
+floating-point and integer registers. FMV.X.W moves the single-precision
+value in floating-point register _rs1_ represented in IEEE 754-2008
+encoding to the lower 32 bits of integer register _rd_. The bits are not
+modified in the transfer, and in particular, the payloads of
+non-canonical NaNs are preserved. For RV64, the higher 32 bits of the
+destination register are filled with copies of the floating-point
+number’s sign bit.
+
+FMV.W.X moves the single-precision value encoded in IEEE 754-2008
+standard encoding from the lower 32 bits of integer register _rs1_ to
+the floating-point register _rd_. The bits are not modified in the
+transfer, and in particular, the payloads of non-canonical NaNs are
+preserved.
+
+[NOTE]
+====
+The FMV.W.X and FMV.X.W instructions were previously called FMV.S.X and
+FMV.X.S. The use of W is more consistent with their semantics as an
+instruction that moves 32 bits without interpreting them. This became
+clearer after defining NaN-boxing. To avoid disturbing existing code,
+both the W and S versions will be supported by tools.
+====
+
+include::images/wavedrom/spfloat-mv.adoc[]
+[[spfloat-mv]]
+.SP floating point move
+image::image_placeholder.png[]
+
+
+[TIP]
+====
+The base floating-point ISA was defined so as to allow implementations
+to employ an internal recoding of the floating-point format in registers
+to simplify handling of subnormal values and possibly to reduce
+functional unit latency. To this end, the F extension avoids
+representing integer values in the floating-point registers by defining
+conversion and comparison operations that read and write the integer
+register file directly. This also removes many of the common cases where
+explicit moves between integer and floating-point registers are
+required, reducing instruction count and critical paths for common
+mixed-format code sequences.
+====
+
+=== Single-Precision Floating-Point Compare Instructions
+
+Floating-point compare instructions (FEQ.S, FLT.S, FLE.S) perform the
+specified comparison between floating-point registers
+(latexmath:[$\mbox{\em rs1}
+= \mbox{\em rs2}$], latexmath:[$\mbox{\em rs1} < \mbox{\em rs2}$],
+latexmath:[$\mbox{\em rs1} \leq
+\mbox{\em rs2}$]) writing 1 to the integer register _rd_ if the
+condition holds, and 0 otherwise.
+
+FLT.S and FLE.S perform what the IEEE 754-2008 standard refers to as
+_signaling_ comparisons: that is, they set the invalid operation
+exception flag if either input is NaN. FEQ.S performs a _quiet_
+comparison: it only sets the invalid operation exception flag if either
+input is a signaling NaN. For all three instructions, the result is 0 if
+either operand is NaN.
+
+include::images/wavedrom/spfloat-comp.adoc[]
+[[spfloat-comp]]
+.SP floating point compare
+image::image_placeholder.png[]
+
+[NOTE]
+====
+The F extension provides a latexmath:[$\leq$] comparison, whereas the
+base ISAs provide a latexmath:[$\geq$] branch comparison. Because
+latexmath:[$\leq$] can be synthesized from latexmath:[$\geq$] and
+vice-versa, there is no performance implication to this inconsistency,
+but it is nevertheless an unfortunate incongruity in the ISA.
+====
+
+=== Single-Precision Floating-Point Classify Instruction
+
+The FCLASS.S instruction examines the value in floating-point register
+_rs1_ and writes to integer register _rd_ a 10-bit mask that indicates
+the class of the floating-point number. The format of the mask is
+described in <<fclass>>. The corresponding bit in _rd_ will
+be set if the property is true and clear otherwise. All other bits in
+_rd_ are cleared. Note that exactly one bit in _rd_ will be set.
+FCLASS.S does not set the floating-point exception flags.
+(((floating-point, classification)))
+
+include::images/wavedrom/spfloat-classify.adoc[]
+[[spfloat-classify]]
+.SP floating point classify
+image::image_placeholder.png[]
+
+[[fclass]]
+.Format of result of FCLASS instruction.
+[cols="^,<",options="header",]
+|===
+|_rd_ bit |Meaning
+|0 |_rs1_ is latexmath:[$-\infty$].
+|1 |_rs1_ is a negative normal number.
+|2 |_rs1_ is a negative subnormal number.
+|3 |_rs1_ is latexmath:[$-0$].
+|4 |_rs1_ is latexmath:[$+0$].
+|5 |_rs1_ is a positive subnormal number.
+|6 |_rs1_ is a positive normal number.
+|7 |_rs1_ is latexmath:[$+\infty$].
+|8 |_rs1_ is a signaling NaN.
+|9 |_rs1_ is a quiet NaN.
+|===
+
diff --git a/src/history.adoc b/src/history.adoc
new file mode 100644
index 0000000..8ec9fbd
--- /dev/null
+++ b/src/history.adoc
@@ -0,0 +1,362 @@
+[[history]]
+== History and Acknowledgments
+
+=== `Why Develop a new ISA?` Rationale from Berkeley Group
+
+We developed RISC-V to support our own needs in research and education,
+where our group is particularly interested in actual hardware
+implementations of research ideas (we have completed eleven different
+silicon fabrications of RISC-V since the first edition of this
+specification), and in providing real implementations for students to
+explore in classes (RISC-V processor RTL designs have been used in
+multiple undergraduate and graduate classes at Berkeley). In our current
+research, we are especially interested in the move towards specialized
+and heterogeneous accelerators, driven by the power constraints imposed
+by the end of conventional transistor scaling. We wanted a highly
+flexible and extensible base ISA around which to build our research
+effort.
+
+A question we have been repeatedly asked is `Why develop a new ISA?`
+The biggest obvious benefit of using an existing commercial ISA is the
+large and widely supported software ecosystem, both development tools
+and ported applications, which can be leveraged in research and
+teaching. Other benefits include the existence of large amounts of
+documentation and tutorial examples. However, our experience of using
+commercial instruction sets for research and teaching is that these
+benefits are smaller in practice, and do not outweigh the disadvantages:
+
+* *Commercial ISAs are proprietary.* Except for SPARC V8, which is an
+open IEEE standard cite:[sparcieee1994] , most owners of commercial ISAs carefully guard
+their intellectual property and do not welcome freely available
+competitive implementations. This is much less of an issue for academic
+research and teaching using only software simulators, but has been a
+major concern for groups wishing to share actual RTL implementations. It
+is also a major concern for entities who do not want to trust the few
+sources of commercial ISA implementations, but who are prohibited from
+creating their own clean room implementations. We cannot guarantee that
+all RISC-V implementations will be free of third-party patent
+infringements, but we can guarantee we will not attempt to sue a RISC-V
+implementor.
+* *Commercial ISAs are only popular in certain market domains.* The most
+obvious examples at time of writing are that the ARM architecture is not
+well supported in the server space, and the Intel x86 architecture (or
+for that matter, almost every other architecture) is not well supported
+in the mobile space, though both Intel and ARM are attempting to enter
+each other’s market segments. Another example is ARC and Tensilica,
+which provide extensible cores but are focused on the embedded space.
+This market segmentation dilutes the benefit of supporting a particular
+commercial ISA as in practice the software ecosystem only exists for
+certain domains, and has to be built for others.
+* *Commercial ISAs come and go.* Previous research infrastructures have
+been built around commercial ISAs that are no longer popular (SPARC,
+MIPS) or even no longer in production (Alpha). These lose the benefit of
+an active software ecosystem, and the lingering intellectual property
+issues around the ISA and supporting tools interfere with the ability of
+interested third parties to continue supporting the ISA. An open ISA
+might also lose popularity, but any interested party can continue using
+and developing the ecosystem.
+* *Popular commercial ISAs are complex.* The dominant commercial ISAs
+(x86 and ARM) are both very complex to implement in hardware to the
+level of supporting common software stacks and operating systems. Worse,
+nearly all the complexity is due to bad, or at least outdated, ISA
+design decisions rather than features that truly improve efficiency.
+* *Commercial ISAs alone are not enough to bring up applications.* Even
+if we expend the effort to implement a commercial ISA, this is not
+enough to run existing applications for that ISA. Most applications need
+a complete ABI (application binary interface) to run, not just the
+user-level ISA. Most ABIs rely on libraries, which in turn rely on
+operating system support. To run an existing operating system requires
+implementing the supervisor-level ISA and device interfaces expected by
+the OS. These are usually much less well-specified and considerably more
+complex to implement than the user-level ISA.
+* *Popular commercial ISAs were not designed for extensibility.* The
+dominant commercial ISAs were not particularly designed for
+extensibility, and as a consequence have added considerable instruction
+encoding complexity as their instruction sets have grown. Companies such
+as Tensilica (acquired by Cadence) and ARC (acquired by Synopsys) have
+built ISAs and toolchains around extensibility, but have focused on
+embedded applications rather than general-purpose computing systems.
+* *A modified commercial ISA is a new ISA.* One of our main goals is to
+support architecture research, including major ISA extensions. Even
+small extensions diminish the benefit of using a standard ISA, as
+compilers have to be modified and applications rebuilt from source code
+to use the extension. Larger extensions that introduce new architectural
+state also require modifications to the operating system. Ultimately,
+the modified commercial ISA becomes a new ISA, but carries along all the
+legacy baggage of the base ISA.
+
+Our position is that the ISA is perhaps the most important interface in
+a computing system, and there is no reason that such an important
+interface should be proprietary. The dominant commercial ISAs are based
+on instruction-set concepts that were already well known over 30 years
+ago. Software developers should be able to target an open standard
+hardware target, and commercial processor designers should compete on
+implementation quality.
+
+We are far from the first to contemplate an open ISA design suitable for
+hardware implementation. We also considered other existing open ISA
+designs, of which the closest to our goals was the OpenRISC
+architecture cite:[openriscarch]. We decided against adopting the OpenRISC ISA for several
+technical reasons:
+
+* OpenRISC has condition codes and branch delay slots, which complicate
+higher performance implementations.
+* OpenRISC uses a fixed 32-bit encoding and 16-bit immediates, which
+precludes a denser instruction encoding and limits space for later
+expansion of the ISA.
+* OpenRISC does not support the 2008 revision to the IEEE 754
+floating-point standard.
+* The OpenRISC 64-bit design had not been completed when we began.
+
+By starting from a clean slate, we could design an ISA that met all of
+our goals, though of course, this took far more effort than we had
+planned at the outset. We have now invested considerable effort in
+building up the RISC-V ISA infrastructure, including documentation,
+compiler tool chains, operating system ports, reference ISA simulators,
+FPGA implementations, efficient ASIC implementations, architecture test
+suites, and teaching materials. Since the last edition of this manual,
+there has been considerable uptake of the RISC-V ISA in both academia
+and industry, and we have created the non-profit RISC-V Foundation to
+protect and promote the standard. The RISC-V Foundation website at
+https://riscv.org contains the latest information on the Foundation
+membership and various open-source projects using RISC-V.
+
+=== History from Revision 1.0 of ISA manual
+
+The RISC-V ISA and instruction-set manual builds upon several earlier
+projects. Several aspects of the supervisor-level machine and the
+overall format of the manual date back to the T0 (Torrent-0) vector
+microprocessor project at UC Berkeley and ICSI, begun in 1992. T0 was a
+vector processor based on the MIPS-II ISA, with Krste Asanović as main
+architect and RTL designer, and Brian Kingsbury and Bertrand Irrisou as
+principal VLSI implementors. David Johnson at ICSI was a major
+contributor to the T0 ISA design, particularly supervisor mode, and to
+the manual text. John Hauser also provided considerable feedback on the
+T0 ISA design.
+
+The Scale (Software-Controlled Architecture for Low Energy) project at
+MIT, begun in 2000, built upon the T0 project infrastructure, refined
+the supervisor-level interface, and moved away from the MIPS scalar ISA
+by dropping the branch delay slot. Ronny Krashinsky and Christopher
+Batten were the principal architects of the Scale Vector-Thread
+processor at MIT, while Mark Hampton ported the GCC-based compiler
+infrastructure and tools for Scale.
+
+A lightly edited version of the T0 MIPS scalar processor specification
+(MIPS-6371) was used in teaching a new version of the MIT 6.371
+Introduction to VLSI Systems class in the Fall 2002 semester, with Chris
+Terman and Krste Asanović as lecturers. Chris Terman contributed most of
+the lab material for the class (there was no TA!). The 6.371 class
+evolved into the trial 6.884 Complex Digital Design class at MIT, taught
+by Arvind and Krste Asanović in Spring 2005, which became a regular
+Spring class 6.375. A reduced version of the Scale MIPS-based scalar
+ISA, named SMIPS, was used in 6.884/6.375. Christopher Batten was the TA
+for the early offerings of these classes and developed a considerable
+amount of documentation and lab material based around the SMIPS ISA.
+This same SMIPS lab material was adapted and enhanced by TA Yunsup Lee
+for the UC Berkeley Fall 2009 CS250 VLSI Systems Design class taught by
+John Wawrzynek, Krste Asanović, and John Lazzaro.
+
+The Maven (Malleable Array of Vector-thread ENgines) project was a
+second-generation vector-thread architecture. Its design was led by
+Christopher Batten when he was an Exchange Scholar at UC Berkeley
+starting in summer 2007. Hidetaka Aoki, a visiting industrial fellow
+from Hitachi, gave considerable feedback on the early Maven ISA and
+microarchitecture design. The Maven infrastructure was based on the
+Scale infrastructure but the Maven ISA moved further away from the MIPS
+ISA variant defined in Scale, with a unified floating-point and integer
+register file. Maven was designed to support experimentation with
+alternative data-parallel accelerators. Yunsup Lee was the main
+implementor of the various Maven vector units, while Rimas Avižienis was
+the main implementor of the various Maven scalar units. Yunsup Lee and
+Christopher Batten ported GCC to work with the new Maven ISA.
+Christopher Celio provided the initial definition of a traditional
+vector instruction set (`Flood`) variant of Maven.
+
+Based on experience with all these previous projects, the RISC-V ISA
+definition was begun in Summer 2010, with Andrew Waterman, Yunsup Lee,
+Krste Asanović, and David Patterson as principal designers. An initial
+version of the RISC-V 32-bit instruction subset was used in the UC
+Berkeley Fall 2010 CS250 VLSI Systems Design class, with Yunsup Lee as
+TA. RISC-V is a clean break from the earlier MIPS-inspired designs. John
+Hauser contributed to the floating-point ISA definition, including the
+sign-injection instructions and a register encoding scheme that permits
+internal recoding of floating-point values.
+
+=== History from Revision 2.0 of ISA manual
+
+Multiple implementations of RISC-V processors have been completed,
+including several silicon fabrications, as shown in
+<<silicon>>.
+
+[[silicon]]
+[cols="<,>,<,<",options="header",]
+|===
+|Name |Tapeout Date |Process |ISA
+|Raven-1 |May 29, 2011 |ST 28nm FDSOI |RV64G1_Xhwacha1
+|EOS14 |April 1, 2012 |IBM 45nm SOI |RV64G1p1_Xhwacha2
+|EOS16 |August 17, 2012 |IBM 45nm SOI |RV64G1p1_Xhwacha2
+|Raven-2 |August 22, 2012 |ST 28nm FDSOI |RV64G1p1_Xhwacha2
+|EOS18 |February 6, 2013 |IBM 45nm SOI |RV64G1p1_Xhwacha2
+|EOS20 |July 3, 2013 |IBM 45nm SOI |RV64G1p99_Xhwacha2
+|Raven-3 |September 26, 2013 |ST 28nm SOI |RV64G1p99_Xhwacha2
+|EOS22 |March 7, 2014 |IBM 45nm SOI |RV64G1p9999_Xhwacha3
+|===
+
+The first RISC-V processors to be fabricated were written in Verilog and
+manufactured in a pre-production FDSOI technology from ST as the Raven-1
+testchip in 2011. Two cores were developed by Yunsup Lee and Andrew
+Waterman, advised by Krste Asanović, and fabricated together: 1) an RV64
+scalar core with error-detecting flip-flops, and 2) an RV64 core with an
+attached 64-bit floating-point vector unit. The first microarchitecture
+was informally known as `TrainWreck`, due to the short time available
+to complete the design with immature design libraries.
+
+Subsequently, a clean microarchitecture for an in-order decoupled RV64
+core was developed by Andrew Waterman, Rimas Avižienis, and Yunsup Lee,
+advised by Krste Asanović, and, continuing the railway theme, was
+codenamed `Rocket` after George Stephenson’s successful steam
+locomotive design. Rocket was written in Chisel, a new hardware design
+language developed at UC Berkeley. The IEEE floating-point units used in
+Rocket were developed by John Hauser, Andrew Waterman, and Brian
+Richards. Rocket has since been refined and developed further, and has
+been fabricated two more times in FDSOI (Raven-2, Raven-3), and five
+times in IBM SOI technology (EOS14, EOS16, EOS18, EOS20, EOS22) for a
+photonics project. Work is ongoing to make the Rocket design available
+as a parameterized RISC-V processor generator.
+
+EOS14–EOS22 chips include early versions of Hwacha, a 64-bit IEEE
+floating-point vector unit, developed by Yunsup Lee, Andrew Waterman,
+Huy Vo, Albert Ou, Quan Nguyen, and Stephen Twigg, advised by Krste
+Asanović. EOS16–EOS22 chips include dual cores with a cache-coherence
+protocol developed by Henry Cook and Andrew Waterman, advised by Krste
+Asanović. EOS14 silicon has successfully run at 1.25 GHz. EOS16 silicon suffered
+from a bug in the IBM pad libraries. EOS18 and EOS20 have successfully
+run at 1.35 GHz.
+
+Contributors to the Raven testchips include Yunsup Lee, Andrew Waterman,
+Rimas Avižienis, Brian Zimmer, Jaehwa Kwak, Ruzica Jevtić, Milovan
+Blagojević, Alberto Puggelli, Steven Bailey, Ben Keller, Pi-Feng Chiu,
+Brian Richards, Borivoje Nikolić, and Krste Asanović.
+
+Contributors to the EOS testchips include Yunsup Lee, Rimas Avižienis,
+Andrew Waterman, Henry Cook, Huy Vo, Daiwei Li, Chen Sun, Albert Ou,
+Quan Nguyen, Stephen Twigg, Vladimir Stojanović, and Krste Asanović.
+
+Andrew Waterman and Yunsup Lee developed the C++ ISA simulator
+`Spike`, used as a golden model in development and named after the
+golden spike used to celebrate completion of the US transcontinental
+railway. Spike has been made available as a BSD open-source project.
+
+Andrew Waterman completed a Master’s thesis with a preliminary design of
+the RISC-V compressed instruction set cite:[waterman-ms].
+
+Various FPGA implementations of the RISC-V have been completed,
+primarily as part of integrated demos for the Par Lab project research
+retreats. The largest FPGA design has 3 cache-coherent RV64IMA
+processors running a research operating system. Contributors to the FPGA
+implementations include Andrew Waterman, Yunsup Lee, Rimas Avižienis,
+and Krste Asanović.
+
+RISC-V processors have been used in several classes at UC Berkeley.
+Rocket was used in the Fall 2011 offering of CS250 as a basis for class
+projects, with Brian Zimmer as TA. For the undergraduate CS152 class in
+Spring 2012, Christopher Celio used Chisel to write a suite of
+educational RV32 processors, named `Sodor` after the island on which
+`Thomas the Tank Engine` and friends live. The suite includes a
+microcoded core, an unpipelined core, and 2, 3, and 5-stage pipelined
+cores, and is publicly available under a BSD license. The suite was
+subsequently updated and used again in CS152 in Spring 2013, with Yunsup
+Lee as TA, and in Spring 2014, with Eric Love as TA. Christopher Celio
+also developed an out-of-order RV64 design known as BOOM (Berkeley
+Out-of-Order Machine), with accompanying pipeline visualizations, that
+was used in the CS152 classes. The CS152 classes also used
+cache-coherent versions of the Rocket core developed by Andrew Waterman
+and Henry Cook.
+
+Over the summer of 2013, the RoCC (Rocket Custom Coprocessor) interface
+was defined to simplify adding custom accelerators to the Rocket core.
+Rocket and the RoCC interface were used extensively in the Fall 2013
+CS250 VLSI class taught by Jonathan Bachrach, with several student
+accelerator projects built to the RoCC interface. The Hwacha vector unit
+has been rewritten as a RoCC coprocessor.
+
+Two Berkeley undergraduates, Quan Nguyen and Albert Ou, have
+successfully ported Linux to run on RISC-V in Spring 2013.
+
+Colin Schmidt successfully completed an LLVM backend for RISC-V 2.0 in
+January 2014.
+
+Darius Rad at Bluespec contributed soft-float ABI support to the GCC
+port in March 2014.
+
+John Hauser contributed the definition of the floating-point
+classification instructions.
+
+We are aware of several other RISC-V core implementations, including one
+in Verilog by Tommy Thorn, and one in Bluespec by Rishiyur Nikhil.
+
+=== Acknowledgments
+
+Thanks to Christopher F. Batten, Preston Briggs, Christopher Celio,
+David Chisnall, Stefan Freudenberger, John Hauser, Ben Keller, Rishiyur
+Nikhil, Michael Taylor, Tommy Thorn, and Robert Watson for comments on
+the draft ISA version 2.0 specification.
+
+=== History from Revision 2.1
+
+Uptake of the RISC-V ISA has been very rapid since the introduction of
+the frozen version 2.0 in May 2014, with too much activity to record in
+a short history section such as this. Perhaps the most important single
+event was the formation of the non-profit RISC-V Foundation in August
+2015. The Foundation will now take over stewardship of the official
+RISC-V ISA standard, and the official website `riscv.org` is the best
+place to obtain news and updates on the RISC-V standard.
+
+=== Acknowledgments
+
+Thanks to Scott Beamer, Allen J. Baum, Christopher Celio, David
+Chisnall, Paul Clayton, Palmer Dabbelt, Jan Gray, Michael Hamburg, and
+John Hauser for comments on the version 2.0 specification.
+
+=== History from Revision 2.2
+
+=== Acknowledgments
+
+Thanks to Jacob Bachmeyer, Alex Bradbury, David Horner, Stefan O’Rear,
+and Joseph Myers for comments on the version 2.1 specification.
+
+=== History for Revision 2.3
+
+Uptake of RISC-V continues at breakneck pace.
+
+John Hauser and Andrew Waterman contributed a hypervisor ISA extension
+based upon a proposal from Paolo Bonzini.
+
+Daniel Lustig, Arvind, Krste Asanović, Shaked Flur, Paul Loewenstein,
+Yatin Manerkar, Luc Maranget, Margaret Martonosi, Vijayanand Nagarajan,
+Rishiyur Nikhil, Jonas Oberhauser, Christopher Pulte, Jose Renau, Peter
+Sewell, Susmit Sarkar, Caroline Trippel, Muralidaran Vijayaraghavan,
+Andrew Waterman, Derek Williams, Andrew Wright, and Sizhuo Zhang
+contributed the memory consistency model.
+
+=== Funding
+
+Development of the RISC-V architecture and implementations has been
+partially funded by the following sponsors.
+
+* *Par Lab:* Research supported by Microsoft (Award # 024263) and Intel
+(Award # 024894) funding and by matching funding by U.C. Discovery (Award
+# DIG07-10227). Additional support came from Par Lab affiliates Nokia,
+NVIDIA, Oracle, and Samsung.
+* *Project Isis:* DoE Award DE-SC0003624.
+* *ASPIRE Lab*: DARPA PERFECT program, Award HR0011-12-2-0016. DARPA
+POEM program Award HR0011-11-C-0100. The Center for Future Architectures
+Research (C-FAR), a STARnet center funded by the Semiconductor Research
+Corporation. Additional support from ASPIRE industrial sponsor, Intel,
+and ASPIRE affiliates, Google, Hewlett Packard Enterprise, Huawei,
+Nokia, NVIDIA, Oracle, and Samsung.
+
+The content of this paper does not necessarily reflect the position or
+the policy of the US government and no official endorsement should be
+inferred.
diff --git a/src/index.adoc b/src/index.adoc
new file mode 100644
index 0000000..4abaca2
--- /dev/null
+++ b/src/index.adoc
@@ -0,0 +1,2 @@
+[index]
+== Index
diff --git a/src/intro-old.adoc b/src/intro-old.adoc
new file mode 100644
index 0000000..a1d24f7
--- /dev/null
+++ b/src/intro-old.adoc
@@ -0,0 +1,733 @@
+[[introduction]]
+== Introduction
+
+RISC-V (pronounced `risk-five`) is a new instruction-set architecture
+(ISA) that was originally designed to support computer architecture
+research and education, but which we now hope will also become a
+standard free and open architecture for industry implementations. Our
+goals in defining RISC-V include:
+
+* A completely _open_ ISA that is freely available to academia and
+industry.
+* A _real_ ISA suitable for direct native hardware implementation, not
+just simulation or binary translation.
+* An ISA that avoids `over-architecting` for a particular
+microarchitecture style (e.g., microcoded, in-order, decoupled,
+out-of-order) or implementation technology (e.g., full-custom, ASIC,
+FPGA), but which allows efficient implementation in any of these.
+* An ISA separated into a _small_ base integer ISA, usable by itself as
+a base for customized accelerators or for educational purposes, and
+optional standard extensions, to support general-purpose software
+development.
+* Support for the revised 2008 IEEE-754 floating-point standard .
+* An ISA supporting extensive ISA extensions and specialized variants.
+* Both 32-bit and 64-bit address space variants for applications,
+operating system kernels, and hardware implementations.
+* An ISA with support for highly parallel multicore or manycore
+implementations, including heterogeneous multiprocessors.
+* Optional _variable-length instructions_ to both expand available
+instruction encoding space and to support an optional _dense instruction
+encoding_ for improved performance, static code size, and energy
+efficiency.
+* A fully virtualizable ISA to ease hypervisor development.
+* An ISA that simplifies experiments with new privileged architecture
+designs.
+
+[TIP]
+====
+Commentary on our design decisions is formatted as in this paragraph.
+This non-normative text can be skipped if the reader is only interested
+in the specification itself.
+====
+
+[NOTE]
+====
+The name RISC-V was chosen to represent the fifth major RISC ISA design
+from UC Berkeley (RISC-I cite:[riscI-isca1981], RISC-II cite:[Katevenis:1983], SOAR cite:[Ungar:1984], and SPUR cite:[spur-jsscc1989] were the first
+four). We also pun on the use of the Roman numeral `V` to signify
+`variations` and `vectors`, as support for a range of architecture
+research, including various data-parallel accelerators, is an explicit
+goal of the ISA design.
+====
+(((ISA, definition)))
+
+The RISC-V ISA is defined avoiding implementation details as much as
+possible (although commentary is included on implementation-driven
+decisions) and should be read as the software-visible interface to a
+wide variety of implementations rather than as the design of a
+particular hardware artifact. The RISC-V manual is structured in two
+volumes. This volume covers the design of the base _unprivileged_
+instructions, including optional unprivileged ISA extensions.
+Unprivileged instructions are those that are generally usable in all
+privilege modes in all privileged architectures, though behavior might
+vary depending on privilege mode and privilege architecture. The second
+volume provides the design of the first (`classic`) privileged
+architecture. The manuals use IEC 80000-13:2008 conventions, with a byte
+of 8 bits.
+
+[TIP]
+====
+In the unprivileged ISA design, we tried to remove any dependence on
+particular microarchitectural features, such as cache line size, or on
+privileged architecture details, such as page translation. This is both
+for simplicity and to allow maximum flexibility for alternative
+microarchitectures or alternative privileged architectures.
+====
+
+=== RISC-V Hardware Platform Terminology
+
+A RISC-V hardware platform can contain one or more RISC-V-compatible
+processing cores together with other non-RISC-V-compatible cores,
+fixed-function accelerators, various physical memory structures, I/O
+devices, and an interconnect structure to allow the components to
+communicate.
+(((core, component)))
+
+A component is termed a _core_ if it contains an independent instruction
+fetch unit. A RISC-V-compatible core might support multiple
+RISC-V-compatible hardware threads, or _harts_, through multithreading.
+(((core, extensions, coprocessor)))
+
+A RISC-V core might have additional specialized instruction-set
+extensions or an added _coprocessor_. We use the term _coprocessor_ to
+refer to a unit that is attached to a RISC-V core and is mostly
+sequenced by a RISC-V instruction stream, but which contains additional
+architectural state and instruction-set extensions, and possibly some
+limited autonomy relative to the primary RISC-V instruction stream.
+(((core, accelerator)))
+
+We use the term _accelerator_ to refer to either a non-programmable
+fixed-function unit or a core that can operate autonomously but is
+specialized for certain tasks. In RISC-V systems, we expect many
+programmable accelerators will be RISC-V-based cores with specialized
+instruction-set extensions and/or customized coprocessors. An important
+class of RISC-V accelerators are I/O accelerators, which offload I/O
+processing tasks from the main application cores.
+(((core, cluster, multiprocessors)))
+
+The system-level organization of a RISC-V hardware platform can range
+from a single-core microcontroller to a many-thousand-node cluster of
+shared-memory manycore server nodes. Even small systems-on-a-chip might
+be structured as a hierarchy of multicomputers and/or multiprocessors to
+modularize development effort or to provide secure isolation between
+subsystems.
+
+=== RISC-V Software Execution Environments and Harts
+
+The behavior of a RISC-V program depends on the execution environment in
+which it runs. A RISC-V execution environment interface (EEI) defines
+the initial state of the program, the number and type of harts in the
+environment including the privilege modes supported by the harts, the
+accessibility and attributes of memory and I/O regions, the behavior of
+all legal instructions executed on each hart (i.e., the ISA is one
+component of the EEI), and the handling of any interrupts or exceptions
+raised during execution including environment calls. Examples of EEIs
+include the Linux application binary interface (ABI), or the RISC-V
+supervisor binary interface (SBI). The implementation of a RISC-V
+execution environment can be pure hardware, pure software, or a
+combination of hardware and software. For example, opcode traps and
+software emulation can be used to implement functionality not provided
+in hardware. Examples of execution environment implementations include:
+
+* `Bare metal` hardware platforms where harts are directly implemented
+by physical processor threads and instructions have full access to the
+physical address space. The hardware platform defines an execution
+environment that begins at power-on reset.
+* RISC-V operating systems that provide multiple user-level execution
+environments by multiplexing user-level harts onto available physical
+processor threads and by controlling access to memory via virtual
+memory.
+* RISC-V hypervisors that provide multiple supervisor-level execution
+environments for guest operating systems.
+* RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V
+harts on an underlying x86 system, and which can provide either a
+user-level or a supervisor-level execution environment.
+
+[TIP]
+====
+A bare hardware platform can be considered to define an EEI, where the
+accessible harts, memory, and other devices populate the environment,
+and the initial state is that at power-on reset. Generally, most
+software is designed to use a more abstract interface to the hardware,
+as more abstract EEIs provide greater portability across different
+hardware platforms. Often EEIs are layered on top of one another, where
+one higher-level EEI uses another lower-level EEI.
+====
+(((hart, exectution environment)))
+
+From the perspective of software running in a given execution
+environment, a hart is a resource that autonomously fetches and executes
+RISC-V instructions within that execution environment. In this respect,
+a hart behaves like a hardware thread resource even if time-multiplexed
+onto real hardware by the execution environment. Some EEIs support the
+creation and destruction of additional harts, for example, via
+environment calls to fork new harts.
+
+The execution environment is responsible for ensuring the eventual
+forward progress of each of its harts. For a given hart, that
+responsibility is suspended while the hart is exercising a mechanism
+that explicitly waits for an event, such as the wait-for-interrupt
+instruction defined in Volume II of this specification; and that
+responsibility ends if the hart is terminated. The following events
+constitute forward progress:
+
+* The retirement of an instruction.
+* A trap, as defined in <<trap-defn>>.
+* Any other event defined by an extension to constitute forward
+progress.
+
+
+[TIP]
+====
+The term hart was introduced in the work on Lithe in cite:[lithe-pan-hotpar09] and cite:[lithe-pan-pldi10] to provide a term to
+represent an abstract execution resource as opposed to a software thread
+programming abstraction.
+
+The important distinction between a hardware thread (hart) and a
+software thread context is that the software running inside an execution
+environment is not responsible for causing progress of each of its
+harts; that is the responsibility of the outer execution environment. So
+the environment’s harts operate like hardware threads from the
+perspective of the software inside the execution environment.
+
+An execution environment implementation might time-multiplex a set of
+guest harts onto fewer host harts provided by its own execution
+environment but must do so in a way that guest harts operate like
+independent hardware threads. In particular, if there are more guest
+harts than host harts then the execution environment must be able to
+preempt the guest harts and must not wait indefinitely for guest
+software on a guest hart to “yield" control of the guest hart.
+====
+
+=== RISC-V ISA Overview
+
+A RISC-V ISA is defined as a base integer ISA, which must be present in
+any implementation, plus optional extensions to the base ISA. The base
+integer ISAs are very similar to that of the early RISC processors
+except with no branch delay slots and with support for optional
+variable-length instruction encodings. A base is carefully restricted to
+a minimal set of instructions sufficient to provide a reasonable target
+for compilers, assemblers, linkers, and operating systems (with
+additional privileged operations), and so provides a convenient ISA and
+software toolchain `skeleton` around which more customized processor
+ISAs can be built.
+
+Although it is convenient to speak of _the_ RISC-V ISA, RISC-V is
+actually a family of related ISAs, of which there are currently four
+base ISAs. Each base integer instruction set is characterized by the
+width of the integer registers and the corresponding size of the address
+space and by the number of integer registers. There are two primary base
+integer variants, RV32I and RV64I, described in
+Chapters (see <<rv32>> and <<rv64>>, which provide 32-bit
+or 64-bit address spaces respectively. We use the term XLEN to refer to
+the width of an integer register in bits (either 32 or 64).
+Chapter <<rv32e>> describes the RV32E subset variant of the
+RV32I base instruction set, which has been added to support small
+microcontrollers, and which has half the number of integer registers.
+Chapter <<rv128>> sketches a future RV128I variant of the
+base integer instruction set supporting a flat 128-bit address space
+(XLEN=128). The base integer instruction sets use a two’s-complement
+representation for signed integer values.
+
+Although 64-bit address spaces are a requirement for larger systems, we
+believe 32-bit address spaces will remain adequate for many embedded and
+client devices for decades to come and will be desirable to lower memory
+traffic and energy consumption. In addition, 32-bit address spaces are
+sufficient for educational purposes. A larger flat 128-bit address space
+might eventually be required, so we ensured this could be accommodated
+within the RISC-V ISA framework.
+
+The four base ISAs in RISC-V are treated as distinct base ISAs. A common
+question is why is there not a single ISA, and in particular, why is
+RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC,
+MIPS) adopted a strict superset policy when increasing address space
+size to support running existing 32-bit binaries on new 64-bit hardware.
+
+The main advantage of explicitly separating base ISAs is that each base
+ISA can be optimized for its needs without requiring to support all the
+operations needed for other base ISAs. For example, RV64I can omit
+instructions and CSRs that are only needed to cope with the narrower
+registers in RV32I. The RV32I variants can use encoding space otherwise
+reserved for instructions only required by wider address-space variants.
+
+The main disadvantage of not treating the design as a single ISA is that
+it complicates the hardware needed to emulate one base ISA on another
+(e.g., RV32I on RV64I). However, differences in addressing and illegal
+instruction traps generally mean some mode switch would be required in
+hardware in any case even with full superset instruction encodings, and
+the different RISC-V base ISAs are similar enough that supporting
+multiple versions is relatively low cost. Although some have proposed
+that the strict superset design would allow legacy 32-bit libraries to
+be linked with 64-bit code, this is impractical in practice, even with
+compatible encodings, due to the differences in software calling
+conventions and system-call interfaces.
+
+The RISC-V privileged architecture provides fields in ` misa` to control
+the unprivileged ISA at each level to support emulating different base
+ISAs on the same hardware. We note that newer SPARC and MIPS ISA
+revisions have deprecated support for running 32-bit code unchanged on
+64-bit systems.
+
+A related question is why there is a different encoding for 32-bit adds
+in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for
+32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the
+existing design which uses the same opcode ADD for 32-bit adds in RV32I
+and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in
+RV64I. This would also be more consistent with the use of the same LW
+opcode for 32-bit load in both RV32I and RV64I. The very first versions
+of RISC-V ISA did have a variant of this alternate design, but the
+RISC-V design was changed to the current choice in January 2011. Our
+focus was on supporting 32-bit integers in the 64-bit ISA not on
+providing compatibility with the 32-bit ISA, and the motivation was to
+remove the asymmetry that arose from having not all opcodes in RV32I
+have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was
+perhaps not well-justified and a consequence of designing both ISAs at
+the same time as opposed to adding one later to sit on top of another,
+and also from a belief we had to fold platform requirements into the ISA
+spec which would imply that all the RV32I instructions would have been
+required in RV64I. It is too late to change the encoding now, but this
+is also of little practical consequence for the reasons stated above.
+
+It has been noted we could enable the *W* variants as an extension to
+RV32I systems to provide a common encoding across RV64I and a future
+RV32 variant.
+
+RISC-V has been designed to support extensive customization and
+specialization. Each base integer ISA can be extended with one or more
+optional instruction-set extensions. An extension may be categorized as
+either standard, custom, or non-conforming. For this purpose, we divide
+each RISC-V instruction-set encoding space (and related encoding spaces
+such as the CSRs) into three disjoint categories: _standard_,
+_reserved_, and _custom_. Standard extensions and encodings are defined
+by the Foundation; any extensions not defined by the Foundation are
+_non-standard_. Each base ISA and its standard extensions use only
+standard encodings, and shall not conflict with each other in their uses
+of these encodings. Reserved encodings are currently not defined but are
+saved for future standard extensions; once thus used, they become
+standard encodings. Custom encodings shall never be used for standard
+extensions and are made available for vendor-specific non-standard
+extensions. Non-standard extensions are either custom extensions, that
+use only custom encodings, or _non-conforming_ extensions, that use any
+standard or reserved encoding. Instruction-set extensions are generally
+shared but may provide slightly different functionality depending on the
+base ISA. Chapter <<extensions>> describes various ways
+of extending the RISC-V ISA. We have also developed a naming convention
+for RISC-V base instructions and instruction-set extensions, described
+in detail in Chapter <<naming>>.
+
+To support more general software development, a set of standard
+extensions are defined to provide integer multiply/divide, atomic
+operations, and single and double-precision floating-point arithmetic.
+The base integer ISA is named `I` (prefixed by RV32 or RV64 depending
+on integer register width), and contains integer computational
+instructions, integer loads, integer stores, and control-flow
+instructions. The standard integer multiplication and division extension
+is named `M`, and adds instructions to multiply and divide values held
+in the integer registers. The standard atomic instruction extension,
+denoted by `A`, adds instructions that atomically read, modify, and
+write memory for inter-processor synchronization. The standard
+single-precision floating-point extension, denoted by `F`, adds
+floating-point registers, single-precision computational instructions,
+and single-precision loads and stores. The standard double-precision
+floating-point extension, denoted by `D`, expands the floating-point
+registers, and adds double-precision computational instructions, loads,
+and stores. The standard `C` compressed instruction extension provides
+narrower 16-bit forms of common instructions.
+
+Beyond the base integer ISA and the standard GC extensions, we believe
+it is rare that a new instruction will provide a significant benefit for
+all applications, although it may be very beneficial for a certain
+domain. As energy efficiency concerns are forcing greater
+specialization, we believe it is important to simplify the required
+portion of an ISA specification. Whereas other architectures usually
+treat their ISA as a single entity, which changes to a new version as
+instructions are added over time, RISC-V will endeavor to keep the base
+and each standard extension constant over time, and instead layer new
+instructions as further optional extensions. For example, the base
+integer ISAs will continue as fully supported standalone ISAs,
+regardless of any subsequent extensions.
+
+=== Memory
+
+A RISC-V hart has a single byte-addressable address space of
+latexmath:[$2^{\text{XLEN}}$] bytes for all memory accesses. A _word_ of
+memory is defined as (). Correspondingly, a _halfword_ is (), a
+_doubleword_ is (), and a _quadword_ is (). The memory address space is
+circular, so that the byte at address latexmath:[$2^{\text{XLEN}}-1$] is
+adjacent to the byte at address zero. Accordingly, memory address
+computations done by the hardware ignore overflow and instead wrap
+around modulo latexmath:[$2^{\text{XLEN}}$].
+
+The execution environment determines the mapping of hardware resources
+into a hart’s address space. Different address ranges of a hart’s
+address space may (1) be vacant, or (2) contain _main memory_, or
+(3) contain one or more _I/O devices_. Reads and writes of I/O devices
+may have visible side effects, but accesses to main memory cannot.
+Although it is possible for the execution environment to call everything
+in a hart’s address space an I/O device, it is usually expected that
+some portion will be specified as main memory.
+
+When a RISC-V platform has multiple harts, the address spaces of any two
+harts may be entirely the same, or entirely different, or may be partly
+different but sharing some subset of resources, mapped into the same or
+different address ranges.
+
+For a purely `bare metal` environment, all harts may see an identical
+address space, accessed entirely by physical addresses. However, when
+the execution environment includes an operating system employing address
+translation, it is common for each hart to be given a virtual address
+space that is largely or entirely its own.
+
+(((memory access, implicit and explicit)))
+Executing each RISC-V machine instruction entails one or more memory
+accesses, subdivided into _implicit_ and _explicit_ accesses. For each
+instruction executed, an _implicit_ memory read (instruction fetch) is
+done to obtain the encoded instruction to execute. Many RISC-V
+instructions perform no further memory accesses beyond instruction
+fetch. Specific load and store instructions perform an _explicit_ read
+or write of memory at an address determined by the instruction. The
+execution environment may dictate that instruction execution performs
+other _implicit_ memory accesses (such as to implement address
+translation) beyond those documented for the unprivileged ISA.
+
+The execution environment determines what portions of the non-vacant
+address space are accessible for each kind of memory access. For
+example, the set of locations that can be implicitly read for
+instruction fetch may or may not have any overlap with the set of
+locations that can be explicitly read by a load instruction; and the set
+of locations that can be explicitly written by a store instruction may
+be only a subset of locations that can be read. Ordinarily, if an
+instruction attempts to access memory at an inaccessible address, an
+exception is raised for the instruction. Vacant locations in the address
+space are never accessible.
+
+Except when specified otherwise, implicit reads that do not raise an
+exception and that have no side effects may occur arbitrarily early and
+speculatively, even before the machine could possibly prove that the
+read will be needed. For instance, a valid implementation could attempt
+to read all of main memory at the earliest opportunity, cache as many
+fetchable (executable) bytes as possible for later instruction fetches,
+and avoid reading main memory for instruction fetches ever again. To
+ensure that certain implicit reads are ordered only after writes to the
+same memory locations, software must execute specific fence or
+cache-control instructions defined for this purpose (such as the FENCE.I
+instruction defined in Chapter <<zifencei>>.
+
+(((memory access, implicit and explicit)))
+The memory accesses (implicit or explicit) made by a hart may appear to
+occur in a different order as perceived by another hart or by any other
+agent that can access the same memory. This perceived reordering of
+memory accesses is always constrained, however, by the applicable memory
+consistency model. The default memory consistency model for RISC-V is
+the RISC-V Weak Memory Ordering (RVWMO), defined in
+Chapter <<memorymodel>> and in appendices. Optionally,
+an implementation may adopt the stronger model of Total Store Ordering,
+as defined in Chapter <<ztso>>. The execution environment
+may also add constraints that further limit the perceived reordering of
+memory accesses. Since the RVWMO model is the weakest model allowed for
+any RISC-V implementation, software written for this model is compatible
+with the actual memory consistency rules of all RISC-V implementations.
+As with implicit reads, software must execute fence or cache-control
+instructions to ensure specific ordering of memory accesses beyond the
+requirements of the assumed memory consistency model and execution
+environment.
+
+=== Base Instruction-Length Encoding
+
+The base RISC-V ISA has fixed-length 32-bit instructions that must be
+naturally aligned on 32-bit boundaries. However, the standard RISC-V
+encoding scheme is designed to support ISA extensions with
+variable-length instructions, where each instruction can be any number
+of 16-bit instruction _parcels_ in length and parcels are naturally
+aligned on 16-bit boundaries. The standard compressed ISA extension
+described in Chapter <<compressed>> reduces code size by
+providing compressed 16-bit instructions and relaxes the alignment
+constraints to allow all instructions (16 bit and 32 bit) to be aligned
+on any 16-bit boundary to improve code density.
+
+We use the term IALIGN (measured in bits) to refer to the
+instruction-address alignment constraint the implementation enforces.
+IALIGN is 32 bits in the base ISA, but some ISA extensions, including
+the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not
+take on any value other than 16 or 32.
+
+`(((ILEN)))
+We use the term ILEN (measured in bits) to refer to the maximum
+instruction length supported by an implementation, and which is always a
+multiple of IALIGN. For implementations supporting only a base
+instruction set, ILEN is 32 bits. Implementations supporting longer
+instructions have larger values of ILEN.
+
+<<instlengthcode>> illustrates the standard
+RISC-V instruction-length encoding convention. All the 32-bit
+instructions in the base ISA have their lowest two bits set to `11`. The
+optional compressed 16-bit instruction-set extensions have their lowest
+two bits equal to `00`, `01`, or `10`.
+
+==== Expanded Instruction-Length Encoding
+
+`(((instruction length encoding)))
+A portion of the 32-bit instruction-encoding space has been tentatively
+allocated for instructions longer than 32 bits. The entirety of this
+space is reserved at this time, and the following proposal for encoding
+instructions longer than 32 bits is not considered frozen.
+
+Standard instruction-set extensions encoded with more than 32 bits have
+additional low-order bits set to `1`, with the conventions for 48-bit
+and 64-bit lengths shown in
+<<instlengthcode>>. Instruction lengths
+between 80 bits and 176 bits are encoded using a 3-bit field in bits
+[14:12] giving the number of 16-bit words in addition to the first
+5latexmath:[$\times$]16-bit words. The encoding with bits [14:12] set to
+`111` is reserved for future longer instruction encodings.
+
+//table needs to be fixed
+[[instlengthcode]]
+.Instruction length code
+[cols="^,^,^,^,<",]
+|===
+| | | |`xxxxxxxxxxxxxxaa` |16-bit (`aa` latexmath:[$\neq$] `11`)
+
+| | | | |
+
+| | |`xxxxxxxxxxxxxxxx` |`xxxxxxxxxxxbbb11` |32-bit (`bbb`
+latexmath:[$\neq$] `111`)
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xxxxxxxxxx011111` |48-bit
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xxxxxxxxx0111111` |64-bit
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xnnnxxxxx1111111` |(80+16*`nnn`)-bit, `nnn`latexmath:[$\neq$]`111`
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`x111xxxxx1111111` |Reserved for latexmath:[$\geq$]192-bits
+
+| | | | |
+
+|Byte Address: |base+4 |base+2 |base |
+|===
+
+Given the code size and energy savings of a compressed format, we wanted
+to build in support for a compressed format to the ISA encoding scheme
+rather than adding this as an afterthought, but to allow simpler
+implementations we didn’t want to make the compressed format mandatory.
+We also wanted to optionally allow longer instructions to support
+experimentation and larger instruction-set extensions. Although our
+encoding convention required a tighter encoding of the core RISC-V ISA,
+this has several beneficial effects.
+(((IMAFED)))
+
+An implementation of the standard IMAFD ISA need only hold the
+most-significant 30 bits in instruction caches (a 6.25% saving). On
+instruction cache refills, any instructions encountered with either low
+bit clear should be recoded into illegal 30-bit instructions before
+storing in the cache to preserve illegal instruction exception behavior.
+
+Perhaps more importantly, by condensing our base ISA into a subset of
+the 32-bit instruction word, we leave more space available for
+non-standard and custom extensions. In particular, the base RV32I ISA
+uses less than 1/8 of the encoding space in the 32-bit instruction word.
+As described in Chapter <<extensions>>, an
+implementation that does not require support for the standard compressed
+instruction extension can map 3 additional non-conforming 30-bit
+instruction spaces into the 32-bit fixed-width format, while preserving
+support for standard latexmath:[$\geq$]32-bit instruction-set
+extensions. Further, if the implementation also does not need
+instructions latexmath:[$>$]32-bits in length, it can recover a further
+four major opcodes for non-conforming extensions.
+
+Encodings with bits [15:0] all zeros are defined as illegal
+instructions. These instructions are considered to be of minimal length:
+16 bits if any 16-bit instruction-set extension is present, otherwise 32
+bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this
+instruction is considered to be ILEN bits long.
+
+We consider it a feature that any length of instruction containing all
+zero bits is not legal, as this quickly traps erroneous jumps into
+zeroed memory regions. Similarly, we also reserve the instruction
+encoding containing all ones to be an illegal instruction, to catch the
+other common pattern observed with unprogrammed non-volatile memory
+devices, disconnected memory buses, or broken memory devices.
+
+Software can rely on a naturally aligned 32-bit word containing zero to
+act as an illegal instruction on all RISC-V implementations, to be used
+by software where an illegal instruction is explicitly desired. Defining
+a corresponding known illegal value for all ones is more difficult due
+to the variable-length encoding. Software cannot generally use the
+illegal value of ILEN bits of all 1s, as software might not know ILEN
+for the eventual target machine (e.g., if software is compiled into a
+standard binary library used by many different machines). Defining a
+32-bit word of all ones as illegal was also considered, as all machines
+must support a 32-bit instruction size, but this requires the
+instruction-fetch unit on machines with ILENlatexmath:[$>$]32 report an
+illegal instruction exception rather than an access-fault exception when
+such an instruction borders a protection boundary, complicating
+variable-instruction-length fetch and decode.
+(((endian, little and big)))
+
+RISC-V base ISAs have either little-endian or big-endian memory systems,
+with the privileged architecture further defining bi-endian operation.
+Instructions are stored in memory as a sequence of 16-bit little-endian
+parcels, regardless of memory system endianness. Parcels forming one
+instruction are stored at increasing halfword addresses, with the
+lowest-addressed parcel holding the lowest-numbered bits in the
+instruction specification.
+
+We originally chose little-endian byte ordering for the RISC-V memory
+system because little-endian systems are currently dominant commercially
+(all x86 systems; iOS, Android, and Windows for ARM). A minor point is
+that we have also found little-endian memory systems to be more natural
+for hardware designers. However, certain application areas, such as IP
+networking, operate on big-endian data structures, and certain legacy
+code bases have been built assuming big-endian processors, so we have
+defined big-endian and bi-endian variants of RISC-V.
+
+We have to fix the order in which instruction parcels are stored in
+memory, independent of memory system endianness, to ensure that the
+length-encoding bits always appear first in halfword address order. This
+allows the length of a variable-length instruction to be quickly
+determined by an instruction-fetch unit by examining only the first few
+bits of the first 16-bit instruction parcel.
+
+We further make the instruction parcels themselves little-endian to
+decouple the instruction encoding from the memory system endianness
+altogether. This design benefits both software tooling and bi-endian
+hardware. Otherwise, for instance, a RISC-V assembler or disassembler
+would always need to know the intended active endianness, despite that
+in bi-endian systems, the endianness mode might change dynamically
+during execution. In contrast, by giving instructions a fixed
+endianness, it is sometimes possible for carefully written software to
+be endianness-agnostic even in binary form, much like
+position-independent code.
+
+The choice to have instructions be only little-endian does have
+consequences, however, for RISC-V software that encodes or decodes
+machine instructions. Big-endian JIT compilers, for example, must swap
+the byte order when storing to instruction memory.
+
+Once we had decided to fix on a little-endian instruction encoding, this
+naturally led to placing the length-encoding bits in the LSB positions
+of the instruction format to avoid breaking up opcode fields.
+
+[[trap-defn]]
+=== Exceptions, Traps, and Interrupts
+(((exceptions)))
+(((traps)))
+(((interrupts)))
+
+We use the term _exception_ to refer to an unusual condition occurring
+at run time associated with an instruction in the current RISC-V hart.
+We use the term _interrupt_ to refer to an external asynchronous event
+that may cause a RISC-V hart to experience an unexpected transfer of
+control. We use the term _trap_ to refer to the transfer of control to a
+trap handler caused by either an exception or an interrupt.
+
+The instruction descriptions in following chapters describe conditions
+that can raise an exception during execution. The general behavior of
+most RISC-V EEIs is that a trap to some handler occurs when an exception
+is signaled on an instruction (except for floating-point exceptions,
+which, in the standard floating-point extensions, do not cause traps).
+The manner in which interrupts are generated, routed to, and enabled by
+a hart depends on the EEI.
+
+Our use of `exception` and `trap` is compatible with that in the
+IEEE-754 floating-point standard.
+
+How traps are handled and made visible to software running on the hart
+depends on the enclosing execution environment. From the perspective of
+software running inside an execution environment, traps encountered by a
+hart at runtime can have four different effects:
+
+Contained Trap:::
+  The trap is visible to, and handled by, software running inside the
+  execution environment. For example, in an EEI providing both
+  supervisor and user mode on harts, an ECALL by a user-mode hart will
+  generally result in a transfer of control to a supervisor-mode handler
+  running on the same hart. Similarly, in the same environment, when a
+  hart is interrupted, an interrupt handler will be run in supervisor
+  mode on the hart.
+Requested Trap:::
+  The trap is a synchronous exception that is an explicit call to the
+  execution environment requesting an action on behalf of software
+  inside the execution environment. An example is a system call. In this
+  case, execution may or may not resume on the hart after the requested
+  action is taken by the execution environment. For example, a system
+  call could remove the hart or cause an orderly termination of the
+  entire execution environment.
+Invisible Trap:::
+  The trap is handled transparently by the execution environment and
+  execution resumes normally after the trap is handled. Examples include
+  emulating missing instructions, handling non-resident page faults in a
+  demand-paged virtual-memory system, or handling device interrupts for
+  a different job in a multiprogrammed machine. In these cases, the
+  software running inside the execution environment is not aware of the
+  trap (we ignore timing effects in these definitions).
+Fatal Trap:::
+  The trap represents a fatal failure and causes the execution
+  environment to terminate execution. Examples include failing a
+  virtual-memory page-protection check or allowing a watchdog timer to
+  expire. Each EEI should define how execution is terminated and
+  reported to an external environment.
+
+<<trapcharacteristics>> shows the characteristics of each
+kind of trap. _Notes: 1) Termination may be requested. 2)
+Imprecise fatal traps might be observable by software._
+
+[[trapcharacteristics]]
+.Characteristics of traps
+[cols="<,^,^,^,^",options="header",]
+|===
+| |Contained |Requested |Invisible |Fatal
+|Execution terminates |No |Nolatexmath:[$^{1}$] |No |Yes
+|Software is oblivious |No |No |Yes |Yeslatexmath:[$^{2}$]
+|Handled by environment |No |Yes |Yes |Yes
+|===
+
+The EEI defines for each trap whether it is handled precisely, though
+the recommendation is to maintain preciseness where possible. Contained
+and requested traps can be observed to be imprecise by software inside
+the execution environment. Invisible traps, by definition, cannot be
+observed to be precise or imprecise by software running inside the
+execution environment. Fatal traps can be observed to be imprecise by
+software running inside the execution environment, if known-errorful
+instructions do not cause immediate termination.
+
+Because this document describes unprivileged instructions, traps are
+rarely mentioned. Architectural means to handle contained traps are
+defined in the privileged architecture manual, along with other features
+to support richer EEIs. Unprivileged instructions that are defined
+solely to cause requested traps are documented here. Invisible traps
+are, by their nature, out of scope for this document. Instruction
+encodings that are not defined here and not defined by some other means
+may cause a fatal trap.
+
+=== UNSPECIFIED Behaviors and Values
+(((unspecified, behaviors)))
+(((unspecified, values)))
+
+The architecture fully describes what implementations must do and any
+constraints on what they may do. In cases where the architecture
+intentionally does not constrain implementations, the term _unspecified_ is
+explicitly used.
+
+The term _unspecified_ refers to a behavior or value that is intentionally
+unconstrained. The definition of these behaviors or values is open to
+extensions, platform standards, or implementations. Extensions, platform
+standards, or implementation documentation may provide normative content
+to further constrain cases that the base architecture defines as .
+
+Like the base architecture, extensions should fully describe allowable
+behavior and values and use the term _unspecified_ for cases that are intentionally
+unconstrained. These cases may be constrained or defined by other
+extensions, platform standards, or implementations.
diff --git a/src/intro.adoc b/src/intro.adoc
new file mode 100644
index 0000000..d57a12a
--- /dev/null
+++ b/src/intro.adoc
@@ -0,0 +1,731 @@
+== Introduction
+
+RISC-V (pronounced `risk-five`) is a new instruction-set architecture
+(ISA) that was originally designed to support computer architecture
+research and education, but which we now hope will also become a
+standard free and open architecture for industry implementations. Our
+goals in defining RISC-V include:
+
+* A completely _open_ ISA that is freely available to academia and
+industry.
+* A _real_ ISA suitable for direct native hardware implementation, not
+just simulation or binary translation.
+* An ISA that avoids `over-architecting` for a particular
+microarchitecture style (e.g., microcoded, in-order, decoupled,
+out-of-order) or implementation technology (e.g., full-custom, ASIC,
+FPGA), but which allows efficient implementation in any of these.
+* An ISA separated into a _small_ base integer ISA, usable by itself as
+a base for customized accelerators or for educational purposes, and
+optional standard extensions, to support general-purpose software
+development.
+* Support for the revised 2008 IEEE-754 floating-point standard .
+* An ISA supporting extensive ISA extensions and specialized variants.
+* Both 32-bit and 64-bit address space variants for applications,
+operating system kernels, and hardware implementations.
+* An ISA with support for highly parallel multicore or manycore
+implementations, including heterogeneous multiprocessors.
+* Optional _variable-length instructions_ to both expand available
+instruction encoding space and to support an optional _dense instruction
+encoding_ for improved performance, static code size, and energy
+efficiency.
+* A fully virtualizable ISA to ease hypervisor development.
+* An ISA that simplifies experiments with new privileged architecture
+designs.
+
+[TIP]
+====
+Commentary on our design decisions is formatted as in this paragraph.
+This non-normative text can be skipped if the reader is only interested
+in the specification itself.
+====
+
+[NOTE]
+====
+The name RISC-V was chosen to represent the fifth major RISC ISA design
+from UC Berkeley (RISC-I cite:[riscI-isca1981], RISC-II cite:[Katevenis:1983], SOAR cite:[Ungar:1984], and SPUR cite:[spur-jsscc1989] were the first
+four). We also pun on the use of the Roman numeral `V` to signify
+`variations` and `vectors`, as support for a range of architecture
+research, including various data-parallel accelerators, is an explicit
+goal of the ISA design.
+====
+(((ISA, definition)))
+
+The RISC-V ISA is defined avoiding implementation details as much as
+possible (although commentary is included on implementation-driven
+decisions) and should be read as the software-visible interface to a
+wide variety of implementations rather than as the design of a
+particular hardware artifact. The RISC-V manual is structured in two
+volumes. This volume covers the design of the base _unprivileged_
+instructions, including optional unprivileged ISA extensions.
+Unprivileged instructions are those that are generally usable in all
+privilege modes in all privileged architectures, though behavior might
+vary depending on privilege mode and privilege architecture. The second
+volume provides the design of the first (`classic`) privileged
+architecture. The manuals use IEC 80000-13:2008 conventions, with a byte
+of 8 bits.
+
+[TIP]
+====
+In the unprivileged ISA design, we tried to remove any dependence on
+particular microarchitectural features, such as cache line size, or on
+privileged architecture details, such as page translation. This is both
+for simplicity and to allow maximum flexibility for alternative
+microarchitectures or alternative privileged architectures.
+====
+
+=== RISC-V Hardware Platform Terminology
+
+A RISC-V hardware platform can contain one or more RISC-V-compatible
+processing cores together with other non-RISC-V-compatible cores,
+fixed-function accelerators, various physical memory structures, I/O
+devices, and an interconnect structure to allow the components to
+communicate.
+(((core, component)))
+
+A component is termed a _core_ if it contains an independent instruction
+fetch unit. A RISC-V-compatible core might support multiple
+RISC-V-compatible hardware threads, or _harts_, through multithreading.
+(((core, extensions, coprocessor)))
+
+A RISC-V core might have additional specialized instruction-set
+extensions or an added _coprocessor_. We use the term _coprocessor_ to
+refer to a unit that is attached to a RISC-V core and is mostly
+sequenced by a RISC-V instruction stream, but which contains additional
+architectural state and instruction-set extensions, and possibly some
+limited autonomy relative to the primary RISC-V instruction stream.
+
+We use the term _accelerator_ to refer to either a non-programmable
+fixed-function unit or a core that can operate autonomously but is
+specialized for certain tasks. In RISC-V systems, we expect many
+programmable accelerators will be RISC-V-based cores with specialized
+instruction-set extensions and/or customized coprocessors. An important
+class of RISC-V accelerators are I/O accelerators, which offload I/O
+processing tasks from the main application cores.
+(((core, accelerator)))
+
+The system-level organization of a RISC-V hardware platform can range
+from a single-core microcontroller to a many-thousand-node cluster of
+shared-memory manycore server nodes. Even small systems-on-a-chip might
+be structured as a hierarchy of multicomputers and/or multiprocessors to
+modularize development effort or to provide secure isolation between
+subsystems.
+(((core, cluster, multiprocessors)))
+
+=== RISC-V Software Execution Environments and Harts
+
+The behavior of a RISC-V program depends on the execution environment in
+which it runs. A RISC-V execution environment interface (EEI) defines
+the initial state of the program, the number and type of harts in the
+environment including the privilege modes supported by the harts, the
+accessibility and attributes of memory and I/O regions, the behavior of
+all legal instructions executed on each hart (i.e., the ISA is one
+component of the EEI), and the handling of any interrupts or exceptions
+raised during execution including environment calls. Examples of EEIs
+include the Linux application binary interface (ABI), or the RISC-V
+supervisor binary interface (SBI). The implementation of a RISC-V
+execution environment can be pure hardware, pure software, or a
+combination of hardware and software. For example, opcode traps and
+software emulation can be used to implement functionality not provided
+in hardware. Examples of execution environment implementations include:
+
+* `Bare metal` hardware platforms where harts are directly implemented
+by physical processor threads and instructions have full access to the
+physical address space. The hardware platform defines an execution
+environment that begins at power-on reset.
+* RISC-V operating systems that provide multiple user-level execution
+environments by multiplexing user-level harts onto available physical
+processor threads and by controlling access to memory via virtual
+memory.
+* RISC-V hypervisors that provide multiple supervisor-level execution
+environments for guest operating systems.
+* RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V
+harts on an underlying x86 system, and which can provide either a
+user-level or a supervisor-level execution environment.
+
+[TIP]
+====
+A bare hardware platform can be considered to define an EEI, where the
+accessible harts, memory, and other devices populate the environment,
+and the initial state is that at power-on reset. Generally, most
+software is designed to use a more abstract interface to the hardware,
+as more abstract EEIs provide greater portability across different
+hardware platforms. Often EEIs are layered on top of one another, where
+one higher-level EEI uses another lower-level EEI.
+====
+(((hart, exectution environment)))
+
+From the perspective of software running in a given execution
+environment, a hart is a resource that autonomously fetches and executes
+RISC-V instructions within that execution environment. In this respect,
+a hart behaves like a hardware thread resource even if time-multiplexed
+onto real hardware by the execution environment. Some EEIs support the
+creation and destruction of additional harts, for example, via
+environment calls to fork new harts.
+
+The execution environment is responsible for ensuring the eventual
+forward progress of each of its harts. For a given hart, that
+responsibility is suspended while the hart is exercising a mechanism
+that explicitly waits for an event, such as the wait-for-interrupt
+instruction defined in Volume II of this specification; and that
+responsibility ends if the hart is terminated. The following events
+constitute forward progress:
+
+* The retirement of an instruction.
+* A trap, as defined in <<trap-defn>>.
+* Any other event defined by an extension to constitute forward
+progress.
+
+[TIP]
+====
+The term hart was introduced in the work on Lithe  cite:[lithe-pan-hotpar09] and cite:[lithe-pan-pldi10] to provide a term to
+represent an abstract execution resource as opposed to a software thread
+programming abstraction.
+
+The important distinction between a hardware thread (hart) and a
+software thread context is that the software running inside an execution
+environment is not responsible for causing progress of each of its
+harts; that is the responsibility of the outer execution environment. So
+the environment’s harts operate like hardware threads from the
+perspective of the software inside the execution environment.
+
+An execution environment implementation might time-multiplex a set of
+guest harts onto fewer host harts provided by its own execution
+environment but must do so in a way that guest harts operate like
+independent hardware threads. In particular, if there are more guest
+harts than host harts then the execution environment must be able to
+preempt the guest harts and must not wait indefinitely for guest
+software on a guest hart to “yield" control of the guest hart.
+====
+
+=== RISC-V ISA Overview
+
+A RISC-V ISA is defined as a base integer ISA, which must be present in
+any implementation, plus optional extensions to the base ISA. The base
+integer ISAs are very similar to that of the early RISC processors
+except with no branch delay slots and with support for optional
+variable-length instruction encodings. A base is carefully restricted to
+a minimal set of instructions sufficient to provide a reasonable target
+for compilers, assemblers, linkers, and operating systems (with
+additional privileged operations), and so provides a convenient ISA and
+software toolchain `skeleton` around which more customized processor
+ISAs can be built.
+
+Although it is convenient to speak of _the_ RISC-V ISA, RISC-V is
+actually a family of related ISAs, of which there are currently four
+base ISAs. Each base integer instruction set is characterized by the
+width of the integer registers and the corresponding size of the address
+space and by the number of integer registers. There are two primary base
+integer variants, RV32I and RV64I, described in
+<<rv32>> and <<rv64>>, which provide 32-bit
+or 64-bit address spaces respectively. We use the term XLEN to refer to
+the width of an integer register in bits (either 32 or 64).
+<<rv32e>> describes the RV32E subset variant of the
+RV32I base instruction set, which has been added to support small
+microcontrollers, and which has half the number of integer registers.
+<<rv128>> sketches a future RV128I variant of the
+base integer instruction set supporting a flat 128-bit address space
+(XLEN=128). The base integer instruction sets use a two’s-complement
+representation for signed integer values.
+
+Although 64-bit address spaces are a requirement for larger systems, we
+believe 32-bit address spaces will remain adequate for many embedded and
+client devices for decades to come and will be desirable to lower memory
+traffic and energy consumption. In addition, 32-bit address spaces are
+sufficient for educational purposes. A larger flat 128-bit address space
+might eventually be required, so we ensured this could be accommodated
+within the RISC-V ISA framework.
+
+The four base ISAs in RISC-V are treated as distinct base ISAs. A common
+question is why is there not a single ISA, and in particular, why is
+RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC,
+MIPS) adopted a strict superset policy when increasing address space
+size to support running existing 32-bit binaries on new 64-bit hardware.
+
+The main advantage of explicitly separating base ISAs is that each base
+ISA can be optimized for its needs without requiring to support all the
+operations needed for other base ISAs. For example, RV64I can omit
+instructions and CSRs that are only needed to cope with the narrower
+registers in RV32I. The RV32I variants can use encoding space otherwise
+reserved for instructions only required by wider address-space variants.
+
+The main disadvantage of not treating the design as a single ISA is that
+it complicates the hardware needed to emulate one base ISA on another
+(e.g., RV32I on RV64I). However, differences in addressing and illegal
+instruction traps generally mean some mode switch would be required in
+hardware in any case even with full superset instruction encodings, and
+the different RISC-V base ISAs are similar enough that supporting
+multiple versions is relatively low cost. Although some have proposed
+that the strict superset design would allow legacy 32-bit libraries to
+be linked with 64-bit code, this is impractical in practice, even with
+compatible encodings, due to the differences in software calling
+conventions and system-call interfaces.
+
+The RISC-V privileged architecture provides fields in` misa` to control
+the unprivileged ISA at each level to support emulating different base
+ISAs on the same hardware. We note that newer SPARC and MIPS ISA
+revisions have deprecated support for running 32-bit code unchanged on
+64-bit systems.
+
+A related question is why there is a different encoding for 32-bit adds
+in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for
+32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the
+existing design which uses the same opcode ADD for 32-bit adds in RV32I
+and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in
+RV64I. This would also be more consistent with the use of the same LW
+opcode for 32-bit load in both RV32I and RV64I. The very first versions
+of RISC-V ISA did have a variant of this alternate design, but the
+RISC-V design was changed to the current choice in January 2011. Our
+focus was on supporting 32-bit integers in the 64-bit ISA not on
+providing compatibility with the 32-bit ISA, and the motivation was to
+remove the asymmetry that arose from having not all opcodes in RV32I
+have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was
+perhaps not well-justified and a consequence of designing both ISAs at
+the same time as opposed to adding one later to sit on top of another,
+and also from a belief we had to fold platform requirements into the ISA
+spec which would imply that all the RV32I instructions would have been
+required in RV64I. It is too late to change the encoding now, but this
+is also of little practical consequence for the reasons stated above.
+
+It has been noted we could enable the *W variants as an extension to
+RV32I systems to provide a common encoding across RV64I and a future
+RV32 variant.
+
+RISC-V has been designed to support extensive customization and
+specialization. Each base integer ISA can be extended with one or more
+optional instruction-set extensions. An extension may be categorized as
+either standard, custom, or non-conforming. For this purpose, we divide
+each RISC-V instruction-set encoding space (and related encoding spaces
+such as the CSRs) into three disjoint categories: _standard_,
+_reserved_, and _custom_. Standard extensions and encodings are defined
+by the Foundation; any extensions not defined by the Foundation are
+_non-standard_. Each base ISA and its standard extensions use only
+standard encodings, and shall not conflict with each other in their uses
+of these encodings. Reserved encodings are currently not defined but are
+saved for future standard extensions; once thus used, they become
+standard encodings. Custom encodings shall never be used for standard
+extensions and are made available for vendor-specific non-standard
+extensions. Non-standard extensions are either custom extensions, that
+use only custom encodings, or _non-conforming_ extensions, that use any
+standard or reserved encoding. Instruction-set extensions are generally
+shared but may provide slightly different functionality depending on the
+base ISA. <<extensions>> describes various ways
+of extending the RISC-V ISA. We have also developed a naming convention
+for RISC-V base instructions and instruction-set extensions, described
+in detail in <<naming>>.
+
+To support more general software development, a set of standard
+extensions are defined to provide integer multiply/divide, atomic
+operations, and single and double-precision floating-point arithmetic.
+The base integer ISA is named `I` (prefixed by RV32 or RV64 depending
+on integer register width), and contains integer computational
+instructions, integer loads, integer stores, and control-flow
+instructions. The standard integer multiplication and division extension
+is named `M`, and adds instructions to multiply and divide values held
+in the integer registers. The standard atomic instruction extension,
+denoted by `A`, adds instructions that atomically read, modify, and
+write memory for inter-processor synchronization. The standard
+single-precision floating-point extension, denoted by `F`, adds
+floating-point registers, single-precision computational instructions,
+and single-precision loads and stores. The standard double-precision
+floating-point extension, denoted by `D`, expands the floating-point
+registers, and adds double-precision computational instructions, loads,
+and stores. The standard `C` compressed instruction extension provides
+narrower 16-bit forms of common instructions.
+
+Beyond the base integer ISA and the standard GC extensions, we believe
+it is rare that a new instruction will provide a significant benefit for
+all applications, although it may be very beneficial for a certain
+domain. As energy efficiency concerns are forcing greater
+specialization, we believe it is important to simplify the required
+portion of an ISA specification. Whereas other architectures usually
+treat their ISA as a single entity, which changes to a new version as
+instructions are added over time, RISC-V will endeavor to keep the base
+and each standard extension constant over time, and instead layer new
+instructions as further optional extensions. For example, the base
+integer ISAs will continue as fully supported standalone ISAs,
+regardless of any subsequent extensions.
+
+=== Memory
+
+A RISC-V hart has a single byte-addressable address space of
+latexmath:[$2^{\text{XLEN}}$] bytes for all memory accesses. A _word_ of
+memory is defined as (). Correspondingly, a _halfword_ is (), a
+_doubleword_ is (), and a _quadword_ is (). The memory address space is
+circular, so that the byte at address latexmath:[$2^{\text{XLEN}}-1$] is
+adjacent to the byte at address zero. Accordingly, memory address
+computations done by the hardware ignore overflow and instead wrap
+around modulo latexmath:[$2^{\text{XLEN}}$].
+
+The execution environment determines the mapping of hardware resources
+into a hart’s address space. Different address ranges of a hart’s
+address space may (1) be vacant, or (2) contain _main memory_, or
+(3) contain one or more _I/O devices_. Reads and writes of I/O devices
+may have visible side effects, but accesses to main memory cannot.
+Although it is possible for the execution environment to call everything
+in a hart’s address space an I/O device, it is usually expected that
+some portion will be specified as main memory.
+
+When a RISC-V platform has multiple harts, the address spaces of any two
+harts may be entirely the same, or entirely different, or may be partly
+different but sharing some subset of resources, mapped into the same or
+different address ranges.
+
+For a purely `bare metal` environment, all harts may see an identical
+address space, accessed entirely by physical addresses. However, when
+the execution environment includes an operating system employing address
+translation, it is common for each hart to be given a virtual address
+space that is largely or entirely its own.
+(((memory access, implicit and explicit)))
+
+Executing each RISC-V machine instruction entails one or more memory
+accesses, subdivided into _implicit_ and _explicit_ accesses. For each
+instruction executed, an _implicit_ memory read (instruction fetch) is
+done to obtain the encoded instruction to execute. Many RISC-V
+instructions perform no further memory accesses beyond instruction
+fetch. Specific load and store instructions perform an _explicit_ read
+or write of memory at an address determined by the instruction. The
+execution environment may dictate that instruction execution performs
+other _implicit_ memory accesses (such as to implement address
+translation) beyond those documented for the unprivileged ISA.
+
+The execution environment determines what portions of the non-vacant
+address space are accessible for each kind of memory access. For
+example, the set of locations that can be implicitly read for
+instruction fetch may or may not have any overlap with the set of
+locations that can be explicitly read by a load instruction; and the set
+of locations that can be explicitly written by a store instruction may
+be only a subset of locations that can be read. Ordinarily, if an
+instruction attempts to access memory at an inaccessible address, an
+exception is raised for the instruction. Vacant locations in the address
+space are never accessible.
+
+Except when specified otherwise, implicit reads that do not raise an
+exception and that have no side effects may occur arbitrarily early and
+speculatively, even before the machine could possibly prove that the
+read will be needed. For instance, a valid implementation could attempt
+to read all of main memory at the earliest opportunity, cache as many
+fetchable (executable) bytes as possible for later instruction fetches,
+and avoid reading main memory for instruction fetches ever again. To
+ensure that certain implicit reads are ordered only after writes to the
+same memory locations, software must execute specific fence or
+cache-control instructions defined for this purpose (such as the FENCE.I
+instruction defined in <<zifencei>>.
+(((memory access, implicit and explicit)))
+
+The memory accesses (implicit or explicit) made by a hart may appear to
+occur in a different order as perceived by another hart or by any other
+agent that can access the same memory. This perceived reordering of
+memory accesses is always constrained, however, by the applicable memory
+consistency model. The default memory consistency model for RISC-V is
+the RISC-V Weak Memory Ordering (RVWMO), defined in
+<<memorymodel>> and in appendices. Optionally,
+an implementation may adopt the stronger model of Total Store Ordering,
+as defined in <<ztso>>. The execution environment
+may also add constraints that further limit the perceived reordering of
+memory accesses. Since the RVWMO model is the weakest model allowed for
+any RISC-V implementation, software written for this model is compatible
+with the actual memory consistency rules of all RISC-V implementations.
+As with implicit reads, software must execute fence or cache-control
+instructions to ensure specific ordering of memory accesses beyond the
+requirements of the assumed memory consistency model and execution
+environment.
+
+=== Base Instruction-Length Encoding
+
+The base RISC-V ISA has fixed-length 32-bit instructions that must be
+naturally aligned on 32-bit boundaries. However, the standard RISC-V
+encoding scheme is designed to support ISA extensions with
+variable-length instructions, where each instruction can be any number
+of 16-bit instruction _parcels_ in length and parcels are naturally
+aligned on 16-bit boundaries. The standard compressed ISA extension
+described in <<compressed>> reduces code size by
+providing compressed 16-bit instructions and relaxes the alignment
+constraints to allow all instructions (16 bit and 32 bit) to be aligned
+on any 16-bit boundary to improve code density.
+
+We use the term IALIGN (measured in bits) to refer to the
+instruction-address alignment constraint the implementation enforces.
+IALIGN is 32 bits in the base ISA, but some ISA extensions, including
+the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not
+take on any value other than 16 or 32.
+(((ILEN)))
+
+We use the term ILEN (measured in bits) to refer to the maximum
+instruction length supported by an implementation, and which is always a
+multiple of IALIGN. For implementations supporting only a base
+instruction set, ILEN is 32 bits. Implementations supporting longer
+instructions have larger values of ILEN.
+
+<<instlengthcode>> illustrates the standard
+RISC-V instruction-length encoding convention. All the 32-bit
+instructions in the base ISA have their lowest two bits set to`11`. The
+optional compressed 16-bit instruction-set extensions have their lowest
+two bits equal to`00`,`01`, or`10`.
+
+==== Expanded Instruction-Length Encoding
+(((instruction length encoding)))
+
+A portion of the 32-bit instruction-encoding space has been tentatively
+allocated for instructions longer than 32 bits. The entirety of this
+space is reserved at this time, and the following proposal for encoding
+instructions longer than 32 bits is not considered frozen.
+
+Standard instruction-set extensions encoded with more than 32 bits have
+additional low-order bits set to`1`, with the conventions for 48-bit
+and 64-bit lengths shown in
+<<instlengthcode>>. Instruction lengths
+between 80 bits and 176 bits are encoded using a 3-bit field in bits
+[14:12] giving the number of 16-bit words in addition to the first
+5latexmath:[$\times$]16-bit words. The encoding with bits [14:12] set to
+`111` is reserved for future longer instruction encodings.
+
+[[instlengthcode]]
+[cols="^,^,^,^,<",]
+|===
+| | | |`xxxxxxxxxxxxxxaa` |16-bit (`aa` latexmath:[$\neq$]`11`)
+
+| | | | |
+
+| | |`xxxxxxxxxxxxxxxx` |`xxxxxxxxxxxbbb11` |32-bit (`bbb`
+latexmath:[$\neq$]`111`)
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xxxxxxxxxx011111` |48-bit
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xxxxxxxxx0111111` |64-bit
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`xnnnxxxxx1111111` |(80+16*`nnn`)-bit,`nnn`latexmath:[$\neq$]`111`
+
+| | | | |
+
+| |latexmath:[$\cdot\cdot\cdot$]`xxxx` |`xxxxxxxxxxxxxxxx`
+|`x111xxxxx1111111` |Reserved for latexmath:[$\geq$]192-bits
+
+| | | | |
+
+|Byte Address: |base+4 |base+2 |base |
+|===
+
+Given the code size and energy savings of a compressed format, we wanted
+to build in support for a compressed format to the ISA encoding scheme
+rather than adding this as an afterthought, but to allow simpler
+implementations we didn’t want to make the compressed format mandatory.
+We also wanted to optionally allow longer instructions to support
+experimentation and larger instruction-set extensions. Although our
+encoding convention required a tighter encoding of the core RISC-V ISA,
+this has several beneficial effects.
+(((IMAFED)))
+
+An implementation of the standard IMAFD ISA need only hold the
+most-significant 30 bits in instruction caches (a 6.25% saving). On
+instruction cache refills, any instructions encountered with either low
+bit clear should be recoded into illegal 30-bit instructions before
+storing in the cache to preserve illegal instruction exception behavior.
+
+Perhaps more importantly, by condensing our base ISA into a subset of
+the 32-bit instruction word, we leave more space available for
+non-standard and custom extensions. In particular, the base RV32I ISA
+uses less than 1/8 of the encoding space in the 32-bit instruction word.
+As described in Chapter link:#extensions[[extensions]], an
+implementation that does not require support for the standard compressed
+instruction extension can map 3 additional non-conforming 30-bit
+instruction spaces into the 32-bit fixed-width format, while preserving
+support for standard latexmath:[$\geq$]32-bit instruction-set
+extensions. Further, if the implementation also does not need
+instructions latexmath:[$>$]32-bits in length, it can recover a further
+four major opcodes for non-conforming extensions.
+
+Encodings with bits [15:0] all zeros are defined as illegal
+instructions. These instructions are considered to be of minimal length:
+16 bits if any 16-bit instruction-set extension is present, otherwise 32
+bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this
+instruction is considered to be ILEN bits long.
+
+We consider it a feature that any length of instruction containing all
+zero bits is not legal, as this quickly traps erroneous jumps into
+zeroed memory regions. Similarly, we also reserve the instruction
+encoding containing all ones to be an illegal instruction, to catch the
+other common pattern observed with unprogrammed non-volatile memory
+devices, disconnected memory buses, or broken memory devices.
+
+Software can rely on a naturally aligned 32-bit word containing zero to
+act as an illegal instruction on all RISC-V implementations, to be used
+by software where an illegal instruction is explicitly desired. Defining
+a corresponding known illegal value for all ones is more difficult due
+to the variable-length encoding. Software cannot generally use the
+illegal value of ILEN bits of all 1s, as software might not know ILEN
+for the eventual target machine (e.g., if software is compiled into a
+standard binary library used by many different machines). Defining a
+32-bit word of all ones as illegal was also considered, as all machines
+must support a 32-bit instruction size, but this requires the
+instruction-fetch unit on machines with ILEN latexmath:[$>$]32 report an
+illegal instruction exception rather than an access-fault exception when
+such an instruction borders a protection boundary, complicating
+variable-instruction-length fetch and decode.
+(((endian, little and big)))
+
+RISC-V base ISAs have either little-endian or big-endian memory systems,
+with the privileged architecture further defining bi-endian operation.
+Instructions are stored in memory as a sequence of 16-bit little-endian
+parcels, regardless of memory system endianness. Parcels forming one
+instruction are stored at increasing halfword addresses, with the
+lowest-addressed parcel holding the lowest-numbered bits in the
+instruction specification.
+((bi-endian))
+(((endian, bi-)))
+
+We originally chose little-endian byte ordering for the RISC-V memory
+system because little-endian systems are currently dominant commercially
+(all x86 systems; iOS, Android, and Windows for ARM). A minor point is
+that we have also found little-endian memory systems to be more natural
+for hardware designers. However, certain application areas, such as IP
+networking, operate on big-endian data structures, and certain legacy
+code bases have been built assuming big-endian processors, so we have
+defined big-endian and bi-endian variants of RISC-V.
+
+We have to fix the order in which instruction parcels are stored in
+memory, independent of memory system endianness, to ensure that the
+length-encoding bits always appear first in halfword address order. This
+allows the length of a variable-length instruction to be quickly
+determined by an instruction-fetch unit by examining only the first few
+bits of the first 16-bit instruction parcel.
+
+We further make the instruction parcels themselves little-endian to
+decouple the instruction encoding from the memory system endianness
+altogether. This design benefits both software tooling and bi-endian
+hardware. Otherwise, for instance, a RISC-V assembler or disassembler
+would always need to know the intended active endianness, despite that
+in bi-endian systems, the endianness mode might change dynamically
+during execution. In contrast, by giving instructions a fixed
+endianness, it is sometimes possible for carefully written software to
+be endianness-agnostic even in binary form, much like
+position-independent code.
+
+The choice to have instructions be only little-endian does have
+consequences, however, for RISC-V software that encodes or decodes
+machine instructions. Big-endian JIT compilers, for example, must swap
+the byte order when storing to instruction memory.
+
+Once we had decided to fix on a little-endian instruction encoding, this
+naturally led to placing the length-encoding bits in the LSB positions
+of the instruction format to avoid breaking up opcode fields.
+
+[[trap-defn]]
+=== Exceptions, Traps, and Interrupts
+(((exceptions)))
+(((traps)))
+(((interrupts)))
+
+We use the term _exception_ to refer to an unusual condition occurring
+at run time associated with an instruction in the current RISC-V hart.
+We use the term _interrupt_ to refer to an external asynchronous event
+that may cause a RISC-V hart to experience an unexpected transfer of
+control. We use the term _trap_ to refer to the transfer of control to a
+trap handler caused by either an exception or an interrupt.
+
+The instruction descriptions in following chapters describe conditions
+that can raise an exception during execution. The general behavior of
+most RISC-V EEIs is that a trap to some handler occurs when an exception
+is signaled on an instruction (except for floating-point exceptions,
+which, in the standard floating-point extensions, do not cause traps).
+The manner in which interrupts are generated, routed to, and enabled by
+a hart depends on the EEI.
+
+Our use of `exception` and `trap` is compatible with that in the
+IEEE-754 floating-point standard.
+
+How traps are handled and made visible to software running on the hart
+depends on the enclosing execution environment. From the perspective of
+software running inside an execution environment, traps encountered by a
+hart at runtime can have four different effects:
+
+Contained Trap:::
+  The trap is visible to, and handled by, software running inside the
+  execution environment. For example, in an EEI providing both
+  supervisor and user mode on harts, an ECALL by a user-mode hart will
+  generally result in a transfer of control to a supervisor-mode handler
+  running on the same hart. Similarly, in the same environment, when a
+  hart is interrupted, an interrupt handler will be run in supervisor
+  mode on the hart.
+Requested Trap:::
+  The trap is a synchronous exception that is an explicit call to the
+  execution environment requesting an action on behalf of software
+  inside the execution environment. An example is a system call. In this
+  case, execution may or may not resume on the hart after the requested
+  action is taken by the execution environment. For example, a system
+  call could remove the hart or cause an orderly termination of the
+  entire execution environment.
+Invisible Trap:::
+  The trap is handled transparently by the execution environment and
+  execution resumes normally after the trap is handled. Examples include
+  emulating missing instructions, handling non-resident page faults in a
+  demand-paged virtual-memory system, or handling device interrupts for
+  a different job in a multiprogrammed machine. In these cases, the
+  software running inside the execution environment is not aware of the
+  trap (we ignore timing effects in these definitions).
+Fatal Trap:::
+  The trap represents a fatal failure and causes the execution
+  environment to terminate execution. Examples include failing a
+  virtual-memory page-protection check or allowing a watchdog timer to
+  expire. Each EEI should define how execution is terminated and
+  reported to an external environment.
+
+<<trapcharacteristics>> shows the characteristics of each
+kind of trap.
+
+[[trapcharacteristics]]
+.Characteristics of traps; 1) Termination may be requested. 2)
+Imprecise fatal traps might be observable by software.
+[cols="<,^,^,^,^",options="header",]
+|===
+| |Contained |Requested |Invisible |Fatal
+|Execution terminates |No |Nolatexmath:[$^{1}$] |No |Yes
+|Software is oblivious |No |No |Yes |Yeslatexmath:[$^{2}$]
+|Handled by environment |No |Yes |Yes |Yes
+|===
+
+The EEI defines for each trap whether it is handled precisely, though
+the recommendation is to maintain preciseness where possible. Contained
+and requested traps can be observed to be imprecise by software inside
+the execution environment. Invisible traps, by definition, cannot be
+observed to be precise or imprecise by software running inside the
+execution environment. Fatal traps can be observed to be imprecise by
+software running inside the execution environment, if known-errorful
+instructions do not cause immediate termination.
+
+Because this document describes unprivileged instructions, traps are
+rarely mentioned. Architectural means to handle contained traps are
+defined in the privileged architecture manual, along with other features
+to support richer EEIs. Unprivileged instructions that are defined
+solely to cause requested traps are documented here. Invisible traps
+are, by their nature, out of scope for this document. Instruction
+encodings that are not defined here and not defined by some other means
+may cause a fatal trap.
+
+=== UNSPECIFIED Behaviors and Values
+(((unspecified, behaviors)))
+(((unspecified, values)))
+
+The architecture fully describes what implementations must do and any
+constraints on what they may do. In cases where the architecture
+intentionally does not constrain implementations, the term  is
+explicitly used.
+
+The term _unspecified_ refers to a behavior or value that is intentionally
+unconstrained. The definition of these behaviors or values is open to
+extensions, platform standards, or implementations. Extensions, platform
+standards, or implementation documentation may provide normative content
+to further constrain cases that the base architecture defines as .
+
+Like the base architecture, extensions should fully describe allowable
+behavior and values and use the term _unspecified_ for cases that are intentionally
+unconstrained. These cases may be constrained or defined by other
+extensions, platform standards, or implementations.
diff --git a/src/j-st-ext.adoc b/src/j-st-ext.adoc
new file mode 100644
index 0000000..0b5358c
--- /dev/null
+++ b/src/j-st-ext.adoc
@@ -0,0 +1,10 @@
+[[j-extendj]]
+== `J` Standard Extension for Dynamically Translated Languages, Version 0.0
+
+This chapter is a placeholder for a future standard extension to support
+dynamically translated languages.
+
+Many popular languages are usually implemented via dynamic translation,
+including Java and Javascript. These languages can benefit from
+additional ISA support for dynamic checks and garbage collection.
+
diff --git a/src/m-st-ext.adoc b/src/m-st-ext.adoc
new file mode 100644
index 0000000..6b2593e
--- /dev/null
+++ b/src/m-st-ext.adoc
@@ -0,0 +1,159 @@
+[[mstandard]]
+== `M` Standard Extension for Integer Multiplication and Division, Version 2.0
+
+This chapter describes the standard integer multiplication and division
+instruction extension, which is named `M` and contains instructions
+that multiply or divide values held in two integer registers.
+
+[TIP]
+====
+We separate integer multiply and divide out from the base to simplify
+low-end implementations, or for applications where integer multiply and
+divide operations are either infrequent or better handled in attached
+accelerators.
+====
+
+=== Multiplication Operations
+
+include::images/wavedrom/m-st-ext-for-int-mult.adoc[]
+[[m-st-ext-for-int-mult]]
+.Multiplication operation instructions
+image::image_placeholder.png[]
+(((MUL, MULH)))
+(((MUL, MULHU)))
+(((MUL, MULHSU)))
+
+MUL performs an XLEN-bitlatexmath:[$\times$]XLEN-bit multiplication of
+_rs1_ by _rs2_ and places the lower XLEN bits in the destination
+register. MULH, MULHU, and MULHSU perform the same multiplication but
+return the upper XLEN bits of the full 2latexmath:[$\times$]XLEN-bit
+product, for signedlatexmath:[$\times$]signed,
+unsignedlatexmath:[$\times$]unsigned, and latexmath:[$\times$]
+multiplication, respectively. If both the high and low bits of the same
+product are required, then the recommended code sequence is: MULH[[S]U]
+_rdh, rs1, rs2_; MUL _rdl, rs1, rs2_ (source register specifiers must be
+in same order and _rdh_ cannot be the same as _rs1_ or _rs2_).
+Microarchitectures can then fuse these into a single multiply operation
+instead of performing two separate multiplies.
+
+[NOTE]
+====
+MULHSU is used in multi-word signed multiplication to multiply the
+most-significant word of the multiplicand (which contains the sign bit)
+with the less-significant words of the multiplier (which are unsigned).
+====
+
+MULW is an RV64 instruction that multiplies the lower 32 bits of the
+source registers, placing the sign-extension of the lower 32 bits of the
+result into the destination register.
+
+[NOTE]
+====
+In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit
+product, but signed arguments must be proper 32-bit signed values,
+whereas unsigned arguments must have their upper 32 bits clear. If the
+arguments are not known to be sign- or zero-extended, an alternative is
+to shift both arguments left by 32 bits, then use MULH[[S]U].
+====
+
+=== Division Operations
+
+include::images/wavedrom/division-op.adoc[]
+[[division-op]]
+.Division operation instructions
+image::image_placeholder.png[]
+(((MUL, DIV)))
+(((MUL, DIVU)))
+
+DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned
+integer division of _rs1_ by _rs2_, rounding towards zero. REM and REMU
+provide the remainder of the corresponding division operation. For REM,
+the sign of the result equals the sign of the dividend.
+
+[NOTE]
+====
+For both signed and unsigned division, it holds that
+latexmath:[$\textrm{dividend} = \textrm{divisor} \times \textrm{quotient} + \textrm{remainder}$].
+====
+
+If both the quotient and remainder are required from the same division,
+the recommended code sequence is: DIV[U] _rdq, rs1, rs2_; REM[U] _rdr,
+rs1, rs2_ (_rdq_ cannot be the same as _rs1_ or _rs2_).
+Microarchitectures can then fuse these into a single divide operation
+instead of performing two separate divides.
+
+DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of
+_rs1_ by the lower 32 bits of _rs2_, treating them as signed and
+unsigned integers respectively, placing the 32-bit quotient in _rd_,
+sign-extended to 64 bits. REMW and REMUW are RV64 instructions that
+provide the corresponding signed and unsigned remainder operations
+respectively. Both REMW and REMUW always sign-extend the 32-bit result
+to 64 bits, including on a divide by zero.
+(((MUL, div by zero)))
+
+The semantics for division by zero and division overflow are summarized
+in <<divby0>>. The quotient of division by zero has all bits
+set, and the remainder of division by zero equals the dividend. Signed
+division overflow occurs only when the most-negative integer is divided
+by latexmath:[$-1$]. The quotient of a signed division with overflow is
+equal to the dividend, and the remainder is zero. Unsigned division
+overflow cannot occur.
+
+[[divby0]]
+.Semantics for division by zero and division overflow.
+[cols="<,^,^,^,^,^,^",options="header",]
+|===
+|Condition |Dividend |Divisor |DIVU[W] |REMU[W] |DIV[W] |REM[W]
+|Division by zero |latexmath:[$x$] |0 |latexmath:[$2^{L}-1$]
+|latexmath:[$x$] |latexmath:[$-1$] |latexmath:[$x$]
+
+|Overflow (signed only) |latexmath:[$-2^{L-1}$] |latexmath:[$-1$] |– |–
+|latexmath:[$-2^{L-1}$] |0
+|===
+
+In <<divby0>>, L is the width of the operation in bits: XLEN for DIV[U] and REM[U], or 32 for DIV[U]W and REM[U]W.
+
+[TIP]
+====
+We considered raising exceptions on integer divide by zero, with these
+exceptions causing a trap in most execution environments. However, this
+would be the only arithmetic trap in the standard ISA (floating-point
+exceptions set flags and write default values, but do not cause traps)
+and would require language implementors to interact with the execution
+environment’s trap handlers for this case. Further, where language
+standards mandate that a divide-by-zero exception must cause an
+immediate control flow change, only a single branch instruction needs to
+be added to each divide operation, and this branch instruction can be
+inserted after the divide and should normally be very predictably not
+taken, adding little runtime overhead.
+
+The value of all bits set is returned for both unsigned and signed
+divide by zero to simplify the divider circuitry. The value of all 1s is
+both the natural value to return for unsigned divide, representing the
+largest unsigned number, and also the natural result for simple unsigned
+divider implementations. Signed division is often implemented using an
+unsigned division circuit and specifying the same overflow result
+simplifies the hardware.
+====
+
+=== Zmmul Extension, Version 0.1
+
+The Zmmul extension implements the multiplication subset of the M
+extension. It adds all of the instructions defined in
+<<Multiplication Operations>>, namely: MUL, MULH, MULHU,
+MULHSU, and (for RV64 only) MULW. The encodings are identical to those
+of the corresponding M-extension instructions.
+(((MUL, Zmmul)))
+
+[NOTE]
+====
+The Zmmul extension enables low-cost implementations that require
+multiplication operations but not division. For many microcontroller
+applications, division operations are too infrequent to justify the cost
+of divider hardware. By contrast, multiplication operations are more
+frequent, making the cost of multiplier hardware more justifiable.
+Simple FPGA soft cores particularly benefit from eliminating division
+but retaining multiplication, since many FPGAs provide hardwired
+multipliers but require dividers be implemented in soft logic.
+====
+
diff --git a/src/mm-alloy.adoc b/src/mm-alloy.adoc
new file mode 100644
index 0000000..93c0c41
--- /dev/null
+++ b/src/mm-alloy.adoc
@@ -0,0 +1,239 @@
+[[sec:alloy]]
+== Formal Axiomatic Specification in Alloy
+
+We present a formal specification of the RVWMO memory model in Alloy
+(http://alloy.mit.edu). This model is available online at
+https://github.com/daniellustig/riscv-memory-model.
+
+The online material also contains some litmus tests and some examples of
+how Alloy can be used to model check some of the mappings in
+Section #sec:memory:porting[[sec:memory:porting]].
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =RVWMO PPO=
+
+// Preserved Program Order
+fun ppo : Event->Event {
+  // same-address ordering
+  po_loc :> Store
+  + rdw
+  + (AMO + StoreConditional) <: rfi
+
+  // explicit synchronization
+  + ppo_fence
+  + Acquire <: ^po :> MemoryEvent
+  + MemoryEvent <: ^po :> Release
+  + RCsc <: ^po :> RCsc
+  + pair
+
+  // syntactic dependencies
+  + addrdep
+  + datadep
+  + ctrldep :> Store
+
+  // pipeline dependencies
+  + (addrdep+datadep).rfi
+  + addrdep.^po :> Store
+}
+
+// the global memory order respects preserved program order
+fact { ppo in ^gmo }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =RVWMO axioms=
+
+// Load Value Axiom
+fun candidates[r: MemoryEvent] : set MemoryEvent {
+  (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo
+  + (r.^~po & Store & same_addr[r]) // writes preceding r in po
+}
+
+fun latest_among[s: set Event] : Event { s - s.~^gmo }
+
+pred LoadValue {
+  all w: Store | all r: Load |
+    w->r in rf <=> w = latest_among[candidates[r]]
+}
+
+// Atomicity Axiom
+pred Atomicity {
+  all r: Store.~pair |            // starting from the lr,
+    no x: Store & same_addr[r] |  // there is no store x to the same addr
+      x not in same_hart[r]       // such that x is from a different hart,
+      and x in r.~rf.^gmo         // x follows (the store r reads from) in gmo,
+      and r.pair in x.^gmo        // and r follows x in gmo
+}
+
+// Progress Axiom implicit: Alloy only considers finite executions
+
+pred RISCV_mm { LoadValue and Atomicity /* and Progress */ }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// Basic model of memory
+
+sig Hart {  // hardware thread
+  start : one Event
+}
+sig Address {}
+abstract sig Event {
+  po: lone Event // program order
+}
+
+abstract sig MemoryEvent extends Event {
+  address: one Address,
+  acquireRCpc: lone MemoryEvent,
+  acquireRCsc: lone MemoryEvent,
+  releaseRCpc: lone MemoryEvent,
+  releaseRCsc: lone MemoryEvent,
+  addrdep: set MemoryEvent,
+  ctrldep: set Event,
+  datadep: set MemoryEvent,
+  gmo: set MemoryEvent,  // global memory order
+  rf: set MemoryEvent
+}
+sig LoadNormal extends MemoryEvent {} // l{b|h|w|d}
+sig LoadReserve extends MemoryEvent { // lr
+  pair: lone StoreConditional
+}
+sig StoreNormal extends MemoryEvent {}       // s{b|h|w|d}
+// all StoreConditionals in the model are assumed to be successful
+sig StoreConditional extends MemoryEvent {}  // sc
+sig AMO extends MemoryEvent {}               // amo
+sig NOP extends Event {}
+
+fun Load : Event { LoadNormal + LoadReserve + AMO }
+fun Store : Event { StoreNormal + StoreConditional + AMO }
+
+sig Fence extends Event {
+  pr: lone Fence, // opcode bit
+  pw: lone Fence, // opcode bit
+  sr: lone Fence, // opcode bit
+  sw: lone Fence  // opcode bit
+}
+sig FenceTSO extends Fence {}
+
+/* Alloy encoding detail: opcode bits are either set (encoded, e.g.,
+ * as f.pr in iden) or unset (f.pr not in iden).  The bits cannot be used for
+ * anything else */
+fact { pr + pw + sr + sw in iden }
+// likewise for ordering annotations
+fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden }
+// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is
+fact { no FenceTSO.(pr + pw + sr + sw) }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =Basic model rules=
+
+// Ordering annotation groups
+fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc }
+fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc }
+fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc }
+fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc }
+
+// There is no such thing as store-acquire or load-release, unless it's both
+fact { Load & Release in Acquire }
+fact { Store & Acquire in Release }
+
+// FENCE PPO
+fun FencePRSR : Fence { Fence.(pr & sr) }
+fun FencePRSW : Fence { Fence.(pr & sw) }
+fun FencePWSR : Fence { Fence.(pw & sr) }
+fun FencePWSW : Fence { Fence.(pw & sw) }
+
+fun ppo_fence : MemoryEvent->MemoryEvent {
+    (Load  <: ^po :> FencePRSR).(^po :> Load)
+  + (Load  <: ^po :> FencePRSW).(^po :> Store)
+  + (Store <: ^po :> FencePWSR).(^po :> Load)
+  + (Store <: ^po :> FencePWSW).(^po :> Store)
+  + (Load  <: ^po :> FenceTSO) .(^po :> MemoryEvent)
+  + (Store <: ^po :> FenceTSO) .(^po :> Store)
+}
+
+// auxiliary definitions
+fun po_loc : Event->Event { ^po & address.~address }
+fun same_hart[e: Event] : set Event { e + e.^~po + e.^po }
+fun same_addr[e: Event] : set Event { e.address.~address }
+
+// initial stores
+fun NonInit : set Event { Hart.start.*po }
+fun Init : set Event { Event - NonInit }
+fact { Init in StoreNormal }
+fact { Init->(MemoryEvent & NonInit) in ^gmo }
+fact { all e: NonInit | one e.*~po.~start }  // each event is in exactly one hart
+fact { all a: Address | one Init & a.~address } // one init store per address
+fact { no Init <: po and no po :> Init }
+....
+
+`  `
+
+....
+// po
+fact { acyclic[po] }
+
+// gmo
+fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents
+
+//rf
+fact { rf.~rf in iden } // each read returns the value of only one write
+fact { rf in Store <: address.~address :> Load }
+fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) }
+
+//dep
+fact { no StoreNormal <: (addrdep + ctrldep + datadep) }
+fact { addrdep + ctrldep + datadep + pair in ^po }
+fact { datadep in datadep :> Store }
+fact { ctrldep.*po in ctrldep }
+fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po }
+fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed
+
+// rdw
+fun rdw : Event->Event {
+  (Load <: po_loc :> Load)  // start with all same_address load-load pairs,
+  - (~rf.rf)                // subtract pairs that read from the same store,
+  - (po_loc.rfi)            // and subtract out "fri-rfi" patterns
+}
+
+// filter out redundant instances and/or visualizations
+fact { no gmo & gmo.gmo } // keep the visualization uncluttered
+fact { all a: Address | some a.~address }
+
+////////////////////////////////////////////////////////////////////////////////
+// =Optional: opcode encoding restrictions=
+
+// the list of blessed fences
+fact { Fence in
+  Fence.pr.sr
+  + Fence.pw.sw
+  + Fence.pr.pw.sw
+  + Fence.pr.sr.sw
+  + FenceTSO
+  + Fence.pr.pw.sr.sw
+}
+
+pred restrict_to_current_encodings {
+  no (LoadNormal + StoreNormal) & (Acquire + Release)
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// =Alloy shortcuts=
+pred acyclic[rel: Event->Event] { no iden & ^rel }
+pred total[rel: Event->Event, bag: Event] {
+  all disj e, e': bag | e->e' in rel + ~rel
+  acyclic[rel]
+}
+....
diff --git a/src/mm-eplan.adoc b/src/mm-eplan.adoc
new file mode 100644
index 0000000..a58d306
--- /dev/null
+++ b/src/mm-eplan.adoc
@@ -0,0 +1,1705 @@
+[appendix]
+== RVWMO Explanatory Material, Version 0.1
+[[mm-explain]]
+
+This section provides more explanation for RVWMO
+<<memorymodel>>, using more informal
+language and concrete examples. These are intended to clarify the
+meaning and intent of the axioms and preserved program order rules. This
+appendix should be treated as commentary; all normative material is
+provided in <<memorymodel>> and in the rest of
+the main body of the ISA specification. All currently known
+discrepancies are listed in <<discrepancies>>. Any
+other discrepancies are unintentional.
+
+[[whyrvwmo]]
+=== Why RVWMO?
+
+Memory consistency models fall along a loose spectrum from weak to
+strong. Weak memory models allow more hardware implementation
+flexibility and deliver arguably better performance, performance per
+watt, power, scalability, and hardware verification overheads than
+strong models, at the expense of a more complex programming model.
+Strong models provide simpler programming models, but at the cost of
+imposing more restrictions on the kinds of (non-speculative) hardware
+optimizations that can be performed in the pipeline and in the memory
+system, and in turn imposing some cost in terms of power, area overhead,
+and verification burden.
+
+RISC-V has chosen the RVWMO memory model, a variant of release
+consistency. This places it in between the two extremes of the memory
+model spectrum. The RVWMO memory model enables architects to build
+simple implementations, aggressive implementations, implementations
+embedded deeply inside a much larger system and subject to complex
+memory system interactions, or any number of other possibilities, all
+while simultaneously being strong enough to support programming language
+memory models at high performance.
+
+To facilitate the porting of code from other architectures, some
+hardware implementations may choose to implement the Ztso extension,
+which provides stricter RVTSO ordering semantics by default. Code
+written for RVWMO is automatically and inherently compatible with RVTSO,
+but code written assuming RVTSO is not guaranteed to run correctly on
+RVWMO implementations. In fact, most RVWMO implementations will (and
+should) simply refuse to run RVTSO-only binaries. Each implementation
+must therefore choose whether to prioritize compatibility with RVTSO
+code (e.g., to facilitate porting from x86) or whether to instead
+prioritize compatibility with other RISC-V cores implementing RVWMO.
+
+Some fences and/or memory ordering annotations in code written for RVWMO
+may become redundant under RVTSO; the cost that the default of RVWMO
+imposes on Ztso implementations is the incremental overhead of fetching
+those fences (e.g., FENCE R,RW and FENCE RW,W) which become no-ops on
+that implementation. However, these fences must remain present in the
+code if compatibility with non-Ztso implementations is desired.
+
+[[litmustests]]
+=== Litmus Tests
+
+The explanations in this chapter make use of _litmus tests_, or small
+programs designed to test or highlight one particular aspect of a memory
+model.<<litmus-sample>> shows an example
+of a litmus test with two harts. As a convention for this figure and for
+all figures that follow in this chapter, we assume that `s0`–`s2` are
+pre-set to the same value in all harts and that `s0` holds the address
+labeled `x`, `s1` holds `y`, and `s2` holds `z`, where `x`, `y`, and `z`
+are disjoint memory locations aligned to 8 byte boundaries. Each figure
+shows the litmus test code on the left, and a visualization of one
+particular valid or invalid execution on the right.
+
+include::images/graphviz/litmus_sample.adoc[]
+[[litmus-sample]]
+.A sample litmus test and one forbidden execution (a0=1)
+image::image_placeholder.png[float+"right"]
+
+.Key for sample litmus test
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |latexmath:[$\vdots$] | |latexmath:[$\vdots$]
+| |li t1,1 | |li t4,4
+|(a) |sw t1,0(s0) |(e) |sw t4,0(s0)
+| |latexmath:[$\vdots$] | |latexmath:[$\vdots$]
+| |li t2,2 | |
+|(b) |sw t2,0(s0) | |
+| |latexmath:[$\vdots$] | |latexmath:[$\vdots$]
+|(c) |lw a0,0(s0) | |
+| |latexmath:[$\vdots$] | |latexmath:[$\vdots$]
+| |li t3,3 | |li t5,5
+|(d) |sw t3,0(s0) |(f) |sw t5,0(s0)
+| |latexmath:[$\vdots$] | |latexmath:[$\vdots$]
+|===
+
+
+Litmus tests are used to understand the implications of the memory model
+in specific concrete situations. For example, in the litmus test of
+<<litmus-sample>>, the final value of `a0`
+in the first hart can be either 2, 4, or 5, depending on the dynamic
+interleaving of the instruction stream from each hart at runtime.
+However, in this example, the final value of `a0` in Hart 0 will never
+be 1 or 3; intuitively, the value 1 will no longer be visible at the
+time the load executes, and the value 3 will not yet be visible by the
+time the load executes. We analyze this test and many others below.
+
+[[litmus-key]]
+.A key for the litmus test diagrams drawn in this appendix
+[cols="^,<",options="header",]
+|===
+|Edge |Full Name (and explanation)
+|rf |Reads From (from each store to the loads that return a value
+written by that store)
+
+|co |Coherence (a total order on the stores to each address)
+
+|fr |From-Reads (from each load to co-successors of the store from which
+the load returned a value)
+
+|ppo |Preserved Program Order
+
+|fence |Orderings enforced by a FENCE instruction
+
+|addr |Address Dependency
+
+|ctrl |Control Dependency
+
+|data |Data Dependency
+|===
+
+The diagram shown to the right of each litmus test shows a visual
+representation of the particular execution candidate being considered.
+These diagrams use a notation that is common in the memory model
+literature for constraining the set of possible global memory orders
+that could produce the execution in question. It is also the basis for
+the [.sans-serif]#herd# models presented in
+<<herd>>. This notation is explained in
+<<litmus-key>>. Of the listed relations, rf edges between
+harts, co edges, fr edges, and ppo edges directly constrain the global
+memory order (as do fence, addr, data, and some ctrl edges, via ppo).
+Other edges (such as intra-hart rf edges) are informative but do not
+constrain the global memory order.
+
+For example, in <<litmus-sample>>, `a0=1`
+could occur only if one of the following were true:
+
+* {empty}(b) appears before (a) in global memory order (and in the
+coherence order co). However, this violates RVWMO PPO
+rule `ppo:->st`. The co edge from (b) to (a) highlights this
+contradiction.
+* {empty}(a) appears before (b) in global memory order (and in the
+coherence order co). However, in this case, the Load Value Axiom would
+be violated, because (a) is not the latest matching store prior to (c)
+in program order. The fr edge from (c) to (b) highlights this
+contradiction.
+
+Since neither of these scenarios satisfies the RVWMO axioms, the outcome
+`a0=1` is forbidden.
+
+Beyond what is described in this appendix, a suite of more than seven
+thousand litmus tests is available at
+https://github.com/litmus-tests/litmus-tests-riscv.
+
+The litmus tests repository also provides instructions on how to run the
+litmus tests on RISC-V hardware and how to compare the results with the
+operational and axiomatic models.
+
+In the future, we expect to adapt these memory model litmus tests for
+use as part of the RISC-V compliance test suite as well.
+
+=== Explaining the RVWMO Rules
+
+In this section, we provide explanation and examples for all of the
+RVWMO rules and axioms.
+
+==== Preserved Program Order and Global Memory Order
+
+Preserved program order represents the subset of program order that must
+be respected within the global memory order. Conceptually, events from
+the same hart that are ordered by preserved program order must appear in
+that order from the perspective of other harts and/or observers. Events
+from the same hart that are not ordered by preserved program order, on
+the other hand, may appear reordered from the perspective of other harts
+and/or observers.
+
+Informally, the global memory order represents the order in which loads
+and stores perform. The formal memory model literature has moved away
+from specifications built around the concept of performing, but the idea
+is still useful for building up informal intuition. A load is said to
+have performed when its return value is determined. A store is said to
+have performed not when it has executed inside the pipeline, but rather
+only when its value has been propagated to globally visible memory. In
+this sense, the global memory order also represents the contribution of
+the coherence protocol and/or the rest of the memory system to
+interleave the (possibly reordered) memory accesses being issued by each
+hart into a single total order agreed upon by all harts.
+
+The order in which loads perform does not always directly correspond to
+the relative age of the values those two loads return. In particular, a
+load latexmath:[$b$] may perform before another load latexmath:[$a$] to
+the same address (i.e., latexmath:[$b$] may execute before
+latexmath:[$a$], and latexmath:[$b$] may appear before latexmath:[$a$]
+in the global memory order), but latexmath:[$a$] may nevertheless return
+an older value than latexmath:[$b$]. This discrepancy captures (among
+other things) the reordering effects of buffering placed between the
+core and memory. For example, latexmath:[$b$] may have returned a value
+from a store in the store buffer, while latexmath:[$a$] may have ignored
+that younger store and read an older value from memory instead. To
+account for this, at the time each load performs, the value it returns
+is determined by the load value axiom, not just strictly by determining
+the most recent store to the same address in the global memory order, as
+described below.
+
+
+
+[[loadvalueaxiom]]
+==== Load value axiom
+
+[NOTE]
+====
+Load Value Axiom: Each byte of each load i returns the value written
+to that byte by the store that is the latest in global memory order among
+the following stores:
+. Stores that write that byte and that precede i in the global memory
+order
+. Stores that write that byte and that precede i in program order
+====
+
+Preserved program order is _not_ required to respect the ordering of a
+store followed by a load to an overlapping address. This complexity
+arises due to the ubiquity of store buffers in nearly all
+implementations. Informally, the load may perform (return a value) by
+forwarding from the store while the store is still in the store buffer,
+and hence before the store itself performs (writes back to globally
+visible memory). Any other hart will therefore observe the load as
+performing before the store.
+
+Consider the litmus test of Figure A.2. When running this program on an implementation with
+store buffers, it is possible to arrive at the final outcome a0=1, a1=0, a2=1, a3=0 as follows:
+
+A store buffer forwarding litmus test (outcome permitted)
+
+include::images/graphviz/storebuffer.adoc[float="right"]
+[[storebuffer]]
+.A store buffer forwarding litmus test (outcome permitted)
+image::image_placeholder.png[]
+
+.A store buffer forwarding litmus test (outcome permitted)
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |sw t1,0(s0) |(e) |sw t1,0(s1)
+|(b) |lw a0,0(s0) |(f) |lw a2,0(s1)
+|(c) |fence r,r |(g) |fence r,r
+|(d) |lw a1,0(s1) |(h) |lw a3,0(s0)
+|Outcome: `a0=1`, `a1=0`, `a2=1`, `a3=0` | | |
+|===
+
+
+Consider the litmus test of <<storebuffer>>. When running
+this program on an implementation with store buffers, it is possible to
+arrive at the final outcome `a0=1`, `a1=0`, `a2=1`, `a3=0` as follows:
+
+* {empty}(a) executes and enters the first hart’s private store buffer
+* {empty}(b) executes and forwards its return value 1 from (a) in the
+store buffer
+* {empty}(c) executes since all previous loads (i.e., (b)) have
+completed
+* {empty}(d) executes and reads the value 0 from memory
+* {empty}(e) executes and enters the second hart’s private store buffer
+* {empty}(f) executes and forwards its return value 1 from (e) in the
+store buffer
+* {empty}(g) executes since all previous loads (i.e., (f)) have
+completed
+* {empty}(h) executes and reads the value 0 from memory
+* {empty}(a) drains from the first hart’s store buffer to memory
+* {empty}(e) drains from the second hart’s store buffer to memory
+
+Therefore, the memory model must be able to account for this behavior.
+
+To put it another way, suppose the definition of preserved program order
+did include the following hypothetical rule: memory access
+latexmath:[$a$] precedes memory access latexmath:[$b$] in preserved
+program order (and hence also in the global memory order) if
+latexmath:[$a$] precedes latexmath:[$b$] in program order and
+latexmath:[$a$] and latexmath:[$b$] are accesses to the same memory
+location, latexmath:[$a$] is a write, and latexmath:[$b$] is a read.
+Call this `Rule X`. Then we get the following:
+
+* {empty}(a) precedes (b): by rule X
+* {empty}(b) precedes (d): by rule ppo:fence[[ppo:fence]]
+* {empty}(d) precedes (e): by the load value axiom. Otherwise, if (e)
+preceded (d), then (d) would be required to return the value 1. (This is
+a perfectly legal execution; it’s just not the one in question)
+* {empty}(e) precedes (f): by rule X
+* {empty}(f) precedes (h): by rule ppo:fence[[ppo:fence]]
+* {empty}(h) precedes (a): by the load value axiom, as above.
+
+The global memory order must be a total order and cannot be cyclic,
+because a cycle would imply that every event in the cycle happens before
+itself, which is impossible. Therefore, the execution proposed above
+would be forbidden, and hence the addition of rule X would forbid
+implementations with store buffer forwarding, which would clearly be
+undesirable.
+
+Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the
+global memory order, the only sensible possibility in this example is
+for (b) to return the value written by (a), and likewise for (f) and
+(e). This combination of circumstances is what leads to the second
+option in the definition of the load value axiom. Even though (b)
+precedes (a) in the global memory order, (a) will still be visible to
+(b) by virtue of sitting in the store buffer at the time (b) executes.
+Therefore, even if (b) precedes (a) in the global memory order, (b)
+should return the value written by (a) because (a) precedes (b) in
+program order. Likewise for (e) and (f).
+
+include::images/graphviz/ppoca.adoc[]
+[[ppoca]]]
+.A test that highlights the behavior of store buffers
+image::image_placeholder.png[]
+
+.Key for test that highlights the behavior of store buffers
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |sw t1,0(s0) | |LOOP:
+|(b) |fence w,w |(d) |lw a0,0(s1)
+|(c) |sw t1,0(s1) | |beqz a0, LOOP
+| | |(e) |sw t1,0(s2)
+| | |(f) |lw a1,0(s2)
+| | | |xor a2,a1,a1
+| | | |add s0,s0,a2
+| | |(g) |lw a2,0(s0)
+|Outcome: `a0=1`, `a1=1`, `a2=0` | | |
+|===
+
+
+Another test that highlights the behavior of store buffers is shown in
+Figure fig:litmus:ppoca[[fig:litmus:ppoca]]. In this example, (d) is
+ordered before (e) because of the control dependency, and (f) is ordered
+before (g) because of the address dependency. However, (e) is _not_
+necessarily ordered before (f), even though (f) returns the value
+written by (e). This could correspond to the following sequence of
+events:
+
+* {empty}(e) executes speculatively and enters the second hart’s private
+store buffer (but does not drain to memory)
+* {empty}(f) executes speculatively and forwards its return value 1 from
+(e) in the store buffer
+* {empty}(g) executes speculatively and reads the value 0 from memory
+* {empty}(a) executes, enters the first hart’s private store buffer, and
+drains to memory
+* {empty}(b) executes and retires
+* {empty}(c) executes, enters the first hart’s private store buffer, and
+drains to memory
+* {empty}(d) executes and reads the value 1 from memory
+* (e), (f), and (g) commit, since the speculation turned out to be
+correct
+* {empty}(e) drains from the store buffer to memory
+
+[[atomicityaxiom]]
+==== Atomicity axiom
+
+p1cm|p12cm & (for Aligned Atomics):
+
+The RISC-V architecture decouples the notion of atomicity from the
+notion of ordering. Unlike architectures such as TSO, RISC-V atomics
+under RVWMO do not impose any ordering requirements by default. Ordering
+semantics are only guaranteed by the PPO rules that otherwise apply.
+
+RISC-V contains two types of atomics: AMOs and LR/SC pairs. These
+conceptually behave differently, in the following way. LR/SC behave as
+if the old value is brought up to the core, modified, and written back
+to memory, all while a reservation is held on that memory location. AMOs
+on the other hand conceptually behave as if they are performed directly
+in memory. AMOs are therefore inherently atomic, while LR/SC pairs are
+atomic in the slightly different sense that the memory location in
+question will not be modified by another hart during the time the
+original hart holds the reservation.
+
+{empty}(a) lr.d a0, 0(s0) (b) sd t1, 0(s0) (c) sc.d t2, 0(s0)
+
+      
+
+{empty}(a) lr.d a0, 0(s0) (b) sw t1, 4(s0) (c) sc.d t2, 0(s0)
+
+      
+
+{empty}(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 0(s0)
+
+      
+
+{empty}(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 8(s0)
+
+The atomicity axiom forbids stores from other harts from being
+interleaved in global memory order between an LR and the SC paired with
+that LR. The atomicity axiom does not forbid loads from being
+interleaved between the paired operations in program order or in the
+global memory order, nor does it forbid stores from the same hart or
+stores to non-overlapping locations from appearing between the paired
+operations in either program order or in the global memory order. For
+example, the SC instructions in
+Figure fig:litmus:lrsdsc[[fig:litmus:lrsdsc]] may (but are not
+guaranteed to) succeed. None of those successes would violate the
+atomicity axiom, because the intervening non-conditional stores are from
+the same hart as the paired load-reserved and store-conditional
+instructions. This way, a memory system that tracks memory accesses at
+cache line granularity (and which therefore will see the four snippets
+of Figure fig:litmus:lrsdsc[[fig:litmus:lrsdsc]] as identical) will not
+be forced to fail a store-conditional instruction that happens to
+(falsely) share another portion of the same cache line as the memory
+location being held by the reservation.
+
+The atomicity axiom also technically supports cases in which the LR and
+SC touch different addresses and/or use different access sizes; however,
+use cases for such behaviors are expected to be rare in practice.
+Likewise, scenarios in which stores from the same hart between an LR/SC
+pair actually overlap the memory location(s) referenced by the LR or SC
+are expected to be rare compared to scenarios where the intervening
+store may simply fall onto the same cache line.
+
+[[mm-progress]]
+==== Progress
+
+p1cm|p12cm & :
+
+The progress axiom ensures a minimal forward progress guarantee. It
+ensures that stores from one hart will eventually be made visible to
+other harts in the system in a finite amount of time, and that loads
+from other harts will eventually be able to read those values (or
+successors thereof). Without this rule, it would be legal, for example,
+for a spinlock to spin infinitely on a value, even with a store from
+another hart waiting to unlock the spinlock.
+
+The progress axiom is intended not to impose any other notion of
+fairness, latency, or quality of service onto the harts in a RISC-V
+implementation. Any stronger notions of fairness are up to the rest of
+the ISA and/or up to the platform and/or device to define and implement.
+
+The forward progress axiom will in almost all cases be naturally
+satisfied by any standard cache coherence protocol. Implementations with
+non-coherent caches may have to provide some other mechanism to ensure
+the eventual visibility of all stores (or successors thereof) to all
+harts.
+
+[[mm-overlap]]
+==== Overlapping-Address Orderings (Rules #ppo:->st[[ppo:->st]]–#ppo:amoforward[[ppo:amoforward]])
+
+p1cm|p12cm & Rule ppo:->st[[ppo:->st]]: +
+& Rule ppo:rdw[[ppo:rdw]]: +
+& Rule ppo:amoforward[[ppo:amoforward]]: +
+
+Same-address orderings where the latter is a store are straightforward:
+a load or store can never be reordered with a later store to an
+overlapping memory location. From a microarchitecture perspective,
+generally speaking, it is difficult or impossible to undo a
+speculatively reordered store if the speculation turns out to be
+invalid, so such behavior is simply disallowed by the model.
+Same-address orderings from a store to a later load, on the other hand,
+do not need to be enforced. As discussed in
+Section sec:memory:loadvalueaxiom[1.3.2], this reflects the observable
+behavior of implementations that forward values from buffered stores to
+later loads.
+
+Same-address load-load ordering requirements are far more subtle. The
+basic requirement is that a younger load must not return a value that is
+older than a value returned by an older load in the same hart to the
+same address. This is often known as ``CoRR`` (Coherence for Read-Read
+pairs), or as part of a broader ``coherence`` or ``sequential
+consistency per location`` requirement. Some architectures in the past
+have relaxed same-address load-load ordering, but in hindsight this is
+generally considered to complicate the programming model too much, and
+so RVWMO requires CoRR ordering to be enforced. However, because the
+global memory order corresponds to the order in which loads perform
+rather than the ordering of the values being returned, capturing CoRR
+requirements in terms of the global memory order requires a bit of
+indirection.
+
+m.4@m.4
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li  t2, 2
+|(a) |sw t1,0(s0) |(d) |lw  a0,0(s1)
+|(b) |fence w, w |(e) |sw  t2,0(s1)
+|(c) |sw t1,0(s1) |(f) |lw  a1,0(s1)
+| | |(g) |xor t3,a1,a1
+| | |(h) |add s0,s0,t3
+| | |(i) |lw  a2,0(s0)
+|Outcome: `a0=1`, `a1=2`, `a2=0` | | |
+|===
+
+&
+
+Consider the litmus test of
+Figure fig:litmus:frirfi[[fig:litmus:frirfi]], which is one particular
+instance of the more general ``fri-rfi`` pattern. The term ``fri-rfi``
+refers to the sequence (d), (e), (f): (d) ``from-reads`` (i.e., reads
+from an earlier write than) (e) which is the same hart, and (f) reads
+from (e) which is in the same hart.
+
+From a microarchitectural perspective, outcome `a0=1`, `a1=2`, `a2=0` is
+legal (as are various other less subtle outcomes). Intuitively, the
+following would produce the outcome in question:
+
+* {empty}(d) stalls (for whatever reason; perhaps it’s stalled waiting
+for some other preceding instruction)
+* {empty}(e) executes and enters the store buffer (but does not yet
+drain to memory)
+* {empty}(f) executes and forwards from (e) in the store buffer
+* (g), (h), and (i) execute
+* {empty}(a) executes and drains to memory, (b) executes, and (c)
+executes and drains to memory
+* {empty}(d) unstalls and executes
+* {empty}(e) drains from the store buffer to memory
+
+This corresponds to a global memory order of (f), (i), (a), (c), (d),
+(e). Note that even though (f) performs before (d), the value returned
+by (f) is newer than the value returned by (d). Therefore, this
+execution is legal and does not violate the CoRR requirements.
+
+Likewise, if two back-to-back loads return the values written by the
+same store, then they may also appear out-of-order in the global memory
+order without violating CoRR. Note that this is not the same as saying
+that the two loads return the same value, since two different stores may
+write the same value.
+
+m.4@m.6
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 |(d) |lw  a0,0(s1)
+|(a) |sw t1,0(s0) |(e) |xor t2,a0,a0
+|(b) |fence w, w |(f) |add s4,s2,t2
+|(c) |sw t1,0(s1) |(g) |lw  a1,0(s4)
+| | |(h) |lw  a2,0(s2)
+| | |(i) |xor t3,a2,a2
+| | |(j) |add s0,s0,t3
+| | |(k) |lw  a3,0(s0)
+|Outcome: `a0=1`, `a1=v`, `a2=v`, `a3=0` | | |
+|===
+
+&
+
+Consider the litmus test of Figure fig:litmus:rsw[[fig:litmus:rsw]].
+The outcome `a0=1`, `a1=v`, `a2=v`, `a3=0` (where latexmath:[$v$] is
+some value written by another hart) can be observed by allowing (g) and
+(h) to be reordered. This might be done speculatively, and the
+speculation can be justified by the microarchitecture (e.g., by snooping
+for cache invalidations and finding none) because replaying (h) after
+(g) would return the value written by the same store anyway. Hence
+assuming `a1` and `a2` would end up with the same value written by the
+same store anyway, (g) and (h) can be legally reordered. The global
+memory order corresponding to this execution would be
+(h),(k),(a),(c),(d),(g).
+
+Executions of the test in Figure fig:litmus:rsw[[fig:litmus:rsw]] in
+which `a1` does not equal `a2` do in fact require that (g) appears
+before (h) in the global memory order. Allowing (h) to appear before (g)
+in the global memory order would in that case result in a violation of
+CoRR, because then (h) would return an older value than that returned by
+(g). Therefore, PPO rule ppo:rdw[[ppo:rdw]] forbids this CoRR violation
+from occurring. As such, PPO rule ppo:rdw[[ppo:rdw]] strikes a careful
+balance between enforcing CoRR in all cases while simultaneously being
+weak enough to permit `RSW` and `fri-rfi` patterns that commonly
+appear in real microarchitectures.
+
+There is one more overlapping-address rule: PPO
+rule ppo:amoforward[[ppo:amoforward]] simply states that a value cannot
+be returned from an AMO or SC to a subsequent load until the AMO or SC
+has (in the case of the SC, successfully) performed globally. This
+follows somewhat naturally from the conceptual view that both AMOs and
+SC instructions are meant to be performed atomically in memory. However,
+notably, PPO rule ppo:amoforward[[ppo:amoforward]] states that hardware
+may not even non-speculatively forward the value being stored by an
+AMOSWAP to a subsequent load, even though for AMOSWAP that store value
+is not actually semantically dependent on the previous value in memory,
+as is the case for the other AMOs. The same holds true even when
+forwarding from SC store values that are not semantically dependent on
+the value returned by the paired LR.
+
+The three PPO rules above also apply when the memory accesses in
+question only overlap partially. This can occur, for example, when
+accesses of different sizes are used to access the same object. Note
+also that the base addresses of two overlapping memory operations need
+not necessarily be the same for two memory accesses to overlap. When
+misaligned memory accesses are being used, the overlapping-address PPO
+rules apply to each of the component memory accesses independently.
+
+[[mm-fence]]
+==== Fences
+
+p1cm|p12cm & Rule ppo:fence[[ppo:fence]]:
+
+By default, the FENCE instruction ensures that all memory accesses from
+instructions preceding the fence in program order (the ``predecessor
+set``) appear earlier in the global memory order than memory accesses
+from instructions appearing after the fence in program order (the
+``successor set``). However, fences can optionally further restrict the
+predecessor set and/or the successor set to a smaller set of memory
+accesses in order to provide some speedup. Specifically, fences have PR,
+PW, SR, and SW bits which restrict the predecessor and/or successor
+sets. The predecessor set includes loads (resp.stores) if and only if PR
+(resp.PW) is set. Similarly, the successor set includes loads
+(resp.stores) if and only if SR (resp.SW) is set.
+
+The FENCE encoding currently has nine non-trivial combinations of the
+four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which
+facilitates mapping of ``acquire+release`` or RVTSO semantics. The
+remaining seven combinations have empty predecessor and/or successor
+sets and hence are no-ops. Of the ten non-trivial options, only six are
+commonly used in practice:
+
+* FENCE RW,RW
+* FENCE.TSO
+* FENCE RW,W
+* FENCE R,RW
+* FENCE R,R
+* FENCE W,W
+
+FENCE instructions using any other combination of PR, PW, SR, and SW are
+reserved. We strongly recommend that programmers stick to these six.
+Other combinations may have unknown or unexpected interactions with the
+memory model.
+
+Finally, we note that since RISC-V uses a multi-copy atomic memory
+model, programmers can reason about fences bits in a thread-local
+manner. There is no complex notion of ``fence cumulativity`` as found in
+memory models that are not multi-copy atomic.
+
+[[sec:memory:acqrel]]
+==== Explicit Synchronization (Rules #ppo:acquire[[ppo:acquire]]–#ppo:pair[[ppo:pair]])
+
+p1cm|p12cm & Rule ppo:acquire[[ppo:acquire]]: +
+& Rule ppo:release[[ppo:release]]: +
+& Rule ppo:rcsc[[ppo:rcsc]]: +
+& Rule ppo:pair[[ppo:pair]]: +
+
+An _acquire_ operation, as would be used at the start of a critical
+section, requires all memory operations following the acquire in program
+order to also follow the acquire in the global memory order. This
+ensures, for example, that all loads and stores inside the critical
+section are up to date with respect to the synchronization variable
+being used to protect it. Acquire ordering can be enforced in one of two
+ways: with an acquire annotation, which enforces ordering with respect
+to just the synchronization variable itself, or with a FENCE R,RW, which
+enforces ordering with respect to all previous loads.
+
+....
+          sd           x1, (a1)     # Arbitrary unrelated store
+          ld           x2, (a2)     # Arbitrary unrelated load
+          li           t0, 1        # Initialize swap value.
+      again:
+          amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
+          bnez         t0, again    # Retry if held.
+          # ...
+          # Critical section.
+          # ...
+          amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
+          sd           x3, (a3)     # Arbitrary unrelated store
+          ld           x4, (a4)     # Arbitrary unrelated load
+....
+
+Consider
+<<spinlock_atomics>>.
+Because this example uses _aq_, the loads and stores in the critical
+section are guaranteed to appear in the global memory order after the
+AMOSWAP used to acquire the lock. However, assuming `a0`, `a1`, and `a2`
+point to different memory locations, the loads and stores in the
+critical section may or may not appear after the ``Arbitrary unrelated
+load`` at the beginning of the example in the global memory order.
+
+....
+          sd           x1, (a1)     # Arbitrary unrelated store
+          ld           x2, (a2)     # Arbitrary unrelated load
+          li           t0, 1        # Initialize swap value.
+      again:
+          amoswap.w    t0, t0, (a0) # Attempt to acquire lock.
+          fence        r, rw        # Enforce "acquire" memory ordering
+          bnez         t0, again    # Retry if held.
+          # ...
+          # Critical section.
+          # ...
+          fence        rw, w        # Enforce "release" memory ordering
+          amoswap.w    x0, x0, (a0) # Release lock by storing 0.
+          sd           x3, (a3)     # Arbitrary unrelated store
+          ld           x4, (a4)     # Arbitrary unrelated load
+....
+
+Now, consider the alternative in
+Figure fig:litmus:spinlock_fences[[fig:litmus:spinlock_fences]]. In
+this case, even though the AMOSWAP does not enforce ordering with an
+_aq_ bit, the fence nevertheless enforces that the acquire AMOSWAP
+appears earlier in the global memory order than all loads and stores in
+the critical section. Note, however, that in this case, the fence also
+enforces additional orderings: it also requires that the ``Arbitrary
+unrelated load`` at the start of the program appears earlier in the
+global memory order than the loads and stores of the critical section.
+(This particular fence does not, however, enforce any ordering with
+respect to the ``Arbitrary unrelated store`` at the start of the
+snippet.) In this way, fence-enforced orderings are slightly coarser
+than orderings enforced by _.aq_.
+
+Release orderings work exactly the same as acquire orderings, just in
+the opposite direction. Release semantics require all loads and stores
+preceding the release operation in program order to also precede the
+release operation in the global memory order. This ensures, for example,
+that memory accesses in a critical section appear before the
+lock-releasing store in the global memory order. Just as for acquire
+semantics, release semantics can be enforced using release annotations
+or with a FENCE RW,W operation. Using the same examples, the ordering
+between the loads and stores in the critical section and the ``Arbitrary
+unrelated store`` at the end of the code snippet is enforced only by the
+FENCE RW,W in
+Figure #fig:litmus:spinlock_fences[[fig:litmus:spinlock_fences]], not by
+the _rl_ in
+Figure #fig:litmus:spinlock_atomics[[fig:litmus:spinlock_atomics]].
+
+With RCpc annotations alone, store-release-to-load-acquire ordering is
+not enforced. This facilitates the porting of code written under the TSO
+and/or RCpc memory models. To enforce store-release-to-load-acquire
+ordering, the code must use store-release-RCsc and load-acquire-RCsc
+operations so that PPO rule #ppo:rcsc[[ppo:rcsc]] applies. RCpc alone is
+sufficient for many use cases in C/C++ but is insufficient for many
+other use cases in C/C++, Java, and Linux, to name just a few examples;
+see Section #sec:memory:porting[1.5] for details.
+
+PPO rule #ppo:pair[[ppo:pair]] indicates that an SC must appear after
+its paired LR in the global memory order. This will follow naturally
+from the common use of LR/SC to perform an atomic read-modify-write
+operation due to the inherent data dependency. However, PPO
+rule #ppo:pair[[ppo:pair]] also applies even when the value being stored
+does not syntactically depend on the value returned by the paired LR.
+
+Lastly, we note that just as with fences, programmers need not worry
+about ``cumulativity`` when analyzing ordering annotations.
+
+[[sec:memory:dependencies]]
+==== Syntactic Dependencies (Rules #ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]])
+
+p1cm|p12cm & Rule #ppo:addr[[ppo:addr]]: +
+& Rule #ppo:data[[ppo:data]]: +
+& Rule #ppo:ctrl[[ppo:ctrl]]: +
+
+Dependencies from a load to a later memory operation in the same hart
+are respected by the RVWMO memory model. The Alpha memory model was
+notable for choosing _not_ to enforce the ordering of such dependencies,
+but most modern hardware and software memory models consider allowing
+dependent instructions to be reordered too confusing and
+counterintuitive. Furthermore, modern code sometimes intentionally uses
+such dependencies as a particularly lightweight ordering enforcement
+mechanism.
+
+The terms in
+<<mm-dependencies>>
+work as follows. Instructions are said to carry dependencies from their
+source register(s) to their destination register(s) whenever the value
+written into each destination register is a function of the source
+register(s). For most instructions, this means that the destination
+register(s) carry a dependency from all source register(s). However,
+there are a few notable exceptions. In the case of memory instructions,
+the value written into the destination register ultimately comes from
+the memory system rather than from the source register(s) directly, and
+so this breaks the chain of dependencies carried from the source
+register(s). In the case of unconditional jumps, the value written into
+the destination register comes from the current `pc` (which is never
+considered a source register by the memory model), and so likewise, JALR
+(the only jump with a source register) does not carry a dependency from
+_rs1_ to _rd_.
+
+{empty}(a) fadd f3,f1,f2 (b) fadd f6,f4,f5 (c) csrrs a0,fflags,x0
+
+The notion of accumulating into a destination register rather than
+writing into it reflects the behavior of CSRs such as `fflags`. In
+particular, an accumulation into a register does not clobber any
+previous writes or accumulations into the same register. For example, in
+Figure #fig:litmus:fflags[[fig:litmus:fflags]], (c) has a syntactic
+dependency on both (a) and (b).
+
+Like other modern memory models, the RVWMO memory model uses syntactic
+rather than semantic dependencies. In other words, this definition
+depends on the identities of the registers being accessed by different
+instructions, not the actual contents of those registers. This means
+that an address, control, or data dependency must be enforced even if
+the calculation could seemingly be ``optimized away``. This choice
+ensures that RVWMO remains compatible with code that uses these false
+syntactic dependencies as a lightweight ordering mechanism.
+
+ld a1,0(s0) xor a2,a1,a1 add s1,s1,a2 ld a5,0(s1)
+
+For example, there is a syntactic address dependency from the memory
+operation generated by the first instruction to the memory operation
+generated by the last instruction in
+Figure #fig:litmus:address[[fig:litmus:address]], even though `a1` XOR
+`a1` is zero and hence has no effect on the address accessed by the
+second load.
+
+The benefit of using dependencies as a lightweight synchronization
+mechanism is that the ordering enforcement requirement is limited only
+to the specific two instructions in question. Other non-dependent
+instructions may be freely reordered by aggressive implementations. One
+alternative would be to use a load-acquire, but this would enforce
+ordering for the first load with respect to _all_ subsequent
+instructions. Another would be to use a FENCE R,R, but this would
+include all previous and all subsequent loads, making this option more
+expensive.
+
+lw x1,0(x2) bne x1,x0,next sw x3,0(x4) next: sw x5,0(x6)
+
+Control dependencies behave differently from address and data
+dependencies in the sense that a control dependency always extends to
+all instructions following the original target in program order.
+Consider Figure #fig:litmus:control1[[fig:litmus:control1]]: the
+instruction at `next` will always execute, but the memory operation
+generated by that last instruction nevertheless still has a control
+dependency from the memory operation generated by the first instruction.
+
+lw x1,0(x2) bne x1,x0,next next: sw x3,0(x4)
+
+Likewise, consider Figure #fig:litmus:control2[[fig:litmus:control2]].
+Even though both branch outcomes have the same target, there is still a
+control dependency from the memory operation generated by the first
+instruction in this snippet to the memory operation generated by the
+last instruction. This definition of control dependency is subtly
+stronger than what might be seen in other contexts (e.g., C++), but it
+conforms with standard definitions of control dependencies in the
+literature.
+
+Notably, PPO rules #ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]] are also
+intentionally designed to respect dependencies that originate from the
+output of a successful store-conditional instruction. Typically, an SC
+instruction will be followed by a conditional branch checking whether
+the outcome was successful; this implies that there will be a control
+dependency from the store operation generated by the SC instruction to
+any memory operations following the branch. PPO
+rule #ppo:ctrl[[ppo:ctrl]] in turn implies that any subsequent store
+operations will appear later in the global memory order than the store
+operation generated by the SC. However, since control, address, and data
+dependencies are defined over memory operations, and since an
+unsuccessful SC does not generate a memory operation, no order is
+enforced between unsuccessful SC and its dependent instructions.
+Moreover, since SC is defined to carry dependencies from its source
+registers to _rd_ only when the SC is successful, an unsuccessful SC has
+no effect on the global memory order.
+
+m.4m0.05m.4
+
+` `
+
+[cols="^,<,^,<",]
+|===
+|Initial values: 0(s0)=1; 0(s1)=1 | | |
+| | | |
+|Hart 0 | |Hart 1 |
+|(a) |ld a0,0(s0) |(e) |ld a3,0(s2)
+|(b) |lr a1,0(s1) |(f) |sd a3,0(s0)
+|(c) |sc a2,a0,0(s1) | |
+|(d) |sd a2,0(s2) | |
+|Outcome: `a0=0`, `a3=0` | | |
+|===
+
+& &
+
+In addition, the choice to respect dependencies originating at
+store-conditional instructions ensures that certain out-of-thin-air-like
+behaviors will be prevented. Consider
+Figure #fig:litmus:successdeps[[fig:litmus:successdeps]]. Suppose a
+hypothetical implementation could occasionally make some early guarantee
+that a store-conditional operation will succeed. In this case, (c) could
+return 0 to `a2` early (before actually executing), allowing the
+sequence (d), (e), (f), (a), and then (b) to execute, and then (c) might
+execute (successfully) only at that point. This would imply that (c)
+writes its own success value to `0(s1)`! Fortunately, this situation and
+others like it are prevented by the fact that RVWMO respects
+dependencies originating at the stores generated by successful SC
+instructions.
+
+We also note that syntactic dependencies between instructions only have
+any force when they take the form of a syntactic address, control,
+and/or data dependency. For example: a syntactic dependency between two
+``F`` instructions via one of the ``accumulating CSRs`` in
+Section #sec:source-dest-regs[[sec:source-dest-regs]] does _not_ imply
+that the two ``F`` instructions must be executed in order. Such a
+dependency would only serve to ultimately set up later a dependency from
+both ``F`` instructions to a later CSR instruction accessing the CSR
+flag in question.
+
+[[sec:memory:ppopipeline]]
+==== Pipeline Dependencies (Rules #ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]])
+
+p1cm|p12cm & Rule #ppo:addrdatarfi[[ppo:addrdatarfi]]: +
+& Rule #ppo:addrpo[[ppo:addrpo]]: +
+
+m.4m.05m.4
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 |(d) |lw a0, 0(s1)
+|(a) |sw t1,0(s0) |(e) |sw a0, 0(s2)
+|(b) |fence w, w |(f) |lw a1, 0(s2)
+|(c) |sw t1,0(s1) | |xor a2,a1,a1
+| | | |add s0,s0,a2
+| | |(g) |lw a3,0(s0)
+|Outcome: `a0=1`, `a3=0` | | |
+|===
+
+& &
+
+PPO rules #ppo:addrdatarfi[[ppo:addrdatarfi]] and
+#ppo:addrpo[[ppo:addrpo]] reflect behaviors of almost all real processor
+pipeline implementations. Rule #ppo:addrdatarfi[[ppo:addrdatarfi]]
+states that a load cannot forward from a store until the address and
+data for that store are known. Consider
+Figure #fig:litmus:addrdatarfi[[fig:litmus:addrdatarfi]]: (f) cannot be
+executed until the data for (e) has been resolved, because (f) must
+return the value written by (e) (or by something even later in the
+global memory order), and the old value must not be clobbered by the
+writeback of (e) before (d) has had a chance to perform. Therefore, (f)
+will never perform before (d) has performed.
+
+m.4m.05m.4
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |sw t1,0(s0) |(d) |lw a0, 0(s1)
+|(b) |fence w, w |(e) |sw a0, 0(s2)
+|(c) |sw t1,0(s1) |(f) |sw t1, 0(s2)
+| | |(g) |lw a1, 0(s2)
+| | | |xor a2,a1,a1
+| | | |add s0,s0,a2
+| | |(h) |lw a3,0(s0)
+|Outcome: `a0=1`, `a3=0` | | |
+|===
+
+& &
+
+If there were another store to the same address in between (e) and (f),
+as in Figure #fig:litmus:addrdatarfi_no[[fig:litmus:addrdatarfi_no]],
+then (f) would no longer be dependent on the data of (e) being resolved,
+and hence the dependency of (f) on (d), which produces the data for (e),
+would be broken.
+
+Rule #ppo:addrpo[[ppo:addrpo]] makes a similar observation to the
+previous rule: a store cannot be performed at memory until all previous
+loads that might access the same address have themselves been performed.
+Such a load must appear to execute before the store, but it cannot do so
+if the store were to overwrite the value in memory before the load had a
+chance to read the old value. Likewise, a store generally cannot be
+performed until it is known that preceding instructions will not cause
+an exception due to failed address resolution, and in this sense,
+rule #ppo:addrpo[[ppo:addrpo]] can be seen as somewhat of a special case
+of rule #ppo:ctrl[[ppo:ctrl]].
+
+m.4m.05m.4 ` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| | | |li t1, 1
+|(a) |lw a0,0(s0) |(d) |lw a1, 0(s1)
+|(b) |fence rw,rw |(e) |lw a2, 0(a1)
+|(c) |sw s2,0(s1) |(f) |sw t1, 0(s0)
+|Outcome: `a0=1`, `a1=t` | | |
+|===
+
+& &
+
+Consider Figure #fig:litmus:addrpo[[fig:litmus:addrpo]]: (f) cannot be
+executed until the address for (e) is resolved, because it may turn out
+that the addresses match; i.e., that `a1=s0`. Therefore, (f) cannot be
+sent to memory before (d) has executed and confirmed whether the
+addresses do indeed overlap.
+
+=== Beyond Main Memory
+
+RVWMO does not currently attempt to formally describe how FENCE.I,
+SFENCE.VMA, I/O fences, and PMAs behave. All of these behaviors will be
+described by future formalizations. In the meantime, the behavior of
+FENCE.I is described in Chapter #chap:zifencei[[chap:zifencei]], the
+behavior of SFENCE.VMA is described in the RISC-V Instruction Set
+Privileged Architecture Manual, and the behavior of I/O fences and the
+effects of PMAs are described below.
+
+==== Coherence and Cacheability
+
+The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs)
+which specify, among other things, whether portions of the address space
+are coherent and/or cacheable. See the RISC-V Privileged ISA
+Specification for the complete details. Here, we simply discuss how the
+various details in each PMA relate to the memory model:
+
+* Main memory vs.I/O, and I/O memory ordering PMAs: the memory model as
+defined applies to main memory regions. I/O ordering is discussed below.
+* Supported access types and atomicity PMAs: the memory model is simply
+applied on top of whatever primitives each region supports.
+* Cacheability PMAs: the cacheability PMAs in general do not affect the
+memory model. Non-cacheable regions may have more restrictive behavior
+than cacheable regions, but the set of allowed behaviors does not change
+regardless. However, some platform-specific and/or device-specific
+cacheability settings may differ.
+* Coherence PMAs: The memory consistency model for memory regions marked
+as non-coherent in PMAs is currently platform-specific and/or
+device-specific: the load-value axiom, the atomicity axiom, and the
+progress axiom all may be violated with non-coherent memory. Note
+however that coherent memory does not require a hardware cache coherence
+protocol. The RISC-V Privileged ISA Specification suggests that
+hardware-incoherent regions of main memory are discouraged, but the
+memory model is compatible with hardware coherence, software coherence,
+implicit coherence due to read-only memory, implicit coherence due to
+only one agent having access, or otherwise.
+* Idempotency PMAs: Idempotency PMAs are used to specify memory regions
+for which loads and/or stores may have side effects, and this in turn is
+used by the microarchitecture to determine, e.g., whether prefetches are
+legal. This distinction does not affect the memory model.
+
+==== I/O Ordering
+
+For I/O, the load value axiom and atomicity axiom in general do not
+apply, as both reads and writes might have device-specific side effects
+and may return values other than the value ``written`` by the most
+recent store to the same address. Nevertheless, the following preserved
+program order rules still generally apply for accesses to I/O memory:
+memory access latexmath:[$a$] precedes memory access latexmath:[$b$] in
+global memory order if latexmath:[$a$] precedes latexmath:[$b$] in
+program order and one or more of the following holds:
+
+. latexmath:[$a$] precedes latexmath:[$b$] in preserved program order as
+defined in Chapter #ch:memorymodel[[ch:memorymodel]], with the exception
+that acquire and release ordering annotations apply only from one memory
+operation to another memory operation and from one I/O operation to
+another I/O operation, but not from a memory operation to an I/O nor
+vice versa
+. latexmath:[$a$] and latexmath:[$b$] are accesses to overlapping
+addresses in an I/O region
+. latexmath:[$a$] and latexmath:[$b$] are accesses to the same strongly
+ordered I/O region
+. latexmath:[$a$] and latexmath:[$b$] are accesses to I/O regions, and
+the channel associated with the I/O region accessed by either
+latexmath:[$a$] or latexmath:[$b$] is channel 1
+. latexmath:[$a$] and latexmath:[$b$] are accesses to I/O regions
+associated with the same channel (except for channel 0)
+
+Note that the FENCE instruction distinguishes between main memory
+operations and I/O operations in its predecessor and successor sets. To
+enforce ordering between I/O operations and main memory operations, code
+must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW.
+For example, to enforce ordering between a write to main memory and an
+I/O write to a device register, a FENCE W,O or stronger is needed.
+
+sd t0, 0(a0) fence w,o sd a0, 0(a1)
+
+When a fence is in fact used, implementations must assume that the
+device may attempt to access memory immediately after receiving the MMIO
+signal, and subsequent memory accesses from that device to memory must
+observe the effects of all accesses ordered prior to that MMIO
+operation. In other words, in Figure #fig:litmus:wo[[fig:litmus:wo]],
+suppose `0(a0)` is in main memory and `0(a1)` is the address of a device
+register in I/O memory. If the device accesses `0(a0)` upon receiving
+the MMIO write, then that load must conceptually appear after the first
+store to `0(a0)` according to the rules of the RVWMO memory model. In
+some implementations, the only way to ensure this will be to require
+that the first store does in fact complete before the MMIO write is
+issued. Other implementations may find ways to be more aggressive, while
+others still may not need to do anything different at all for I/O and
+main memory accesses. Nevertheless, the RVWMO memory model does not
+distinguish between these options; it simply provides an
+implementation-agnostic mechanism to specify the orderings that must be
+enforced.
+
+Many architectures include separate notions of ``ordering`` and
+``completion`` fences, especially as it relates to I/O (as opposed to
+regular main memory). Ordering fences simply ensure that memory
+operations stay in order, while completion fences ensure that
+predecessor accesses have all completed before any successors are made
+visible. RISC-V does not explicitly distinguish between ordering and
+completion fences. Instead, this distinction is simply inferred from
+different uses of the FENCE bits.
+
+For implementations that conform to the RISC-V Unix Platform
+Specification, I/O devices and DMA operations are required to access
+memory coherently and via strongly ordered I/O channels. Therefore,
+accesses to regular main memory regions that are concurrently accessed
+by external devices can also use the standard synchronization
+mechanisms. Implementations that do not conform to the Unix Platform
+Specification and/or in which devices do not access memory coherently
+will need to use mechanisms (which are currently platform-specific or
+device-specific) to enforce coherency.
+
+I/O regions in the address space should be considered non-cacheable
+regions in the PMAs for those regions. Such regions can be considered
+coherent by the PMA if they are not cached by any agent.
+
+The ordering guarantees in this section may not apply beyond a
+platform-specific boundary between the RISC-V cores and the device. In
+particular, I/O accesses sent across an external bus (e.g., PCIe) may be
+reordered before they reach their ultimate destination. Ordering must be
+enforced in such situations according to the platform-specific rules of
+those external devices and buses.
+
+[[sec:memory:porting]]
+=== Code Porting and Mapping Guidelines
+
+[[tsomappings]]
+.Mappings from TSO operations to RISC-V operations
+[cols="<,<",options="header",]
+|===
+|x86/TSO Operation |RVWMO Mapping
+|Load |`l{b|h|w|d}; fence r,rw `
+|Store |`fence rw,w; s{b|h|w|d} `
+|Atomic RMW |`amo<op>.{w|d}.aqrl OR `
+| |`loop:lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop `
+|Fence |`fence rw,rw `
+|===
+
+<<somappings>> provides a mapping from TSO memory
+operations onto RISC-V memory instructions. Normal x86 loads and stores
+are all inherently acquire-RCpc and release-RCpc operations: TSO
+enforces all load-load, load-store, and store-store ordering by default.
+Therefore, under RVWMO, all TSO loads must be mapped onto a load
+followed by FENCE R,RW, and all TSO stores must be mapped onto
+FENCE RW,W followed by a store. TSO atomic read-modify-writes and x86
+instructions using the LOCK prefix are fully ordered and can be
+implemented either via an AMO with both _aq_ and _rl_ set, or via an LR
+with _aq_ set, the arithmetic operation in question, an SC with both
+_aq_ and _rl_ set, and a conditional branch checking the success
+condition. In the latter case, the _rl_ annotation on the LR turns out
+(for non-obvious reasons) to be redundant and can be omitted.
+
+Alternatives to <<somappings>> are also possible. A TSO
+store can be mapped onto AMOSWAP with _rl_ set. However, since RVWMO PPO
+Rule #ppo:amoforward[[ppo:amoforward]] forbids forwarding of values from
+AMOs to subsequent loads, the use of AMOSWAP for stores may negatively
+affect performance. A TSO load can be mapped using LR with _aq_ set: all
+such LR instructions will be unpaired, but that fact in and of itself
+does not preclude the use of LR for loads. However, again, this mapping
+may also negatively affect performance if it puts more pressure on the
+reservation mechanism than was originally intended.
+
+[[powermappings]]
+.Mappings from Power operations to RISC-V operations
+[cols="<,<",options="header",]
+|===
+|Power Operation |RVWMO Mapping
+|Load |`l{b|h|w|d} `
+|Load-Reserve |`lr.{w|d} `
+|Store |`s{b|h|w|d} `
+|Store-Conditional |`sc.{w|d} `
+|`lwsync ` |`fence.tso `
+|`sync ` |`fence rw,rw `
+|`isync ` |`fence.i; fence r,r `
+|===
+
+<<powermappings>> provides a mapping from Power memory
+operations onto RISC-V memory instructions. Power ISYNC maps on RISC-V
+to a FENCE.I followed by a FENCE R,R; the latter fence is needed because
+ISYNC is used to define a `control+control fence` dependency that is
+not present in RVWMO.
+
+[[armmappings]]
+.Mappings from ARM operations to RISC-V operations
+[cols="<,<",options="header",]
+|===
+|ARM Operation |RVWMO Mapping
+|Load |`l{b|h|w|d} `
+|Load-Acquire |`fence rw, rw; l{b|h|w|d}; fence r,rw `
+|Load-Exclusive |`lr.{w|d} `
+|Load-Acquire-Exclusive |`lr.{w|d}.aqrl `
+|Store |`s{b|h|w|d} `
+|Store-Release |`fence rw,w; s{b|h|w|d} `
+|Store-Exclusive |`sc.{w|d} `
+|Store-Release-Exclusive |`sc.{w|d}.rl `
+|`dmb ` |`fence rw,rw `
+|`dmb.ld ` |`fence r,rw `
+|`dmb.st ` |`fence w,w `
+|`isb ` |`fence.i; fence r,r `
+|===
+
+<<armmappings>> provides a mapping from ARM memory
+operations onto RISC-V memory instructions. Since RISC-V does not
+currently have plain load and store opcodes with _aq_ or _rl_
+annotations, ARM load-acquire and store-release operations should be
+mapped using fences instead. Furthermore, in order to enforce
+store-release-to-load-acquire ordering, there must be a FENCE RW,RW
+between the store-release and load-acquire; Table #tab:armmappings[1.4]
+enforces this by always placing the fence in front of each acquire
+operation. ARM load-exclusive and store-exclusive instructions can
+likewise map onto their RISC-V LR and SC equivalents, but instead of
+placing a FENCE RW,RW in front of an LR with _aq_ set, we simply also
+set _rl_ instead. ARM ISB maps on RISC-V to FENCE.I followed by
+FENCE R,R similarly to how ISYNC maps for Power.
+
+[[linuxmappings]]
+.Mappings from Linux memory primitives to RISC-V primitives.
+[cols="<,<",options="header",]
+|===
+|Linux Operation |RVWMO Mapping
+|`smp_mb() ` |`fence rw,rw `
+
+|`smp_rmb() ` |`fence r,r `
+
+|`smp_wmb() ` |`fence w,w `
+
+|`dma_rmb() ` |`fence r,r `
+
+|`dma_wmb() ` |`fence w,w `
+
+|`mb() ` |`fence iorw,iorw `
+
+|`rmb() ` |`fence ri,ri `
+
+|`wmb() ` |`fence wo,wo `
+
+|`smp_load_acquire() ` |`l{b|h|w|d}; fence r,rw `
+
+|`smp_store_release() ` |`fence.tso; s{b|h|w|d} `
+
+|Linux Construct |RVWMO AMO Mapping
+
+|`atomic_<op>_relaxed ` |`amo<op>.{w|d} `
+
+|`atomic_<op>_acquire ` |`amo<op>.{w|d}.aq `
+
+|`atomic_<op>_release ` |`amo<op>.{w|d}.rl `
+
+|`atomic_<op> ` |`amo<op>.{w|d}.aqrl `
+
+|Linux Construct |RVWMO LR/SC Mapping
+
+|`atomic_<op>_relaxed ` |`loop:lr.{w|d}; <op>; sc.{w|d}; bnez loop `
+
+|`atomic_<op>_acquire ` |`loop:lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop `
+
+|`atomic_<op>_release`
+|`loop:lr.{w|d}; <op>; sc.{w|d}.aqrl^*; bnez loop OR `
+
+| |`fence.tso; loop:lr.{w|d}; <op>; sc.{w|d}^*; bnez loop `
+
+|`atomic_<op> ` |`loop:lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop `
+|===
+
+With regosrds to <<linuxmappings>>, other
+constructs (such as spinlocks) should follow accordingly. Platforms or
+devices with non-coherent DMA may need additional synchronization (such
+as cache flush or invalidate mechanisms); currently any such extra
+synchronization will be device-specific.
+
+Table #tab:linuxmappings[1.5] provides a mapping of Linux memory
+ordering macros onto RISC-V memory instructions. The Linux fences
+`dma_rmb()` and `dma_wmb()` map onto FENCE R,R and FENCE W,W,
+respectively, since the RISC-V Unix Platform requires coherent DMA, but
+would be mapped onto FENCE RI,RI and FENCE WO,WO, respectively, on a
+platform with non-coherent DMA. Platforms with non-coherent DMA may also
+require a mechanism by which cache lines can be flushed and/or
+invalidated. Such mechanisms will be device-specific and/or standardized
+in a future extension to the ISA.
+
+The Linux mappings for release operations may seem stronger than
+necessary, but these mappings are needed to cover some cases in which
+Linux requires stronger orderings than the more intuitive mappings would
+provide. In particular, as of the time this text is being written, Linux
+is actively debating whether to require load-load, load-store, and
+store-store orderings between accesses in one critical section and
+accesses in a subsequent critical section in the same hart and protected
+by the same synchronization object. Not all combinations of
+FENCE RW,W/FENCE R,RW mappings with _aq_/_rl_ mappings combine to
+provide such orderings. There are a few ways around this problem,
+including:
+
+. Always use FENCE RW,W/FENCE R,RW, and never use _aq_/_rl_. This
+suffices but is undesirable, as it defeats the purpose of the _aq_/_rl_
+modifiers.
+. Always use _aq_/_rl_, and never use FENCE RW,W/FENCE R,RW. This does
+not currently work due to the lack of load and store opcodes with _aq_
+and _rl_ modifiers.
+. Strengthen the mappings of release operations such that they would
+enforce sufficient orderings in the presence of either type of acquire
+mapping. This is the currently recommended solution, and the one shown
+in Table #tab:linuxmappings[1.5].
+
+          
+
+RVWMO Mapping: (a) lw a0, 0(s0) (b) fence.tso // vs. fence rw,w (c) sd
+x0,0(s1) ... loop: (d) amoswap.d.aq a1,t1,0(s1) bnez a1,loop (e) lw
+a2,0(s2)
+
+For example, the critical section ordering rule currently being debated
+by the Linux community would require (a) to be ordered before (e) in
+Figure #fig:litmus:lkmm_ll[[fig:litmus:lkmm_ll]]. If that will indeed be
+required, then it would be insufficient for (b) to map as FENCE RW,W.
+That said, these mappings are subject to change as the Linux Kernel
+Memory Model evolves.
+
+[[tab:c11mappings]]
+.Mappings from C/C++ primitives to RISC-V primitives.
+[cols="<,<",options="header",]
+|===
+|C/C++ Construct |RVWMO Mapping
+|Non-atomic load |`l{b|h|w|d} `
+
+|`atomic_load(memory_order_relaxed) ` |`l{b|h|w|d} `
+
+|`atomic_load(memory_order_acquire) ` |`l{b|h|w|d}; fence r,rw `
+
+|`atomic_load(memory_order_seq_cst) `
+|`fence rw,rw; l{b|h|w|d}; fence r,rw `
+
+|Non-atomic store |`s{b|h|w|d} `
+
+|`atomic_store(memory_order_relaxed) ` |`s{b|h|w|d} `
+
+|`atomic_store(memory_order_release) ` |`fence rw,w; s{b|h|w|d} `
+
+|`atomic_store(memory_order_seq_cst) ` |`fence rw,w; s{b|h|w|d} `
+
+|`atomic_thread_fence(memory_order_acquire) ` |`fence r,rw `
+
+|`atomic_thread_fence(memory_order_release) ` |`fence rw,w `
+
+|`atomic_thread_fence(memory_order_acq_rel) ` |`fence.tso`
+
+|`atomic_thread_fence(memory_order_seq_cst) ` |`fence rw,rw `
+
+|C/C++ Construct |RVWMO AMO Mapping
+
+|`atomic_<op>(memory_order_relaxed) ` |`amo<op>.{w|d} `
+
+|`atomic_<op>(memory_order_acquire) ` |`amo<op>.{w|d}.aq `
+
+|`atomic_<op>(memory_order_release) ` |`amo<op>.{w|d}.rl `
+
+|`atomic_<op>(memory_order_acq_rel) ` |`amo<op>.{w|d}.aqrl `
+
+|`atomic_<op>(memory_order_seq_cst) ` |`amo<op>.{w|d}.aqrl `
+
+|C/C++ Construct |RVWMO LR/SC Mapping
+
+|`atomic_<op>(memory_order_relaxed)` |`loop:lr.{w|d}; <op>; sc.{w|d}; `
+
+| |`bnez loop `
+
+|`atomic_<op>(memory_order_acquire)`
+|`loop:lr.{w|d}.aq; <op>; sc.{w|d}; `
+
+| |`bnez loop `
+
+|`atomic_<op>(memory_order_release)`
+|`loop:lr.{w|d}; <op>; sc.{w|d}.rl; `
+
+| |`bnez loop `
+
+|`atomic_<op>(memory_order_acq_rel)`
+|`loop:lr.{w|d}.aq; <op>; sc.{w|d}.rl; `
+
+| |`bnez loop `
+
+|`atomic_<op>(memory_order_seq_cst)` |`loop:lr.{w|d}.aqrl; <op>; `
+
+| |`sc.{w|d}.rl; bnez loop `
+|===
+
+[[tab:c11mappings_hypothetical]]
+.Hypothetical mappings from C/C++ primitives to RISC-V primitives, if
+native load-acquire and store-release opcodes are introduced.
+[cols="<,<",options="header",]
+|===
+|C/C++ Construct |RVWMO Mapping
+|Non-atomic load |`l{b|h|w|d} `
+
+|`atomic_load(memory_order_relaxed) ` |`l{b|h|w|d} `
+
+|`atomic_load(memory_order_acquire) ` |`l{b|h|w|d}.aq `
+
+|`atomic_load(memory_order_seq_cst) ` |`l{b|h|w|d}.aq `
+
+|Non-atomic store |`s{b|h|w|d} `
+
+|`atomic_store(memory_order_relaxed) ` |`s{b|h|w|d} `
+
+|`atomic_store(memory_order_release) ` |`s{b|h|w|d}.rl `
+
+|`atomic_store(memory_order_seq_cst) ` |`s{b|h|w|d}.rl `
+
+|`atomic_thread_fence(memory_order_acquire) ` |`fence r,rw `
+
+|`atomic_thread_fence(memory_order_release) ` |`fence rw,w `
+
+|`atomic_thread_fence(memory_order_acq_rel) ` |`fence.tso`
+
+|`atomic_thread_fence(memory_order_seq_cst) ` |`fence rw,rw `
+
+|C/C++ Construct |RVWMO AMO Mapping
+
+|`atomic_<op>(memory_order_relaxed) ` |`amo<op>.{w|d} `
+
+|`atomic_<op>(memory_order_acquire) ` |`amo<op>.{w|d}.aq `
+
+|`atomic_<op>(memory_order_release) ` |`amo<op>.{w|d}.rl `
+
+|`atomic_<op>(memory_order_acq_rel) ` |`amo<op>.{w|d}.aqrl `
+
+|`atomic_<op>(memory_order_seq_cst) ` |`amo<op>.{w|d}.aqrl `
+
+|C/C++ Construct |RVWMO LR/SC Mapping
+
+|`atomic_<op>(memory_order_relaxed) ` |`lr.{w|d}; <op>; sc.{w|d} `
+
+|`atomic_<op>(memory_order_acquire) ` |`lr.{w|d}.aq; <op>; sc.{w|d} `
+
+|`atomic_<op>(memory_order_release) ` |`lr.{w|d}; <op>; sc.{w|d}.rl `
+
+|`atomic_<op>(memory_order_acq_rel) ` |`lr.{w|d}.aq; <op>; sc.{w|d}.rl `
+
+|`atomic_<op>(memory_order_seq_cst) `
+|`lr.{w|d}.aq^*; <op>; sc.{w|d}.rl `
+
+|latexmath:[$^*$]must be `lr.{w|d}.aqrl` in order to interoperate with
+code mapped per Table #tab:c11mappings[1.6] |
+|===
+
+Table #tab:c11mappings[1.6] provides a mapping of C11/C++11 atomic
+operations onto RISC-V memory instructions. If load and store opcodes
+with _aq_ and _rl_ modifiers are introduced, then the mappings in
+Table #tab:c11mappings_hypothetical[1.7] will suffice. Note however that
+the two mappings only interoperate correctly if
+`atomic_<op>(memory_order_seq_cst)` is mapped using an LR that has both
+_aq_ and _rl_ set.
+
+Any AMO can be emulated by an LR/SC pair, but care must be taken to
+ensure that any PPO orderings that originate from the LR are also made
+to originate from the SC, and that any PPO orderings that terminate at
+the SC are also made to terminate at the LR. For example, the LR must
+also be made to respect any data dependencies that the AMO has, given
+that load operations do not otherwise have any notion of a data
+dependency. Likewise, the effect a FENCE R,R elsewhere in the same hart
+must also be made to apply to the SC, which would not otherwise respect
+that fence. The emulator may achieve this effect by simply mapping AMOs
+onto `lr.aq; <op>; sc.aqrl`, matching the mapping used elsewhere for
+fully ordered atomics.
+
+These C11/C++11 mappings require the platform to provide the following
+Physical Memory Attributes (as defined in the RISC-V Privileged ISA) for
+all memory:
+
+* main memory
+* coherent
+* AMOArithmetic
+* RsrvEventual
+
+Platforms with different attributes may require different mappings, or
+require platform-specific SW (e.g., memory-mapped I/O).
+
+=== Implementation Guidelines
+
+The RVWMO and RVTSO memory models by no means preclude
+microarchitectures from employing sophisticated speculation techniques
+or other forms of optimization in order to deliver higher performance.
+The models also do not impose any requirement to use any one particular
+cache hierarchy, nor even to use a cache coherence protocol at all.
+Instead, these models only specify the behaviors that can be exposed to
+software. Microarchitectures are free to use any pipeline design, any
+coherent or non-coherent cache hierarchy, any on-chip interconnect,
+etc., as long as the design only admits executions that satisfy the
+memory model rules. That said, to help people understand the actual
+implementations of the memory model, in this section we provide some
+guidelines on how architects and programmers should interpret the
+models’ rules.
+
+Both RVWMO and RVTSO are multi-copy atomic (or
+``other-multi-copy-atomic``): any store value that is visible to a hart
+other than the one that originally issued it must also be conceptually
+visible to all other harts in the system. In other words, harts may
+forward from their own previous stores before those stores have become
+globally visible to all harts, but no early inter-hart forwarding is
+permitted. Multi-copy atomicity may be enforced in a number of ways. It
+might hold inherently due to the physical design of the caches and store
+buffers, it may be enforced via a single-writer/multiple-reader cache
+coherence protocol, or it might hold due to some other mechanism.
+
+Although multi-copy atomicity does impose some restrictions on the
+microarchitecture, it is one of the key properties keeping the memory
+model from becoming extremely complicated. For example, a hart may not
+legally forward a value from a neighbor hart’s private store buffer
+(unless of course it is done in such a way that no new illegal behaviors
+become architecturally visible). Nor may a cache coherence protocol
+forward a value from one hart to another until the coherence protocol
+has invalidated all older copies from other caches. Of course,
+microarchitectures may (and high-performance implementations likely
+will) violate these rules under the covers through speculation or other
+optimizations, as long as any non-compliant behaviors are not exposed to
+the programmer.
+
+As a rough guideline for interpreting the PPO rules in RVWMO, we expect
+the following from the software perspective:
+
+* programmers will use PPO rules #ppo:->st[[ppo:->st]] and
+#ppo:fence[[ppo:fence]]–#ppo:pair[[ppo:pair]] regularly and actively.
+* expert programmers will use PPO rules
+#ppo:addr[[ppo:addr]]–#ppo:ctrl[[ppo:ctrl]] to speed up critical paths
+of important data structures.
+* even expert programmers will rarely if ever use PPO rules
+#ppo:rdw[[ppo:rdw]]–#ppo:amoforward[[ppo:amoforward]] and
+#ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]] directly.
+These are included to facilitate common microarchitectural optimizations
+(rule #ppo:rdw[[ppo:rdw]]) and the operational formal modeling approach
+(rules #ppo:amoforward[[ppo:amoforward]] and
+#ppo:addrdatarfi[[ppo:addrdatarfi]]–#ppo:addrpo[[ppo:addrpo]]) described
+in Section #sec:operational[[sec:operational]]. They also facilitate the
+process of porting code from other architectures that have similar
+rules.
+
+We also expect the following from the hardware perspective:
+
+* PPO rules #ppo:->st[[ppo:->st]] and
+#ppo:amoforward[[ppo:amoforward]]–#ppo:release[[ppo:release]] reflect
+well-understood rules that should pose few surprises to architects.
+* PPO rule #ppo:rdw[[ppo:rdw]] reflects a natural and common hardware
+optimization, but one that is very subtle and hence is worth double
+checking carefully.
+* PPO rule #ppo:rcsc[[ppo:rcsc]] may not be immediately obvious to
+architects, but it is a standard memory model requirement
+* The load value axiom, the atomicity axiom, and PPO rules
+#ppo:pair[[ppo:pair]]–#ppo:addrpo[[ppo:addrpo]] reflect rules that most
+hardware implementations will enforce naturally, unless they contain
+extreme optimizations. Of course, implementations should make sure to
+double check these rules nevertheless. Hardware must also ensure that
+syntactic dependencies are not ``optimized away``.
+
+Architectures are free to implement any of the memory model rules as
+conservatively as they choose. For example, a hardware implementation
+may choose to do any or all of the following:
+
+* interpret all fences as if they were FENCE RW,RW (or FENCE IORW,IORW,
+if I/O is involved), regardless of the bits actually set
+* implement all fences with PW and SR as if they were FENCE RW,RW (or
+FENCE IORW,IORW, if I/O is involved), as PW with SR is the most
+expensive of the four possible main memory ordering components anyway
+* emulate _aq_ and _rl_ as described in Section #sec:memory:porting[1.5]
+* enforcing all same-address load-load ordering, even in the presence of
+patterns such as ``fri-rfi`` and ``RSW``
+* forbid any forwarding of a value from a store in the store buffer to a
+subsequent AMO or LR to the same address
+* forbid any forwarding of a value from an AMO or SC in the store buffer
+to a subsequent load to the same address
+* implement TSO on all memory accesses, and ignore any main memory
+fences that do not include PW and SR ordering (e.g., as Ztso
+implementations will do)
+* implement all atomics to be RCsc or even fully ordered, regardless of
+annotation
+
+Architectures that implement RVTSO can safely do the following:
+
+* Ignore all fences that do not have both PW and SR (unless the fence
+also orders I/O)
+* Ignore all PPO rules except for rules #ppo:fence[[ppo:fence]] through
+#ppo:rcsc[[ppo:rcsc]], since the rest are redundant with other PPO rules
+under RVTSO assumptions
+
+Other general notes:
+
+* Silent stores (i.e., stores that write the same value that already
+exists at a memory location) behave like any other store from a memory
+model point of view. Likewise, AMOs which do not actually change the
+value in memory (e.g., an AMOMAX for which the value in _rs2_ is smaller
+than the value currently in memory) are still semantically considered
+store operations. Microarchitectures that attempt to implement silent
+stores must take care to ensure that the memory model is still obeyed,
+particularly in cases such as RSW (Section #sec:memory:overlap[1.3.5])
+which tend to be incompatible with silent stores.
+* Writes may be merged (i.e., two consecutive writes to the same address
+may be merged) or subsumed (i.e., the earlier of two back-to-back writes
+to the same address may be elided) as long as the resulting behavior
+does not otherwise violate the memory model semantics.
+
+The question of write subsumption can be understood from the following
+example:
+
+m.4m.1m.4 ` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 3 | |li t3, 2
+| |li t2, 1 | |
+|(a) |sw t1,0(s0) |(d) |lw a0,0(s1)
+|(b) |fence w, w |(e) |sw a0,0(s0)
+|(c) |sw t2,0(s1) |(f) |sw t3,0(s0)
+|===
+
+& &
+
+As written, if the load  (d) reads value latexmath:[$1$], then (a) must
+precede (f) in the global memory order:
+
+* {empty}(a) precedes (c) in the global memory order because of rule 2
+* {empty}(c) precedes (d) in the global memory order because of the Load
+Value axiom
+* {empty}(d) precedes (e) in the global memory order because of rule 7
+* {empty}(e) precedes (f) in the global memory order because of rule 1
+
+In other words the final value of the memory location whose address is
+in `s0` must be latexmath:[$2$] (the value written by the store (f)) and
+cannot be latexmath:[$3$] (the value written by the store (a)).
+
+A very aggressive microarchitecture might erroneously decide to discard
+(e), as (f) supersedes it, and this may in turn lead the
+microarchitecture to break the now-eliminated dependency between (d) and
+(f) (and hence also between (a) and (f)). This would violate the memory
+model rules, and hence it is forbidden. Write subsumption may in other
+cases be legal, if for example there were no data dependency between (d)
+and (e).
+
+==== Possible Future Extensions
+
+We expect that any or all of the following possible future extensions
+would be compatible with the RVWMO memory model:
+
+* `V' vector ISA extensions
+* `J' JIT extension
+* Native encodings for load and store opcodes with _aq_ and _rl_ set
+* Fences limited to certain addresses
+* Cache writeback/flush/invalidate/etc.instructions
+
+[[discrepancies]]
+=== Known Issues
+
+[[mixedrsw]]
+==== Mixed-size RSW
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |lw a0,0(s0) |(d) |lw a1,0(s1)
+|(b) |fence rw,rw |(e) |amoswap.w.rl a2,t1,0(s2)
+|(c) |sw t1,0(s1) |(f) |ld a3,0(s2)
+| | |(g) |lw a4,4(s2)
+| | | |xor a5,a4,a4
+| | | |add s0,s0,a5
+| | |(h) |sw a2,0(s0)
+|Outcome: `a0=1`, `a1=1`, `a2=0`, `a3=1`, `a4=0` | | |
+|===
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |lw a0,0(s0) |(d) |ld a1,0(s1)
+|(b) |fence rw,rw |(e) |lw a2,4(s1)
+|(c) |sw t1,0(s1) | |xor a3,a2,a2
+| | | |add s0,s0,a3
+| | |(f) |sw a2,0(s0)
+|Outcome: `a0=0`, `a1=1`, `a2=0` | | |
+|===
+
+` `
+
+[cols="^,<,^,<",options="header",]
+|===
+|Hart 0 | |Hart 1 |
+| |li t1, 1 | |li t1, 1
+|(a) |lw a0,0(s0) |(d) |sw t1,4(s1)
+|(b) |fence rw,rw |(e) |ld a1,0(s1)
+|(c) |sw t1,0(s1) |(f) |lw a2,4(s1)
+| | | |xor a3,a2,a2
+| | | |add s0,s0,a3
+| | |(g) |sw a2,0(s0)
+|Outcome: `a0=1`, `a1=0x100000001`, `a1=1` | | |
+|===
+
+There is a known discrepancy between the operational and axiomatic
+specifications within the family of mixed-size RSW variants shown in
+Figures #fig:litmus:discrepancy:rsw1[[fig:litmus:discrepancy:rsw1]]–#fig:litmus:discrepancy:rsw3[[fig:litmus:discrepancy:rsw3]].
+To address this, we may choose to add something like the following new
+PPO rule: Memory operation latexmath:[$a$] precedes memory operation
+latexmath:[$b$] in preserved program order (and hence also in the global
+memory order) if latexmath:[$a$] precedes latexmath:[$b$] in program
+order, latexmath:[$a$] and latexmath:[$b$] both access regular main
+memory (rather than I/O regions), latexmath:[$a$] is a load,
+latexmath:[$b$] is a store, there is a load latexmath:[$m$] between
+latexmath:[$a$] and latexmath:[$b$], there is a byte latexmath:[$x$]
+that both latexmath:[$a$] and latexmath:[$m$] read, there is no store
+between latexmath:[$a$] and latexmath:[$m$] that writes to
+latexmath:[$x$], and latexmath:[$m$] precedes latexmath:[$b$] in PPO. In
+other words, in herd syntax, we may choose to add
+`(po-loc & rsw);ppo;[W]` to PPO. Many implementations will already
+enforce this ordering naturally. As such, even though this rule is not
+official, we recommend that implementers enforce it nevertheless in
+order to ensure forwards compatibility with the possible future addition
+of this rule to RVWMO.
diff --git a/src/mm-formal.adoc b/src/mm-formal.adoc
new file mode 100644
index 0000000..feea627
--- /dev/null
+++ b/src/mm-formal.adoc
@@ -0,0 +1,1412 @@
+[appendix]
+== Formal Memory Model Specifications, Version 0.1
+[[mm-formal]]
+
+To facilitate formal analysis of RVWMO, this chapter presents a set of
+formalizations using different tools and modeling approaches. Any
+discrepancies are unintended; the expectation is that the models
+describe exactly the same sets of legal behaviors.
+
+This appendix should be treated as commentary; all normative material is
+provided in <<memorymodel>> and in the rest of
+the main body of the ISA specification. All currently known
+discrepancies are listed in
+<<discrepancies>>. Any other
+discrepancies are unintentional.
+
+[[alloy]]
+=== Formal Axiomatic Specification in Alloy
+
+We present a formal specification of the RVWMO memory model in Alloy
+(http://alloy.mit.edu). This model is available online at
+https://github.com/daniellustig/riscv-memory-model.
+
+The online material also contains some litmus tests and some examples of
+how Alloy can be used to model check some of the mappings in
+Section #sec:memory:porting[[sec:memory:porting]].
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =RVWMO PPO=
+
+// Preserved Program Order
+fun ppo : Event->Event {
+  // same-address ordering
+  po_loc :> Store
+  + rdw
+  + (AMO + StoreConditional) <: rfi
+
+  // explicit synchronization
+  + ppo_fence
+  + Acquire <: ^po :> MemoryEvent
+  + MemoryEvent <: ^po :> Release
+  + RCsc <: ^po :> RCsc
+  + pair
+
+  // syntactic dependencies
+  + addrdep
+  + datadep
+  + ctrldep :> Store
+
+  // pipeline dependencies
+  + (addrdep+datadep).rfi
+  + addrdep.^po :> Store
+}
+
+// the global memory order respects preserved program order
+fact { ppo in ^gmo }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =RVWMO axioms=
+
+// Load Value Axiom
+fun candidates[r: MemoryEvent] : set MemoryEvent {
+  (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo
+  + (r.^~po & Store & same_addr[r]) // writes preceding r in po
+}
+
+fun latest_among[s: set Event] : Event { s - s.~^gmo }
+
+pred LoadValue {
+  all w: Store | all r: Load |
+    w->r in rf <=> w = latest_among[candidates[r]]
+}
+
+// Atomicity Axiom
+pred Atomicity {
+  all r: Store.~pair |            // starting from the lr,
+    no x: Store & same_addr[r] |  // there is no store x to the same addr
+      x not in same_hart[r]       // such that x is from a different hart,
+      and x in r.~rf.^gmo         // x follows (the store r reads from) in gmo,
+      and r.pair in x.^gmo        // and r follows x in gmo
+}
+
+// Progress Axiom implicit: Alloy only considers finite executions
+
+pred RISCV_mm { LoadValue and Atomicity /* and Progress */ }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// Basic model of memory
+
+sig Hart {  // hardware thread
+  start : one Event
+}
+sig Address {}
+abstract sig Event {
+  po: lone Event // program order
+}
+
+abstract sig MemoryEvent extends Event {
+  address: one Address,
+  acquireRCpc: lone MemoryEvent,
+  acquireRCsc: lone MemoryEvent,
+  releaseRCpc: lone MemoryEvent,
+  releaseRCsc: lone MemoryEvent,
+  addrdep: set MemoryEvent,
+  ctrldep: set Event,
+  datadep: set MemoryEvent,
+  gmo: set MemoryEvent,  // global memory order
+  rf: set MemoryEvent
+}
+sig LoadNormal extends MemoryEvent {} // l{b|h|w|d}
+sig LoadReserve extends MemoryEvent { // lr
+  pair: lone StoreConditional
+}
+sig StoreNormal extends MemoryEvent {}       // s{b|h|w|d}
+// all StoreConditionals in the model are assumed to be successful
+sig StoreConditional extends MemoryEvent {}  // sc
+sig AMO extends MemoryEvent {}               // amo
+sig NOP extends Event {}
+
+fun Load : Event { LoadNormal + LoadReserve + AMO }
+fun Store : Event { StoreNormal + StoreConditional + AMO }
+
+sig Fence extends Event {
+  pr: lone Fence, // opcode bit
+  pw: lone Fence, // opcode bit
+  sr: lone Fence, // opcode bit
+  sw: lone Fence  // opcode bit
+}
+sig FenceTSO extends Fence {}
+
+/* Alloy encoding detail: opcode bits are either set (encoded, e.g.,
+ * as f.pr in iden) or unset (f.pr not in iden).  The bits cannot be used for
+ * anything else */
+fact { pr + pw + sr + sw in iden }
+// likewise for ordering annotations
+fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden }
+// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is
+fact { no FenceTSO.(pr + pw + sr + sw) }
+....
+
+`  `
+
+....
+////////////////////////////////////////////////////////////////////////////////
+// =Basic model rules=
+
+// Ordering annotation groups
+fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc }
+fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc }
+fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc }
+fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc }
+
+// There is no such thing as store-acquire or load-release, unless it's both
+fact { Load & Release in Acquire }
+fact { Store & Acquire in Release }
+
+// FENCE PPO
+fun FencePRSR : Fence { Fence.(pr & sr) }
+fun FencePRSW : Fence { Fence.(pr & sw) }
+fun FencePWSR : Fence { Fence.(pw & sr) }
+fun FencePWSW : Fence { Fence.(pw & sw) }
+
+fun ppo_fence : MemoryEvent->MemoryEvent {
+    (Load  <: ^po :> FencePRSR).(^po :> Load)
+  + (Load  <: ^po :> FencePRSW).(^po :> Store)
+  + (Store <: ^po :> FencePWSR).(^po :> Load)
+  + (Store <: ^po :> FencePWSW).(^po :> Store)
+  + (Load  <: ^po :> FenceTSO) .(^po :> MemoryEvent)
+  + (Store <: ^po :> FenceTSO) .(^po :> Store)
+}
+
+// auxiliary definitions
+fun po_loc : Event->Event { ^po & address.~address }
+fun same_hart[e: Event] : set Event { e + e.^~po + e.^po }
+fun same_addr[e: Event] : set Event { e.address.~address }
+
+// initial stores
+fun NonInit : set Event { Hart.start.*po }
+fun Init : set Event { Event - NonInit }
+fact { Init in StoreNormal }
+fact { Init->(MemoryEvent & NonInit) in ^gmo }
+fact { all e: NonInit | one e.*~po.~start }  // each event is in exactly one hart
+fact { all a: Address | one Init & a.~address } // one init store per address
+fact { no Init <: po and no po :> Init }
+....
+
+`  `
+
+....
+// po
+fact { acyclic[po] }
+
+// gmo
+fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents
+
+//rf
+fact { rf.~rf in iden } // each read returns the value of only one write
+fact { rf in Store <: address.~address :> Load }
+fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) }
+
+//dep
+fact { no StoreNormal <: (addrdep + ctrldep + datadep) }
+fact { addrdep + ctrldep + datadep + pair in ^po }
+fact { datadep in datadep :> Store }
+fact { ctrldep.*po in ctrldep }
+fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po }
+fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed
+
+// rdw
+fun rdw : Event->Event {
+  (Load <: po_loc :> Load)  // start with all same_address load-load pairs,
+  - (~rf.rf)                // subtract pairs that read from the same store,
+  - (po_loc.rfi)            // and subtract out "fri-rfi" patterns
+}
+
+// filter out redundant instances and/or visualizations
+fact { no gmo & gmo.gmo } // keep the visualization uncluttered
+fact { all a: Address | some a.~address }
+
+////////////////////////////////////////////////////////////////////////////////
+// =Optional: opcode encoding restrictions=
+
+// the list of blessed fences
+fact { Fence in
+  Fence.pr.sr
+  + Fence.pw.sw
+  + Fence.pr.pw.sw
+  + Fence.pr.sr.sw
+  + FenceTSO
+  + Fence.pr.pw.sr.sw
+}
+
+pred restrict_to_current_encodings {
+  no (LoadNormal + StoreNormal) & (Acquire + Release)
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// =Alloy shortcuts=
+pred acyclic[rel: Event->Event] { no iden & ^rel }
+pred total[rel: Event->Event, bag: Event] {
+  all disj e, e': bag | e->e' in rel + ~rel
+  acyclic[rel]
+}
+....
+
+[[sec:herd]]
+=== Formal Axiomatic Specification in Herd
+
+The tool [.sans-serif]#herd# takes a memory model and a litmus test as
+input and simulates the execution of the test on top of the memory
+model. Memory models are written in the domain specific language Cat.
+This section provides two Cat memory model of RVWMO. The first model,
+Figure #fig:herd2[[fig:herd2]], follows the _global memory order_,
+Chapter #ch:memorymodel[[ch:memorymodel]], definition of RVWMO, as much
+as is possible for a Cat model. The second model,
+Figure #fig:herd3[[fig:herd3]], is an equivalent, more efficient,
+partial order based RVWMO model.
+
+The simulator [.sans-serif]#herd# is part of the [.sans-serif]#diy# tool
+suite — see http://diy.inria.fr for software and documentation. The
+models and more are available online
+at http://diy.inria.fr/cats7/riscv/.
+
+`  `
+
+....
+(*************)
+(* Utilities *)
+(*************)
+
+(* All fence relations *)
+let fence.r.r = [R];fencerel(Fence.r.r);[R]
+let fence.r.w = [R];fencerel(Fence.r.w);[W]
+let fence.r.rw = [R];fencerel(Fence.r.rw);[M]
+let fence.w.r = [W];fencerel(Fence.w.r);[R]
+let fence.w.w = [W];fencerel(Fence.w.w);[W]
+let fence.w.rw = [W];fencerel(Fence.w.rw);[M]
+let fence.rw.r = [M];fencerel(Fence.rw.r);[R]
+let fence.rw.w = [M];fencerel(Fence.rw.w);[W]
+let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M]
+let fence.tso =
+  let f = fencerel(Fence.tso) in
+  ([W];f;[W]) | ([R];f;[M])
+
+let fence = 
+  fence.r.r | fence.r.w | fence.r.rw |
+  fence.w.r | fence.w.w | fence.w.rw |
+  fence.rw.r | fence.rw.w | fence.rw.rw |
+  fence.tso
+
+(* Same address, no W to the same address in-between *)
+let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc)
+(* Read same write *)
+let rsw = rf^-1;rf
+(* Acquire, or stronger  *)
+let AQ = Acq|AcqRel
+(* Release or stronger *)
+and RL = RelAcqRel
+(* All RCsc *)
+let RCsc = Acq|Rel|AcqRel
+(* Amo events are both R and W, relation rmw relates paired lr/sc *)
+let AMO = R & W
+let StCond = range(rmw)
+
+(*************)
+(* ppo rules *)
+(*************)
+
+(* Overlapping-Address Orderings *)
+let r1 = [M];po-loc;[W]
+and r2 = ([R];po-loc-no-w;[R]) \ rsw
+and r3 = [AMO|StCond];rfi;[R]
+(* Explicit Synchronization *)
+and r4 = fence
+and r5 = [AQ];po;[M]
+and r6 = [M];po;[RL]
+and r7 = [RCsc];po;[RCsc]
+and r8 = rmw
+(* Syntactic Dependencies *)
+and r9 = [M];addr;[M]
+and r10 = [M];data;[W]
+and r11 = [M];ctrl;[W]
+(* Pipeline Dependencies *)
+and r12 = [R];(addr|data);[W];rfi;[R]
+and r13 = [R];addr;[M];po;[W]
+
+let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13
+....
+
+`  `
+
+....
+Total
+
+(* Notice that herd has defined its own rf relation *)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(********************************)
+(* Generate global memory order *)
+(********************************)
+
+let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *)
+  loc & (W\FW) * FW | # Final write after any write to the same location
+  ppo |               # ppo compatible
+  rfe                 # includes herd external rf (optimization)
+
+(* Walk over all linear extensions of gmo0 *)
+with  gmo from linearizations(M\IW,gmo0)
+
+(* Add initial writes upfront -- convenient for computing rfGMO *)
+let gmo = gmo | loc & IW * (M\IW)
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Compute rf according to the load value axiom, aka rfGMO *)
+let WR = loc & ([W];(gmo|po);[R])
+let rfGMO = WR \ (loc&([W];gmo);WR)
+
+(* Check equality of herd rf and of rfGMO *)
+empty (rf\rfGMO)|(rfGMO\rf) as RfCons
+
+(* Atomicity axiom *)
+let infloc = (gmo & loc)^-1
+let inflocext = infloc & ext
+let winside  = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W]
+empty winside as Atomic
+....
+
+`  `
+
+....
+Partial
+
+(***************)
+(* Definitions *)
+(***************)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(* Compute coherence relation *)
+include "cos-opt.cat"
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Sc per location *)
+acyclic co|rf|fr|po-loc as Coherence
+
+(* Main model axiom *)
+acyclic co|rfe|fr|ppo as Model
+
+(* Atomicity axiom *)
+empty rmw & (fre;coe) as Atomic
+....
+
+[[sec:operational]]
+=== An Operational Memory Model
+
+This is an alternative presentation of the RVWMO memory model in
+operational style. It aims to admit exactly the same extensional
+behavior as the axiomatic presentation: for any given program, admitting
+an execution if and only if the axiomatic presentation allows it.
+
+The axiomatic presentation is defined as a predicate on complete
+candidate executions. In contrast, this operational presentation has an
+abstract microarchitectural flavor: it is expressed as a state machine,
+with states that are an abstract representation of hardware machine
+states, and with explicit out-of-order and speculative execution (but
+abstracting from more implementation-specific microarchitectural details
+such as register renaming, store buffers, cache hierarchies, cache
+protocols, etc.). As such, it can provide useful intuition. It can also
+construct executions incrementally, making it possible to interactively
+and randomly explore the behavior of larger examples, while the
+axiomatic model requires complete candidate executions over which the
+axioms can be checked.
+
+The operational presentation covers mixed-size execution, with
+potentially overlapping memory accesses of different power-of-two byte
+sizes. Misaligned accesses are broken up into single-byte accesses.
+
+The operational model, together with a fragment of the RISC-V ISA
+semantics (RV64I and A), are integrated into the `rmem` exploration tool
+(https://github.com/rems-project/rmem). `rmem` can explore litmus tests
+(see #sec:litmustests[[sec:litmustests]]) and small ELF binaries
+exhaustively, pseudo-randomly and interactively. In `rmem`, the ISA
+semantics is expressed explicitly in Sail (see
+https://github.com/rems-project/sail for the Sail language, and
+https://github.com/rems-project/sail-riscv for the RISC-V ISA model),
+and the concurrency semantics is expressed in Lem (see
+https://github.com/rems-project/lem for the Lem language).
+
+`rmem` has a command-line interface and a web-interface. The
+web-interface runs entirely on the client side, and is provided online
+together with a library of litmus tests:
+http://www.cl.cam.ac.uk/~pes20/rmem. The command-line interface is
+faster than the web-interface, specially in exhaustive mode.
+
+Below is an informal introduction of the model states and transitions.
+The description of the formal model starts in the next subsection.
+
+Terminology: In contrast to the axiomatic presentation, here every
+memory operation is either a load or a store. Hence, AMOs give rise to
+two distinct memory operations, a load and a store. When used in
+conjunction with ``instruction``, the terms ``load`` and ``store`` refer
+to instructions that give rise to such memory operations. As such, both
+include AMO instructions. The term ``acquire`` refers to an instruction
+(or its memory operation) with the acquire-RCpc or acquire-RCsc
+annotation. The term ``release`` refers to an instruction (or its memory
+operation) with the release-RCpc or release-RCsc annotation.
+
+==== Model states
+
+A model state consists of a shared memory and a tuple of hart states.
+
+[cols="^,^,^",]
+|===
+| |*…* |Hart latexmath:[$n$]
+
+|latexmath:[$\big\uparrow$] latexmath:[$\big\downarrow$] |
+|latexmath:[$\big\uparrow$] latexmath:[$\big\downarrow$]
+
+|Shared Memory | |
+|===
+
+The shared memory state records all the memory store operations that
+have propagated so far, in the order they propagated (this can be made
+more efficient, but for simplicity of the presentation we keep it this
+way).
+
+Each hart state consists principally of a tree of instruction instances,
+some of which have been _finished_, and some of which have not.
+Non-finished instruction instances can be subject to _restart_, e.g. if
+they depend on an out-of-order or speculative load that turns out to be
+unsound.
+
+Conditional branch and indirect jump instructions may have multiple
+successors in the instruction tree. When such instruction is finished,
+any un-taken alternative paths are discarded.
+
+Each instruction instance in the instruction tree has a state that
+includes an execution state of the intra-instruction semantics (the ISA
+pseudocode for this instruction). The model uses a formalization of the
+intra-instruction semantics in Sail. One can think of the execution
+state of an instruction as a representation of the pseudocode control
+state, pseudocode call stack, and local variable values. An instruction
+instance state also includes information about the instance’s memory and
+register footprints, its register reads and writes, its memory
+operations, whether it is finished, etc.
+
+==== Model transitions
+
+The model defines, for any model state, the set of allowed transitions,
+each of which is a single atomic step to a new abstract machine state.
+Execution of a single instruction will typically involve many
+transitions, and they may be interleaved in operational-model execution
+with transitions arising from other instructions. Each transition arises
+from a single instruction instance; it will change the state of that
+instance, and it may depend on or change the rest of its hart state and
+the shared memory state, but it does not depend on other hart states,
+and it will not change them. The transitions are introduced below and
+defined in Section #sec:omm:transitions[1.3.5], with a precondition and
+a construction of the post-transition model state for each.
+
+Transitions for all instructions:
+
+* : This transition represents a fetch and decode of a new instruction
+instance, as a program order successor of a previously fetched
+instruction instance (or the initial fetch address).
++
+The model assumes the instruction memory is fixed; it does not describe
+the behavior of self-modifying code. In particular, the transition does
+not generate memory load operations, and the shared memory is not
+involved in the transition. Instead, the model depends on an external
+oracle that provides an opcode when given a memory location.
+* : This is a write of a register value.
+* : This is a read of a register value from the most recent
+program-order-predecessor instruction instance that writes to that
+register.
+* : This covers pseudocode internal computation: arithmetic, function
+calls, etc.
+* : At this point the instruction pseudocode is done, the instruction
+cannot be restarted, memory accesses cannot be discarded, and all memory
+effects have taken place. For conditional branch and indirect jump
+instructions, any program order successors that were fetched from an
+address that is not the one that was written to the _pc_ register are
+discarded, together with the sub-tree of instruction instances below
+them.
+
+Transitions specific to load instructions:
+
+* : At this point the memory footprint of the load instruction is
+provisionally known (it could change if earlier instructions are
+restarted) and its individual memory load operations can start being
+satisfied.
+* : This partially or entirely satisfies a single memory load operation
+by forwarding, from program-order-previous memory store operations.
+* : This entirely satisfies the outstanding slices of a single memory
+load operation, from memory.
+* : At this point all the memory load operations of the instruction have
+been entirely satisfied and the instruction pseudocode can continue
+executing. A load instruction can be subject to being restarted until
+the transition. But, under some conditions, the model might treat a load
+instruction as non-restartable even before it is finished (e.g. see ).
+
+Transitions specific to store instructions:
+
+* : At this point the memory footprint of the store is provisionally
+known.
+* : At this point the memory store operations have their values and
+program-order-successor memory load operations can be satisfied by
+forwarding from them.
+* : At this point the store operations are guaranteed to happen (the
+instruction can no longer be restarted or discarded), and they can start
+being propagated to memory.
+* : This propagates a single memory store operation to memory.
+* : At this point all the memory store operations of the instruction
+have been propagated to memory, and the instruction pseudocode can
+continue executing.
+
+Transitions specific to `sc` instructions:
+
+* : This causes the `sc` to fail, either a spontaneous fail or because
+it is not paired with a program-order-previous `lr`.
+* : This transition indicates the `sc` is paired with an `lr` and might
+succeed.
+* : This is an atomic execution of the transitions and , it is enabled
+only if the stores from which the `lr` read from have not been
+overwritten.
+* : This causes the `sc` to fail, either a spontaneous fail or because
+the stores from which the `lr` read from have been overwritten.
+
+Transitions specific to AMO instructions:
+
+* : This is an atomic execution of all the transitions needed to satisfy
+the load operation, do the required arithmetic, and propagate the store
+operation.
+
+Transitions specific to fence instructions:
+
+* {blank}
+
+The transitions labeled latexmath:[$\circ$] can always be taken eagerly,
+as soon as their precondition is satisfied, without excluding other
+behavior; the latexmath:[$\bullet$] cannot. Although is marked with a
+latexmath:[$\bullet$], it can be taken eagerly as long as it is not
+taken infinitely many times.
+
+An instance of a non-AMO load instruction, after being fetched, will
+typically experience the following transitions in this order:
+
+. {blank}
+. {blank}
+. and/or (as many as needed to satisfy all the load operations of the
+instance)
+. {blank}
+. {blank}
+. {blank}
+
+Before, between and after the transitions above, any number of
+transitions may appear. In addition, a transition for fetching the
+instruction in the next program location will be available until it is
+taken.
+
+This concludes the informal description of the operational model. The
+following sections describe the formal operational model.
+
+[[sec:omm:pseudocode_exec]]
+==== Intra-instruction Pseudocode Execution
+
+The intra-instruction semantics for each instruction instance is
+expressed as a state machine, essentially running the instruction
+pseudocode. Given a pseudocode execution state, it computes the next
+state. Most states identify a pending memory or register operation,
+requested by the pseudocode, which the memory model has to do. The
+states are (this is a tagged union; tags in small-caps):
+
+[cols="<,<",]
+|===
+|Load_mem(_kind_, _address_, _size_, _load_continuation_) |memory load
+operation
+
+|Early_sc_fail(_res_continuation_) |allow `sc` to fail early
+
+|Store_ea(_kind_, _address_, _size_, _next_state_) |memory store
+effective address
+
+|Store_memv(_mem_value_, _store_continuation_) |memory store value
+
+|Fence(_kind_, _next_state_) |fence
+
+|Read_reg(_reg_name_, _read_continuation_) |register read
+
+|Write_reg(_reg_name_, _reg_value_, _next_state_) |register write
+
+|Internal(_next_state_) |pseudocode internal step
+
+|Done |end of pseudocode
+|===
+
+Here:
+
+_mem_value_ and _reg_value_ are lists of bytes;
+
+_address_ is an integer of XLEN bits;
+
+for load/store, _kind_ identifies whether it is `lr/sc`,
+acquire-RCpc/release-RCpc, acquire-RCsc/release-RCsc,
+acquire-release-RCsc;
+
+for fence, _kind_ identifies whether it is a normal or TSO, and (for
+normal fences) the predecessor and successor ordering bits;
+
+_reg_name_ identifies a register and a slice thereof (start and end bit
+indices); and
+
+the continuations describe how the instruction instance will continue
+for each value that might be provided by the surrounding memory model
+(the _load_continuation_ and _read_continuation_ take the value loaded
+from memory and read from the previous register write, the
+_store_continuation_ takes _false_ for an `sc` that failed and _true_ in
+all other cases, and _res_continuation_ takes _false_ if the `sc` fails
+and _true_ otherwise).
+
+For example, given the load instruction `lw x1,0(x2)`, an execution will
+typically go as follows. The initial execution state will be computed
+from the pseudocode for the given opcode. This can be expected to be
+Read_reg(`x2`, _read_continuation_). Feeding the most recently written
+value of register `x2` (the instruction semantics will be blocked if
+necessary until the register value is available), say `0x4000`, to
+_read_continuation_ returns Load_mem(`plain_load`, `0x4000`, `4`,
+_load_continuation_). Feeding the 4-byte value loaded from memory
+location `0x4000`, say `0x42`, to _load_continuation_ returns
+Write_reg(`x1`, `0x42`, Done). Many Internal(_next_state_) states may
+appear before and between the states above.
+
+Notice that writing to memory is split into two steps, Store_ea and
+Store_memv: the first one makes the memory footprint of the store
+provisionally known, and the second one adds the value to be stored. We
+ensure these are paired in the pseudocode (Store_ea followed by
+Store_memv), but there may be other steps between them.
+
+It is observable that the Store_ea can occur before the value to be
+stored is determined. For example, for the litmus test
+LB+fence.r.rw+data-po to be allowed by the operational model (as it is
+by RVWMO), the first store in Hart 1 has to take the Store_ea step
+before its value is determined, so that the second store can see it is
+to a non-overlapping memory footprint, allowing the second store to be
+committed out of order without violating coherence.
+
+The pseudocode of each instruction performs at most one store or one
+load, except for AMOs that perform exactly one load and one store. Those
+memory accesses are then split apart into the architecturally atomic
+units by the hart semantics (see and below).
+
+Informally, each bit of a register read should be satisfied from a
+register write by the most recent (in program order) instruction
+instance that can write that bit (or from the hart’s initial register
+state if there is no such write). Hence, it is essential to know the
+register write footprint of each instruction instance, which we
+calculate when the instruction instance is created (see the action of
+below). We ensure in the pseudocode that each instruction does at most
+one register write to each register bit, and also that it does not try
+to read a register value it just wrote.
+
+Data-flow dependencies (address and data) in the model emerge from the
+fact that each register read has to wait for the appropriate register
+write to be executed (as described above).
+
+[[sec:omm:inst_state]]
+==== Instruction Instance State
+
+Each instruction instance latexmath:[$i$] has a state comprising:
+
+* _program_loc_, the memory address from which the instruction was
+fetched;
+* _instruction_kind_, identifying whether this is a load, store, AMO,
+fence, branch/jump or a `simple' instruction (this also includes a
+_kind_ similar to the one described for the pseudocode execution
+states);
+* _src_regs_, the set of source _reg_name_s (including system
+registers), as statically determined from the pseudocode of the
+instruction;
+* _dst_regs_, the destination _reg_name_s (including system registers),
+as statically determined from the pseudocode of the instruction;
+* _pseudocode_state_ (or sometimes just `state' for short), one of (this
+is a tagged union; tags in small-caps):
++
+[cols="<,<",]
+|===
+|Plain(_isa_state_) |ready to make a pseudocode transition
+
+|Pending_mem_loads(_load_continuation_) |requesting memory load
+operation(s)
+
+|Pending_mem_stores(_store_continuation_) |requesting memory store
+operation(s)
+|===
+* _reg_reads_, the register reads the instance has performed, including,
+for each one, the register write slices it read from;
+* _reg_writes_, the register writes the instance has performed;
+* _mem_loads_, a set of memory load operations, and for each one the
+as-yet-unsatisfied slices (the byte indices that have not been satisfied
+yet), and, for the satisfied slices, the store slices (each consisting
+of a memory store operation and subset of its byte indices) that
+satisfied it.
+* _mem_stores_, a set of memory store operations, and for each one a
+flag that indicates whether it has been propagated (passed to the shared
+memory) or not.
+* information recording whether the instance is committed, finished,
+etc.
+
+Each memory load operation includes a memory footprint (address and
+size). Each memory store operations includes a memory footprint, and,
+when available, a value.
+
+A load instruction instance with a non-empty _mem_loads_, for which all
+the load operations are satisfied (i.e. there are no unsatisfied load
+slices) is said to be _entirely satisfied_.
+
+Informally, an instruction instance is said to have _fully determined
+data_ if the load (and `sc`) instructions feeding its source registers
+are finished. Similarly, it is said to have a _fully determined memory
+footprint_ if the load (and `sc`) instructions feeding its memory
+operation address register are finished. Formally, we first define the
+notion of _fully determined register write_: a register write
+latexmath:[$w$] from _reg_writes_ of instruction instance
+latexmath:[$i$] is said to be _fully determined_ if one of the following
+conditions hold:
+
+. latexmath:[$i$] is finished; or
+. the value written by latexmath:[$w$] is not affected by a memory
+operation that latexmath:[$i$] has made (i.e. a value loaded from memory
+or the result of `sc`), and, for every register read that
+latexmath:[$i$] has made, that affects latexmath:[$w$], the register
+write from which latexmath:[$i$] read is fully determined (or
+latexmath:[$i$] read from the initial register state).
+
+Now, an instruction instance latexmath:[$i$] is said to have _fully
+determined data_ if for every register read latexmath:[$r$] from
+_reg_reads_, the register writes that latexmath:[$r$] reads from are
+fully determined. An instruction instance latexmath:[$i$] is said to
+have a _fully determined memory footprint_ if for every register read
+latexmath:[$r$] from _reg_reads_ that feeds into latexmath:[$i$]’s
+memory operation address, the register writes that latexmath:[$r$] reads
+from are fully determined.
+
+The `rmem` tool records, for every register write, the set of register
+writes from other instructions that have been read by this instruction
+at the point of performing the write. By carefully arranging the
+pseudocode of the instructions covered by the tool we were able to make
+it so that this is exactly the set of register writes on which the write
+depends on.
+
+==== Hart State
+
+The model state of a single hart comprises:
+
+* _hart_id_, a unique identifier of the hart;
+* _initial_register_state_, the initial register value for each
+register;
+* _initial_fetch_address_, the initial instruction fetch address;
+* _instruction_tree_, a tree of the instruction instances that have been
+fetched (and not discarded), in program order.
+
+==== Shared Memory State
+
+The model state of the shared memory comprises a list of memory store
+operations, in the order they propagated to the shared memory.
+
+When a store operation is propagated to the shared memory it is simply
+added to the end of the list. When a load operation is satisfied from
+memory, for each byte of the load operation, the most recent
+corresponding store slice is returned.
+
+For most purposes, it is simpler to think of the shared memory as an
+array, i.e., a map from memory locations to memory store operation
+slices, where each memory location is mapped to a one-byte slice of the
+most recent memory store operation to that location. However, this
+abstraction is not detailed enough to properly handle the `sc`
+instruction. The RVWMO allows store operations from the same hart as the
+`sc` to intervene between the store operation of the `sc` and the store
+operations the paired `lr` read from. To allow such store operations to
+intervene, and forbid others, the array abstraction must be extended to
+record more information. Here, we use a list as it is very simple, but a
+more efficient and scalable implementations should probably use
+something better.
+
+[[sec:omm:transitions]]
+==== Transitions
+
+Each of the paragraphs below describes a single kind of system
+transition. The description starts with a condition over the current
+system state. The transition can be taken in the current state only if
+the condition is satisfied. The condition is followed by an action that
+is applied to that state when the transition is taken, in order to
+generate the new system state.
+
+[[omm:fetch]]
+===== Fetch instruction
+
+A possible program-order-successor of instruction instance
+latexmath:[$i$] can be fetched from address _loc_ if:
+
+. it has not already been fetched, i.e., none of the immediate
+successors of latexmath:[$i$] in the hart’s _instruction_tree_ are from
+_loc_; and
+. if latexmath:[$i$]’s pseudocode has already written an address to
+_pc_, then _loc_ must be that address, otherwise _loc_ is:
+* for a conditional branch, the successor address or the branch target
+address;
+* for a (direct) jump and link instruction (`jal`), the target address;
+* for an indirect jump instruction (`jalr`), any address; and
+* for any other instruction, latexmath:[$i.\textit{program\_loc}+4$].
+
+Action: construct a freshly initialized instruction instance
+latexmath:[$i'$] for the instruction in the program memory at _loc_,
+with state Plain(_isa_state_), computed from the instruction pseudocode,
+including the static information available from the pseudocode such as
+its _instruction_kind_, _src_regs_, and _dst_regs_, and add
+latexmath:[$i'$] to the hart’s _instruction_tree_ as a successor of
+latexmath:[$i$].
+
+The possible next fetch addresses (_loc_) are available immediately
+after fetching latexmath:[$i$] and the model does not need to wait for
+the pseudocode to write to _pc_; this allows out-of-order execution, and
+speculation past conditional branches and jumps. For most instructions
+these addresses are easily obtained from the instruction pseudocode. The
+only exception to that is the indirect jump instruction (`jalr`), where
+the address depends on the value held in a register. In principle the
+mathematical model should allow speculation to arbitrary addresses here.
+The exhaustive search in the `rmem` tool handles this by running the
+exhaustive search multiple times with a growing set of possible next
+fetch addresses for each indirect jump. The initial search uses empty
+sets, hence there is no fetch after indirect jump instruction until the
+pseudocode of the instruction writes to _pc_, and then we use that value
+for fetching the next instruction. Before starting the next iteration of
+exhaustive search, we collect for each indirect jump (grouped by code
+location) the set of values it wrote to _pc_ in all the executions in
+the previous search iteration, and use that as possible next fetch
+addresses of the instruction. This process terminates when no new fetch
+addresses are detected.
+
+[[omm:initiate_load]]
+===== Initiate memory load operations
+
+An instruction instance latexmath:[$i$] in state Plain(Load_mem(_kind_,
+_address_, _size_, _load_continuation_)) can always initiate the
+corresponding memory load operations. Action:
+
+. Construct the appropriate memory load operations latexmath:[$mlos$]:
+* if _address_ is aligned to _size_ then latexmath:[$mlos$] is a single
+memory load operation of _size_ bytes from _address_;
+* otherwise, latexmath:[$mlos$] is a set of _size_ memory load
+operations, each of one byte, from the addresses
+latexmath:[$\textit{address}\ldots\textit{address}+\textit{size}-1$].
+. set _mem_loads_ of latexmath:[$i$] to latexmath:[$mlos$]; and
+. update the state of latexmath:[$i$] to
+Pending_mem_loads(_load_continuation_).
+
+In Section #sec:rvwmo:primitives[[sec:rvwmo:primitives]] it is said that
+misaligned memory accesses may be decomposed at any granularity. Here we
+decompose them to one-byte accesses as this granularity subsumes all
+others.
+
+[[omm:sat_by_forwarding]]
+===== Satisfy memory load operation by forwarding from unpropagated stores
+
+For a non-AMO load instruction instance latexmath:[$i$] in state
+Pending_mem_loads(_load_continuation_), and a memory load operation
+latexmath:[$mlo$] in latexmath:[$i.\textit{mem\_loads}$] that has
+unsatisfied slices, the memory load operation can be partially or
+entirely satisfied by forwarding from unpropagated memory store
+operations by store instruction instances that are program-order-before
+latexmath:[$i$] if:
+
+. all program-order-previous `fence` instructions with `.sr` and `.pw`
+set are finished;
+. for every program-order-previous `fence` instruction, latexmath:[$f$],
+with `.sr` and `.pr` set, and `.pw` not set, if latexmath:[$f$] is not
+finished then all load instructions that are program-order-before
+latexmath:[$f$] are entirely satisfied;
+. for every program-order-previous `fence.tso` instruction,
+latexmath:[$f$], that is not finished, all load instructions that are
+program-order-before latexmath:[$f$] are entirely satisfied;
+. if latexmath:[$i$] is a load-acquire-RCsc, all program-order-previous
+store-releases-RCsc are finished;
+. if latexmath:[$i$] is a load-acquire-release, all
+program-order-previous instructions are finished;
+. all non-finished program-order-previous load-acquire instructions are
+entirely satisfied; and
+. all program-order-previous store-acquire-release instructions are
+finished;
+
+Let latexmath:[$msoss$] be the set of all unpropagated memory store
+operation slices from non-`sc` store instruction instances that are
+program-order-before latexmath:[$i$] and have already calculated the
+value to be stored, that overlap with the unsatisfied slices of
+latexmath:[$mlo$], and which are not superseded by intervening store
+operations or store operations that are read from by an intervening
+load. The last condition requires, for each memory store operation slice
+latexmath:[$msos$] in latexmath:[$msoss$] from instruction
+latexmath:[$i'$]:
+
+that there is no store instruction program-order-between latexmath:[$i$]
+and latexmath:[$i'$] with a memory store operation overlapping
+latexmath:[$msos$]; and
+
+that there is no load instruction program-order-between latexmath:[$i$]
+and latexmath:[$i'$] that was satisfied from an overlapping memory store
+operation slice from a different hart.
+
+Action:
+
+. update latexmath:[$i.\textit{mem\_loads}$] to indicate that
+latexmath:[$mlo$] was satisfied by latexmath:[$msoss$]; and
+. restart any speculative instructions which have violated coherence as
+a result of this, i.e., for every non-finished instruction
+latexmath:[$i'$] that is a program-order-successor of latexmath:[$i$],
+and every memory load operation latexmath:[$mlo'$] of latexmath:[$i'$]
+that was satisfied from latexmath:[$msoss'$], if there exists a memory
+store operation slice latexmath:[$msos'$] in latexmath:[$msoss'$], and
+an overlapping memory store operation slice from a different memory
+store operation in latexmath:[$msoss$], and latexmath:[$msos'$] is not
+from an instruction that is a program-order-successor of
+latexmath:[$i$], restart latexmath:[$i'$] and its _restart-dependents_.
+
+Where, the _restart-dependents_ of instruction latexmath:[$j$] are:
+
+program-order-successors of latexmath:[$j$] that have data-flow
+dependency on a register write of latexmath:[$j$];
+
+program-order-successors of latexmath:[$j$] that have a memory load
+operation that reads from a memory store operation of latexmath:[$j$]
+(by forwarding);
+
+if latexmath:[$j$] is a load-acquire, all the program-order-successors
+of latexmath:[$j$];
+
+if latexmath:[$j$] is a load, for every `fence`, latexmath:[$f$], with
+`.sr` and `.pr` set, and `.pw` not set, that is a
+program-order-successor of latexmath:[$j$], all the load instructions
+that are program-order-successors of latexmath:[$f$];
+
+if latexmath:[$j$] is a load, for every `fence.tso`, latexmath:[$f$],
+that is a program-order-successor of latexmath:[$j$], all the load
+instructions that are program-order-successors of latexmath:[$f$]; and
+
+(recursively) all the restart-dependents of all the instruction
+instances above.
+
+Forwarding memory store operations to a memory load might satisfy only
+some slices of the load, leaving other slices unsatisfied.
+
+A program-order-previous store operation that was not available when
+taking the transition above might make latexmath:[$msoss$] provisionally
+unsound (violating coherence) when it becomes available. That store will
+prevent the load from being finished (see ), and will cause it to
+restart when that store operation is propagated (see ).
+
+A consequence of the transition condition above is that
+store-release-RCsc memory store operations cannot be forwarded to
+load-acquire-RCsc instructions: latexmath:[$msoss$] does not include
+memory store operations from finished stores (as those must be
+propagated memory store operations), and the condition above requires
+all program-order-previous store-releases-RCsc to be finished when the
+load is acquire-RCsc.
+
+[[omm:sat_from_mem]]
+===== Satisfy memory load operation from memory
+
+For an instruction instance latexmath:[$i$] of a non-AMO load
+instruction or an AMO instruction in the context of the ```` transition,
+any memory load operation latexmath:[$mlo$] in
+latexmath:[$i.\textit{mem\_loads}$] that has unsatisfied slices, can be
+satisfied from memory if all the conditions of are satisfied. Action:
+let latexmath:[$msoss$] be the memory store operation slices from memory
+covering the unsatisfied slices of latexmath:[$mlo$], and apply the
+action of .
+
+Note that might leave some slices of the memory load operation
+unsatisfied, those will have to be satisfied by taking the transition
+again, or taking . , on the other hand, will always satisfy all the
+unsatisfied slices of the memory load operation.
+
+[[omm:complete_loads]]
+===== Complete load operations
+
+A load instruction instance latexmath:[$i$] in state
+Pending_mem_loads(_load_continuation_) can be completed (not to be
+confused with finished) if all the memory load operations
+latexmath:[$i.\textit{mem\_loads}$] are entirely satisfied (i.e. there
+are no unsatisfied slices). Action: update the state of latexmath:[$i$]
+to Plain(_load_continuation(mem_value)_), where _mem_value_ is assembled
+from all the memory store operation slices that satisfied
+latexmath:[$i.\textit{mem\_loads}$].
+
+[[omm:early_sc_fail]]
+===== Early `sc` fail
+
+An `sc` instruction instance latexmath:[$i$] in state
+Plain(Early_sc_fail(_res_continuation_)) can always be made to fail.
+Action: update the state of latexmath:[$i$] to
+Plain(_res_continuation(false)_).
+
+[[omm:paired_sc]]
+===== Paired `sc`
+
+An `sc` instruction instance latexmath:[$i$] in state
+Plain(Early_sc_fail(_res_continuation_)) can continue its (potentially
+successful) execution if latexmath:[$i$] is paired with an `lr`. Action:
+update the state of latexmath:[$i$] to Plain(_res_continuation(true)_).
+
+[[omm:initiate_store_footprint]]
+===== Initiate memory store operation footprints
+
+An instruction instance latexmath:[$i$] in state Plain(Store_ea(_kind_,
+_address_, _size_, _next_state_)) can always announce its pending memory
+store operation footprint. Action:
+
+. construct the appropriate memory store operations latexmath:[$msos$]
+(without the store value):
+* if _address_ is aligned to _size_ then latexmath:[$msos$] is a single
+memory store operation of _size_ bytes to _address_;
+* otherwise, latexmath:[$msos$] is a set of _size_ memory store
+operations, each of one-byte size, to the addresses
+latexmath:[$\textit{address}\ldots\textit{address}+\textit{size}-1$].
+. set latexmath:[$i.\textit{mem\_stores}$] to latexmath:[$msos$]; and
+. update the state of latexmath:[$i$] to Plain(_next_state_).
+
+Note that after taking the transition above the memory store operations
+do not yet have their values. The importance of splitting this
+transition from the transition below is that it allows other
+program-order-successor store instructions to observe the memory
+footprint of this instruction, and if they don’t overlap, propagate out
+of order as early as possible (i.e. before the data register value
+becomes available).
+
+[[omm:instantiate_store_value]]
+===== Instantiate memory store operation values
+
+An instruction instance latexmath:[$i$] in state
+Plain(Store_memv(_mem_value_, _store_continuation_)) can always
+instantiate the values of the memory store operations
+latexmath:[$i.\textit{mem\_stores}$]. Action:
+
+. split _mem_value_ between the memory store operations
+latexmath:[$i.\textit{mem\_stores}$]; and
+. update the state of latexmath:[$i$] to
+Pending_mem_stores(_store_continuation_).
+
+[[omm:commit_stores]]
+===== Commit store instruction
+
+An uncommitted instruction instance latexmath:[$i$] of a non-`sc` store
+instruction or an `sc` instruction in the context of the ````
+transition, in state Pending_mem_stores(_store_continuation_), can be
+committed (not to be confused with propagated) if:
+
+. latexmath:[$i$] has fully determined data;
+. all program-order-previous conditional branch and indirect jump
+instructions are finished;
+. all program-order-previous `fence` instructions with `.sw` set are
+finished;
+. all program-order-previous `fence.tso` instructions are finished;
+. all program-order-previous load-acquire instructions are finished;
+. all program-order-previous store-acquire-release instructions are
+finished;
+. if latexmath:[$i$] is a store-release, all program-order-previous
+instructions are finished;
+. [#omm:commit_store:prev_addrs]#[omm:commit_store:prev_addrs]# all
+program-order-previous memory access instructions have a fully
+determined memory footprint;
+. [#omm:commit_store:prev_stores]#[omm:commit_store:prev_stores]# all
+program-order-previous store instructions, except for `sc` that failed,
+have initiated and so have non-empty _mem_stores_; and
+. [#omm:commit_store:prev_loads]#[omm:commit_store:prev_loads]# all
+program-order-previous load instructions have initiated and so have
+non-empty _mem_loads_.
+
+Action: record that latexmath:[$i$] is committed.
+
+Notice that if condition
+#omm:commit_store:prev_addrs[[omm:commit_store:prev_addrs]] is satisfied
+the conditions
+#omm:commit_store:prev_stores[[omm:commit_store:prev_stores]] and
+#omm:commit_store:prev_loads[[omm:commit_store:prev_loads]] are also
+satisfied, or will be satisfied after taking some eager transitions.
+Hence, requiring them does not strengthen the model. By requiring them,
+we guarantee that previous memory access instructions have taken enough
+transitions to make their memory operations visible for the condition
+check of , which is the next transition the instruction will take,
+making that condition simpler.
+
+[[omm:prop_store]]
+===== Propagate store operation
+
+For a committed instruction instance latexmath:[$i$] in state
+Pending_mem_stores(_store_continuation_), and an unpropagated memory
+store operation latexmath:[$mso$] in
+latexmath:[$i.\textit{mem\_stores}$], latexmath:[$mso$] can be
+propagated if:
+
+. all memory store operations of program-order-previous store
+instructions that overlap with latexmath:[$mso$] have already
+propagated;
+. all memory load operations of program-order-previous load instructions
+that overlap with latexmath:[$mso$] have already been satisfied, and
+(the load instructions) are _non-restartable_ (see definition below);
+and
+. all memory load operations that were satisfied by forwarding
+latexmath:[$mso$] are entirely satisfied.
+
+Where a non-finished instruction instance latexmath:[$j$] is
+_non-restartable_ if:
+
+. there does not exist a store instruction latexmath:[$s$] and an
+unpropagated memory store operation latexmath:[$mso$] of latexmath:[$s$]
+such that applying the action of the ```` transition to
+latexmath:[$mso$] will result in the restart of latexmath:[$j$]; and
+. there does not exist a non-finished load instruction latexmath:[$l$]
+and a memory load operation latexmath:[$mlo$] of latexmath:[$l$] such
+that applying the action of the ````/```` transition (even if
+latexmath:[$mlo$] is already satisfied) to latexmath:[$mlo$] will result
+in the restart of latexmath:[$j$].
+
+Action:
+
+. update the shared memory state with latexmath:[$mso$];
+. update latexmath:[$i.\textit{mem\_stores}$] to indicate that
+latexmath:[$mso$] was propagated; and
+. restart any speculative instructions which have violated coherence as
+a result of this, i.e., for every non-finished instruction
+latexmath:[$i'$] program-order-after latexmath:[$i$] and every memory
+load operation latexmath:[$mlo'$] of latexmath:[$i'$] that was satisfied
+from latexmath:[$msoss'$], if there exists a memory store operation
+slice latexmath:[$msos'$] in latexmath:[$msoss'$] that overlaps with
+latexmath:[$mso$] and is not from latexmath:[$mso$], and
+latexmath:[$msos'$] is not from a program-order-successor of
+latexmath:[$i$], restart latexmath:[$i'$] and its _restart-dependents_
+(see ).
+
+[[omm:commit_sc]]
+===== Commit and propagate store operation of an `sc`
+
+An uncommitted `sc` instruction instance latexmath:[$i$], from hart
+latexmath:[$h$], in state Pending_mem_stores(_store_continuation_), with
+a paired `lr` latexmath:[$i'$] that has been satisfied by some store
+slices latexmath:[$msoss$], can be committed and propagated at the same
+time if:
+
+. latexmath:[$i'$] is finished;
+. every memory store operation that has been forwarded to
+latexmath:[$i'$] is propagated;
+. the conditions of is satisfied;
+. the conditions of is satisfied (notice that an `sc` instruction can
+only have one memory store operation); and
+. for every store slice latexmath:[$msos$] from latexmath:[$msoss$],
+latexmath:[$msos$] has not been overwritten, in the shared memory, by a
+store that is from a hart that is not latexmath:[$h$], at any point
+since latexmath:[$msos$] was propagated to memory.
+
+Action:
+
+. apply the actions of ; and
+. apply the action of .
+
+[[omm:late_sc_fail]]
+===== Late `sc` fail
+
+An `sc` instruction instance latexmath:[$i$] in state
+Pending_mem_stores(_store_continuation_), that has not propagated its
+memory store operation, can always be made to fail. Action:
+
+. clear latexmath:[$i.\textit{mem\_stores}$]; and
+. update the state of latexmath:[$i$] to
+Plain(_store_continuation(false)_).
+
+For efficiency, the `rmem` tool allows this transition only when it is
+not possible to take the transition. This does not affect the set of
+allowed final states, but when explored interactively, if the `sc`
+should fail one should use the transition instead of waiting for this
+transition.
+
+[[omm:complete_stores]]
+===== Complete store operations
+
+A store instruction instance latexmath:[$i$] in state
+Pending_mem_stores(_store_continuation_), for which all the memory store
+operations in latexmath:[$i.\textit{mem\_stores}$] have been propagated,
+can always be completed (not to be confused with finished). Action:
+update the state of latexmath:[$i$] to
+Plain(_store_continuation(true)_).
+
+[[omm:do_amo]]
+===== Satisfy, commit and propagate operations of an AMO
+
+An AMO instruction instance latexmath:[$i$] in state
+Pending_mem_loads(_load_continuation_) can perform its memory access if
+it is possible to perform the following sequence of transitions with no
+intervening transitions:
+
+. {blank}
+. {blank}
+. (zero or more times)
+. {blank}
+. {blank}
+. {blank}
+. {blank}
+
+and in addition, the condition of , with the exception of not requiring
+latexmath:[$i$] to be in state Plain(Done), holds after those
+transitions. Action: perform the above sequence of transitions (this
+does not include ), one after the other, with no intervening
+transitions.
+
+Notice that program-order-previous stores cannot be forwarded to the
+load of an AMO. This is simply because the sequence of transitions above
+does not include the forwarding transition. But even if it did include
+it, the sequence will fail when trying to do the transition, as this
+transition requires all program-order-previous store operations to
+overlapping memory footprints to be propagated, and forwarding requires
+the store operation to be unpropagated.
+
+In addition, the store of an AMO cannot be forwarded to a
+program-order-successor load. Before taking the transition above, the
+store operation of the AMO does not have its value and therefore cannot
+be forwarded; after taking the transition above the store operation is
+propagated and therefore cannot be forwarded.
+
+[[omm:commit_fence]]
+===== Commit fence
+
+A fence instruction instance latexmath:[$i$] in state
+Plain(Fence(_kind_, _next_state_)) can be committed if:
+
+. if latexmath:[$i$] is a normal fence and it has `.pr` set, all
+program-order-previous load instructions are finished;
+. if latexmath:[$i$] is a normal fence and it has `.pw` set, all
+program-order-previous store instructions are finished; and
+. if latexmath:[$i$] is a `fence.tso`, all program-order-previous load
+and store instructions are finished.
+
+Action:
+
+. record that latexmath:[$i$] is committed; and
+. update the state of latexmath:[$i$] to Plain(_next_state_).
+
+[[omm:reg_read]]
+===== Register read
+
+An instruction instance latexmath:[$i$] in state
+Plain(Read_reg(_reg_name_, _read_cont_)) can do a register read of
+_reg_name_ if every instruction instance that it needs to read from has
+already performed the expected _reg_name_ register write.
+
+Let _read_sources_ include, for each bit of _reg_name_, the write to
+that bit by the most recent (in program order) instruction instance that
+can write to that bit, if any. If there is no such instruction, the
+source is the initial register value from _initial_register_state_. Let
+_reg_value_ be the value assembled from _read_sources_. Action:
+
+. add _reg_name_ to latexmath:[$i.\textit{reg\_reads}$] with
+_read_sources_ and _reg_value_; and
+. update the state of latexmath:[$i$] to Plain(_read_cont(reg_value)_).
+
+[[omm:reg_write]]
+===== Register write
+
+An instruction instance latexmath:[$i$] in state
+Plain(Write_reg(_reg_name_, _reg_value_, _next_state_)) can always do a
+_reg_name_ register write. Action:
+
+. add _reg_name_ to latexmath:[$i.\textit{reg\_writes}$] with
+latexmath:[$deps$] and _reg_value_; and
+. update the state of latexmath:[$i$] to Plain(_next_state_).
+
+where latexmath:[$deps$] is a pair of the set of all _read_sources_ from
+latexmath:[$i.\textit{reg\_reads}$], and a flag that is true iff
+latexmath:[$i$] is a load instruction instance that has already been
+entirely satisfied.
+
+[[omm:sail_interp]]
+===== Pseudocode internal step
+
+An instruction instance latexmath:[$i$] in state
+Plain(Internal(_next_state_)) can always do that pseudocode-internal
+step. Action: update the state of latexmath:[$i$] to
+Plain(_next_state_).
+
+[[omm:finish]]
+===== Finish instruction
+
+A non-finished instruction instance latexmath:[$i$] in state Plain(Done)
+can be finished if:
+
+. if latexmath:[$i$] is a load instruction:
+.. all program-order-previous load-acquire instructions are finished;
+.. all program-order-previous `fence` instructions with `.sr` set are
+finished;
+.. for every program-order-previous `fence.tso` instruction,
+latexmath:[$f$], that is not finished, all load instructions that are
+program-order-before latexmath:[$f$] are finished; and
+.. it is guaranteed that the values read by the memory load operations
+of latexmath:[$i$] will not cause coherence violations, i.e., for any
+program-order-previous instruction instance latexmath:[$i'$], let
+latexmath:[$\textit{cfp}$] be the combined footprint of propagated
+memory store operations from store instructions program-order-between
+latexmath:[$i$] and latexmath:[$i'$], and _fixed memory store
+operations_ that were forwarded to latexmath:[$i$] from store
+instructions program-order-between latexmath:[$i$] and latexmath:[$i'$]
+including latexmath:[$i'$], and let
+latexmath:[$\overline{\textit{cfp}}$] be the complement of
+latexmath:[$\textit{cfp}$] in the memory footprint of latexmath:[$i$].
+If latexmath:[$\overline{\textit{cfp}}$] is not empty:
+... latexmath:[$i'$] has a fully determined memory footprint;
+... latexmath:[$i'$] has no unpropagated memory store operations that
+overlap with latexmath:[$\overline{\textit{cfp}}$]; and
+... if latexmath:[$i'$] is a load with a memory footprint that overlaps
+with latexmath:[$\overline{\textit{cfp}}$], then all the memory load
+operations of latexmath:[$i'$] that overlap with
+latexmath:[$\overline{\textit{cfp}}$] are satisfied and latexmath:[$i'$]
+is _non-restartable_ (see the transition for how to determined if an
+instruction is non-restartable).
++
+Here, a memory store operation is called fixed if the store instruction
+has fully determined data.
+. latexmath:[$i$] has a fully determined data; and
+. if latexmath:[$i$] is not a fence, all program-order-previous
+conditional branch and indirect jump instructions are finished.
+
+Action:
+
+. if latexmath:[$i$] is a conditional branch or indirect jump
+instruction, discard any untaken paths of execution, i.e., remove all
+instruction instances that are not reachable by the branch/jump taken in
+_instruction_tree_; and
+. record the instruction as finished, i.e., set _finished_ to _true_.
+
+[[sec:omm:limitations]]
+==== Limitations
+
+* The model covers user-level RV64I and RV64A. In particular, it does
+not support the misaligned atomics extension ``Zam`` or the total store
+ordering extension ``Ztso``. It should be trivial to adapt the model to
+RV32I/A and to the G, Q and C extensions, but we have never tried it.
+This will involve, mostly, writing Sail code for the instructions, with
+minimal, if any, changes to the concurrency model.
+* The model covers only normal memory accesses (it does not handle I/O
+accesses).
+* The model does not cover TLB-related effects.
+* The model assumes the instruction memory is fixed. In particular, the
+transition does not generate memory load operations, and the shared
+memory is not involved in the transition. Instead, the model depends on
+an external oracle that provides an opcode when given a memory location.
+* The model does not cover exceptions, traps and interrupts.
diff --git a/src/mm-herd.adoc b/src/mm-herd.adoc
new file mode 100644
index 0000000..f1c0fd8
--- /dev/null
+++ b/src/mm-herd.adoc
@@ -0,0 +1,155 @@
+[[sec:herd]]
+== Formal Axiomatic Specification in Herd
+
+The tool [.sans-serif]#herd# takes a memory model and a litmus test as
+input and simulates the execution of the test on top of the memory
+model. Memory models are written in the domain specific language Cat.
+This section provides two Cat memory model of RVWMO. The first model,
+Figure #fig:herd2[[fig:herd2]], follows the _global memory order_,
+Chapter #ch:memorymodel[[ch:memorymodel]], definition of RVWMO, as much
+as is possible for a Cat model. The second model,
+Figure #fig:herd3[[fig:herd3]], is an equivalent, more efficient,
+partial order based RVWMO model.
+
+The simulator [.sans-serif]#herd# is part of the [.sans-serif]#diy# tool
+suite — see http://diy.inria.fr for software and documentation. The
+models and more are available online
+at http://diy.inria.fr/cats7/riscv/.
+
+`  `
+
+....
+(*************)
+(* Utilities *)
+(*************)
+
+(* All fence relations *)
+let fence.r.r = [R];fencerel(Fence.r.r);[R]
+let fence.r.w = [R];fencerel(Fence.r.w);[W]
+let fence.r.rw = [R];fencerel(Fence.r.rw);[M]
+let fence.w.r = [W];fencerel(Fence.w.r);[R]
+let fence.w.w = [W];fencerel(Fence.w.w);[W]
+let fence.w.rw = [W];fencerel(Fence.w.rw);[M]
+let fence.rw.r = [M];fencerel(Fence.rw.r);[R]
+let fence.rw.w = [M];fencerel(Fence.rw.w);[W]
+let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M]
+let fence.tso =
+  let f = fencerel(Fence.tso) in
+  ([W];f;[W]) | ([R];f;[M])
+
+let fence = 
+  fence.r.r | fence.r.w | fence.r.rw |
+  fence.w.r | fence.w.w | fence.w.rw |
+  fence.rw.r | fence.rw.w | fence.rw.rw |
+  fence.tso
+
+(* Same address, no W to the same address in-between *)
+let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc)
+(* Read same write *)
+let rsw = rf^-1;rf
+(* Acquire, or stronger  *)
+let AQ = Acq|AcqRel
+(* Release or stronger *)
+and RL = RelAcqRel
+(* All RCsc *)
+let RCsc = Acq|Rel|AcqRel
+(* Amo events are both R and W, relation rmw relates paired lr/sc *)
+let AMO = R & W
+let StCond = range(rmw)
+
+(*************)
+(* ppo rules *)
+(*************)
+
+(* Overlapping-Address Orderings *)
+let r1 = [M];po-loc;[W]
+and r2 = ([R];po-loc-no-w;[R]) \ rsw
+and r3 = [AMO|StCond];rfi;[R]
+(* Explicit Synchronization *)
+and r4 = fence
+and r5 = [AQ];po;[M]
+and r6 = [M];po;[RL]
+and r7 = [RCsc];po;[RCsc]
+and r8 = rmw
+(* Syntactic Dependencies *)
+and r9 = [M];addr;[M]
+and r10 = [M];data;[W]
+and r11 = [M];ctrl;[W]
+(* Pipeline Dependencies *)
+and r12 = [R];(addr|data);[W];rfi;[R]
+and r13 = [R];addr;[M];po;[W]
+
+let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13
+....
+
+`  `
+
+....
+Total
+
+(* Notice that herd has defined its own rf relation *)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(********************************)
+(* Generate global memory order *)
+(********************************)
+
+let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *)
+  loc & (W\FW) * FW | # Final write after any write to the same location
+  ppo |               # ppo compatible
+  rfe                 # includes herd external rf (optimization)
+
+(* Walk over all linear extensions of gmo0 *)
+with  gmo from linearizations(M\IW,gmo0)
+
+(* Add initial writes upfront -- convenient for computing rfGMO *)
+let gmo = gmo | loc & IW * (M\IW)
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Compute rf according to the load value axiom, aka rfGMO *)
+let WR = loc & ([W];(gmo|po);[R])
+let rfGMO = WR \ (loc&([W];gmo);WR)
+
+(* Check equality of herd rf and of rfGMO *)
+empty (rf\rfGMO)|(rfGMO\rf) as RfCons
+
+(* Atomicity axiom *)
+let infloc = (gmo & loc)^-1
+let inflocext = infloc & ext
+let winside  = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W]
+empty winside as Atomic
+....
+
+`  `
+
+....
+Partial
+
+(***************)
+(* Definitions *)
+(***************)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(* Compute coherence relation *)
+include "cos-opt.cat"
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Sc per location *)
+acyclic co|rf|fr|po-loc as Coherence
+
+(* Main model axiom *)
+acyclic co|rfe|fr|ppo as Model
+
+(* Atomicity axiom *)
+empty rmw & (fre;coe) as Atomic
+....
diff --git a/src/naming.adoc b/src/naming.adoc
new file mode 100644
index 0000000..2ac4465
--- /dev/null
+++ b/src/naming.adoc
@@ -0,0 +1,184 @@
+[[naming]]
+== ISA Extension Naming Conventions
+
+This chapter describes the RISC-V ISA extension naming scheme that is
+used to concisely describe the set of instructions present in a hardware
+implementation, or the set of instructions used by an application binary
+interface (ABI).
+
+The RISC-V ISA is designed to support a wide variety of implementations
+with various experimental instruction-set extensions. We have found that
+an organized naming scheme simplifies software tools and documentation.
+
+=== Case Sensitivity
+
+The ISA naming strings are case insensitive.
+
+=== Base Integer ISA
+
+RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I
+indicating the supported address space size in bits for the base integer
+ISA.
+
+=== Instruction-Set Extension Names
+
+Standard ISA extensions are given a name consisting of a single letter.
+For example, the first four standard extensions to the integer bases
+are: `M` for integer multiplication and division, `A` for atomic
+memory instructions, `F` for single-precision floating-point
+instructions, and `D` for double-precision floating-point
+instructions. Any RISC-V instruction-set variant can be succinctly
+described by concatenating the base integer prefix with the names of the
+included extensions, e.g., `RV64IMAFD`.
+
+We have also defined an abbreviation `G to represent the
+`IMAFDZicsr_Zifencei` base and extensions, as this is intended to
+represent our standard general-purpose ISA.
+
+Standard extensions to the RISC-V ISA are given other reserved letters,
+e.g., `Q` for quad-precision floating-point, or `C` for the 16-bit
+compressed instruction format.
+
+Some ISA extensions depend on the presence of other extensions, e.g.,
+`D` depends on `F` and `F` depends on `Zicsr`. These dependences
+may be implicit in the ISA name: for example, RV32IF is equivalent to
+RV32IFZicsr, and RV32ID is equivalent to RV32IFD and RV32IFDZicsr.
+
+=== Version Numbers
+
+Recognizing that instruction sets may expand or alter over time, we
+encode extension version numbers following the extension name. Version
+numbers are divided into major and minor version numbers, separated by a
+`p`. If the minor version is `0`, then `p0` can be omitted from
+the version string. Changes in major version numbers imply a loss of
+backwards compatibility, whereas changes in only the minor version
+number must be backwards-compatible. For example, the original 64-bit
+standard ISA defined in release 1.0 of this manual can be written in
+full as `RV64I1p0M1p0A1p0F1p0D1p0`, more concisely as
+`RV64I1M1A1F1D1`.
+
+We introduced the version numbering scheme with the second release.
+Hence, we define the default version of a standard extension to be the
+version present at that time, e.g., `RV32I` is equivalent to
+`RV32I2`.
+
+=== Underscores
+
+Underscores `\_` may be used to separate ISA extensions to improve
+readability and to provide disambiguation, e.g., `RV32I2_M2_A2`.
+
+Because the `P` extension for Packed SIMD can be confused for the
+decimal point in a version number, it must be preceded by an underscore
+if it follows a number. For example, `rv32i2p2` means version 2.2 of
+RV32I, whereas `rv32i2_p2` means version 2.0 of RV32I with version 2.0
+of the P extension.
+
+=== Additional Standard Extension Names
+
+Standard extensions can also be named using a single `Z` followed by
+an alphabetical name and an optional version number. For example,
+`Zifencei` names the instruction-fetch fence extension described in
+<<zifencei>>; `Zifencei2` and
+`Zifencei2p0` name version 2.0 of same.
+
+The first letter following the `Z` conventionally indicates the most
+closely related alphabetical extension category, IMAFDQLCBKJTPV. For the
+`Zam` extension for misaligned atomics, for example, the letter `a`
+indicates the extension is related to the `A` standard extension. If
+multiple `Z` extensions are named, they should be ordered first by
+category, then alphabetically within a category—for example,
+`Zicsr_Zifencei_Zam`.
+
+Extensions with the `Z` prefix must be separated from other
+multi-letter extensions by an underscore, e.g.,
+`RV32IMACZicsr_Zifencei`.
+
+=== Supervisor-level Instruction-Set Extensions
+
+Standard supervisor-level instruction-set extensions are defined in
+Volume II, but are named using `S` as a prefix, followed by an
+alphabetical name and an optional version number. Supervisor-level
+extensions must be separated from other multi-letter extensions by an
+underscore.
+
+Standard supervisor-level extensions should be listed after standard
+unprivileged extensions. If multiple supervisor-level extensions are
+listed, they should be ordered alphabetically.
+
+=== Hypervisor-level Instruction-Set Extensions
+
+Standard hypervisor-level instruction-set extensions are named like
+supervisor-level extensions, but beginning with the letter `H` instead
+of the letter `S`.
+
+Standard hypervisor-level extensions should be listed after standard
+lesser-privileged extensions. If multiple hypervisor-level extensions
+are listed, they should be ordered alphabetically.
+
+=== Machine-level Instruction-Set Extensions
+
+Standard machine-level instruction-set extensions are prefixed with the
+three letters `Zxm`.
+
+Standard machine-level extensions should be listed after standard
+lesser-privileged extensions. If multiple machine-level extensions are
+listed, they should be ordered alphabetically.
+
+=== Non-Standard Extension Names
+
+Non-standard extensions are named using a single `X` followed by an
+alphabetical name and an optional version number. For example,
+`Xhwacha` names the Hwacha vector-fetch ISA extension; `Xhwacha2`
+and `Xhwacha2p0` name version 2.0 of same.
+
+Non-standard extensions must be listed after all standard extensions.
+They must be separated from other multi-letter extensions by an
+underscore. For example, an ISA with non-standard extensions Argle and
+Bargle may be named `RV64IZifencei_Xargle_Xbargle`.
+
+If multiple non-standard extensions are listed, they should be ordered
+alphabetically.
+
+=== Subset Naming Convention
+
+<<isanametable>> summarizes the standardized extension
+names.   +
+
+[[isanametable]]
+.Standard ISA extension names. The table also defines the canonical
+order in which extension names must appear in the name string, with
+top-to-bottom in table indicating first-to-last in the name string,
+e.g., RV32IMACV is legal, whereas RV32IMAVC is not.
+[cols="<,^,^",options="header",]
+|===
+|Subset |Name |Implies
+|Base ISA | |
+|Integer |I |
+|Reduced Integer |E |
+|Standard Unprivileged Extensions | |
+|Integer Multiplication and Division |M |
+|Atomics |A |
+|Single-Precision Floating-Point |F |Zicsr
+|Double-Precision Floating-Point |D |F
+|General |G |IMADZifencei
+|Quad-Precision Floating-Point |Q |D
+|16-bit Compressed Instructions |C |
+|Bit Manipulation |B |
+|Cryptography Extensions |K |
+|Dynamic Languages |J |
+|Packed-SIMD Extensions |P |
+|Vector Extensions |V |
+|Control and Status Register Access |Zicsr |
+|Instruction-Fetch Fence |Zifencei |
+|Misaligned Atomics |Zam |A
+|Total Store Ordering |Ztso |
+|Standard Supervisor-Level Extensions | |
+|Supervisor-level extension `def` |Sdef |
+|Standard Hypervisor-Level Extensions | |
+|Hypervisor-level extension `ghi` |Hghi |
+|Standard Machine-Level Extensions | |
+|Machine-level extension `jkl` |Zxmjkl |
+|Non-Standard Extensions | |
+|Non-standard extension `mno` |Xmno |
+|===
+
diff --git a/src/p-st-ext.adoc b/src/p-st-ext.adoc
new file mode 100644
index 0000000..79a070a
--- /dev/null
+++ b/src/p-st-ext.adoc
@@ -0,0 +1,10 @@
+[[packedsimd]]
+== `P` Standard Extension for Packed-SIMD Instructions, Version 0.2
+
+Discussions at the 5th RISC-V workshop indicated a desire to drop this
+packed-SIMD proposal for floating-point registers in favor of
+standardizing on the V extension for large floating-point SIMD
+operations. However, there was interest in packed-SIMD fixed-point
+operations for use in the integer registers of small RISC-V
+implementations. A task group is working to define the new P extension.
+
diff --git a/src/q-st-ext.adoc b/src/q-st-ext.adoc
new file mode 100644
index 0000000..02122b7
--- /dev/null
+++ b/src/q-st-ext.adoc
@@ -0,0 +1,148 @@
+== `Q` Standard Extension for Quad-Precision Floating-Point, Version 2.2
+
+This chapter describes the Q standard extension for 128-bit
+quad-precision binary floating-point instructions compliant with the
+IEEE 754-2008 arithmetic standard. The quad-precision binary
+floating-point instruction-set extension is named `Q`; it depends on
+the double-precision floating-point extension D. The floating-point
+registers are now extended to hold either a single, double, or
+quad-precision floating-point value (FLEN=128). The NaN-boxing scheme
+described in <<nanboxing>> is now extended
+recursively to allow a single-precision value to be NaN-boxed inside a
+double-precision value which is itself NaN-boxed inside a quad-precision
+value.
+
+=== Quad-Precision Load and Store Instructions
+
+New 128-bit variants of LOAD-FP and STORE-FP instructions are added,
+encoded with a new value for the funct3 width field.
+
+M@R@F@R@O +
+& & & & +
+& & & & +
+& 5 & 3 & 5 & 7 +
+offset[11:0] & base & Q & dest & LOAD-FP +
+
+O@R@R@F@R@O +
+& & & & & +
+& & & & & +
+& 5 & 5 & 3 & 5 & 7 +
+offset[11:5] & src & base & Q & offset[4:0] & STORE-FP +
+
+FLQ and FSQ are only guaranteed to execute atomically if the effective
+address is naturally aligned and XLEN=128.
+
+FLQ and FSQ do not modify the bits being transferred; in particular, the
+payloads of non-canonical NaNs are preserved.
+
+=== Quad-Precision Computational Instructions
+
+A new supported format is added to the format field of most
+instructions, as shown in <<fpextfmt>>
+
+[[fpextfmt]]
+.Format field encoding.
+[cols="^,^,<",options="header",]
+|===
+|_fmt_ field |Mnemonic |Meaning
+|00 |S |32-bit single-precision
+|01 |D |64-bit double-precision
+|10 |H |16-bit half-precision
+|11 |Q |128-bit quad-precision
+|===
+
+The quad-precision floating-point computational instructions are defined
+analogously to their double-precision counterparts, but operate on
+quad-precision operands and produce quad-precision results.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FADD/FSUB & Q & src2 & src1 & RM & dest & OP-FP +
+FMUL/FDIV & Q & src2 & src1 & RM & dest & OP-FP +
+FMIN-MAX & Q & src2 & src1 & MIN/MAX & dest & OP-FP +
+FSQRT & Q & 0 & src & RM & dest & OP-FP +
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+src3 & Q & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB +
+
+=== Quad-Precision Convert and Move Instructions
+
+New floating-point-to-integer and integer-to-floating-point conversion
+instructions are added. These instructions are defined analogously to
+the double-precision-to-integer and integer-to-double-precision
+conversion instructions. FCVT.W.Q or FCVT.L.Q converts a quad-precision
+floating-point number to a signed 32-bit or 64-bit integer,
+respectively. FCVT.Q.W or FCVT.Q.L converts a 32-bit or 64-bit signed
+integer, respectively, into a quad-precision floating-point number.
+FCVT.WU.Q, FCVT.LU.Q, FCVT.Q.WU, and FCVT.Q.LU variants convert to or
+from unsigned integer values. FCVT.L[U].Q and FCVT.Q.L[U] are RV64-only
+instructions.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCVT._int_.Q & Q & W[U]/L[U] & src & RM & dest & OP-FP +
+FCVT.Q._int_ & Q & W[U]/L[U] & src & RM & dest & OP-FP +
+
+New floating-point-to-floating-point conversion instructions are added.
+These instructions are defined analogously to the double-precision
+floating-point-to-floating-point conversion instructions. FCVT.S.Q or
+FCVT.Q.S converts a quad-precision floating-point number to a
+single-precision floating-point number, or vice-versa, respectively.
+FCVT.D.Q or FCVT.Q.D converts a quad-precision floating-point number to
+a double-precision floating-point number, or vice-versa, respectively.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCVT.S.Q & S & Q & src & RM & dest & OP-FP +
+FCVT.Q.S & Q & S & src & RM & dest & OP-FP +
+FCVT.D.Q & D & Q & src & RM & dest & OP-FP +
+FCVT.Q.D & Q & D & src & RM & dest & OP-FP +
+
+Floating-point to floating-point sign-injection instructions, FSGNJ.Q,
+FSGNJN.Q, and FSGNJX.Q are defined analogously to the double-precision
+sign-injection instruction.
+
+R@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FSGNJ & Q & src2 & src1 & J[N]/JX & dest & OP-FP +
+
+FMV.X.Q and FMV.Q.X instructions are not provided in RV32 or RV64, so
+quad-precision bit patterns must be moved to the integer registers via
+memory.
+
+RV128 will support FMV.X.Q and FMV.Q.X in the Q extension.
+
+=== Quad-Precision Floating-Point Compare Instructions
+
+The quad-precision floating-point compare instructions are defined
+analogously to their double-precision counterparts, but operate on
+quad-precision operands.
+
+S@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCMP & Q & src2 & src1 & EQ/LT/LE & dest & OP-FP +
+
+=== Quad-Precision Floating-Point Classify Instruction
+
+The quad-precision floating-point classify instruction, FCLASS.Q, is
+defined analogously to its double-precision counterpart, but operates on
+quad-precision operands.
+
+S@F@R@R@F@R@O +
+& & & & & & +
+& & & & & & +
+& 2 & 5 & 5 & 3 & 5 & 7 +
+FCLASS & Q & 0 & src & 001 & dest & OP-FP +
diff --git a/src/riscv-isa-unpriv-conv-review.adoc b/src/riscv-isa-unpriv-conv-review.adoc
new file mode 100644
index 0000000..2f3c349
--- /dev/null
+++ b/src/riscv-isa-unpriv-conv-review.adoc
@@ -0,0 +1,139 @@
+[[risc-v-isa]]
+= The RISC-V Instruction Set Manual
+:description: Volume I: Unprivileged ISA
+:company: RISC-V.org
+//:authors: Editors: Andrew waterman, Krste Asanovic, SiFive, Inc., CS Division, EECS Department, University of California, Berkeley
+:revdate: 09/2021
+:revnumber: Convert pre
+:revremark: Pre-release version
+//development: assume everything can change
+//stable: assume everything could change
+//frozen: of you implement this version you assume the risk that something might change because of the public review cycle  but we expect little to no change.
+//ratified: you can implement this and be assured nothing will change. if something needs to change due to an errata or enhancement, it will come out in a new extension. we do not revise extensions.
+:url-riscv: http://riscv.org
+:doctype: book
+//:doctype: report
+:colophon:
+:preface-title: Preamble
+:appendix-caption: Appendix
+:imagesdir: images
+:title-logo-image: image:risc-v_logo.png[pdfwidth=3.25in,align=center]
+:page-background-image: image:draft.png[opacity=20%]
+//:title-page-background-image: none
+:back-cover-image: image:backpage.png[opacity=25%]
+// Settings:
+:experimental:
+:reproducible:
+:imagesoutdir: images
+:bibtex-file: resources/riscv-spec.bib
+:bibtex-order: alphabetical
+:bibtex-style: apa
+:icons: font
+:lang: en
+:listing-caption: Listing
+:sectnums:
+:toc: left
+:toclevels: 4
+:source-highlighter: pygments
+ifdef::backend-pdf[]
+:source-highlighter: coderay
+endif::[]
+:table-caption: Table
+:figure-caption: Figure
+:xrefstyle: full
+:chapter-refsig: Chapter
+:section-refsig: Section
+:appendix-refsig: Appendix
+:data-uri:
+:hide-uri-scheme:
+:stem: latexmath
+:footnote:
+
+_Contributors to all versions of the spec in alphabetical order (please contact editors to suggest
+corrections): Arvind, Krste Asanovi´c, Rimas Aviˇzienis, Jacob Bachmeyer, Christopher F. Batten,
+Allen J. Baum, Alex Bradbury, Scott Beamer, Preston Briggs, Christopher Celio, Chuanhua
+Chang, David Chisnall, Paul Clayton, Palmer Dabbelt, Ken Dockser, Roger Espasa, Greg Favor,
+Shaked Flur, Stefan Freudenberger, Marc Gauthier, Andy Glew, Jan Gray, Michael Hamburg, John
+Hauser, David Horner, Bruce Hoult, Bill Huffman, Alexandre Joannou, Olof Johansson, Ben Keller,
+David Kruckemyer, Yunsup Lee, Paul Loewenstein, Daniel Lustig, Yatin Manerkar, Luc Maranget,
+Margaret Martonosi, Joseph Myers, Vijayanand Nagarajan, Rishiyur Nikhil, Jonas Oberhauser,
+Stefan O’Rear, Albert Ou, John Ousterhout, David Patterson, Christopher Pulte, Jose Renau,
+Josh Scheid, Colin Schmidt, Peter Sewell, Susmit Sarkar, Michael Taylor, Wesley Terpstra, Matt
+Thomas, Tommy Thorn, Caroline Trippel, Ray VanDeWalker, Muralidaran Vijayaraghavan, Megan
+Wachs, Andrew Waterman, Robert Watson, Derek Williams, Andrew Wright, Reinoud Zandijk,
+and Sizhuo Zhang._
+
+_This document is released under a Creative Commons Attribution 4.0 International License._
+
+_This document is a derivative of “The RISC-V Instruction Set Manual, Volume I: User-Level ISA
+Version 2.1” released under the following license: ©2010–2017 Andrew Waterman, Yunsup Lee,
+David Patterson, Krste Asanovi'c. Creative Commons Attribution 4.0 International License.
+Please cite as: “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document
+Version 20191214-draft”, Editors Andrew Waterman and Krste Asanovi´c, RISC-V Foundation,
+December 2019._
+
+//the colophon allows for a section after the preamble that is part of the frontmatter and therefore not assigned a page number.
+include::colophon.adoc[]
+//preface.tex
+include::intro.adoc[]
+//intro.tex
+include::rv32.adoc[]
+//rv32.tex
+include::zifencei.adoc[]
+//zfencei.tex
+include::zihintpause.adoc[]
+//zihintpause.tex
+include::rv32e.adoc[]
+//rv32e.tex
+include::rv64.adoc[]
+//rv54.tex
+include::rv128.adoc[]
+//rv128.tex
+include::m-st-ext.adoc[]
+//m.tex
+include::a-st-ext.adoc[]
+//a.tex
+include::zicsr.adoc[]
+//csr.tex
+include::counters.adoc[]
+//counters.tex
+include::f-st-ext.adoc[]
+//f.tex t
+include::d-st-ext.adoc[]
+//d.tex
+include::q-st-ext.adoc[]
+//q.tex
+include::rvwmo.adoc[]
+//rvwmo.tex
+include::c-st-ext.adoc[]
+//c.tex
+include::b-st-ext.adoc[]
+//b.tex
+include::j-st-ext.adoc[]
+//j.tex
+include::p-st-ext.adoc[]
+//p.tex
+include::v-st-ext.adoc[]
+//v.tex
+include::zam-st-ext.adoc[]
+//zam.tex
+include::ztso-st-ext.adoc[]
+//ztso.tex
+include::rv-32-64g.adoc[]
+//gmaps.tex
+include::extending.adoc[]
+//extensions.tex
+include::naming.adoc[]
+//naming.tex
+include::history.adoc[]
+//history.tex
+//include::test.adoc[]
+//include::mm-eplan.adoc[]
+//memory.tex
+//include::mm-formal.adoc[]
+//end of memory.tex, memory-model-alloy.tex, memory-model-herd.tex
+include::index.adoc[]
+// this is generated generated from index markers.
+include::bibliography.adoc[]
+// this references the riscv-spec.bi file that has been copied into the resources directoy
+
diff --git a/src/riscv-isa-unpriv.adoc b/src/riscv-isa-unpriv.adoc
new file mode 100644
index 0000000..459eb6f
--- /dev/null
+++ b/src/riscv-isa-unpriv.adoc
@@ -0,0 +1,135 @@
+[[risc-v-isa]]
+= The RISC-V Instruction Set Manual
+:description: Volume I: Unprivileged ISA
+:company: RISC-V.org
+//:authors: Editors: Andrew waterman, Krste Asanovic, SiFive, Inc., CS Division, EECS Department, University of California, Berkeley
+:revdate: 07/2021
+:revnumber: Version 20191214-draft
+:revremark: Pre-release version
+//development: assume everything can change
+//stable: assume everything could change
+//frozen: of you implement this version you assume the risk that something might change because of the public review cycle  but we expect little to no change.
+//ratified: you can implement this and be assured nothing will change. if something needs to change due to an errata or enhancement, it will come out in a new extension. we do not revise extensions.
+:url-riscv: http://riscv.org
+:doctype: book
+//:doctype: report
+:preface-title: Preamble
+:colophon:
+:appendix-caption: Appendix
+:imagesdir: images
+:title-logo-image: image:risc-v_logo.png[pdfwidth=3.25in,align=center]
+//:page-background-image: image:draft.svg[opacity=20%]
+//:title-page-background-image: none
+:back-cover-image: image:backpage.png[opacity=25%]
+// Settings:
+:experimental:
+:reproducible:
+// needs to be changed? bug discussion started
+//:WaveDromEditorApp: app/wavedrom-editor.app
+:imagesoutdir: images
+:bibtex-file: resources/riscv-spec.bib
+:bibtex-order: alphabetical
+:bibtex-style: ieee
+:icons: font
+:lang: en
+:listing-caption: Listing
+:sectnums:
+:toc: left
+:toclevels: 4
+:source-highlighter: pygments
+ifdef::backend-pdf[]
+:source-highlighter: coderay
+endif::[]
+:data-uri:
+:hide-uri-scheme:
+:stem: latexmath
+:footnote:
+:xrefstyle: short
+
+Contributors to all versions of the spec in alphabetical order (please contact editors to suggest
+corrections): Arvind, Krste Asanovi´c, Rimas Aviˇzienis, Jacob Bachmeyer, Christopher F. Batten,
+Allen J. Baum, Alex Bradbury, Scott Beamer, Preston Briggs, Christopher Celio, Chuanhua
+Chang, David Chisnall, Paul Clayton, Palmer Dabbelt, Ken Dockser, Roger Espasa, Greg Favor,
+Shaked Flur, Stefan Freudenberger, Marc Gauthier, Andy Glew, Jan Gray, Michael Hamburg, John
+Hauser, David Horner, Bruce Hoult, Bill Huffman, Alexandre Joannou, Olof Johansson, Ben Keller,
+David Kruckemyer, Yunsup Lee, Paul Loewenstein, Daniel Lustig, Yatin Manerkar, Luc Maranget,
+Margaret Martonosi, Joseph Myers, Vijayanand Nagarajan, Rishiyur Nikhil, Jonas Oberhauser,
+Stefan O’Rear, Albert Ou, John Ousterhout, David Patterson, Christopher Pulte, Jose Renau,
+Josh Scheid, Colin Schmidt, Peter Sewell, Susmit Sarkar, Michael Taylor, Wesley Terpstra, Matt
+Thomas, Tommy Thorn, Caroline Trippel, Ray VanDeWalker, Muralidaran Vijayaraghavan, Megan
+Wachs, Andrew Waterman, Robert Watson, Derek Williams, Andrew Wright, Reinoud Zandijk,
+and Sizhuo Zhang.
+
+This document is released under a Creative Commons Attribution 4.0 International License.
+
+This document is a derivative of “The RISC-V Instruction Set Manual, Volume I: User-Level ISA
+Version 2.1” released under the following license: ©2010–2017 Andrew Waterman, Yunsup Lee,
+David Patterson, Krste Asanovi'c. Creative Commons Attribution 4.0 International License.
+Please cite as: “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document
+Version 20191214-draft”, Editors Andrew Waterman and Krste Asanovi´c, RISC-V Foundation,
+December 2019.
+
+//the colophon allows for a section after the preamble that is part of the frontmatter and therefore not assigned a page number.
+include::colophon.adoc[]
+//preface.tex
+//While some documents need several levels of introductory material, other documents only need a breif introduction.
+//include::overview.adoc[]
+include::introduction.adoc[]
+//intro.tex
+include::rv32.adoc[]
+//rv32.tex
+include::zifencei.adoc[]
+//zfencei.tex
+include::zihintpause.adoc[]
+//zihintpause.tex
+include::rv32e.adoc[]
+//rv32e.tex
+include::rv64.adoc[]
+//rv54.tex
+include::rv128.adoc[]
+//rv128.tex
+include::m-st-ext.adoc[]
+//m.tex
+include::a-st-ext.adoc[]
+//a.tex
+include::zicsr.adoc[]
+//csr.tex
+include::f-st-ext.adoc[]
+//f.tex t
+include::d-st-ext.adoc[]
+//d.tex
+include::q-st-ext.adoc[]
+//q.tex
+include::rvwmo.adoc[]
+//rvwmo.tex
+include::c-st-ext.adoc[]
+//c.tex
+include::b-st-ext.adoc[]
+//b.tex
+include::j-st-ext.adoc[]
+//j.tex
+include::p-st-ext.adoc[]
+//p.tex
+include::v-st-ext.adoc[]
+//v.tex
+include::zam-st-ext.adoc[]
+//zam.tex
+include::ztso-st-ext.adoc[]
+//ztso.tex
+include::rv-32-64g.adoc[]
+//gmaps.tex
+include::extending.adoc[]
+//extensions.tex
+include::naming.adoc[]
+//naming.tex
+include::history.adoc[]
+//history.tex
+//include::mm-explain.adoc[]
+//memory.tex
+include::mm-formal.adoc[]
+//end of memory.tex, memory-model-alloy.tex, memory-model-herd.tex
+//include::index.adoc[]
+// this is generated generated from index markers.
+include::bibliography.adoc[]
+// this references the riscv-spec.bi file that has been copied into the resources directoy
+
diff --git a/src/rv-32-64g.adoc b/src/rv-32-64g.adoc
new file mode 100644
index 0000000..4355fae
--- /dev/null
+++ b/src/rv-32-64g.adoc
@@ -0,0 +1,491 @@
+[[rv32-64g]]
+== RV32/64G Instruction Set Listings
+
+One goal of the RISC-V project is that it be used as a stable software
+development target. For this purpose, we define a combination of a base
+ISA (RV32I or RV64I) plus selected standard extensions (IMAFD, Zicsr,
+Zifencei) as a ``general-purpose`` ISA, and we use the abbreviation G
+for the IMAFDZicsr_Zifencei combination of instruction-set extensions.
+This chapter presents opcode maps and instruction-set listings for RV32G
+and RV64G.
+
+[cols=">,^,^,^,^,^,^,^,^",]
+|===
+|inst[4:2] |000 |001 |010 |011 |100 |101 |110 |111
+
+|inst[6:5] | | | | | | | |(latexmath:[$>32b$])
+
+|00 |LOAD |LOAD-FP |_custom-0_ |MISC-MEM |OP-IMM |AUIPC |OP-IMM-32
+|latexmath:[$48b$]
+
+|01 |STORE |STORE-FP |_custom-1_ |AMO |OP |LUI |OP-32 |latexmath:[$64b$]
+
+|10 |MADD |MSUB |NMSUB |NMADD |OP-FP |_reserved_ |_custom-2/rv128_
+|latexmath:[$48b$]
+
+|11 |BRANCH |JALR |_reserved_ |JAL |SYSTEM |_reserved_ |_custom-3/rv128_
+|latexmath:[$\geq80b$]
+|===
+
+<<opcodemap>> shows a map of the major opcodes for
+RVG. Major opcodes with 3 or more lower bits set are reserved for
+instruction lengths greater than 32 bits. Opcodes marked as _reserved_
+should be avoided for custom instruction-set extensions as they might be
+used by future standard extensions. Major opcodes marked as _custom-0_
+and _custom-1_ will be avoided by future standard extensions and are
+recommended for use by custom instruction-set extensions within the base
+32-bit instruction format. The opcodes marked _custom-2/rv128_ and
+_custom-3/rv128_ are reserved for future use by RV128, but will
+otherwise be avoided for standard extensions and so can also be used for
+custom instruction-set extensions in RV32 and RV64.
+
+We believe RV32G and RV64G provide simple but complete instruction sets
+for a broad range of general-purpose computing. The optional compressed
+instruction set described in <<compressed>> can
+be added (forming RV32GC and RV64GC) to improve performance, code size,
+and energy efficiency, though with some additional hardware complexity.
+
+As we move beyond IMAFDC into further instruction-set extensions, the
+added instructions tend to be more domain-specific and only provide
+benefits to a restricted class of applications, e.g., for multimedia or
+security. Unlike most commercial ISAs, the RISC-V ISA design clearly
+separates the base ISA and broadly applicable standard extensions from
+these more specialized additions. <<extensions>>
+has a more extensive discussion of ways to add extensions to the RISC-V
+ISA.
+
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+
+| | | | | | | | | | | |
+
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+
+| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type
+
+| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |funct3
+|imm[4:1latexmath:[$\vert$]11] |opcode |B-type
+
+| |imm[31:12] | | | | | | | |rd |opcode |U-type
+
+|
+|imm[20latexmath:[$\vert$]10:1latexmath:[$\vert$]11latexmath:[$\vert$]19:12]
+| | | | | | | |rd |opcode |J-type
+
+| | | | | | | | | | | |
+
+| |*RV32I Base Instruction Set* | | | | | | | | | |
+
+| |imm[31:12] | | | | | | | |rd |0110111 |LUI
+
+| |imm[31:12] | | | | | | | |rd |0010111 |AUIPC
+
+|
+|imm[20latexmath:[$\vert$]10:1latexmath:[$\vert$]11latexmath:[$\vert$]19:12]
+| | | | | | | |rd |1101111 |JAL
+
+| |imm[11:0] | | | | | |rs1 |000 |rd |1100111 |JALR
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |000
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BEQ
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |001
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BNE
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |100
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BLT
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |101
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BGE
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |110
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BLTU
+
+| |imm[12latexmath:[$\vert$]10:5] | | | |rs2 | |rs1 |111
+|imm[4:1latexmath:[$\vert$]11] |1100011 |BGEU
+
+| |imm[11:0] | | | | | |rs1 |000 |rd |0000011 |LB
+
+| |imm[11:0] | | | | | |rs1 |001 |rd |0000011 |LH
+
+| |imm[11:0] | | | | | |rs1 |010 |rd |0000011 |LW
+
+| |imm[11:0] | | | | | |rs1 |100 |rd |0000011 |LBU
+
+| |imm[11:0] | | | | | |rs1 |101 |rd |0000011 |LHU
+
+| |imm[11:5] | | | |rs2 | |rs1 |000 |imm[4:0] |0100011 |SB
+
+| |imm[11:5] | | | |rs2 | |rs1 |001 |imm[4:0] |0100011 |SH
+
+| |imm[11:5] | | | |rs2 | |rs1 |010 |imm[4:0] |0100011 |SW
+
+| |imm[11:0] | | | | | |rs1 |000 |rd |0010011 |ADDI
+
+| |imm[11:0] | | | | | |rs1 |010 |rd |0010011 |SLTI
+
+| |imm[11:0] | | | | | |rs1 |011 |rd |0010011 |SLTIU
+
+| |imm[11:0] | | | | | |rs1 |100 |rd |0010011 |XORI
+
+| |imm[11:0] | | | | | |rs1 |110 |rd |0010011 |ORI
+
+| |imm[11:0] | | | | | |rs1 |111 |rd |0010011 |ANDI
+
+| |0000000 | | | |shamt | |rs1 |001 |rd |0010011 |SLLI
+
+| |0000000 | | | |shamt | |rs1 |101 |rd |0010011 |SRLI
+
+| |0100000 | | | |shamt | |rs1 |101 |rd |0010011 |SRAI
+
+| |0000000 | | | |rs2 | |rs1 |000 |rd |0110011 |ADD
+
+| |0100000 | | | |rs2 | |rs1 |000 |rd |0110011 |SUB
+
+| |0000000 | | | |rs2 | |rs1 |001 |rd |0110011 |SLL
+
+| |0000000 | | | |rs2 | |rs1 |010 |rd |0110011 |SLT
+
+| |0000000 | | | |rs2 | |rs1 |011 |rd |0110011 |SLTU
+
+| |0000000 | | | |rs2 | |rs1 |100 |rd |0110011 |XOR
+
+| |0000000 | | | |rs2 | |rs1 |101 |rd |0110011 |SRL
+
+| |0100000 | | | |rs2 | |rs1 |101 |rd |0110011 |SRA
+
+| |0000000 | | | |rs2 | |rs1 |110 |rd |0110011 |OR
+
+| |0000000 | | | |rs2 | |rs1 |111 |rd |0110011 |AND
+
+| |fm | |pred | | |succ |rs1 |000 |rd |0001111 |FENCE
+
+| |1000 | |0011 | | |0011 |00000 |000 |00000 |0001111 |FENCE.TSO
+
+| |0000 | |0001 | | |0000 |00000 |000 |00000 |0001111 |PAUSE
+
+| |000000000000 | | | | | |00000 |000 |00000 |1110011 |ECALL
+
+| |000000000001 | | | | | |00000 |000 |00000 |1110011 |EBREAK
+
+| | | | | | | | | | | |
+|===
+
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+
+| | | | | | | | | | | |
+
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+
+| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type
+
+| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type
+
+| | | | | | | | | | | |
+
+| |*RV64I Base Instruction Set (in addition to RV32I)* | | | | | | | | |
+|
+
+| |imm[11:0] | | | | | |rs1 |110 |rd |0000011 |LWU
+
+| |imm[11:0] | | | | | |rs1 |011 |rd |0000011 |LD
+
+| |imm[11:5] | | | |rs2 | |rs1 |011 |imm[4:0] |0100011 |SD
+
+| |000000 | | |shamt | | |rs1 |001 |rd |0010011 |SLLI
+
+| |000000 | | |shamt | | |rs1 |101 |rd |0010011 |SRLI
+
+| |010000 | | |shamt | | |rs1 |101 |rd |0010011 |SRAI
+
+| |imm[11:0] | | | | | |rs1 |000 |rd |0011011 |ADDIW
+
+| |0000000 | | | |shamt | |rs1 |001 |rd |0011011 |SLLIW
+
+| |0000000 | | | |shamt | |rs1 |101 |rd |0011011 |SRLIW
+
+| |0100000 | | | |shamt | |rs1 |101 |rd |0011011 |SRAIW
+
+| |0000000 | | | |rs2 | |rs1 |000 |rd |0111011 |ADDW
+
+| |0100000 | | | |rs2 | |rs1 |000 |rd |0111011 |SUBW
+
+| |0000000 | | | |rs2 | |rs1 |001 |rd |0111011 |SLLW
+
+| |0000000 | | | |rs2 | |rs1 |101 |rd |0111011 |SRLW
+
+| |0100000 | | | |rs2 | |rs1 |101 |rd |0111011 |SRAW
+
+| | | | | | | | | | | |
+
+| |*RV32/RV64 _Zifencei_ Standard Extension* | | | | | | | | | |
+
+| |imm[11:0] | | | | | |rs1 |001 |rd |0001111 |FENCE.I
+
+| | | | | | | | | | | |
+
+| |*RV32/RV64 _Zicsr_ Standard Extension* | | | | | | | | | |
+
+| |csr | | | | | |rs1 |001 |rd |1110011 |CSRRW
+
+| |csr | | | | | |rs1 |010 |rd |1110011 |CSRRS
+
+| |csr | | | | | |rs1 |011 |rd |1110011 |CSRRC
+
+| |csr | | | | | |uimm |101 |rd |1110011 |CSRRWI
+
+| |csr | | | | | |uimm |110 |rd |1110011 |CSRRSI
+
+| |csr | | | | | |uimm |111 |rd |1110011 |CSRRCI
+
+| | | | | | | | | | | |
+
+| |*RV32M Standard Extension* | | | | | | | | | |
+
+| |0000001 | | | |rs2 | |rs1 |000 |rd |0110011 |MUL
+
+| |0000001 | | | |rs2 | |rs1 |001 |rd |0110011 |MULH
+
+| |0000001 | | | |rs2 | |rs1 |010 |rd |0110011 |MULHSU
+
+| |0000001 | | | |rs2 | |rs1 |011 |rd |0110011 |MULHU
+
+| |0000001 | | | |rs2 | |rs1 |100 |rd |0110011 |DIV
+
+| |0000001 | | | |rs2 | |rs1 |101 |rd |0110011 |DIVU
+
+| |0000001 | | | |rs2 | |rs1 |110 |rd |0110011 |REM
+
+| |0000001 | | | |rs2 | |rs1 |111 |rd |0110011 |REMU
+
+| | | | | | | | | | | |
+
+| |*RV64M Standard Extension (in addition to RV32M)* | | | | | | | | | |
+
+| |0000001 | | | |rs2 | |rs1 |000 |rd |0111011 |MULW
+
+| |0000001 | | | |rs2 | |rs1 |100 |rd |0111011 |DIVW
+
+| |0000001 | | | |rs2 | |rs1 |101 |rd |0111011 |DIVUW
+
+| |0000001 | | | |rs2 | |rs1 |110 |rd |0111011 |REMW
+
+| |0000001 | | | |rs2 | |rs1 |111 |rd |0111011 |REMUW
+
+| | | | | | | | | | | |
+|===
+
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+| | | | | | | | | | | |
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+| | | | | | | | | | | |
+| |*RV32A Standard Extension* | | | | | | | | | |
+| |00010 | |aq |rl |00000 | |rs1 |010 |rd |0101111 |LR.W
+| |00011 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |SC.W
+| |00001 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOSWAP.W
+| |00000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOADD.W
+| |00100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOXOR.W
+| |01100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOAND.W
+| |01000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOOR.W
+| |10000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMIN.W
+| |10100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMAX.W
+| |11000 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMINU.W
+| |11100 | |aq |rl |rs2 | |rs1 |010 |rd |0101111 |AMOMAXU.W
+| | | | | | | | | | | |
+| |*RV64A Standard Extension (in addition to RV32A)* | | | | | | | | | |
+| |00010 | |aq |rl |00000 | |rs1 |011 |rd |0101111 |LR.D
+| |00011 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |SC.D
+| |00001 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOSWAP.D
+| |00000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOADD.D
+| |00100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOXOR.D
+| |01100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOAND.D
+| |01000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOOR.D
+| |10000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMIN.D
+| |10100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMAX.D
+| |11000 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMINU.D
+| |11100 | |aq |rl |rs2 | |rs1 |011 |rd |0101111 |AMOMAXU.D
+| | | | | | | | | | | |
+|===
+
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+| | | | | | | | | | | |
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type
+| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type
+| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type
+| | | | | | | | | | | |
+| |*RV32F Standard Extension* | | | | | | | | | |
+| |imm[11:0] | | | | | |rs1 |010 |rd |0000111 |FLW
+| |imm[11:5] | | | |rs2 | |rs1 |010 |imm[4:0] |0100111 |FSW
+| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.S
+| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.S
+| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.S
+| |rs3 | |00 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.S
+| |0000000 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.S
+| |0000100 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.S
+| |0001000 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.S
+| |0001100 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.S
+| |0101100 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.S
+| |0010000 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.S
+| |0010000 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.S
+| |0010000 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.S
+| |0010100 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.S
+| |0010100 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.S
+| |1100000 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.S
+| |1100000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.S
+| |1110000 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.X.W
+| |1010000 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.S
+| |1010000 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.S
+| |1010000 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.S
+| |1110000 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.S
+| |1101000 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.S.W
+| |1101000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.S.WU
+| |1111000 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.W.X
+| | | | | | | | | | | |
+| |*RV64F Standard Extension (in addition to RV32F)* | | | | | | | | | |
+| |1100000 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.S
+| |1100000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.S
+| |1101000 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.S.L
+| |1101000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.S.LU
+| | | | | | | | | | | |
+|===
+
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+| | | | | | | | | | | |
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type
+| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type
+| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type
+| | | | | | | | | | | |
+| |*RV32D Standard Extension* | | | | | | | | | |
+| |imm[11:0] | | | | | |rs1 |011 |rd |0000111 |FLD
+| |imm[11:5] | | | |rs2 | |rs1 |011 |imm[4:0] |0100111 |FSD
+| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.D
+| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.D
+| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.D
+| |rs3 | |01 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.D
+| |0000001 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.D
+| |0000101 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.D
+| |0001001 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.D
+| |0001101 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.D
+| |0101101 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.D
+| |0010001 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.D
+| |0010001 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.D
+| |0010001 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.D
+| |0010101 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.D
+| |0010101 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.D
+| |0100000 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.S.D
+| |0100001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.D.S
+| |1010001 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.D
+| |1010001 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.D
+| |1010001 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.D
+| |1110001 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.D
+| |1100001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.D
+| |1100001 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.D
+| |1101001 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.D.W
+| |1101001 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.D.WU
+| | | | | | | | | | | |
+| |*RV64D Standard Extension (in addition to RV32D)* | | | | | | | | | |
+| |1100001 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.D
+| |1100001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.D
+| |1110001 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.X.D
+| |1101001 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.D.L
+| |1101001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.D.LU
+| |1111001 | | | |00000 | |rs1 |000 |rd |1010011 |FMV.D.X
+| | | | | | | | | | | |
+|===
+
+.Instruction listing for RISC-V
+[cols="<,<,<,<,<,<,<,<,<,<,<,<",]
+|===
+| | | | | | | | | | | |
+| | | | | | | | | | | |
+| |funct7 | | | |rs2 | |rs1 |funct3 |rd |opcode |R-type
+| |rs3 | |funct2 | |rs2 | |rs1 |funct3 |rd |opcode |R4-type
+| |imm[11:0] | | | | | |rs1 |funct3 |rd |opcode |I-type
+| |imm[11:5] | | | |rs2 | |rs1 |funct3 |imm[4:0] |opcode |S-type
+| | | | | | | | | | | |
+| |*RV32Q Standard Extension* | | | | | | | | | |
+| |imm[11:0] | | | | | |rs1 |100 |rd |0000111 |FLQ
+| |imm[11:5] | | | |rs2 | |rs1 |100 |imm[4:0] |0100111 |FSQ
+| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1000011 |FMADD.Q
+| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1000111 |FMSUB.Q
+| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1001011 |FNMSUB.Q
+| |rs3 | |11 | |rs2 | |rs1 |rm |rd |1001111 |FNMADD.Q
+| |0000011 | | | |rs2 | |rs1 |rm |rd |1010011 |FADD.Q
+| |0000111 | | | |rs2 | |rs1 |rm |rd |1010011 |FSUB.Q
+| |0001011 | | | |rs2 | |rs1 |rm |rd |1010011 |FMUL.Q
+| |0001111 | | | |rs2 | |rs1 |rm |rd |1010011 |FDIV.Q
+| |0101111 | | | |00000 | |rs1 |rm |rd |1010011 |FSQRT.Q
+| |0010011 | | | |rs2 | |rs1 |000 |rd |1010011 |FSGNJ.Q
+| |0010011 | | | |rs2 | |rs1 |001 |rd |1010011 |FSGNJN.Q
+| |0010011 | | | |rs2 | |rs1 |010 |rd |1010011 |FSGNJX.Q
+| |0010111 | | | |rs2 | |rs1 |000 |rd |1010011 |FMIN.Q
+| |0010111 | | | |rs2 | |rs1 |001 |rd |1010011 |FMAX.Q
+| |0100000 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.S.Q
+| |0100011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.Q.S
+| |0100001 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.D.Q
+| |0100011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.Q.D
+| |1010011 | | | |rs2 | |rs1 |010 |rd |1010011 |FEQ.Q
+| |1010011 | | | |rs2 | |rs1 |001 |rd |1010011 |FLT.Q
+| |1010011 | | | |rs2 | |rs1 |000 |rd |1010011 |FLE.Q
+| |1110011 | | | |00000 | |rs1 |001 |rd |1010011 |FCLASS.Q
+| |1100011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.W.Q
+| |1100011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.WU.Q
+| |1101011 | | | |00000 | |rs1 |rm |rd |1010011 |FCVT.Q.W
+| |1101011 | | | |00001 | |rs1 |rm |rd |1010011 |FCVT.Q.WU
+| | | | | | | | | | | |
+| |*RV64Q Standard Extension (in addition to RV32Q)* | | | | | | | | | |
+| |1100011 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.L.Q
+| |1100011 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.LU.Q
+| |1101011 | | | |00010 | |rs1 |rm |rd |1010011 |FCVT.Q.L
+| |1101011 | | | |00011 | |rs1 |rm |rd |1010011 |FCVT.Q.LU
+| | | | | | | | | | | |
+|===
+
+<<rvgcsrnames>> lists the CSRs that have currently been
+allocated CSR addresses. The timers, counters, and floating-point CSRs
+are the only CSRs defined in this specification.
+
+[[rvgcsrnames]]
+.RISC-V control and status register (CSR) address map.
+[cols="<,<,<,<",options="header",]
+|===
+|Number |Privilege |Name |Description
+|Floating-Point Control and Status Registers | | |
+
+|`0x001 ` |Read/write |`fflags ` |Floating-Point Accrued Exceptions.
+
+|`0x002 ` |Read/write |`frm ` |Floating-Point Dynamic Rounding Mode.
+
+|`0x003 ` |Read/write |`fcsr ` |Floating-Point Control and Status
+Register (`frm` + `fflags`).
+
+|Counters and Timers | | |
+
+|`0xC00 ` |Read-only |`cycle ` |Cycle counter for RDCYCLE instruction.
+
+|`0xC01 ` |Read-only |`time ` |Timer for RDTIME instruction.
+
+|`0xC02 ` |Read-only |`instret ` |Instructions-retired counter for
+RDINSTRET instruction.
+
+|`0xC80 ` |Read-only |`cycleh ` |Upper 32 bits of `cycle`, RV32I only.
+
+|`0xC81 ` |Read-only |`timeh ` |Upper 32 bits of `time`, RV32I only.
+
+|`0xC82 ` |Read-only |`instreth ` |Upper 32 bits of `instret`, RV32I
+only.
+|===
+
diff --git a/src/rv128.adoc b/src/rv128.adoc
new file mode 100644
index 0000000..1513bf9
--- /dev/null
+++ b/src/rv128.adoc
@@ -0,0 +1,79 @@
+[[rv128]]
+== RV128I Base Integer Instruction Set, Version 1.7
+
+====
+_"There is only one mistake that can be made in computer design that is
+difficult to recover from—not having enough address bits for memory
+addressing and memory management."_ Bell and Strecker, ISCA-3, 1976.
+====
+
+This chapter describes RV128I, a variant of the RISC-V ISA supporting a
+flat 128-bit address space. The variant is a straightforward
+extrapolation of the existing RV32I and RV64I designs.
+(((RV128, design)))
+
+[TIP]
+====
+The primary reason to extend integer register width is to support larger
+address spaces. It is not clear when a flat address space larger than 64
+bits will be required. At the time of writing, the fastest supercomputer
+in the world as measured by the Top500 benchmark had over of DRAM, and
+would require over 50 bits of address space if all the DRAM resided in a
+single address space. Some warehouse-scale computers already contain
+even larger quantities of DRAM, and new dense solid-state non-volatile
+memories and fast interconnect technologies might drive a demand for
+even larger memory spaces. Exascale systems research is targeting memory
+systems, which occupy 57 bits of address space. At historic rates of
+growth, it is possible that greater than 64 bits of address space might
+be required before 2030.
+====
+
+History suggests that whenever it becomes clear that more than 64 bits
+of address space is needed, architects will repeat intensive debates
+about alternatives to extending the address space, including
+segmentation, 96-bit address spaces, and software workarounds, until,
+finally, flat 128-bit address spaces will be adopted as the simplest and
+best solution.
+(((RV128, evolution)))
+
+[TIP]
+====
+We have not frozen the RV128 spec at this time, as there might be need
+to evolve the design based on actual usage of 128-bit address spaces.
+====
+(((RV128I, as relates to RV64I)))
+
+RV128I builds upon RV64I in the same way RV64I builds upon RV32I, with
+integer registers extended to 128 bits (i.e., XLEN=128). Most integer
+computational instructions are unchanged as they are defined to operate
+on XLEN bits. The RV64I `\*W` integer instructions that operate on
+32-bit values in the low bits of a register are retained but now sign
+extend their results from bit 31 to bit 127. A new set of `\*D` integer
+instructions are added that operate on 64-bit values held in the low
+bits of the 128-bit integer registers and sign extend their results from
+bit 63 to bit 127. The `\*D` instructions consume two major opcodes
+(OP-IMM-64 and OP-64) in the standard 32-bit encoding.
+(((RV128I, compatibility with RV64)))
+
+To improve compatibility with RV64, in a reverse of how RV32 to RV64 was
+handled, we might change the decoding around to rename RV64I ADD as a
+64-bit ADDD, and add a 128-bit ADDQ in what was previously the OP-64
+major opcode (now renamed the OP-128 major opcode).
+
+
+Shifts by an immediate (SLLI/SRLI/SRAI) are now encoded using the low 7
+bits of the I-immediate, and variable shifts (SLL/SRL/SRA) use the low 7
+bits of the shift amount source register.
+(((RV128I, LOU)))
+
+A LDU (load double unsigned) instruction is added using the existing
+LOAD major opcode, along with new LQ and SQ instructions to load and
+store quadword values. SQ is added to the STORE major opcode, while LQ
+is added to the MISC-MEM major opcode.
+
+
+The floating-point instruction set is unchanged, although the 128-bit Q
+floating-point extension can now support FMV.X.Q and FMV.Q.X
+instructions, together with additional FCVT instructions to and from the
+T (128-bit) integer format.
+
diff --git a/src/rv32.adoc b/src/rv32.adoc
new file mode 100644
index 0000000..b035fe3
--- /dev/null
+++ b/src/rv32.adoc
@@ -0,0 +1,988 @@
+[[rv32]]
+== RV32I Base Integer Instruction Set, Version 2.1
+
+This chapter describes the RV32I base integer instruction set.
+
+[TIP]
+====
+RV32I was designed to be sufficient to form a compiler target and to
+support modern operating system environments. The ISA was also designed
+to reduce the hardware required in a minimal implementation. RV32I
+contains 40 unique instructions, though a simple implementation might
+cover the ECALL/EBREAK instructions with a single SYSTEM hardware
+instruction that always traps and might be able to implement the FENCE
+instruction as a NOP, reducing base instruction count to 38 total. RV32I
+can emulate almost any other ISA extension (except the A extension,
+which requires additional hardware support for atomicity).
+
+In practice, a hardware implementation including the machine-mode
+privileged architecture will also require the 6 CSR instructions.
+
+Subsets of the base integer ISA might be useful for pedagogical
+purposes, but the base has been defined such that there should be little
+incentive to subset a real hardware implementation beyond omitting
+support for misaligned memory accesses and treating all SYSTEM
+instructions as a single trap.
+====
+
+[NOTE]
+====
+The standard RISC-V assembly language syntax is documented in the
+Assembly Programmer’s Manual cite:[riscv-asm-manual].
+
+Most of the commentary for RV32I also applies to the RV64I base.
+====
+
+=== Programmers’ Model for Base Integer ISA
+
+<<img-gprs>> shows the unprivileged state for the base
+integer ISA. For RV32I, the 32 `x` registers are each 32 bits wide,
+i.e., XLEN=32. Register `x0` is hardwired with all bits equal to 0.
+General purpose registers `x1`–`x31` hold values that various
+instructions interpret as a collection of Boolean values, or as two’s
+complement signed binary integers or unsigned binary integers.
+
+There is one additional unprivileged register: the program counter `pc`
+holds the address of the current instruction.
+
+[[img-gprs]]
+.RISC-V base unprivileged integer register state.
+image::base-unpriv-reg-state.png[base,180,1000,align="center"]
+
+There is no dedicated stack pointer or subroutine return address link
+register in the Base Integer ISA; the instruction encoding allows any
+`x` register to be used for these purposes. However, the standard
+software calling convention uses register `x1` to hold the return
+address for a call, with register `x5` available as an alternate link
+register. The standard calling convention uses register `x2` as the
+stack pointer.
+
+Hardware might choose to accelerate function calls and returns that use
+`x1` or `x5`. See the descriptions of the JAL and JALR instructions.
+
+The optional compressed 16-bit instruction format is designed around the
+assumption that `x1` is the return address register and ` x2` is the
+stack pointer. Software using other conventions will operate correctly
+but may have greater code size.
+
+The number of available architectural registers can have large impacts
+on code size, performance, and energy consumption. Although 16 registers
+would arguably be sufficient for an integer ISA running compiled code,
+it is impossible to encode a complete ISA with 16 registers in 16-bit
+instructions using a 3-address format. Although a 2-address format would
+be possible, it would increase instruction count and lower efficiency.
+We wanted to avoid intermediate instruction sizes (such as Xtensa’s
+24-bit instructions) to simplify base hardware implementations, and once
+a 32-bit instruction size was adopted, it was straightforward to support
+32 integer registers. A larger number of integer registers also helps
+performance on high-performance code, where there can be extensive use
+of loop unrolling, software pipelining, and cache tiling.
+
+For these reasons, we chose a conventional size of 32 integer registers
+for RV32I. Dynamic register usage tends to be dominated by a few
+frequently accessed registers, and regfile implementations can be
+optimized to reduce access energy for the frequently accessed
+registers . The optional compressed 16-bit instruction format mostly
+only accesses 8 registers and hence can provide a dense instruction
+encoding, while additional instruction-set extensions could support a
+much larger register space (either flat or hierarchical) if desired.
+
+For resource-constrained embedded applications, we have defined the
+RV32E subset, which only has 16 registers
+(<<rv32e>>).
+
+=== Base Instruction Formats
+
+In the base RV32I ISA, there are four core instruction formats
+(R/I/S/U), as shown in <<base_instr>>. All are a fixed 32
+bits in length and must be aligned on a four-byte boundary in memory. An
+instruction-address-misaligned exception is generated on a taken branch
+or unconditional jump if the target address is not four-byte aligned.
+This exception is reported on the branch or jump instruction, not on the
+target instruction. No instruction-address-misaligned exception is
+generated for a conditional branch that is not taken.
+
+The alignment constraint for base ISA instructions is relaxed to a
+two-byte boundary when instruction extensions with 16-bit lengths or
+other odd multiples of 16-bit lengths are added (i.e., IALIGN=16).
+
+Instruction-address-misaligned exceptions are reported on the branch or
+jump that would cause instruction misalignment to help debugging, and to
+simplify hardware design for systems with IALIGN=32, where these are the
+only places where misalignment can occur.
+
+The behavior upon decoding a reserved instruction is UNSPECIFIED.
+
+Some platforms may require that opcodes reserved for standard use raise
+an illegal-instruction exception. Other platforms may permit reserved
+opcode space be used for non-conforming extensions.
+
+include::images/wavedrom/instruction_formats.adoc[]
+[[base_instr]]
+.Instruction formats
+image::image_placeholder.png[]
+
+[NOTE]
+====
+Each immediate subfield in <<base_instr>> above is labeled with the bit position (imm[x ]) in the immediate value being produced, rather than the bit position within the instruction’s immediate field as is usually done.
+====
+
+The RISC-V ISA keeps the source (_rs1_ and _rs2_) and destination (_rd_)
+registers at the same position in all formats to simplify decoding.
+Except for the 5-bit immediates used in CSR instructions
+(<<csrinsts>>), immediates are always
+sign-extended, and are generally packed towards the leftmost available
+bits in the instruction and have been allocated to reduce hardware
+complexity. In particular, the sign bit for all immediates is always in
+bit 31 of the instruction to speed sign-extension circuitry.
+
+[NOTE]
+====
+Decoding register specifiers is usually on the critical paths in
+implementations, and so the instruction format was chosen to keep all
+register specifiers at the same position in all formats at the expense
+of having to move immediate bits across formats (a property shared with
+RISC-IV aka. cite:[spur-jsscc1989]).
+
+In practice, most immediates are either small or require all XLEN bits.
+We chose an asymmetric immediate split (12 bits in regular instructions
+plus a special load-upper-immediate instruction with 20 bits) to
+increase the opcode space available for regular instructions.
+
+Immediates are sign-extended because we did not observe a benefit to
+using zero-extension for some immediates as in the MIPS ISA and wanted
+to keep the ISA as simple as possible.
+====
+
+=== Immediate Encoding Variants
+
+There are a further two variants of the instruction formats (B/J) based
+on the handling of immediates, as shown in <<baseinstformatsimm>>.
+
+include::images/wavedrom/immediate_variants.adoc[]
+[[baseinstformatsimm]]
+.RISC-V base instruction formats.
+image::image_placeholder.png[]
+
+[NOTE]
+====
+Each immediate subfield is labeled with the bit
+position (imm[x ]) in the immediate value being produced, rather than the bit position within the
+instruction’s immediate field as is usually done.
+====
+
+The only difference between the S and B formats is that the 12-bit
+immediate field is used to encode branch offsets in multiples of 2 in
+the B format. Instead of shifting all bits in the instruction-encoded
+immediate left by one in hardware as is conventionally done, the middle
+bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest
+bit in S format (inst[7]) encodes a high-order bit in B format.
+
+Similarly, the only difference between the U and J formats is that the
+20-bit immediate is shifted left by 12 bits to form U immediates and by
+1 bit to form J immediates. The location of instruction bits in the U
+and J format immediates is chosen to maximize overlap with the other
+formats and with each other.
+
+<<immtypes>> shows the immediates produced by
+each of the base instruction formats, and is labeled to show which
+instruction bit `(inst[_y_])` produces each bit of the immediate value.
+
+include::images/wavedrom/immediate.adoc[]
+[[immtypes]]
+.Immediate variants for I, S, B, U, and J
+image::image_placeholder.png[]
+
+
+Sign-extension is one of the most critical operations on immediates
+(particularly for XLEN latexmath:[$>$]32), and in RISC-V the sign bit for
+all immediates is always held in bit 31 of the instruction to allow
+sign-extension to proceed in parallel with instruction decoding.
+
+Although more complex implementations might have separate adders for
+branch and jump calculations and so would not benefit from keeping the
+location of immediate bits constant across types of instruction, we
+wanted to reduce the hardware cost of the simplest implementations. By
+rotating bits in the instruction encoding of B and J immediates instead
+of using dynamic hardware muxes to multiply the immediate by 2, we
+reduce instruction signal fanout and immediate mux costs by around a
+factor of 2. The scrambled immediate encoding will add negligible time
+to static or ahead-of-time compilation. For dynamic generation of
+instructions, there is some small additional overhead, but the most
+common short forward branches have straightforward immediate encodings.
+
+=== Integer Computational Instructions
+
+Most integer computational instructions operate on XLEN bits of values
+held in the integer register file. Integer computational instructions
+are either encoded as register-immediate operations using the I-type
+format or as register-register operations using the R-type format. The
+destination is register _rd_ for both register-immediate and
+register-register instructions. No integer computational instructions
+cause arithmetic exceptions.
+
+We did not include special instruction-set support for overflow checks
+on integer arithmetic operations in the base instruction set, as many
+overflow checks can be cheaply implemented using RISC-V branches.
+Overflow checking for unsigned addition requires only a single
+additional branch instruction after the addition:
+`add t0, t1, t2; bltu t0, t1, overflow`.
+
+For signed addition, if one operand’s sign is known, overflow checking
+requires only a single branch after the addition:
+`addi t0, t1, +imm; blt t0, t1, overflow`. This covers the common case
+of addition with an immediate operand.
+
+For general signed addition, three additional instructions after the
+addition are required, leveraging the observation that the sum should be
+less than one of the operands if and only if the other operand is
+negative.
+
+[source,txt]
+....
+         add t0, t1, t2
+         slti t3, t2, 0
+         slt t4, t0, t1
+         bne t3, t4, overflow
+....
+
+In RV64I, checks of 32-bit signed additions can be optimized further by
+comparing the results of ADD and ADDW on the operands.
+
+==== Integer Register-Immediate Instructions
+
+include::images/wavedrom/integer_computational.adoc[]
+.Integer Computational Instructions
+image::image_placeholder.png[]
+
+ADDI adds the sign-extended 12-bit immediate to register _rs1_.
+Arithmetic overflow is ignored and the result is simply the low XLEN
+bits of the result. ADDI _rd, rs1, 0_ is used to implement the MV _rd,
+rs1_ assembler pseudoinstruction.
+
+SLTI (set less than immediate) places the value 1 in register _rd_ if
+register _rs1_ is less than the sign-extended immediate when both are
+treated as signed numbers, else 0 is written to _rd_. SLTIU is similar
+but compares the values as unsigned numbers (i.e., the immediate is
+first sign-extended to XLEN bits then treated as an unsigned number).
+Note, SLTIU _rd, rs1, 1_ sets _rd_ to 1 if _rs1_ equals zero, otherwise
+sets _rd_ to 0 (assembler pseudoinstruction SEQZ _rd, rs_).
+
+ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and
+XOR on register _rs1_ and the sign-extended 12-bit immediate and place
+the result in _rd_. Note, XORI _rd, rs1, -1_ performs a bitwise logical
+inversion of register _rs1_ (assembler pseudoinstruction NOT _rd, rs_).
+
+include::images/wavedrom/int-comp-slli-srli-srai.adoc[]
+[[int-comp-slli-srli-srai]]
+.Integer register-immediate, SLLI, SRLI, SRAI
+image::image_placeholder.png[]
+
+
+Shifts by a constant are encoded as a specialization of the I-type
+format. The operand to be shifted is in _rs1_, and the shift amount is
+encoded in the lower 5 bits of the I-immediate field. The right shift
+type is encoded in bit 30. SLLI is a logical left shift (zeros are
+shifted into the lower bits); SRLI is a logical right shift (zeros are
+shifted into the upper bits); and SRAI is an arithmetic right shift (the
+original sign bit is copied into the vacated upper bits).
+
+
+include::images/wavedrom/int-comp-lui-aiupc.adoc[]
+[[int-comp-lui-aiupc]]
+.Integer register-immediate, U-immediate
+image::image_placeholder.png[]
+
+
+LUI (load upper immediate) is used to build 32-bit constants and uses
+the U-type format. LUI places the 32-bit U-immediate value into the
+destination register _rd_, filling in the lowest 12 bits with zeros.
+
+AUIPC (add upper immediate to `pc`) is used to build `pc`-relative
+addresses and uses the U-type format. AUIPC forms a 32-bit offset from
+the U-immediate, filling in the lowest 12 bits with zeros, adds this
+offset to the address of the AUIPC instruction, then places the result
+in register _rd_.
+
+The assembly syntax for `lui` and `auipc` does not represent the lower
+12 bits of the U-immediate, which are always zero.
+
+The AUIPC instruction supports two-instruction sequences to access
+arbitrary offsets from the PC for both control-flow transfers and data
+accesses. The combination of an AUIPC and the 12-bit immediate in a JALR
+can transfer control to any 32-bit PC-relative address, while an AUIPC
+plus the 12-bit immediate offset in regular load or store instructions
+can access any 32-bit PC-relative data address.
+
+The current PC can be obtained by setting the U-immediate to 0. Although
+a JAL +4 instruction could also be used to obtain the local PC (of the
+instruction following the JAL), it might cause pipeline breaks in
+simpler microarchitectures or pollute BTB structures in more complex
+microarchitectures.
+
+==== Integer Register-Register Operations
+
+RV32I defines several arithmetic R-type operations. All operations read
+the _rs1_ and _rs2_ registers as source operands and write the result
+into register _rd_. The _funct7_ and _funct3_ fields select the type of
+operation.
+
+include::images/wavedrom/int_reg-reg.adoc[]
+[[int-reg-reg]]
+.Integer register-register
+image::image_placeholder.png[]
+
+ADD performs the addition of _rs1_ and _rs2_. SUB performs the
+subtraction of _rs2_ from _rs1_. Overflows are ignored and the low XLEN
+bits of results are written to the destination _rd_. SLT and SLTU
+perform signed and unsigned compares respectively, writing 1 to _rd_ if
+latexmath:[$\mbox{\em rs1} < \mbox{\em
+  rs2}$], 0 otherwise. Note, SLTU _rd_, _x0_, _rs2_ sets _rd_ to 1 if
+_rs2_ is not equal to zero, otherwise sets _rd_ to zero (assembler
+pseudoinstruction SNEZ _rd, rs_). AND, OR, and XOR perform bitwise
+logical operations.
+
+SLL, SRL, and SRA perform logical left, logical right, and arithmetic
+right shifts on the value in register _rs1_ by the shift amount held in
+the lower 5 bits of register _rs2_.
+
+==== NOP Instruction
+
+include::images/wavedrom/nop.adoc[]
+[[nop]]
+.NOP instructions
+image::image_placeholder.png[]
+
+The NOP instruction does not change any architecturally visible state,
+except for advancing the `pc` and incrementing any applicable
+performance counters. NOP is encoded as ADDI _x0, x0, 0_.
+
+NOPs can be used to align code segments to microarchitecturally
+significant address boundaries, or to leave space for inline code
+modifications. Although there are many possible ways to encode a NOP, we
+define a canonical NOP encoding to allow microarchitectural
+optimizations as well as for more readable disassembly output. The other
+NOP encodings are made available for HINT instructions
+(Section <<rv32i-hints>>).
+
+ADDI was chosen for the NOP encoding as this is most likely to take
+fewest resources to execute across a range of systems (if not optimized
+away in decode). In particular, the instruction only reads one register.
+Also, an ADDI functional unit is more likely to be available in a
+superscalar design as adds are the most common operation. In particular,
+address-generation functional units can execute ADDI using the same
+hardware needed for base+offset address calculations, while
+register-register ADD or logical/shift operations require additional
+hardware.
+
+=== Control Transfer Instructions
+
+RV32I provides two types of control transfer instructions: unconditional
+jumps and conditional branches. Control transfer instructions in RV32I
+do _not_ have architecturally visible delay slots.
+
+If an instruction access-fault or instruction page-fault exception
+occurs on the target of a jump or taken branch, the exception is
+reported on the target instruction, not on the jump or branch
+instruction.
+
+==== Unconditional Jumps
+
+The jump and link (JAL) instruction uses the J-type format, where the
+J-immediate encodes a signed offset in multiples of 2 bytes. The offset
+is sign-extended and added to the address of the jump instruction to
+form the jump target address. Jumps can therefore target a
+latexmath:[$\pm$] range. JAL stores the address of the instruction
+following the jump (`pc`+4) into register _rd_. The standard software
+calling convention uses `x1` as the return address register and `x5` as
+an alternate link register.
+
+The alternate link register supports calling millicode routines (e.g.,
+those to save and restore registers in compressed code) while preserving
+the regular return address register. The register `x5` was chosen as the
+alternate link register as it maps to a temporary in the standard
+calling convention, and has an encoding that is only one bit different
+than the regular link register.
+
+Plain unconditional jumps (assembler pseudoinstruction J) are encoded as
+a JAL with _rd_=`x0`.
+
+include::images/wavedrom/ct-unconditional.adoc[]
+[[ct-unconditional]]
+.Plain unconditional jumps
+image::image_placeholder.png[]
+
+The indirect jump instruction JALR (jump and link register) uses the
+I-type encoding. The target address is obtained by adding the
+sign-extended 12-bit I-immediate to the register _rs1_, then setting the
+least-significant bit of the result to zero. The address of the
+instruction following the jump (`pc`+4) is written to register _rd_.
+Register `x0` can be used as the destination if the result is not
+required.
+
+include::images/wavedrom/ct-unconditional-2.adoc[]
+[[ct-unconditional-2]]
+.Indirect unconditional jump
+image::image_placeholder.png[]
+
+The unconditional jump instructions all use PC-relative addressing to
+help support position-independent code. The JALR instruction was defined
+to enable a two-instruction sequence to jump anywhere in a 32-bit
+absolute address range. A LUI instruction can first load _rs1_ with the
+upper 20 bits of a target address, then JALR can add in the lower bits.
+Similarly, AUIPC then JALR can jump anywhere in a 32-bit `pc`-relative
+address range.
+
+Note that the JALR instruction does not treat the 12-bit immediate as
+multiples of 2 bytes, unlike the conditional branch instructions. This
+avoids one more immediate format in hardware. In practice, most uses of
+JALR will have either a zero immediate or be paired with a LUI or AUIPC,
+so the slight reduction in range is not significant.
+
+Clearing the least-significant bit when calculating the JALR target
+address both simplifies the hardware slightly and allows the low bit of
+function pointers to be used to store auxiliary information. Although
+there is potentially a slight loss of error checking in this case, in
+practice jumps to an incorrect instruction address will usually quickly
+raise an exception.
+
+When used with a base _rs1_latexmath:[$=$]`x0`, JALR can be used to
+implement a single instruction subroutine call to the lowest or highest
+address region from anywhere in the address space, which could be used
+to implement fast calls to a small runtime library. Alternatively, an
+ABI could dedicate a general-purpose register to point to a library
+elsewhere in the address space.
+
+The JAL and JALR instructions will generate an
+instruction-address-misaligned exception if the target address is not
+aligned to a four-byte boundary.
+
+Instruction-address-misaligned exceptions are not possible on machines
+that support extensions with 16-bit aligned instructions, such as the
+compressed instruction-set extension, C.
+
+Return-address prediction stacks are a common feature of
+high-performance instruction-fetch units, but require accurate detection
+of instructions used for procedure calls and returns to be effective.
+For RISC-V, hints as to the instructions’ usage are encoded implicitly
+via the register numbers used. A JAL instruction should push the return
+address onto a return-address stack (RAS) only when _rd_ is `x1` or
+`x5`. JALR instructions should push/pop a RAS as shown in <<rashints>>.
+
+[[rashints]]
+.Return-address stack prediction hints encoded in the register operands
+of a JALR instruction.
+[cols="^,^,^,<",options="header",]
+|===
+|_rd_ is `x1`/`x5` |_rs1_ is `x1`/`x5` |__rd__latexmath:[$=$]_rs1_ |RAS
+action
+|No |No |– |None
+
+|No |Yes |– |Pop
+
+|Yes |No |– |Push
+
+|Yes |Yes |No |Pop, then push
+
+|Yes |Yes |Yes |Push
+|===
+
+Some other ISAs added explicit hint bits to their indirect-jump
+instructions to guide return-address stack manipulation. We use implicit
+hinting tied to register numbers and the calling convention to reduce
+the encoding space used for these hints.
+
+When two different link registers (`x1` and `x5`) are given as _rs1_ and
+_rd_, then the RAS is both popped and pushed to support coroutines. If
+_rs1_ and _rd_ are the same link register (either `x1` or `x5`), the RAS
+is only pushed to enable macro-op fusion of the sequences:
+`lui ra, imm20; jalr ra, imm12(ra)`  and
+ `auipc ra, imm20; jalr ra, imm12(ra)`
+
+==== Conditional Branches
+
+All branch instructions use the B-type instruction format. The 12-bit
+B-immediate encodes signed offsets in multiples of 2 bytes. The offset
+is sign-extended and added to the address of the branch instruction to
+give the target address. The conditional branch range is
+latexmath:[$\pm$].
+
+include::images/wavedrom/ct-conditional.adoc[]
+[[ct-conditional]]
+.Conditional branches
+image::image_placeholder.png[]
+
+Branch instructions compare two registers. BEQ and BNE take the branch
+if registers _rs1_ and _rs2_ are equal or unequal respectively. BLT and
+BLTU take the branch if _rs1_ is less than _rs2_, using signed and
+unsigned comparison respectively. BGE and BGEU take the branch if _rs1_
+is greater than or equal to _rs2_, using signed and unsigned comparison
+respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by
+reversing the operands to BLT, BLTU, BGE, and BGEU, respectively.
+
+Signed array bounds may be checked with a single BLTU instruction, since
+any negative index will compare greater than any nonnegative bound.
+
+Software should be optimized such that the sequential code path is the
+most common path, with less-frequently taken code paths placed out of
+line. Software should also assume that backward branches will be
+predicted taken and forward branches as not taken, at least the first
+time they are encountered. Dynamic predictors should quickly learn any
+predictable branch behavior.
+
+Unlike some other architectures, the RISC-V jump (JAL with _rd_=`x0`)
+instruction should always be used for unconditional branches instead of
+a conditional branch instruction with an always-true condition. RISC-V
+jumps are also PC-relative and support a much wider offset range than
+branches, and will not pollute conditional-branch prediction tables.
+
+The conditional branches were designed to include arithmetic comparison
+operations between two registers (as also done in PA-RISC, Xtensa, and
+MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or
+to only compare one register against zero (Alpha, MIPS), or two
+registers only for equality (MIPS). This design was motivated by the
+observation that a combined compare-and-branch instruction fits into a
+regular pipeline, avoids additional condition code state or use of a
+temporary register, and reduces static code size and dynamic instruction
+fetch traffic. Another point is that comparisons against zero require
+non-trivial circuit delay (especially after the move to static logic in
+advanced processes) and so are almost as expensive as arithmetic
+magnitude compares. Another advantage of a fused compare-and-branch
+instruction is that branches are observed earlier in the front-end
+instruction stream, and so can be predicted earlier. There is perhaps an
+advantage to a design with condition codes in the case where multiple
+branches can be taken based on the same condition codes, but we believe
+this case to be relatively rare.
+
+We considered but did not include static branch hints in the instruction
+encoding. These can reduce the pressure on dynamic predictors, but
+require more instruction encoding space and software profiling for best
+results, and can result in poor performance if production runs do not
+match profiling runs.
+
+We considered but did not include conditional moves or predicated
+instructions, which can effectively replace unpredictable short forward
+branches. Conditional moves are the simpler of the two, but are
+difficult to use with conditional code that might cause exceptions
+(memory accesses and floating-point operations). Predication adds
+additional flag state to a system, additional instructions to set and
+clear flags, and additional encoding overhead on every instruction. Both
+conditional move and predicated instructions add complexity to
+out-of-order microarchitectures, adding an implicit third source operand
+due to the need to copy the original value of the destination
+architectural register into the renamed destination physical register if
+the predicate is false. Also, static compile-time decisions to use
+predication instead of branches can result in lower performance on
+inputs not included in the compiler training set, especially given that
+unpredictable branches are rare, and becoming rarer as branch prediction
+techniques improve.
+
+We note that various microarchitectural techniques exist to dynamically
+convert unpredictable short forward branches into internally predicated
+code to avoid the cost of flushing pipelines on a branch mispredict cite:[heil-tr1996], cite:[Klauser-1998], cite:[Kim-micro2005] and
+have been implemented in commercial processors  cite:[ibmpower7]. The simplest techniques
+just reduce the penalty of recovering from a mispredicted short forward
+branch by only flushing instructions in the branch shadow instead of the
+entire fetch pipeline, or by fetching instructions from both sides using
+wide instruction fetch or idle instruction fetch slots. More complex
+techniques for out-of-order cores add internal predicates on
+instructions in the branch shadow, with the internal predicate value
+written by the branch instruction, allowing the branch and following
+instructions to be executed speculatively and out-of-order with respect
+to other code .
+
+The conditional branch instructions will generate an
+instruction-address-misaligned exception if the target address is not
+aligned to a four-byte boundary and the branch condition evaluates to
+true. If the branch condition evaluates to false, the
+instruction-address-misaligned exception will not be raised.
+
+Instruction-address-misaligned exceptions are not possible on machines
+that support extensions with 16-bit aligned instructions, such as the
+compressed instruction-set extension, C.
+
+[[ldst]]
+=== Load and Store Instructions
+
+RV32I is a load-store architecture, where only load and store
+instructions access memory and arithmetic instructions only operate on
+CPU registers. RV32I provides a 32-bit address space that is
+byte-addressed. The EEI will define what portions of the address space
+are legal to access with which instructions (e.g., some addresses might
+be read only, or support word access only). Loads with a destination of
+`x0` must still raise any exceptions and cause any other side effects
+even though the load value is discarded.
+
+The EEI will define whether the memory system is little-endian or
+big-endian. In RISC-V, endianness is byte-address invariant.
+
+In a system for which endianness is byte-address invariant, the
+following property holds: if a byte is stored to memory at some address
+in some endianness, then a byte-sized load from that address in any
+endianness returns the stored value.
+
+In a little-endian configuration, multibyte stores write the
+least-significant register byte at the lowest memory byte address,
+followed by the other register bytes in ascending order of their
+significance. Loads similarly transfer the contents of the lesser memory
+byte addresses to the less-significant register bytes.
+
+In a big-endian configuration, multibyte stores write the
+most-significant register byte at the lowest memory byte address,
+followed by the other register bytes in descending order of their
+significance. Loads similarly transfer the contents of the greater
+memory byte addresses to the less-significant register bytes.
+
+include::images/wavedrom/load_store.adoc[]
+[[load-store,load and store]]
+.Load and store instructions
+image::image_placeholder.png[]
+
+Load and store instructions transfer a value between the registers and
+memory. Loads are encoded in the I-type format and stores are S-type.
+The effective address is obtained by adding register _rs1_ to the
+sign-extended 12-bit offset. Loads copy a value from memory to register
+_rd_. Stores copy the value in register _rs2_ to memory.
+
+The LW instruction loads a 32-bit value from memory into _rd_. LH loads
+a 16-bit value from memory, then sign-extends to 32-bits before storing
+in _rd_. LHU loads a 16-bit value from memory but then zero extends to
+32-bits before storing in _rd_. LB and LBU are defined analogously for
+8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and
+8-bit values from the low bits of register _rs2_ to memory.
+
+Regardless of EEI, loads and stores whose effective addresses are
+naturally aligned shall not raise an address-misaligned exception. Loads
+and stores whose effective address is not naturally aligned to the
+referenced datatype (i.e., the effective address is not divisible by the
+size of the access in bytes) have behavior dependent on the EEI.
+
+An EEI may guarantee that misaligned loads and stores are fully
+supported, and so the software running inside the execution environment
+will never experience a contained or fatal address-misaligned trap. In
+this case, the misaligned loads and stores can be handled in hardware,
+or via an invisible trap into the execution environment implementation,
+or possibly a combination of hardware and invisible trap depending on
+address.
+
+An EEI may not guarantee misaligned loads and stores are handled
+invisibly. In this case, loads and stores that are not naturally aligned
+may either complete execution successfully or raise an exception. The
+exception raised can be either an address-misaligned exception or an
+access-fault exception. For a memory access that would otherwise be able
+to complete except for the misalignment, an access-fault exception can
+be raised instead of an address-misaligned exception if the misaligned
+access should not be emulated, e.g., if accesses to the memory region
+have side effects. When an EEI does not guarantee misaligned loads and
+stores are handled invisibly, the EEI must define if exceptions caused
+by address misalignment result in a contained trap (allowing software
+running inside the execution environment to handle the trap) or a fatal
+trap (terminating execution).
+
+Misaligned accesses are occasionally required when porting legacy code,
+and help performance on applications when using any form of packed-SIMD
+extension or handling externally packed data structures. Our rationale
+for allowing EEIs to choose to support misaligned accesses via the
+regular load and store instructions is to simplify the addition of
+misaligned hardware support. One option would have been to disallow
+misaligned accesses in the base ISAs and then provide some separate ISA
+support for misaligned accesses, either special instructions to help
+software handle misaligned accesses or a new hardware addressing mode
+for misaligned accesses. Special instructions are difficult to use,
+complicate the ISA, and often add new processor state (e.g., SPARC VIS
+align address offset register) or complicate access to existing
+processor state (e.g., MIPS LWL/LWR partial register writes). In
+addition, for loop-oriented packed-SIMD code, the extra overhead when
+operands are misaligned motivates software to provide multiple forms of
+loop depending on operand alignment, which complicates code generation
+and adds to loop startup overhead. New misaligned hardware addressing
+modes take considerable space in the instruction encoding or require
+very simplified addressing modes (e.g., register indirect only).
+
+Even when misaligned loads and stores complete successfully, these
+accesses might run extremely slowly depending on the implementation
+(e.g., when implemented via an invisible trap). Furthermore, whereas
+naturally aligned loads and stores are guaranteed to execute atomically,
+misaligned loads and stores might not, and hence require additional
+synchronization to ensure atomicity.
+
+We do not mandate atomicity for misaligned accesses so execution
+environment implementations can use an invisible machine trap and a
+software handler to handle some or all misaligned accesses. If hardware
+misaligned support is provided, software can exploit this by simply
+using regular load and store instructions. Hardware can then
+automatically optimize accesses depending on whether runtime addresses
+are aligned.
+
+[[fence]]
+=== Memory Ordering Instructions
+
+include::images/wavedrom/mem_order.adoc[]
+[[mem-order]]
+.Memory ordering instructions
+image::image_placeholder.png[]
+
+The FENCE instruction is used to order device I/O and memory accesses as
+viewed by other RISC-V harts and external devices or coprocessors. Any
+combination of device input (I), device output (O), memory reads \(R),
+and memory writes (W) may be ordered with respect to any combination of
+the same. Informally, no other RISC-V hart or external device can
+observe any operation in the _successor_ set following a FENCE before
+any operation in the _predecessor_ set preceding the FENCE.
+<<memorymodeL>> provides a precise description
+of the RISC-V memory consistency model.
+
+The FENCE instruction also orders memory reads and writes made by the
+hart as observed by memory reads and writes made by an external device.
+However, FENCE does not order observations of events made by an external
+device using any other signaling mechanism.
+
+A device might observe an access to a memory location via some external
+communication mechanism, e.g., a memory-mapped control register that
+drives an interrupt signal to an interrupt controller. This
+communication is outside the scope of the FENCE ordering mechanism and
+hence the FENCE instruction can provide no guarantee on when a change in
+the interrupt signal is visible to the interrupt controller. Specific
+devices might provide additional ordering guarantees to reduce software
+overhead but those are outside the scope of the RISC-V memory model.
+
+The EEI will define what I/O operations are possible, and in particular,
+which memory addresses when accessed by load and store instructions will
+be treated and ordered as device input and device output operations
+respectively rather than memory reads and writes. For example,
+memory-mapped I/O devices will typically be accessed with uncached loads
+and stores that are ordered using the I and O bits rather than the R and
+W bits. Instruction-set extensions might also describe new I/O
+instructions that will also be ordered using the I and O bits in a
+FENCE.
+
+[[fm]]
+.Fence mode encoding
+|===
+|_fm_ field |Mnemonic |Meaning
+|0000 |_none_ |Normal Fence
+|1000 |TSO |With FENCE RW,RW: exclude write-to-read ordering; otherwise: _Reserved for future use._
+|_other_ | |_Reserved for future use._
+|===
+
+The fence mode field _fm_ defines the semantics of the FENCE. A FENCE
+with _fm_=0000 orders all memory operations in its predecessor set
+before all memory operations in its successor set.
+
+The optional FENCE.TSO instruction is encoded as a FENCE instruction
+with _fm_=1000, _predecessor_=RW, and _successor_=RW. FENCE.TSO orders
+all load operations in its predecessor set before all memory operations
+in its successor set, and all store operations in its predecessor set
+before all store operations in its successor set. This leaves non-AMO
+store operations in the FENCE.TSO’s predecessor set unordered with
+non-AMO loads in its successor set.
+
+The FENCE.TSO encoding was added as an optional extension to the
+original base FENCE instruction encoding. The base definition requires
+that implementations ignore any set bits and treat the FENCE as global,
+and so this is a backwards-compatible extension.
+
+The unused fields in the FENCE instructions--_rs1_ and _rd_--are reserved
+for finer-grain fences in future extensions. For forward compatibility,
+base implementations shall ignore these fields, and standard software
+shall zero these fields. Likewise, many _fm_ and predecessor/successor
+set settings in <<fm>> are also reserved for future use.
+Base implementations shall treat all such reserved configurations as
+normal fences with _fm_=0000, and standard software shall use only
+non-reserved configurations.
+
+We chose a relaxed memory model to allow high performance from simple
+machine implementations and from likely future coprocessor or
+accelerator extensions. We separate out I/O ordering from memory R/W
+ordering to avoid unnecessary serialization within a device-driver hart
+and also to support alternative non-memory paths to control added
+coprocessors or I/O devices. Simple implementations may additionally
+ignore the _predecessor_ and _successor_ fields and always execute a
+conservative fence on all operations.
+
+=== Environment Call and Breakpoints
+
+SYSTEM instructions are used to access system functionality that might
+require privileged access and are encoded using the I-type instruction
+format. These can be divided into two main classes: those that
+atomically read-modify-write control and status registers (CSRs), and
+all other potentially privileged instructions. CSR instructions are
+described in <<csrinsts>>, and the base
+unprivileged instructions are described in the following section.
+
+
+[TIP]
+====
+The SYSTEM instructions are defined to allow simpler implementations to
+always trap to a single software trap handler. More sophisticated
+implementations might execute more of each system instruction in
+hardware.
+====
+
+include::images/wavedrom/env_call-breakpoint.adoc[]
+[[env-call]]
+.Evironment call and breakpoint instructions
+image::image_placeholder.png[]
+
+These two instructions cause a precise requested trap to the supporting
+execution environment.
+
+The ECALL instruction is used to make a service request to the execution
+environment. The EEI will define how parameters for the service request
+are passed, but usually these will be in defined locations in the
+integer register file.
+
+The EBREAK instruction is used to return control to a debugging
+environment.
+
+ECALL and EBREAK were previously named SCALL and SBREAK. The
+instructions have the same functionality and encoding, but were renamed
+to reflect that they can be used more generally than to call a
+supervisor-level operating system or debugger.
+
+EBREAK was primarily designed to be used by a debugger to cause
+execution to stop and fall back into the debugger. EBREAK is also used
+by the standard gcc compiler to mark code paths that should not be
+executed.
+
+Another use of EBREAK is to support _semihosting_, where the execution
+environment includes a debugger that can provide services over an
+alternate system call interface built around the EBREAK instruction.
+Because the RISC-V base ISAs do not provide more than one EBREAK
+instruction, RISC-V semihosting uses a special sequence of instructions
+to distinguish a semihosting EBREAK from a debugger inserted EBREAK.
+
+....
+    slli x0, x0, 0x1f   # Entry NOP
+    ebreak              # Break to debugger
+    srai x0, x0, 7      # NOP encoding the semihosting call number 7
+....
+
+Note that these three instructions must be 32-bit-wide instructions,
+i.e., they mustn’t be among the compressed 16-bit instructions described
+in <<compressed>>.
+
+The shift NOP instructions are still considered available for use as
+HINTs.
+
+Semihosting is a form of service call and would be more naturally
+encoded as an ECALL using an existing ABI, but this would require the
+debugger to be able to intercept ECALLs, which is a newer addition to
+the debug standard. We intend to move over to using ECALLs with a
+standard ABI, in which case, semihosting can share a service ABI with an
+existing standard.
+
+We note that ARM processors have also moved to using SVC instead of BKPT
+for semihosting calls in newer designs.
+
+=== HINT Instructions
+
+RV32I reserves a large encoding space for HINT instructions, which are
+usually used to communicate performance hints to the microarchitecture.
+Like the NOP instruction, HINTs do not change any architecturally
+visible state, except for advancing the `pc` and any applicable
+performance counters. Implementations are always allowed to ignore the
+encoded hints.
+
+Most RV32I HINTs are encoded as integer computational instructions with
+_rd_=x0. The other RV32I HINTs are encoded as FENCE instructions with
+a null predecessor or successor set and with _fm_=0.
+
+These HINT encodings have been chosen so that simple implementations can
+ignore HINTs altogether, and instead execute a HINT as a regular
+instruction that happens not to mutate the architectural state. For
+example, ADD is a HINT if the destination register is `x0`; the five-bit
+_rs1_ and _rs2_ fields encode arguments to the HINT. However, a simple
+implementation can simply execute the HINT as an ADD of _rs1_ and _rs2_
+that writes ` x0`, which has no architecturally visible effect.
+
+As another example, a FENCE instruction with a zero _pred_ field and a
+zero _fm_ field is a HINT; the _succ_, _rs1_, and _rd_ fields encode the
+arguments to the HINT. A simple implementation can simply execute the
+HINT as a FENCE that orders the null set of prior memory accesses before
+whichever subsequent memory accesses are encoded in the _succ_ field.
+Since the intersection of the predecessor and successor sets is null,
+the instruction imposes no memory orderings, and so it has no
+architecturally visible effect.
+
+<<t-rv32i-hints>> lists all RV32I HINT code points. 91% of the
+HINT space is reserved for standard HINTs. The remainder of the HINT
+space is designated for custom HINTs: no standard HINTs will ever be
+defined in this subspace.
+
+[TIP]
+====
+We anticipate standard hints to eventually include memory-system spatial
+and temporal locality hints, branch prediction hints, thread-scheduling
+hints, security tags, and instrumentation flags for
+simulation/emulation.
+====
+
+// this table isn't quite right and needs to be fixed--some rows might not have landed properly. It needs to be checked cell-by cell.
+
+[[t-rv32i-hints]]
+.RV32I HINT instructions.
+[cols="<,<,^,<",options="header"]
+|===
+|Instruction |Constraints |Code Points |Purpose
+
+|LUI |_rd_=`x0` |latexmath:[$2^{20}$] .22+<.>m|_Reserved for future standard use_
+
+|AUIPC |_rd_=`x0` |latexmath:[$2^{20}$]
+
+|ADDI |_rd_=`x0`, and either latexmath:[$2^{17}-1$] _rs1_ latexmath:[$\neq$]`x0` or _imm_latexmath:[$\neq$]0 |
+
+|ANDI |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|ORI |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|XORI |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|ADD |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SUB |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|AND |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|OR |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|XOR |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLL |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRL |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRA |_rd_=`x0` |latexmath:[$2^{10}$]
+
+| |_rd_=`x0`| _rs1_latexmath:[$\neq$]`x0`, latexmath:[$2^{10}-63$]
+
+| |_fm_=0, and either _pred_=0 or _succ_=0 |_rd_latexmath:[$\neq$]`x0`
+
+| | _rs1_=`x0` |latexmath:[$2^{10}-63$]
+
+| |_fm_=0, and either _pred_=0 or _succ_=0 |
+
+|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15
+
+|FENCE |_pred_=0| _succ_latexmath:[$\neq$]0
+
+|FENCE |_rd_=_rs1_=`x0`, _fm_=0 |15
+
+|FENCE |_pred_latexmath:[$\neq$]W, _succ_=0 |
+
+|FENCE |_rd_=_rs1_=`x0`, _fm_=0, _pred_=W, _succ_=0 |1 |PAUSE
+
+|SLTI |_rd_=`x0` |latexmath:[$2^{17}$] .7+<.>m|_Designated for custom use_
+
+|SLTIU|_rd_=`x0` |latexmath:[$2^{17}$]
+
+|SLLI |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRLI |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRAI |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLT |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLTU |_rd_=`x0` |latexmath:[$2^{10}$]
+|===
diff --git a/src/rv32e.adoc b/src/rv32e.adoc
new file mode 100644
index 0000000..775771f
--- /dev/null
+++ b/src/rv32e.adoc
@@ -0,0 +1,59 @@
+[[rv32e]]
+== RV32E Base Integer Instruction Set, Version 1.9
+
+This chapter describes a draft proposal for the RV32E base integer
+instruction set, which is a reduced version of RV32I designed for
+embedded systems. The only change is to reduce the number of integer
+registers to 16. This chapter only outlines the differences between
+RV32E and RV32I, and so should be read after <<rv32>>.
+(((RV32E, design)))
+
+[NOTE]
+====
+RV32E was designed to provide an even smaller base core for embedded
+microcontrollers. Although we had mentioned this possibility in version
+2.0 of this document, we initially resisted defining this subset.
+However, given the demand for the smallest possible 32-bit
+microcontroller, and in the interests of preempting fragmentation in
+this space, we have now defined RV32E as a fourth standard base ISA in
+addition to RV32I, RV64I, and RV128I. There is also interest in defining
+an RV64E to reduce context state for highly threaded 64-bit processors.
+====
+
+=== RV32E Programmers’ Model
+
+RV32E reduces the integer register count to 16 general-purpose
+registers, (`x0`–`x15`), where `x0` is a dedicated zero register.
+
+[TIP]
+====
+We have found that in the small RV32I core designs, the upper 16
+registers consume around one quarter of the total area of the core
+excluding memories, thus their removal saves around 25% core area with a
+corresponding core power reduction.
+
+This change requires a different calling convention and ABI. In
+particular, RV32E is only used with a soft-float calling convention. A
+new embedded ABI is under consideration that would work across RV32E and
+RV32I.
+====
+
+=== RV32E Instruction Set
+(((RV32E, difference from RV32I)))
+
+RV32E uses the same instruction-set encoding as RV32I, except that only
+registers `x0`–`x15` are provided. Any future standard extensions will
+not make use of the instruction bits freed up by the reduced
+register-specifier fields and so these are designated for custom
+extensions.
+
+[NOTE]
+====
+RV32E can be combined with all current standard extensions. Defining the
+F, D, and Q extensions as having a 16-entry floating point register file
+when combined with RV32E was considered but decided against. To support
+systems with reduced floating-point register state, we intend to define
+a `Zfinx` extension that makes floating-point computations use the
+integer registers, removing the floating-point loads, stores, and moves
+between floating point and integer registers.
+====
diff --git a/src/rv64.adoc b/src/rv64.adoc
new file mode 100644
index 0000000..748fbc3
--- /dev/null
+++ b/src/rv64.adoc
@@ -0,0 +1,252 @@
+[[rv64]]
+== RV64I Base Integer Instruction Set, Version 2.1
+
+This chapter describes the RV64I base integer instruction set, which
+builds upon the RV32I variant described in <<rv32>>.
+This chapter presents only the differences with RV32I, so should be read
+in conjunction with the earlier chapter.
+
+=== Register State
+
+RV64I widens the integer registers and supported user address space to
+64 bits (XLEN=64 in <<gprs>>.
+
+=== Integer Computational Instructions
+
+Most integer computational instructions operate on XLEN-bit values.
+Additional instruction variants are provided to manipulate 32-bit values
+in RV64I, indicated by a `W` suffix to the opcode. These `\*W`
+instructions ignore the upper 32 bits of their inputs and always produce
+32-bit signed values, sign-extending them to 64 bits, i.e. bits XLEN-1
+through 31 are equal.
+
+The compiler and calling convention maintain an invariant that all
+32-bit values are held in a sign-extended format in 64-bit registers.
+Even 32-bit unsigned integers extend bit 31 into bits 63 through 32.
+Consequently, conversion between unsigned and signed 32-bit integers is
+a no-op, as is conversion from a signed 32-bit integer to a signed
+64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares
+still operate correctly on unsigned 32-bit integers under this
+invariant. Similarly, existing 64-bit wide logical operations on 32-bit
+sign-extended integers preserve the sign-extension property. A few new
+instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to
+ensure reasonable performance for 32-bit values.
+(((RV64I, shifts)))
+(((RV64I, compares)))
+
+==== Integer Register-Immediate Instructions
+
+include::images/wavedrom/rv64i-base-int.adoc[]
+[[rv64i-base-int]]
+.RV64I register-immediate instructions
+image::image_placeholder.png[]
+
+ADDIW is an RV64I instruction that adds the sign-extended 12-bit
+immediate to register _rs1_ and produces the proper sign-extension of a
+32-bit result in _rd_. Overflows are ignored and the result is the low
+32 bits of the result sign-extended to 64 bits. Note, ADDIW _rd, rs1, 0_
+writes the sign-extension of the lower 32 bits of register _rs1_ into
+register _rd_ (assembler pseudoinstruction SEXT.W).
+//`the following diagram doesn't match the tex spec`
+
+include::images/wavedrom/rv64i-addiw.adoc[]
+[[rv64i-addiw]]
+.RV64I register-immediate (descr ADDIW) instructions
+image::image_placeholder.png[]
+
+Shifts by a constant are encoded as a specialization of the I-type
+format using the same instruction opcode as RV32I. The operand to be
+shifted is in _rs1_, and the shift amount is encoded in the lower 6 bits
+of the I-immediate field for RV64I. The right shift type is encoded in
+bit 30. SLLI is a logical left shift (zeros are shifted into the lower
+bits); SRLI is a logical right shift (zeros are shifted into the upper
+bits); and SRAI is an arithmetic right shift (the original sign bit is
+copied into the vacated upper bits).
+(((RV64I, SLLI)))
+(((RV64I, SRKIW)))
+(((RV64I, SRLIW)))
+(((RV64I, RV64I-only)))
+
+
+SLLIW, SRLIW, and SRAIW are RV64I-only instructions that are analogously
+defined but operate on 32-bit values and sign-extend their 32-bit
+results to 64 bits. SLLIW, SRLIW, and SRAIW encodings with
+latexmath:[$imm[5] \neq 0$] are reserved.
+
+[NOTE]
+====
+Previously, SLLIW, SRLIW, and SRAIW with latexmath:[$imm[5] \neq 0$]
+were defined to cause illegal instruction exceptions, whereas now they
+are marked as reserved. This is a backwards-compatible change.
+====
+
+include::images/wavedrom/rv64_lui-auipc.adoc[]
+[[rv64_lui-auipc]]
+.RV64I register-immediate (descr) instructions
+image::image_placeholder.png[]
+
+LUI (load upper immediate) uses the same opcode as RV32I. LUI places the
+32-bit U-immediate into register _rd_, filling in the lowest 12 bits
+with zeros. The 32-bit result is sign-extended to 64 bits.
+
+AUIPC (add upper immediate to `pc`) uses the same opcode as RV32I. AUIPC
+is used to build ` pc`-relative addresses and uses the U-type format.
+AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest
+12 bits with zeros, sign-extends the result to 64 bits, adds it to the
+address of the AUIPC instruction, then places the result in register
+_rd_.
+
+Note that the set of address offsets that can be formed by pairing LUI
+with LD, AUIPC with JALR, etc.in RV64I is
+[latexmath:[${-}2^{31}{-}2^{11}$], latexmath:[$2^{31}{-}2^{11}{-}1$]].
+
+==== Integer Register-Register Operations
+
+//`this diagrM doesn't match the tex specification`
+include::images/wavedrom/rv64i_int-reg-reg.adoc[]
+[[int_reg-reg]]
+.RV64I integer register-register instructions
+image::image_placeholder.png[]
+
+ADDW and SUBW are RV64I-only instructions that are defined analogously
+to ADD and SUB but operate on 32-bit values and produce signed 32-bit
+results. Overflows are ignored, and the low 32-bits of the result is
+sign-extended to 64-bits and written to the destination register.
+
+SLL, SRL, and SRA perform logical left, logical right, and arithmetic
+right shifts on the value in register _rs1_ by the shift amount held in
+register _rs2_. In RV64I, only the low 6 bits of _rs2_ are considered
+for the shift amount.
+
+SLLW, SRLW, and SRAW are RV64I-only instructions that are analogously
+defined but operate on 32-bit values and sign-extend their 32-bit
+results to 64 bits. The shift amount is given by _rs2[4:0]_.
+
+=== Load and Store Instructions
+
+RV64I extends the address space to 64 bits. The execution environment
+will define what portions of the address space are legal to access.
+
+include::images/wavedrom/load_store.adoc[]
+[[load_store]]
+.Load and store instructions
+image::image_placeholder.png[]
+
+The LD instruction loads a 64-bit value from memory into register _rd_
+for RV64I.
+(((RV64I, LD)))
+
+The LW instruction loads a 32-bit value from memory and sign-extends
+this to 64 bits before storing it in register _rd_ for RV64I. The LWU
+instruction, on the other hand, zero-extends the 32-bit value from
+memory for RV64I. LH and LHU are defined analogously for 16-bit values,
+as are LB and LBU for 8-bit values. The SD, SW, SH, and SB instructions
+store 64-bit, 32-bit, 16-bit, and 8-bit values from the low bits of
+register _rs2_ to memory respectively.
+
+[[sec:rv64i-hints]]
+=== HINT Instructions
+
+All instructions that are microarchitectural HINTs in RV32I (see
+<<rv32i-hints>> are also HINTs in RV64I.
+The additional computational instructions in RV64I expand both the
+standard and custom HINT encoding spaces.
+(((RV64I, HINT)))
+
+<<rv64i-hints>> lists all RV64I HINT code points. 91% of the
+HINT space is reserved for standard HINTs, but none are presently
+defined. The remainder of the HINT space is designated for custom HINTs;
+no standard HINTs will ever be defined in this subspace.
+
+[[rv64i-hints]]
+.RV64I HINT instructions.
+[cols="<,<,^,<",options="header",]
+|===
+|Instruction |Constraints |Code Points |Purpose
+|LUI |_rd_=`x0` |latexmath:[$2^{20}$] |_Reserved for future standard
+use_
+
+|AUIPC | |_rd_=`x0` |latexmath:[$2^{20}$]
+
+| | |_rd_=`x0`, and either |latexmath:[$2^{17}-1$]
+
+| | |_rs1_latexmath:[$\neq$]`x0` or _imm_latexmath:[$\neq$]0 |
+
+|ANDI | |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|ORI | |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|XORI | |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|ADDIW | |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|ADD | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SUB | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|AND | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|OR | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|XOR | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLL | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRL | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRA | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|ADDW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SUBW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLLW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRLW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRAW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+| | |_rd_=`x0`, _rs1_latexmath:[$\neq$]`x0`, |latexmath:[$2^{10}-63$]
+
+| | |_fm_=0, and either |
+
+| | |_pred_=0 or _succ_=0 |
+
+| | |_rd_latexmath:[$\neq$]`x0`, _rs1_=`x0`, |latexmath:[$2^{10}-63$]
+
+| | |_fm_=0, and either |
+
+| | |_pred_=0 or _succ_=0 |
+
+| | |_rd_=_rs1_=`x0`, _fm_=0, |15
+
+| | |_pred_=0, _succ_latexmath:[$\neq$]0 |
+
+| | |_rd_=_rs1_=`x0`, _fm_=0, |15
+
+| | |_pred_latexmath:[$\neq$]W, _succ_=0 |
+
+|FENCE | |_rd_=_rs1_=`x0`, _fm_=0, |1
+
+| |_pred_=W, _succ_=0 | |
+
+|SLTI |_rd_=`x0` |latexmath:[$2^{17}$] |_Designated for custom use_
+
+|SLTIU | |_rd_=`x0` |latexmath:[$2^{17}$]
+
+|SLLI | |_rd_=`x0` |latexmath:[$2^{11}$]
+
+|SRLI | |_rd_=`x0` |latexmath:[$2^{11}$]
+
+|SRAI | |_rd_=`x0` |latexmath:[$2^{11}$]
+
+|SLLIW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRLIW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SRAIW | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLT | |_rd_=`x0` |latexmath:[$2^{10}$]
+
+|SLTU | |_rd_=`x0` |latexmath:[$2^{10}$]
+|===
diff --git a/src/rvwmo.adoc b/src/rvwmo.adoc
new file mode 100644
index 0000000..e6bc412
--- /dev/null
+++ b/src/rvwmo.adoc
@@ -0,0 +1,831 @@
+[[memorymodel]]
+== RVWMO Memory Consistency Model, Version 2.0
+
+This chapter defines the RISC-V memory consistency model. A memory
+consistency model is a set of rules specifying the values that can be
+returned by loads of memory. RISC-V uses a memory model called `RVWMO`
+(RISC-V Weak Memory Ordering) which is designed to provide flexibility
+for architects to build high-performance scalable designs while
+simultaneously supporting a tractable programming model.
+(((design, high performace)))
+(((design, scalable)))
+
+Under RVWMO, code running on a single hart appears to execute in order
+from the perspective of other memory instructions in the same hart, but
+memory instructions from another hart may observe the memory
+instructions from the first hart being executed in a different order.
+Therefore, multithreaded code may require explicit synchronization to
+guarantee ordering between memory instructions from different harts. The
+base RISC-V ISA provides a FENCE instruction for this purpose, described
+in <<fence>>, while the atomics extension `A`
+additionally defines load-reserved/store-conditional and atomic
+read-modify-write instructions.
+(((atomics, misaligned)))
+
+The standard ISA extension for misaligned atomics `Zam`
+(<<zam>>) and the standard ISA extension for total
+store ordering `Ztso` (<<ztso>>) augment RVWMO
+with additional rules specific to those extensions.
+
+The appendices to this specification provide both axiomatic and
+operational formalizations of the memory consistency model as well as
+additional explanatory material.
+((FENCE))
+((SFENCE))
+
+This chapter defines the memory model for regular main memory
+operations. The interaction of the memory model with I/O memory,
+instruction fetches, FENCE.I, page table walks, and SFENCE.VMA is not
+(yet) formalized. Some or all of the above may be formalized in a future
+revision of this specification. The RV128 base ISA and future ISA
+extensions such as the `V` vector and `J` JIT extensions will need
+to be incorporated into a future revision as well.
+
+Memory consistency models supporting overlapping memory accesses of
+different widths simultaneously remain an active area of academic
+research and are not yet fully understood. The specifics of how memory
+accesses of different sizes interact under RVWMO are specified to the
+best of our current abilities, but they are subject to revision should
+new issues be uncovered.
+
+[[rvwmo]]
+=== Definition of the RVWMO Memory Model
+
+The RVWMO memory model is defined in terms of the _global memory order_,
+a total ordering of the memory operations produced by all harts. In
+general, a multithreaded program has many different possible executions,
+with each execution having its own corresponding global memory order.
+((RVWMO))
+
+The global memory order is defined over the primitive load and store
+operations generated by memory instructions. It is then subject to the
+constraints defined in the rest of this chapter. Any execution
+satisfying all of the memory model constraints is a legal execution (as
+far as the memory model is concerned).
+
+[[rvwmo-primitives]]
+==== Memory Model Primitives
+
+The _program order_ over memory operations reflects the order in which
+the instructions that generate each load and store are logically laid
+out in that hart’s dynamic instruction stream; i.e., the order in which
+a simple in-order processor would execute the instructions of that hart.
+
+Memory-accessing instructions give rise to _memory operations_. A memory
+operation can be either a _load operation_, a _store operation_, or both
+simultaneously. All memory operations are single-copy atomic: they can
+never be observed in a partially complete state.
+(((operations, memory)))
+
+Among instructions in RV32GC and RV64GC, each aligned memory instruction
+gives rise to exactly one memory operation, with two exceptions. First,
+an unsuccessful SC instruction does not give rise to any memory
+operations. Second, FLD and FSD instructions may each give rise to
+multiple memory operations if XLENlatexmath:[$<$]64, as stated in
+<<fld_fsd>> and clarified below. An aligned AMO
+gives rise to a single memory operation that is both a load operation
+and a store operation simultaneously.
+
+Instructions in the RV128 base instruction set and in future ISA
+extensions such as V (vector) and P (SIMD) may give rise to multiple
+memory operations. However, the memory model for these extensions has
+not yet been formalized.
+
+A misaligned load or store instruction may be decomposed into a set of
+component memory operations of any granularity. An FLD or FSD
+instruction for which XLENlatexmath:[$<$]64 may also be decomposed into
+a set of component memory operations of any granularity. The memory
+operations generated by such instructions are not ordered with respect
+to each other in program order, but they are ordered normally with
+respect to the memory operations generated by preceding and subsequent
+instructions in program order. The atomics extension `A` does not
+require execution environments to support misaligned atomic instructions
+at all; however, if misaligned atomics are supported via the `Zam`
+extension, LRs, SCs, and AMOs may be decomposed subject to the
+constraints of the atomicity axiom for misaligned atomics, which is
+defined in <<zam>>.
+((decomposition))
+
+The decomposition of misaligned memory operations down to byte
+granularity facilitates emulation on implementations that do not
+natively support misaligned accesses. Such implementations might, for
+example, simply iterate over the bytes of a misaligned access one by
+one.
+
+An LR instruction and an SC instruction are said to be _paired_ if the
+LR precedes the SC in program order and if there are no other LR or SC
+instructions in between; the corresponding memory operations are said to
+be paired as well (except in case of a failed SC, where no store
+operation is generated). The complete list of conditions determining
+whether an SC must succeed, may succeed, or must fail is defined in
+<<lrsc>>.
+
+Load and store operations may also carry one or more ordering
+annotations from the following set: `acquire-RCpc`, `acquire-RCsc`,
+`release-RCpc`, and `release-RCsc`. An AMO or LR instruction with
+_aq_ set has an `acquire-RCsc` annotation. An AMO or SC instruction
+with _rl_ set has a `release-RCsc` annotation. An AMO, LR, or SC
+instruction with both _aq_ and _rl_ set has both `acquire-RCsc` and
+`release-RCsc` annotations.
+
+For convenience, we use the term `acquire annotation` to refer to an
+acquire-RCpc annotation or an acquire-RCsc annotation. Likewise, a
+`release annotation` refers to a release-RCpc annotation or a
+release-RCsc annotation. An `RCpc annotation` refers to an
+acquire-RCpc annotation or a release-RCpc annotation. An `RCsc
+annotation` refers to an acquire-RCsc annotation or a release-RCsc
+annotation.
+
+In the memory model literature, the term `RCpc` stands for release
+consistency with processor-consistent synchronization operations, and
+the term `RCsc` stands for release consistency with sequentially
+consistent synchronization operations .
+
+While there are many different definitions for acquire and release
+annotations in the literature, in the context of RVWMO these terms are
+concisely and completely defined by Preserved Program Order rules
+<<rcsc>>.
+
+`RCpc` annotations are currently only used when implicitly assigned to
+every memory access per the standard extension `Ztso`
+(<<ztso>>). Furthermore, although the ISA does not
+currently contain native load-acquire or store-release instructions, nor
+RCpc variants thereof, the RVWMO model itself is designed to be
+forwards-compatible with the potential addition of any or all of the
+above into the ISA in a future extension.
+
+[[mem-dependencies]]
+==== Syntactic Dependencies
+
+The definition of the RVWMO memory model depends in part on the notion
+of a syntactic dependency, defined as follows.
+
+In the context of defining dependencies, a `register` refers either to
+an entire general-purpose register, some portion of a CSR, or an entire
+CSR. The granularity at which dependencies are tracked through CSRs is
+specific to each CSR and is defined in
+<<csr-granularity>>.
+
+Syntactic dependencies are defined in terms of instructions’ _source
+registers_, instructions’ _destination registers_, and the way
+instructions _carry a dependency_ from their source registers to their
+destination registers. This section provides a general definition of all
+of these terms; however, <<source-dest-regs>> provides a
+complete listing of the specifics for each instruction.
+
+In general, a register latexmath:[$r$] other than `x0` is a _source
+register_ for an instruction latexmath:[$i$] if any of the following
+hold:
+
+* In the opcode of latexmath:[$i$], _rs1_, _rs2_, or _rs3_ is set to
+latexmath:[$r$]
+* latexmath:[$i$] is a CSR instruction, and in the opcode of
+latexmath:[$i$], _csr_ is set to latexmath:[$r$], unless latexmath:[$i$]
+is CSRRW or CSRRWI and _rd_ is set to `x0`
+* latexmath:[$r$] is a CSR and an implicit source register for
+latexmath:[$i$], as defined in <<source-dest-regs>>
+* latexmath:[$r$] is a CSR that aliases with another source register for
+latexmath:[$i$]
+
+Memory instructions also further specify which source registers are
+_address source registers_ and which are _data source registers_.
+
+In general, a register latexmath:[$r$] other than `x0` is a _destination
+register_ for an instruction latexmath:[$i$] if any of the following
+hold:
+
+* In the opcode of latexmath:[$i$], _rd_ is set to latexmath:[$r$]
+* latexmath:[$i$] is a CSR instruction, and in the opcode of
+latexmath:[$i$], _csr_ is set to latexmath:[$r$], unless latexmath:[$i$]
+is CSRRS or CSRRC and _rs1_ is set to `x0` or latexmath:[$i$] is CSRRSI
+or CSRRCI and uimm[4:0] is set to zero.
+* latexmath:[$r$] is a CSR and an implicit destination register for
+latexmath:[$i$], as defined in <<source-dest-regs>>
+* latexmath:[$r$] is a CSR that aliases with another destination
+register for latexmath:[$i$]
+
+Most non-memory instructions _carry a dependency_ from each of their
+source registers to each of their destination registers. However, there
+are exceptions to this rule; see <<>>source-dest-regs>>.
+
+Instruction latexmath:[$j$] has a _syntactic dependency_ on instruction
+latexmath:[$i$] via destination register latexmath:[$s$] of
+latexmath:[$i$] and source register latexmath:[$r$] of latexmath:[$j$]
+if either of the following hold:
+
+* latexmath:[$s$] is the same as latexmath:[$r$], and no instruction
+program-ordered between latexmath:[$i$] and latexmath:[$j$] has
+latexmath:[$r$] as a destination register
+* There is an instruction latexmath:[$m$] program-ordered between
+latexmath:[$i$] and latexmath:[$j$] such that all of the following hold:
+. latexmath:[$j$] has a syntactic dependency on latexmath:[$m$] via
+destination register latexmath:[$q$] and source register latexmath:[$r$]
+. latexmath:[$m$] has a syntactic dependency on latexmath:[$i$] via
+destination register latexmath:[$s$] and source register latexmath:[$p$]
+. latexmath:[$m$] carries a dependency from latexmath:[$p$] to
+latexmath:[$q$]
+
+Finally, in the definitions that follow, let latexmath:[$a$] and
+latexmath:[$b$] be two memory operations, and let latexmath:[$i$] and
+latexmath:[$j$] be the instructions that generate latexmath:[$a$] and
+latexmath:[$b$], respectively.
+
+latexmath:[$b$] has a _syntactic address dependency_ on latexmath:[$a$]
+if latexmath:[$r$] is an address source register for latexmath:[$j$] and
+latexmath:[$j$] has a syntactic dependency on latexmath:[$i$] via source
+register latexmath:[$r$]
+
+latexmath:[$b$] has a _syntactic data dependency_ on latexmath:[$a$] if
+latexmath:[$b$] is a store operation, latexmath:[$r$] is a data source
+register for latexmath:[$j$], and latexmath:[$j$] has a syntactic
+dependency on latexmath:[$i$] via source register latexmath:[$r$]
+
+latexmath:[$b$] has a _syntactic control dependency_ on latexmath:[$a$]
+if there is an instruction latexmath:[$m$] program-ordered between
+latexmath:[$i$] and latexmath:[$j$] such that latexmath:[$m$] is a
+branch or indirect jump and latexmath:[$m$] has a syntactic dependency
+on latexmath:[$i$].
+
+Generally speaking, non-AMO load instructions do not have data source
+registers, and unconditional non-AMO store instructions do not have
+destination registers. However, a successful SC instruction is
+considered to have the register specified in _rd_ as a destination
+register, and hence it is possible for an instruction to have a
+syntactic dependency on a successful SC instruction that precedes it in
+program order.
+
+==== Preserved Program Order
+
+The global memory order for any given execution of a program respects
+some but not all of each hart’s program order. The subset of program
+order that must be respected by the global memory order is known as
+_preserved program order_.
+
+The complete definition of preserved program order is as follows (and
+note that AMOs are simultaneously both loads and stores): memory
+operation latexmath:[$a$] precedes memory operation latexmath:[$b$] in
+preserved program order (and hence also in the global memory order) if
+latexmath:[$a$] precedes latexmath:[$b$] in program order,
+latexmath:[$a$] and latexmath:[$b$] both access regular main memory
+(rather than I/O regions), and any of the following hold:
+
+[[overlapping-orering]]
+* Overlapping-Address Orderings:
+. latexmath:[$b$] is a store, and
+latexmath:[$a$] and latexmath:[$b$] access overlapping memory addresses
+. [#ppo:rdw]#[ppo:rdw]# latexmath:[$a$] and latexmath:[$b$] are loads,
+latexmath:[$x$] is a byte read by both latexmath:[$a$] and
+latexmath:[$b$], there is no store to latexmath:[$x$] between
+latexmath:[$a$] and latexmath:[$b$] in program order, and
+latexmath:[$a$] and latexmath:[$b$] return values for latexmath:[$x$]
+written by different memory operations
+. {empty}[#ppo:amoforward]#[ppo:amoforward]# latexmath:[$a$] is
+generated by an AMO or SC instruction, latexmath:[$b$] is a load, and
+latexmath:[$b$] returns a value written by latexmath:[$a$]
+* Explicit Synchronization
+. {empty} There is a FENCE instruction that
+orders latexmath:[$a$] before latexmath:[$b$]
+. latexmath:[$a$] has an acquire
+annotation
+. latexmath:[$b$] has a release annotation
+. latexmath:[$a$] and latexmath:[$b$] both have
+RCsc annotations
+. {empty} latexmath:[$a$] is paired with
+latexmath:[$b$]
+* Syntactic Dependencies
+. {empty} latexmath:[$b$] has a syntactic address
+dependency on latexmath:[$a$]
+. {empty} latexmath:[$b$] has a syntactic data
+dependency on latexmath:[$a$]
+. {empty} latexmath:[$b$] is a store, and
+latexmath:[$b$] has a syntactic control dependency on latexmath:[$a$]
+* Pipeline Dependencies
+. {empty} latexmath:[$b$] is a
+load, and there exists some store latexmath:[$m$] between
+latexmath:[$a$] and latexmath:[$b$] in program order such that
+latexmath:[$m$] has an address or data dependency on latexmath:[$a$],
+and latexmath:[$b$] returns a value written by latexmath:[$m$]
+. {empty} latexmath:[$b$] is a store, and
+there exists some instruction latexmath:[$m$] between latexmath:[$a$]
+and latexmath:[$b$] in program order such that latexmath:[$m$] has an
+address dependency on latexmath:[$a$]
+
+==== Memory Model Axioms
+
+An execution of a RISC-V program obeys the RVWMO memory consistency
+model only if there exists a global memory order conforming to preserved
+program order and satisfying the _load value axiom_, the _atomicity
+axiom_, and the _progress axiom_.
+
+[[ax-load]]
+===== Load Value Axiom
+
+Each byte of each load latexmath:[$i$] returns the value written to that
+byte by the store that is the latest in global memory order among the
+following stores:
+
+. Stores that write that byte and that precede latexmath:[$i$] in the
+global memory order
+. Stores that write that byte and that precede latexmath:[$i$] in
+program order
+
+[[ax-atom]]
+===== Atomicity Axiom
+
+If latexmath:[$r$] and latexmath:[$w$] are paired load and store
+operations generated by aligned LR and SC instructions in a hart
+latexmath:[$h$], latexmath:[$s$] is a store to byte latexmath:[$x$], and
+latexmath:[$r$] returns a value written by latexmath:[$s$], then
+latexmath:[$s$] must precede latexmath:[$w$] in the global memory order,
+and there can be no store from a hart other than latexmath:[$h$] to byte
+latexmath:[$x$] following latexmath:[$s$] and preceding latexmath:[$w$]
+in the global memory order.
+
+The theoretically supports LR/SC pairs of different widths and to
+mismatched addresses, since implementations are permitted to allow SC
+operations to succeed in such cases. However, in practice, we expect
+such patterns to be rare, and their use is discouraged.
+
+[[ax-prog]]
+===== Progress Axiom
+
+No memory operation may be preceded in the global memory order by an
+infinite sequence of other memory operations.
+
+[[sec:csr-granularity]]
+=== CSR Dependency Tracking Granularity
+
+.Granularities at which syntactic dependencies are tracked through CSRs
+[cols="<,<,<",options="header",]
+|===
+|Name |Portions Tracked as Independent Units |Aliases
+|`fflags` |Bits 4, 3, 2, 1, 0 |`fcsr`
+|`frm` |entire CSR |`fcsr`
+|`fcsr` |Bits 7-5, 4, 3, 2, 1, 0 |`fflags`, `frm`
+|===
+
+Note: read-only CSRs are not listed, as they do not participate in the
+definition of syntactic dependencies.
+
+[[sec:source-dest-regs]]
+=== Source and Destination Register Listings
+
+This section provides a concrete listing of the source and destination
+registers for each instruction. These listings are used in the
+definition of syntactic dependencies in
+<<mem-dependencies>>.
+
+The term `accumulating CSR` is used to describe a CSR that is both a
+source and a destination register, but which carries a dependency only
+from itself to itself.
+
+Instructions carry a dependency from each source register in the
+`Source Registers` column to each destination register in the
+`Destination Registers` column, from each source register in the
+`Source Registers` column to each CSR in the `Accumulating CSRs`
+column, and from each CSR in the `Accumulating CSRs` column to itself,
+except where annotated otherwise.
+
+Key:
+
+latexmath:[$^A$]Address source register
+
+latexmath:[$^D$]Data source register
+
+latexmath:[$^\dagger$]The instruction does not carry a dependency from
+any source register to any destination register
+
+latexmath:[$^\ddagger$]The instruction carries dependencies from source
+register(s) to destination register(s) as specified
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32I Base Integer Instruction Set* | | | |
+| |Source |Destination |Accumulating |
+| |Registers |Registers |CSRs |
+|LUI | |_rd_ | |
+|AUIPC | |_rd_ | |
+|JAL | |_rd_ | |
+|JALRlatexmath:[$^\dagger$] |_rs1_ |_rd_ | |
+|BEQ |_rs1_, _rs2_ | | |
+|BNE |_rs1_, _rs2_ | | |
+|BLT |_rs1_, _rs2_ | | |
+|BGE |_rs1_, _rs2_ | | |
+|BLTU |_rs1_, _rs2_ | | |
+|BGEU |_rs1_, _rs2_ | | |
+|LBlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|LHlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|LWlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|LBUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|LHUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|SB |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+|SH |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+|SW |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+|ADDI |_rs1_ |_rd_ | |
+|SLTI |_rs1_ |_rd_ | |
+|SLTIU |_rs1_ |_rd_ | |
+|XORI |_rs1_ |_rd_ | |
+|ORI |_rs1_ |_rd_ | |
+|ANDI |_rs1_ |_rd_ | |
+|SLLI |_rs1_ |_rd_ | |
+|SRLI |_rs1_ |_rd_ | |
+|SRAI |_rs1_ |_rd_ | |
+|ADD |_rs1_, _rs2_ |_rd_ | |
+|SUB |_rs1_, _rs2_ |_rd_ | |
+|SLL |_rs1_, _rs2_ |_rd_ | |
+|SLT |_rs1_, _rs2_ |_rd_ | |
+|SLTU |_rs1_, _rs2_ |_rd_ | |
+|XOR |_rs1_, _rs2_ |_rd_ | |
+|SRL |_rs1_, _rs2_ |_rd_ | |
+|SRA |_rs1_, _rs2_ |_rd_ | |
+|OR |_rs1_, _rs2_ |_rd_ | |
+|AND |_rs1_, _rs2_ |_rd_ | |
+|FENCE | | | |
+|FENCE.I | | | |
+|ECALL | | | |
+|EBREAK | | | |
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32I Base Integer Instruction Set (continued)* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|CSRRWlatexmath:[$^\ddagger$] |_rs1_, _csr_latexmath:[$^*$] |_rd_, _csr_
+| |latexmath:[$^*$]unless _rd_=`x0`
+
+|CSRRSlatexmath:[$^\ddagger$] |_rs1_, _csr_ |_rd_latexmath:[$^*$], _csr_
+| |latexmath:[$^*$]unless _rs1_=`x0`
+
+|CSRRClatexmath:[$^\ddagger$] |_rs1_, _csr_ |_rd_latexmath:[$^*$], _csr_
+| |latexmath:[$^*$]unless _rs1_=`x0`
+
+| |latexmath:[$\ddagger$]carries a dependency from _rs1_ to _csr_ and
+from _csr_ to _rd_ | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32I Base Integer Instruction Set (continued)* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|CSRRWIlatexmath:[$^\ddagger$] |_csr_latexmath:[$^*$] |_rd_, _csr_ |
+|latexmath:[$^*$]unless _rd_=`x0`
+
+|CSRRSIlatexmath:[$^\ddagger$] |_csr_ |_rd_, _csr_latexmath:[$^*$] |
+|latexmath:[$^*$]unless uimm[4:0]=0
+
+|CSRRCIlatexmath:[$^\ddagger$] |_csr_ |_rd_, _csr_latexmath:[$^*$] |
+|latexmath:[$^*$]unless uimm[4:0]=0
+
+| |latexmath:[$\ddagger$]carries a dependency from _csr_ to _rd_ | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV64I Base Integer Instruction Set* | | | |
+| |Source |Destination |Accumulating |
+| |Registers |Registers |CSRs |
+|LWUlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|LDlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+|SD |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+|SLLI |_rs1_ |_rd_ | |
+|SRLI |_rs1_ |_rd_ | |
+|SRAI |_rs1_ |_rd_ | |
+|ADDIW |_rs1_ |_rd_ | |
+|SLLIW |_rs1_ |_rd_ | |
+|SRLIW |_rs1_ |_rd_ | |
+|SRAIW |_rs1_ |_rd_ | |
+|ADDW |_rs1_, _rs2_ |_rd_ | |
+|SUBW |_rs1_, _rs2_ |_rd_ | |
+|SLLW |_rs1_, _rs2_ |_rd_ | |
+|SRLW |_rs1_, _rs2_ |_rd_ | |
+|SRAW |_rs1_, _rs2_ |_rd_ | |
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32M Standard Extension* | | | |
+| |Source |Destination |Accumulating |
+| |Registers |Registers |CSRs |
+|MUL |_rs1_, _rs2_ |_rd_ | |
+|MULH |_rs1_, _rs2_ |_rd_ | |
+|MULHSU |_rs1_, _rs2_ |_rd_ | |
+|MULHU |_rs1_, _rs2_ |_rd_ | |
+|DIV |_rs1_, _rs2_ |_rd_ | |
+|DIVU |_rs1_, _rs2_ |_rd_ | |
+|REM |_rs1_, _rs2_ |_rd_ | |
+|REMU |_rs1_, _rs2_ |_rd_ | |
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV64M Standard Extension* | | | |
+| |Source |Destination |Accumulating |
+| |Registers |Registers |CSRs |
+|MULW |_rs1_, _rs2_ |_rd_ | |
+|DIVW |_rs1_, _rs2_ |_rd_ | |
+|DIVUW |_rs1_, _rs2_ |_rd_ | |
+|REMW |_rs1_, _rs2_ |_rd_ | |
+|REMUW |_rs1_, _rs2_ |_rd_ | |
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32A Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|LR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+
+|SC.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_latexmath:[$^*$] | |latexmath:[$^*$]if
+successful
+
+|AMOSWAP.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOADD.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOXOR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOAND.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOOR.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMIN.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMAX.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMINU.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMAXU.Wlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV64A Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|LR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+
+|SC.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_latexmath:[$^*$] | |latexmath:[$^*$]if
+successful
+
+|AMOSWAP.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOADD.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOXOR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOAND.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOOR.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMIN.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMAX.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMINU.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+|AMOMAXU.Dlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$],
+_rs2_latexmath:[$^D$] |_rd_ | |
+
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32F Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|FLWlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+
+|FSW |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+
+|FMADD.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FMSUB.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FNMSUB.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF,
+NX |latexmath:[$^*$]if rm=111
+
+|FNMADD.S |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF,
+NX |latexmath:[$^*$]if rm=111
+
+|FADD.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX
+|latexmath:[$^*$]if rm=111
+
+|FSUB.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX
+|latexmath:[$^*$]if rm=111
+
+|FMUL.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FDIV.S |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, DZ, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FSQRT.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FSGNJ.S |_rs1_, _rs2_ |_rd_ | |
+
+|FSGNJN.S |_rs1_, _rs2_ |_rd_ | |
+
+|FSGNJX.S |_rs1_, _rs2_ |_rd_ | |
+
+|FMIN.S |_rs1_, _rs2_ |_rd_ |NV |
+
+|FMAX.S |_rs1_, _rs2_ |_rd_ |NV |
+
+|FCVT.W.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.WU.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FMV.X.W |_rs1_ |_rd_ | |
+
+|FEQ.S |_rs1_, _rs2_ |_rd_ |NV |
+
+|FLT.S |_rs1_, _rs2_ |_rd_ |NV |
+
+|FLE.S |_rs1_, _rs2_ |_rd_ |NV |
+
+|FCLASS.S |_rs1_ |_rd_ | |
+
+|FCVT.S.W |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.S.WU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+|FMV.W.X |_rs1_ |_rd_ | |
+
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV64F Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|FCVT.L.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.LU.S |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.S.L |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.S.LU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV32D Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|FLDlatexmath:[$^\dagger$] |_rs1_latexmath:[$^A$] |_rd_ | |
+
+|FSD |_rs1_latexmath:[$^A$], _rs2_latexmath:[$^D$] | | |
+
+|FMADD.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FMSUB.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FNMSUB.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF,
+NX |latexmath:[$^*$]if rm=111
+
+|FNMADD.D |_rs1_, _rs2_, _rs3_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF,
+NX |latexmath:[$^*$]if rm=111
+
+|FADD.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX
+|latexmath:[$^*$]if rm=111
+
+|FSUB.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, NX
+|latexmath:[$^*$]if rm=111
+
+|FMUL.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FDIV.D |_rs1_, _rs2_, frmlatexmath:[$^*$] |_rd_ |NV, DZ, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FSQRT.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FSGNJ.D |_rs1_, _rs2_ |_rd_ | |
+
+|FSGNJN.D |_rs1_, _rs2_ |_rd_ | |
+
+|FSGNJX.D |_rs1_, _rs2_ |_rd_ | |
+
+|FMIN.D |_rs1_, _rs2_ |_rd_ |NV |
+
+|FMAX.D |_rs1_, _rs2_ |_rd_ |NV |
+
+|FCVT.S.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, OF, UF, NX
+|latexmath:[$^*$]if rm=111
+
+|FCVT.D.S |_rs1_ |_rd_ |NV |
+
+|FEQ.D |_rs1_, _rs2_ |_rd_ |NV |
+
+|FLT.D |_rs1_, _rs2_ |_rd_ |NV |
+
+|FLE.D |_rs1_, _rs2_ |_rd_ |NV |
+
+|FCLASS.D |_rs1_ |_rd_ | |
+
+|FCVT.W.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.WU.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.D.W |_rs1_ |_rd_ | |
+
+|FCVT.D.WU |_rs1_ |_rd_ | |
+
+| | | | |
+|===
+
+[cols="<,<,<,<,<",]
+|===
+|*RV64D Standard Extension* | | | |
+
+| |Source |Destination |Accumulating |
+
+| |Registers |Registers |CSRs |
+
+|FCVT.L.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.LU.D |_rs1_, frmlatexmath:[$^*$] |_rd_ |NV, NX |latexmath:[$^*$]if
+rm=111
+
+|FMV.X.D |_rs1_ |_rd_ | |
+
+|FCVT.D.L |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+|FCVT.D.LU |_rs1_, frmlatexmath:[$^*$] |_rd_ |NX |latexmath:[$^*$]if
+rm=111
+
+|FMV.D.X |_rs1_ |_rd_ | |
+
+| | | | |
+|===
+
diff --git a/src/test.adoc b/src/test.adoc
new file mode 100644
index 0000000..10992eb
--- /dev/null
+++ b/src/test.adoc
@@ -0,0 +1,47 @@
+[appendix]
+== Test
+
+
+//include::images/graphviz/litmus_sample.dot[]
+//[[litmus-sample]]
+//.A sample litmus test and one forbidden execution (a0=1).
+//image::image_placeholder.png[]
+
+
+[graphviz,test-diagram,svg]
+....
+digraph G {
+
+splines=spline;
+pad="0.000000";
+
+
+/* the unlocked events */
+eiid0 [label="a: Ry=1", shape="none", fontsize=8, pos="1.000000,1.125000!", fixedsize="false", height="0.111111", width="0.555556"];
+eiid1 [label="c: Wx=t", shape="none", fontsize=8, pos="1.000000,0.562500!", fixedsize="false", height="0.111111", width="0.555556"];
+eiid2 [label="d: Rx=t", shape="none", fontsize=8, pos="2.500000,1.125000!", fixedsize="false", height="0.111111", width="0.555556"];
+eiid3 [label="e: Rt=$v$", shape="none", fontsize=8, pos="2.500000,0.562500!", fixedsize="false", height="0.111111", width="0.555556"];
+eiid4 [label="f: Wy=1", shape="none", fontsize=8, pos="2.500000,0.000000!", fixedsize="false", height="0.111111", width="0.555556"];
+
+/* the intra_causality_data edges */
+
+
+/* the intra_causality_control edges */
+
+/* the poi edges */
+/* the rfmap edges */
+
+
+/* The viewed-before edges */
+eiid0 -> eiid1 [label=<<font color="darkgreen">fence</font><font color="indigo">ppo</font>>, color="darkgreen:indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+eiid1 -> eiid2 [label=<<font color="red">rf</font>>, color="red", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+eiid2 -> eiid3 [label=<<font color="indigo">addr</font><font color="indigo">ppo</font>>, color="indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+eiid2 -> eiid4 [label=<<font color="indigo">ppo</font>>, color="indigo", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+eiid3 -> eiid4 [label=<<font color="black">po</font>>, color="black", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+eiid4 -> eiid0 [label=<<font color="red">rf</font>>, color="red", fontsize=11, penwidth="3.000000", arrowsize="0.666700"];
+}
+....
+
+
+para
+
diff --git a/src/v-st-ext.adoc b/src/v-st-ext.adoc
new file mode 100644
index 0000000..ab637ae
--- /dev/null
+++ b/src/v-st-ext.adoc
@@ -0,0 +1,11 @@
+[[vector]]
+== `V` Standard Extension for Vector Operations, Version 0.7
+
+The current working group draft is hosted at
+` https://github.com/riscv/riscv-v-spec`.
+
+The base vector extension is intended to provide general support for
+data-parallel execution within the 32-bit instruction encoding space,
+with later vector extensions supporting richer functionality for certain
+domains.
+
diff --git a/src/zam-st-ext.adoc b/src/zam-st-ext.adoc
new file mode 100644
index 0000000..4468e31
--- /dev/null
+++ b/src/zam-st-ext.adoc
@@ -0,0 +1,52 @@
+[[zam]]
+== `Zam` Standard Extension for Misaligned Atomics, v0.1
+
+This chapter defines the ``Zam`` extension, which extends the ``A``
+extension by standardizing support for misaligned atomic memory
+operations (AMOs). On platforms implementing ``Zam``, misaligned AMOs
+need only execute atomically with respect to other accesses (including
+non-atomic loads and stores) to the same address and of the same size.
+More precisely, execution environments implementing ``Zam`` are subject
+to the following axiom:
+
+[[misaligned]]
+=== Atomicity Axiom for misaligned atomics
+
+If latexmath:[$r$] and latexmath:[$w$] are paired misaligned load and
+store instructions from a hart latexmath:[$h$] with the same address and
+of the same size, then there can be no store instruction latexmath:[$s$]
+from a hart other than latexmath:[$h$] with the same address and of the
+same size as latexmath:[$r$] and latexmath:[$w$] such that a store
+operation generated by latexmath:[$s$] lies in between memory operations
+generated by latexmath:[$r$] and latexmath:[$w$] in the global memory
+order. Furthermore, there can be no load instruction latexmath:[$l$]
+from a hart other than latexmath:[$h$] with the same address and of the
+same size as latexmath:[$r$] and latexmath:[$w$] such that a load
+operation generated by latexmath:[$l$] lies between two memory
+operations generated by latexmath:[$r$] or by latexmath:[$w$] in the
+global memory order.
+
+This restricted form of atomicity is intended to balance the needs of
+applications which require support for misaligned atomics and the
+ability of the implementation to actually provide the necessary degree
+of atomicity.
+
+Aligned instructions under `Zam` continue to behave as they normally
+do under RVWMO.
+
+The intention of `Zam` is that it can be implemented in one of two
+ways:
+
+. On hardware that natively supports atomic misaligned accesses to the
+address and size in question (e.g., for misaligned accesses within a
+single cache line): by simply following the same rules that would be
+applied for aligned AMOs.
+. On hardware that does not natively support misaligned accesses to the
+address and size in question: by trapping on all instructions (including
+loads) with that address and size and executing them (via any number of
+memory operations) inside a mutex that is a function of the given memory
+address and access size. AMOs may be emulated by splitting them into
+separate load and store operations, but all preserved program order
+rules (e.g., incoming and outgoing syntactic dependencies) must behave
+as if the AMO is still a single memory operation.
+
diff --git a/src/zicsr.adoc b/src/zicsr.adoc
new file mode 100644
index 0000000..3c4eed4
--- /dev/null
+++ b/src/zicsr.adoc
@@ -0,0 +1,242 @@
+[[csrinsts]]
+== `Zicsr`, Control and Status Register (CSR) Instructions, Version 2.0
+
+RISC-V defines a separate address space of 4096 Control and Status
+registers associated with each hart. This chapter defines the full set
+of CSR instructions that operate on these CSRs.
+
+[NOTE]
+====
+While CSRs are primarily used by the privileged architecture, there are
+several uses in unprivileged code including for counters and timers, and
+for floating-point status.
+
+The counters and timers are no longer considered mandatory parts of the
+standard base ISAs, and so the CSR instructions required to access them
+have been moved out of <<rv32>> into this separate
+chapter.
+====
+
+=== CSR Instructions
+((CSR))
+
+All CSR instructions atomically read-modify-write a single CSR, whose
+CSR specifier is encoded in the 12-bit _csr_ field of the instruction
+held in bits 31–20. The immediate forms use a 5-bit zero-extended
+immediate encoded in the _rs1_ field.
+(((CSR, instructions)))
+
+include::images/wavedrom/csr-instr.adoc[]
+[[csr-instr]]
+.CSR instructions
+image::image_placeholder.png[]
+(((CSR, CSSRW)))
+
+The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in
+the CSRs and integer registers. CSRRW reads the old value of the CSR,
+zero-extends the value to XLEN bits, then writes it to integer register
+_rd_. The initial value in _rs1_ is written to the CSR. If _rd_=`x0`,
+then the instruction shall not read the CSR and shall not cause any of
+the side effects that might occur on a CSR read.
+(((CSR, CSSRS)))
+
+The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value
+of the CSR, zero-extends the value to XLEN bits, and writes it to
+integer register _rd_. The initial value in integer register _rs1_ is
+treated as a bit mask that specifies bit positions to be set in the CSR.
+Any bit that is high in _rs1_ will cause the corresponding bit to be set
+in the CSR, if that CSR bit is writable. Other bits in the CSR are not
+explicitly written.
+(((CSR, CSRRC)))
+
+The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the
+value of the CSR, zero-extends the value to XLEN bits, and writes it to
+integer register _rd_. The initial value in integer register _rs1_ is
+treated as a bit mask that specifies bit positions to be cleared in the
+CSR. Any bit that is high in _rs1_ will cause the corresponding bit to
+be cleared in the CSR, if that CSR bit is writable. Other bits in the
+CSR are not explicitly written.
+
+For both CSRRS and CSRRC, if _rs1_=`x0`, then the instruction will not
+write to the CSR at all, and so shall not cause any of the side effects
+that might otherwise occur on a CSR write, nor raise illegal instruction
+exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always
+read the addressed CSR and cause any read side effects regardless of
+_rs1_ and _rd_ fields. Note that if _rs1_ specifies a register holding a
+zero value other than ` x0`, the instruction will still attempt to write
+the unmodified value back to the CSR and will cause any attendant side
+effects. A CSRRW with _rs1_=`x0` will attempt to write zero to the
+destination CSR.
+(((CSR, CSRRWI)))
+(((CSR, CSRRCI)))
+
+The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and
+CSRRC respectively, except they update the CSR using an XLEN-bit value
+obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field
+encoded in the _rs1_ field instead of a value from an integer register.
+For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these
+instructions will not write to the CSR, and shall not cause any of the
+side effects that might otherwise occur on a CSR write, nor raise
+illegal instruction exceptions on accesses to read-only CSRs. For
+CSRRWI, if _rd_=`x0`, then the instruction shall not read the CSR and
+shall not cause any of the side effects that might occur on a CSR read.
+Both CSRRSI and CSRRCI will always read the CSR and cause any read side
+effects regardless of _rd_ and _rs1_ fields.
+(((CSR, side effects)))
+
+[[csrsideeffects]]
+.Conditions determining whether a CSR instruction reads or writes the
+specified CSR.
+[cols="<,^,^,^,^",options="header",]
+|===
+|Register operand | | | |
+|Instruction |_rd_ is `x0` |_rs1_ is `x0` |Reads CSR |Writes CSR
+
+|CSRRW |Yes |– |No |Yes
+
+|CSRRW |No |– |Yes |Yes
+
+|CSRRS/CSRRC |– |Yes |Yes |No
+
+|CSRRS/CSRRC |– |No |Yes |Yes
+
+|Immediate operand | | | |
+
+|Instruction |_rd_ is `x0` |_uimm_ latexmath:[$=$]0 |Reads CSR |Writes
+CSR
+
+|CSRRWI |Yes |– |No |Yes
+
+|CSRRWI |No |– |Yes |Yes
+
+|CSRRSI/CSRRCI |– |Yes |Yes |No
+
+|CSRRSI/CSRRCI |– |No |Yes |Yes
+|===
+(((CSR, defects)))
+
+<<csrsideeffects>> summarizes the behavior of the CSR
+instructions with respect to whether they read and/or write the CSR.
+
+For any event or consequence that occurs due to a CSR having a
+particular value, if a write to the CSR gives it that value, the
+resulting event or consequence is said to be an _indirect effect_ of the
+write. Indirect effects of a CSR write are not considered by the RISC-V
+ISA to be side effects of that write.
+
+[TIP]
+====
+An example of side effects for CSR accesses would be if reading from a
+specific CSR causes a light bulb to turn on, while writing an odd value
+to the same CSR causes the light to turn off. Assume writing an even
+value has no effect. In this case, both the read and write have side
+effects controlling whether the bulb is lit, as this condition is not
+determined solely from the CSR value. (Note that after writing an odd
+value to the CSR to turn off the light, then reading to turn the light
+on, writing again the same odd value causes the light to turn off again.
+Hence, on the last write, it is not a change in the CSR value that turns
+off the light.)
+
+On the other hand, if a bulb is rigged to light whenever the value of a
+particular CSR is odd, then turning the light on and off is not
+considered a side effect of writing to the CSR but merely an indirect
+effect of such writes.
+
+More concretely, the RISC-V privileged architecture defined in Volume II
+specifies that certain combinations of CSR values cause a trap to occur.
+When an explicit write to a CSR creates the conditions that trigger the
+trap, the trap is not considered a side effect of the write but merely
+an indirect effect.
+// check whether we are using "volume"
+
+Standard CSRs do not have any side effects on reads. Standard CSRs may
+have side effects on writes. Custom extensions might add CSRs for which
+accesses have side effects on either reads or writes.
+====
+
+Some CSRs, such as the instructions-retired counter, `instret`, may be
+modified as side effects of instruction execution. In these cases, if a
+CSR access instruction reads a CSR, it reads the value prior to the
+execution of the instruction. If a CSR access instruction writes such a
+CSR, the write is done instead of the increment. In particular, a value
+written to `instret` by one instruction will be the value read by the
+following instruction.
+
+The assembler pseudoinstruction to read a CSR, CSRR _rd, csr_, is
+encoded as CSRRS _rd, csr, x0_. The assembler pseudoinstruction to write
+a CSR, CSRW _csr, rs1_, is encoded as CSRRW _x0, csr, rs1_, while CSRWI
+_csr, uimm_, is encoded as CSRRWI _x0, csr, uimm_.
+
+Further assembler pseudoinstructions are defined to set and clear bits
+in the CSR when the old value is not required: CSRS/CSRC _csr, rs1_;
+CSRSI/CSRCI _csr, uimm_.
+
+==== CSR Access Ordering
+(((CSR, access ordering)))
+
+Each RISC-V hart normally observes its own CSR accesses, including its
+implicit CSR accesses, as performed in program order. In particular,
+unless specified otherwise, a CSR access is performed after the
+execution of any prior instructions in program order whose behavior
+modifies or is modified by the CSR state and before the execution of any
+subsequent instructions in program order whose behavior modifies or is
+modified by the CSR state. Furthermore, an explicit CSR read returns the
+CSR state before the execution of the instruction, while an explict CSR
+write suppresses and overrides any implicit writes or modifications to
+the same CSR by the same instruction.
+
+Likewise, any side effects from an explicit CSR access are normally
+observed to occur synchronously in program order. Unless specified
+otherwise, the full consequences of any such side effects are observable
+by the very next instruction, and no consequences may be observed
+out-of-order by preceding instructions. (Note the distinction made
+earlier between side effects and indirect effects of CSR writes.)
+
+For the RVWMO memory consistency model <<memorymodel>>, CSR accesses are weakly
+ordered by default, so other harts or devices may observe CSR accesses
+in an order different from program order. In addition, CSR accesses are
+not ordered with respect to explicit memory accesses, unless a CSR
+access modifies the execution behavior of the instruction that performs
+the explicit memory access or unless a CSR access and an explicit memory
+access are ordered by either the syntactic dependencies defined by the
+memory model or the ordering requirements defined by the Memory-Ordering
+PMAs section in Volume II of this manual. To enforce ordering in all
+other cases, software should execute a FENCE instruction between the
+relevant accesses. For the purposes of the FENCE instruction, CSR read
+accesses are classified as device input (I), and CSR write accesses are
+classified as device output (O).
+
+[NOTE]
+====
+Informally, the CSR space acts as a weakly ordered memory-mapped I/O
+region, as defined by the Memory-Ordering PMAs section in Volume II of
+this manual. As a result, the order of CSR accesses with respect to all
+other accesses is constrained by the same mechanisms that constrain the
+order of memory-mapped I/O accesses to such a region.
+
+These CSR-ordering constraints are imposed to support ordering main
+memory and memory-mapped I/O accesses with respect to CSR accesses that
+are visible to, or affected by, devices or other harts. Examples include
+the `time`, `cycle`, and `mcycle` CSRs, in addition to CSRs that reflect
+pending interrupts, like `mip` and `sip`. Note that implicit reads of
+such CSRs (e.g., taking an interrupt because of a change in `mip`) are
+also ordered as device input.
+====
+
+Most CSRs (including, e.g., the `fcsr`) are not visible to other harts;
+their accesses can be freely reordered in the global memory order with
+respect to FENCE instructions without violating this specification.
+
+The hardware platform may define that accesses to certain CSRs are
+strongly ordered, as defined by the Memory-Ordering PMAs section in
+Volume II of this manual. Accesses to strongly ordered CSRs have
+stronger ordering constraints with respect to accesses to both weakly
+ordered CSRs and accesses to memory-mapped I/O regions.
+
+[NOTE]
+====
+The rules for the reordering of CSR accesses in the global memory order
+should probably be moved to <<memorymodel>>
+concerning the RVWMO memory consistency model.
+====
+
diff --git a/src/zifencei.adoc b/src/zifencei.adoc
new file mode 100644
index 0000000..48b4440
--- /dev/null
+++ b/src/zifencei.adoc
@@ -0,0 +1,97 @@
+[[zifencei]]
+== `Zifencei` Instruction-Fetch Fence, Version 2.0
+
+This chapter defines the 'Zifencei' extension, which includes the
+FENCE.I instruction that provides explicit synchronization between
+writes to instruction memory and instruction fetches on the same hart.
+Currently, this instruction is the only standard mechanism to ensure
+that stores visible to a hart will also be visible to its instruction
+fetches.
+(((store instruction word, not included)))
+
+[NOTE]
+====
+We considered but did not include a `store instruction word`
+instruction as in cite:[majc]. JIT compilers may generate a large trace of
+instructions before a single FENCE.I, and amortize any instruction cache
+snooping/invalidation overhead by writing translated instructions to
+memory regions that are known not to reside in the I-cache.
+====
+
+[TIP]
+====
+The FENCE.I instruction was designed to support a wide variety of
+implementations. A simple implementation can flush the local instruction
+cache and the instruction pipeline when the FENCE.I is executed. A more
+complex implementation might snoop the instruction (data) cache on every
+data (instruction) cache miss, or use an inclusive unified private L2
+cache to invalidate lines from the primary instruction cache when they
+are being written by a local store instruction. If instruction and data
+caches are kept coherent in this way, or if the memory system consists
+of only uncached RAMs, then just the fetch pipeline needs to be flushed
+at a FENCE.I.
+
+The FENCE.I instruction was previously part of the base I instruction
+set. Two main issues are driving moving this out of the mandatory base,
+although at time of writing it is still the only standard method for
+maintaining instruction-fetch coherence.
+
+First, it has been recognized that on some systems, FENCE.I will be
+expensive to implement and alternate mechanisms are being discussed in
+the memory model task group. In particular, for designs that have an
+incoherent instruction cache and an incoherent data cache, or where the
+instruction cache refill does not snoop a coherent data cache, both
+caches must be completely flushed when a FENCE.I instruction is
+encountered. This problem is exacerbated when there are multiple levels
+of I and D cache in front of a unified cache or outer memory system.
+
+Second, the instruction is not powerful enough to make available at user
+level in a Unix-like operating system environment. The FENCE.I only
+synchronizes the local hart, and the OS can reschedule the user hart to
+a different physical hart after the FENCE.I. This would require the OS
+to execute an additional FENCE.I as part of every context migration. For
+this reason, the standard Linux ABI has removed FENCE.I from user-level
+and now requires a system call to maintain instruction-fetch coherence,
+which allows the OS to minimize the number of FENCE.I executions
+required on current systems and provides forward-compatibility with
+future improved instruction-fetch coherence mechanisms.
+
+Future approaches to instruction-fetch coherence under discussion
+include providing more restricted versions of FENCE.I that only target a
+given address specified in _rs1_, and/or allowing software to use an ABI
+that relies on machine-mode cache-maintenance operations.
+====
+
+include::images/wavedrom/zifencei-ff.adoc[]
+[[zifencei-ff]]
+.FENCE.I instruction
+image::image_placeholder.png[]
+(((FENCE.I, synchronization)))
+
+The FENCE.I instruction is used to synchronize the instruction and data
+streams. RISC-V does not guarantee that stores to instruction memory
+will be made visible to instruction fetches on a RISC-V hart until that
+hart executes a FENCE.I instruction. A FENCE.I instruction ensures that
+a subsequent instruction fetch on a RISC-V hart will see any previous
+data stores already visible to the same RISC-V hart. FENCE.I does _not_
+ensure that other RISC-V harts’ instruction fetches will observe the
+local hart’s stores in a multiprocessor system. To make a store to
+instruction memory visible to all RISC-V harts, the writing hart also
+has to execute a data FENCE before requesting that all remote RISC-V
+harts execute a FENCE.I.
+
+The unused fields in the FENCE.I instruction, _imm[11:0]_, _rs1_, and
+_rd_, are reserved for finer-grain fences in future extensions. For
+forward compatibility, base implementations shall ignore these fields,
+and standard software shall zero these fields.
+(((FENCE.I, finer-grained)))
+(((FENCE.I, forward compatibility)))
+
+[NOTE]
+====
+Because FENCE.I only orders stores with a hart’s own instruction
+fetches, application code should only rely upon FENCE.I if the
+application thread will not be migrated to a different hart. The EEI can
+provide mechanisms for efficient multiprocessor instruction-stream
+synchronization.
+====
diff --git a/src/zihintpause.adoc b/src/zihintpause.adoc
new file mode 100644
index 0000000..9b30a61
--- /dev/null
+++ b/src/zihintpause.adoc
@@ -0,0 +1,61 @@
+[[zihintpause]]
+== "Zihintpause" Pause Hint, Version 2.0
+
+The PAUSE instruction is a HINT that indicates the current hart’s rate
+of instruction retirement should be temporarily reduced or paused. The
+duration of its effect must be bounded and may be zero. No architectural
+state is changed.
+(((PAUSE, HINT)))
+(((HINT, PAUSE)))
+
+Software can use the PAUSE instruction to reduce energy consumption
+while executing spin-wait code sequences. Multithreaded cores might
+temporarily relinquish execution resources to other harts when PAUSE is
+executed. It is recommended that a PAUSE instruction generally be
+included in the code sequence for a spin-wait loop.
+(((PAUSE, energy consumption)))
+
+A future extension might add primitives similar to the x86 MONITOR/MWAIT
+instructions, which provide a more efficient mechanism to wait on writes
+to a specific memory location. However, these instructions would not
+supplant PAUSE. PAUSE is more appropriate when polling for non-memory
+events, when polling for multiple events, or when software does not know
+precisely what events it is polling for.
+
+The duration of a PAUSE instruction’s effect may vary significantly
+within and among implementations. In typical implementations this
+duration should be much less than the time to perform a context switch,
+probably more on the rough order of an on-chip cache miss latency or a
+cacheless access to main memory.
+(((PAUSE, duration)))
+
+A series of PAUSE instructions can be used to create a cumulative delay
+loosely proportional to the number of PAUSE instructions. In spin-wait
+loops in portable code, however, only one PAUSE instruction should be
+used before re-evaluating loop conditions, else the hart might stall
+longer than optimal on some implementations, degrading system
+performance.
+
+PAUSE is encoded as a FENCE instruction with _pred_=W, _succ_=0, _fm_=0,
+_rd_=`x0`, and _rs1_=`x0`.
+
+PAUSE is encoded as a hint within the FENCE opcode because some
+implementations are expected to deliberately stall the PAUSE instruction
+until outstanding memory transactions have completed. Because the
+successor set is null, however, PAUSE does not _mandate_ any particular
+memory ordering—hence, it truly is a HINT.
+(((PAUSE, encoding)))
+
+include::images/wavedrom/zihintpause-hint.adoc[]
+[zihintpause-hint]
+.Zihintpause fence instructions
+image::image_placeholder.png[]
+
+Like other FENCE instructions, PAUSE cannot be used within LR/SC
+sequences without voiding the forward-progress guarantee.
+(((PAUSE, LR/RC sequences)))
+
+The choice of a predecessor set of W is arbitrary, since the successor
+set is null. Other HINTs similar to PAUSE might be encoded with other
+predecessor sets.
+
diff --git a/src/ztso-st-ext.adoc b/src/ztso-st-ext.adoc
new file mode 100644
index 0000000..2ec457d
--- /dev/null
+++ b/src/ztso-st-ext.adoc
@@ -0,0 +1,38 @@
+[[ztso]]
+== `Ztso` Standard Extension for Total Store Ordering, v0.1
+
+This chapter defines the ``Ztso`` extension for the RISC-V Total Store
+Ordering (RVTSO) memory consistency model. RVTSO is defined as a delta
+from RVWMO, which is defined in <<rvwmo>>.
+
+The Ztso extension is meant to facilitate the porting of code originally
+written for the x86 or SPARC architectures, both of which use TSO by
+default. It also supports implementations which inherently provide RVTSO
+behavior and want to expose that fact to software.
+
+RVTSO makes the following adjustments to RVWMO:
+
+* All load operations behave as if they have an acquire-RCpc annotation
+* All store operations behave as if they have a release-RCpc annotation.
+* All AMOs behave as if they have both acquire-RCsc and release-RCsc
+annotations.
+
+These rules render all PPO rules except
+<<fence>>–<<rcsc>> redundant. They also make
+redundant any non-I/O fences that do not have both PW and SR set.
+Finally, they also imply that no memory operation will be reordered past
+an AMO in either direction.
+
+In the context of RVTSO, as is the case for RVWMO, the storage ordering
+annotations are concisely and completely defined by PPO rules
+<<acquire>>–<<rcsc>>. In both of these
+memory models, it is the that allows a hart to forward a value from its
+store buffer to a subsequent (in program order) load—that is to say that
+stores can be forwarded locally before they are visible to other harts.
+
+In spite of the fact that Ztso adds no new instructions to the ISA, code
+written assuming RVTSO will not run correctly on implementations not
+supporting Ztso. Binaries compiled to run only under Ztso should
+indicate as such via a flag in the binary, so that platforms which do
+not implement Ztso can simply refuse to run them.
+
author	elisa <elisa@riscv.org>	2021-09-10 14:07:18 -0700
committer	elisa <elisa@riscv.org>	2021-09-10 14:07:18 -0700
commit	64353b3717c9d387759c61d16d3984760a028046 (patch)
tree	7dc4b3a1aff37dfd38ff921fa21113a64989a5aa
parent	e1a563505224b47129b0cbb5c46ee22f5e0acddb (diff)
download	riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.zip riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.gz riscv-isa-manual-64353b3717c9d387759c61d16d3984760a028046.tar.bz2