diff options
Diffstat (limited to 'src/a-st-ext.adoc')
| -rw-r--r-- | src/a-st-ext.adoc | 484 |
1 files changed, 484 insertions, 0 deletions
diff --git a/src/a-st-ext.adoc b/src/a-st-ext.adoc new file mode 100644 index 0000000..593a51a --- /dev/null +++ b/src/a-st-ext.adoc @@ -0,0 +1,484 @@ +== "A" Extension for Atomic Instructions, Version 2.1 + +The atomic-instruction extension, named "A", contains +instructions that atomically read-modify-write memory to support +synchronization between multiple RISC-V harts running in the same memory +space. The two forms of atomic instruction provided are +load-reserved/store-conditional instructions and atomic fetch-and-op +memory instructions. Both types of atomic instruction support various +memory consistency orderings including unordered, acquire, release, and +sequentially consistent semantics. These instructions allow RISC-V to +support the RCsc memory consistency model. cite:[Gharachorloo90memoryconsistency] + +[NOTE] +==== +After much debate, the language community and architecture community +appear to have finally settled on release consistency as the standard +memory consistency model and so the RISC-V atomic support is built +around this model. +==== + +The A extension comprises instructions provided by the Zaamo and Zalrsc +extensions. + +=== Specifying Ordering of Atomic Instructions + +The base RISC-V ISA has a relaxed memory model, with the FENCE +instruction used to impose additional ordering constraints. The address +space is divided by the execution environment into memory and I/O +domains, and the FENCE instruction provides options to order accesses to +one or both of these two address domains. + +To provide more efficient support for release consistency cite:[Gharachorloo90memoryconsistency], each atomic +instruction has two bits, _aq_ and _rl_, used to specify additional +memory ordering constraints as viewed by other RISC-V harts. The bits +order accesses to one of the two address domains, memory or I/O, +depending on which address domain the atomic instruction is accessing. +No ordering constraint is implied to accesses to the other domain, and a +FENCE instruction should be used to order across both domains. + +If both bits are clear, no additional ordering constraints are imposed +on the atomic memory operation. If only the _aq_ bit is set, the atomic +memory operation is treated as an _acquire_ access, i.e., no following +memory operations on this RISC-V hart can be observed to take place +before the acquire memory operation. If only the _rl_ bit is set, the +atomic memory operation is treated as a _release_ access, i.e., the +release memory operation cannot be observed to take place before any +earlier memory operations on this RISC-V hart. If both the _aq_ and _rl_ +bits are set, the atomic memory operation is _sequentially consistent_ +and cannot be observed to happen before any earlier memory operations or +after any later memory operations in the same RISC-V hart and to the +same address domain. + +[[sec:lrsc]] +=== "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions + +include::images/wavedrom/load-reserve-st-conditional.edn[] + +Complex atomic memory operations on a single memory word or doubleword +are performed with the load-reserved (LR) and store-conditional (SC) +instructions. LR.W loads a word from the address in _rs1_, places the +sign-extended value in _rd_, and registers a _reservation set_—a set of +bytes that subsumes the bytes in the addressed word. SC.W conditionally +writes a word in _rs2_ to the address in _rs1_: the SC.W succeeds only +if the reservation is still valid and the reservation set contains the +bytes being written. If the SC.W succeeds, the instruction writes the +word in _rs2_ to memory, and it writes zero to _rd_. If the SC.W fails, +the instruction does not write to memory, and it writes a nonzero value +to _rd_. +No SC.W instruction shall retire unless it passes memory permission checks, +but it is UNSPECIFIED whether any side effects of implicit address translation +and protection memory accesses (such as setting a page-table entry D bit) +occur on a failed SC.W. +For the purposes of memory protection, a failed SC.W may be +treated like a store. +Regardless of success or failure, executing an +SC.W instruction invalidates any reservation held by this hart. LR.D and +SC.D act analogously on doublewords and are only available on RV64. For +RV64, LR.W and SC.W sign-extend the value placed in _rd_. + +[NOTE] +==== +Both compare-and-swap (CAS) and LR/SC can be used to build lock-free +data structures. After extensive discussion, we opted for LR/SC for +several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids +because it monitors all writes to the address rather than only checking +for changes in the data value; 2) CAS would also require a new integer +instruction format to support three source operands (address, compare +value, swap value) as well as a different memory system message format, +which would complicate microarchitectures; 3) Furthermore, to avoid the +ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a +counter to be tested and incremented along with a data word. This +requires reading five registers and writing two in one instruction, and +also a new larger memory system message type, further complicating +implementations; 4) LR/SC provides a more efficient implementation of +many primitives as it only requires one load as opposed to two with CAS +(one load before the CAS instruction to obtain a value for speculative +computation, then a second load as part of the CAS instruction to check +if value is unchanged before updating). + +The main disadvantage of LR/SC over CAS is livelock, which we avoid, +under certain circumstances, with an architected guarantee of eventual +forward progress as described below. Another concern is whether the +influence of the current x86 architecture, with its DW-CAS, will +complicate porting of synchronization libraries and other software that +assumes DW-CAS is the basic machine primitive. A possible mitigating +factor is the recent addition of transactional memory instructions to +x86, which might cause a move away from DW-CAS. + +More generally, a multi-word atomic primitive is desirable, but there is +still considerable debate about what form this should take, and +guaranteeing forward progress adds complexity to a system. +==== + +The failure code with value 1 encodes an unspecified failure. Other +failure codes are reserved at this time. Portable software should only +assume the failure code will be non-zero. + +[NOTE] +==== +We reserve a failure code of 1 to mean ''unspecified'' so that simple +implementations may return this value using the existing multiplexer required +for the SLT/SLTU instructions. More specific failure codes might be +defined in future versions or extensions to the ISA. +==== + +For LR and SC, the Zalrsc extension requires that the address held in _rs1_ +be naturally aligned to the size of the operand (i.e., eight-byte +aligned for _doublewords_ and four-byte aligned for _words_). If the +address is not naturally aligned, an address-misaligned exception or an +access-fault exception will be generated. The access-fault exception can +be generated for a memory access that would otherwise be able to +complete except for the misalignment, if the misaligned access should +not be emulated. +[NOTE] +==== +Emulating misaligned LR/SC sequences is impractical in most systems. + +Misaligned LR/SC sequences also raise the possibility of accessing +multiple reservation sets at once, which present definitions do not +provide for. +==== + +An implementation can register an arbitrarily large reservation set on +each LR, provided the reservation set includes all bytes of the +addressed data word or doubleword. An SC can only pair with the most +recent LR in program order. An SC may succeed only if no store from +another hart to the reservation set can be observed to have occurred +between the LR and the SC, and if there is no other SC between the LR +and itself in program order. An SC may succeed only if no write from a +device other than a hart to the bytes accessed by the LR instruction can +be observed to have occurred between the LR and SC. Note this LR might +have had a different effective address and data size, but reserved the +SC's address as part of the reservation set. + +[NOTE] +==== +Following this model, in systems with memory translation, an SC is +allowed to succeed if the earlier LR reserved the same location using an +alias with a different virtual address, but is also allowed to fail if +the virtual address is different. + +To accommodate legacy devices and buses, writes from devices other than +RISC-V harts are only required to invalidate reservations when they +overlap the bytes accessed by the LR. These writes are not required to +invalidate the reservation when they access other bytes in the +reservation set. +==== + +The SC must fail if the address is not within the reservation set of the +most recent LR in program order. The SC must fail if a store to the +reservation set from another hart can be observed to occur between the +LR and SC. The SC must fail if a write from some other device to the +bytes accessed by the LR can be observed to occur between the LR and SC. +(If such a device writes the reservation set but does not write the +bytes accessed by the LR, the SC may or may not fail.) An SC must fail +if there is another SC (to any address) between the LR and the SC in +program order. The precise statement of the atomicity requirements for +successful LR/SC sequences is defined by the Atomicity Axiom in +<<rvwmo>>. + +[NOTE] +==== +The platform should provide a means to determine the size and shape of +the reservation set. + +A platform specification may constrain the size and shape of the +reservation set. + +A store-conditional instruction to a scratch word of memory should be +used to forcibly invalidate any existing load reservation: + +* during a preemptive context switch, and +* if necessary when changing virtual to physical address mappings, such +as when migrating pages that might contain an active reservation. + +The invalidation of a hart's reservation when it executes an LR or SC +imply that a hart can only hold one reservation at a time, and that an +SC can only pair with the most recent LR, and LR with the next following +SC, in program order. This is a restriction to the Atomicity Axiom in +<<rvwmo>> that ensures software runs correctly on +expected common implementations that operate in this manner. +==== + +An SC instruction can never be observed by another RISC-V hart before +the LR instruction that established the reservation. + +[NOTE] +==== +The LR/SC sequence +can be given acquire semantics by setting the _aq_ bit on the LR +instruction. The LR/SC sequence can be given release semantics by +by setting the _rl_ bit on the SC instruction. Assuming +suitable mappings for other atomic operations, setting the +_aq_ bit on the LR instruction, and setting the +_rl_ bit on the SC instruction makes the LR/SC +sequence sequentially consistent in the C\++ `memory_order_seq_cst` +sense. Such a sequence does not act as a fence for ordering ordinary +load and store instructions before and after the sequence. Specific +instruction mappings for other C++ atomic operations, +or stronger notions of "sequential consistency", may require both +bits to be set on either or both of the LR or SC instruction. + +If neither bit is set on either LR or SC, the LR/SC sequence can be +observed to occur before or after surrounding memory operations from the +same RISC-V hart. This can be appropriate when the LR/SC sequence is +used to implement a parallel reduction operation. +==== + +Software should not set the _rl_ bit on an LR instruction unless the +_aq_ bit is also set, nor should software set the _aq_ bit on an SC +instruction unless the _rl_ bit is also set. LR._rl_ and SC._aq_ +instructions are not guaranteed to provide any stronger ordering than +those with both bits clear, but may result in lower performance. + +[NOTE] +==== +[[cas]] +[source,asm] +.Sample code for compare-and-swap function using LR/SC. + # a0 holds address of memory location + # a1 holds expected value + # a2 holds desired value + # a0 holds return value, 0 if successful, !0 otherwise + cas: + lr.w t0, (a0) # Load original value. + bne t0, a1, fail # Doesn't match, so fail. + sc.w t0, a2, (a0) # Try to update. + bnez t0, cas # Retry if store-conditional failed. + li a0, 0 # Set return to success. + jr ra # Return. + fail: + li a0, 1 # Set return to failure. + jr ra # Return. + +LR/SC can be used to construct lock-free data structures. An example +using LR/SC to implement a compare-and-swap function is shown in +<<cas>>. If inlined, compare-and-swap functionality need only take four instructions. +==== + +[[sec:lrscseq]] +=== Eventual Success of Store-Conditional Instructions + +The Zalrsc extension defines _constrained LR/SC loops_, which have +the following properties: + +* The loop comprises only an LR/SC sequence and code to retry the +sequence in the case of failure, and must comprise at most 16 +instructions placed sequentially in memory. +* An LR/SC sequence begins with an LR instruction and ends with an SC +instruction. The dynamic code executed between the LR and SC +instructions can only contain instructions from the base ''I'' +instruction set, excluding loads, stores, backward jumps, taken backward +branches, JALR, FENCE, and SYSTEM instructions. +Compressed forms of the aforementioned ''I'' instructions in the +C (hence Zca) and Zcb extensions are also permitted. +* The code to retry a failing LR/SC sequence can contain backwards jumps +and/or branches to repeat the LR/SC sequence, but otherwise has the same +constraint as the code between the LR and SC. +* The LR and SC addresses must lie within a memory region with the +_LR/SC eventuality_ property. The execution environment is responsible +for communicating which regions have this property. +* The SC must be to the same effective address and of the same data size +as the latest LR executed by the same hart. + +LR/SC sequences that do not lie within constrained LR/SC loops are +_unconstrained_. Unconstrained LR/SC sequences might succeed on some +attempts on some implementations, but might never succeed on other +implementations. + +[NOTE] +==== +We restricted the length of LR/SC loops to fit within 64 contiguous +instruction bytes in the base ISA to avoid undue restrictions on +instruction cache and TLB size and associativity. Similarly, we +disallowed other loads and stores within the loops to avoid restrictions +on data-cache associativity in simple implementations that track the +reservation within a private cache. The restrictions on branches and +jumps limit the time that can be spent in the sequence. Floating-point +operations and integer multiply/divide were disallowed to simplify the +operating system's emulation of these instructions on implementations +lacking appropriate hardware support. + +Software is not forbidden from using unconstrained LR/SC sequences, but +portable software must detect the case that the sequence repeatedly +fails, then fall back to an alternate code sequence that does not rely +on an unconstrained LR/SC sequence. Implementations are permitted to +unconditionally fail any unconstrained LR/SC sequence. +==== + +If a hart _H_ enters a constrained LR/SC loop, the execution environment +must guarantee that one of the following events eventually occurs: + +* _H_ or some other hart executes a successful SC to the reservation set +of the LR instruction in _H_'s constrained LR/SC loops. +* Some other hart executes an unconditional store or AMO instruction to +the reservation set of the LR instruction in _H_'s constrained LR/SC +loop, or some other device in the system writes to that reservation set. +* _H_ executes a branch or jump that exits the constrained LR/SC loop. +* _H_ traps. + +[NOTE] +==== +Note that these definitions permit an implementation to fail an SC +instruction occasionally for any reason, provided the aforementioned +guarantee is not violated. + +As a consequence of the eventuality guarantee, if some harts in an +execution environment are executing constrained LR/SC loops, and no +other harts or devices in the execution environment execute an +unconditional store or AMO to that reservation set, then at least one +hart will eventually exit its constrained LR/SC loop. By contrast, if +other harts or devices continue to write to that reservation set, it is +not guaranteed that any hart will exit its LR/SC loop. + +Loads and load-reserved instructions do not by themselves impede the +progress of other harts' LR/SC sequences. We note this constraint +implies, among other things, that loads and load-reserved instructions +executed by other harts (possibly within the same core) cannot impede +LR/SC progress indefinitely. For example, cache evictions caused by +another hart sharing the cache cannot impede LR/SC progress +indefinitely. Typically, this implies reservations are tracked +independently of evictions from any shared cache. Similarly, cache +misses caused by speculative execution within a hart cannot impede LR/SC +progress indefinitely. + +These definitions admit the possibility that SC instructions may +spuriously fail for implementation reasons, provided progress is +eventually made. + +One advantage of CAS is that it guarantees that some hart eventually +makes progress, whereas an LR/SC atomic sequence could livelock +indefinitely on some systems. To avoid this concern, we added an +architectural guarantee of livelock freedom for certain LR/SC sequences. + +Earlier versions of this specification imposed a stronger +starvation-freedom guarantee. However, the weaker livelock-freedom +guarantee is sufficient to implement the C11 and C++11 languages, and is +substantially easier to provide in some microarchitectural styles. +==== + +[[sec:amo]] +=== "Zaamo" Extension for Atomic Memory Operations + +include::images/wavedrom/atomic-mem.edn[] + +The atomic memory operation (AMO) instructions perform read-modify-write +operations for multiprocessor synchronization and are encoded with an +R-type instruction format. These AMO instructions atomically load a data +value from the address in _rs1_, place the value into register _rd_, +apply a binary operator to the loaded value and the original value in +_rs2_, then store the result back to the original address in _rs1_. AMOs +can either operate on _doublewords_ (RV64 only) or _words_ in memory. For +RV64, 32-bit AMOs always sign-extend the value placed in _rd_, and +ignore the upper 32 bits of the original value of _rs2_. + +For AMOs, the Zaamo extension requires that the address held in _rs1_ be +naturally aligned to the size of the operand (i.e., eight-byte aligned +for _doublewords_ and four-byte aligned for _words_). If the address +is not naturally aligned, an address-misaligned exception or an +access-fault exception will be generated. The access-fault exception can +be generated for a memory access that would otherwise be able to +complete except for the misalignment, if the misaligned access should +not be emulated. + +The misaligned atomicity granule PMA, defined in Volume II of this manual, +optionally relaxes this alignment requirement. +If present, the misaligned atomicity granule PMA specifies the size +of a misaligned atomicity granule, a power-of-two number of bytes. +The misaligned atomicity granule PMA applies only to AMOs, loads and stores +defined in the base ISAs, and loads and stores of no more than XLEN bits +defined in the F, D, and Q extensions. +For an instruction in that set, if all accessed bytes lie within the same +misaligned atomicity granule, the instruction will not raise an exception for +reasons of address alignment, and the instruction will give rise to only one +memory operation for the purposes of RVWMO--i.e., it will execute atomically. + +The operations supported are swap, integer add, bitwise AND, bitwise OR, +bitwise XOR, and signed and unsigned integer maximum and minimum. +Without ordering constraints, these AMOs can be used to implement +parallel reduction operations, where typically the return value would be +discarded by writing to `x0`. + +[NOTE] +==== +We provided fetch-and-op style atomic primitives as they scale to highly +parallel systems better than LR/SC or CAS. A simple microarchitecture +can implement AMOs using the LR/SC primitives, provided the +implementation can guarantee the AMO eventually completes. More complex +implementations might also implement AMOs at memory controllers, and can +optimize away fetching the original value when the destination is `x0`. + +The set of AMOs was chosen to support the C11/C++11 atomic memory +operations efficiently, and also to support parallel reductions in +memory. Another use of AMOs is to provide atomic updates to +memory-mapped device registers (e.g., setting, clearing, or toggling +bits) in the I/O space. + +The Zaamo extension enables microcontroller class implementations to utilize +atomic primitives from the AMO subset of the A extension. Typically such +implementations do not have caches and thus may not be able to naturally support +the LR/SC instructions provided by the Zalrsc extension. +==== + +To help implement multiprocessor synchronization, the AMOs optionally +provide release consistency semantics. If the _aq_ bit is set, then no +later memory operations in this RISC-V hart can be observed to take +place before the AMO. Conversely, if the _rl_ bit is set, then other +RISC-V harts will not observe the AMO before memory accesses preceding +the AMO in this RISC-V hart. Setting both the _aq_ and the _rl_ bit on +an AMO makes the sequence sequentially consistent, meaning that it +cannot be reordered with earlier or later memory operations from the +same hart. + +[NOTE] +==== +The AMOs were designed to implement the C11 and C++11 memory models +efficiently. Although the FENCE R, RW instruction suffices to implement +the _acquire_ operation and FENCE RW, W suffices to implement _release_, +both imply additional unnecessary ordering as compared to AMOs with the +corresponding _aq_ or _rl_ bit set. +==== + +[NOTE] +==== +An example code sequence for a critical section guarded by a +test-and-test-and-set spinlock is shown in +Example <<critical>>. Note the first AMO is marked _aq_ to +order the lock acquisition before the critical section, and the second +AMO is marked _rl_ to order the critical section before the lock +relinquishment. + +[[critical]] +[source,asm] +.Sample code for mutual exclusion. `a0` contains the address of the lock. + li t0, 1 # Initialize swap value. + again: + lw t1, (a0) # Check if lock is held. + bnez t1, again # Retry if held. + amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock. + bnez t1, again # Retry if held. + # ... + # Critical section. + # ... + amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. + +We recommend the use of the AMO Swap idiom shown in <<critical>> for both lock +acquire and release to simplify the implementation of speculative lock +elision. cite:[Rajwar:2001:SLE] +==== + +[NOTE] +==== +The instructions in the "A" extension can be used to provide sequentially +consistent loads and stores, but this constrains hardware +reordering of memory accesses more than necessary. +A C++ sequentially consistent load can be implemented as +an LR with _aq_ set. However, the LR/SC eventual +success guarantee may slow down concurrent loads from the same effective +address. A sequentially consistent store can be implemented as an AMOSWAP +that writes the old value to `x0` and has _rl_ set. However the superfluous +load may impose ordering constraints that are unnecessary for this use case. +Specific compilation conventions may require both the _aq_ and _rl_ +bits to be set in either or both the LR and AMOSWAP instructions. +==== |
