diff options
author | Andrew Waterman <andrew@sifive.com> | 2017-02-01 20:41:47 -0800 |
---|---|---|
committer | Andrew Waterman <andrew@sifive.com> | 2017-02-01 20:41:47 -0800 |
commit | ab6f8c9bd7bc85361fcf35667d1fddfaf367a53f (patch) | |
tree | 716a2118ca0565dbb4e7903723f283ae4dd13c46 /src/a.tex | |
parent | 207a7c6ee51aa2fd74d4618cd1369ddc21706b9e (diff) | |
download | riscv-isa-manual-ab6f8c9bd7bc85361fcf35667d1fddfaf367a53f.zip riscv-isa-manual-ab6f8c9bd7bc85361fcf35667d1fddfaf367a53f.tar.gz riscv-isa-manual-ab6f8c9bd7bc85361fcf35667d1fddfaf367a53f.tar.bz2 |
Reorganize directory structure
Diffstat (limited to 'src/a.tex')
-rw-r--r-- | src/a.tex | 379 |
1 files changed, 379 insertions, 0 deletions
diff --git a/src/a.tex b/src/a.tex new file mode 100644 index 0000000..8c3c745 --- /dev/null +++ b/src/a.tex @@ -0,0 +1,379 @@ +\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.0} +\label{atomics} + +The standard atomic instruction extension is denoted by instruction +subset name ``A'', and contains instructions that atomically +read-modify-write memory to support synchronization between multiple +RISC-V threads running in the same memory space. The two forms of +atomic instruction provided are load-reserved/store-conditional +instructions and atomic fetch-and-op memory instructions. Both types +of atomic instruction support various memory consistency orderings +including unordered, acquire, release, and sequentially consistent +semantics. These instructions allow RISC-V to support the RCsc memory +consistency model~\cite{Gharachorloo90memoryconsistency}. + +\begin{commentary} +After much debate, the language community and architecture community +appear to have finally settled on release consistency as the standard +memory consistency model and so the RISC-V atomic support is built +around this model. +\end{commentary} + +\section{Specifying Ordering of Atomic Instructions} + +The base RISC-V ISA has a relaxed memory model, with the FENCE +instruction used to impose additional ordering constraints. The +address space is divided by the execution environment into memory and +I/O domains, and the FENCE instruction provides options to order +accesses to one or both of these two address domains. + +To provide more efficient support for release +consistency~\cite{Gharachorloo90memoryconsistency}, each atomic +instruction has two bits, {\em aq} and {\em rl}, used to specify +additional memory ordering constraints as viewed by other RISC-V +threads. The bits order accesses to one of the two address domains, +memory or I/O, depending on which address domain the atomic +instruction is accessing. No ordering constraint is implied to +accesses to the other domain, and a FENCE instruction should be used +to order across both domains. + +If both bits are clear, no additional ordering constraints are imposed +on the atomic memory operation. If only the {\em aq} bit is set, the +atomic memory operation is treated as an {\em acquire} access, i.e., +no following memory operations on this RISC-V thread can be observed +to take place before the acquire memory operation. If only the {\em + rl} bit is set, the atomic memory operation is treated as a {\em + release} access, i.e., the release memory operation can not be +observed to take place before any earlier memory operations on this +RISC-V thread. If both the {\em aq} and {\em rl} bits are set, the +atomic memory operation is {\em sequentially consistent} and cannot be +observed to happen before any earlier memory operations or after any +later memory operations in the same RISC-V thread, and can only be +observed by any other thread in the same global order of all +sequentially consistent atomic memory operations to the same address +domain. + +\begin{commentary} +Theoretically, the definition of the {\em aq} and {\em rl} bits allows +for implementations without global store atomicity. When both {\em + aq} and {\em rl} bits are set, however, we require full sequential +consistency for the atomic operation which implies global store +atomicity in addition to both acquire and release semantics. In +practice, hardware systems are usually implemented with global store +atomicity, embodied in local processor ordering rules together with +single-writer cache coherence protocols. +\end{commentary} + +\section{Load-Reserved/Store-Conditional Instructions} + +\vspace{-0.2in} +\begin{center} +\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O} +\\ +\instbitrange{31}{27} & +\instbit{26} & +\instbit{25} & +\instbitrange{24}{20} & +\instbitrange{19}{15} & +\instbitrange{14}{12} & +\instbitrange{11}{7} & +\instbitrange{6}{0} \\ +\hline +\multicolumn{1}{|c|}{funct5} & +\multicolumn{1}{c|}{aq} & +\multicolumn{1}{c|}{rl} & +\multicolumn{1}{c|}{rs2} & +\multicolumn{1}{c|}{rs1} & +\multicolumn{1}{c|}{funct3} & +\multicolumn{1}{c|}{rd} & +\multicolumn{1}{c|}{opcode} \\ +\hline +5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\ +LR & \multicolumn{2}{c}{ordering} & 0 & addr & width & dest & AMO \\ +SC & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +\end{tabular} +\end{center} + +Complex atomic memory operations on a single memory word are performed +with the load-reserved (LR) and store-conditional (SC) instructions. +LR loads a word from the address in {\em rs1}, places the +sign-extended value in {\em rd}, and registers a reservation on the +memory address. SC writes a word in {\em rs2} to the address in {\em + rs1}, provided a valid reservation still exists on that address. SC +writes zero to {\em rd} on success or a nonzero code on failure. + +\begin{commentary} +Both compare-and-swap (CAS) and LR/SC can be used to build lock-free +data structures. After extensive discussion, we opted for LR/SC for +several reasons: 1) CAS suffers from the ABA problem, which LR/SC +avoids because it monitors all accesses to the address rather than +only checking for changes in the data value; 2) CAS would also require +a new integer instruction format to support three source operands +(address, compare value, swap value) as well as a different memory +system message format, which would complicate microarchitectures; 3) +Furthermore, to avoid the ABA problem, other systems provide a +double-wide CAS (DW-CAS) to allow a counter to be tested and +incremented along with a data word. This requires reading five +registers and writing two in one instruction, and also a new larger +memory system message type, further complicating implementations; 4) +LR/SC provides a more efficient implementation of many primitives as +it only requires one load as opposed to two with CAS (one load before +the CAS instruction to obtain a value for speculative computation, +then a second load as part of the CAS instruction to check if value is +unchanged before updating). + +The main disadvantage of LR/SC over CAS is livelock, which we avoid +with an architected guarantee of eventual forward progress as +described below. Another concern is whether the influence of the +current x86 architecture, with its DW-CAS, will complicate porting of +synchronization libraries and other software that assumes DW-CAS is +the basic machine primitive. A possible mitigating factor is the +recent addition of transactional memory instructions to x86, which +might cause a move away from DW-CAS. +\end{commentary} + +The failure code with value 1 is reserved to encode an unspecified +failure. Other failure codes are reserved at this time, and portable +software should only assume the failure code will be non-zero. LR and +SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit words in +memory. Misaligned addresses will generate misaligned address +exceptions. + +\begin{commentary} +We reserve a failure code of 1 to mean ``unspecified'' so that simple +implementations may return this value using the existing mux required +for the SLT/SLTU instructions. More specific failure codes might be +defined in future versions or extensions to the ISA. +\end{commentary} + +\label{lrscseq} + +In the standard A extension, certain constrained LR/SC sequences are +guaranteed to succeed eventually. The static code for the LR/SC +sequence plus the code to retry the sequence in case of failure must +comprise at most 16 integer instructions placed sequentially in +memory. For the sequence to be guaranteed to eventually succeed, the +dynamic code executed between the LR and SC instructions can only +contain other instructions from the base ``I'' subset, excluding +loads, stores, backward jumps or taken backward branches, FENCE, +FENCE.I, and SYSTEM instructions. The code to retry a failing LR/SC +sequence can contain backward jumps and/or branches to repeat the +LR/SC sequence, but otherwise has the same constraints. The SC must +be to the same address as the latest LR executed. LR/SC sequences +that do not meet these constraints might complete on some attempts on +some implementations, but there is no guarantee of eventual success. + +\begin{commentary} +One advantage of CAS is that it guarantees that some thread eventually +makes progress, whereas an LR/SC atomic sequence could livelock +indefinitely on some systems. To avoid this concern, we added an +architectural guarantee of forward progress to LR/SC atomic sequences. +The restrictions on LR/SC sequence contents allows an implementation +to capture a cache line on the LR and complete the LR/SC sequence by +holding off remote cache interventions for a bounded short +time. Interrupts and TLB misses might cause the reservation to be +lost, but eventually the atomic sequence can complete. We restricted +the length of LR/SC sequences to fit within 64 contiguous instruction +bytes in the base ISA to avoid undue restrictions on instruction cache +and TLB size and associativity. Similarly, we disallowed other loads +and stores within the sequences to avoid restrictions on data cache +associativity. The restrictions on branches and jumps limits the time +that can be spent in the sequence. Floating-point operations and +integer multiply/divide were disallowed to simplify the operating +system's emulation of these instructions on implementations lacking +appropriate hardware support. +\end{commentary} + +An implementation can reserve an arbitrary subset of the memory space +on each LR and multiple LR reservations might be active simultaneously +for a single hart. An SC can succeed if no accesses from other harts +to the address can be observed to have occurred between the SC and +the last LR in this hart to reserve the address. Note this LR might +have had a different address argument, but reserved the SC's address +as part of the memory subset. Following this model, in systems with +memory translation, an SC is allowed to succeed if the earlier LR +reserved the same location using an alias with a different virtual +address, but is also allowed to fail if the virtual address is +different. The SC must fail if there is an observable memory access +from another hart to the address, or if there is an intervening +context switch on this hart, or if in the meantime the hart executed a +privileged exception-return instruction. + +\begin{commentary} +The specification explicitly allows implementations to support more +powerful implementations with wider guarantees, provided they do not +void the atomicity guarantees for the constrained sequences. +\end{commentary} + +LR/SC can be used to construct lock-free data structures. An example +using LR/SC to implement a compare-and-swap function is shown in +Figure~\ref{cas}. If inlined, compare-and-swap functionality need +only take three instructions. + +\begin{figure}[h!] +\begin{center} +\begin{verbatim} + # a0 holds address of memory location + # a1 holds expected value + # a2 holds desired value + # a0 holds return value, 0 if successful, !0 otherwise + cas: + lr.w t0, (a0) # Load original value. + bne t0, a1, fail # Doesn't match, so fail. + sc.w a0, a2, (a0) # Try to update. + jr ra # Return. + fail: + li a0, 1 # Set return to failure. + jr ra # Return. +\end{verbatim} +\end{center} +\caption{Sample code for compare-and-swap function using LR/SC.} +\label{cas} +\end{figure} + +An SC instruction can never be observed by another RISC-V thread +before the immediately preceding LR. Due to the atomic nature of the +LR/SC sequence, no memory operations from any thread can be observed +to have occurred between the LR and a successful SC. The LR/SC +sequence can be given acquire semantics by setting the {\em aq} bit on +the SC instruction. The LR/SC sequence can be given release semantics +by setting the {\em rl} bit on the LR instruction. Setting both {\em + aq} and {\em rl} bits on the LR instruction, and setting the {\em + aq} bit on the SC instruction makes the LR/SC sequence sequentially +consistent with respect to other sequentially consistent atomic +operations. + +If neither bit is set on both LR and SC, the LR/SC sequence can be +observed to occur before or after surrounding memory operations from +the same RISC-V thread. This can be appropriate when the LR/SC +sequence is used to implement a parallel reduction operation. + +\begin{commentary} +In general, a multi-word atomic primitive is desirable but there is +still considerable debate about what form this should take, and +guaranteeing forward progress adds complexity to a system. Our +current thoughts are to include a small limited-capacity transactional +memory buffer along the lines of the original transactional memory +proposals as an optional standard extension ``T''. +\end{commentary} + +\section{Atomic Memory Operations} + +\vspace{-0.2in} +\begin{center} +\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R} +\\ +\instbitrange{31}{27} & +\instbit{26} & +\instbit{25} & +\instbitrange{24}{20} & +\instbitrange{19}{15} & +\instbitrange{14}{12} & +\instbitrange{11}{7} & +\instbitrange{6}{0} \\ +\hline +\multicolumn{1}{|c|}{funct5} & +\multicolumn{1}{c|}{aq} & +\multicolumn{1}{c|}{rl} & +\multicolumn{1}{c|}{rs2} & +\multicolumn{1}{c|}{rs1} & +\multicolumn{1}{c|}{funct3} & +\multicolumn{1}{c|}{rd} & +\multicolumn{1}{c|}{opcode} \\ +\hline +5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\ +AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ +\end{tabular} +\end{center} + +\vspace{-0.1in} The atomic memory operation (AMO) instructions perform +read-modify-write operations for multiprocessor synchronization and +are encoded with an R-type instruction format. These AMO instructions +atomically load a data value from the address in {\em rs1}, place the +value into register {\em rd}, apply a binary operator to the loaded +value and the original value in {\em rs2}, then store the result back +to the address in {\em rs1}. AMOs can either operate on 64-bit (RV64 +only) or 32-bit words in memory. For RV64, 32-bit AMOs always +sign-extend the value placed in {\em rd}. The address held in {\em + rs1} must be naturally aligned to the size of the operand (i.e., +eight-byte aligned for 64-bit words and four-byte aligned for 32-bit +words). If the address is not naturally aligned, a misaligned address +exception will be generated. + +The operations supported are swap, integer add, logical AND, logical +OR, logical XOR, and signed and unsigned integer maximum and minimum. +Without ordering constraints, these AMOs can be used to implement +parallel reduction operations, where typically the return value would +be discarded by writing to {\tt x0}. + +\begin{commentary} +We provided fetch-and-op style atomic primitives as they scale to +highly parallel systems better than LR/SC or CAS. A simple +microarchitecture can implement AMOs using the LR/SC primitives. More +complex implementations might also implement AMOs at memory +controllers, and can optimize away fetching the original value when +the destination is {\tt x0}. +\end{commentary} + +To help implement multiprocessor synchronization, the AMOs optionally +provide release consistency semantics. If the {\em aq} bit is set, +then no later memory operations in this RISC-V thread can be observed +to take place before the AMO. +Conversely, if the {\em rl} bit is set, then other +RISC-V threads will not observe the AMO before memory accesses +preceding the AMO in this RISC-V thread. + +\begin{commentary} +The AMOs were designed to implement the C11 and C++11 memory models +efficiently. Although the FENCE R, RW instruction suffices to +implement the {\em acquire} operation and FENCE RW, W suffices to +implement {\em release}, both imply additional unnecessary ordering as +compared to AMOs with the corresponding {\em aq} or {\em rl} bit set. +\end{commentary} + +AMOs can also be used to provide sequentially consistent loads and +stores. A sequentially consistent load can be implemented as an +AMOADD of x0 with both {\em aq} and {\em rl} set. A sequentially +consistent store can be implemented as an AMOSWAP that writes the old +value to x0 and has both {\em aq} and {\em rl} set. + +An example code sequence for a critical section guarded by a +test-and-set spinlock is shown in Figure~\ref{critical}. Note the +first AMO is marked {\em aq} to order the lock acquisition before the +critical section, and the second AMO is marked {\em rl} to order +the critical section before the lock relinquishment. + +\begin{figure}[h!] +\begin{center} +\begin{verbatim} + li t0, 1 # Initialize swap value. + again: + amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock. + bnez t0, again # Retry if held. + # ... + # Critical section. + # ... + amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. +\end{verbatim} +\end{center} +\caption{Sample code for mutual exclusion. {\tt a0} contains the address of the lock.} +\label{critical} +\end{figure} + +\begin{commentary} +We recommend the use of the AMO Swap idiom shown above for both lock +acquire and release to simplify the implementation of speculative lock +elision~\cite{Rajwar:2001:SLE}. + +At the risk of complicating the implementation of atomic operations, +microarchitectures can elide the store within the acquire swap if the +lock value matches the swap value, to avoid dirtying a cache line held +in a shared or exclusive clean state. The effect is similar to a +test-and-test-and-set lock but with shorter code paths. +\end{commentary} |