aboutsummaryrefslogtreecommitdiff
path: root/src/a.tex
diff options
context:
space:
mode:
Diffstat (limited to 'src/a.tex')
-rw-r--r--src/a.tex404
1 files changed, 0 insertions, 404 deletions
diff --git a/src/a.tex b/src/a.tex
deleted file mode 100644
index 163c52f..0000000
--- a/src/a.tex
+++ /dev/null
@@ -1,404 +0,0 @@
-\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.0}
-\label{atomics}
-
-The standard atomic instruction extension is denoted by instruction
-subset name ``A'', and contains instructions that atomically
-read-modify-write memory to support synchronization between multiple
-RISC-V harts running in the same memory space. The two forms of
-atomic instruction provided are load-reserved/store-conditional
-instructions and atomic fetch-and-op memory instructions. Both types
-of atomic instruction support various memory consistency orderings
-including unordered, acquire, release, and sequentially consistent
-semantics. These instructions allow RISC-V to support the RCsc memory
-consistency model~\cite{Gharachorloo90memoryconsistency}.
-
-\begin{commentary}
-After much debate, the language community and architecture community
-appear to have finally settled on release consistency as the standard
-memory consistency model and so the RISC-V atomic support is built
-around this model.
-\end{commentary}
-
-\section{Specifying Ordering of Atomic Instructions}
-
-The base RISC-V ISA has a relaxed memory model, with the FENCE
-instruction used to impose additional ordering constraints. The
-address space is divided by the execution environment into memory and
-I/O domains, and the FENCE instruction provides options to order
-accesses to one or both of these two address domains.
-
-To provide more efficient support for release
-consistency~\cite{Gharachorloo90memoryconsistency}, each atomic
-instruction has two bits, {\em aq} and {\em rl}, used to specify
-additional memory ordering constraints as viewed by other RISC-V
-harts. The bits order accesses to one of the two address domains,
-memory or I/O, depending on which address domain the atomic
-instruction is accessing. No ordering constraint is implied to
-accesses to the other domain, and a FENCE instruction should be used
-to order across both domains.
-
-If both bits are clear, no additional ordering constraints are imposed
-on the atomic memory operation. If only the {\em aq} bit is set, the
-atomic memory operation is treated as an {\em acquire} access, i.e.,
-no following memory operations on this RISC-V hart can be observed
-to take place before the acquire memory operation. If only the {\em
- rl} bit is set, the atomic memory operation is treated as a {\em
- release} access, i.e., the release memory operation cannot be
-observed to take place before any earlier memory operations on this
-RISC-V hart. If both the {\em aq} and {\em rl} bits are set, the
-atomic memory operation is {\em sequentially consistent} and cannot be
-observed to happen before any earlier memory operations or after any
-later memory operations in the same RISC-V hart and to the same
-address domain.
-
-\section{Load-Reserved/Store-Conditional Instructions}
-\label{sec:lrsc}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O}
-\\
-\instbitrange{31}{27} &
-\instbit{26} &
-\instbit{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{funct5} &
-\multicolumn{1}{c|}{aq} &
-\multicolumn{1}{c|}{rl} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
-LR & \multicolumn{2}{c}{ordering} & 0 & addr & width & dest & AMO \\
-SC & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-\end{tabular}
-\end{center}
-
-Complex atomic memory operations on a single memory word are performed
-with the load-reserved (LR) and store-conditional (SC) instructions.
-LR loads a word from the address in {\em rs1}, places the
-sign-extended value in {\em rd}, and registers a reservation on the
-memory address and a range of bytes including at least all bytes of
-the addressed word. SC writes a word in {\em rs2} to the address in
-{\em rs1}, provided a valid reservation still exists on that address.
-SC writes zero to {\em rd} on success or a nonzero code on failure.
-
-\begin{commentary}
-Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
-data structures. After extensive discussion, we opted for LR/SC for
-several reasons: 1) CAS suffers from the ABA problem, which LR/SC
-avoids because it monitors all accesses to the address rather than
-only checking for changes in the data value; 2) CAS would also require
-a new integer instruction format to support three source operands
-(address, compare value, swap value) as well as a different memory
-system message format, which would complicate microarchitectures; 3)
-Furthermore, to avoid the ABA problem, other systems provide a
-double-wide CAS (DW-CAS) to allow a counter to be tested and
-incremented along with a data word. This requires reading five
-registers and writing two in one instruction, and also a new larger
-memory system message type, further complicating implementations; 4)
-LR/SC provides a more efficient implementation of many primitives as
-it only requires one load as opposed to two with CAS (one load before
-the CAS instruction to obtain a value for speculative computation,
-then a second load as part of the CAS instruction to check if value is
-unchanged before updating).
-
-The main disadvantage of LR/SC over CAS is livelock, which we avoid
-with an architected guarantee of eventual forward progress as
-described below. Another concern is whether the influence of the
-current x86 architecture, with its DW-CAS, will complicate porting of
-synchronization libraries and other software that assumes DW-CAS is
-the basic machine primitive. A possible mitigating factor is the
-recent addition of transactional memory instructions to x86, which
-might cause a move away from DW-CAS.
-\end{commentary}
-
-The failure code with value 1 is reserved to encode an unspecified
-failure. Other failure codes are reserved at this time, and portable
-software should only assume the failure code will be non-zero.
-
-\begin{commentary}
-We reserve a failure code of 1 to mean ``unspecified'' so that simple
-implementations may return this value using the existing mux required
-for the SLT/SLTU instructions. More specific failure codes might be
-defined in future versions or extensions to the ISA.
-\end{commentary}
-
-For LR and SC, the A extension requires that the address held in {\em
- rs1} be naturally aligned to the size of the operand (i.e.,
-eight-byte aligned for 64-bit words and four-byte aligned for 32-bit
-words). If the address is not naturally aligned, a misaligned address
-exception or an access exception will be generated. The access
-exception can be generated for a memory access that would otherwise be
-able to complete except for the misalignment, if the misaligned access
-should not be emulated.
-
-\label{lrscseq}
-
-In the standard A extension, certain constrained LR/SC sequences are
-guaranteed to succeed eventually. The static code for the LR/SC
-sequence plus the code to retry the sequence in case of failure must
-comprise at most 16 integer instructions placed sequentially in
-memory. For the sequence to be guaranteed to eventually succeed, the
-dynamic code executed between the LR and SC instructions can only
-contain other instructions from the base ``I'' subset, excluding
-loads, stores, backward jumps or taken backward branches, FENCE,
-FENCE.I, and SYSTEM instructions. The code to retry a failing LR/SC
-sequence can contain backward jumps and/or branches to repeat the
-LR/SC sequence, but otherwise has the same constraints. The SC must
-be to the same address and of the same data size as the latest LR
-executed. The execution environment can limit the instruction and
-data memory regions within which forward progress is guaranteed.
-LR/SC sequences that do not meet all these constraints might complete on
-some attempts on some implementations, but there is no guarantee of
-eventual success.
-
-\begin{commentary}
-One advantage of CAS is that it guarantees that some hart eventually
-makes progress, whereas an LR/SC atomic sequence could livelock
-indefinitely on some systems. To avoid this concern, we added an
-architectural guarantee of forward progress to LR/SC atomic sequences.
-The restrictions on LR/SC sequence contents allows an implementation
-to capture a cache line on the LR and complete the LR/SC sequence by
-holding off remote cache interventions for a bounded short
-time. Interrupts and TLB misses might cause the reservation to be
-lost, but eventually the atomic sequence can complete. We restricted
-the length of LR/SC sequences to fit within 64 contiguous instruction
-bytes in the base ISA to avoid undue restrictions on instruction cache
-and TLB size and associativity. Similarly, we disallowed other loads
-and stores within the sequences to avoid restrictions on data-cache
-associativity. The restrictions on branches and jumps limits the time
-that can be spent in the sequence. Floating-point operations and
-integer multiply/divide were disallowed to simplify the operating
-system's emulation of these instructions on implementations lacking
-appropriate hardware support.
-
-Although software is not forbidden from using LR/SC sequences that do not meet
-the forward-progress constraints, portable software must detect the case that
-the sequence repeatedly fails, then fall back to an alternate code sequence
-that does not run afoul of the forward-progress constraints.
-Implementations are permitted to simply always fail any LR/SC sequence that does
-not meet the forward-progress guarantee.
-\end{commentary}
-
-An implementation can reserve an arbitrarily large subset of the
-address space on each LR, provided the memory range includes all bytes
-of the addressed data word.
-An SC can only pair with the most recent LR in program order. An SC
-may succeed if no store from another hart to the address range
-reserved by the LR can be observed to have occurred between the LR and
-the SC, and if there is no other SC between the LR and itself in
-program order. Note this LR might have had a different address
-argument, but reserved the SC's address as part of the memory subset.
-Following this model, in systems with memory translation, an SC is
-allowed to succeed if the earlier LR reserved the same location using
-an alias with a different virtual address, but is also allowed to fail
-if the virtual address is different. The SC must fail if a store from
-another hart to the address range reserved by the LR can be observed
-to occur between the LR and the SC. An SC must fail if there is
-another SC (to any address) between the LR and the SC in program
-order. The precise statement of the atomicity requirements for
-successful LR/SC sequences is defined by the Atomicity Axiom in
-Section~\ref{sec:rvwmo}.
-
-\begin{commentary}
-A store-conditional instruction to a scratch word of memory should be used
-during a preemptive context switch to forcibly yield any existing load
-reservation.
-\end{commentary}
-
-LR/SC can be used to construct lock-free data structures. An example
-using LR/SC to implement a compare-and-swap function is shown in
-Figure~\ref{cas}. If inlined, compare-and-swap functionality need
-only take four instructions.
-
-\begin{figure}[h!]
-\begin{center}
-\begin{verbatim}
- # a0 holds address of memory location
- # a1 holds expected value
- # a2 holds desired value
- # a0 holds return value, 0 if successful, !0 otherwise
- cas:
- lr.w t0, (a0) # Load original value.
- bne t0, a1, fail # Doesn't match, so fail.
- sc.w a0, a2, (a0) # Try to update.
- bnez a0, cas # Retry if store-conditional failed.
- jr ra # Return.
- fail:
- li a0, 1 # Set return to failure.
- jr ra # Return.
-\end{verbatim}
-\end{center}
-\caption{Sample code for compare-and-swap function using LR/SC.}
-\label{cas}
-\end{figure}
-
-An SC instruction can never be observed by another RISC-V hart
-before the immediately preceding LR. Due to the atomic nature of the
-LR/SC sequence, no memory operations from another hart can be observed
-to have occurred between the LR and a successful SC. The LR/SC
-sequence can be given acquire semantics by setting the {\em aq} bit on
-the LR instruction. The LR/SC sequence can be given release semantics
-by setting the {\em rl} bit on the SC instruction. Setting the {\em
- aq} bit on the LR instruction, and setting both the {\em aq} and the {\em
- rl} bit on the SC instruction makes the LR/SC sequence sequentially
-consistent, meaning that it cannot be reordered with earlier or
-later memory operations from the same hart.
-
-The {\em rl} bit on an LR instruction must not be set unless the {\em aq} bit is also set.
-The {\em aq} bit on an SC instruction must not be set unless the {\em rl} bit is also set.
-
-If neither bit is set on both LR and SC, the LR/SC sequence can be
-observed to occur before or after surrounding memory operations from
-the same RISC-V hart. This can be appropriate when the LR/SC
-sequence is used to implement a parallel reduction operation.
-
-\begin{commentary}
-In general, a multi-word atomic primitive is desirable but there is
-still considerable debate about what form this should take, and
-guaranteeing forward progress adds complexity to a system. Our
-current thoughts are to include a small limited-capacity transactional
-memory buffer along the lines of the original transactional memory
-proposals as an optional standard extension ``T''.
-\end{commentary}
-
-\section{Atomic Memory Operations}
-\label{sec:amo}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R}
-\\
-\instbitrange{31}{27} &
-\instbit{26} &
-\instbit{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{funct5} &
-\multicolumn{1}{c|}{aq} &
-\multicolumn{1}{c|}{rl} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
-AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\
-\end{tabular}
-\end{center}
-
-\vspace{-0.1in} The atomic memory operation (AMO) instructions perform
-read-modify-write operations for multiprocessor synchronization and
-are encoded with an R-type instruction format. These AMO instructions
-atomically load a data value from the address in {\em rs1}, place the
-value into register {\em rd}, apply a binary operator to the loaded
-value and the original value in {\em rs2}, then store the result back
-to the address in {\em rs1}. AMOs can either operate on 64-bit (RV64
-only) or 32-bit words in memory. For RV64, 32-bit AMOs always
-sign-extend the value placed in {\em rd}.
-
-For AMOs, the A extension requires that the address held in {\em rs1}
-be naturally aligned to the size of the operand (i.e., eight-byte
-aligned for 64-bit words and four-byte aligned for 32-bit words). If
-the address is not naturally aligned, a misaligned address exception
-or an access exception will be generated. The access exception can be
-generated for a memory access that would otherwise be able to complete
-except for the misalignment, if the misaligned access should not be
-emulated. The ``Zam'' extension, described in Chapter~\ref{sec:zam},
-relaxes this requirement and specifies the semantics of misaligned
-AMOs.
-
-The operations supported are swap, integer add, bitwise AND, bitwise
-OR, bitwise XOR, and signed and unsigned integer maximum and minimum.
-Without ordering constraints, these AMOs can be used to implement
-parallel reduction operations, where typically the return value would
-be discarded by writing to {\tt x0}.
-
-\begin{commentary}
-We provided fetch-and-op style atomic primitives as they scale to
-highly parallel systems better than LR/SC or CAS. A simple
-microarchitecture can implement AMOs using the LR/SC primitives. More
-complex implementations might also implement AMOs at memory
-controllers, and can optimize away fetching the original value when
-the destination is {\tt x0}.
-
-The set of AMOs was chosen to support the C11/C++11 atomic memory
-operations efficiently, and also to support parallel reductions in
-memory. Another use of AMOs is to provide atomic updates to
-memory-mapped device registers (e.g., setting, clearing, or toggling
-bits) in the I/O space.
-\end{commentary}
-
-To help implement multiprocessor synchronization, the AMOs optionally
-provide release consistency semantics. If the {\em aq} bit is set,
-then no later memory operations in this RISC-V hart can be observed
-to take place before the AMO.
-Conversely, if the {\em rl} bit is set, then other
-RISC-V harts will not observe the AMO before memory accesses
-preceding the AMO in this RISC-V hart. Setting both the {\em aq} and the {\em
-rl} bit on an AMO makes the sequence sequentially consistent, meaning that
-it cannot be reordered with earlier or later memory operations from the same
-hart.
-
-\begin{commentary}
-The AMOs were designed to implement the C11 and C++11 memory models
-efficiently. Although the FENCE R, RW instruction suffices to
-implement the {\em acquire} operation and FENCE RW, W suffices to
-implement {\em release}, both imply additional unnecessary ordering as
-compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
-\end{commentary}
-
-An example code sequence for a critical section guarded by a
-test-and-set spinlock is shown in Figure~\ref{critical}. Note the
-first AMO is marked {\em aq} to order the lock acquisition before the
-critical section, and the second AMO is marked {\em rl} to order
-the critical section before the lock relinquishment.
-
-\begin{figure}[h!]
-\begin{center}
-\begin{verbatim}
- li t0, 1 # Initialize swap value.
- again:
- amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
- bnez t0, again # Retry if held.
- # ...
- # Critical section.
- # ...
- amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
-\end{verbatim}
-\end{center}
-\caption{Sample code for mutual exclusion. {\tt a0} contains the address of the lock.}
-\label{critical}
-\end{figure}
-
-\begin{commentary}
-We recommend the use of the AMO Swap idiom shown above for both lock
-acquire and release to simplify the implementation of speculative lock
-elision~\cite{Rajwar:2001:SLE}.
-\end{commentary}
-
-The instructions in the ``A'' extension can also be used to provide
-sequentially consistent loads and stores. A sequentially consistent load can
-be implemented as an LR with both {\em aq} and {\em rl} set. A sequentially
-consistent store can be implemented as an AMOSWAP that writes the old value to
-x0 and has both {\em aq} and {\em rl} set.