\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.0}
\label{atomics}

The standard atomic instruction extension is denoted by instruction
subset name ``A'', and contains instructions that atomically
read-modify-write memory to support synchronization between multiple
RISC-V harts running in the same memory space.  The two forms of
atomic instruction provided are load-reserved/store-conditional
instructions and atomic fetch-and-op memory instructions.  Both types
of atomic instruction support various memory consistency orderings
including unordered, acquire, release, and sequentially consistent
semantics.  These instructions allow RISC-V to support the RCsc memory
consistency model~\cite{Gharachorloo90memoryconsistency}.

\begin{commentary}
After much debate, the language community and architecture community
appear to have finally settled on release consistency as the standard
memory consistency model and so the RISC-V atomic support is built
around this model.
\end{commentary}

\section{Specifying Ordering of Atomic Instructions}

The base RISC-V ISA has a relaxed memory model, with the FENCE
instruction used to impose additional ordering constraints.  The
address space is divided by the execution environment into memory and
I/O domains, and the FENCE instruction provides options to order
accesses to one or both of these two address domains.

To provide more efficient support for release
consistency~\cite{Gharachorloo90memoryconsistency}, each atomic
instruction has two bits, {\em aq} and {\em rl}, used to specify
additional memory ordering constraints as viewed by other RISC-V
harts.  The bits order accesses to one of the two address domains,
memory or I/O, depending on which address domain the atomic
instruction is accessing.  No ordering constraint is implied to
accesses to the other domain, and a FENCE instruction should be used
to order across both domains.

If both bits are clear, no additional ordering constraints are imposed
on the atomic memory operation.  If only the {\em aq} bit is set, the
atomic memory operation is treated as an {\em acquire} access, i.e.,
no following memory operations on this RISC-V hart can be observed
to take place before the acquire memory operation.  If only the {\em
  rl} bit is set, the atomic memory operation is treated as a {\em
  release} access, i.e., the release memory operation cannot be
observed to take place before any earlier memory operations on this
RISC-V hart.  If both the {\em aq} and {\em rl} bits are set, the
atomic memory operation is {\em sequentially consistent} and cannot be
observed to happen before any earlier memory operations or after any
later memory operations in the same RISC-V hart and to the same
address domain.

\begin{commentary}
Theoretically, the definition of the {\em aq} and {\em rl} bits allows
for implementations without global store atomicity.  When both {\em
  aq} and {\em rl} bits are set, however, we require full sequential
consistency for the atomic operation which implies global store
atomicity in addition to both acquire and release semantics.  In
practice, hardware systems are usually implemented with global store
atomicity, embodied in local processor ordering rules together with
single-writer cache coherence protocols.
\end{commentary}

\section{Load-Reserved/Store-Conditional Instructions}
\label{sec:lrsc}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
LR & \multicolumn{2}{c}{ordering} & 0   & addr & width & dest & AMO    \\
SC & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

Complex atomic memory operations on a single memory word are performed
with the load-reserved (LR) and store-conditional (SC) instructions.
LR loads a word from the address in {\em rs1}, places the
sign-extended value in {\em rd}, and registers a reservation on the
memory address.  SC writes a word in {\em rs2} to the address in {\em
  rs1}, provided a valid reservation still exists on that address.  SC
writes zero to {\em rd} on success or a nonzero code on failure.

\begin{commentary}
Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
data structures.  After extensive discussion, we opted for LR/SC for
several reasons: 1) CAS suffers from the ABA problem, which LR/SC
avoids because it monitors all accesses to the address rather than
only checking for changes in the data value; 2) CAS would also require
a new integer instruction format to support three source operands
(address, compare value, swap value) as well as a different memory
system message format, which would complicate microarchitectures; 3)
Furthermore, to avoid the ABA problem, other systems provide a
double-wide CAS (DW-CAS) to allow a counter to be tested and
incremented along with a data word. This requires reading five
registers and writing two in one instruction, and also a new larger
memory system message type, further complicating implementations; 4)
LR/SC provides a more efficient implementation of many primitives as
it only requires one load as opposed to two with CAS (one load before
the CAS instruction to obtain a value for speculative computation,
then a second load as part of the CAS instruction to check if value is
unchanged before updating).

The main disadvantage of LR/SC over CAS is livelock, which we avoid
with an architected guarantee of eventual forward progress as
described below.  Another concern is whether the influence of the
current x86 architecture, with its DW-CAS, will complicate porting of
synchronization libraries and other software that assumes DW-CAS is
the basic machine primitive.  A possible mitigating factor is the
recent addition of transactional memory instructions to x86, which
might cause a move away from DW-CAS.
\end{commentary}

The failure code with value 1 is reserved to encode an unspecified
failure.  Other failure codes are reserved at this time, and portable
software should only assume the failure code will be non-zero.

\begin{commentary}
We reserve a failure code of 1 to mean ``unspecified'' so that simple
implementations may return this value using the existing mux required
for the SLT/SLTU instructions.  More specific failure codes might be
defined in future versions or extensions to the ISA.
\end{commentary}

For LR and SC, the A extension requires that the address held in {\em rs1} be
naturally aligned to the size of the operand (i.e., eight-byte aligned for
64-bit words and four-byte aligned for 32-bit words).  If the address is
not naturally aligned, a misaligned address exception will be generated.

\label{lrscseq}

In the standard A extension, certain constrained LR/SC sequences are
guaranteed to succeed eventually.  The static code for the LR/SC
sequence plus the code to retry the sequence in case of failure must
comprise at most 16 integer instructions placed sequentially in
memory.  For the sequence to be guaranteed to eventually succeed, the
dynamic code executed between the LR and SC instructions can only
contain other instructions from the base ``I'' subset, excluding
loads, stores, backward jumps or taken backward branches, FENCE,
FENCE.I, and SYSTEM instructions.  The code to retry a failing LR/SC
sequence can contain backward jumps and/or branches to repeat the
LR/SC sequence, but otherwise has the same constraints.  The SC must
be to the same address and of the same data size as the latest LR
executed.  LR/SC sequences that do not meet these constraints might
complete on some attempts on some implementations, but there is no
guarantee of eventual success.

\begin{commentary}
One advantage of CAS is that it guarantees that some hart eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems.  To avoid this concern, we added an
architectural guarantee of forward progress to LR/SC atomic sequences.
The restrictions on LR/SC sequence contents allows an implementation
to capture a cache line on the LR and complete the LR/SC sequence by
holding off remote cache interventions for a bounded short
time. Interrupts and TLB misses might cause the reservation to be
lost, but eventually the atomic sequence can complete.  We restricted
the length of LR/SC sequences to fit within 64 contiguous instruction
bytes in the base ISA to avoid undue restrictions on instruction cache
and TLB size and associativity.  Similarly, we disallowed other loads
and stores within the sequences to avoid restrictions on data cache
associativity.  The restrictions on branches and jumps limits the time
that can be spent in the sequence.  Floating-point operations and
integer multiply/divide were disallowed to simplify the operating
system's emulation of these instructions on implementations lacking
appropriate hardware support.

Although software is not forbidden from using LR/SC sequences that do not meet
the forward-progress constraints, portable software must detect the case that
the sequence repeatedly fails, then fall back to an alternate code sequence
that does not run afoul of the forward-progress constraints.
Implementations are permitted to simply always fail any LR/SC sequences that do
not meet the forward-progress guarantees.
\end{commentary}

Certain LR/SC sequences that do not meet the forward progress
guarantees may succeed on some implementations.  An implementation can
reserve an arbitrary subset of the memory space on each LR.  An SC may
succeed if no store from another hart to the address range reserved by
the LR can be observed to have occurred between the LR and the SC, and
if there is no other SC between the LR and itself in program order.
Note this LR might have had a different address argument, but reserved
the SC's address as part of the memory subset.  Following this model,
in systems with memory translation, an SC is allowed to succeed if the
earlier LR reserved the same location using an alias with a different
virtual address, but is also allowed to fail if the virtual address is
different.  The SC must fail if a store from another hart to the
address range reserved by the LR can be observed to occur between the
LR and the SC.  The precise statement of the atomicity requirements for
successful LR/SC sequences is defined by the Atomicity Axiom in
Section~\ref{sec:rvwmo}.

In the standard A extension, an SC must fail if there is another SC (to
any address) between the LR and the SC in program order.  In other
words, multiple outstanding reservations are not permitted.

\begin{commentary}
A store-conditional instruction to a scratch word of memory should be used
during a preemptive context switch to forcibly yield any existing load
reservation.
\end{commentary}

\begin{commentary}
The specification explicitly allows implementations to support more
powerful implementations with wider guarantees, provided they do not
void the atomicity guarantees for the constrained sequences.
\end{commentary}

LR/SC can be used to construct lock-free data structures.  An example
using LR/SC to implement a compare-and-swap function is shown in
Figure~\ref{cas}.  If inlined, compare-and-swap functionality need
only take four instructions.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        # a0 holds address of memory location 
        # a1 holds expected value
        # a2 holds desired value
        # a0 holds return value, 0 if successful, !0 otherwise
    cas:
        lr.w t0, (a0)        # Load original value.
        bne t0, a1, fail     # Doesn't match, so fail.
        sc.w a0, a2, (a0)    # Try to update.
        bnez a0, cas         # Retry if store-conditional failed.
        jr ra                # Return.
    fail:
        li a0, 1             # Set return to failure.
        jr ra                # Return.
\end{verbatim}
\end{center}
\caption{Sample code for compare-and-swap function using LR/SC.}
\label{cas}
\end{figure}

An SC instruction can never be observed by another RISC-V hart
before the immediately preceding LR.  Due to the atomic nature of the
LR/SC sequence, no memory operations from another hart can be observed
to have occurred between the LR and a successful SC.  The LR/SC
sequence can be given acquire semantics by setting the {\em aq} bit on
the LR instruction.  The LR/SC sequence can be given release semantics
by setting the {\em rl} bit on the SC instruction.  Setting the {\em
  aq} bit on the LR instruction, and setting both the {\em aq} and the {\em
  rl} bit on the SC instruction makes the LR/SC sequence sequentially
consistent, meaning that it cannot be reordered with earlier or
later memory operations from the same hart.

The {\em rl} bit on an LR instruction must not be set unless the {\em aq} bit is also set.
The {\em aq} bit on an SC instruction must not be set unless the {\em rl} bit is also set.

If neither bit is set on both LR and SC, the LR/SC sequence can be
observed to occur before or after surrounding memory operations from
the same RISC-V hart.  This can be appropriate when the LR/SC
sequence is used to implement a parallel reduction operation.

\begin{commentary}
In general, a multi-word atomic primitive is desirable but there is
still considerable debate about what form this should take, and
guaranteeing forward progress adds complexity to a system.  Our
current thoughts are to include a small limited-capacity transactional
memory buffer along the lines of the original transactional memory
proposals as an optional standard extension ``T''.
\end{commentary}

\section{Atomic Memory Operations}
\label{sec:amo}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

\vspace{-0.1in} The atomic memory operation (AMO) instructions perform
read-modify-write operations for multiprocessor synchronization and
are encoded with an R-type instruction format.  These AMO instructions
atomically load a data value from the address in {\em rs1}, place the
value into register {\em rd}, apply a binary operator to the loaded
value and the original value in {\em rs2}, then store the result back
to the address in {\em rs1}. AMOs can either operate on 64-bit (RV64
only) or 32-bit words in memory.  For RV64, 32-bit AMOs always
sign-extend the value placed in {\em rd}.

For AMOs, the A extension requires that the address held in {\em rs1} be
naturally aligned to the size of the operand (i.e., eight-byte aligned for
64-bit words and four-byte aligned for 32-bit words).  If the address is
not naturally aligned, a misaligned address exception will be generated.
The ``Zam'' extension, described in Chapter~\ref{sec:zam}, relaxes this
requirement and specifies the semantics of misaligned AMOs.

The operations supported are swap, integer add, logical AND, logical
OR, logical XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would
be discarded by writing to {\tt x0}.

\begin{commentary}
We provided fetch-and-op style atomic primitives as they scale to
highly parallel systems better than LR/SC or CAS.  A simple
microarchitecture can implement AMOs using the LR/SC primitives.  More
complex implementations might also implement AMOs at memory
controllers, and can optimize away fetching the original value when
the destination is {\tt x0}.

The set of AMOs was chosen to support the C11/C++11 atomic memory
operations efficiently, and also to support parallel reductions in
memory.  Another use of AMOs is to provide atomic updates to
memory-mapped device registers (e.g., setting, clearing, or toggling
bits) in the I/O space.
\end{commentary}

To help implement multiprocessor synchronization, the AMOs optionally
provide release consistency semantics.  If the {\em aq} bit is set,
then no later memory operations in this RISC-V hart can be observed
to take place before the AMO.
Conversely, if the {\em rl} bit is set, then other
RISC-V harts will not observe the AMO before memory accesses
preceding the AMO in this RISC-V hart.  Setting both the {\em aq} and the {\em
rl} bit on an AMO makes the sequence sequentially consistent, meaning that
it cannot be reordered with earlier or later memory operations from the same
hart.

\begin{commentary}
The AMOs were designed to implement the C11 and C++11 memory models
efficiently.  Although the FENCE R, RW instruction suffices to
implement the {\em acquire} operation and FENCE RW, W suffices to
implement {\em release}, both imply additional unnecessary ordering as
compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
\end{commentary}

An example code sequence for a critical section guarded by a
test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
first AMO is marked {\em aq} to order the lock acquisition before the
critical section, and the second AMO is marked {\em rl} to order
the critical section before the lock relinquishment.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        li           t0, 1        # Initialize swap value.
    again:
        amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
        bnez         t0, again    # Retry if held.
        # ...
        # Critical section.
        # ...
        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
\end{verbatim}
\end{center}
\caption{Sample code for mutual exclusion.  {\tt a0} contains the address of the lock.}
\label{critical}
\end{figure}

\begin{commentary}
We recommend the use of the AMO Swap idiom shown above for both lock
acquire and release to simplify the implementation of speculative lock
elision~\cite{Rajwar:2001:SLE}.
\end{commentary}

The instructions in the ``A'' extension can also be used to provide
sequentially consistent loads and stores.  A sequentially consistent load can
be implemented as an LR with both {\em aq} and {\em rl} set. A sequentially
consistent store can be implemented as an AMOSWAP that writes the old value to
x0 and has both {\em aq} and {\em rl} set.