src/a.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404

\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.0}
\label{atomics}

The standard atomic instruction extension is denoted by instruction
subset name ``A'', and contains instructions that atomically
read-modify-write memory to support synchronization between multiple
RISC-V harts running in the same memory space.  The two forms of
atomic instruction provided are load-reserved/store-conditional
instructions and atomic fetch-and-op memory instructions.  Both types
of atomic instruction support various memory consistency orderings
including unordered, acquire, release, and sequentially consistent
semantics.  These instructions allow RISC-V to support the RCsc memory
consistency model~\cite{Gharachorloo90memoryconsistency}.

\begin{commentary}
After much debate, the language community and architecture community
appear to have finally settled on release consistency as the standard
memory consistency model and so the RISC-V atomic support is built
around this model.
\end{commentary}

\section{Specifying Ordering of Atomic Instructions}

The base RISC-V ISA has a relaxed memory model, with the FENCE
instruction used to impose additional ordering constraints.  The
address space is divided by the execution environment into memory and
I/O domains, and the FENCE instruction provides options to order
accesses to one or both of these two address domains.

To provide more efficient support for release
consistency~\cite{Gharachorloo90memoryconsistency}, each atomic
instruction has two bits, {\em aq} and {\em rl}, used to specify
additional memory ordering constraints as viewed by other RISC-V
harts.  The bits order accesses to one of the two address domains,
memory or I/O, depending on which address domain the atomic
instruction is accessing.  No ordering constraint is implied to
accesses to the other domain, and a FENCE instruction should be used
to order across both domains.

If both bits are clear, no additional ordering constraints are imposed
on the atomic memory operation.  If only the {\em aq} bit is set, the
atomic memory operation is treated as an {\em acquire} access, i.e.,
no following memory operations on this RISC-V hart can be observed
to take place before the acquire memory operation.  If only the {\em
  rl} bit is set, the atomic memory operation is treated as a {\em
  release} access, i.e., the release memory operation cannot be
observed to take place before any earlier memory operations on this
RISC-V hart.  If both the {\em aq} and {\em rl} bits are set, the
atomic memory operation is {\em sequentially consistent} and cannot be
observed to happen before any earlier memory operations or after any
later memory operations in the same RISC-V hart and to the same
address domain.

\section{Load-Reserved/Store-Conditional Instructions}
\label{sec:lrsc}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
LR & \multicolumn{2}{c}{ordering} & 0   & addr & width & dest & AMO    \\
SC & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

Complex atomic memory operations on a single memory word are performed
with the load-reserved (LR) and store-conditional (SC) instructions.
LR loads a word from the address in {\em rs1}, places the
sign-extended value in {\em rd}, and registers a reservation on the
memory address and a range of bytes including at least all bytes of
the addressed word.  SC writes a word in {\em rs2} to the address in
{\em rs1}, provided a valid reservation still exists on that address.
SC writes zero to {\em rd} on success or a nonzero code on failure.

\begin{commentary}
Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
data structures.  After extensive discussion, we opted for LR/SC for
several reasons: 1) CAS suffers from the ABA problem, which LR/SC
avoids because it monitors all accesses to the address rather than
only checking for changes in the data value; 2) CAS would also require
a new integer instruction format to support three source operands
(address, compare value, swap value) as well as a different memory
system message format, which would complicate microarchitectures; 3)
Furthermore, to avoid the ABA problem, other systems provide a
double-wide CAS (DW-CAS) to allow a counter to be tested and
incremented along with a data word. This requires reading five
registers and writing two in one instruction, and also a new larger
memory system message type, further complicating implementations; 4)
LR/SC provides a more efficient implementation of many primitives as
it only requires one load as opposed to two with CAS (one load before
the CAS instruction to obtain a value for speculative computation,
then a second load as part of the CAS instruction to check if value is
unchanged before updating).

The main disadvantage of LR/SC over CAS is livelock, which we avoid
with an architected guarantee of eventual forward progress as
described below.  Another concern is whether the influence of the
current x86 architecture, with its DW-CAS, will complicate porting of
synchronization libraries and other software that assumes DW-CAS is
the basic machine primitive.  A possible mitigating factor is the
recent addition of transactional memory instructions to x86, which
might cause a move away from DW-CAS.
\end{commentary}

The failure code with value 1 is reserved to encode an unspecified
failure.  Other failure codes are reserved at this time, and portable
software should only assume the failure code will be non-zero.

\begin{commentary}
We reserve a failure code of 1 to mean ``unspecified'' so that simple
implementations may return this value using the existing mux required
for the SLT/SLTU instructions.  More specific failure codes might be
defined in future versions or extensions to the ISA.
\end{commentary}

For LR and SC, the A extension requires that the address held in {\em
  rs1} be naturally aligned to the size of the operand (i.e.,
eight-byte aligned for 64-bit words and four-byte aligned for 32-bit
words).  If the address is not naturally aligned, a misaligned address
exception or an access exception will be generated.  The access
exception can be generated for a memory access that would otherwise be
able to complete except for the misalignment, if the misaligned access
should not be emulated.

\label{lrscseq}

In the standard A extension, certain constrained LR/SC sequences are
guaranteed to succeed eventually.  The static code for the LR/SC
sequence plus the code to retry the sequence in case of failure must
comprise at most 16 integer instructions placed sequentially in
memory.  For the sequence to be guaranteed to eventually succeed, the
dynamic code executed between the LR and SC instructions can only
contain other instructions from the base ``I'' subset, excluding
loads, stores, backward jumps or taken backward branches, FENCE,
FENCE.I, and SYSTEM instructions.  The code to retry a failing LR/SC
sequence can contain backward jumps and/or branches to repeat the
LR/SC sequence, but otherwise has the same constraints.  The SC must
be to the same address and of the same data size as the latest LR
executed.  The execution environment can limit the instruction and
data memory regions within which forward progress is guaranteed.
LR/SC sequences that do not meet all these constraints might complete on
some attempts on some implementations, but there is no guarantee of
eventual success.

\begin{commentary}
One advantage of CAS is that it guarantees that some hart eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems.  To avoid this concern, we added an
architectural guarantee of forward progress to LR/SC atomic sequences.
The restrictions on LR/SC sequence contents allows an implementation
to capture a cache line on the LR and complete the LR/SC sequence by
holding off remote cache interventions for a bounded short
time. Interrupts and TLB misses might cause the reservation to be
lost, but eventually the atomic sequence can complete.  We restricted
the length of LR/SC sequences to fit within 64 contiguous instruction
bytes in the base ISA to avoid undue restrictions on instruction cache
and TLB size and associativity.  Similarly, we disallowed other loads
and stores within the sequences to avoid restrictions on data-cache
associativity.  The restrictions on branches and jumps limits the time
that can be spent in the sequence.  Floating-point operations and
integer multiply/divide were disallowed to simplify the operating
system's emulation of these instructions on implementations lacking
appropriate hardware support.

Although software is not forbidden from using LR/SC sequences that do not meet
the forward-progress constraints, portable software must detect the case that
the sequence repeatedly fails, then fall back to an alternate code sequence
that does not run afoul of the forward-progress constraints.
Implementations are permitted to simply always fail any LR/SC sequence that does
not meet the forward-progress guarantee.
\end{commentary}

An implementation can reserve an arbitrarily large subset of the
address space on each LR, provided the memory range includes all bytes
of the addressed data word.
An SC can only pair with the most recent LR in program order.  An SC
may succeed if no store from another hart to the address range
reserved by the LR can be observed to have occurred between the LR and
the SC, and if there is no other SC between the LR and itself in
program order.  Note this LR might have had a different address
argument, but reserved the SC's address as part of the memory subset.
Following this model, in systems with memory translation, an SC is
allowed to succeed if the earlier LR reserved the same location using
an alias with a different virtual address, but is also allowed to fail
if the virtual address is different.  The SC must fail if a store from
another hart to the address range reserved by the LR can be observed
to occur between the LR and the SC.  An SC must fail if there is
another SC (to any address) between the LR and the SC in program
order.  The precise statement of the atomicity requirements for
successful LR/SC sequences is defined by the Atomicity Axiom in
Section~\ref{sec:rvwmo}.

\begin{commentary}
A store-conditional instruction to a scratch word of memory should be used
during a preemptive context switch to forcibly yield any existing load
reservation.
\end{commentary}

LR/SC can be used to construct lock-free data structures.  An example
using LR/SC to implement a compare-and-swap function is shown in
Figure~\ref{cas}.  If inlined, compare-and-swap functionality need
only take four instructions.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        # a0 holds address of memory location 
        # a1 holds expected value
        # a2 holds desired value
        # a0 holds return value, 0 if successful, !0 otherwise
    cas:
        lr.w t0, (a0)        # Load original value.
        bne t0, a1, fail     # Doesn't match, so fail.
        sc.w a0, a2, (a0)    # Try to update.
        bnez a0, cas         # Retry if store-conditional failed.
        jr ra                # Return.
    fail:
        li a0, 1             # Set return to failure.
        jr ra                # Return.
\end{verbatim}
\end{center}
\caption{Sample code for compare-and-swap function using LR/SC.}
\label{cas}
\end{figure}

An SC instruction can never be observed by another RISC-V hart
before the immediately preceding LR.  Due to the atomic nature of the
LR/SC sequence, no memory operations from another hart can be observed
to have occurred between the LR and a successful SC.  The LR/SC
sequence can be given acquire semantics by setting the {\em aq} bit on
the LR instruction.  The LR/SC sequence can be given release semantics
by setting the {\em rl} bit on the SC instruction.  Setting the {\em
  aq} bit on the LR instruction, and setting both the {\em aq} and the {\em
  rl} bit on the SC instruction makes the LR/SC sequence sequentially
consistent, meaning that it cannot be reordered with earlier or
later memory operations from the same hart.

The {\em rl} bit on an LR instruction must not be set unless the {\em aq} bit is also set.
The {\em aq} bit on an SC instruction must not be set unless the {\em rl} bit is also set.

If neither bit is set on both LR and SC, the LR/SC sequence can be
observed to occur before or after surrounding memory operations from
the same RISC-V hart.  This can be appropriate when the LR/SC
sequence is used to implement a parallel reduction operation.

\begin{commentary}
In general, a multi-word atomic primitive is desirable but there is
still considerable debate about what form this should take, and
guaranteeing forward progress adds complexity to a system.  Our
current thoughts are to include a small limited-capacity transactional
memory buffer along the lines of the original transactional memory
proposals as an optional standard extension ``T''.
\end{commentary}

\section{Atomic Memory Operations}
\label{sec:amo}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

\vspace{-0.1in} The atomic memory operation (AMO) instructions perform
read-modify-write operations for multiprocessor synchronization and
are encoded with an R-type instruction format.  These AMO instructions
atomically load a data value from the address in {\em rs1}, place the
value into register {\em rd}, apply a binary operator to the loaded
value and the original value in {\em rs2}, then store the result back
to the address in {\em rs1}. AMOs can either operate on 64-bit (RV64
only) or 32-bit words in memory.  For RV64, 32-bit AMOs always
sign-extend the value placed in {\em rd}.

For AMOs, the A extension requires that the address held in {\em rs1}
be naturally aligned to the size of the operand (i.e., eight-byte
aligned for 64-bit words and four-byte aligned for 32-bit words).  If
the address is not naturally aligned, a misaligned address exception
or an access exception will be generated.  The access exception can be
generated for a memory access that would otherwise be able to complete
except for the misalignment, if the misaligned access should not be
emulated.  The ``Zam'' extension, described in Chapter~\ref{sec:zam},
relaxes this requirement and specifies the semantics of misaligned
AMOs.

The operations supported are swap, integer add, bitwise AND, bitwise
OR, bitwise XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would
be discarded by writing to {\tt x0}.

\begin{commentary}
We provided fetch-and-op style atomic primitives as they scale to
highly parallel systems better than LR/SC or CAS.  A simple
microarchitecture can implement AMOs using the LR/SC primitives.  More
complex implementations might also implement AMOs at memory
controllers, and can optimize away fetching the original value when
the destination is {\tt x0}.

The set of AMOs was chosen to support the C11/C++11 atomic memory
operations efficiently, and also to support parallel reductions in
memory.  Another use of AMOs is to provide atomic updates to
memory-mapped device registers (e.g., setting, clearing, or toggling
bits) in the I/O space.
\end{commentary}

To help implement multiprocessor synchronization, the AMOs optionally
provide release consistency semantics.  If the {\em aq} bit is set,
then no later memory operations in this RISC-V hart can be observed
to take place before the AMO.
Conversely, if the {\em rl} bit is set, then other
RISC-V harts will not observe the AMO before memory accesses
preceding the AMO in this RISC-V hart.  Setting both the {\em aq} and the {\em
rl} bit on an AMO makes the sequence sequentially consistent, meaning that
it cannot be reordered with earlier or later memory operations from the same
hart.

\begin{commentary}
The AMOs were designed to implement the C11 and C++11 memory models
efficiently.  Although the FENCE R, RW instruction suffices to
implement the {\em acquire} operation and FENCE RW, W suffices to
implement {\em release}, both imply additional unnecessary ordering as
compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
\end{commentary}

An example code sequence for a critical section guarded by a
test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
first AMO is marked {\em aq} to order the lock acquisition before the
critical section, and the second AMO is marked {\em rl} to order
the critical section before the lock relinquishment.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        li           t0, 1        # Initialize swap value.
    again:
        amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
        bnez         t0, again    # Retry if held.
        # ...
        # Critical section.
        # ...
        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
\end{verbatim}
\end{center}
\caption{Sample code for mutual exclusion.  {\tt a0} contains the address of the lock.}
\label{critical}
\end{figure}

\begin{commentary}
We recommend the use of the AMO Swap idiom shown above for both lock
acquire and release to simplify the implementation of speculative lock
elision~\cite{Rajwar:2001:SLE}.
\end{commentary}

The instructions in the ``A'' extension can also be used to provide
sequentially consistent loads and stores.  A sequentially consistent load can
be implemented as an LR with both {\em aq} and {\em rl} set. A sequentially
consistent store can be implemented as an AMOSWAP that writes the old value to
x0 and has both {\em aq} and {\em rl} set.