src/a.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379

\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.0}
\label{atomics}

The standard atomic instruction extension is denoted by instruction
subset name ``A'', and contains instructions that atomically
read-modify-write memory to support synchronization between multiple
RISC-V threads running in the same memory space.  The two forms of
atomic instruction provided are load-reserved/store-conditional
instructions and atomic fetch-and-op memory instructions.  Both types
of atomic instruction support various memory consistency orderings
including unordered, acquire, release, and sequentially consistent
semantics.  These instructions allow RISC-V to support the RCsc memory
consistency model~\cite{Gharachorloo90memoryconsistency}.

\begin{commentary}
After much debate, the language community and architecture community
appear to have finally settled on release consistency as the standard
memory consistency model and so the RISC-V atomic support is built
around this model.
\end{commentary}

\section{Specifying Ordering of Atomic Instructions}

The base RISC-V ISA has a relaxed memory model, with the FENCE
instruction used to impose additional ordering constraints.  The
address space is divided by the execution environment into memory and
I/O domains, and the FENCE instruction provides options to order
accesses to one or both of these two address domains.

To provide more efficient support for release
consistency~\cite{Gharachorloo90memoryconsistency}, each atomic
instruction has two bits, {\em aq} and {\em rl}, used to specify
additional memory ordering constraints as viewed by other RISC-V
threads.  The bits order accesses to one of the two address domains,
memory or I/O, depending on which address domain the atomic
instruction is accessing.  No ordering constraint is implied to
accesses to the other domain, and a FENCE instruction should be used
to order across both domains.

If both bits are clear, no additional ordering constraints are imposed
on the atomic memory operation.  If only the {\em aq} bit is set, the
atomic memory operation is treated as an {\em acquire} access, i.e.,
no following memory operations on this RISC-V thread can be observed
to take place before the acquire memory operation.  If only the {\em
  rl} bit is set, the atomic memory operation is treated as a {\em
  release} access, i.e., the release memory operation can not be
observed to take place before any earlier memory operations on this
RISC-V thread.  If both the {\em aq} and {\em rl} bits are set, the
atomic memory operation is {\em sequentially consistent} and cannot be
observed to happen before any earlier memory operations or after any
later memory operations in the same RISC-V thread, and can only be
observed by any other thread in the same global order of all
sequentially consistent atomic memory operations to the same address
domain.

\begin{commentary}
Theoretically, the definition of the {\em aq} and {\em rl} bits allows
for implementations without global store atomicity.  When both {\em
  aq} and {\em rl} bits are set, however, we require full sequential
consistency for the atomic operation which implies global store
atomicity in addition to both acquire and release semantics.  In
practice, hardware systems are usually implemented with global store
atomicity, embodied in local processor ordering rules together with
single-writer cache coherence protocols.
\end{commentary}

\section{Load-Reserved/Store-Conditional Instructions}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
LR & \multicolumn{2}{c}{ordering} & 0   & addr & width & dest & AMO    \\
SC & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

Complex atomic memory operations on a single memory word are performed
with the load-reserved (LR) and store-conditional (SC) instructions.
LR loads a word from the address in {\em rs1}, places the
sign-extended value in {\em rd}, and registers a reservation on the
memory address.  SC writes a word in {\em rs2} to the address in {\em
  rs1}, provided a valid reservation still exists on that address.  SC
writes zero to {\em rd} on success or a nonzero code on failure.

\begin{commentary}
Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
data structures.  After extensive discussion, we opted for LR/SC for
several reasons: 1) CAS suffers from the ABA problem, which LR/SC
avoids because it monitors all accesses to the address rather than
only checking for changes in the data value; 2) CAS would also require
a new integer instruction format to support three source operands
(address, compare value, swap value) as well as a different memory
system message format, which would complicate microarchitectures; 3)
Furthermore, to avoid the ABA problem, other systems provide a
double-wide CAS (DW-CAS) to allow a counter to be tested and
incremented along with a data word. This requires reading five
registers and writing two in one instruction, and also a new larger
memory system message type, further complicating implementations; 4)
LR/SC provides a more efficient implementation of many primitives as
it only requires one load as opposed to two with CAS (one load before
the CAS instruction to obtain a value for speculative computation,
then a second load as part of the CAS instruction to check if value is
unchanged before updating).

The main disadvantage of LR/SC over CAS is livelock, which we avoid
with an architected guarantee of eventual forward progress as
described below.  Another concern is whether the influence of the
current x86 architecture, with its DW-CAS, will complicate porting of
synchronization libraries and other software that assumes DW-CAS is
the basic machine primitive.  A possible mitigating factor is the
recent addition of transactional memory instructions to x86, which
might cause a move away from DW-CAS.
\end{commentary}

The failure code with value 1 is reserved to encode an unspecified
failure.  Other failure codes are reserved at this time, and portable
software should only assume the failure code will be non-zero.  LR and
SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit words in
memory.  Misaligned addresses will generate misaligned address
exceptions.

\begin{commentary}
We reserve a failure code of 1 to mean ``unspecified'' so that simple
implementations may return this value using the existing mux required
for the SLT/SLTU instructions.  More specific failure codes might be
defined in future versions or extensions to the ISA.
\end{commentary}

\label{lrscseq}

In the standard A extension, certain constrained LR/SC sequences are
guaranteed to succeed eventually.  The static code for the LR/SC
sequence plus the code to retry the sequence in case of failure must
comprise at most 16 integer instructions placed sequentially in
memory.  For the sequence to be guaranteed to eventually succeed, the
dynamic code executed between the LR and SC instructions can only
contain other instructions from the base ``I'' subset, excluding
loads, stores, backward jumps or taken backward branches, FENCE,
FENCE.I, and SYSTEM instructions.  The code to retry a failing LR/SC
sequence can contain backward jumps and/or branches to repeat the
LR/SC sequence, but otherwise has the same constraints.  The SC must
be to the same address as the latest LR executed.  LR/SC sequences
that do not meet these constraints might complete on some attempts on
some implementations, but there is no guarantee of eventual success.

\begin{commentary}
One advantage of CAS is that it guarantees that some thread eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems.  To avoid this concern, we added an
architectural guarantee of forward progress to LR/SC atomic sequences.
The restrictions on LR/SC sequence contents allows an implementation
to capture a cache line on the LR and complete the LR/SC sequence by
holding off remote cache interventions for a bounded short
time. Interrupts and TLB misses might cause the reservation to be
lost, but eventually the atomic sequence can complete.  We restricted
the length of LR/SC sequences to fit within 64 contiguous instruction
bytes in the base ISA to avoid undue restrictions on instruction cache
and TLB size and associativity.  Similarly, we disallowed other loads
and stores within the sequences to avoid restrictions on data cache
associativity.  The restrictions on branches and jumps limits the time
that can be spent in the sequence.  Floating-point operations and
integer multiply/divide were disallowed to simplify the operating
system's emulation of these instructions on implementations lacking
appropriate hardware support.
\end{commentary}

An implementation can reserve an arbitrary subset of the memory space
on each LR and multiple LR reservations might be active simultaneously
for a single hart.  An SC can succeed if no accesses from other harts
to the address can be observed to have occurred between the SC and
the last LR in this hart to reserve the address.  Note this LR might
have had a different address argument, but reserved the SC's address
as part of the memory subset.  Following this model, in systems with
memory translation, an SC is allowed to succeed if the earlier LR
reserved the same location using an alias with a different virtual
address, but is also allowed to fail if the virtual address is
different.  The SC must fail if there is an observable memory access
from another hart to the address, or if there is an intervening
context switch on this hart, or if in the meantime the hart executed a
privileged exception-return instruction.

\begin{commentary}
The specification explicitly allows implementations to support more
powerful implementations with wider guarantees, provided they do not
void the atomicity guarantees for the constrained sequences.
\end{commentary}

LR/SC can be used to construct lock-free data structures.  An example
using LR/SC to implement a compare-and-swap function is shown in
Figure~\ref{cas}.  If inlined, compare-and-swap functionality need
only take three instructions.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        # a0 holds address of memory location 
        # a1 holds expected value
        # a2 holds desired value
        # a0 holds return value, 0 if successful, !0 otherwise
    cas:
        lr.w t0, (a0)        # Load original value.
        bne t0, a1, fail     # Doesn't match, so fail.
        sc.w a0, a2, (a0)    # Try to update.
        jr ra                # Return.
    fail:
        li a0, 1             # Set return to failure.
        jr ra                # Return.
\end{verbatim}
\end{center}
\caption{Sample code for compare-and-swap function using LR/SC.}
\label{cas}
\end{figure}

An SC instruction can never be observed by another RISC-V thread
before the immediately preceding LR.  Due to the atomic nature of the
LR/SC sequence, no memory operations from any thread can be observed
to have occurred between the LR and a successful SC.  The LR/SC
sequence can be given acquire semantics by setting the {\em aq} bit on
the SC instruction.  The LR/SC sequence can be given release semantics
by setting the {\em rl} bit on the LR instruction.  Setting both {\em
  aq} and {\em rl} bits on the LR instruction, and setting the {\em
  aq} bit on the SC instruction makes the LR/SC sequence sequentially
consistent with respect to other sequentially consistent atomic
operations.

If neither bit is set on both LR and SC, the LR/SC sequence can be
observed to occur before or after surrounding memory operations from
the same RISC-V thread.  This can be appropriate when the LR/SC
sequence is used to implement a parallel reduction operation.

\begin{commentary}
In general, a multi-word atomic primitive is desirable but there is
still considerable debate about what form this should take, and
guaranteeing forward progress adds complexity to a system.  Our
current thoughts are to include a small limited-capacity transactional
memory buffer along the lines of the original transactional memory
proposals as an optional standard extension ``T''.
\end{commentary}

\section{Atomic Memory Operations}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R}
\\
\instbitrange{31}{27} &
\instbit{26} &
\instbit{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct5} &
\multicolumn{1}{c|}{aq} &
\multicolumn{1}{c|}{rl} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\
AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO  \\
\end{tabular}
\end{center}

\vspace{-0.1in} The atomic memory operation (AMO) instructions perform
read-modify-write operations for multiprocessor synchronization and
are encoded with an R-type instruction format.  These AMO instructions
atomically load a data value from the address in {\em rs1}, place the
value into register {\em rd}, apply a binary operator to the loaded
value and the original value in {\em rs2}, then store the result back
to the address in {\em rs1}. AMOs can either operate on 64-bit (RV64
only) or 32-bit words in memory.  For RV64, 32-bit AMOs always
sign-extend the value placed in {\em rd}. The address held in {\em
  rs1} must be naturally aligned to the size of the operand (i.e.,
eight-byte aligned for 64-bit words and four-byte aligned for 32-bit
words).  If the address is not naturally aligned, a misaligned address
exception will be generated.

The operations supported are swap, integer add, logical AND, logical
OR, logical XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would
be discarded by writing to {\tt x0}.

\begin{commentary}
We provided fetch-and-op style atomic primitives as they scale to
highly parallel systems better than LR/SC or CAS.  A simple
microarchitecture can implement AMOs using the LR/SC primitives.  More
complex implementations might also implement AMOs at memory
controllers, and can optimize away fetching the original value when
the destination is {\tt x0}.
\end{commentary}

To help implement multiprocessor synchronization, the AMOs optionally
provide release consistency semantics.  If the {\em aq} bit is set,
then no later memory operations in this RISC-V thread can be observed
to take place before the AMO.
Conversely, if the {\em rl} bit is set, then other
RISC-V threads will not observe the AMO before memory accesses
preceding the AMO in this RISC-V thread.

\begin{commentary}
The AMOs were designed to implement the C11 and C++11 memory models
efficiently.  Although the FENCE R, RW instruction suffices to
implement the {\em acquire} operation and FENCE RW, W suffices to
implement {\em release}, both imply additional unnecessary ordering as
compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
\end{commentary}

AMOs can also be used to provide sequentially consistent loads and
stores.  A sequentially consistent load can be implemented as an
AMOADD of x0 with both {\em aq} and {\em rl} set. A sequentially
consistent store can be implemented as an AMOSWAP that writes the old
value to x0 and has both {\em aq} and {\em rl} set.

An example code sequence for a critical section guarded by a
test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
first AMO is marked {\em aq} to order the lock acquisition before the
critical section, and the second AMO is marked {\em rl} to order
the critical section before the lock relinquishment.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
        li           t0, 1        # Initialize swap value.
    again:
        amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
        bnez         t0, again    # Retry if held.
        # ...
        # Critical section.
        # ...
        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
\end{verbatim}
\end{center}
\caption{Sample code for mutual exclusion.  {\tt a0} contains the address of the lock.}
\label{critical}
\end{figure}

\begin{commentary}
We recommend the use of the AMO Swap idiom shown above for both lock
acquire and release to simplify the implementation of speculative lock
elision~\cite{Rajwar:2001:SLE}.

At the risk of complicating the implementation of atomic operations,
microarchitectures can elide the store within the acquire swap if the
lock value matches the swap value, to avoid dirtying a cache line held
in a shared or exclusive clean state.  The effect is similar to a
test-and-test-and-set lock but with shorter code paths.
\end{commentary}