src/a-st-ext.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480

[[atomics]]
== "A" Extension for Atomic Instructions, Version 2.1

The atomic-instruction extension, named "A", contains
instructions that atomically read-modify-write memory to support
synchronization between multiple RISC-V harts running in the same memory
space. The two forms of atomic instruction provided are
load-reserved/store-conditional instructions and atomic fetch-and-op
memory instructions. Both types of atomic instruction support various
memory consistency orderings including unordered, acquire, release, and
sequentially consistent semantics. These instructions allow RISC-V to
support the RCsc memory consistency model. cite:[Gharachorloo90memoryconsistency]

[NOTE]
====
After much debate, the language community and architecture community
appear to have finally settled on release consistency as the standard
memory consistency model and so the RISC-V atomic support is built
around this model.
====

The A extension comprises instructions provided by the Zaamo and Zalrsc
extensions.

=== Specifying Ordering of Atomic Instructions

The base RISC-V ISA has a relaxed memory model, with the FENCE
instruction used to impose additional ordering constraints. The address
space is divided by the execution environment into memory and I/O
domains, and the FENCE instruction provides options to order accesses to
one or both of these two address domains.

To provide more efficient support for release consistency cite:[Gharachorloo90memoryconsistency], each atomic
instruction has two bits, _aq_ and _rl_, used to specify additional
memory ordering constraints as viewed by other RISC-V harts. The bits
order accesses to one of the two address domains, memory or I/O,
depending on which address domain the atomic instruction is accessing.
No ordering constraint is implied to accesses to the other domain, and a
FENCE instruction should be used to order across both domains.

If both bits are clear, no additional ordering constraints are imposed
on the atomic memory operation. If only the _aq_ bit is set, the atomic
memory operation is treated as an _acquire_ access, i.e., no following
memory operations on this RISC-V hart can be observed to take place
before the acquire memory operation. If only the _rl_ bit is set, the
atomic memory operation is treated as a _release_ access, i.e., the
release memory operation cannot be observed to take place before any
earlier memory operations on this RISC-V hart. If both the _aq_ and _rl_
bits are set, the atomic memory operation is _sequentially consistent_
and cannot be observed to happen before any earlier memory operations or
after any later memory operations in the same RISC-V hart and to the
same address domain.

[[sec:lrsc]]
=== "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions

include::images/wavedrom/load-reserve-st-conditional.adoc[]

Complex atomic memory operations on a single memory word or doubleword
are performed with the load-reserved (LR) and store-conditional (SC)
instructions. LR.W loads a word from the address in _rs1_, places the
sign-extended value in _rd_, and registers a _reservation set_—a set of
bytes that subsumes the bytes in the addressed word. SC.W conditionally
writes a word in _rs2_ to the address in _rs1_: the SC.W succeeds only
if the reservation is still valid and the reservation set contains the
bytes being written. If the SC.W succeeds, the instruction writes the
word in _rs2_ to memory, and it writes zero to _rd_. If the SC.W fails,
the instruction does not write to memory, and it writes a nonzero value
to _rd_.  For the purposes of memory protection, a failed SC.W may be
treated like a store.  Regardless of success or failure, executing an
SC.W instruction invalidates any reservation held by this hart. LR.D and
SC.D act analogously on doublewords and are only available on RV64. For
RV64, LR.W and SC.W sign-extend the value placed in _rd_.

[NOTE]
====
Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
data structures. After extensive discussion, we opted for LR/SC for
several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids
because it monitors all writes to the address rather than only checking
for changes in the data value; 2) CAS would also require a new integer
instruction format to support three source operands (address, compare
value, swap value) as well as a different memory system message format,
which would complicate microarchitectures; 3) Furthermore, to avoid the
ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a
counter to be tested and incremented along with a data word. This
requires reading five registers and writing two in one instruction, and
also a new larger memory system message type, further complicating
implementations; 4) LR/SC provides a more efficient implementation of
many primitives as it only requires one load as opposed to two with CAS
(one load before the CAS instruction to obtain a value for speculative
computation, then a second load as part of the CAS instruction to check
if value is unchanged before updating).

The main disadvantage of LR/SC over CAS is livelock, which we avoid,
under certain circumstances, with an architected guarantee of eventual
forward progress as described below. Another concern is whether the
influence of the current x86 architecture, with its DW-CAS, will
complicate porting of synchronization libraries and other software that
assumes DW-CAS is the basic machine primitive. A possible mitigating
factor is the recent addition of transactional memory instructions to
x86, which might cause a move away from DW-CAS.

More generally, a multi-word atomic primitive is desirable, but there is
still considerable debate about what form this should take, and
guaranteeing forward progress adds complexity to a system.
====

The failure code with value 1 encodes an unspecified failure. Other
failure codes are reserved at this time. Portable software should only
assume the failure code will be non-zero.

[NOTE]
====
We reserve a failure code of 1 to mean ''unspecified'' so that simple
implementations may return this value using the existing mux required
for the SLT/SLTU instructions. More specific failure codes might be
defined in future versions or extensions to the ISA.
====

For LR and SC, the Zalrsc extension requires that the address held in _rs1_
be naturally aligned to the size of the operand (i.e., eight-byte
aligned for _doublewords_ and four-byte aligned for _words_). If the
address is not naturally aligned, an address-misaligned exception or an
access-fault exception will be generated. The access-fault exception can
be generated for a memory access that would otherwise be able to
complete except for the misalignment, if the misaligned access should
not be emulated.
[NOTE]
====
Emulating misaligned LR/SC sequences is impractical in most systems.

Misaligned LR/SC sequences also raise the possibility of accessing
multiple reservation sets at once, which present definitions do not
provide for.
====

An implementation can register an arbitrarily large reservation set on
each LR, provided the reservation set includes all bytes of the
addressed data word or doubleword. An SC can only pair with the most
recent LR in program order. An SC may succeed only if no store from
another hart to the reservation set can be observed to have occurred
between the LR and the SC, and if there is no other SC between the LR
and itself in program order. An SC may succeed only if no write from a
device other than a hart to the bytes accessed by the LR instruction can
be observed to have occurred between the LR and SC. Note this LR might
have had a different effective address and data size, but reserved the
SC's address as part of the reservation set.

[NOTE]
====
Following this model, in systems with memory translation, an SC is
allowed to succeed if the earlier LR reserved the same location using an
alias with a different virtual address, but is also allowed to fail if
the virtual address is different.

To accommodate legacy devices and buses, writes from devices other than
RISC-V harts are only required to invalidate reservations when they
overlap the bytes accessed by the LR. These writes are not required to
invalidate the reservation when they access other bytes in the
reservation set.
====

The SC must fail if the address is not within the reservation set of the
most recent LR in program order. The SC must fail if a store to the
reservation set from another hart can be observed to occur between the
LR and SC. The SC must fail if a write from some other device to the
bytes accessed by the LR can be observed to occur between the LR and SC.
(If such a device writes the reservation set but does not write the
bytes accessed by the LR, the SC may or may not fail.) An SC must fail
if there is another SC (to any address) between the LR and the SC in
program order. The precise statement of the atomicity requirements for
successful LR/SC sequences is defined by the Atomicity Axiom in
<<rvwmo>>.

[NOTE]
====
The platform should provide a means to determine the size and shape of
the reservation set.

A platform specification may constrain the size and shape of the
reservation set.

A store-conditional instruction to a scratch word of memory should be
used to forcibly invalidate any existing load reservation:

* during a preemptive context switch, and
* if necessary when changing virtual to physical address mappings, such
as when migrating pages that might contain an active reservation.

The invalidation of a hart's reservation when it executes an LR or SC
imply that a hart can only hold one reservation at a time, and that an
SC can only pair with the most recent LR, and LR with the next following
SC, in program order. This is a restriction to the Atomicity Axiom in
<<rvwmo>> that ensures software runs correctly on
expected common implementations that operate in this manner.
====

An SC instruction can never be observed by another RISC-V hart before
the LR instruction that established the reservation.

[NOTE]
====
The LR/SC sequence
can be given acquire semantics by setting the _aq_ bit on the LR
instruction. The LR/SC sequence can be given release semantics by
by setting the _rl_ bit on the SC instruction.  Assuming
suitable mappings for other atomic operations, setting the
_aq_ bit on the LR instruction, and setting the
_rl_ bit on the SC instruction makes the LR/SC
sequence sequentially consistent in the C\++ `memory_order_seq_cst`
sense. Such a sequence does not act as a fence for ordering ordinary
load and store instructions before and after the sequence. Specific
instruction mappings for other C++ atomic operations,
or stronger notions of "sequential consistency", may require both
bits to be set on either or both of the LR or SC instruction.

If neither bit is set on either LR or SC, the LR/SC sequence can be
observed to occur before or after surrounding memory operations from the
same RISC-V hart. This can be appropriate when the LR/SC sequence is
used to implement a parallel reduction operation.
====

Software should not set the _rl_ bit on an LR instruction unless the
_aq_ bit is also set, nor should software set the _aq_ bit on an SC
instruction unless the _rl_ bit is also set. LR._rl_ and SC._aq_
instructions are not guaranteed to provide any stronger ordering than
those with both bits clear, but may result in lower performance.

<<<

[[cas]]
[source,asm]
.Sample code for compare-and-swap function using LR/SC.
        # a0 holds address of memory location
        # a1 holds expected value
        # a2 holds desired value
        # a0 holds return value, 0 if successful, !0 otherwise
    cas:
        lr.w t0, (a0)        # Load original value.
        bne t0, a1, fail     # Doesn't match, so fail.
        sc.w t0, a2, (a0)    # Try to update.
        bnez t0, cas         # Retry if store-conditional failed.
        li a0, 0             # Set return to success.
        jr ra                # Return.
    fail:
        li a0, 1             # Set return to failure.
        jr ra                # Return.

LR/SC can be used to construct lock-free data structures. An example
using LR/SC to implement a compare-and-swap function is shown in
<<cas>>. If inlined, compare-and-swap functionality need only take four instructions.

[[sec:lrscseq]]
=== Eventual Success of Store-Conditional Instructions

The Zalrsc extension defines _constrained LR/SC loops_, which have
the following properties:

* The loop comprises only an LR/SC sequence and code to retry the
sequence in the case of failure, and must comprise at most 16
instructions placed sequentially in memory.
* An LR/SC sequence begins with an LR instruction and ends with an SC
instruction. The dynamic code executed between the LR and SC
instructions can only contain instructions from the base ''I''
instruction set, excluding loads, stores, backward jumps, taken backward
branches, JALR, FENCE, and SYSTEM instructions. If the ''C'' extension
is supported, then compressed forms of the aforementioned ''I''
instructions are also permitted.
* The code to retry a failing LR/SC sequence can contain backwards jumps
and/or branches to repeat the LR/SC sequence, but otherwise has the same
constraint as the code between the LR and SC.
* The LR and SC addresses must lie within a memory region with the
_LR/SC eventuality_ property. The execution environment is responsible
for communicating which regions have this property.
* The SC must be to the same effective address and of the same data size
as the latest LR executed by the same hart.

LR/SC sequences that do not lie within constrained LR/SC loops are
_unconstrained_. Unconstrained LR/SC sequences might succeed on some
attempts on some implementations, but might never succeed on other
implementations.

[NOTE]
====
We restricted the length of LR/SC loops to fit within 64 contiguous
instruction bytes in the base ISA to avoid undue restrictions on
instruction cache and TLB size and associativity. Similarly, we
disallowed other loads and stores within the loops to avoid restrictions
on data-cache associativity in simple implementations that track the
reservation within a private cache. The restrictions on branches and
jumps limit the time that can be spent in the sequence. Floating-point
operations and integer multiply/divide were disallowed to simplify the
operating system's emulation of these instructions on implementations
lacking appropriate hardware support.

Software is not forbidden from using unconstrained LR/SC sequences, but
portable software must detect the case that the sequence repeatedly
fails, then fall back to an alternate code sequence that does not rely
on an unconstrained LR/SC sequence. Implementations are permitted to
unconditionally fail any unconstrained LR/SC sequence.
====

If a hart _H_ enters a constrained LR/SC loop, the execution environment
must guarantee that one of the following events eventually occurs:

* _H_ or some other hart executes a successful SC to the reservation set
of the LR instruction in _H_'s constrained LR/SC loops.
* Some other hart executes an unconditional store or AMO instruction to
the reservation set of the LR instruction in _H_'s constrained LR/SC
loop, or some other device in the system writes to that reservation set.
* _H_ executes a branch or jump that exits the constrained LR/SC loop.
* _H_ traps.

[NOTE]
====
Note that these definitions permit an implementation to fail an SC
instruction occasionally for any reason, provided the aforementioned
guarantee is not violated.

As a consequence of the eventuality guarantee, if some harts in an
execution environment are executing constrained LR/SC loops, and no
other harts or devices in the execution environment execute an
unconditional store or AMO to that reservation set, then at least one
hart will eventually exit its constrained LR/SC loop. By contrast, if
other harts or devices continue to write to that reservation set, it is
not guaranteed that any hart will exit its LR/SC loop.

Loads and load-reserved instructions do not by themselves impede the
progress of other harts' LR/SC sequences. We note this constraint
implies, among other things, that loads and load-reserved instructions
executed by other harts (possibly within the same core) cannot impede
LR/SC progress indefinitely. For example, cache evictions caused by
another hart sharing the cache cannot impede LR/SC progress
indefinitely. Typically, this implies reservations are tracked
independently of evictions from any shared cache. Similarly, cache
misses caused by speculative execution within a hart cannot impede LR/SC
progress indefinitely.

These definitions admit the possibility that SC instructions may
spuriously fail for implementation reasons, provided progress is
eventually made.

One advantage of CAS is that it guarantees that some hart eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems. To avoid this concern, we added an
architectural guarantee of livelock freedom for certain LR/SC sequences.

Earlier versions of this specification imposed a stronger
starvation-freedom guarantee. However, the weaker livelock-freedom
guarantee is sufficient to implement the C11 and C++11 languages, and is
substantially easier to provide in some microarchitectural styles.
====

[[sec:amo]]
=== "Zaamo" Extension for Atomic Memory Operations

include::images/wavedrom/atomic-mem.adoc[]

The atomic memory operation (AMO) instructions perform read-modify-write
operations for multiprocessor synchronization and are encoded with an
R-type instruction format. These AMO instructions atomically load a data
value from the address in _rs1_, place the value into register _rd_,
apply a binary operator to the loaded value and the original value in
_rs2_, then store the result back to the original address in _rs1_. AMOs
can either operate on _doublewords_ (RV64 only) or _words_ in memory. For
RV64, 32-bit AMOs always sign-extend the value placed in _rd_, and
ignore the upper 32 bits of the original value of _rs2_.

For AMOs, the Zaamo extension requires that the address held in _rs1_ be
naturally aligned to the size of the operand (i.e., eight-byte aligned
for _doublewords_ and four-byte aligned for _words_). If the address
is not naturally aligned, an address-misaligned exception or an
access-fault exception will be generated. The access-fault exception can
be generated for a memory access that would otherwise be able to
complete except for the misalignment, if the misaligned access should
not be emulated.

The misaligned atomicity granule PMA, defined in Volume II of this manual,
optionally relaxes this alignment requirement.
If present, the misaligned atomicity granule PMA specifies the size
of a misaligned atomicity granule, a power-of-two number of bytes.
The misaligned atomicity granule PMA applies only to AMOs, loads and stores
defined in the base ISAs, and loads and stores of no more than XLEN bits
defined in the F, D, and Q extensions.
For an instruction in that set, if all accessed bytes lie within the same
misaligned atomicity granule, the instruction will not raise an exception for
reasons of address alignment, and the instruction will give rise to only one
memory operation for the purposes of RVWMO--i.e., it will execute atomically.

The operations supported are swap, integer add, bitwise AND, bitwise OR,
bitwise XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would be
discarded by writing to `x0`.

[NOTE]
====
We provided fetch-and-op style atomic primitives as they scale to highly
parallel systems better than LR/SC or CAS. A simple microarchitecture
can implement AMOs using the LR/SC primitives, provided the
implementation can guarantee the AMO eventually completes. More complex
implementations might also implement AMOs at memory controllers, and can
optimize away fetching the original value when the destination is `x0`.

The set of AMOs was chosen to support the C11/C++11 atomic memory
operations efficiently, and also to support parallel reductions in
memory. Another use of AMOs is to provide atomic updates to
memory-mapped device registers (e.g., setting, clearing, or toggling
bits) in the I/O space.

The Zaamo extension enables microcontroller class implementations to utilize
atomic primitives from the AMO subset of the A extension. Typically such
implementations do not have caches and thus may not be able to naturally support
the LR/SC instructions provided by the Zalrsc extension.
====

To help implement multiprocessor synchronization, the AMOs optionally
provide release consistency semantics. If the _aq_ bit is set, then no
later memory operations in this RISC-V hart can be observed to take
place before the AMO. Conversely, if the _rl_ bit is set, then other
RISC-V harts will not observe the AMO before memory accesses preceding
the AMO in this RISC-V hart. Setting both the _aq_ and the _rl_ bit on
an AMO makes the sequence sequentially consistent, meaning that it
cannot be reordered with earlier or later memory operations from the
same hart.

[NOTE]
====
The AMOs were designed to implement the C11 and C++11 memory models
efficiently. Although the FENCE R, RW instruction suffices to implement
the _acquire_ operation and FENCE RW, W suffices to implement _release_,
both imply additional unnecessary ordering as compared to AMOs with the
corresponding _aq_ or _rl_ bit set.
====

An example code sequence for a critical section guarded by a
test-and-test-and-set spinlock is shown in
Example <<critical>>. Note the first AMO is marked _aq_ to
order the lock acquisition before the critical section, and the second
AMO is marked _rl_ to order the critical section before the lock
relinquishment.

<<<

[[critical]]
[source,asm]
.Sample code for mutual exclusion. `a0` contains the address of the lock.
        li           t0, 1        # Initialize swap value.
    again:
        lw           t1, (a0)     # Check if lock is held.
        bnez         t1, again    # Retry if held.
        amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock.
        bnez         t1, again    # Retry if held.
        # ...
        # Critical section.
        # ...
        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.

[NOTE]
====
We recommend the use of the AMO Swap idiom shown above for both lock
acquire and release to simplify the implementation of speculative lock
elision. cite:[Rajwar:2001:SLE]
====

[NOTE]
====
The instructions in the "A" extension can be used to provide sequentially
consistent loads and stores, but this constrains hardware
reordering of memory accesses more than necessary.
A C++ sequentially consistent load can be implemented as
an LR with _aq_ set. However, the LR/SC eventual
success guarantee may slow down concurrent loads from the same effective
address. A sequentially consistent store can be implemented as an AMOSWAP
that writes the old value to `x0` and has _rl_ set. However the superfluous
load may impose ordering constraints that are unnecessary for this use case.
Specific compilation conventions may require both the _aq_ and _rl_
bits to be set in either or both the LR and AMOSWAP instructions.
====