src/mm-eplan.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849

[appendix]
== RVWMO Explanatory Material, Version 0.1
[[mm-explain]]

This section provides more explanation for RVWMO
<<memorymodel>>, using more informal
language and concrete examples. These are intended to clarify the
meaning and intent of the axioms and preserved program order rules. This
appendix should be treated as commentary; all normative material is
provided in <<memorymodel>> and in the rest of
the main body of the ISA specification. All currently known
discrepancies are listed in <<discrepancies>>. Any
other discrepancies are unintentional.

[[whyrvwmo]]
=== Why RVWMO?

Memory consistency models fall along a loose spectrum from weak to
strong. Weak memory models allow more hardware implementation
flexibility and deliver arguably better performance, performance per
watt, power, scalability, and hardware verification overheads than
strong models, at the expense of a more complex programming model.
Strong models provide simpler programming models, but at the cost of
imposing more restrictions on the kinds of (non-speculative) hardware
optimizations that can be performed in the pipeline and in the memory
system, and in turn imposing some cost in terms of power, area overhead,
and verification burden.

RISC-V has chosen the RVWMO memory model, a variant of release
consistency. This places it in between the two extremes of the memory
model spectrum. The RVWMO memory model enables architects to build
simple implementations, aggressive implementations, implementations
embedded deeply inside a much larger system and subject to complex
memory system interactions, or any number of other possibilities, all
while simultaneously being strong enough to support programming language
memory models at high performance.

To facilitate the porting of code from other architectures, some
hardware implementations may choose to implement the Ztso extension,
which provides stricter RVTSO ordering semantics by default. Code
written for RVWMO is automatically and inherently compatible with RVTSO,
but code written assuming RVTSO is not guaranteed to run correctly on
RVWMO implementations. In fact, most RVWMO implementations will (and
should) simply refuse to run RVTSO-only binaries. Each implementation
must therefore choose whether to prioritize compatibility with RVTSO
code (e.g., to facilitate porting from x86) or whether to instead
prioritize compatibility with other RISC-V cores implementing RVWMO.

Some fences and/or memory ordering annotations in code written for RVWMO
may become redundant under RVTSO; the cost that the default of RVWMO
imposes on Ztso implementations is the incremental overhead of fetching
those fences (e.g., FENCE R,RW and FENCE RW,W) which become no-ops on
that implementation. However, these fences must remain present in the
code if compatibility with non-Ztso implementations is desired.

[[litmustests]]
=== Litmus Tests

The explanations in this chapter make use of _litmus tests_, or small
programs designed to test or highlight one particular aspect of a memory
model. <<litmus-sample>> shows an example
of a litmus test with two harts. As a convention for this figure and for
all figures that follow in this chapter, we assume that `s0-s2` are
pre-set to the same value in all harts and that `s0` holds the address
labeled `x`, `s1` holds `y`, and `s2` holds `z`, where `x`, `y`, and `z`
are disjoint memory locations aligned to 8 byte boundaries. All other registers and all referenced memory locations are presumed to be initialized to zero.  Each figure
shows the litmus test code on the left, and a visualization of one
particular valid or invalid execution on the right.

[[litmus-sample, Litmus sample]]
[float="center",align="center",cols="1a,.^1a",frame="none",grid="none",options="noheader"]
.A sample litmus test and one forbidden execution (`a0=1`).
|===
|
[.left]
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header"]
!===
2+!Hart 0 2+!Hart 1 
! !&#8942; ! !&#8942;
! !li t1,1 ! !li t4,4
!(a) !sw t1,0(s0) !(e) !sw t4,0(s0)
! !&#8942; ! !&#8942;
! !li t2,2 ! !
!(b) !sw t2,0(s0) ! !
! !&#8942; ! !&#8942;
!(c) !lw a0,0(s0) ! !
! !&#8942; ! !&#8942;
! !li t3,3 ! !li t5,5
!(d) !sw t3,0(s0) !(f) !sw t5,0(s0)
! !&#8942; ! !&#8942;
!===
|
!===
//a! graphviz::images/graphviz/litmus_sample.txt[]
a! image::graphviz/litmus_sample.png[]
!===
|===

Litmus tests are used to understand the implications of the memory model
in specific concrete situations. For example, in the litmus test of
<<litmus-sample>>, the final value of `a0`
in the first hart can be either 2, 4, or 5, depending on the dynamic
interleaving of the instruction stream from each hart at runtime.
However, in this example, the final value of `a0` in Hart 0 will never
be 1 or 3; intuitively, the value 1 will no longer be visible at the
time the load executes, and the value 3 will not yet be visible by the
time the load executes. We analyze this test and many others below.

<<<
[[litmus-key]]
.A key for the litmus test diagrams drawn in this appendix
[%autowidth,cols="<,<",align="center",float="center",options="header",]
|===
|Edge |Full Name (and explanation)
|rf |Reads From (from each store to the loads that return a value
written by that store)

|co |Coherence (a total order on the stores to each address)

|fr |From-Reads (from each load to co-successors of the store from which
the load returned a value)

|ppo |Preserved Program Order

|fence |Orderings enforced by a FENCE instruction

|addr |Address Dependency

|ctrl |Control Dependency

|data |Data Dependency
|===

The diagram shown to the right of each litmus test shows a visual
representation of the particular execution candidate being considered.
These diagrams use a notation that is common in the memory model
literature for constraining the set of possible global memory orders
that could produce the execution in question. It is also the basis for
the _herd_ models presented in
<<sec:herd>>. This notation is explained in
<<litmus-key>>. Of the listed relations, rf edges between
harts, co edges, fr edges, and ppo edges directly constrain the global
memory order (as do fence, addr, data, and some ctrl edges, via ppo).
Other edges (such as intra-hart rf edges) are informative but do not
constrain the global memory order.

For example, in <<litmus-sample>>, `a0=1`
could occur only if one of the following were true:

* (b) appears before (a) in global memory order (and in the
coherence order co). However, this violates RVWMO PPO
rule `ppo:->st`. The co edge from (b) to (a) highlights this
contradiction.
* (a) appears before (b) in global memory order (and in the
coherence order co). However, in this case, the Load Value Axiom would
be violated, because (a) is not the latest matching store prior to (c)
in program order. The fr edge from (c) to (b) highlights this
contradiction.

Since neither of these scenarios satisfies the RVWMO axioms, the outcome
`a0=1` is forbidden.

Beyond what is described in this appendix, a suite of more than seven
thousand litmus tests is available at
https://github.com/litmus-tests/litmus-tests-riscv.
[NOTE]
====
The litmus tests repository also provides instructions on how to run the
litmus tests on RISC-V hardware and how to compare the results with the
operational and axiomatic models.

In the future, we expect to adapt these memory model litmus tests for
use as part of the RISC-V compliance test suite as well.
====
=== Explaining the RVWMO Rules

In this section, we provide explanation and examples for all of the
RVWMO rules and axioms.

==== Preserved Program Order and Global Memory Order

Preserved program order represents the subset of program order that must
be respected within the global memory order. Conceptually, events from
the same hart that are ordered by preserved program order must appear in
that order from the perspective of other harts and/or observers. Events
from the same hart that are not ordered by preserved program order, on
the other hand, may appear reordered from the perspective of other harts
and/or observers.

Informally, the global memory order represents the order in which loads
and stores perform. The formal memory model literature has moved away
from specifications built around the concept of performing, but the idea
is still useful for building up informal intuition. A load is said to
have performed when its return value is determined. A store is said to
have performed not when it has executed inside the pipeline, but rather
only when its value has been propagated to globally visible memory. In
this sense, the global memory order also represents the contribution of
the coherence protocol and/or the rest of the memory system to
interleave the (possibly reordered) memory accesses being issued by each
hart into a single total order agreed upon by all harts.

The order in which loads perform does not always directly correspond to
the relative age of the values those two loads return. In particular, a
load _b_ may perform before another load _a_ to
the same address (i.e., _b_ may execute before
_a_, and _b_ may appear before _a_
in the global memory order), but _a_ may nevertheless return
an older value than _b_. This discrepancy captures (among
other things) the reordering effects of buffering placed between the
core and memory. For example, _b_ may have returned a value
from a store in the store buffer, while _a_ may have ignored
that younger store and read an older value from memory instead. To
account for this, at the time each load performs, the value it returns
is determined by the load value axiom, not just strictly by determining
the most recent store to the same address in the global memory order, as
described below.

[[loadvalueaxiom, Load value axiom]]
==== Load value axiom

[IMPORTANT]
====
<<ax-load>>: Each byte of each load _i_ returns the value written
to that byte by the store that is the latest in global memory order among
the following stores:

. Stores that write that byte and that precede i in the global memory
order
. Stores that write that byte and that precede i in program order
====

Preserved program order is _not_ required to respect the ordering of a
store followed by a load to an overlapping address. This complexity
arises due to the ubiquity of store buffers in nearly all
implementations. Informally, the load may perform (return a value) by
forwarding from the store while the store is still in the store buffer,
and hence before the store itself performs (writes back to globally
visible memory). Any other hart will therefore observe the load as
performing before the store.

Consider the <<litms_sb_forward>>. When running this program on an implementation with
store buffers, it is possible to arrive at the final outcome a0=1, `a1=0, a2=1, a3=0` as follows:

[[litms_sb_forward]]
.A store buffer forwarding litmus test (outcome permitted)
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header",align="center"]
!===
2+^!Hart 0 2+^!Hart 1
2+^!li t1, 1 2+^!li t1, 1
2+<!(a) sw t1,0(s0) 2+!(e) sw t1,0(s1)
2+<!(b) lw a0,0(s0) 2+!(f) lw a2,0(s1)
2+<!(c) fence r,r 2+!(g) fence r,r
2+<!(d) lw a1,0(s1) 2+!(h) lw a3,0(s0)
4+^!Outcome: `a0=1`, `a1=0`, `a2=1`, `a3=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_sb_fwd.txt[]
a! image::graphviz/litmus_sb_fwd.png[]
!===
|===

* (a) executes and enters the first hart's private store buffer
* (b) executes and forwards its return value 1 from (a) in the
store buffer
* (c) executes since all previous loads (i.e., (b)) have
completed
* (d) executes and reads the value 0 from memory
* (e) executes and enters the second hart's private store buffer
* (f) executes and forwards its return value 1 from (e) in the
store buffer
* (g) executes since all previous loads (i.e., (f)) have
completed
* (h) executes and reads the value 0 from memory
* (a) drains from the first hart's store buffer to memory
* (e) drains from the second hart's store buffer to memory

Therefore, the memory model must be able to account for this behavior.

To put it another way, suppose the definition of preserved program order
did include the following hypothetical rule: memory access
_a_ precedes memory access _b_ in preserved
program order (and hence also in the global memory order) if
_a_ precedes _b_ in program order and
_a_ and _b_ are accesses to the same memory
location, _a_ is a write, and _b_ is a read.
Call this "Rule X". Then we get the following:

* (a) precedes (b): by rule X
* (b) precedes (d): by rule <<overlapping-ordering, 4>>
* (d) precedes (e): by the load value axiom. Otherwise, if (e)
preceded (d), then (d) would be required to return the value 1. (This is
a perfectly legal execution; it's just not the one in question)
* (e) precedes (f): by rule X
* (f) precedes (h): by rule <<overlapping-ordering, 4]>>
* (h) precedes (a): by the load value axiom, as above.

The global memory order must be a total order and cannot be cyclic,
because a cycle would imply that every event in the cycle happens before
itself, which is impossible. Therefore, the execution proposed above
would be forbidden, and hence the addition of rule X would forbid
implementations with store buffer forwarding, which would clearly be
undesirable.

Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the
global memory order, the only sensible possibility in this example is
for (b) to return the value written by (a), and likewise for (f) and
(e). This combination of circumstances is what leads to the second
option in the definition of the load value axiom. Even though (b)
precedes (a) in the global memory order, (a) will still be visible to
(b) by virtue of sitting in the store buffer at the time (b) executes.
Therefore, even if (b) precedes (a) in the global memory order, (b)
should return the value written by (a) because (a) precedes (b) in
program order. Likewise for (e) and (f).

[[litmus_ppoca]]
.Key for test that highlights the behavior of store buffers
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
.The "PPOCA" store buffer forwarding litmus test (outcome permitted)
|===
|
[%autowidth,cols="^,<,^,<",options="header",float="center",align="center"]
!===
2+^!Hart 0 2+^!Hart 1
! !li t1, 1 !!li t1, 1
!(a) !sw t1,0(s0) !!LOOP:
!(b) !fence w,w !(d) !lw a0,0(s1)
!(c) !sw t1,0(s1) !!beqz a0, LOOP
2+! !(e) !sw t1,0(s2)
2+! !(f) !lw a1,0(s2)
2+! ! !xor a2,a1,a1
2+! ! !add s0,s0,a2
2+! !(g) !lw a2,0(s0)
4+!Outcome: `a0=1`, `a1=1`, `a2=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_ppoca.txt[]
a! image::graphviz/litmus_ppoca.png[]
!===
|===

Another test that highlights the behavior of store buffers is shown in
<<litmus_ppoca>>. In this example, (d) is
ordered before (e) because of the control dependency, and (f) is ordered
before (g) because of the address dependency. However, (e) is _not_
necessarily ordered before (f), even though (f) returns the value
written by (e). This could correspond to the following sequence of
events:

* (e) executes speculatively and enters the second hart's private
store buffer (but does not drain to memory)
* (f) executes speculatively and forwards its return value 1 from
(e) in the store buffer
* (g) executes speculatively and reads the value 0 from memory
* (a) executes, enters the first hart's private store buffer, and
drains to memory
* (b) executes and retires
* (c) executes, enters the first hart's private store buffer, and
drains to memory
* (d) executes and reads the value 1 from memory
* (e), (f), and (g) commit, since the speculation turned out to be
correct
* (e) drains from the store buffer to memory

[[atomicityaxiom]]
==== Atomicity axiom

[IMPORTANT]
====
<<ax-atom, Atomicity Axiom>> (for Aligned Atomics): If r and w are paired load and
store operations generated by aligned LR and SC instructions in a hart
h, s is a store to byte x, and r returns a value written by s, then s must
precede w in the global memory order, and there can be no store from
a hart other than h to byte x following s and preceding w in the global
memory order.
====

The RISC-V architecture decouples the notion of atomicity from the
notion of ordering. Unlike architectures such as TSO, RISC-V atomics
under RVWMO do not impose any ordering requirements by default. Ordering
semantics are only guaranteed by the PPO rules that otherwise apply.

RISC-V contains two types of atomics: AMOs and LR/SC pairs. These
conceptually behave differently, in the following way. LR/SC behave as
if the old value is brought up to the core, modified, and written back
to memory, all while a reservation is held on that memory location. AMOs
on the other hand conceptually behave as if they are performed directly
in memory. AMOs are therefore inherently atomic, while LR/SC pairs are
atomic in the slightly different sense that the memory location in
question will not be modified by another hart during the time the
original hart holds the reservation.

[frame=none]
|====
|(a) lr.d a0, 0(s0) |(a) lr.d a0, 0(s0) |(a) lr.w a0, 0(s0) |(a) lr.w a0, 0(s0)

|(b) sd t1, 0(s0)  |(b) sw t1, 4(s0)  |(b) sw t1, 4(s0) |(b) sw t1, 4(s0)

|(c) sc.d t3, t2, 0(s0) |(c) sc.d t3, t2, 0(s0) |(c) sc.w t3, t2, 0(s0) |(c) addi s0, s0, 8 

|(d) sc.w t3, t2, 8(s0)|||
|====
[[litmus_lrsdsc]]
<<litmus_lrsdsc, Figure 4>>: In all four (independent) instances, the final  store-conditional instruction is permitted but not guaranteed to succeed.

The atomicity axiom forbids stores from other harts from being
interleaved in global memory order between an LR and the SC paired with
that LR. The atomicity axiom does not forbid loads from being
interleaved between the paired operations in program order or in the
global memory order, nor does it forbid stores from the same hart or
stores to non-overlapping locations from appearing between the paired
operations in either program order or in the global memory order. For
example, the SC instructions in <<litmus_lrsdsc>> may (but are not
guaranteed to) succeed. None of those successes would violate the
atomicity axiom, because the intervening non-conditional stores are from
the same hart as the paired load-reserved and store-conditional
instructions. This way, a memory system that tracks memory accesses at
cache line granularity (and which therefore will see the four snippets
of <<litmus_lrsdsc>> as identical) will not
be forced to fail a store-conditional instruction that happens to
(falsely) share another portion of the same cache line as the memory
location being held by the reservation.

The atomicity axiom also technically supports cases in which the LR and
SC touch different addresses and/or use different access sizes; however,
use cases for such behaviors are expected to be rare in practice.
Likewise, scenarios in which stores from the same hart between an LR/SC
pair actually overlap the memory location(s) referenced by the LR or SC
are expected to be rare compared to scenarios where the intervening
store may simply fall onto the same cache line.

[[mm-progress]]
==== Progress axiom

[IMPORTANT]
====
<<ax-prog, Progress Axiom>>: No memory operation may be preceded in the global
memory order by an infinite sequence of other memory operations.
====

The progress axiom ensures a minimal forward progress guarantee. It
ensures that stores from one hart will eventually be made visible to
other harts in the system in a finite amount of time, and that loads
from other harts will eventually be able to read those values (or
successors thereof). Without this rule, it would be legal, for example,
for a spinlock to spin infinitely on a value, even with a store from
another hart waiting to unlock the spinlock.

The progress axiom is intended not to impose any other notion of
fairness, latency, or quality of service onto the harts in a RISC-V
implementation. Any stronger notions of fairness are up to the rest of
the ISA and/or up to the platform and/or device to define and implement.

The forward progress axiom will in almost all cases be naturally
satisfied by any standard cache coherence protocol. Implementations with
non-coherent caches may have to provide some other mechanism to ensure
the eventual visibility of all stores (or successors thereof) to all
harts.

[[mm-overlap]]
==== Overlapping-Address Orderings (<<overlapping-ordering, Rules 1-3>>)

[NOTE]
====
<<overlapping-ordering, Rule 1>>: b is a store, and a and b access overlapping memory addresses

<<overlapping-ordering, Rule 2>>: a and b are loads, x is a byte read by both a and b, there is no
store to x between a and b in program order, and a and b return values
for x written by different memory operations

<<overlapping-ordering, Rule 3>>: a is generated by an AMO or SC instruction, b is a load, and b
returns a value written by a
====

Same-address orderings where the latter is a store are straightforward:
a load or store can never be reordered with a later store to an
overlapping memory location. From a microarchitecture perspective,
generally speaking, it is difficult or impossible to undo a
speculatively reordered store if the speculation turns out to be
invalid, so such behavior is simply disallowed by the model.
Same-address orderings from a store to a later load, on the other hand,
do not need to be enforced. As discussed in
<<loadvalueaxiom>>, this reflects the observable
behavior of implementations that forward values from buffered stores to
later loads.

Same-address load-load ordering requirements are far more subtle. The
basic requirement is that a younger load must not return a value that is
older than a value returned by an older load in the same hart to the
same address. This is often known as "CoRR" (Coherence for Read-Read
pairs), or as part of a broader "coherence" or "sequential
consistency per location" requirement. Some architectures in the past
have relaxed same-address load-load ordering, but in hindsight this is
generally considered to complicate the programming model too much, and
so RVWMO requires CoRR ordering to be enforced. However, because the
global memory order corresponds to the order in which loads perform
rather than the ordering of the values being returned, capturing CoRR
requirements in terms of the global memory order requires a bit of
indirection.

[[frirfi]]
.A litmus test MP+fence.w.w+fri-rfi-addr (outcome permitted)

[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
.Litmus test MP+fence.w.w+fre-rfi-addr (outcome permitted)
|===
|
[%autowidth,cols="^,<,^,<",options="header",float="center",align="center"]
!===
2+!Hart 0 2+^!Hart 1
!!li t1, 1 !!li t2, 2
>!(a) !sw t1,0(s0) >!(d) !lw a0,0(s1)
>!(b) !fence w, w >!(e) !sw t2,0(s1)
>!(c) !sw t1,0(s1) >!(f) !lw a1,0(s1)
! ! >!(g) !xor t3,a1,a1
! ! >!(h) !add s0,s0,t3
! ! >!(i) !lw a2,0(s0)
4+^!Outcome: `a0=1`, `a1=2`, `a2=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_mp_fenceww_fri_rfi_addr.txt[]
a! image::graphviz/litmus_mp_fenceww_fri_rfi_addr.png[]
!===
|===
Consider the litmus test of <<frirfi>>, which is one particular
instance of the more general "fri-rfi" pattern. The term "fri-rfi"
refers to the sequence (d), (e), (f): (d) "from-reads" (i.e., reads
from an earlier write than) (e) which is the same hart, and (f) reads
from (e) which is in the same hart.

From a microarchitectural perspective, outcome `a0=1`, `a1=2`, `a2=0` is
legal (as are various other less subtle outcomes). Intuitively, the
following would produce the outcome in question:

* (d) stalls (for whatever reason; perhaps it's stalled waiting
for some other preceding instruction)
* (e) executes and enters the store buffer (but does not yet
drain to memory)
* (f) executes and forwards from (e) in the store buffer
* (g), (h), and (i) execute
* (a) executes and drains to memory, (b) executes, and (c)
executes and drains to memory
* (d) unstalls and executes
* (e) drains from the store buffer to memory

This corresponds to a global memory order of (f), (i), (a), (c), (d),
(e). Note that even though (f) performs before (d), the value returned
by (f) is newer than the value returned by (d). Therefore, this
execution is legal and does not violate the CoRR requirements.

Likewise, if two back-to-back loads return the values written by the
same store, then they may also appear out-of-order in the global memory
order without violating CoRR. Note that this is not the same as saying
that the two loads return the same value, since two different stores may
write the same value.

[[litmus-rsw]]
.Litmus test RSW (outcome permitted)

[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,cols="^,<,^,<",options="header",float="center",align="center"]
!===
2+!Hart 0 2+^!Hart 1
2+!li t1, 1 >!(d) <!lw  a0,0(s1)
>!(a) <!sw t1,0(s0) >!(e) !xor t2,a0,a0
>!(b) <!fence w, w >!(f) !add s4,s2,t2
>!(c) <!sw t1,0(s1) >!(g) !lw  a1,0(s4)
! ! >!(h) !lw  a2,0(s2)
! ! >!(i) !xor t3,a2,a2
! ! >!(j) !add s0,s0,t3
! ! >!(k) !lw  a3,0(s0)
4+!Outcome: `a0=1`, `a1=v`, `a2=v`, `a3=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_rsw.txt[]
a! image::graphviz/litmus_rsw.png[]
!===
|===

Consider the litmus test of <<litmus-rsw>>.
The outcome `a0=1`, `a1=v`, `a2=v`, `a3=0` (where _v_ is
some value written by another hart) can be observed by allowing (g) and
(h) to be reordered. This might be done speculatively, and the
speculation can be justified by the microarchitecture (e.g., by snooping
for cache invalidations and finding none) because replaying (h) after
(g) would return the value written by the same store anyway. Hence
assuming `a1` and `a2` would end up with the same value written by the
same store anyway, (g) and (h) can be legally reordered. The global
memory order corresponding to this execution would be
(h),(k),(a),(c),(d),(g).

Executions of the test in <<litmus-rsw>> in
which `a1` does not equal `a2` do in fact require that (g) appears
before (h) in the global memory order. Allowing (h) to appear before (g)
in the global memory order would in that case result in a violation of
CoRR, because then (h) would return an older value than that returned by
(g). Therefore, <<overlapping-ordering, rule 2>> forbids this CoRR violation
from occurring. As such, <<overlapping-ordering, rule 2>> strikes a careful
balance between enforcing CoRR in all cases while simultaneously being
weak enough to permit "RSW" and "fri-rfi" patterns that commonly
appear in real microarchitectures.

There is one more overlapping-address rule: <<overlapping-ordering, 
rule 3>> simply states that a value cannot
be returned from an AMO or SC to a subsequent load until the AMO or SC
has (in the case of the SC, successfully) performed globally. This
follows somewhat naturally from the conceptual view that both AMOs and
SC instructions are meant to be performed atomically in memory. However,
notably, <<overlapping-ordering, rule 3>> states that hardware
may not even non-speculatively forward the value being stored by an
AMOSWAP to a subsequent load, even though for AMOSWAP that store value
is not actually semantically dependent on the previous value in memory,
as is the case for the other AMOs. The same holds true even when
forwarding from SC store values that are not semantically dependent on
the value returned by the paired LR.

The three PPO rules above also apply when the memory accesses in
question only overlap partially. This can occur, for example, when
accesses of different sizes are used to access the same object. Note
also that the base addresses of two overlapping memory operations need
not necessarily be the same for two memory accesses to overlap. When
misaligned memory accesses are being used, the overlapping-address PPO
rules apply to each of the component memory accesses independently.

[[mm-fence]]
==== Fences (<<overlapping-ordering, Rule 4>>)

[IMPORTANT]
====
Rule <<overlapping-ordering, 4>>: There is a FENCE instruction that orders a before b
====

By default, the FENCE instruction ensures that all memory accesses from
instructions preceding the fence in program order (the "predecessor
set") appear earlier in the global memory order than memory accesses
from instructions appearing after the fence in program order (the
"successor set"). However, fences can optionally further restrict the
predecessor set and/or the successor set to a smaller set of memory
accesses in order to provide some speedup. Specifically, fences have PR,
PW, SR, and SW bits which restrict the predecessor and/or successor
sets. The predecessor set includes loads (resp.stores) if and only if PR
(resp.PW) is set. Similarly, the successor set includes loads
(resp.stores) if and only if SR (resp.SW) is set.

The FENCE encoding currently has nine non-trivial combinations of the
four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which
facilitates mapping of "acquire+release" or RVTSO semantics. The
remaining seven combinations have empty predecessor and/or successor
sets and hence are no-ops. Of the ten non-trivial options, only six are
commonly used in practice:

* FENCE RW,RW
* FENCE.TSO
* FENCE RW,W
* FENCE R,RW
* FENCE R,R
* FENCE W,W

FENCE instructions using any other combination of PR, PW, SR, and SW are
reserved. We strongly recommend that programmers stick to these six.
Other combinations may have unknown or unexpected interactions with the
memory model.

Finally, we note that since RISC-V uses a multi-copy atomic memory
model, programmers can reason about fences bits in a thread-local
manner. There is no complex notion of "fence cumulativity" as found in
memory models that are not multi-copy atomic.

[[sec:memory:acqrel]]
==== Explicit Synchronization (<<overlapping-ordering, Rules 5-8>>)

[IMPORTANT]
====
<<overlapping-ordering, Rule 5>>: a has an acquire annotation

<<overlapping-ordering, Rule 6>>: b has a release annotation

<<overlapping-ordering, Rule 7>>: a and b both have RCsc annotations

<<overlapping-ordering, Rule 8>>: a is paired with b
====

An _acquire_ operation, as would be used at the start of a critical
section, requires all memory operations following the acquire in program
order to also follow the acquire in the global memory order. This
ensures, for example, that all loads and stores inside the critical
section are up to date with respect to the synchronization variable
being used to protect it. Acquire ordering can be enforced in one of two
ways: with an acquire annotation, which enforces ordering with respect
to just the synchronization variable itself, or with a FENCE R,RW, which
enforces ordering with respect to all previous loads.

[[spinlock_atomics]]
.A spinlock with atomics
[source%linenums,asm]
....
          sd           x1, (a1)     # Arbitrary unrelated store
          ld           x2, (a2)     # Arbitrary unrelated load
          li           t0, 1        # Initialize swap value.
      again:
          amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
          bnez         t0, again    # Retry if held.
          # ...
          # Critical section.
          # ...
          amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
          sd           x3, (a3)     # Arbitrary unrelated store
          ld           x4, (a4)     # Arbitrary unrelated load
....

Consider <<spinlock_atomics, Example 1>>.
Because this example uses _aq_, the loads and stores in the critical
section are guaranteed to appear in the global memory order after the
AMOSWAP used to acquire the lock. However, assuming `a0`, `a1`, and `a2`
point to different memory locations, the loads and stores in the
critical section may or may not appear after the "Arbitrary unrelated
load" at the beginning of the example in the global memory order.

[[spinlock_fences]]
.A spinlock with fences
[source%linenums,asm]
....
          sd           x1, (a1)     # Arbitrary unrelated store
          ld           x2, (a2)     # Arbitrary unrelated load
          li           t0, 1        # Initialize swap value.
      again:
          amoswap.w    t0, t0, (a0) # Attempt to acquire lock.
          fence        r, rw        # Enforce "acquire" memory ordering
          bnez         t0, again    # Retry if held.
          # ...
          # Critical section.
          # ...
          fence        rw, w        # Enforce "release" memory ordering
          amoswap.w    x0, x0, (a0) # Release lock by storing 0.
          sd           x3, (a3)     # Arbitrary unrelated store
          ld           x4, (a4)     # Arbitrary unrelated load
....

Now, consider the alternative in <<spinlock_fences, Example 2>>. In
this case, even though the AMOSWAP does not enforce ordering with an
_aq_ bit, the fence nevertheless enforces that the acquire AMOSWAP
appears earlier in the global memory order than all loads and stores in
the critical section. Note, however, that in this case, the fence also
enforces additional orderings: it also requires that the "Arbitrary
unrelated load" at the start of the program appears earlier in the
global memory order than the loads and stores of the critical section.
(This particular fence does not, however, enforce any ordering with
respect to the "Arbitrary unrelated store" at the start of the
snippet.) In this way, fence-enforced orderings are slightly coarser
than orderings enforced by _.aq_.

Release orderings work exactly the same as acquire orderings, just in
the opposite direction. Release semantics require all loads and stores
preceding the release operation in program order to also precede the
release operation in the global memory order. This ensures, for example,
that memory accesses in a critical section appear before the
lock-releasing store in the global memory order. Just as for acquire
semantics, release semantics can be enforced using release annotations
or with a FENCE RW,W operation. Using the same examples, the ordering
between the loads and stores in the critical section and the "Arbitrary
unrelated store" at the end of the code snippet is enforced only by the
FENCE RW,W in <<spinlock_fences, Example 2>>, not by
the _rl_ in <<spinlock_atomics, Example 1>>.

With RCpc annotations alone, store-release-to-load-acquire ordering is
not enforced. This facilitates the porting of code written under the TSO
and/or RCpc memory models. To enforce store-release-to-load-acquire
ordering, the code must use store-release-RCsc and load-acquire-RCsc
operations so that PPO rule 7 applies. RCpc alone is
sufficient for many use cases in C/C++ but is insufficient for many
other use cases in C/C++, Java, and Linux, to name just a few examples;
see <<memory_porting, Memory Porting>> for details.

PPO rule 8 indicates that an SC must appear after
its paired LR in the global memory order. This will follow naturally
from the common use of LR/SC to perform an atomic read-modify-write
operation due to the inherent data dependency. However, PPO
rule 8 also applies even when the value being stored
does not syntactically depend on the value returned by the paired LR.

Lastly, we note that just as with fences, programmers need not worry
about "cumulativity" when analyzing ordering annotations.

[[sec:memory:dependencies]]
==== Syntactic Dependencies (<<overlapping-ordering, Rules 9-11>>)

[[ppo-addr]]
[IMPORTANT]
====
<<overlapping-ordering, Rule 9>>: b has a syntactic address dependency on a

<<overlapping-ordering, Rule 10>>: b has a syntactic data dependency on a

<<overlapping-ordering, Rule 11>>: b is a store, and b has a syntactic control dependency on a
====

Dependencies from a load to a later memory operation in the same hart
are respected by the RVWMO memory model. The Alpha memory model was
notable for choosing _not_ to enforce the ordering of such dependencies,
but most modern hardware and software memory models consider allowing
dependent instructions to be reordered too confusing and
counterintuitive. Furthermore, modern code sometimes intentionally uses
such dependencies as a particularly lightweight ordering enforcement
mechanism.

The terms in <<mem-dependencies>> work as follows. Instructions
are said to carry dependencies from their
source register(s) to their destination register(s) whenever the value
written into each destination register is a function of the source
register(s). For most instructions, this means that the destination
register(s) carry a dependency from all source register(s). However,
there are a few notable exceptions. In the case of memory instructions,
the value written into the destination register ultimately comes from
the memory system rather than from the source register(s) directly, and
so this breaks the chain of dependencies carried from the source
register(s). In the case of unconditional jumps, the value written into
the destination register comes from the current `pc` (which is never
considered a source register by the memory model), and so likewise, JALR
(the only jump with a source register) does not carry a dependency from
_rs1_ to _rd_.


[[fflags]]
.(c) has a syntactic dependency on both (a) and (b) via fflags, a destination register that both (a) and (b) implicitly accumulate into
[.text-center,source%linenums,asm]
----
(a) fadd f3,f1,f2
(b) fadd f6,f4,f5
(c) csrrs a0,fflags,x0
----

The notion of accumulating into a destination register rather than
writing into it reflects the behavior of CSRs such as `fflags`. In
particular, an accumulation into a register does not clobber any
previous writes or accumulations into the same register. For example, in
<<fflags>>, (c) has a syntactic dependency on both (a) and (b).

Like other modern memory models, the RVWMO memory model uses syntactic
rather than semantic dependencies. In other words, this definition
depends on the identities of the registers being accessed by different
instructions, not the actual contents of those registers. This means
that an address, control, or data dependency must be enforced even if
the calculation could seemingly be `optimized away`. This choice
ensures that RVWMO remains compatible with code that uses these false
syntactic dependencies as a lightweight ordering mechanism.

[[address]]
.A syntactic address dependency
[.text-center, source%linenums, asm]
----
ld a1,0(s0)
xor a2,a1,a1
add s1,s1,a2
ld a5,0(s1)
----

For example, there is a syntactic address dependency from the memory
operation generated by the first instruction to the memory operation
generated by the last instruction in
<<address>>, even though `a1` XOR
`a1` is zero and hence has no effect on the address accessed by the
second load.

The benefit of using dependencies as a lightweight synchronization
mechanism is that the ordering enforcement requirement is limited only
to the specific two instructions in question. Other non-dependent
instructions may be freely reordered by aggressive implementations. One
alternative would be to use a load-acquire, but this would enforce
ordering for the first load with respect to _all_ subsequent
instructions. Another would be to use a FENCE R,R, but this would
include all previous and all subsequent loads, making this option more
expensive.

[[control1]]
.A syntactic control dependency
[.text-center, source%linenums, asm]
----
lw x1,0(x2)
bne x1,x0,next
sw x3,0(x4)
next: sw x5,0(x6)
----

Control dependencies behave differently from address and data
dependencies in the sense that a control dependency always extends to
all instructions following the original target in program order.
Consider <<control1>> the
instruction at `next` will always execute, but the memory operation
generated by that last instruction nevertheless still has a control
dependency from the memory operation generated by the first instruction.

[[control2]]
.Another syntactic control dependency
[.text-center,source%linenums,asm]
----
lw x1,0(x2)
bne x1,x0,next
next: sw x3,0(x4)
----

Likewise, consider <<control2>>.
Even though both branch outcomes have the same target, there is still a
control dependency from the memory operation generated by the first
instruction in this snippet to the memory operation generated by the
last instruction. This definition of control dependency is subtly
stronger than what might be seen in other contexts (e.g., C++), but it
conforms with standard definitions of control dependencies in the
literature.

Notably, PPO rules <<overlapping-ordering, 9-11>> are also
intentionally designed to respect dependencies that originate from the
output of a successful store-conditional instruction. Typically, an SC
instruction will be followed by a conditional branch checking whether
the outcome was successful; this implies that there will be a control
dependency from the store operation generated by the SC instruction to
any memory operations following the branch. PPO
rule <<ppo, 11>> in turn implies that any subsequent store
operations will appear later in the global memory order than the store
operation generated by the SC. However, since control, address, and data
dependencies are defined over memory operations, and since an
unsuccessful SC does not generate a memory operation, no order is
enforced between unsuccessful SC and its dependent instructions.
Moreover, since SC is defined to carry dependencies from its source
registers to _rd_ only when the SC is successful, an unsuccessful SC has
no effect on the global memory order.

[[litmus_lb_lrsc]]
.A variant of the LB litmus test (outcome forbidden)
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,cols="^,<,^,<",float="center",align="center"]
!===
4+!Initial values: 0(s0)=1; 0(s1)=1
4+!
2+^!Hart 0 2+^!Hart 1 
!(a) !ld a0,0(s0) !(e) !ld a3,0(s2)
!(b) !lr a1,0(s1) !(f) !sd a3,0(s0)
!(c) !sc a2,a0,0(s1) ! !
!(d) !sd a2,0(s2) ! !
4+!Outcome: `a0=0`, `a3=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_lb_lrsc.txt[]
a! image::graphviz/litmus_lb_lrsc.png[]
!===
|===

In addition, the choice to respect dependencies originating at
store-conditional instructions ensures that certain out-of-thin-air-like
behaviors will be prevented. Consider
<<litmus_lb_lrsc>>. Suppose a
hypothetical implementation could occasionally make some early guarantee
that a store-conditional operation will succeed. In this case, (c) could
return 0 to `a2` early (before actually executing), allowing the
sequence (d), (e), (f), (a), and then (b) to execute, and then (c) might
execute (successfully) only at that point. This would imply that (c)
writes its own success value to `0(s1)`! Fortunately, this situation and
others like it are prevented by the fact that RVWMO respects
dependencies originating at the stores generated by successful SC
instructions.

We also note that syntactic dependencies between instructions only have
any force when they take the form of a syntactic address, control,
and/or data dependency. For example: a syntactic dependency between two
`F` instructions via one of the `accumulating CSRs` in
<<source-dest-regs>> does _not_ imply
that the two `F` instructions must be executed in order. Such a
dependency would only serve to ultimately set up later a dependency from
both `F` instructions to a later CSR instruction accessing the CSR
flag in question.

[[memory-ppopipeline]]
==== Pipeline Dependencies (<<overlapping-ordering, Rules 12-13>>)

[[addrdatarfi]]
[IMPORTANT]
====
<<overlapping-ordering, Rule 12>>: b is a load, and there exists some store m between a and b in
program order such that m has an address or data dependency on a,
and b returns a value written by m

<<overlapping-ordering, Rule 13>>: b is a store, and there exists some instruction m between a and
b in program order such that m has an address dependency on a
====

[[litmus_datarfi]]
.Because of PPO <<overlapping-ordering, rule 12>> and the data dependency from (d) to (e), (d) must also precede (f) in the global memory order (outcome forbidden)
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header",]
!===
2+!Hart 0 2+! Hart 1
! !li t1, 1 !(d) !lw a0, 0(s1)
!(a) !sw t1,0(s0) !(e) !sw a0, 0(s2)
!(b) !fence w, w !(f) !lw a1, 0(s2)
!(c) !sw t1,0(s1) ! !xor a2,a1,a1
! ! ! !add s0,s0,a2
! ! !(g) !lw a3,0(s0)
4+!Outcome: `a0=1`, `a3=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_datarfi.txt[]
a! image::graphviz/litmus_datarfi.png[]
!===
|===

PPO rules <<overlapping-ordering, 12>> and <<overlapping-ordering, 13>> reflect behaviors of almost all real processor
pipeline implementations. Rule <<overlapping-ordering, 12>>
states that a load cannot forward from a store until the address and
data for that store are known. Consider <<litmus_datarfi>> (f) cannot be
executed until the data for (e) has been resolved, because (f) must
return the value written by (e) (or by something even later in the
global memory order), and the old value must not be clobbered by the
writeback of (e) before (d) has had a chance to perform. Therefore, (f)
will never perform before (d) has performed.


.Because of the extra store between (e) and (g), (d) no longer necessarily precedes (g) (outcome permitted)

[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,cols="^,<,^,<",float="center",align="center",options="header",]
!===
2+!Hart 0 2+!Hart 1
2+!li t1, 1 2+^!li t1, 1
!(a) !sw t1,0(s0) !(d) !lw a0, 0(s1)
!(b) !fence w, w !(e) !sw a0, 0(s2)
!(c) !sw t1,0(s1) !(f) !sw t1, 0(s2)
! ! !(g) !lw a1, 0(s2)
! ! ! !xor a2,a1,a1
! ! ! !add s0,s0,a2
! ! !(h) !lw a3,0(s0)
4+!Outcome: `a0=1`, `a3=0`
!===
|
!===
//a! graphviz::images/graphviz/litmus_datacoirfi.txt[]
a! image::graphviz/litmus_datacoirfi.png[]
!===
|===

If there were another store to the same address in between (e) and (f),
as in <<litmus:addrdatarfi_no>>,
then (f) would no longer be dependent on the data of (e) being resolved,
and hence the dependency of (f) on (d), which produces the data for (e),
would be broken.

Rule<<overlapping-ordering, 13>> makes a similar observation to the
previous rule: a store cannot be performed at memory until all previous
loads that might access the same address have themselves been performed.
Such a load must appear to execute before the store, but it cannot do so
if the store were to overwrite the value in memory before the load had a
chance to read the old value. Likewise, a store generally cannot be
performed until it is known that preceding instructions will not cause
an exception due to failed address resolution, and in this sense,
rule 13 can be seen as somewhat of a special case
of rule <<overlapping-ordering, 11>>.

[[litmus:addrdatarfi_no]]
.Because of the address dependency from (d) to (e), (d) also precedes (f) (outcome forbidden)
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,cols="^,<,^,<"float="center",align="center",options="header"]
!===
2+!Hart 0 2+^!Hart 1
2+! 2+^!li t1, 1
!(a) !lw a0,0(s0) !(d) !lw a1, 0(s1)
!(b) !fence rw,rw !(e) !lw a2, 0(a1)
!(c) !sw s2,0(s1) !(f) !sw t1, 0(s0)
4+!Outcome: `a0=1`, `a1=t`
!===
|
!===
//a! graphviz::images/graphviz/litmus_addrpo.txt[]
a! image:graphviz/litmus_addrpo.png[]
!===
|===

Consider <<litmus:addrdatarfi_no>> (f) cannot be
executed until the address for (e) is resolved, because it may turn out
that the addresses match; i.e., that `a1=s0`. Therefore, (f) cannot be
sent to memory before (d) has executed and confirmed whether the
addresses do indeed overlap.

=== Beyond Main Memory

RVWMO does not currently attempt to formally describe how FENCE.I,
SFENCE.VMA, I/O fences, and PMAs behave. All of these behaviors will be
described by future formalizations. In the meantime, the behavior of
FENCE.I is described in <<zifencei>>, the
behavior of SFENCE.VMA is described in the RISC-V Instruction Set
Privileged Architecture Manual, and the behavior of I/O fences and the
effects of PMAs are described below.

==== Coherence and Cacheability

The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs)
which specify, among other things, whether portions of the address space
are coherent and/or cacheable. See the RISC-V Privileged ISA
Specification for the complete details. Here, we simply discuss how the
various details in each PMA relate to the memory model:

* Main memory vs.I/O, and I/O memory ordering PMAs: the memory model as
defined applies to main memory regions. I/O ordering is discussed below.
* Supported access types and atomicity PMAs: the memory model is simply
applied on top of whatever primitives each region supports.
* Cacheability PMAs: the cacheability PMAs in general do not affect the
memory model. Non-cacheable regions may have more restrictive behavior
than cacheable regions, but the set of allowed behaviors does not change
regardless. However, some platform-specific and/or device-specific
cacheability settings may differ.
* Coherence PMAs: The memory consistency model for memory regions marked
as non-coherent in PMAs is currently platform-specific and/or
device-specific: the load-value axiom, the atomicity axiom, and the
progress axiom all may be violated with non-coherent memory. Note
however that coherent memory does not require a hardware cache coherence
protocol. The RISC-V Privileged ISA Specification suggests that
hardware-incoherent regions of main memory are discouraged, but the
memory model is compatible with hardware coherence, software coherence,
implicit coherence due to read-only memory, implicit coherence due to
only one agent having access, or otherwise.
* Idempotency PMAs: Idempotency PMAs are used to specify memory regions
for which loads and/or stores may have side effects, and this in turn is
used by the microarchitecture to determine, e.g., whether prefetches are
legal. This distinction does not affect the memory model.

==== I/O Ordering

For I/O, the load value axiom and atomicity axiom in general do not
apply, as both reads and writes might have device-specific side effects
and may return values other than the value "written" by the most
recent store to the same address. Nevertheless, the following preserved
program order rules still generally apply for accesses to I/O memory:
memory access _a_ precedes memory access _b_ in
global memory order if _a_ precedes _b_ in
program order and one or more of the following holds:

. _a_ precedes _b_ in preserved program order as
defined in <<memorymodel>>, with the exception
that acquire and release ordering annotations apply only from one memory
operation to another memory operation and from one I/O operation to
another I/O operation, but not from a memory operation to an I/O nor
vice versa
. _a_ and _b_ are accesses to overlapping
addresses in an I/O region
. _a_ and _b_ are accesses to the same strongly
ordered I/O region
. _a_ and _b_ are accesses to I/O regions, and
the channel associated with the I/O region accessed by either
_a_ or _b_ is channel 1
. _a_ and _b_ are accesses to I/O regions
associated with the same channel (except for channel 0)

Note that the FENCE instruction distinguishes between main memory
operations and I/O operations in its predecessor and successor sets. To
enforce ordering between I/O operations and main memory operations, code
must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW.
For example, to enforce ordering between a write to main memory and an
I/O write to a device register, a FENCE W,O or stronger is needed.
[[wo]]
.Ordering memory and I/O accesses
[.text-center,source%linenums,asm]
----
sd t0, 0(a0)
fence w,o 
sd a0, 0(a1)
----

When a fence is in fact used, implementations must assume that the
device may attempt to access memory immediately after receiving the MMIO
signal, and subsequent memory accesses from that device to memory must
observe the effects of all accesses ordered prior to that MMIO
operation. In other words, in <<wo>>,
suppose `0(a0)` is in main memory and `0(a1)` is the address of a device
register in I/O memory. If the device accesses `0(a0)` upon receiving
the MMIO write, then that load must conceptually appear after the first
store to `0(a0)` according to the rules of the RVWMO memory model. In
some implementations, the only way to ensure this will be to require
that the first store does in fact complete before the MMIO write is
issued. Other implementations may find ways to be more aggressive, while
others still may not need to do anything different at all for I/O and
main memory accesses. Nevertheless, the RVWMO memory model does not
distinguish between these options; it simply provides an
implementation-agnostic mechanism to specify the orderings that must be
enforced.

Many architectures include separate notions of "ordering" and
`completion" fences, especially as it relates to I/O (as opposed to
regular main memory). Ordering fences simply ensure that memory
operations stay in order, while completion fences ensure that
predecessor accesses have all completed before any successors are made
visible. RISC-V does not explicitly distinguish between ordering and
completion fences. Instead, this distinction is simply inferred from
different uses of the FENCE bits.

For implementations that conform to the RISC-V Unix Platform
Specification, I/O devices and DMA operations are required to access
memory coherently and via strongly ordered I/O channels. Therefore,
accesses to regular main memory regions that are concurrently accessed
by external devices can also use the standard synchronization
mechanisms. Implementations that do not conform to the Unix Platform
Specification and/or in which devices do not access memory coherently
will need to use mechanisms (which are currently platform-specific or
device-specific) to enforce coherency.

I/O regions in the address space should be considered non-cacheable
regions in the PMAs for those regions. Such regions can be considered
coherent by the PMA if they are not cached by any agent.

The ordering guarantees in this section may not apply beyond a
platform-specific boundary between the RISC-V cores and the device. In
particular, I/O accesses sent across an external bus (e.g., PCIe) may be
reordered before they reach their ultimate destination. Ordering must be
enforced in such situations according to the platform-specific rules of
those external devices and buses.

[[memory_porting]]
=== Code Porting and Mapping Guidelines

[[tsomappings]]
.Mappings from TSO operations to RISC-V operations
[%autowidth,float="center", align="center",cols="<,<",options="header",separator=!]
|===
!x86/TSO Operation !RVWMO Mapping
!Load ! `l{b|h|w|d}; fence r,rw`
!Store !`fence rw,w; s{b|h|w|d}`
!Atomic RMW !`amo<op>.{w|d}.aqrl OR` +
`loop:lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop`
!Fence !`fence rw,rw`
|===

<<tsomappings>> provides a mapping from TSO memory
operations onto RISC-V memory instructions. Normal x86 loads and stores
are all inherently acquire-RCpc and release-RCpc operations: TSO
enforces all load-load, load-store, and store-store ordering by default.
Therefore, under RVWMO, all TSO loads must be mapped onto a load
followed by FENCE R,RW, and all TSO stores must be mapped onto
FENCE RW,W followed by a store. TSO atomic read-modify-writes and x86
instructions using the LOCK prefix are fully ordered and can be
implemented either via an AMO with both _aq_ and _rl_ set, or via an LR
with _aq_ set, the arithmetic operation in question, an SC with both
_aq_ and _rl_ set, and a conditional branch checking the success
condition. In the latter case, the _rl_ annotation on the LR turns out
(for non-obvious reasons) to be redundant and can be omitted.

Alternatives to <<tsomappings>> are also possible. A TSO
store can be mapped onto AMOSWAP with _rl_ set. However, since RVWMO PPO
Rule <<overlapping-ordering, 3>> forbids forwarding of values from
AMOs to subsequent loads, the use of AMOSWAP for stores may negatively
affect performance. A TSO load can be mapped using LR with _aq_ set: all
such LR instructions will be unpaired, but that fact in and of itself
does not preclude the use of LR for loads. However, again, this mapping
may also negatively affect performance if it puts more pressure on the
reservation mechanism than was originally intended.

[[powermappings]]
.Mappings from Power operations to RISC-V operations
[%autowidth,float="center",align="center",cols="<,<",options="header",separator=!]
|===
!Power Operation !RVWMO Mapping
!Load !`l{b|h|w|d}`
!Load-Reserve !`lr.{w|d}`
!Store !`s{b|h|w|d}`
!Store-Conditional !`sc.{w|d}`
!`lwsync` !`fence.tso`
!`sync` !`fence rw,rw`
!`isync` !`fence.i; fence r,r`
|===

<<powermappings>> provides a mapping from Power memory
operations onto RISC-V memory instructions. Power ISYNC maps on RISC-V
to a FENCE.I followed by a FENCE R,R; the latter fence is needed because
ISYNC is used to define a "control+control fence" dependency that is
not present in RVWMO.

[[armmappings]]
.Mappings from ARM operations to RISC-V operations
[%autowidth,float="center",align="center",cols="<,<",options="header",separator=!]
|===
!ARM Operation !RVWMO Mapping
!Load !`l{b|h|w|d}`
!Load-Acquire !`fence rw, rw; l{b|h|w|d}; fence r,rw`
!Load-Exclusive !`lr.{w|d}`
!Load-Acquire-Exclusive !`lr.{w|d}.aqrl`
!Store !`s{b|h|w|d}`
!Store-Release !`fence rw,w; s{b|h|w|d}`
!Store-Exclusive !`sc.{w|d}`
!Store-Release-Exclusive !`sc.{w|d}.rl`
!`dmb` !`fence rw,rw`
!`dmb.ld` !`fence r,rw`
!`dmb.st` !`fence w,w`
!`isb` !`fence.i; fence r,r`
|===

<<armmappings>> provides a mapping from ARM memory
operations onto RISC-V memory instructions. Since RISC-V does not
currently have plain load and store opcodes with _aq_ or _rl_
annotations, ARM load-acquire and store-release operations should be
mapped using fences instead. Furthermore, in order to enforce
store-release-to-load-acquire ordering, there must be a FENCE RW,RW
between the store-release and load-acquire; <<armmappings>>
enforces this by always placing the fence in front of each acquire
operation. ARM load-exclusive and store-exclusive instructions can
likewise map onto their RISC-V LR and SC equivalents, but instead of
placing a FENCE RW,RW in front of an LR with _aq_ set, we simply also
set _rl_ instead. ARM ISB maps on RISC-V to FENCE.I followed by
FENCE R,R similarly to how ISYNC maps for Power.

[[linuxmappings]]
.Mappings from Linux memory primitives to RISC-V primitives.
[%autowidth,float="center",align="center",cols="<,<",options="header",separator=!]
|===
!Linux Operation !RVWMO Mapping

!`smp_mb()` !`fence rw,rw`

!`smp_rmb()` !`fence r,r`

!`smp_wmb()` !`fence w,w`

!`dma_rmb()` !`fence r,r`

!`dma_wmb()` !`fence w,w`

!`mb()` !`fence iorw,iorw`

!`rmb()` !`fence ri,ri`

!`wmb()` !`fence wo,wo`

!`smp_load_acquire()` !`l{b|h|w|d}; fence r,rw`

!`smp_store_release()` !`fence.tso; s{b|h|w|d}`

!Linux Construct !RVWMO AMO Mapping

!`atomic &#60;op&#62; relaxed` !`amo &#60;op&#62;.{w|d}`

!`atomic &#60;op&#62; acquire` !`amo &#60;op&#62;.{w|d}.aq`

!`atomic &#60;op&#62; release` !`amo &#60;op&#62;.{w|d}.rl`

!`atomic &#60;op&#62;` !`amo &#60;op&#62;.{w|d}.aqrl`

!Linux Construct !RVWMO LR/SC Mapping

!`atomic &#60;op&#62; relaxed` !`loop:lr.{w|d}; &#60;op&#62;; sc.{w|d}; bnez loop`

!`atomic &#60;op&#62; acquire` !`loop:lr.{w|d}.aq; &#60;op&#62;; sc.{w|d}; bnez loop`

!`atomic &#60;op&#62; release` !`loop:lr.{w|d}; &#60;op&#62;; sc.{w|d}.aqrl^&#42;; bnez loop OR`

! !`fence.tso; loop:lr.{w|d}; &#60;op &#62;; sc.{w|d}^&#42;; bnez loop`

!`atomic &#60;op&#62;` !`loop:lr.{w|d}.aq;` `&#60;op&#62;; sc.{w|d}.aqrl; bnez loop`

|===

With regards to <<linuxmappings>>, other
constructs (such as spinlocks) should follow accordingly. Platforms or
devices with non-coherent DMA may need additional synchronization (such
as cache flush or invalidate mechanisms); currently any such extra
synchronization will be device-specific.

<<linuxmappings>> provides a mapping of Linux memory
ordering macros onto RISC-V memory instructions. The Linux fences
`dma_rmb()` and `dma_wmb()` map onto FENCE R,R and FENCE W,W,
respectively, since the RISC-V Unix Platform requires coherent DMA, but
would be mapped onto FENCE RI,RI and FENCE WO,WO, respectively, on a
platform with non-coherent DMA. Platforms with non-coherent DMA may also
require a mechanism by which cache lines can be flushed and/or
invalidated. Such mechanisms will be device-specific and/or standardized
in a future extension to the ISA.

The Linux mappings for release operations may seem stronger than
necessary, but these mappings are needed to cover some cases in which
Linux requires stronger orderings than the more intuitive mappings would
provide. In particular, as of the time this text is being written, Linux
is actively debating whether to require load-load, load-store, and
store-store orderings between accesses in one critical section and
accesses in a subsequent critical section in the same hart and protected
by the same synchronization object. Not all combinations of
FENCE RW,W/FENCE R,RW mappings with _aq_/_rl_ mappings combine to
provide such orderings. There are a few ways around this problem,
including:

. Always use FENCE RW,W/FENCE R,RW, and never use _aq_/_rl_. This
suffices but is undesirable, as it defeats the purpose of the _aq_/_rl_
modifiers.
. Always use _aq_/_rl_, and never use FENCE RW,W/FENCE R,RW. This does
not currently work due to the lack of load and store opcodes with _aq_
and _rl_ modifiers.
. Strengthen the mappings of release operations such that they would
enforce sufficient orderings in the presence of either type of acquire
mapping. This is the currently recommended solution, and the one shown
in <<linuxmappings>>.

RVWMO Mapping: (a) lw a0, 0(s0) (b) fence.tso // vs. fence rw,w (c) sd
x0,0(s1) ... loop: (d) amoswap.d.aq a1,t1,0(s1) bnez a1,loop (e) lw
a2,0(s2)

For example, the critical section ordering rule currently being debated
by the Linux community would require (a) to be ordered before (e) in
<<lkmm_ll>>. If that will indeed be
required, then it would be insufficient for (b) to map as FENCE RW,W.
That said, these mappings are subject to change as the Linux Kernel
Memory Model evolves.

[[lkmm_ll]]
.Orderings between critical sections in Linux
[source%linenums,asm]
----
Linux Code:
(a) int r0 = *x;
       (bc) spin_unlock(y, 0);
....
....
(d) spin_lock(y);
(e) int r1 = *z;

RVWMO Mapping:
(a) lw a0, 0(s0)
(b) fence.tso // vs. fence rw,w
(c) sd x0,0(s1)
....
loop:
(d) amoswap.d.aq a1,t1,0(s1)
bnez a1,loop
(e) lw a2,0(s2)
----

<<c11mappings>> provides a mapping of C11/C++11 atomic
operations onto RISC-V memory instructions. If load and store opcodes
with _aq_ and _rl_ modifiers are introduced, then the mappings in
<<c11mappings_hypothetical>> will suffice. Note however that
the two mappings only interoperate correctly if
`atomic_<op>(memory_order_seq_cst)` is mapped using an LR that has both
_aq_ and _rl_ set.
Even more importantly, a <<c11mappings>> sequentially consistent store,
followed by a <<c11mappings_hypothetical>> sequentially consistent load
can be reordered unless the <<c11mappings>> mapping of stores is
strengthened by either adding a second fence or mapping the store
to `amoswap.rl` instead.

[[c11mappings]]
.Mappings from C/C++ primitives to RISC-V primitives.
[%autowidth,float="center",align="center",cols="<,<",options="header",separator=!]
|===

!C/C++ Construct ! RVWMO Mapping

!Non-atomic load ! `l{b|h|w|d}`

!`atomic_load(memory_order_relaxed)` !`l{b|h|w|d}`

!`atomic_load(memory_order_acquire)` !`l{b|h|w|d}; fence r,rw`

!`atomic_load(memory_order_seq_cst)` !`fence rw,rw; l{b|h|w|d}; fence r,rw`

!Non-atomic store !`s{b|h|w|d}`

!`atomic_store(memory_order_relaxed)` !`s{b|h|w|d}`

!`atomic_store(memory_order_release)` !`fence rw,w; s{b|h|w|d}`

!`atomic_store(memory_order_seq_cst)` !`fence rw,w; s{b|h|w|d}`

!`atomic_thread_fence(memory_order_acquire)` !`fence r,rw`

!`atomic_thread_fence(memory_order_release)` !`fence rw,w`

!`atomic_thread_fence(memory_order_acq_rel)` !`fence.tso`

!`atomic_thread_fence(memory_order_seq_cst)` !`fence rw,rw`

!C/C++ Construct !RVWMO AMO Mapping

!`atomic_<op>(memory_order_relaxed)` !`amo<op>.{w|d}`

!`atomic_<op>(memory_order_acquire)` !`amo<op>.{w|d}.aq`

!`atomic_<op>(memory_order_release)` !`amo<op>.{w|d}.rl`

!`atomic_<op>(memory_order_acq_rel)` !`amo<op>.{w|d}.aqrl`

!`atomic_<op>(memory_order_seq_cst)` !`amo<op>.{w|d}.aqrl`

!C/C++ Construct !RVWMO LR/SC Mapping

!`atomic_<op>(memory_order_relaxed)` !`loop:lr.{w|d}; <op>; sc.{w|d};`

! !`bnez loop`

!`atomic_<op>(memory_order_acquire)` !`loop:lr.{w|d}.aq; <op>; sc.{w|d};`

! !`bnez loop`

!`atomic_<op>(memory_order_release)` !`loop:lr.{w|d}; <op>; sc.{w|d}.rl;`

! !`bnez loop`

!`atomic_<op>(memory_order_acq_rel)` !`loop:lr.{w|d}.aq; <op>; sc.{w|d}.rl;`

! !`bnez loop`

!`atomic_<op>(memory_order_seq_cst)` !`loop:lr.{w|d}.aqrl; <op>;`

! !`sc.{w|d}.rl; bnez loop`

|===

[[c11mappings_hypothetical]]
.Hypothetical mappings from C/C++ primitives to RISC-V primitives, if native load-acquire and store-release opcodes are introduced.
[%autowidth,float="center",align="center",cols="<,<",options="header",separator=!]
|===
!C/C++ Construct !RVWMO Mapping

!Non-atomic load !`l{b|h|w|d}`

!`atomic_load(memory_order_relaxed)` !`l{b|h|w|d}`

!`atomic_load(memory_order_acquire)` !`l{b|h|w|d}.aq`

!`atomic_load(memory_order_seq_cst)` !`l{b|h|w|d}.aq`

!Non-atomic store !`s{b|h|w|d}`

!`atomic_store(memory_order_relaxed)` !`s{b|h|w|d}`

!`atomic_store(memory_order_release)` !`s{b|h|w|d}.rl`

!`atomic_store(memory_order_seq_cst)` !`s{b|h|w|d}.rl`

!`atomic_thread_fence(memory_order_acquire)` !`fence r,rw`

!`atomic_thread_fence(memory_order_release)` !`fence rw,w`

!`atomic_thread_fence(memory_order_acq_rel)` !`fence.tso`

!`atomic_thread_fence(memory_order_seq_cst)` !`fence rw,rw`

!C/C++ Construct !RVWMO AMO Mapping

!`atomic_<op>(memory_order_relaxed)` !`amo<op>.{w|d}`

!`atomic_<op>(memory_order_acquire)` !`amo<op>.{w|d}.aq`

!`atomic_<op>(memory_order_release)` !`amo<op>.{w|d}.rl`

!`atomic_<op>(memory_order_acq_rel)` !`amo<op>.{w|d}.aqrl`

!`atomic_<op>(memory_order_seq_cst)` !`amo<op>.{w|d}.aqrl`

!C/C++ Construct !RVWMO LR/SC Mapping

!`atomic_<op>(memory_order_relaxed)` !`lr.{w|d}; <op>; sc.{w|d}`

!`atomic_<op>(memory_order_acquire)` !`lr.{w|d}.aq; <op>; sc.{w|d}`

!`atomic_<op>(memory_order_release)` !`lr.{w|d}; <op>; sc.{w|d}.rl`

!`atomic_<op>(memory_order_acq_rel)` !`lr.{w|d}.aq; <op>; sc.{w|d}.rl`

!`atomic_<op>(memory_order_seq_cst)` !`lr.{w|d}.aq* <op>; sc.{w|d}.rl`

2+!`*` must be `lr.{w|d}.aqrl` in order to interoperate with code mapped per <<c11mappings>>
|===

Any AMO can be emulated by an LR/SC pair, but care must be taken to
ensure that any PPO orderings that originate from the LR are also made
to originate from the SC, and that any PPO orderings that terminate at
the SC are also made to terminate at the LR. For example, the LR must
also be made to respect any data dependencies that the AMO has, given
that load operations do not otherwise have any notion of a data
dependency. Likewise, the effect a FENCE R,R elsewhere in the same hart
must also be made to apply to the SC, which would not otherwise respect
that fence. The emulator may achieve this effect by simply mapping AMOs
onto `lr.aq; <op>; sc.aqrl`, matching the mapping used elsewhere for
fully ordered atomics.

These C11/C++11 mappings require the platform to provide the following
Physical Memory Attributes (as defined in the RISC-V Privileged ISA) for
all memory:

* main memory
* coherent
* AMOArithmetic
* RsrvEventual

Platforms with different attributes may require different mappings, or
require platform-specific SW (e.g., memory-mapped I/O).

=== Implementation Guidelines

The RVWMO and RVTSO memory models by no means preclude
microarchitectures from employing sophisticated speculation techniques
or other forms of optimization in order to deliver higher performance.
The models also do not impose any requirement to use any one particular
cache hierarchy, nor even to use a cache coherence protocol at all.
Instead, these models only specify the behaviors that can be exposed to
software. Microarchitectures are free to use any pipeline design, any
coherent or non-coherent cache hierarchy, any on-chip interconnect,
etc., as long as the design only admits executions that satisfy the
memory model rules. That said, to help people understand the actual
implementations of the memory model, in this section we provide some
guidelines on how architects and programmers should interpret the
models' rules.

Both RVWMO and RVTSO are multi-copy atomic (or
_other-multi-copy-atomic_): any store value that is visible to a hart
other than the one that originally issued it must also be conceptually
visible to all other harts in the system. In other words, harts may
forward from their own previous stores before those stores have become
globally visible to all harts, but no early inter-hart forwarding is
permitted. Multi-copy atomicity may be enforced in a number of ways. It
might hold inherently due to the physical design of the caches and store
buffers, it may be enforced via a single-writer/multiple-reader cache
coherence protocol, or it might hold due to some other mechanism.

Although multi-copy atomicity does impose some restrictions on the
microarchitecture, it is one of the key properties keeping the memory
model from becoming extremely complicated. For example, a hart may not
legally forward a value from a neighbor hart's private store buffer
(unless of course it is done in such a way that no new illegal behaviors
become architecturally visible). Nor may a cache coherence protocol
forward a value from one hart to another until the coherence protocol
has invalidated all older copies from other caches. Of course,
microarchitectures may (and high-performance implementations likely
will) violate these rules under the covers through speculation or other
optimizations, as long as any non-compliant behaviors are not exposed to
the programmer.

As a rough guideline for interpreting the PPO rules in RVWMO, we expect
the following from the software perspective:

* programmers will use PPO rules <<overlapping-ordering, 1>> and <<overlapping-ordering, 4-8>> regularly and actively.
* expert programmers will use PPO rules <<overlapping-ordering, 9-11>> to speed up critical paths
of important data structures.
* even expert programmers will rarely if ever use PPO rules <<overlapping-ordering, 2-3>> and
<<overlapping-ordering, 12-13>> directly.
These are included to facilitate common microarchitectural optimizations
(rule <<overlapping-ordering, 2>>) and the operational formal modeling approach (rules <<overlapping-ordering, 3>> and
<<overlapping-ordering, 12-13>>) described
in <<operational>>. They also facilitate the
process of porting code from other architectures that have similar
rules.

We also expect the following from the hardware perspective:

* PPO rules <<overlapping-ordering, 1>> and <<overlapping-ordering, 3-6>> reflect
well-understood rules that should pose few surprises to architects.
* PPO rule <<overlapping-ordering, 2>> reflects a natural and common hardware
optimization, but one that is very subtle and hence is worth double
checking carefully.
* PPO rule <<overlapping-ordering, 7>> may not be immediately obvious to
architects, but it is a standard memory model requirement
* The load value axiom, the atomicity axiom, and PPO rules
<<overlapping-ordering, 8-13>> reflect rules that most
hardware implementations will enforce naturally, unless they contain
extreme optimizations. Of course, implementations should make sure to
double check these rules nevertheless. Hardware must also ensure that
syntactic dependencies are not `optimized away`.

Architectures are free to implement any of the memory model rules as
conservatively as they choose. For example, a hardware implementation
may choose to do any or all of the following:

* interpret all fences as if they were FENCE RW,RW (or FENCE IORW,IORW,
if I/O is involved), regardless of the bits actually set
* implement all fences with PW and SR as if they were FENCE RW,RW (or
FENCE IORW,IORW, if I/O is involved), as PW with SR is the most
expensive of the four possible main memory ordering components anyway
* emulate _aq_ and _rl_ as described in <<memory_porting>>
* enforcing all same-address load-load ordering, even in the presence of
patterns such as `fri-rfi` and `RSW`
* forbid any forwarding of a value from a store in the store buffer to a
subsequent AMO or LR to the same address
* forbid any forwarding of a value from an AMO or SC in the store buffer
to a subsequent load to the same address
* implement TSO on all memory accesses, and ignore any main memory
fences that do not include PW and SR ordering (e.g., as Ztso
implementations will do)
* implement all atomics to be RCsc or even fully ordered, regardless of
annotation

Architectures that implement RVTSO can safely do the following:

* Ignore all fences that do not have both PW and SR (unless the fence
also orders I/O)
* Ignore all PPO rules except for rules <<overlapping-ordering, 4>> through <<overlapping-ordering, 7>>, since the rest
are redundant with other PPO rules under RVTSO assumptions

Other general notes:

* Silent stores (i.e., stores that write the same value that already
exists at a memory location) behave like any other store from a memory
model point of view. Likewise, AMOs which do not actually change the
value in memory (e.g., an AMOMAX for which the value in _rs2_ is smaller
than the value currently in memory) are still semantically considered
store operations. Microarchitectures that attempt to implement silent
stores must take care to ensure that the memory model is still obeyed,
particularly in cases such as RSW <<mm-overlap>>
which tend to be incompatible with silent stores.
* Writes may be merged (i.e., two consecutive writes to the same address
may be merged) or subsumed (i.e., the earlier of two back-to-back writes
to the same address may be elided) as long as the resulting behavior
does not otherwise violate the memory model semantics.

The question of write subsumption can be understood from the following
example:

.Write subsumption litmus test, allowed execution
[float="center",align="center",cols=".^1a,.^1a",frame="none",grid="none",options="noheader"]
|===
|
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header",]
!===
2+!Hart 0 2+^!Hart 1
2+!li t1, 3 2+^!li t3, 2
! !li t2, 1 ! !
!(a) !sw t1,0(s0) !(d) !lw a0,0(s1)
!(b) !fence w, w !(e) !sw a0,0(s0)
!(c) !sw t2,0(s1) !(f) !sw t3,0(s0)
!===
|
!===
//a! graphviz::images/graphviz/litmus_subsumption.txt[]
a! image::graphviz/litmus_subsumption.png[]
!===
|===

As written, if the load  (d) reads value _1_, then (a) must
precede (f) in the global memory order:

* (a) precedes (c) in the global memory order because of rule 2
* (c) precedes (d) in the global memory order because of the Load
Value axiom
* (d) precedes (e) in the global memory order because of rule 7
* (e) precedes (f) in the global memory order because of rule 1

In other words the final value of the memory location whose address is
in `s0` must be _2_ (the value written by the store (f)) and
cannot be _3_ (the value written by the store (a)).

A very aggressive microarchitecture might erroneously decide to discard
(e), as (f) supersedes it, and this may in turn lead the
microarchitecture to break the now-eliminated dependency between (d) and
(f) (and hence also between (a) and (f)). This would violate the memory
model rules, and hence it is forbidden. Write subsumption may in other
cases be legal, if for example there were no data dependency between (d)
and (e).

==== Possible Future Extensions

We expect that any or all of the following possible future extensions
would be compatible with the RVWMO memory model:

* "V" vector ISA extensions
* "J" JIT extension
* Native encodings for load and store opcodes with _aq_ and _rl_ set
* Fences limited to certain addresses
* Cache writeback/flush/invalidate/etc.instructions

[[discrepancies]]
=== Known Issues

[[mixedrsw]]
==== Mixed-size RSW

[[rsw1]]
.Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header",]
|===
2+|Hart 0 2+^|Hart 1
2+|li t1, 1 2+^|li t1, 1
|(a) |lw a0,0(s0) |(d) |lw a1,0(s1)
|(b) |fence rw,rw |(e) |amoswap.w.rl a2,t1,0(s2)
|(c) |sw t1,0(s1) |(f) |ld a3,0(s2)
| | |(g) |lw a4,4(s2)
| | | |xor a5,a4,a4
| | | |add s0,s0,a5
| | |(h) |sw t1,0(s0)
4+|Outcome: `a0=1`, `a1=1`, `a2=0`, `a3=1`, `a4=0`
|===

[[rsw2]]
.Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header"]
|===
2+|Hart 0 2+^|Hart 1 
2+|li t1, 1 2+^|li t1, 1
|(a) |lw a0,0(s0) |(d) |ld a1,0(s1)
|(b) |fence rw,rw |(e) |lw a2,4(s1)
|(c) |sw t1,0(s1) | |xor a3,a2,a2
| | | |add s0,s0,a3
| | |(f) |sw t1,0(s0)
4+|Outcome: `a0=1`, `a1=1`, `a2=0`
|===

[[rsw3]]
.Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)
[%autowidth,float="center",align="center",cols="^,<,^,<",options="header",]
|===
2+|Hart 0 2+^|Hart 1
2+|li t1, 1 2+^|li t1, 1
|(a) |lw a0,0(s0) |(d) |sw t1,4(s1)
|(b) |fence rw,rw |(e) |ld a1,0(s1)
|(c) |sw t1,0(s1) |(f) |lw a2,4(s1)
| | | |xor a3,a2,a2
| | | |add s0,s0,a3
| | |(g) |sw t1,0(s0)
4+|Outcome: `a0=1`, `a1=0x100000001`, `a2=1`
|===

There is a known discrepancy between the operational and axiomatic
specifications within the family of mixed-size RSW variants shown in
<<rsw1>>-<<rsw3>>.
To address this, we may choose to add something like the following new
PPO rule: Memory operation _a_ precedes memory operation
_b_ in preserved program order (and hence also in the global
memory order) if _a_ precedes _b_ in program
order, _a_ and _b_ both access regular main
memory (rather than I/O regions), _a_ is a load,
_b_ is a store, there is a load _m_ between
_a_ and _b_, there is a byte _x_
that both _a_ and _m_ read, there is no store
between _a_ and _m_ that writes to
_x_, and _m_ precedes _b_ in PPO. In
other words, in herd syntax, we may choose to add
`(po-loc & rsw);ppo;[W]` to PPO. Many implementations will already
enforce this ordering naturally. As such, even though this rule is not
official, we recommend that implementers enforce it nevertheless in
order to ensure forwards compatibility with the possible future addition
of this rule to RVWMO.