Updates to the memory consistency model spec

This giant patch is the result of months of work from a lot of different people in the memory model TG.
author: Daniel Lustig <dlustig@nvidia.com> 2018-05-02 16:31:03 -0700
committer: Daniel Lustig <dlustig@nvidia.com> 2018-05-02 16:31:03 -0700
commit: 03a5e722fc0fe7b94dd0a49f550ff7b41a63f612 (patch)
tree: f6db80e1e442798654d12bc5e9bc151930d49570 /src/memory.tex
parent: 3559c11db55e96e1220c6b032d9d920b1808f151 (diff)
download: riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.zip
riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.tar.gz
riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.tar.bz2
1 files changed, 910 insertions, 1103 deletions
diff --git a/src/memory.tex b/src/memory.tex
index dd0a1fe..d0c71af 100644
--- a/src/memory.tex
+++ b/src/memory.tex
@@ -1,118 +1,130 @@
-\newenvironment{tentative}
-{ \vspace{-0.2in}
-  \begin{quotation}
-  \noindent
-  \color{red}MEMORY MODEL TASK GROUP TO-DO
-
-  \small \em
-  \rule{\linewidth}{1pt}\\
-}
-{ 
-  \end{quotation}
-  \vspace{-0.2in}
-}
-\lstdefinelanguage{alloy}{
-  morekeywords={abstract, sig, extends, pred, fun, fact, no, set, one, lone, let, not, all, iden, some, run, for},
-  morecomment=[l]{//},
-  morecomment=[s]{/*}{*/},
-  commentstyle=\color{green!40!black},
-  keywordstyle=\color{blue!40!black},
-  moredelim=**[is][\color{red}]{@}{@},
-  escapeinside={!}{!},
-}
-\lstset{language=alloy}
-\lstset{aboveskip=0pt}
-\lstset{belowskip=0pt}
-
-\newcommand{\diagram}{(picture coming soon)}
+%%%%Nice fonts in diagrams
+%%Images
+\makeatletter
+% for the fig2dev version on P desktop
+\gdef\SetFigFont#1#2#3#4#5{%
+  \reset@font\fontsize{12}{#2pt}%
+%  \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
+  \fontfamily{\sfdefault}\fontseries{#4}\fontshape{#5}%
+  \selectfont}
+\makeatother
 
 \chapter{RVWMO Explanatory Material}
-\label{sec:explanation}
-This section provides more explanation for the RVWMO memory model, using more informal language and concrete examples.
+\label{sec:memorymodelexplanation}
+This section provides more explanation for the RVWMO (Chapter~\ref{ch:memorymodel}), using more informal language and concrete examples.
 These are intended to clarify the meaning and intent of the axioms and preserved program order rules.
-%In case of any discrepancy between the informal descriptions here and the formal descriptions elsewhere, the formal definitions should be considered authoritative.
 
 \section{Why RVWMO?}
-\label{sec:whynottso}
+\label{sec:whyrvwmo}
 
 Memory consistency models fall along a loose spectrum from weak to strong.
-Weak memory models (e.g., ARMv7, Power, Alpha) allow more hardware implementation flexibility and deliver arguably better performance, performance per watt, power, scalability, and hardware verification overheads than strong models, at the expense of a more complex programming model.
-%Models which are too weak may not even be properly analyzable with modern formal analysis techniques.
-Strong models (e.g., sequential consistency, TSO) provide simpler programming models, but at the cost of imposing more restrictions on the kinds of hardware optimizations that can be performed in the pipeline and in the memory system, with some cost to power and area overheads, and with some added hardware verification burden.
+Weak memory models allow more hardware implementation flexibility and deliver arguably better performance, performance per watt, power, scalability, and hardware verification overheads than strong models, at the expense of a more complex programming model.
+Strong models provide simpler programming models, but at the cost of imposing more restrictions on the kinds of (non-speculative) hardware optimizations that can be performed in the pipeline and in the memory system, and in turn imposing some cost in terms of power, area overhead, and verification burden.
 
-For the base ISA, RISC-V has chosen the RVWMO memory model, which is a variant of release consistency.
+RISC-V has chosen the RVWMO memory model, a variant of release consistency.
 This places it in between the two extremes of the memory model spectrum.
-It is not as weak as the Power memory model, and this buys back some programming model simplicity without giving up very much in terms of performance.
-RVWMO is also not as restrictive as RVTSO, and hence it remains weak enough to ensure that implementations can be performant and scalable without incurring huge hardware complexity overheads.
-RVWMO is similar to the ARMv8 memory model in this regard.
-
-As such, the RVWMO memory model enables architects to build simple implementations, aggressive implementations, implementations embedded deeply inside a much larger system and subject to complex memory system interactions, or any number of other possibilities, all while simultaneously being strong enough to support programming language memory models at high performance.
+The RVWMO memory model enables architects to build simple implementations, aggressive implementations, implementations embedded deeply inside a much larger system and subject to complex memory system interactions, or any number of other possibilities, all while simultaneously being strong enough to support programming language memory models at high performance.
 
-The risk of a weak memory model lies in the complexity of the programming model.
-Buggy code which ``just worked'' on stronger implementations may well break on more aggressive implementations due to the bugs simply not manifesting on the stronger-than-necessary implementations.
-For these situations, though, the root cause is the bug in the original software, not the memory model itself.
-The risk of finding short-term bugs in code ported from other architectures is outweighed by the long-term benefits that the weak memory model delivers more generally.
-
-To mitigate this risk, some hardware implementations may choose to stick with RVTSO, and that is perfectly acceptable and perfectly compatible with the RVWMO memory model.
-The cost that the weak memory model imposes on such implementations is the incremental overhead of fetching instructions (e.g., {\tt fence~r,rw} and {\tt fence rw,w}) which become no-ops on that implementation.
-(These fences must remain present in the code to ensure compatibility with other more weakly-ordered RISC-V implementations.)
-
-Most software is also fully compatible with weak memory models.
-C/C++, Java, and Linux, to name some of the most notable and more formally analyzed examples, are all entirely compatible with weak non-atomic memory models, as all are designed to run not just on x86 but also on ARM, Power, and many other architectures.
-It is true that some code, e.g., code ported from x86, does sometimes (correctly or incorrectly) assume a stronger model such as TSO.
-For such code, the RVWMO memory model provides a means for restoring TSO to sections of code through fences and atomics with {\tt .aq} and {\tt .rl} bits in the ``A'' extension, until such code can be ported to RVWMO over time.
-
-Designers who wish to provide drop-in compatibility with x86 code can also implement the Ztso extension which enforces RVTSO.
+To facilitate the porting of code from other architectures, some hardware implementations may choose to implement the Ztso extension, which provides stricter RVTSO ordering semantics by default.
 Code written for RVWMO is automatically and inherently compatible with RVTSO, but code written assuming RVTSO is not guaranteed to run correctly on RVWMO implementations.
-In fact, RVWMO implementations will (and should) simply refuse to run TSO-only binaries.
+In fact, most RVWMO implementations will (and should) simply refuse to run RVTSO-only binaries.
 Each implementation must therefore choose whether to prioritize compatibility with RVTSO code (e.g., to facilitate porting from x86) or whether to instead prioritize compatibility with other RISC-V cores implementing RVWMO.
 
+Some fences and/or memory ordering annotations in code written for RVWMO may become redundant under RVTSO; the cost that the default of RVWMO imposes on Ztso implementations is the incremental overhead of fetching those fences (e.g., FENCE~R,RW and FENCE~RW,W) which become no-ops on that implementation.
+However, these fences must remain present in the code if compatibility with non-Ztso implementations is desired.
+
 \section{Litmus Tests}
 The explanations in this chapter make use of {\em litmus tests}, or small programs designed to test or highlight one particular aspect of a memory model.
 Figure~\ref{fig:litmus:sample} shows an example of a litmus test with two harts.
-For this figure (and for all figures that follow in this chapter), we assume that {\tt s0}--{\tt s2} are pre-set to the same value in all harts.
-As a convention, we will assume that {\tt s0} holds the address labeled {\tt x}, {\tt s1} holds {\tt y}, and {\tt s2} holds {\tt z}, where {\tt x}, {\tt y}, and {\tt z} are different memory addresses.
-This figure shows the same program twice: on the left in RISC-V assembly, and again on the right in graphical form.
+As a convention for this figure and for all figures that follow in this chapter, we assume that {\tt s0}--{\tt s2} are pre-set to the same value in all harts and that {\tt s0} holds the address labeled {\tt x}, {\tt s1} holds {\tt y}, and {\tt s2} holds {\tt z}, where {\tt x}, {\tt y}, and {\tt z} are disjoint memory locations aligned to 8 byte boundaries.
+Each figure shows the litmus test code on the left, and a visualization of one particular valid or invalid execution on the right.
 
 \begin{figure}[h!]
   \centering
-  {
+    \begin{tabular}{m{.4\linewidth}m{.05\linewidth}m{.4\linewidth}}
     \tt\small
     \begin{tabular}{cl||cl}
     \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
     \hline
           & $\vdots$    &     & $\vdots$    \\
-          & li t1, 1    &     & li t4, 4    \\
+          & li t1,1     &     & li t4,4     \\
       (a) & sw t1,0(s0) & (e) & sw t4,0(s0) \\
           & $\vdots$    &     & $\vdots$    \\
-          & li t2, 2    &     &             \\
+          & li t2,2     &     &             \\
       (b) & sw t2,0(s0) &     &             \\
           & $\vdots$    &     & $\vdots$    \\
       (c) & lw a0,0(s0) &     &             \\
           & $\vdots$    &     & $\vdots$    \\
-          & li t3, 3    &     & li t5, 5    \\
+          & li t3,3     &     & li t5,5     \\
       (d) & sw t3,0(s0) & (f) & sw t5,0(s0) \\
           & $\vdots$    &     & $\vdots$    \\
     \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{A sample litmus test}
+    & &
+    \input{figs/litmus_sample.pdf_t}
+\end{tabular}
+    \caption{A sample litmus test and one forbidden execution ({\tt a0=1}).}
   \label{fig:litmus:sample}
 \end{figure}
 
 Litmus tests are used to understand the implications of the memory model in specific concrete situations.
 For example, in the litmus test of Figure~\ref{fig:litmus:sample}, the final value of {\tt a0} in the first hart can be either 2, 4, or 5, depending on the dynamic interleaving of the instruction stream from each hart at runtime.
-However, in this example, the final value of {\tt a0} in Hart 0 will never be 1 or 3: the value 1 will no longer be visible at the time the load executes, and the value 3 will not yet be visible by the time the load executes.
-
+However, in this example, the final value of {\tt a0} in Hart 0 will never be 1 or 3; intuitively, the value 1 will no longer be visible at the time the load executes, and the value 3 will not yet be visible by the time the load executes.
 We analyze this test and many others below.
 
+\begin{table}[h]
+  \centering\small
+  \begin{tabular}{|c|l|}
+    \hline
+    Edge & Full Name (and explanation) \\
+    \hline
+    \sf rf   & Reads From (from each store to the loads that return a value written by that store) \\
+    \hline
+    \sf co   & Coherence (a total order on the stores to each address) \\
+    \hline
+    \sf fr   & From-Reads (from each load to co-successors of the store from which the load returned a value) \\
+    \hline
+    \sf ppo  & Preserved Program Order \\
+    \hline
+    \sf fence & Orderings enforced by a FENCE instruction \\
+    \hline
+    \sf addr & Address Dependency \\
+    \hline
+    \sf ctrl & Control Dependency \\
+    \hline
+    \sf data & Data Dependency \\
+    \hline
+  \end{tabular}
+  \caption{A key for the litmus test diagrams drawn in this appendix}
+  \label{tab:litmus:key}
+\end{table}
+
+The diagram shown to the right of each litmus test shows a visual representation of the particular execution candidate being considered.
+These diagrams use a notation that is common in the memory model literature for constraining the set of possible global memory orders that could produce the execution in question.
+It is also the basis for the \textsf{herd} models presented in Appendix~\ref{sec:herd}.
+This notation is explained in Table~\ref{tab:litmus:key}.
+Of the listed relations, {\sf rf} edges between harts, {\sf co} edges, {\sf fr} edges, and {\sf ppo} edges directly constrain the global memory order (as do {\sf fence}, {\sf addr}, {\sf data}, and some {\sf ctrl} edges, via {\sf ppo}).
+Other edges (such as intra-hart {\sf rf} edges) are informative but do not constrain the global memory order.
+
+For example, in Figure~\ref{fig:litmus:sample}, {\tt a0=1} could occur only if one of the following were true:
+\begin{itemize}
+  \item (b) appears before (a) in global memory order (and in the coherence order {\sf co}).  However, this violates RVWMO PPO rule~\ref{ppo:->st}.  The {\sf co} edge from (b) to (a) highlights this contradiction.
+  \item (a) appears before (b) in global memory order (and in the coherence order {\sf co}).  However, in this case, the Load Value Axiom would be violated, because (a) is not the latest matching store prior to (c) in program order.  The {\sf fr} edge from (c) to (b) highlights this contradiction.
+\end{itemize}
+Since neither of these scenarios satisfies the RVWMO axioms, the outcome {\tt a0=1} is forbidden.
+
+Beyond what is described in this appendix, a suite of more than seven thousand litmus tests is available at \url{http://diy.inria.fr/cats7/riscv/}.
+
+\begin{commentary}
+  In the future, we expect to adapt these memory model litmus tests for use as part of the RISC-V compliance test suite as well.
+\end{commentary}
+
 \section{Explaining the RVWMO Rules}
 In this section, we provide explanation and examples for all of the RVWMO rules and axioms.
 
 \subsection{Preserved Program Order and Global Memory Order}
-Preserved program order represents the set of intra-hart orderings that the hart's pipeline must ensure are maintained as the instructions execute, even in the presence of hardware optimizations that might otherwise reorder those operations.
-Events from the same hart which are not ordered by preserved program order, on the other hand, may appear reordered from the perspective of other harts and/or observers.
+Preserved program order represents the subset of program order that must be respected within the global memory order.
+Conceptually, events from the same hart that are ordered by preserved program order must appear in that order from the perspective of other harts and/or observers.
+Events from the same hart that are not ordered by preserved program order, on the other hand, may appear reordered from the perspective of other harts and/or observers.
 
 Informally, the global memory order represents the order in which loads and stores perform.
 The formal memory model literature has moved away from specifications built around the concept of performing, but the idea is still useful for building up informal intuition.
@@ -122,12 +134,14 @@ In this sense, the global memory order also represents the contribution of the c
 
 The order in which loads perform does not always directly correspond to the relative age of the values those two loads return.
 In particular, a load $b$ may perform before another load $a$ to the same address (i.e., $b$ may execute before $a$, and $b$ may appear before $a$ in the global memory order), but $a$ may nevertheless return an older value than $b$.
-This discrepancy captures the reordering effects of store buffers placed between the core and memory: a younger load may read from a value in the store buffer, while an older load which appears before that store in program order may ignore that younger store and read an older value from memory instead.
+This discrepancy captures (among other things) the reordering effects of buffering placed between the core and memory.
+For example, $b$ may have returned a value from a store in the store buffer, while $a$ may have ignored that younger store and read an older value from memory instead.
 To account for this, at the time each load performs, the value it returns is determined by the load value axiom, not just strictly by determining the most recent store to the same address in the global memory order, as described below.
 
-\subsection{Store Buffering (Load Value Axiom)}
+\subsection{\nameref*{rvwmo:ax:load}}
+\label{sec:memory:loadvalueaxiom}
 \begin{tabular}{p{1cm}|p{12cm}} &
-Load value axiom:\loadvalueaxiom
+\nameref{rvwmo:ax:load}: \loadvalueaxiom
 \end{tabular}
 
 Preserved program order is {\em not} required to respect the ordering of a store followed by a load to an overlapping address.
@@ -137,6 +151,7 @@ Any other hart will therefore observe the load as performing before the store.
 
 \begin{figure}[h!]
   \centering
+  \begin{tabular}{m{.4\linewidth}@{\qquad}m{.4\linewidth}}
   {
     \tt\small
     \begin{tabular}{cl||cl}
@@ -147,17 +162,20 @@ Any other hart will therefore observe the load as performing before the store.
       (b) & lw a0,0(s0) & (f) & lw a2,0(s1) \\
       (c) & fence r,r   & (g) & fence r,r   \\
       (d) & lw a1,0(s1) & (h) & lw a3,0(s0) \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=0}, {\tt a2=1}, {\tt a3=0}}
     \end{tabular}
   }
-  ~~~~
-  \diagram
-  \caption{A store buffer forwarding litmus test}
+  &
+  \input{figs/litmus_sb_fwd.pdf_t}
+  \end{tabular}
+  \caption{A store buffer forwarding litmus test (outcome permitted)}
   \label{fig:litmus:storebuffer}
 \end{figure}
 
 Consider the litmus test of Figure~\ref{fig:litmus:storebuffer}.
 When running this program on an implementation with store buffers, it is possible to arrive at the final outcome
-{\tt a0=1, a1=0, a2=1, a3=0}
+{\tt a0=1}, {\tt a1=0}, {\tt a2=1}, {\tt a3=0}
 as follows:
 \begin{itemize}
   \item (a) executes and enters the first hart's private store buffer
@@ -185,21 +203,236 @@ memory access $a$ precedes memory access $b$ in preserved program order (and hen
   \item (h) precedes (a): by the load value axiom, as above.
 \end{itemize}
 The global memory order must be a total order and cannot be cyclic, because a cycle would imply that every event in the cycle happens before itself, which is impossible.
-Therefore, the execution proposed above would be forbidden, and hence the addition of rule X would break the memory model.
+Therefore, the execution proposed above would be forbidden, and hence the addition of rule X would forbid implementations with store buffer forwarding, which would clearly be undesirable.
 
 Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the global memory order, the only sensible possibility in this example is for (b) to return the value written by (a), and likewise for (f) and (e).  This combination of circumstances is what leads to the second option in the definition of the load value axiom.
 Even though (b) precedes (a) in the global memory order, (a) will still be visible to (b) by virtue of sitting in the store buffer at the time (b) executes.
 Therefore, even if (b) precedes (a) in the global memory order, (b) should return the value written by (a) because (a) precedes (b) in program order.
 Likewise for (e) and (f).
 
-\subsection{Same-Address Orderings, Part 1 (Rule~\ref{ppo:->st})}
+\begin{figure}[h!]
+  \centering
+  \begin{tabular}{m{.4\linewidth}@{\qquad}m{.4\linewidth}}
+  {
+    \tt\small
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    &     & li t1, 1      \\
+      (a) & sw t1,0(s0) &     & LOOP:         \\
+      (b) & fence w,w   & (d) & lw a0,0(s1)   \\
+      (c) & sw t1,0(s1) &     & beqz a0, LOOP \\
+          &             & (e) & sw t1,0(s2)   \\
+          &             & (f) & lw a1,0(s2)   \\
+          &             &     & xor a2,a1,a1  \\
+          &             &     & add s0,s0,a2  \\
+          &             & (g) & lw a2,0(s0)   \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=1}, {\tt a2=0}}
+    \end{tabular}
+  }
+  &
+  \input{figs/litmus_ppoca.pdf_t}
+  \end{tabular}
+  \caption{The ``PPOCA'' store buffer forwarding litmus test (outcome permitted)}
+  \label{fig:litmus:ppoca}
+\end{figure}
+
+Another test that highlights the behavior of store buffers is shown in Figure~\ref{fig:litmus:ppoca}.
+In this example, (d) is ordered before (e) because of the control dependency, and (f) is ordered before (g) because of the address dependency.
+However, (e) is {\em not} necessarily ordered before (f), even though (f) returns the value written by (e).
+This could correspond to the following sequence of events:
+\begin{itemize}
+  \item (e) executes speculatively and enters the second hart's private store buffer (but does not drain to memory)
+  \item (f) executes speculatively and forwards its return value 1 from (e) in the store buffer
+  \item (g) executes speculatively and reads the value 0 from memory
+  \item (a) executes, enters the first hart's private store buffer, and drains to memory
+  \item (b) executes and retires
+  \item (c) executes, enters the first hart's private store buffer, and drains to memory
+  \item (d) executes and reads the value 1 from memory
+  \item (e), (f), and (g) commit, since the speculation turned out to be correct
+  \item (e) drains from the store buffer to memory
+\end{itemize}
+
+\subsection{\nameref*{rvwmo:ax:atom}}
+\label{sec:memory:atomicityaxiom}
 \begin{tabular}{p{1cm}|p{12cm}} &
-Rule \ref{ppo:->st}: \ppost
+\nameref{rvwmo:ax:atom} (for Aligned Atomics): \atomicityaxiom
+\end{tabular}
+
+The RISC-V architecture decouples the notion of atomicity from the notion of ordering.  Unlike architectures such as TSO, RISC-V atomics under RVWMO do not impose any ordering requirements by default.  Ordering semantics are only guaranteed by the PPO rules that otherwise apply.
+
+RISC-V contains two types of atomics: AMOs and LR/SC pairs.
+These conceptually behave differently, in the following way.
+LR/SC behave as if the old value is brought up to the core, modified, and written back to memory, all while a reservation is held on that memory location.
+AMOs on the other hand conceptually behave as if they are performed directly in memory.
+AMOs are therefore inherently atomic, while LR/SC pairs are atomic in the slightly different sense that the memory location in question will not be modified by another hart during the time the original hart holds the reservation.
+
+\begin{figure}[h!]
+  \centering\small
+  \begin{verbbox}
+  (a) lr.d a0, 0(s0)
+  (b) sd   t1, 0(s0)
+  (c) sc.d t2, 0(s0)
+  \end{verbbox}
+  \theverbbox
+  ~~~~~~
+  \begin{verbbox}
+  (a) lr.d a0, 0(s0)
+  (b) sw   t1, 4(s0)
+  (c) sc.d t2, 0(s0)
+  \end{verbbox}
+  \theverbbox
+  ~~~~~~
+  \begin{verbbox}
+  (a) lr.w a0, 0(s0)
+  (b) sw   t1, 4(s0)
+  (c) sc.w t2, 0(s0)
+  \end{verbbox}
+  \theverbbox
+  ~~~~~~
+  \begin{verbbox}
+  (a) lr.w a0, 0(s0)
+  (b) sw   t1, 4(s0)
+  (c) sc.w t2, 8(s0)
+  \end{verbbox}
+  \theverbbox
+  \caption{In all four (independent) code snippets, the store-conditional (c) is permitted but not guaranteed to succeed}
+  \label{fig:litmus:lrsdsc}
+\end{figure}
+
+The atomicity axiom forbids from other harts from being interleaved in global memory order between an LR and the SC paired with that LR.
+The atomicity axiom does not forbid loads from being interleaved between the paired operations in program order or in the global memory order, nor does it forbid stores from the same hart or stores to non-overlapping locations from appearing between the paired operations in either program order or in the global memory order.
+For example, the SC instructions in Figure~\ref{fig:litmus:lrsdsc} may (but are not guaranteed to) succeed.
+None of those successes would violate the atomicity axiom, because the intervening non-conditional stores are from the same hart as the paired load-reserved and store-conditional instructions.
+This way, a memory system that tracks memory accesses at cache line granularity (and which therefore will see the four snippets of Figure~\ref{fig:litmus:lrsdsc} as identical) will not be forced to fail a store conditional instruction that happens to (falsely) share another portion of the same cache line as the memory location being held by the reservation.
+
+The atomicity axiom also technically supports cases in which the LR and SC touch different addresses and/or use different access sizes; however, use cases for such behaviors are expected to be rare in practice.
+Likewise, scenarios in which stores from the same hart between an LR/SC pair actually overlap the memory location(s) referenced by the LR or SC are expected to be rare compared to scenarios where the intervening store may simply fall onto the same cache line.
+
+\subsection{\nameref*{rvwmo:ax:prog}}
+\label{sec:memory:progress}
+\begin{tabular}{p{1cm}|p{12cm}} &
+\nameref{rvwmo:ax:prog}: \progressaxiom
+\end{tabular}
+
+The progress axiom ensures a minimal forward progress guarantee.
+It ensures that stores from one hart will eventually be made visible to other harts in the system in a finite amount of time, and that loads from other harts will eventually be able to read those values (or successors thereof).
+Without this rule, it would be legal, for example, for a spinlock to spin infinitely on a value, even with a store from another hart waiting to unlock the spinlock.
+
+The progress axiom is intended not to impose any other notion of fairness, latency, or quality of service onto the harts in a RISC-V implementation.
+Any stronger notions of fairness are up to the rest of the ISA and/or up to the platform and/or device to define and implement.
+
+The forward progress axiom will in almost all cases be naturally satisfied by any standard cache coherence protocol.
+Implementations with non-coherent caches may have to provide some other mechanism to ensure the eventual visibility of all stores (or successors thereof) to all harts.
+
+\subsection{Overlapping-Address Orderings (Rules~\ref{ppo:->st}--\ref{ppo:amoforward})}
+\label{sec:memory:overlap}
+\begin{tabular}{p{1cm}|p{12cm}}
+  & Rule \ref{ppo:->st}: \ppost \\
+  & Rule \ref{ppo:rdw}: \ppordw \\
+  & Rule \ref{ppo:amoforward}: \ppoamoforward \\
 \end{tabular}
 
 Same-address orderings where the latter is a store are straightforward: a load or store can never be reordered with a later store to an overlapping memory location.  From a microarchitecture perspective, generally speaking, it is difficult or impossible to undo a speculatively reordered store if the speculation turns out to be invalid, so such behavior is simply disallowed by the model.
+Same-address orderings from a store to a later load, on the other hand, do not need to be enforced.
+As discussed in Section~\ref{sec:memory:loadvalueaxiom}, this reflects the observable behavior of implementations that forward values from buffered stores to later loads.
+
+Same-address load-load ordering requirements are far more subtle.
+The basic requirement is that a younger load must not return a value that is older than a value returned by an older load in the same hart to the same address.  This is often known as ``CoRR'' (Coherence for Read-Read pairs), or as part of a broader ``coherence'' or ``sequential consistency per location'' requirement.
+Some architectures in the past have relaxed same-address load-load ordering, but in hindsight this is generally considered to complicate the programming model too much, and so RVWMO requires CoRR ordering to be enforced.
+However, because the global memory order corresponds to the order in which loads perform rather than the ordering of the values being returned, capturing CoRR requirements in terms of the global memory order requires a bit of indirection.
+
+\begin{figure}[h!]
+  \center
+  \begin{tabular}{m{.4\linewidth}@{\qquad}m{.4\linewidth}}
+    {\tt\small
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    &     & li~ t2, 2    \\
+      (a) & sw t1,0(s0) & (d) & lw~ a0,0(s1) \\
+      (b) & fence w, w  & (e) & sw~ t2,0(s1) \\
+      (c) & sw t1,0(s1) & (f) & lw~ a1,0(s1) \\
+          &             & (g) & xor t3,a1,a1 \\
+          &             & (h) & add s0,s0,t3 \\
+          &             & (i) & lw~ a2,0(s0) \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=2}, {\tt a2=0}}
+    \end{tabular}
+  }
+  &
+  \input{figs/litmus_mp_fenceww_fri_rfi_addr.pdf_t}
+  \end{tabular}
+  \caption{Litmus test MP+fence.w.w+fri-rfi-addr (outcome permitted)}
+  \label{fig:litmus:frirfi}
+\end{figure}
+
+Consider the litmus test of Figure~\ref{fig:litmus:frirfi}, which is one particular instance of the more general ``fri-rfi'' pattern.
+The term ``fri-rfi'' refers to the sequence (d), (e), (f): (d) ``from-reads'' (i.e., reads from an earlier write than) (e) which is the same hart, and (f) reads from (e) which is in the same hart.
+
+From a microarchitectural perspective, outcome {\tt a0=1}, {\tt a1=2}, {\tt a2=0} is legal (as are various other less subtle outcomes).  Intuitively, the following would produce the outcome in question:
+\begin{itemize}
+  \item (d) stalls (for whatever reason; perhaps it's stalled waiting for some other preceding instruction)
+  \item (e) executes and enters the store buffer (but does not yet drain to memory)
+  \item (f) executes and forwards from (e) in the store buffer
+  \item (g), (h), and (i) execute
+  \item (a) executes and drains to memory, (b) executes, and (c) executes and drains to memory
+  \item (d) unstalls and executes
+  \item (e) drains from the store buffer to memory
+\end{itemize}
+This corresponds to a global memory order of (f), (i), (a), (c), (d), (e).
+Note that even though (f) performs before (d), the value returned by (f) is newer than the value returned by (d).
+Therefore, this execution is legal and does not violate the CoRR requirements.
 
-Same-address load-load orderings are far more subtle; see Chapter~\ref{sec:ppo:rdw}.
+Likewise, if two back-to-back loads return the values written by the same store, then they may also appear out-of-order in the global memory order without violating CoRR.  Note that this is not the same as saying that the two loads return the same value, since two different stores may write the same value.
+
+\begin{figure}[h!]
+  \centering
+  \begin{tabular}{m{.4\linewidth}@{\qquad\quad}m{.6\linewidth}}
+  {
+    \tt\small
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    & (d) & lw~ a0,0(s1) \\
+      (a) & sw t1,0(s0) & (e) & xor t2,a0,a0 \\
+      (b) & fence w, w  & (f) & add s4,s2,t2 \\
+      (c) & sw t1,0(s1) & (g) & lw~ a1,0(s4) \\
+          &             & (h) & lw~ a2,0(s2) \\
+          &             & (i) & xor t3,a2,a2 \\
+          &             & (j) & add s0,s0,t3 \\
+          &             & (k) & lw~ a3,0(s0) \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=$v$}, {\tt a2=$v$}, {\tt a3=0}}
+    \end{tabular}
+  }
+  &
+  \input{figs/litmus_rsw.pdf_t}
+   \end{tabular}
+  \caption{Litmus test RSW (outcome permitted)}
+  \label{fig:litmus:rsw}
+\end{figure}
+
+Consider the litmus test of Figure~\ref{fig:litmus:rsw}.
+The outcome {\tt a0=1}, {\tt a1=$v$},  {\tt a2=$v$}, {\tt a3=0} (where $v$ is some value written by another hart) can be observed by allowing (g) and (h) to be reordered.  This might be done speculatively, and the speculation can be justified by the microarchitecture (e.g., by snooping for cache invalidations and finding none) because replaying (h) after (g) would return the value written by the same store anyway.
+Hence assuming {\tt a1} and {\tt a2} would end up with the same value written by the same store anyway, (g) and (h) can be legally reordered.
+The global memory order corresponding to this execution would be (h),(k),(a),(c),(d),(g).
+
+Executions of the test in Figure~\ref{fig:litmus:rsw} in which {\tt a1} does not equal {\tt a2} do in fact require that (g) appears before (h) in the global memory order.
+Allowing (h) to appear before (g) in the global memory order would in that case result in a violation of CoRR, because then (h) would return an older value than that returned by (g).
+Therefore, PPO rule~\ref{ppo:rdw} forbids this CoRR violation from occurring.
+As such, PPO rule~\ref{ppo:rdw} strikes a careful balance between enforcing CoRR in all cases while simultaneously being weak enough to permit ``RSW'' and ``fri-rfi'' patterns that commonly appear in real microarchitectures.
+
+There is one more overlapping-address rule: PPO rule~\ref{ppo:amoforward} simply states that a value cannot be returned from an AMO or SC to a subsequent load until the AMO or SC has (in the case of the SC, successfully) performed globally.
+This follows somewhat naturally from the conceptual view that both AMOs and SC instructions are meant to be performed atomically in memory.
+However, notably, PPO rule~\ref{ppo:amoforward} states that hardware may not even non-speculatively forward the value being stored by an AMOSWAP to a subsequent load, even though for AMOSWAP that store value is not actually semantically dependent on the previous value in memory, as is the case for the other AMOs.
+The same holds true even when forwarding from SC store values that are not semantically dependent on the value returned by the paired LR.
+
+The three PPO rules above also apply when the memory accesses in question only overlap partially.
+This can occur, for example, when accesses of different sizes are used to access the same object.
+Note also that the base addresses of two overlapping memory operations need not necessarily be the same for two memory accesses to overlap.
+When misaligned memory accesses are being used, the overlapping-address PPO rules apply to each of the component memory accesses independently.
 
 \begin{comment}
 The formal model captures this as follows:
@@ -209,52 +442,54 @@ The formal model captures this as follows:
 \end{itemize}
 \end{comment}
 
-\subsection{Fences (Rule~\ref{ppo:fence})}\label{sec:fence}
+\subsection{Fences (Rule~\ref{ppo:fence})}
+\label{sec:mm:fence}
 \begin{tabular}{p{1cm}|p{12cm}} &
 Rule \ref{ppo:fence}: \ppofence
 \end{tabular}
 
-By default, the {\tt fence} instruction ensures that all memory accesses from instructions preceding the fence in program order (the ``predecessor set'') appear earlier in the global memory order than memory accesses from instructions appearing after the fence in program order (the ``successor set'').
+By default, the FENCE instruction ensures that all memory accesses from instructions preceding the fence in program order (the ``predecessor set'') appear earlier in the global memory order than memory accesses from instructions appearing after the fence in program order (the ``successor set'').
 However, fences can optionally further restrict the predecessor set and/or the successor set to  a smaller set of memory accesses in order to provide some speedup.
-Specifically, fences have {\tt .pr}, {\tt .pw}, {\tt .sr}, and {\tt .sw} bits which restrict the predecessor and/or successor sets.
-The predecessor set includes loads (resp.\@ stores) if and only if {\tt .pr} (resp.\@ {\tt .pw}) is set.
-Similarly, the successor set includes loads (resp.\@ stores) if and only if {\tt .sr} (resp.\@ {\tt .sw}) is set.
+Specifically, fences have PR, PW, SR, and SW bits which restrict the predecessor and/or successor sets.
+The predecessor set includes loads (resp.\@ stores) if and only if PR (resp.\@ PW) is set.
+Similarly, the successor set includes loads (resp.\@ stores) if and only if SR (resp.\@ SW) is set.
 
-The full RISC-V opcode encoding currently has nine non-trivial combinations of the four bits {\tt pr}, {\tt pw}, {\tt sr}, and {\tt sw}, plus one extra encoding which is expected to be added to facilitate mapping of ``acquire+release'' or TSO semantics.
+The FENCE encoding currently has nine non-trivial combinations of the four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which facilitates mapping of ``acquire+release'' or RVTSO semantics.
 The remaining seven combinations have empty predecessor and/or successor sets and hence are no-ops.
 Of the ten non-trivial options, only six are commonly used in practice:
-{\tt
+{
 \begin{itemize}
-  \item fence rw,rw
-  \item fence.tso \textrm{(i.e., a combined {\tt fence r,rw} $+$ {\tt fence rw,w})}
-  \item fence rw,w
-  \item fence r,rw
-  \item fence r,r
-  \item fence w,w
+  \item FENCE RW,RW
+  \item FENCE.TSO
+  \item FENCE RW,W
+  \item FENCE R,RW
+  \item FENCE R,R
+  \item FENCE W,W
 \end{itemize}
 }
-We strongly recommend that programmers stick to these six, as these are the best understood.  {\tt fence} instructions using any other combination of {\tt .pr}, {\tt .pw}, {\tt .sr}, and {\tt .sw} are reserved.
+FENCE instructions using any other combination of PR, PW, SR, and SW are reserved.  We strongly recommend that programmers stick to these six.
+Other combinations may have unknown or unexpected interactions with the memory model.
 
-Finally, we note that since RISC-V uses a multi-copy atomic memory model, programmers can reason about fences and the {\tt .aq} and {\tt .rl} bits in a thread-local manner.  There is no complex notion of ``fence cumulativity'' as found in memory models which are not multi-copy atomic.
+Finally, we note that since RISC-V uses a multi-copy atomic memory model, programmers can reason about fences bits in a thread-local manner.  There is no complex notion of ``fence cumulativity'' as found in memory models that are not multi-copy atomic.
 
-\subsection{Acquire/Release Ordering (Rules~\ref{ppo:acquire}--\ref{ppo:amoload})}\label{sec:acqrel}
+\subsection{Explicit Synchronization (Rules~\ref{ppo:acquire}--\ref{ppo:pair})}
+\label{sec:memory:acqrel}
 \begin{tabular}{p{1cm}|p{12cm}}
   & Rule \ref{ppo:acquire}: \ppoacquire \\
-  %& Rule \ref{ppo:loadtoacq}: \ppoloadtoacq \\
   & Rule \ref{ppo:release}: \pporelease \\
-  & Rule \ref{ppo:strongacqrel}: \ppostrongacqrel \\
-  & Rule \ref{ppo:amostore}: \ppoamostore \\
-  & Rule \ref{ppo:amoload}: \ppoamoload \\
+  & Rule \ref{ppo:rcsc}: \pporcsc \\
+  & Rule \ref{ppo:pair}: \ppopair \\
 \end{tabular}
 
-An {\em acquire} operation is used at the start of a critical section.  The general requirement for acquire semantics is that all loads and stores inside the critical section are up to date with respect to the synchronization variable being used to protect it.  In other words, an acquire operation requires load-to-load/store ordering.
-Acquire ordering can be enforced in one of two ways: setting {\tt .aq}, which enforces ordering with respect to just the synchronization variable itself, or with a {\tt FENCE r,rw}, which enforces ordering with respect to all previous loads.  
+An {\em acquire} operation, as would be used at the start of a critical section, requires all memory operations following the acquire in program order to also follow the acquire in the global memory order.
+This ensures, for example, that all loads and stores inside the critical section are up to date with respect to the synchronization variable being used to protect it.
+Acquire ordering can be enforced in one of two ways: with an acquire annotation, which enforces ordering with respect to just the synchronization variable itself, or with a FENCE~R,RW, which enforces ordering with respect to all previous loads.
 
 \begin{figure}[h!]
   \centering\small
   \begin{verbatim}
-          sd           x1, (a1)     # Random unrelated store
-          ld           x2, (a2)     # Random unrelated load
+          sd           x1, (a1)     # Arbitrary unrelated store
+          ld           x2, (a2)     # Arbitrary unrelated load
           li           t0, 1        # Initialize swap value.
       again:
           amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
@@ -263,21 +498,21 @@ Acquire ordering can be enforced in one of two ways: setting {\tt .aq}, which en
           # Critical section.
           # ...
           amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
-          sd           x3, (a3)     # Random unrelated store
-          ld           x4, (a4)     # Random unrelated load
+          sd           x3, (a3)     # Arbitrary unrelated store
+          ld           x4, (a4)     # Arbitrary unrelated load
   \end{verbatim}
   \caption{A spinlock with atomics}
   \label{fig:litmus:spinlock_atomics}
 \end{figure}
 
-Consider Figure~\ref{fig:litmus:spinlock_atomics}:
-Because this example uses {\tt .aq}, the loads and stores in the critical section are guaranteed to appear in the global memory order after the {\tt amoswap} used to acquire the lock.  However, assuming {\tt a0}, {\tt a1}, and {\tt a2} point to different memory locations, the loads and stores in the critical section may or may not appear after the ``random unrelated load'' at the beginning of the example in the global memory order.
+Consider Figure~\ref{fig:litmus:spinlock_atomics}.
+Because this example uses {\em aq}, the loads and stores in the critical section are guaranteed to appear in the global memory order after the AMOSWAP used to acquire the lock.  However, assuming {\tt a0}, {\tt a1}, and {\tt a2} point to different memory locations, the loads and stores in the critical section may or may not appear after the ``Arbitrary unrelated load'' at the beginning of the example in the global memory order.
 
 \begin{figure}[h!]
   \centering\small
   \begin{verbatim}
-          sd           x1, (a1)     # Random unrelated store
-          ld           x2, (a2)     # Random unrelated load
+          sd           x1, (a1)     # Arbitrary unrelated store
+          ld           x2, (a2)     # Arbitrary unrelated load
           li           t0, 1        # Initialize swap value.
       again:
           amoswap.w    t0, t0, (a0) # Attempt to acquire lock.
@@ -288,99 +523,74 @@ Because this example uses {\tt .aq}, the loads and stores in the critical sectio
           # ...
           fence        rw, w        # Enforce "release" memory ordering
           amoswap.w    x0, x0, (a0) # Release lock by storing 0.
-          sd           x3, (a3)     # Random unrelated store
-          ld           x4, (a4)     # Random unrelated load
+          sd           x3, (a3)     # Arbitrary unrelated store
+          ld           x4, (a4)     # Arbitrary unrelated load
   \end{verbatim}
   \caption{A spinlock with fences}
   \label{fig:litmus:spinlock_fences}
 \end{figure}
 
 Now, consider the alternative in Figure~\ref{fig:litmus:spinlock_fences}.
-In this case, even though the {\tt amoswap} does not enforce ordering with an {\tt .aq} bit, the fence nevertheless enforces that the acquire {\tt amoswap} appears earlier in the global memory order than all loads and stores in the critical section.
-Note, however, that in this case, the fence also enforces additional orderings: it also requires that the ``random unrelated load'' at the start of the program appears also appears earlier in the global memory order than the loads and stores of the critical section.  (This particular fence does not, however, enforce any ordering with respect to the ``random unrelated store'' at the start of the snippet.)
-In this way, fence-enforced orderings are slightly coarser than orderings enforced by {\tt.aq}.
+In this case, even though the AMOSWAP does not enforce ordering with an {\em aq} bit, the fence nevertheless enforces that the acquire AMOSWAP appears earlier in the global memory order than all loads and stores in the critical section.
+Note, however, that in this case, the fence also enforces additional orderings: it also requires that the ``Arbitrary unrelated load'' at the start of the program also appears earlier in the global memory order than the loads and stores of the critical section.  (This particular fence does not, however, enforce any ordering with respect to the ``Arbitrary unrelated store'' at the start of the snippet.)
+In this way, fence-enforced orderings are slightly coarser than orderings enforced by {\em .aq}.
 
-\begin{comment}
-A load-acquire also appears later in the global memory order than any previous paired loads to overlapping addresses.
-This rule is in place primarily to ensure compatibility with C/C++ release sequences.
-Consider the example of Figure~\ref{fig:relseq}:
+Release orderings work exactly the same as acquire orderings, just in the opposite direction.  Release semantics require all loads and stores preceding the release operation in program order to also precede the release operation in the global memory order.
+This ensures, for example, that memory accesses in a critical section appear before the lock-releasing store in the global memory order.  Just as for acquire semantics, release semantics can be enforced using release annotations or with a FENCE~R,RW operations.  Using the same examples, the ordering between the loads and stores in the critical section and the ``Arbitrary unrelated store'' at the end of the code snippet is enforced only by the FENCE~R,RW in Figure~\ref{fig:litmus:spinlock_fences}, not by the {\em rl} in Figure~\ref{fig:litmus:spinlock_fences}.
 
-\begin{figure}[h!]
-  \centering
-  {\tt\small
-    \begin{tabular}{cl||cl}
-    \multicolumn{2}{c}{Thread 0} & \multicolumn{2}{c}{Thread 1} \\
-    \hline
-      (a)  & x = 1;                     & (d) & atomic\_exchange(y, 2, memory\_order\_relaxed); \\
-      (bc) & atomic\_store(y, 1,        & (e) & atomic\_exchange(y, 2, memory\_order\_acquire); \\
-           & ~~memory\_order\_release); & (f) & int a1 = x;            \\
-    \end{tabular}
-  }
+With RCpc annotations alone, store-release-to-load-acquire ordering is not enforced.  This facilitates the porting of code written under the TSO and/or RCpc memory models.  
+To enforce store-release-to-load-acquire ordering, the code must use store-release-RCsc and load-acquire-RCsc operations so that PPO rule \ref{ppo:rcsc} applies.
+RCpc alone is sufficient for many uses cases in C/C++ but is insufficient for many other use cases in C/C++, Java, and Linux, to name just a few examples; see Section~\ref{sec:memory:porting} for details.
 
-  \bigskip
+PPO rule~\ref{ppo:pair} indicates that an SC must appear after its paired LR in the global memory order.
+This will follow naturally from the common use of LR/SC to perform an atomic read-modify-write operation due to the inherent data dependency.
+However, PPO rule~\ref{ppo:pair} also applies even when the value being stored does not syntactically depend on the value returned by the paired LR.
 
-  {\tt\small
-    \begin{tabular}{cl||cl}
-    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
-    \hline
-          & li t1, 1    &     & li t1, 1    \\
-      (a) & sd t1,0(s0) & (d) & amoswap~~~ a0,t1,0(s1) \\
-      (b) & fence rw,w  & (e) & amoswap.aq x0,a1,0(s1) \\
-      (c) & sd t1,0(s1) & (f) & ld~~~~~~~~ a1,0(s0)    \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{C/C++ release sequence example}
-  \label{fig:relseq}
-\end{figure}
-
-In Figure~\ref{fig:relseq}, the original C/C++ source code has a ``synchronizes-with'' relationship from (c) to (e) via (d), where the latter is part of the ``release sequence'' of (c).
-Therefore, RVWMO must somehow require (c) to appear before (e) in the global memory order.
-Without rule~\ref{ppo:loadtoacq}, (c) would be ordered before (d), but (d) would {\em not} be ordered before (e) due to ``fri; rfi'' behavior (Chapter~\ref{sec:ppo:rdw}).
-Rule~\ref{ppo:loadtoacq} therefore fixes the missing link by placing both the load and the store part of (d) before (e) in the global memory order.
-\end{comment}
-
-Release orderings work exactly the same as acquire orderings, just in the opposite direction.  Release semantics require all loads and stores in the critical section to appear before the lock-releasing store (here, an {\tt amoswap}) in the global memory order.  This can be enforced using the {\tt .rl} bit or with a {\tt fence rw,w} operations.  Likewise, the ordering between the loads and stores in the critical section and the ``random unrelated store'' at the end of the code snippet is enforced only by the {\tt fence rw,w} in the second example, not by the {\tt .rl} in the first example.
-%Note that a corollary of rule~\ref{ppo:loadtoacq} is not needed for release operations because it would be redundant with rule~\ref{ppo:->st}.
-
-By default, store-release-to-load-acquire ordering is not enforced.  This facilitates the porting of code written under the TSO and/or RCpc memory models; see Chapter~\ref{sec:porting} for details.
-To enforce store-release-to-load-acquire ordering, use store-release-RCsc and load-acquire-RCsc operations, so that PPO rule \ref{ppo:strongacqrel} applies.
-The use of only store-release-RCsc and load-acquire-RCsc operations implies sequential consistency, as the combination of PPO rules \ref{ppo:acquire}--\ref{ppo:strongacqrel} implies that all RCsc accesses will respect program order.
-
-AMOs with both {\tt .aq} and {\tt .rl} set are fully-ordered operations.
-Treating the load part and the store part as independent RCsc operations is not in and of itself sufficient to enforce full fencing behavior, but this subtle weak behavior is counterintuitive and not much of an advantage architecturally, especially with {\tt lr} and {\tt sc} also available.
-For this reason, AMOs annotated with {\tt .aqrl} are strengthened to being fully-ordered under RVWMO.
+Lastly, we note that just as with fences, programmers need not worry about ``cumulativity'' when analyzing ordering annotations.
 
-%The RVWMO memory model rules do not place any explicit restrictions on whether atomics can forward values from stores still in a store buffer, as some particularly aggressive microarchitectures may do this at times.
-%Such behavior is compatible with higher-level software memory model such as the one used by C/C++ if the mappings of Chapter~\ref{sec:porting} are used.
-%However, such forwarding can be prevented manually if desired by placing a {\tt fence w,r,[addr]} between the store and the load in question.
-
-\subsection{Dependencies (Rules~\ref{ppo:addr}--\ref{ppo:success})}
-\label{sec:depspart1}
+\subsection{Syntactic Dependencies (Rules~\ref{ppo:addr}--\ref{ppo:ctrl})}
+\label{sec:memory:dependencies}
 \begin{tabular}{p{1cm}|p{12cm}}
   & Rule \ref{ppo:addr}: \ppoaddr \\
   & Rule \ref{ppo:data}: \ppodata \\
   & Rule \ref{ppo:ctrl}: \ppoctrl \\
-  & Rule \ref{ppo:success}: \pposuccess \\
 \end{tabular}
 
 Dependencies from a load to a later memory operation in the same hart are respected by the RVWMO memory model.
 The Alpha memory model was notable for choosing {\em not} to enforce the ordering of such dependencies, but most modern hardware and software memory models consider allowing dependent instructions to be reordered too confusing and counterintuitive.
 Furthermore, modern code sometimes intentionally uses such dependencies as a particularly lightweight ordering enforcement mechanism.
 
+The terms in Section~\ref{sec:memorymodel:dependencies} work as follows.
+Instructions are said to carry dependencies from their source register(s) to their destination register(s) whenever the value written into each destination register is a function of the source register(s).
+For most instructions, this means that the destination register(s) carry a dependency from all source register(s).
+However, there are a few notable exceptions.
+In the case of memory instructions, the value written into the destination register ultimately comes from the memory system rather than from the source register(s) directly, and so this breaks the chain of dependencies carried from the source register(s).
+In the case of unconditional jumps, the value written into the destination register comes from the current {\tt pc} (which is never considered a source register by the memory model), and so likewise, JALR (the only jump with a source register) does not carry a dependency from {\em rs1} to {\em rd}.
+
+\begin{verbbox}
+(a) fadd  f3,f1,f2
+(b) fadd  f6,f4,f5
+(c) csrrs a0,fflags,x0
+\end{verbbox}
+\begin{figure}[h!]
+  \centering\small
+  \theverbbox
+  \caption{(c) has a syntactic dependency on both (a) and (b) via {\tt fflags}, a destination register that both (a) and (b) implicitly accumulate into}
+  \label{fig:litmus:fflags}
+\end{figure}
+
+The notion of accumulating into a destination register rather than writing into it reflects the behavior of CSRs such as {\tt fflags}.
+In particular, an accumulation into a register does not clobber any previous writes or accumulations into the same register.
+For example, in Figure~\ref{fig:litmus:fflags}, (c) has a syntactic dependency on both (a) and (b).
+
 Like other modern memory models, the RVWMO memory model uses syntactic rather than semantic dependencies.
 In other words, this definition depends on the identities of the
 registers being accessed by different instructions, not the actual
 contents of those registers.  This means that an address, control, or
 data dependency must be enforced even if the calculation could seemingly
 be ``optimized away''.
-This choice ensures that RVWMO remains compatible with programmers that use these false syntactic dependencies intentionally to form a lightweight type of ordering mechanism.
+This choice ensures that RVWMO remains compatible with code that uses these false syntactic dependencies as a lightweight ordering mechanism.
 
-For example, there is a syntactic address
-dependency from the first instruction to the last instruction in the
-Figure~\ref{fig:litmus:address}, even though {\tt a1} XOR {\tt a1} is zero and
-hence has no effect on the address accessed by the second load.
 \begin{verbbox}
 ld  a1,0(s0)
 xor a2,a1,a1
@@ -394,13 +604,16 @@ ld  a5,0(s1)
   \label{fig:litmus:address}
 \end{figure}
 
+For example, there is a syntactic address
+dependency from the memory operation generated by the first instruction to the memory operation generated by the last instruction in
+Figure~\ref{fig:litmus:address}, even though {\tt a1} XOR {\tt a1} is zero and
+hence has no effect on the address accessed by the second load.
+
 The benefit of using dependencies as a lightweight synchronization mechanism is that the ordering enforcement requirement is limited only to the specific two instructions in question.
 Other non-dependent instructions may be freely-reordered by aggressive implementations.
 One alternative would be to use a load-acquire, but this would enforce ordering for the first load with respect to {\em all} subsequent instructions.
-Another would be to use a {\tt fence r,r}, but this would include all previous and all subsequent loads, making this option each more expensive.
+Another would be to use a FENCE~R,R, but this would include all previous and all subsequent loads, making this option more expensive.
 
-Control dependencies behave differently from address and data dependencies in the sense that a control dependency always extends to all instructions following the original target in program order.
-Consider Figure~\ref{fig:litmus:control1}: the instruction at {\tt next} will always execute, but it nevertheless still has control dependency from the first instruction.
 \begin{verbbox}
       lw  x1,0(x2)
       bne x1,x0,NEXT
@@ -414,6 +627,9 @@ next: sw  x5,0(x6)
   \label{fig:litmus:control1}
 \end{figure}
 
+Control dependencies behave differently from address and data dependencies in the sense that a control dependency always extends to all instructions following the original target in program order.
+Consider Figure~\ref{fig:litmus:control1}: the instruction at {\tt next} will always execute, but memory operation generated by that last instruction nevertheless still has control dependency from the memory operation generated by the first instruction.
+
 \begin{verbbox}
         lw  x1,0(x2)
         bne x1,x0,NEXT
@@ -427,419 +643,188 @@ next: sw  x5,0(x6)
 \end{figure}
 
 Likewise, consider Figure~\ref{fig:litmus:control2}.
-Even though both branch outcomes have the same target, there is still a control dependency from the first instruction in this snippet to the last.
+Even though both branch outcomes have the same target, there is still a control dependency from the memory operation generated by the first instruction in this snippet to the memory operation generated by the last instruction.
 This definition of control dependency is subtly stronger than what might be seen in other contexts (e.g., C++), but it conforms with standard definitions of control dependencies in the literature.
 
+Notably, PPO rules \ref{ppo:addr}--\ref{ppo:ctrl} are also intentionally designed to respect dependencies that originate from the output of a successful store conditional instruction.
+Typically, an SC instruction will be followed by a conditional branch checking whether the outcome was successful; this implies that there will be a control dependency from the store operation generated by the SC instruction to any memory operations following the branch.
+PPO rule~\ref{ppo:ctrl} in turn implies that any subsequent store operations will appear later in the global memory order than the store operation generated by the SC.
+However, since control, address, and data dependencies are defined over memory operations, and since an unsuccessful SC does not generate a memory operation, no order is enforced between unsuccessful SC and its dependent instructions.
+Moreover, since SC is defined to carry dependencies from its source registers to {\em rd} only when the SC is successful, an unsuccessful SC has no effect on the global memory order.
+
+
 \begin{figure}[h!]
-  \center
+  \centering
+  \begin{tabular}{m{.4\linewidth}m{0.05\linewidth}m{.4\linewidth}}
   {
     \tt\small
     \begin{tabular}{cl||cl}
+    \multicolumn{4}{c}{Initial values: 0(s0)=1; 0(s1)=1} \\
+    \\
     \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
     \hline
       (a) & ld a0,0(s0)    & (e) & ld a3,0(s2) \\
       (b) & lr a1,0(s1)    & (f) & sd a3,0(s0) \\
       (c) & sc a2,a0,0(s1) &                    \\
       (d) & sd a2,0(s2)    &                    \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=0}, {\tt a3=0}}
     \end{tabular}
   }
-  ~~~~
-  \diagram
-  \caption{A variant of the LB litmus test}
+  & &
+  \input{figs/litmus_lb_lrsc.pdf_t}
+  \end{tabular}
+  \caption{A variant of the LB litmus test (outcome forbidden)}
   \label{fig:litmus:successdeps}
 \end{figure}
 
-Finally, we highlight a unique new rule regarding the success registers written by store-conditional instructions.
-In certain cases, without PPO rule \ref{ppo:success}, a store conditional could in theory be made to store its own success output value as its data, in a manner reminiscent of so-called out-of-thin-air behavior.
-This is shown in Figure~\ref{fig:litmus:successdeps}.
+In addition, the choice to respect dependencies originating at store-conditional instructions ensures that certain out-of-thin-air-like behaviors will be prevented.
+Consider Figure~\ref{fig:litmus:successdeps}.
 Suppose a hypothetical implementation could occasionally make some early guarantee that a store-conditional operation will succeed.
 In this case, (c) could return 0 to {\tt a2} early (before actually executing), allowing the sequence (d), (e), (f), (a), and then (b) to execute, and then (c) might execute (successfully) only at that point.
 This would imply that (c) writes its own success value to {\tt 0(s1)}!
+Fortunately, this situation and others like it are prevented by the fact that RVWMO respects dependencies originating at the stores generated by successful SC instructions.
 
-To rule out this bizarre behavior, PPO rule~\ref{ppo:success} says that store-conditional instructions may not return success or failure into the destination register until both the address and data for the instruction have been resolved.
-In the example above, this would enforce an ordering from (a) to (d), and this would in turn form a cycle that rules out the strange proposed execution.
+We also note that syntactic dependencies between instructions only have any force when they take the form of a syntactic address, control, and/or data dependency.
+For example: a syntactic dependency between two ``F'' instructions via one of the ``accumulating CSRs'' in Section~\ref{sec:source-dest-regs} does {\em not} imply that the two ``F'' instructions must be executed in order.
+Such a dependency would only serve to ultimately set up later a dependency from both ``F'' instructions to a later CSR instruction accessing the CSR flag in question.
 
-\subsection{Same-Address Load-Load Ordering (Rule~\ref{ppo:rdw})}
-\label{sec:ppo:rdw}
+\subsection{Pipeline Dependencies (Rules~\ref{ppo:addrdatarfi}--\ref{ppo:addrpo})}
+\label{sec:memory:ppopipeline}
 \begin{tabular}{p{1cm}|p{12cm}}
-  & Rule \ref{ppo:rdw}: \ppordw \\
+  & Rule \ref{ppo:addrdatarfi}: \ppoaddrdatarfi \\
+  & Rule \ref{ppo:addrpo}: \ppoaddrpo \\
+%  & Rule \ref{ppo:ctrlcfence}: \ppoctrlcfence \\
+%  & Rule \ref{ppo:addrpocfence}: \ppoaddrpocfence \\
 \end{tabular}
 
-In contrast to same-address orderings ending in a store, same-address load-load ordering requirements are very subtle.
-
-The basic requirement is that a younger load must not return a value which is older than a value returned by an older load in the same hart to the same address.  This is often known as ``CoRR'' (Coherence for Read-Read pairs), or as part of a broader ``coherence'' or ``sequential consistency per location'' requirement.
-Some architectures in the past have relaxed same-address load-load ordering, but in hindsight this is generally considered to complicate the programming model too much, and so RVWMO requires CoRR ordering to be enforced.
-However, because the global memory order corresponds to the order in which loads perform rather than the ordering of the values being returned, capturing CoRR requirements in terms of the global memory order requires a bit of indirection.
-
-\begin{figure}[h!]
-  \center
-  {
-    \tt\small
-    \begin{tabular}{cl||cl}
-    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
-    \hline
-          & li t1, 1    &     & li~ t2, 2    \\
-      (a) & sw t1,0(s0) & (d) & lw~ a0,0(s1) \\
-      (b) & fence w, w  & (e) & sw~ t2,0(s1) \\
-      (c) & sw t1,0(s1) & (f) & lw~ a1,0(s1) \\
-          &             & (g) & xor t3,a1,a1 \\
-          &             & (h) & add s0,s0,t3 \\
-          &             & (i) & lw~ a2,0(s0) \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Litmus test MP+FENCE+fri-rfi-addr}
-  \label{fig:litmus:frirfi}
-\end{figure}
-
-Consider the litmus test of Figure~\ref{fig:litmus:frirfi}, which is one particular instance of the more general ``fri-rfi'' pattern.
-The term ``fri-rfi'' refers to the sequence (d),(e),(f): (d) ``from-reads'' (i.e., reads from an earlier write than) (e) which is the same hart, and (f) reads from (e) which is in the same hart.
-
-From a microarchitectural perspective, outcome {\tt a0=1, a1=2, a2=0} is legal (as are various other less subtle outcomes).  Intuitively, the following would produce the outcome in question:
-\begin{itemize}
-  \item (a), (b), (c) execute
-  \item (d) stalls (for whatever reason; perhaps it's stalled waiting for some other preceding instruction)
-  \item (e) executes and enters the store buffer
-  \item (f) forwards from (e) in the store buffer
-  \item (g), (h), and (i) execute
-  \item (d) unstalls and executes
-  \item (e) drains from the store buffer to memory
-\end{itemize}
-This corresponds to a global memory order of (e),(f),(i),(a),(c),(d).
-Note that even though (f) performs before (d), the value returned by (f) is newer than the value returned by (d).
-Therefore, this execution is legal and does not violate the CoRR requirements even though (f) appears before (d) in global memory order.
-
-Likewise, if two back-to-back loads return the values written by the same store, then they may also appear out-of-order in the global memory order without violating CoRR.  Note that this is not the same as saying that the two loads return the same value, since two different stores may write the same value.   Consider the litmus test of Figure~\ref{fig:litmus:rsw}:
-
 \begin{figure}[h!]
   \centering
+  \begin{tabular}{m{.4\linewidth}m{.05\linewidth}m{.4\linewidth}}
   {
     \tt\small
     \begin{tabular}{cl||cl}
     \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
     \hline
-          & li t1, 1    & (d) & lw~ a0,0(s1) \\
-      (a) & sw t1,0(s0) & (e) & xor t2,a0,a0 \\
-      (b) & fence w, w  & (f) & add s2,s2,t2 \\
-      (c) & sw t1,0(s1) & (g) & lw~ a1,0(s2) \\
-          &             & (h) & lw~ a2,0(s2) \\
-          &             & (i) & xor t3,a2,a2 \\
-          &             & (j) & add s0,s0,t3 \\
-          &             & (k) & lw~ a3,0(s0) \\
+          & li t1, 1    & (d) & lw a0, 0(s1)   \\
+      (a) & sw t1,0(s0) & (e) & sw a0, 0(s2)   \\
+      (b) & fence w, w  & (f) & lw a1, 0(s2)   \\
+      (c) & sw t1,0(s1) &     & xor a2,a1,a1   \\
+          &             &     & add s0,s0,a2   \\
+          &             & (g) & lw a3,0(s0)    \\   
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a3=0}}
     \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Litmus test RSW}
-  \label{fig:litmus:rsw}
-\end{figure}
-
-The outcome {\tt a0=1,a1=a2,a3=0} can be observed by allowing (g) and (h) to be reordered.  This might be done speculatively, and the speculation can be justified by the microarchitecture (e.g., by snooping for cache invalidations and finding none) because replaying (h) after (g) would return the value written by the same store anyway.
-Hence assuming {\tt a1=a2}, (g) and (h) can be reordered.
-The global memory order corresponding to this execution would be (h),(k),(a),(c),(d),(g).
-
-Executions of the above test in which {\tt a1} does not equal {\tt a2} do in fact require that (g) appears before (h) in the global memory order.
-Allowing (h) to appear before (g) in the global memory order would in fact result in a violation of CoRR, because then (h) would return an older value than that returned by (g).
-Therefore, PPO rule~\ref{ppo:rdw} forbids this CoRR violation from occurring.
-As such, PPO rule~\ref{ppo:rdw} strikes a careful balance between enforcing CoRR in all cases while simultaneously being weak enough to permit ``RSW'' and ``fri-rfi'' patterns that commonly appear in real microarchitectures.
-
-
-\subsection{Atomics and LR/SCs (Atomicity Axiom)}
-\begin{tabular}{p{1cm}|p{12cm}} &
-Atomicity axiom: \atomicityaxiom
-\end{tabular}
-
-The RISC-V architecture decouples the notion of atomicity from the notion of ordering.  Unlike architectures such as TSO, RISC-V atomics under RVWMO do not impose any ordering requirements by default.  Ordering semantics are only guaranteed by the PPO rules that otherwise apply.
-This relaxed nature allows implementations to be aggressive about forwarding values even before a paired store has been committed to memory.
-
-Roughly speaking, the atomicity rule states that there can be no store from another hart during the time the reservation is held.
-For AMOs, the reservation is held as the AMO is being performed.
-For successful {\tt lr}/{\tt sc} pairs, the reservation is held between the time the {\tt lr} is performed and the time the {\tt sc} is performed.
-In most cases, the atomicity rule states that there can be no store from another hart between the load and its paired store in global memory order.
-
-There is one exception, however: if the paired load returns its value from a store $s$ still in the store buffer (which some implementations may permit), then the reservation may not need to be acquired until $s$ is ready to leave the store buffer, and this may occur after the paired load has already performed.
-Therefore, in this case, the requirement is only that no other store from another hart to an overlapping address can appear between time that $s$ performs and the time that the paired store performs.
-Consider the example of Figure~\ref{fig:litmus:lateatomic}:
-
-\begin{figure}[h!]
-  \centering
-  {
-    \setlength{\tabcolsep}{2mm}
-    \tt\footnotesize
-    \begin{tabular}{cl||cl}
-    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
-    \hline
-          & li~~~~~~ t1, 2       &     & li~~~~~~~~ t3, 2    \\
-          & li~~~~~~ t2, 1       &     & li~~~~~~~~ t4, 1    \\
-      (a) & sd~~~~~~ t1,0(s0)    & (d) & sd~~~~~~~~ t3,0(s1) \\
-      (b) & amoor.aq a0,t2,0(s0) & (e) & amoswap.rl x0,t4,0(s0) \\
-      (c) & sd~~~~~~ t2,0(s1)    &     &             \\
-    \end{tabular}
-  }
-  ~~
-  \diagram
-  \caption{A litmus test where the reservation for {\tt 0(s0)} may not be acquired until after the load of (b) has already completed}
-  \label{fig:litmus:lateatomic}
-\end{figure}
-
-The outcome {\tt 0(s0)=3, 0(s1)=2} is legal, with the global memory order of (b0),(c),(d),(e),(a),(b1), where (b0) and (b1) represent the load and store parts, respectively, of (b).
-The atomic operation (b) does not need to grab the reservation until (a) is ready to leave the store buffer.
-Therefore, although (e) is a store to the same address from another hart, and even though (e) lies between (b0) and (b1) in global memory order, this execution does not violate the atomicity axiom because (e) comes after (a) in global memory order.
+  } & &
+  \input{figs/litmus_datarfi.pdf_t}
+  \end{tabular}
 
-\begin{verbbox}
-(a) lr t0, 0(a0)
-(b) sd t1, 0(a0)
-(c) sc t2, 0(a0)
-\end{verbbox}
-\begin{figure}[h!]
-  \centering\small
-  \theverbbox
-  \caption{Store-conditional (c) may succeed on some implementations}
-  \label{fig:litmus:lrsdsc}
+  \caption{Because of PPO rule~\ref{ppo:addrdatarfi} and the data dependency from (d) to (e), (d) must also precede (f) in the global memory order (outcome forbidden)}
+  \label{fig:litmus:addrdatarfi}
 \end{figure}
 
-The atomicity rule does not forbid loads from being interleaved between the paired operations in program order or in the global memory order, nor does it forbid stores from the same hart from appearing between the paired operations in either program order or in the global memory order.
-For example, the sequence in Figure~\ref{fig:litmus:lrsdsc} is legal, and the {\tt sc} may (but is not guaranteed to) succeed.
-By preserved program order rule \ref{ppo:->st}, the program order of the three operations must be maintained in the global memory order.  This does not violate the atomicity axiom, because the intervening non-conditional store is from the same hart as the paired load-reserved and store-conditional instructions.
+PPO rules~\ref{ppo:addrdatarfi} and \ref{ppo:addrpo} reflect behaviors of almost all real processor pipeline implementations.
+Rule~\ref{ppo:addrdatarfi} states that a load cannot forward from a store until the address and data for that store are known.
+Consider Figure~\ref{fig:litmus:addrdatarfi}:
+(f) cannot be executed until the data for (e) has been resolved, because (f) must return the value written by (e) (or by something even later in the global memory order), and the old value must not be clobbered by the writeback of (e) before (d) has had a chance to perform.
+Therefore, (f) will never perform before (d) has performed.
 
 \begin{figure}[h!]
   \centering
+  \begin{tabular}{m{.4\linewidth}m{.05\linewidth}m{.4\linewidth}}
   {
-    \setlength{\tabcolsep}{1mm}
-    \tt\footnotesize
+    \tt\small
     \begin{tabular}{cl||cl}
     \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
     \hline
-          & li t0, 1              &     &             \\
-      (a) & amoor.aq a0,t0,0(s0)  & (c) & amoadd.aq a1,x0,0(s1) \\
-      (b) & sd~~~~~~ a0,0(s1)     & (d) & ld~~~~~~~ a2,0(s0) \\
+          & li t1, 1    &     & li t1, 1       \\
+      (a) & sw t1,0(s0) & (d) & lw a0, 0(s1)   \\
+      (b) & fence w, w  & (e) & sw a0, 0(s2)   \\
+      (c) & sw t1,0(s1) & (f) & sw t1, 0(s2)   \\
+          &             & (g) & lw a1, 0(s2)   \\
+          &             &     & xor a2,a1,a1   \\
+          &             &     & add s0,s0,a2   \\
+          &             & (h) & lw a3,0(s0)    \\   
+
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=0}}
     \end{tabular}
-  }
-  ~~
-  \diagram
-  \caption{The {\tt .aq} applies only to the load part of (a), and hence it does not order the store part of (a) before (b)}
-  \label{fig:litmus:amoaq}
-\end{figure}
-
-Likewise, in the test of Figure~\ref{fig:litmus:amoaq}, the following global memory order could result in the outcome {\tt a1=1, a2=0}: (a0), (b), (c), (d), (a1).
-
-Overall, the atomicity rule ensures that non-synchronization atomic operations (e.g., incrementing a counter) can be made as efficient as possible in high-performance implementations, while simultaneously ensuring that the atomicity conditions necessary for achieving consensus are maintained.
-
-
-\begin{comment}
-%\subsection{Atomics and Store Forwarding (Rules \ref{ppo:rmwrfi}--\ref{ppo:rfiaq})}
-\subsection{Atomics and Store Forwarding (Rule \ref{ppo:rfiaq})}
-\begin{tabular}{p{1cm}|p{12cm}}
-%  & Rule \ref{ppo:rmwrfi}: \ppormwrfi \\
-  & Rule \ref{ppo:rfiaq}: \pporfiaq \\
-\end{tabular}
-
-  There is one exception to the rule that ``fri-rfi'' reordering from Chapter~\ref{sec:ppo:rdw} is permitted: sequences in which the first memory access is part of an AMO or {\tt lr}/{\tt sc} pair and the second is a load with its {\tt .aq} bit set.  Rule~\ref{ppo:rfiaq} ensures compatibility with causality chains of the form ({\tt rf; ppo; rf; ppo;} $\dots$), and with C/C++ release sequences in particular.  Consider the following variant of test MP+FENCE+fri-rfi-addr (with labels reused from the earlier example):
-
-\begin{center}
-  \tt\footnotesize
-  \begin{tabular}{cl||cl}
-  \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
-  \hline
-        & li t1, 1    &      & li    t2, 2    \\
-    (a) & sw t1,0(s0) &      & li    t3, 1    \\
-    (b) & fence w, w  & (de) & amoor t3,a0,0(s1) \\
-    (c) & sw t1,0(s1) & (f)  & lw.aq a1,0(s1) \\
-        &             & (i)  & lw    a2,0(s0) \\
-  \end{tabular}
-  ~~~~
-  \begin{tabular}{cl||cl}
-  \multicolumn{4}{c}{\rm Abstracted assembly} \\
-  \\
-  \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
-  \hline
-        &            &      &            \\
-    (a) & St [x], 1  &      &            \\
-    (b) & Fence w,w  & (de) & AMOOr a0, 1, [y] \\ 
-    (c) & St [y], 1  & (f)  & Ld.aq a1, [y] \\
-        &            & (i)  & Ld a2, [x] \\
+  } & &
+  \input{figs/litmus_datacoirfi.pdf_t}
   \end{tabular}
-\end{center}
-
-Programmers (and C/C++) expect the causality chain from (a) to (c) to (de) to (f) to (i) to be enforced.
-However, the PPO rules covered so far only enforce global memory ordering from (a) to (c) to (d) (the load of the AMOOr) to (e) (the store of the AMOOr), and from (f) to (i), but not from (e) to (f).
-\ref{ppo:rfiaq} fills this missing link by ensuring that the ordering from (e) to (f) is respected, and hence that the entire ordering chain from (a) to (i) is respected.
-\end{comment}
-
-
-\subsection{Pipeline Dependency Artifacts (Rules~\ref{ppo:ld->st->ld}--\ref{ppo:addrpocfence})}
-\label{sec:ppopipeline}
-\begin{tabular}{p{1cm}|p{12cm}}
-  & Rule \ref{ppo:ld->st->ld}: \ppoldstld \\
-  & Rule \ref{ppo:addrpo}: \ppoaddrpo \\
-  & Rule \ref{ppo:ctrlcfence}: \ppoctrlcfence \\
-  & Rule \ref{ppo:addrpocfence}: \ppoaddrpocfence \\
-\end{tabular}
-
-These four ``compound dependency'' rules reflect behaviors of almost all real processor pipeline implementations, and they are added into the model explicitly to simplify the definition of the formal operational memory model and to improve compatibility with known patterns on other architectures.
-
-\begin{figure}[h!]
-  \centering
-  {
-    \tt\small
-    \begin{tabular}{cl}
-      (a) & lw a0, 0(s0)   \\
-      (b) & sw a0, 0(s1)   \\
-      (c) & lw a1, 0(s1)   \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Because of the data dependency from (a) to (b), (a) is also ordered before (c)}
-  \label{fig:litmus:addrdatarfi}
-\end{figure}
 
-Rule~\ref{ppo:ld->st->ld} states that a load forward from a store until the address and data for that store are known.
-Consider Figure~\ref{fig:litmus:addrdatarfi}:
-(c) cannot be executed until the data for (b) has been resolved, because (c) must return the value written by (b) (or by something even later in the global memory order).  Therefore, (c) will never execute before (a) has executed.
-
-\begin{figure}[h!]
-  \centering
-  {
-    \tt\small
-    \begin{tabular}{cl}
-          & li t1, 1       \\
-      (a) & lw a0, 0(s0)   \\
-      (b) & sw a0, 0(s1)   \\
-          & sw t1, 0(s1)   \\
-      (c) & lw a1, 0(s1)   \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Because of the extra store between (b) and (c), (a) is no longer necessarily ordered before (c)}
+  \caption{Because of the extra store between (e) and (g), (d) no longer necessarily precedes (g) (outcome permitted)}
   \label{fig:litmus:addrdatarfi_no}
 \end{figure}
 
-If there were another store to the same address in between (b) and (c), as in Figure~\ref{fig:litmus:addrdatarfi_no}, then (c) would no longer dependent on the data of (b) being resolved, and hence the dependency of (c) on (a), which produces the data for (b), would be broken.
-
-One subtle related note is that {\tt amoswap} does not contain a data dependency from its load to its store.  Nor does every {\tt sc} have a data dependency on its paired {\tt lr}.
-Therefore, Rule~\ref{ppo:ld->st->ld} does not enforce an ordering from paired loads of this category to subsequent loads to overlapping addresses.
-%Rule~\ref{ppo:loadtoacq} is therefore not quite redundant with Rule~\ref{ppo:ld->st->ld}, even when the first load and the store in this example are paired.
+If there were another store to the same address in between (b) and (c), as in Figure~\ref{fig:litmus:addrdatarfi_no}, then (c) would no longer be dependent on the data of (b) being resolved, and hence the dependency of (c) on (a), which produces the data for (b), would be broken.
 
-Rule~\ref{ppo:addrpo} makes a similar observation to the previous rule: a store cannot be performed at memory until all previous loads which might access the same address have themselves been performed.
+Rule~\ref{ppo:addrpo} makes a similar observation to the previous rule: a store cannot be performed at memory until all previous loads that might access the same address have themselves been performed.
 Such a load must appear to execute before the store, but it cannot do so if the store were to overwrite the value in memory before the load had a chance to read the old value.
+Likewise, a store generally cannot be performed until it is known that preceding instructions will not cause an exception due to failed address resolution, and in this sense, rule~\ref{ppo:addrpo} can be seen as somewhat of a special case of rule~\ref{ppo:ctrl}.
 
 \begin{figure}[h!]
   \centering
-  {
+  \begin{tabular}{m{.4\linewidth}m{.05\linewidth}m{.4\linewidth}}
     \tt\small
-    \begin{tabular}{cl}
-        & li t1, 1       \\
-    (a) & lw a0, 0(s0)   \\
-    (b) & lw a1, 0(a0)   \\
-    (c) & sw t1, 0(s1)   \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Because of the address dependency from (a) to (b), (a) is also ordered before (c)}
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+        &             &     & li t1, 1       \\
+    (a) & lw a0,0(s0) & (d) & lw a1, 0(s1)   \\
+    (b) & fence rw,rw & (e) & lw a2, 0(a1)   \\
+    (c) & sw s2,0(s1) & (f) & sw t1, 0(s0)   \\
+    \hline
+    \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=t}}
+    \end{tabular}  
+    & &
+    \input{figs/litmus_addrpo.pdf_t}
+  \end{tabular}
+  \caption{Because of the address dependency from (d) to (e), (d) also precedes (f) (outcome forbidden)}
   \label{fig:litmus:addrpo}
 \end{figure}
 
 Consider Figure~\ref{fig:litmus:addrpo}:
-(c) cannot be executed until the address for (b) is resolved, because it may turn out that the addresses match; i.e., that {\tt a0=s1}.  Therefore, (c) cannot be sent to memory before (a) has executed and confirmed whether the addresses to indeed overlap.
-
-\begin{figure}[h!]
-  \centering
-  {
-    \tt\small
-    \begin{tabular}{rl}
-    (a)       & ld a0, 0(s0) \\
-              & xor a1,a0,a0 \\
-              & bne a1, critical \\
-    critical: & fence.i \\
-    (c)       & ld a1, 0(s1) \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Because of the control dependency from (a) to (c), (a) is also ordered before (c)}
-  \label{fig:litmus:ctrlcfence}
-\end{figure}
-
-Rule~\ref{ppo:ctrlcfence} reflects the idiom of Figure~\ref{fig:litmus:ctrlcfence} for a lightweight acquire fence:
-In this code snippet, (c) cannot execute until the {\tt fence.i} is cleared.  The {\tt fence.i} cannot clear until the branch has executed and drained.  The branch cannot execute until it receives the value from (a) through the {\tt xor}.  Therefore, (a) must be ordered before (c) in the global memory order.
-
-\begin{figure}[h!]
-  \centering
-  {
-    \tt\small
-    \begin{tabular}{cl}
-    (a)       & ld a0, 0(s0) \\
-    (b)       & ld a1, 0(a0) \\
-              & fence.i \\
-    (c)       & ld a1, 0(s1) \\
-    \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Because of the address dependency from (a) to (b) and the {\tt fence.i} between (b) and (c), (a) is also ordered before (c)}
-  \label{fig:litmus:addrpocfence}
-\end{figure}
+(f) cannot be executed until the address for (e) is resolved, because it may turn out that the addresses match; i.e., that {\tt a0=s1}.  Therefore, (f) cannot be sent to memory before (d) has executed and confirmed whether the addresses do indeed overlap.
 
-Rule~\ref{ppo:addrpocfence} and Figure~\ref{fig:litmus:addrpocfence} present a similar situation:
-Once again, (c) cannot execute until the {\tt fence.i} is cleared.  The {\tt fence.i} cannot clear until both (a) and (b) have at least issued (even if they have not yet returned a value).  Finally, (b) cannot issue until it receives its address from (a).  Therefore, (a) must be ordered before (c).
 
+\section{Beyond Main Memory}
 
-\section{FENCE.I, SFENCE.VMA, and I/O Fences}
-
-In this section, we provide an informal description of how the {\tt fence.i}, {\tt sfence.vma}, and I/O fences interact with the memory model.
-
-Instruction fetches and address translation operations (where applicable) follow the RISC-V memory model as well as the rules below.
-\begin{itemize}
-  \item {\tt fence.i}: Conceptually, {\tt fence.i} ensures that no instructions following the {\tt fence.i} are issued until all instructions prior to the {\tt fence.i} have executed (but not necessarily performed globally).
-    This implies that the fetch of each instruction following the {\tt fence.i} in program order appears later in the global memory order than all stores prior to the {\tt fence.i} in program order.
-    That in turn means that instruction caches which hardware does not keep coherent with normal memory must be flushed when a {\tt fence.i} instruction is executed.
-    ({\tt fence.i} is also used form the patterns of Chapter~\ref{sec:ppopipeline}.)
-  \item {\tt sfence.vma}: Conceptually, the instruction fetch and address translation operations of each instruction following the {\tt sfence.vma} in program order appears later in the global memory order than all stores prior to the {\tt sfence.vma} in program order.
-    This implies that stale entries in the local hart's TLBs must be invalidated.
-  \item Conceptually, updates to the page table made by a hardware page table walker form a paired atomic read-modify-write operation subject to the rules of the atomicity axiom
-\end{itemize}
+RVWMO does not currently attempt to formally describe how FENCE.I, SFENCE.VMA, I/O fences, and PMAs behave.
+All of these behaviors will be described by future formalizations.
+In the meantime, the behavior of FENCE.I is described in Section~\ref{sec:fence}, the behavior of SFENCE.VMA is described in the RISC-V Instruction Set Privileged Architecture Manual, and the behavior of I/O fences and the effects of PMAs are described below.
 
 \subsection{Coherence and Cacheability}
 
-The RISC-V ISA defines Physical Memory Attributes (PMAs) which specify, among other things, whether portions of the address space are coherent and/or cacheable.
-See the privileged spec for the complete details.
+The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs) which specify, among other things, whether portions of the address space are coherent and/or cacheable.
+See the RISC-V Privileged ISA Specification for the complete details.
 Here, we simply discuss how the various details in each PMA relate to the memory model:
 
 \begin{itemize}
   \item Main memory vs.\@ I/O, and I/O memory ordering PMAs: the memory model as defined applies to main memory regions.  I/O ordering is discussed below.
   \item Supported access types and atomicity PMAs: the memory model is simply applied on top of whatever primitives each region supports.
-  \item Coherence and cacheability PMAs: neither the coherence nor the cacheability PMAs affect the memory model.  The RISC-V privileged specification suggests that hardware-incoherent regions of main memory are discouraged, but the memory model is compatible with hardware coherence, software coherence, implicit coherence due to read-only memory, implicit coherence due to only one agent having access, or otherwise.  Likewise, non-cacheable regions may have more restrictive behavior than cacheable regions, but the set of allowed behaviors does not change regardless.
+  \item Cacheability PMAs: the cacheability PMAs in general do not affect the memory model.  Non-cacheable regions may have more restrictive behavior than cacheable regions, but the set of allowed behaviors does not change regardless.  However, some platform-specific and/or device-specific cacheability settings may differ.
+  \item Coherence PMAs: The memory consistency model for memory regions marked as non-coherent in PMAs is currently platform-specific and/or device-specific: the load-value axiom, the atomicity axiom, and the progress axiom all may be violated with non-coherent memory.  Note however that coherent memory does not require a hardware cache coherence protocol.  The RISC-V Privileged ISA Specification suggests that hardware-incoherent regions of main memory are discouraged, but the memory model is compatible with hardware coherence, software coherence, implicit coherence due to read-only memory, implicit coherence due to only one agent having access, or otherwise.
   \item Idempotency PMAs: Idempotency PMAs are used to specify memory regions for which loads and/or stores may have side effects, and this in turn is used by the microarchitecture to determine, e.g., whether prefetches are legal.  This distinction does not affect the memory model.
 \end{itemize}
 
 
 \subsection{I/O Ordering}
 
-For I/O, the load value axiom and atomicity axiom in general do not apply, as both reads and writes might have device-specific side effects.
-The preserved program order rules do not generally apply to I/O either.
-Instead, we informally say that memory access $a$ is ordered before memory access $b$ if $a$ precedes $b$ in program order and one or more of the following holds:
+For I/O, the load value axiom and atomicity axiom in general do not apply, as both reads and writes might have device-specific side effects and may return values other than the value ``written'' by the most recent store to the same address.
+Nevertheless, the following preserved program order rules still generally apply for accesses to I/O memory:
+memory access $a$ precedes memory access $b$ in global memory order if $a$ precedes $b$ in program order and one or more of the following holds:
 \begin{enumerate}
+  \item $a$ precedes $b$ in preserved program order as defined in Chapter~\ref{ch:memorymodel}, with the exception that acquire and release ordering annotations apply only from one memory operation to another memory operation and from one I/O operation to another I/O operation, but not from a memory operation to an I/O nor vice versa
   \item $a$ and $b$ are accesses to overlapping addresses in an I/O region
   \item $a$ and $b$ are accesses to the same strongly-ordered I/O region
   \item $a$ and $b$ are accesses to I/O regions, and the channel associated with the I/O region accessed by either $a$ or $b$ is channel 1
   \item $a$ and $b$ are accesses to I/O regions associated with the same channel (except for channel 0)
-  \item $a$ and $b$ are separated in program order by a FENCE, $a$ is in the predecessor set of the FENCE, and $b$ is in the successor set of the FENCE.  The predecessor and successor sets include the sets described by all eight FENCE bits {\tt .pr}, {\tt .pw}, {\tt .pi}, {\tt .po}, {\tt .sr}, {\tt .sw}, {\tt .si}, and {\tt .so}.
-  \item $a$ and $b$ are accesses to I/O regions, and $a$ has {\tt .aq} set
-  \item $a$ and $b$ are accesses to I/O regions, and $b$ has {\tt .rl} set
-  \item $a$ and $b$ are accesses to I/O regions, and $a$ and $b$ both have {\tt .aq} and {\tt .rl} set
-  \item $a$ and $b$ are accesses to I/O regions, and $a$ is an AMO that has {\tt .aq} and {\tt .rl} set
-  \item $a$ and $b$ are accesses to I/O regions, and $b$ is an AMO that has {\tt .aq} and {\tt .rl} set
 \end{enumerate}
 
-As described above, accesses to I/O memory require stronger synchronization that what is enforced by the RVWMO PPO rules.
-For such cases, {\tt FENCE} operations with {\tt .pi}, {\tt .po}, {\tt .si}, and/or {\tt .so} are needed.
-For example, to enforce ordering between a write to normal memory and an MMIO write to a device register, a {\tt FENCE w,o} or stronger is needed.
-Even {\tt .aq} and {\tt .rl} do not enforce ordering between normal memory accesses and accesses to I/O memory.
-When a fence is in fact used, implementations must assume that the device may attempt to access memory immediately after receiving the MMIO signal, and subsequent memory accesses from that device to memory must observe the effects of all accesses ordered prior to that MMIO operation.
+Note that the FENCE instruction distinguishes between main memory operations and I/O operations in its predecessor and successor sets.
+To enforce ordering between I/O operations and main memory operations, code must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW.
+For example, to enforce ordering between a write to main memory and an I/O write to a device register, a FENCE~W,O or stronger is needed.
 
 \begin{verbbox}
   sd t0, 0(a0)
@@ -853,153 +838,241 @@ When a fence is in fact used, implementations must assume that the device may at
   \label{fig:litmus:wo}
 \end{figure}
 
-In other words, in Figure~\ref{fig:litmus:wo}, suppose {\tt 0(a0)} is in normal memory and {\tt 0(a1)} is the address of a device register in I/O memory.
+When a fence is in fact used, implementations must assume that the device may attempt to access memory immediately after receiving the MMIO signal, and subsequent memory accesses from that device to memory must observe the effects of all accesses ordered prior to that MMIO operation.
+In other words, in Figure~\ref{fig:litmus:wo}, suppose {\tt 0(a0)} is in main memory and {\tt 0(a1)} is the address of a device register in I/O memory.
 If the device accesses {\tt 0(a0)} upon receiving the MMIO write, then that load must conceptually appear after the first store to {\tt 0(a0)} according to the rules of the RVWMO memory model.
 In some implementations, the only way to ensure this will be to require that the first store does in fact complete before the MMIO write is issued.
-Other implementations may find ways to be more aggressive, while others still may not need to do anything different at all for I/O and normal memory accesses.
+Other implementations may find ways to be more aggressive, while others still may not need to do anything different at all for I/O and main memory accesses.
 Nevertheless, the RVWMO memory model does not distinguish between these options; it simply provides an implementation-agnostic mechanism to specify the orderings that must be enforced.
 
-Many architectures include separate notions of ``ordering'' and ``completion'' fences, especially as it relates to I/O (as opposed to normal memory).
+Many architectures include separate notions of ``ordering'' and ``completion'' fences, especially as it relates to I/O (as opposed to regular main memory).
 Ordering fences simply ensure that memory operations stay in order, while completion fences ensure that predecessor accesses have all completed before any successors are made visible.
 RISC-V does not explicitly distinguish between ordering and completion fences.
 Instead, this distinction is simply inferred from different uses of the FENCE bits.
 
-For implementations that conform to the RISC-V Unix Platform Specification, I/O devices, DMA operations, etc.\@ are required to access memory coherently and via strongly-ordered I/O channels.
-Therefore, accesses to normal memory regions that are shared with I/O devices can also use the standard synchronization mechanisms.
-Implementations which do not conform to the Unix Platform Specification and/or in which devices do not access memory coherently will need to use platform-specific mechanisms (such as cache flushes) to enforce coherency.
+For implementations that conform to the RISC-V Unix Platform Specification, I/O devices and DMA operations are required to access memory coherently and via strongly-ordered I/O channels.
+Therefore, accesses to regular main memory regions that are concurrently accessed by external devices can also use the standard synchronization mechanisms.
+Implementations that do not conform to the Unix Platform Specification and/or in which devices do not access memory coherently will need to use mechanisms (which are currently platform-specific or device-specific) to enforce coherency.
 
 I/O regions in the address space should be considered non-cacheable regions in the PMAs for those regions.  Such regions can be considered coherent by the PMA if they are not cached by any agent.
 
 The ordering guarantees in this section may not apply beyond a platform-specific boundary between the RISC-V cores and the device.  In particular, I/O accesses sent across an external bus (e.g., PCIe) may be reordered before they reach their ultimate destination.  Ordering must be enforced in such situations according to the platform-specific rules of those external devices and buses.
 
-\section{Code Examples}
-\label{sec:mmcode}
+\section{Code Porting and Mapping Guidelines}
+\label{sec:memory:porting}
 
-\subsection{Compare and Swap}
-An example
-using {\tt lr}/{\tt sc} to implement a compare-and-swap function is shown in
-Figure~\ref{cas}.  If inlined, compare-and-swap functionality need
-only take three instructions.
+\begin{table}[h!]
+  \centering
+  \begin{tabular}{|l|l|}
+    \hline
+    x86/TSO Operation & RVWMO Mapping \\
+    \hline
+    \hline
+    Load              & \tt l\{b|h|w|d\}; fence r,rw               \\
+    \hline
+    Store             & \tt fence rw,w; s\{b|h|w|d\}               \\
+    \hline
+    \multirow{2}{*}{Atomic RMW}
+    & \tt amo<op>.\{w|d\}.aqrl \textrm{OR} \\
+    & \tt loop:\@ lr.\{w|d\}.aq; <op>; sc.\{w|d\}.aqrl; bnez loop \\
+    \hline
+    Fence             & \tt fence rw,rw \\
+    \hline
+  \end{tabular}
+  \caption{Mappings from TSO operations to RISC-V operations}
+  \label{tab:tsomappings}
+\end{table}
 
-\begin{figure}[h!]
-\begin{center}
-\begin{verbatim}
-        # a0 holds address of memory location 
-        # a1 holds expected value
-        # a2 holds desired value
-        # a0 holds return value, 0 if successful, !0 otherwise
-    cas:
-        lr.w t0, (a0)        # Load original value.
-        bne t0, a1, fail     # Doesn't match, so fail.
-        sc.w a0, a2, (a0)    # Try to update.
-        jr ra                # Return.
-    fail:
-        li a0, 1             # Set return to failure.
-        jr ra                # Return.
-\end{verbatim}
-\end{center}
-  \caption{Sample code for compare-and-swap function using {\tt lr}/{\tt sc}.}
-\label{cas}
-\end{figure}
+Table~\ref{tab:tsomappings} provides a mapping from TSO memory operations onto RISC-V memory instructions.
+Normal x86 loads and stores are all inherently acquire-RCpc and release-RCpc operations: TSO enforces all load-load, load-store, and store-store ordering by default.
+Therefore, under RVWMO, all TSO loads must be mapped onto a load followed by FENCE~R,RW, and all TSO stores must be mapped onto FENCE~RW,W followed by a store.
+TSO atomic read-modify-writes and x86 instructions using the LOCK prefix are fully-ordered and can be implemented either via an AMO with both {\em aq} and {\em rl} set, or via an LR with {\em aq} set, the aritmentic operation in question, an SC with both {\em aq} and {\em rl} set, and a conditional branch checking the success condition.
+In the latter case, the {\em rl} annotation on the LR turns out (for non-obvious reasons) to be redundant and can be omitted.
 
-\subsection{Spinlocks}
-\label{sec:spinlock}
+Alternatives to Table~\ref{tab:tsomappings} are also possible.
+A TSO store can be mapped onto AMOSWAP with {\em rl} set.
+However, since RVWMO PPO Rule~\ref{ppo:amoforward} forbids forwarding of values from AMOs to subsequent loads, the use of AMOSWAP for stores may negatively affect performance.
+A TSO load can be mapped using LR with {\em aq} set: all such LR instructions will be unpaired, but that fact in and of itself does not preclude the use of LR for loads.
+However, again, this mapping may also negatively affect performance if it puts more pressure on the reservation mechanism than was originally intended.
 
-An example code sequence for a critical section guarded by a
-test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
-first AMO is marked {\tt .aq} to order the lock acquisition before the
-critical section, and the second AMO is marked {\tt .rl} to order
-the critical section before the lock relinquishment.
+\begin{table}[h!]
+  \centering
+  \begin{tabular}{|l|l|}
+    \hline
+    Power Operation & RVWMO Mapping \\
+    \hline
+    \hline
+    Load              & \tt l\{b|h|w|d\}  \\
+    \hline
+    Load-Reserve      & \tt lr.\{w|d\}  \\
+    \hline
+    Store             & \tt s\{b|h|w|d\}  \\
+    \hline
+    Store-Conditional & \tt sc.\{w|d\}  \\
+    \hline
+    \tt lwsync        & \tt fence.tso \\
+    \hline
+    \tt sync          & \tt fence rw,rw \\
+    \hline
+    \tt isync         & \tt fence.i; fence r,r \\
+    \hline
+  \end{tabular}
+  \caption{Mappings from Power operations to RISC-V operations}
+  \label{tab:powermappings}
+\end{table}
+
+Table~\ref{tab:powermappings} provides a mapping from Power memory operations onto RISC-V memory instructions.
+Power ISYNC maps on RISC-V to a FENCE.I followed by a FENCE~R,R; the latter fence is needed because ISYNC is used to define a ``control+control fence'' dependency that is not present in RVWMO.
+
+\begin{table}[h!]
+  \centering
+  \begin{tabular}{|l|l|}
+    \hline
+    ARM Operation             & RVWMO Mapping \\
+    \hline
+    \hline
+    Load                      & \tt l\{b|h|w|d\}  \\
+    \hline
+    Load-Acquire              & \tt fence rw, rw; l\{b|h|w|d\}; fence r,rw  \\
+    \hline
+    Load-Exclusive            & \tt lr.\{w|d\}  \\
+    \hline
+    Load-Acquire-Exclusive    & \tt lr.\{w|d\}.aq.rl \\
+    \hline
+    Store                     & \tt s\{b|h|w|d\}  \\
+    \hline
+    Store-Release             & \tt fence rw,w; s\{b|h|w|d\}  \\
+    \hline
+    Store-Exclusive           & \tt sc.\{w|d\}  \\
+    \hline
+    Store-Release-Exclusive   & \tt sc.\{w|d\}.rl  \\
+    \hline
+    \tt dmb                   & \tt fence rw,rw \\
+    \hline
+    \tt dmb.ld                & \tt fence r,rw \\
+    \hline
+    \tt dmb.st                & \tt fence w,w \\
+    \hline
+    \tt isb                   & \tt fence.i; fence r,r \\
+    \hline
+  \end{tabular}
+  \caption{Mappings from ARM operations to RISC-V operations}
+  \label{tab:armmappings}
+\end{table}
+
+Table~\ref{tab:armmappings} provides a mapping from ARM memory operations onto RISC-V memory instructions.
+Since RISC-V does not currently have plain load and store opcodes with {\em aq} or {\em rl} annotations, ARM load-acquire and store-release operations should be mapped using fences instead.
+Furthermore, in order to enforce store-release-to-load-acquire ordering, there must be a FENCE~RW,RW between the store-release and load-acquire; Table~\ref{tab:armmappings} enforces this by always placing the fence in front of each acquire operation.
+ARM load-exclusive and store-exclusive instructions can likewise map onto their RISC-V LR and SC equivalents, but instead of placing a FENCE~RW,RW in front of an LR with {\em aq} set, we simply also set {\em rl} instead.
+ARM ISB maps on RISC-V to FENCE.I followed by FENCE~R,R similarly to how ISYNC maps for Power.
+
+\begin{table}[h!]
+  \centering
+  \begin{tabular}{|l|l|}
+    \hline
+    Linux Operation           & RVWMO Mapping \\
+    \hline
+    \hline
+    \tt smp\_mb()             & \tt fence rw,rw \\
+    \hline
+    \tt smp\_rmb()            & \tt fence r,r \\
+    \hline
+    \tt smp\_wmb()            & \tt fence w,w \\
+    \hline
+    \tt dma\_rmb()            & \tt fence r,r \\
+    \hline
+    \tt dma\_wmb()            & \tt fence w,w \\
+    \hline
+    \tt mb()                  & \tt fence iorw,iorw \\
+    \hline
+    \tt rmb()                 & \tt fence ri,ri \\
+    \hline
+    \tt wmb()                 & \tt fence wo,wo \\
+    \hline
+    \tt smp\_load\_acquire()   & \tt l\{b|h|w|d\}; fence r,rw \\
+    \hline
+    \tt smp\_store\_release()  & \tt fence.tso; s\{b|h|w|d\}  \\
+    \hline
+    \hline
+    Linux Construct            & RVWMO AMO Mapping        \\
+    \hline
+    \tt atomic\_<op>\_relaxed  & \tt amo<op>.\{w|d\}      \\
+    \hline
+    \tt atomic\_<op>\_acquire  & \tt amo<op>.\{w|d\}.aq   \\
+    \hline
+    \tt atomic\_<op>\_release  & \tt amo<op>.\{w|d\}.rl   \\
+    \hline
+    \tt atomic\_<op>           & \tt amo<op>.\{w|d\}.aqrl \\
+    \hline
+    \hline
+    Linux Construct            & RVWMO LR/SC Mapping\\
+    \hline
+    \tt atomic\_<op>\_relaxed  & \tt loop:\@ lr.\{w|d\}; <op>; sc.\{w|d\}; bnez loop \\
+    \hline
+    \tt atomic\_<op>\_acquire  & \tt loop:\@ lr.\{w|d\}.aq; <op>; sc.\{w|d\}; bnez loop \\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>\_release}
+      & \tt loop:\@ lr.\{w|d\}; <op>; sc.\{w|d\}.aqrl$^*$; bnez loop \textrm{OR} \\
+      & \tt fence.tso; loop:\@ lr.\{w|d\}; <op>; sc.\{w|d\}$^*$; bnez loop \\
+    \hline
+    \tt atomic\_<op>           & \tt loop:\@ lr.\{w|d\}.aq; <op>; sc.\{w|d\}.aqrl; bnez loop \\
+    \hline
+  \end{tabular}
+  \caption{Mappings from Linux memory primitives to RISC-V primitives.  Other constructs (such as spinlocks) should follow accordingly.  Platforms or devices with non-coherent DMA may need additional synchronization (such as cache flush or invalidate mechanisms); currently any such extra synchronization will be device-specific.}
+  \label{tab:linuxmappings}
+\end{table}
+
+Table~\ref{tab:linuxmappings} provides a mapping of Linux memory ordering macros onto RISC-V memory instructions.
+%The now-deprecated fence {\tt smp\_read\_barrier\_depends()} map to a no-op due to preserved program order rules \ref{ppo:addr}--\ref{ppo:ctrl}.
+The Linux fences {\tt dma\_rmb()} and {\tt dma\_wmb()} map onto FENCE~R,R and FENCE~W,W, respectively, since the RISC-V Unix Platform requires coherent DMA, but would be mapped onto FENCE~RI,RI and FENCE~WO,WO, respectively, on a platform with non-coherent DMA.
+Platforms with non-coherent DMA may also require a mechanism by which cache lines can be flushed and/or invalidated.
+Such mechanisms will be device-specific and/or standardized in a future extension to the ISA.
+
+The Linux mappings for release operations may seem stronger than necessary, but these mappings are needed to cover some cases in which Linux requires stronger orderings than the more intuitive mappings would provide.
+In particular, as of the time this text is being written, Linux is actively debating whether to require load-load, load-store, and store-store orderings between accesses in one critical section and accesses in a subsequent critical section in the same hart and protected by the same synchronization object.
+Not all combinations of FENCE~RW,W/FENCE~R,RW mappings with {\em aq}/{\em rl} mappings combine to provide such orderings.
+There are a few ways around this problem, including:
+\begin{enumerate}
+  \item Always use FENCE~RW,W/FENCE~R,RW, and never use {\em aq}/{\em rl}.  This suffices but is undesirable, as it defeats the purpose of the {\em aq}/{\em rl} modifiers.
+  \item Always use {\em aq}/{\em rl}, and never use FENCE~RW,W/FENCE~R,RW.  This does not currently work due to the lack of load and store opcodes with {\em aq} and {\em rl} modifiers.
+  \item Strengthen the mappings of release operations such that they would enforce sufficient orderings in the presence of either type of acquire mapping.  This is the currently-recommended solution, and the one shown in Table~\ref{tab:linuxmappings}.
+\end{enumerate}
 
 \begin{figure}[h!]
-\begin{center}
-\begin{verbatim}
-        li           t0, 1        # Initialize swap value.
-    again:
-        amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
-        bnez         t0, again    # Retry if held.
-        # ...
-        # Critical section.
-        # ...
-        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
-\end{verbatim}
-\end{center}
-\caption{Sample code for mutual exclusion.  {\tt a0} contains the address of the lock.}
-\label{critical}
+  \centering\small
+  \begin{verbbox}
+Linux code:
+(a)  int r0 = *x;
+(bc) spin_unlock(y, 0);
+     ...
+     ...
+(d)  spin_lock(y);
+(e)  int r1 = *z;
+  \end{verbbox}
+  \theverbbox
+  ~~~~~~~~~~
+  \begin{verbbox}
+RVWMO Mapping:
+(a) lw           a0, 0(s0)
+(b) fence.tso  // vs. fence rw,w
+(c) sd           x0,0(s1)
+    ...
+    loop:
+(d) amoswap.d.aq a1,t1,0(s1)
+    bnez         a1,loop
+(e) lw           a2,0(s2)
+  \end{verbbox}
+  \theverbbox
+  \caption{Orderings between critical sections in Linux}
+  \label{fig:litmus:lkmm_ll}
 \end{figure}
 
-\section{Code Porting Guidelines}
-\label{sec:porting}
-
-Normal x86 loads and stores are all inherently acquire and release operations: TSO enforces all load-load, load-store, and store-store ordering by default.
-All TSO loads must be mapped onto {\tt l\{b|h|w|d\}; fence r,rw}, and all TSO stores must either be mapped onto {\tt amoswap.rl x0} or onto {\tt fence rw,w; s\{b|h|w|d\}}.
-Alternatively, TSO loads and stores can be mapped onto {\tt l\{b|h|w|d\}.aq} and {\tt s\{b|h|w|d\}.rl} assembler pseudoinstructions to facilitate forwards compatibility in case such instructions are added to the ISA one day.
-However, in the meantime, the assembler will generate the same fence-based and/or {\tt amoswap}-based versions for these pseudoinstructions.
-%However, the more correct solution to porting code from x86-TSO (which is generally overly-constrained at the assembly level compared to DRF software requirements) is to rewrite the algorithm to determine which orderings the original algorithm actually required, and then to re-code the algorithm in terms of the RVWMO memory model.
-x86 atomics using the LOCK prefix are all sequentially consistent and when ported naively to RISC-V must be marked as {\tt .aqrl}.
-
-A Power {\tt sync}/{\tt hwsync} fence, an ARM {\tt dmb} fence, and an x86 {\tt mfence} are all equivalent to a RISC-V {\tt fence rw,rw}.
-Power {\tt isync} and ARM {\tt isb} map to RISC-V {\tt fence.i}.
-A Power {\tt lwsync} map onto {\tt fence.tso}, or onto {\tt fence rw,rw} when {\tt fence.tso} is not available.
-ARM {\tt dmb ld} and {\tt dmb st} fences map to RISC-V {\tt fence r,rw} and {\tt fence w,w}, respectively.
-
-A direct mapping of ARMv8 atomics that maps unordered instructions to unordered instructions, RCpc instructions to RCpc instructions, and RCsc instructions to RCsc instructions is likely to work in the majority of cases.
-Mapping even unordered load-reserved instructions onto {\tt lr.aq} (particularly for LR/SC pairs without internal data dependencies) is an even safer bet, as this ensures C/C++ release sequences will be respected.
-However, due to a subtle mismatch between the two models, strict theoretical compatibility with the ARMv8 memory model requires that a naive mapping translate all ARMv8 store conditional and load-acquire operations map onto RISC-V RCsc operations.
-Any atomics which are naively ported into RCsc operations may revert back to the straightforward mapping if the programmer can verify that the code is not relying on an ordering from the store-conditional to the load-acquire (as this is not common).
-
-%ARMv8 solves the C/C++ release sequence problem of Chapter~\ref{sec:acqrel} through a rule that is different from rule~\ref{ppo:loadtoacq}.
-%Therefore, strict formal compatibility requires naively-ported ARMv8 load-acquire operations to be preceded by a {\tt fence w,r,[addr]} or stronger.
-%The naive translations of ARM {\tt ldapr}, {\tt ldar}, and {\tt stlr} are therefore ``{\tt fence w,r,[addr]; amoor.aq rd,[addr],x0}'', ``{\tt fence w,r,[addr]; amoor.aq.rl rd,[addr],x0}'' and ``{\tt amoswap.aq.rl} with {\tt rd=x0}'', respectively.
-%In general the extra fence would not have been necessary if the original source were recompiled to RISC-V natively, because the RVWMO memory model already solves the same underlying problem just in a different way.
-%Naive ports may choose whether to stick to a strict naive port or to assume that the (cheaper) mapping without the fence is more than likely sufficient.
-
-The Linux fences {\tt smp\_mb()}, {\tt smp\_wmb()}, {\tt smp\_rmb()} map onto {\tt fence rw,rw}, {\tt fence w,w}, and {\tt fence r,r}, respectively.  The fence {\tt smp\_read\_barrier\_depends()} map to a no-op due to preserved program order rules \ref{ppo:addr}--\ref{ppo:ctrl}.
-The Linux fences {\tt dma\_rmb()} and {\tt dma\_wmb()} map onto {\tt fence r,r} and {\tt fence w,w}, respectively, since the RISC-V Unix Platform requires coherent DMA.
-The Linux fences {\tt rmb()}, {\tt wmb()}, and {\tt mb()} map onto {\tt fence ri,ri}, {\tt fence wo,wo}, and {\tt fence rwio,rwio}, respectively.
-
-%\begin{table}[h!]
-%  \begin{tabular}{|l|l|l|}
-%    \hline
-%    C/C++ Construct                            & Base ISA Mapping & `A' Extension Mapping \\
-%    \hline
-%    \hline
-%    Non-atomic load                            & \multicolumn{2}{l|}{\tt ld}               \\
-%    \hline
-%    \tt atomic\_load(memory\_order\_relaxed)   & \multicolumn{2}{l|}{\tt ld}               \\
-%    \hline
-%    %\tt atomic\_load(memory\_order\_consume)   & \multicolumn{2}{l|}{\tt ld; fence r,rw}   \\
-%    %\hline
-%    \tt atomic\_load(memory\_order\_acquire)   & \tt fence r,r,[addr]; & \tt ld.aq   \\
-%                                               & \tt ld; fence r,rw    &                       \\
-%    \hline
-%    \tt atomic\_load(memory\_order\_seq\_cst)  & \tt fence rw,rw; ld; & \tt ld.aq.rl rs2=x0   \\
-%                                               & \tt fence r,rw       &                       \\
-%    \hline
-%    \hline
-%    Non-atomic store                           & \multicolumn{2}{l|}{\tt sd}               \\
-%    \hline
-%    \tt atomic\_store(memory\_order\_relaxed)  & \multicolumn{2}{l|}{\tt sd}               \\
-%    \hline
-%    \tt atomic\_store(memory\_order\_release)  & \tt fence rw,w; sd  & \tt sd.rl x0   \\
-%    \hline
-%    \tt atomic\_store(memory\_order\_seq\_cst) & \tt fence rw,rw; sd & \tt sd.aq.rl x0  \\
-%    \hline
-%    \hline
-%    \tt atomic\_thread\_fence(memory\_order\_acquire)  & \multicolumn{2}{l|}{\tt fence r,rw} \\
-%    \hline
-%    \tt atomic\_thread\_fence(memory\_order\_release)  & \multicolumn{2}{l|}{\tt fence rw,w} \\
-%    \hline
-%    \tt atomic\_thread\_fence(memory\_order\_acq\_rel) & \multicolumn{2}{l|}{{\tt fence rw,rw~} or {~\tt fence rw,w; fence r,rw}} \\
-%    \hline
-%    \tt atomic\_thread\_fence(memory\_order\_seq\_cst) & \multicolumn{2}{l|}{\tt fence rw,rw} \\
-%    \hline
-%  \end{tabular}
-%  \caption{Mappings from C/C++ primitives to RISC-V primitives.  The atomics mapping is preferable where available.}
-%  \label{tab:mappings}
-%\end{table}
+For example, the critical section ordering rule currently being debated by the Linux community would require (a) to be ordered before (e) in Figure~\ref{fig:litmus:lkmm_ll}.
+If that will indeed be required, then it would be insufficient for (b) to map as FENCE~RW,W.
+That said, these mappings are subject to change as the Linux Kernel Memory Model evolves.
 
 \begin{table}[h!]
+  \centering
   \begin{tabular}{|l|l|}
     \hline
     C/C++ Construct                            & RVWMO Mapping \\
@@ -1022,7 +1095,7 @@ The Linux fences {\tt rmb()}, {\tt wmb()}, and {\tt mb()} map onto {\tt fence ri
     \hline
     \tt atomic\_store(memory\_order\_release)  & \tt fence rw,w; s\{b|h|w|d\}  \\
     \hline
-    \tt atomic\_store(memory\_order\_seq\_cst) & \tt fence rw,rw; s\{b|h|w|d\} \\
+    \tt atomic\_store(memory\_order\_seq\_cst) & \tt fence rw,w; s\{b|h|w|d\}  \\
     \hline
     \hline
     \tt atomic\_thread\_fence(memory\_order\_acquire)  & \tt fence r,rw \\
@@ -1033,113 +1106,183 @@ The Linux fences {\tt rmb()}, {\tt wmb()}, and {\tt mb()} map onto {\tt fence ri
     \hline
     \tt atomic\_thread\_fence(memory\_order\_seq\_cst) & \tt fence rw,rw \\
     \hline
+    \hline
+    C/C++ Construct                           & RVWMO AMO Mapping        \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_relaxed)  & \tt amo<op>.\{w|d\}      \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_acquire)  & \tt amo<op>.\{w|d\}.aq   \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_release)  & \tt amo<op>.\{w|d\}.rl   \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_acq\_rel) & \tt amo<op>.\{w|d\}.aqrl \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_seq\_cst) & \tt amo<op>.\{w|d\}.aqrl \\
+    \hline
+    \hline
+    C/C++ Construct                           & RVWMO LR/SC Mapping\\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>(memory\_order\_relaxed)}
+      & \tt loop:\@ lr.\{w|d\}; <op>; sc.\{w|d\}; \\
+      & \tt bnez loop \\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>(memory\_order\_acquire)}
+      & \tt loop:\@ lr.\{w|d\}.aq; <op>; sc.\{w|d\}; \\
+      & \tt bnez loop \\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>(memory\_order\_release)}
+      & \tt loop:\@ lr.\{w|d\}; <op>; sc.\{w|d\}.rl; \\
+      & \tt bnez loop \\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>(memory\_order\_acq\_rel)}
+      & \tt loop:\@ lr.\{w|d\}.aq; <op>; sc.\{w|d\}.rl; \\
+      & \tt bnez loop \\
+    \hline
+    \multirow{2}{*}{\tt atomic\_<op>(memory\_order\_seq\_cst)}
+      & \tt loop:\@ lr.\{w|d\}.aqrl; <op>; \\
+      & \tt sc.\{w|d\}.rl; bnez loop \\
+    \hline
   \end{tabular}
-  \caption{Mappings from C/C++ primitives to RISC-V primitives.} %  The atomics mapping is preferable where available.}
-  \label{tab:mappings}
+  \caption{Mappings from C/C++ primitives to RISC-V primitives.}
+  \label{tab:c11mappings}
 \end{table}
 
-The C11/C++11 {\tt memory\_order\_*} primitives should be mapped as shown in Table~\ref{tab:mappings}.
-The {\tt memory\_order\_acquire} orderings in particular must use fences rather than atomics to ensure that release sequences behave correctly even in the presence of {\tt amoswap}.
-The {\tt memory\_order\_release} mappings may use {\tt .rl} as an alternative.
-
 \begin{table}[h!]
-\centering
+  \centering
   \begin{tabular}{|l|l|}
     \hline
-    Ordering Annotation & Fence-based Equivalent \\
+    C/C++ Construct                            & RVWMO Mapping \\
+    \hline
+    \hline
+    Non-atomic load                            & \tt l\{b|h|w|d\}               \\
+    \hline
+    \tt atomic\_load(memory\_order\_relaxed)   & \tt l\{b|h|w|d\}               \\
+    \hline
+    \tt atomic\_load(memory\_order\_acquire)   & \tt l\{b|h|w|d\}.aq  \\
+    \hline
+    \tt atomic\_load(memory\_order\_seq\_cst)  & \tt l\{b|h|w|d\}.aq  \\
+    \hline
+    \hline
+    Non-atomic store                           & \tt s\{b|h|w|d\}               \\
+    \hline
+    \tt atomic\_store(memory\_order\_relaxed)  & \tt s\{b|h|w|d\}               \\
+    \hline
+    \tt atomic\_store(memory\_order\_release)  & \tt s\{b|h|w|d\}.rl  \\
+    \hline
+    \tt atomic\_store(memory\_order\_seq\_cst) & \tt s\{b|h|w|d\}.rl \\
+    \hline
+    \hline
+    \tt atomic\_thread\_fence(memory\_order\_acquire)  & \tt fence r,rw \\
+    \hline
+    \tt atomic\_thread\_fence(memory\_order\_release)  & \tt fence rw,w \\
+    \hline
+    \tt atomic\_thread\_fence(memory\_order\_acq\_rel) & {\tt fence.tso} \\
+    \hline
+    \tt atomic\_thread\_fence(memory\_order\_seq\_cst) & \tt fence rw,rw \\
+    \hline
+    \hline
+    C/C++ Construct                           & RVWMO AMO Mapping    \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_relaxed)  & \tt amo<op>.\{w|d\}      \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_acquire)  & \tt amo<op>.\{w|d\}.aq   \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_release)  & \tt amo<op>.\{w|d\}.rl   \\
+    \hline
+    \tt atomic\_<op>(memory\_order\_acq\_rel) & \tt amo<op>.\{w|d\}.aqrl \\
     \hline
-    \tt l\{b|h|w|d|r\}.aq        & \tt l\{b|h|w|d|r\}; fence r,rw \\
+    \tt atomic\_<op>(memory\_order\_seq\_cst) & \tt amo<op>.\{w|d\}.aqrl \\
     \hline
-    \tt l\{b|h|w|d|r\}.aqrl      & \tt fence rw,rw; l\{b|h|w|d|r\}; fence r,rw \\
     \hline
-    \tt s\{b|h|w|d|c\}.rl        & \tt fence rw,w; s\{b|h|w|d|c\} \\
+    C/C++ Construct                           & RVWMO LR/SC Mapping\\
     \hline
-    \tt s\{b|h|w|d|c\}.aqrl      & \tt fence rw,w; s\{b|h|w|d|c\} \\
+    \tt atomic\_<op>(memory\_order\_relaxed)  & \tt lr.\{w|d\}; <op>; sc.\{w|d\} \\
     \hline
-    \tt amo<op>.aq               & \tt amo<op>; fence r,rw \\
+    \tt atomic\_<op>(memory\_order\_acquire)  & \tt lr.\{w|d\}.aq; <op>; sc.\{w|d\} \\
     \hline
-    \tt amo<op>.rl               & \tt fence rw,w; amo<op> \\
+    \tt atomic\_<op>(memory\_order\_release)  & \tt lr.\{w|d\}; <op>; sc.\{w|d\}.rl \\
     \hline
-    \tt amo<op>.aqrl             & \tt fence rw,rw; amo<op>; fence rw,rw \\
+    \tt atomic\_<op>(memory\_order\_acq\_rel) & \tt lr.\{w|d\}.aq; <op>; sc.\{w|d\}.rl \\
     \hline
+    \tt atomic\_<op>(memory\_order\_seq\_cst) & \tt lr.\{w|d\}.aq$^*$; <op>; sc.\{w|d\}.rl \\
+    \hline
+    \multicolumn{2}{l}{$^*$must be {\tt lr.\{w|d\}.aqrl} in order to interoperate with code mapped per Table~\ref{tab:c11mappings}}
   \end{tabular}
-  \caption{Mappings from {\tt .aq} and/or {\tt .rl} to fence-based equivalents.  An alternative mapping places a {\tt fence rw,rw} after the existing {\tt s\{b|h|w|d|c\}} mapping rather than at the front of the {\tt l\{b|h|w|d|r\}} mapping.}
-  \label{tab:aqrltofence}
+  \caption{Hypothetical mappings from C/C++ primitives to RISC-V primitives, if native load-acquire and store-release opcodes are introduced.}
+  \label{tab:c11mappings_hypothetical}
 \end{table}
 
-It is also safe to translate any {\tt .aq}, {\tt .rl}, or {\tt .aqrl} annotation into the fence-based snippets of Table~\ref{tab:aqrltofence}.
-These can also be used as a legal implementation of {\tt l\{b|h|w|d\}} or {\tt s\{b|h|w|d\}} pseudoinstructions for as long as those instructions are not added to the ISA.
+Table~\ref{tab:c11mappings} provides a mapping of C11/C++11 atomic operations onto RISC-V memory instructions.
+If load and store opcodes with {\em aq} and {\em rl} modifiers are introduced, then the mappings in Figure~\ref{tab:c11mappings_hypothetical} will suffice.
+Note however that the two mappings only interoperate correctly if {\tt atomic\_<op>(memory\_order\_seq\_cst)} is mapped using an LR that has both {\em aq} and {\em rl} set.
+
+Any AMO can be emulated by an LR/SC pair, but care must be taken to ensure that any PPO orderings that originate from the LR are also made to originate from the SC, and that any PPO orderings that terminate at the SC are also made to terminate at the LR.
+For example, the LR must also be made to respect any data dependencies that the AMO has, given that load operations do not otherwise have any notion of a data dependency.
+Likewise, the effect a FENCE~R,R elsewhere in the same hart must also be made to apply to the SC, which would not otherwise respect that fence.
+The emulator may achieve this effect by simply mapping AMOs onto {\tt lr.aq;~<op>;~sc.aqrl}, matching the mapping used elsewhere for fully-ordered atomics.
 
 \section{Implementation Guidelines}
 
 The RVWMO and RVTSO memory models by no means preclude microarchitectures from employing sophisticated speculation techniques or other forms of optimization in order to deliver higher performance.
 The models also do not impose any requirement to use any one particular cache hierarchy, nor even to use a cache coherence protocol at all.
 Instead, these models only specify the behaviors that can be exposed to software.
-Microarchitectures are free to use any pipeline design, any coherent or non-coherent cache hierarchy, any on-chip interconnect, etc., as long as the design satisfy the memory model rules.
-That said, to help people understand the actual implementations of the memory model, in this section we provide some guidelines below on how architects and programmers should interpret the models' rules.
+Microarchitectures are free to use any pipeline design, any coherent or non-coherent cache hierarchy, any on-chip interconnect, etc., as long as the design only admits executions that satisfy the memory model rules.
+That said, to help people understand the actual implementations of the memory model, in this section we provide some guidelines on how architects and programmers should interpret the models' rules.
 
-Both RVWMO and RVTSO are multi-copy atomic (or ``other-multi-copy-atomic''): any store value which is visible to a hart other than the one that originally issued it must also be conceptually visible to all other harts in the system.
-In other words, harts may forward from their own previous stores before those stores have become globally visible to all harts, but no other early intra-hart forwarding is permitted.
+Both RVWMO and RVTSO are multi-copy atomic (or ``other-multi-copy-atomic''): any store value that is visible to a hart other than the one that originally issued it must also be conceptually visible to all other harts in the system.
+In other words, harts may forward from their own previous stores before those stores have become globally visible to all harts, but no early inter-hart forwarding is permitted.
 Multi-copy atomicity may be enforced in a number of ways.
 It might hold inherently due to the physical design of the caches and store buffers, it may be enforced via a single-writer/multiple-reader cache coherence protocol, or it might hold due to some other mechanism.
 
 Although multi-copy atomicity does impose some restrictions on the microarchitecture, it is one of the key properties keeping the memory model from becoming extremely complicated.
-For example, a hart may not legally forward a value from a neighbor hart's private store buffer, unless those two harts are the only two in the system.
+For example, a hart may not legally forward a value from a neighbor hart's private store buffer (unless of course it is done in such a way that no new illegal behaviors become architecturally visible).
 Nor may a cache coherence protocol forward a value from one hart to another until the coherence protocol has invalidated all older copies from other caches.
 Of course, microarchitectures may (and high-performance implementations likely will) violate these rules under the covers through speculation or other optimizations, as long as any non-compliant behaviors are not exposed to the programmer.
 
 As a rough guideline for interpreting the PPO rules in RVWMO, we expect the following from the software perspective:
 \begin{itemize}
-  \item programmers will use PPO rules \ref{ppo:->st}--\ref{ppo:amoload} regularly and actively.
+  \item programmers will use PPO rules \ref{ppo:->st} and \ref{ppo:fence}--\ref{ppo:pair} regularly and actively.
   \item expert programmers will use PPO rules \ref{ppo:addr}--\ref{ppo:ctrl} to speed up critical paths of important data structures.
-  %\item expert programmers will occasionally use PPO rules \ref{ppo:rdw}--\ref{ppo:rfiaq} in very aggressive code and/or as part of a longer chain of synchronization.
-  \item even expert programmers will rarely if ever use PPO rules \ref{ppo:success}--\ref{ppo:addrpocfence} directly.  These are included to facilitate common microarchitectural optimizations (rule~\ref{ppo:rdw}) and the operational formal modeling approach (rules \ref{ppo:success} and \ref{ppo:ld->st->ld}--\ref{ppo:addrpocfence}) described in Chapter~\ref{sec:operational}.  They also facilitate the process of porting code from other architectures which have similar rules.
+  \item even expert programmers will rarely if ever use PPO rules \ref{ppo:rdw}--\ref{ppo:amoforward} and \ref{ppo:addrdatarfi}--\ref{ppo:addrpo} directly.  These are included to facilitate common microarchitectural optimizations (rule~\ref{ppo:rdw}) and the operational formal modeling approach (rules \ref{ppo:amoforward} and \ref{ppo:addrdatarfi}--\ref{ppo:addrpo}) described in Section~\ref{sec:operational}.  They also facilitate the process of porting code from other architectures that have similar rules.
 \end{itemize}
 
 We also expect the following from the hardware perspective:
 \begin{itemize}
-  \item PPO rules \ref{ppo:->st}--\ref{ppo:release} and \ref{ppo:amostore}--\ref{ppo:amoload} reflect well-understood rules that should pose few surprises to architects.
-  \item PPO rule \ref{ppo:strongacqrel} may not be immediately obvious to architects, but is somewhat standard nevertheless
-  \item The load value axiom, the atomicity axiom, and PPO rules \ref{ppo:addr}--\ref{ppo:ctrl} and \ref{ppo:ld->st->ld}--\ref{ppo:addrpocfence} reflect rules that most hardware implementations will enforce naturally, unless they contain extreme optimizations.  Of course, implementations should make sure to double check these rules nevertheless.  Hardware must also ensure that syntactic dependencies are not ``optimized away''.
-  %\item PPO rules \ref{ppo:strongacqrel} and \ref{ppo:rfiaq} may not be obvious or intuitive, and hence they deserve particular attention.
-  %\item PPO rules \ref{ppo:strongacqrel} and \ref{ppo:rmwrfi}--\ref{ppo:rfiaq} may not be obvious or intuitive, and hence they deserve particular attention.
-  \item PPO rule \ref{ppo:success} is not obvious, but it is necessary to avoid certain out-of-thin-air-like behavior that appears with store-conditional success values
+  \item PPO rules \ref{ppo:->st} and \ref{ppo:amoforward}--\ref{ppo:release} reflect well-understood rules that should pose few surprises to architects.
   \item PPO rule \ref{ppo:rdw} reflects a natural and common hardware optimization, but one that is very subtle and hence is worth double checking carefully.
+  \item PPO rule \ref{ppo:rcsc} may not be immediately obvious to architects, but it is a standard memory model requirement
+  \item The load value axiom, the atomicity axiom, and PPO rules \ref{ppo:pair}--\ref{ppo:addrpo} reflect rules that most hardware implementations will enforce naturally, unless they contain extreme optimizations.  Of course, implementations should make sure to double check these rules nevertheless.  Hardware must also ensure that syntactic dependencies are not ``optimized away''.
 \end{itemize}
 
 Architectures are free to implement any of the memory model rules as conservatively as they choose.  For example, a hardware implementation may choose to do any or all of the following:
   \begin{itemize}
-    \item interpret all fences as if they were {\tt fence rw,rw} (or {\tt fence iorw,iorw}, if I/O is involved), regardless of the bits actually set
-    \item implement all fences with {\tt .pw} and {\tt .sr} as if they were {\tt fence~rw,rw} (or {\tt fence~iorw,iorw}, if I/O is involved), as ``{\tt w,r}'' is the most expensive of the four possible normal memory orderings anyway
-    \item ignore any addresses passed to a fence instruction and simply implement the fence for all addresses
-    \item implement an instruction with {\tt .aq} set as being preceded immediately by {\tt fence r,rw}
-    \item implement an instruction with {\tt .rl} set as being succeeded immediately by {\tt fence rw,w}
+    \item interpret all fences as if they were FENCE~RW,RW (or FENCE~IORW,IORW, if I/O is involved), regardless of the bits actually set
+    \item implement all fences with PW and SR as if they were FENCE~RW,RW (or FENCE~IORW,IORW, if I/O is involved), as PW with SR is the most expensive of the four possible main memory ordering components anyway
+    \item emulate {\em aq} and {\em rl} as described in Section~\ref{sec:memory:porting}
     \item enforcing all same-address load-load ordering, even in the presence of patterns such as ``fri-rfi'' and ``RSW''
-    \item forbid any forwarding of a value from a store in the store buffer to a subsequent AMO or {\tt lr} to the same address
-    \item forbid any forwarding of a value from an AMO or {\tt sc} in the store buffer to a subsequent load to the same address
-    \item implement TSO on all memory accesses, and ignore any normal memory fences that do not include ``{\tt w,r}'' ordering
-    \item implement all atomics to be RCsc; i.e., always enforce all store-release-to-load-acquire ordering
+    \item forbid any forwarding of a value from a store in the store buffer to a subsequent AMO or LR to the same address
+    \item forbid any forwarding of a value from an AMO or SC in the store buffer to a subsequent load to the same address
+    \item implement TSO on all memory accesses, and ignore any main memory fences that do not include PW and SR ordering (e.g., as Ztso implementations will do)
+    \item implement all atomics to be RCsc or even fully-ordered, regardless of annotation
   \end{itemize}
-%PPO rules~\ref{ppo:ld->st->ld}--\ref{ppo:addrpocfence} are not intended to impose any ordering requirements onto a processor pipeline beyond constraints which arise naturally, but extremely-optimized pipelines should be careful not to violate these rules nevertheless (or to ensure that any speculation-based optimizations do not make illegal behaviors visible to software). 
 
-Architectures which implement RVTSO can safely do the following:
+Architectures that implement RVTSO can safely do the following:
 \begin{itemize}
-  \item Ignore all {\tt .aq} and {\tt .rl} bits, since these are implicitly always set under RVTSO.  ({\tt .aqrl} cannot be ignored, however, due to PPO rules \ref{ppo:strongacqrel}--\ref{ppo:amoload}.)
-  \item Ignore all fences which do not have both {\tt .pw} and {\tt .sr} (unless the fence also orders I/O)
-  \item Ignore PPO rules \ref{ppo:->st} and \ref{ppo:addr}--\ref{ppo:addrpocfence}, since these are redundant with other PPO rules under RVTSO assumptions
+  \item Ignore all fences that do not have both PW and SR (unless the fence also orders I/O)
+  \item Ignore all PPO rules except for rules \ref{ppo:fence} through \ref{ppo:rcsc}, since the rest are redundant with other PPO rules under RVTSO assumptions
 \end{itemize}
 
 Other general notes:
 
 \begin{itemize}
-  \item Silent stores (i.e., stores which write the same value that already exists at a memory location) do not have any special behavior from a memory model point of view.  Microarchitectures that attempt to implement silent stores must take care to ensure that the memory model is still obeyed, particularly in cases such as RSW (Chapter~\ref{sec:ppo:rdw}) which tend to be incompatible with silent stores.
+  \item Silent stores (i.e., stores that write the same value that already exists at a memory location) behave like any other store from a memory model point of view.  Microarchitectures that attempt to implement silent stores must take care to ensure that the memory model is still obeyed, particularly in cases such as RSW (Section~\ref{sec:memory:overlap}) which tend to be incompatible with silent stores.
   \item Writes may be merged (i.e., two consecutive writes to the same address may be merged) or subsumed (i.e., the earlier of two back-to-back writes to the same address may be elided) as long as the resulting behavior does not otherwise violate the memory model semantics.
 \end{itemize}
 
 The question of write subsumption can be understood from the following example:
 \begin{figure}[h!]
   \centering
-  {
+  \begin{tabular}{m{.4\linewidth}m{.1\linewidth}m{.4\linewidth}}
     \tt\small
     \begin{tabular}{cl||cl}
     \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
@@ -1148,39 +1291,28 @@ The question of write subsumption can be understood from the following example:
         & li t2, 1    &     &              \\
     (a) & sw t1,0(s0) & (d) & lw  a0,0(s1) \\
     (b) & fence w, w  & (e) & sw  a0,0(s0) \\
-    (c) & sw t2,0(s1) & (f) & lw  t3,0(s0) \\
+    (c) & sw t2,0(s1) & (f) & sw  t3,0(s0) \\
     \end{tabular}
-  }
-  ~~~~
-  \diagram
-  \caption{Write subsumption litmus test}
+  & &
+    \input{figs/litmus_subsumption.pdf_t}
+  \end{tabular}
+  \caption{Write subsumption litmus test, allowed execution.}
   \label{fig:litmus:subsumption}
 \end{figure}
 
-As written, (a) must follow (f) in the global memory order:
+As written, if the load ~(d) reads value~$1$, then (a) must precede (f) in the global memory order:
 \begin{itemize}
-  \item (a) follows (c) in the global memory order because of rule 2
-  \item (c) follows (d) in the global memory order because of the Load Value axiom
-  \item (d) follows (e) in the global memory order because of rule 7
-  \item (e) follows (f) in the global memory order because of rule 1
+  \item (a) precedes (c) in the global memory order because of rule 2
+  \item (c) precedes (d) in the global memory order because of the Load Value axiom
+  \item (d) precedes (e) in the global memory order because of rule 7
+  \item (e) precedes (f) in the global memory order because of rule 1
 \end{itemize}
+In other words the final value of the memory location whose address is in {\tt s0} must be~$2$ (the value written by the store~(f)) and cannot be~$3$ (the value written by the store~(a)).
 
 A very aggressive microarchitecture might erroneously decide to discard (e), as (f) supersedes it, and this may in turn lead the microarchitecture to break the now-eliminated dependency between (d) and (f) (and hence also between (a) and (f)).
 This would violate the memory model rules, and hence it is forbidden.
 Write subsumption may in other cases be legal, if for example there were no data dependency between (d) and (e).
 
-\section{Summary of New/Modified ISA Features}
-
-At a high level, PPO rules \ref{ppo:strongacqrel}, \ref{ppo:success}, and \ref{ppo:ld->st->ld}--\ref{ppo:addrpocfence} are all new rules that did not exist in the original ISA spec.  Rule~\ref{ppo:rdw} and the specifics of the atomicity axiom were addressed but not stated in detail.
-
-Other new or modified ISA details:
-\begin{itemize}
-  \item There is an RCpc ({\tt .aq} and {\tt .rl}) vs.\@ RCsc ({\tt .aqrl}) distinction
-  \item Load-release and store-acquire are deprecated
-  \item {\tt lr}/{\tt sc} behavior was clarified
-  %\item Fences reserve two bits for platform-specific use
-\end{itemize}
-
 \subsection{Possible Future Extensions}
 
 We expect that any or all of the following possible future extensions would be compatible with the RVWMO memory model:
@@ -1189,423 +1321,98 @@ We expect that any or all of the following possible future extensions would be c
   \item `V' vector ISA extensions
   \item A transactional memory subset of the `T' ISA extension
   \item `J' JIT extension
-  \item Native encodings for {\tt l\{b|h|w|d\}.aq}/{\tt s\{b|h|w|d\}.rl}
+  \item Native encodings for load and store opcodes with {\em aq} and {\em rl} set
   \item Fences limited to certain addresses
-  \item Cache writeback/flush/invalidate/etc.\@ hints, but these should be considered hints, not functional requirements.  Any cache management operations which are required for basic correctness should be described as (possibly address range-limited) fences to comply with the RISC-V philosophy (see also {\tt fence.i} and {\tt sfence.vma}).  For example, a functional cache writeback instruction might instead be written as ``{\tt fence~rw[addr],w[addr]}''.
+  \item Cache writeback/flush/invalidate/etc.\@ instructions
 \end{itemize}
 
-\section{Litmus Tests}
-
-These litmus tests represent some of the better-known litmus tests in the field, plus some tests that are randomly-generated, plus some tests that are generated to be particularly relevant to the RVWMO memory model.
-
-All will be made available for download once they are generated.
-
-We expect that these tests will one day serve as part of a compliance test suite, and we expect that many architects will use them for verification purposes as well.
+\section{Known Issues}
+\label{sec:memory:discrepancies}
 
-COMING SOON!
-
-\chapter{Formal Memory Model Specifications}
-
-\begin{commentary}
-  To facilitate formal analysis of RVWMO, we present a set of formalizations in this chapter.  Any discrepancies are unintended; the expectation is that the models will describe exactly the same sets of legal behaviors, pending some memory model changes that have not quite been added to all of the formalizations yet.
-
-  As such, these formalizations should be considered snapshots from some point in time during the development process rather than finalized specifications.
-
-  At this point, no individual formalization is considered authoritative, but we may designate one as such in collaboration with the ISA specification and/or formalization task groups.
-\end{commentary}
-
-\section{Formal Axiomatic Specification in Alloy}
-\label{sec:alloy}
-
-We present two formal specifications of the RVWMO memory model in Alloy (\url{http://alloy.mit.edu}).
-
-The first corresponds directly to the natural language model earlier in this chapter.
+\subsection{Mixed-size RSW}
+\label{sec:memory:discrepancies:mixedrsw}
 
 \begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-////////////////////////////////////////////////////////////////////////////////
-// =RISC-V RVWMO axioms=
-
-// Preserved Program Order
-fun ppo : Event->Event {
-  // same-address ordering
-  po_loc :> Store
-
-  // explicit synchronization
-  + ppo_fence
-  + Load.aq <: ^po
-  + ^po :> Store.rl
-  + Store.aq.rl <: ^po :> Load.aq.rl
-  + ^po :> Load.sc
-  + Store.sc <: ^po
-
-  // dependencies
-  + addr
-  + data
-  + ctrl :> Store
-  + (addr+data).successdep
-
-  // CoRR
-  + rdw & po_loc_no_intervening_write
-
-  // pipeline dependency artifacts
-  + (addr+data).rfi
-  + addr.^po :> Store
-  + ctrl.(FenceI <: ^po)
-  + addr.^po.(FenceI <: ^po)
-}
-
-// the global memory order respects preserved program order
-fact { ppo in gmo }
-\end{lstlisting}}
-  \caption{The RVWMO memory model formalized in Alloy (1/4: PPO)}
-  \label{fig:alloy1}
-\end{figure}
-\begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-// Load value axiom
-fun candidates[r: Load] : set Store {
-  (r.~gmo & Store & same_addr[r]) // writes preceding r in gmo
-  + (r.^~po & Store & same_addr[r]) // writes preceding r in po
-}
-
-fun latest_among[s: set Event] : Event { s - s.~gmo }
-
-pred LoadValue {
-  all w: Store | all r: Load |
-    w->r in rf <=> w = latest_among[candidates[r]]
-}
-
-fun after_reserve_of[r: Load] : Event { latest_among[r + r.~rf].gmo }
-
-pred Atomicity {
-  all r: Store.~rmw |               // starting from the read r of an atomic,
-    no x: Store & same_addr[r + r.rmw] | // there is no write x to the same addr
-      x not in same_hart[r]         // from a different hart, such that
-      and x in after_reserve_of[r]  // x follows (the write r reads from) in gmo
-      and r.rmw in x.gmo            // and r follows x in gmo
-}
-
-pred RISCV_mm { LoadValue and Atomicity }
-\end{lstlisting}}
-  \caption{The RVWMO memory model formalized in Alloy (2/4: Axioms)}
-  \label{fig:alloy2}
-\end{figure}
-\begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-////////////////////////////////////////////////////////////////////////////////
-// Basic model of memory
-
-sig Hart {  // hardware thread
-  start : one Event
-}
-sig Address {}
-abstract sig Event {
-  po: lone Event // program order
-}
-
-abstract sig MemoryEvent extends Event {
-  address: one Address,
-  aq: lone MemoryEvent, // opcode bit
-  rl: lone MemoryEvent, // opcode bit
-  sc: lone MemoryEvent, // for AMOs with .aq and .rl, to distinguish from lr/sc
-  gmo: set MemoryEvent   // global memory order
-}
-sig Load extends MemoryEvent {
-  addr: set Event,
-  ctrl: set Event,
-  data: set Store,
-  successdep: set Event,
-  rmw: lone Store
-}
-sig Store extends MemoryEvent {
-  rf: set Load
-}
-sig Fence extends Event {
-  pr: lone Fence, // opcode bit
-  pw: lone Fence, // opcode bit
-  sr: lone Fence, // opcode bit
-  sw: lone Fence  // opcode bit
-}
-sig FenceI extends Event {}
-   
-// FENCE PPO
-fun FencePRSR : Fence { Fence.(pr & sr) } 
-fun FencePRSW : Fence { Fence.(pr & sw) } 
-fun FencePWSR : Fence { Fence.(pw & sr) } 
-fun FencePWSW : Fence { Fence.(pw & sw) } 
-
-fun ppo_fence : MemoryEvent->MemoryEvent {
-    (Load  <: ^po :> FencePRSR).(^po :> Load)
-  + (Load  <: ^po :> FencePRSW).(^po :> Store)
-  + (Store <: ^po :> FencePWSR).(^po :> Load)
-  + (Store <: ^po :> FencePWSW).(^po :> Store)
-}
-\end{lstlisting}}
-  \caption{The RVWMO memory model formalized in Alloy (3/4: model of memory)}
-  \label{fig:alloy3}
-\end{figure}
-\begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-// auxiliary definitions
-fun po_loc_no_intervening_write : MemoryEvent->MemoryEvent {
-  po_loc - ((po_loc :> Store).po_loc)
-}
-
-fun RFInit : Load { Load - Store.rf }
-fun rsw : Load->Load { ~rf.rf + (RFInit <: address.~address :> RFInit) }
-fun rdw : Load->Load { (Load <: po_loc :> Load) - rsw }
-
-fun po_loc : Event->Event { ^po & address.~address }
-fun same_hart[e: Event] : set Event { e + e.^~po + e.^po }
-fun same_addr[e: Event] : set Event { e.address.~address }
-
-// basic facts about well-formed execution candidates
-fact { acyclic[po] }
-fact { all e: Event | one e.*~po.~start }  // each event is in exactly one hart
-fact { rf.~rf in iden } // each read returns the value of only one write
-fact { total[gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents
-
-//rf
-fact { rf in address.~address }
-fun rfi : Store->Load { Store <: po_loc_no_intervening_write :> Load }
-
-//dep
-fact { addr + ctrl + data in ^po }
-fact { successdep in (Write.~rmw) <: ^po }
-fact { ctrl.*po in ctrl }
-fact { rmw in ^po }
-
-////////////////////////////////////////////////////////////////////////////////
-// =Opcode encoding restrictions=
-
-// opcode bits are either set (encoded, e.g., as f.pr in iden) or unset
-// (f.pr not in iden).  The bits cannot be used for anything else
-fact { pr + pw + sr + sw + aq + rl + sc in iden }
-fact { sc in aq + rl }
-fact { Load.sc.rmw in Store.sc and Store.sc.~rmw in Load.sc }
-
-// Fences must have either pr or pw set, and either sr or sw set
-fact { Fence in Fence.(pr + pw) & Fence.(sr + sw) }
-
-// there is no write-acquire, but there is write-strong-acquire
-fact { Store & Acquire in Release }
-fact { Load & Release in Acquire }
-
-////////////////////////////////////////////////////////////////////////////////
-// =Alloy shortcuts=
-pred acyclic[rel: Event->Event] { no iden & ^rel }
-pred total[rel: Event->Event, bag: Event] {
-  all disj e, e': bag | e->e' in rel + ~rel
-  acyclic[rel]
-}
-\end{lstlisting}}
-  \caption{The RVWMO memory model formalized in Alloy (4/4: Auxiliaries)}
-  \label{fig:alloy4}
-\end{figure}
-
-\clearpage
-The second is an equivalent formulation which is slightly more complex but which is more computationally efficient.  We expect that analysis tools will be built off of this second formulation.  Also included are empirical checks that the two models match.
-
-This formulation, however, does not apply when mixed-size accesses are used, nor when {\tt lr}/{\tt sc} to different addresses are used.
-
-\begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-// coherence order: a total order on the writes to each address
-fun co : Write->Write { Write <: ((address.~address) & gmo) :> Write }
-// from-read: from a read to the coherence successors of the rf-source of the write
-fun fr : Read->Write { ~rf.co + ((Read - Write.rf) <: address.~address :> Write) }
-
-// e = external; i.e., from a different hart
-fun rfe : Store->Load  { rf - iden - ^po - ^~po }
-fun coe : Store->Store { co - iden - ^po - ^~po }
-fun fre :  Load->Store { fr - iden - ^po - ^~po }
-
-pred sc_per_location  { acyclic[rf + co + fr + po_loc] }
-pred atomicity { no rmw & fre.coe }
-pred causality { acyclic[rfe + co + fr + ppo] }
-
-// equality checks
-run RISCV_mm_com_sanity { RISCV_mm_com } for 3
-check RISCV_mm_gmo_com { RISCV_mm => RISCV_mm_com } for 6
-check RISCV_mm_com_gmo {
-  rmw in address.~address => // the rf/co/fr model assumes rmw in same addr
-  RISCV_mm_com =>
-  rfe + co + fr in gmo =>    // pick a gmo which matches rfe+co+fr
-  RISCV_mm
-} for 6
-\end{lstlisting}
+  \centering\small
+  {\tt
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    &     & li t1, 1    \\
+      (a) & lw a0,0(s0) & (d) & lw a1,0(s1) \\
+      (b) & fence rw,rw & (e) & amoswap.w.rl a2,t1,0(s2) \\
+      (c) & sw t1,0(s1) & (f) & ld a3,0(s2) \\
+          &             & (g) & lw a4,4(s2) \\
+          &             &     & xor a5,a4,a4  \\
+          &             &     & add s0,s0,a5  \\
+          &             & (h) & sw a2,0(s0)   \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=1}, {\tt a2=0}, {\tt a3=1}, {\tt a4=0}}
+    \end{tabular}
   }
-  \caption{An alternative, more computationally efficient but less complete axiomatic definition}
-  \label{fig:com}
+  \caption{Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)}
+  \label{fig:litmus:discrepancy:rsw1}
 \end{figure}
 
-\clearpage
-\section{Formal Axiomatic Specification in Herd}
-
-See also: \url{http://moscova.inria.fr/~maranget/cats7/riscv}
-
-(This herd model is not yet updated to account for rules \ref{ppo:amostore}--\ref{ppo:amoload} and \ref{ppo:success}, and rule {\tt r4} has been tentatively removed from RVWMO.  Updates to come...)
-
 \begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-(*************)
-(* Utilities *)
-(*************)
-
-let fence.r.r = [R];fencerel(Fence.r.r);[R]
-let fence.r.w = [R];fencerel(Fence.r.w);[W]
-let fence.r.rw = [R];fencerel(Fence.r.rw);[M]
-let fence.w.r = [W];fencerel(Fence.w.r);[R]
-let fence.w.w = [W];fencerel(Fence.w.w);[W]
-let fence.w.rw = [W];fencerel(Fence.w.rw);[M]
-let fence.rw.r = [M];fencerel(Fence.rw.r);[R]
-let fence.rw.w = [M];fencerel(Fence.rw.w);[W]
-let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M]
-
-let fence = 
-  fence.r.r | fence.r.w | fence.r.rw |
-  fence.w.r | fence.w.w | fence.w.rw |
-  fence.rw.r | fence.rw.w | fence.rw.rw
-
-
-let po-loc-no-w = po-loc \ (po-loc;[W];po-loc)
-let rsw = rf^-1;rf
-
-let LD-ACQ = R & (Acq|AcqRel)
-and ST-REL = W & (Rel|AcqRel)
-
-(*************)
-(* ppo rules *)
-(*************)
-
-let r1 = [M];po-loc;[W]
-and r2 = fence
-and r3 = [LD-ACQ];po;[M]
-and r4 = [R];po-loc;[LD-ACQ]
-and r5 = [M];po;[ST-REL]
-and r6 = [W & AcqRel];po;[R & AcqRel]
-and r7 = [R];addr;[M]
-and r8 = [R];data;[W]
-and r9 = [R];ctrl;[W]
-and r10 = ([R];po-loc-no-w;[R]) \ rsw
-and r11 = [R];(addr|data);[W];po-loc-no-w;[R]
-and r12 = [R];addr;[M];po;[W]
-and r13 = [R];ctrl;[Fence.i];po;[R]
-and r14 = [R];addr;[M];po;[Fence.i];po;[M]
-
-let ppo =
- r1
-| r2
-| r3
-| r4
-| r5
-| r6
-| r7
-| r8
-| r9
-| r10
-| r11
-| r12
-| r13
-| r14
-\end{lstlisting}
+  \centering\small
+  {\tt
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    &     & li t1, 1      \\
+      (a) & lw a0,0(s0) & (d) & ld a1,0(s1)   \\
+      (b) & fence rw,rw & (e) & lw a2,4(s1)   \\
+      (c) & sw t1,0(s1) &     & xor a3,a2,a2  \\
+          &             &     & add s0,s0,a3  \\
+          &             & (f) & sw a2,0(s0)   \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=0}, {\tt a1=1}, {\tt a2=0}}
+    \end{tabular}
   }
-  \caption{{\tt riscv-defs.cat}, part of a herd version of the RVWMO memory model (1/3)}
-  \label{fig:herd1}
+  \caption{Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)}
+  \label{fig:litmus:discrepancy:rsw2}
 \end{figure}
 
 \begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-Total
-
-(* Notice that herd has defined its own rf relation *)
-
-(* Define ppo *)
-include "riscv-defs.cat"
-
-(********************************)
-(* Generate global memory order *)
-(********************************)
-
-let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *)
-  loc & (W\FW) * FW | # Final write before any write to the same location
-  ppo |               # ppo compatible
-  rfe                 # first half of 
-
-(* Walk over all linear extensions of gmo0 *)
-with  gmo from linearisations(M\IW,gmo0)
-
-(* Add initial writes upfront -- convenient for computing rfGMO *)
-let gmo = gmo | loc & IW * (M\IW)
-
-(**********)
-(* Axioms *)
-(**********)
-
-(* Compute rf according to the load value axiom, aka rfGMO *)
-let WR = loc & ([W];(gmo|po);[R])
-let rfGMO = WR \ (loc&([W];gmo);WR)
-
-(* Check equality of herd rf and of rfGMO *)
-empty (rf\rfGMO)|(rfGMO\rf) as RfCons
-
-(* Atomic axion *)
-let infloc = (gmo & loc)^-1
-let inflocext = infloc & ext
-
-let winside  = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W]
-empty winside as Atomic
-\end{lstlisting}
+  \centering\small
+  {\tt
+    \begin{tabular}{cl||cl}
+    \multicolumn{2}{c}{Hart 0} & \multicolumn{2}{c}{Hart 1} \\
+    \hline
+          & li t1, 1    &     & li t1, 1      \\
+      (a) & lw a0,0(s0) & (d) & sw t1,4(s1)   \\
+      (b) & fence rw,rw & (e) & ld a1,0(s1)   \\
+      (c) & sw t1,0(s1) & (f) & lw a2,4(s1)   \\
+          &             &     & xor a3,a2,a2  \\
+          &             &     & add s0,s0,a3  \\
+          &             & (g) & sw a2,0(s0)   \\
+      \hline
+      \multicolumn{4}{c}{Outcome: {\tt a0=1}, {\tt a1=0x100000001}, {\tt a1=1}}
+    \end{tabular}
   }
-  \caption{{\tt riscv.cat}, a herd version of the RVWMO memory model (2/3)}
-  \label{fig:herd2}
+  \caption{Mixed-size discrepancy (permitted by axiomatic models, forbidden by operational model)}
+  \label{fig:litmus:discrepancy:rsw3}
 \end{figure}
 
-\begin{figure}[h!]
-  {
-  \tt\bfseries\centering\footnotesize
-  \begin{lstlisting}
-Partial
-
-(***************)
-(* Definitions *)
-(***************)
+There is a known discrepancy between the operational and axiomatic specifications within the family of mixed-size RSW variants shown in Figures~\ref{fig:litmus:discrepancy:rsw1}--\ref{fig:litmus:discrepancy:rsw3}.
+To address this, we may choose to add something like the following new PPO rule:
+Memory operation $a$ precedes memory operation $b$ in preserved program order (and hence also in the global memory order) if $a$ precedes $b$ in program order, $a$ and $b$ both access regular main memory (rather than I/O regions), $a$ is a load, $b$ is a store, there is a load $m$ between $a$ and $b$, there is a byte $x$ that both $a$ and $m$ read, there is no store between $a$ and $m$ that writes to $x$, and $m$ precedes $b$ in PPO.
+In other words, in {\sf herd} syntax, we may choose to add ``{\tt (po-loc \& rsw);ppo;[W]}'' to PPO.
+Many implementations will already enforce this ordering naturally.
+As such, even though this rule is not official, we recommend that implementers enforce it nevertheless in order to ensure forwards compatibility with the possible future addition of this rule to RVWMO.
 
-(* Define ppo *)
-include "riscv-defs.cat"
+\chapter{Formal Memory Model Specifications}
 
-(* Compute coherence relation *)
-include "cos-opt.cat"
+To facilitate formal analysis of RVWMO, this chapter presents a set of formalizations using different tools and modeling approaches.  Any discrepancies are unintended; the expectation is that the models describe exactly the same sets of legal behaviors.
 
-(**********)
-(* Axioms *)
-(**********)
+All currently-known discrepancies are listed in Section~\ref{sec:memory:discrepancies}.
 
-(* Sc per location *)
-acyclic co|rf|fr|po-loc as Coherence
+\clearpage
+\input{memory-model-alloy.tex}
 
-(* Main model axiom *)
-acyclic co|rfe|fr|ppo as Model
+\clearpage
+\input{memory-model-herd.tex}
 
-(* Atomicity axiom *)
-empty rmw & (fre;coe) as Atomic
-\end{lstlisting}
-  }
-  \caption{{\tt riscv.cat}, part of an alternative herd presentation of the RVWMO memory model (2/3)}
-  \label{fig:herd3}
-\end{figure}
+\clearpage
+\input{memory-model-operational.tex}
author	Daniel Lustig <dlustig@nvidia.com>	2018-05-02 16:31:03 -0700
committer	Daniel Lustig <dlustig@nvidia.com>	2018-05-02 16:31:03 -0700
commit	03a5e722fc0fe7b94dd0a49f550ff7b41a63f612 (patch)
tree	f6db80e1e442798654d12bc5e9bc151930d49570 /src/memory.tex
parent	3559c11db55e96e1220c6b032d9d920b1808f151 (diff)
download	riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.zip riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.tar.gz riscv-isa-manual-03a5e722fc0fe7b94dd0a49f550ff7b41a63f612.tar.bz2