aboutsummaryrefslogtreecommitdiff
path: root/src
diff options
context:
space:
mode:
authorAndrew Waterman <andrew@sifive.com>2017-12-12 20:15:35 -0800
committerAndrew Waterman <andrew@sifive.com>2017-12-12 20:15:35 -0800
commit4b5852c190f3eff38c471b6d5e1a5f1adc7d9142 (patch)
treeca1dd9a8b9da4abbff517fe37ff2c27e1cf84903 /src
parent9da1a115bcc4fe327f35acceb851d4850d12e9fa (diff)
downloadriscv-isa-manual-4b5852c190f3eff38c471b6d5e1a5f1adc7d9142.zip
riscv-isa-manual-4b5852c190f3eff38c471b6d5e1a5f1adc7d9142.tar.gz
riscv-isa-manual-4b5852c190f3eff38c471b6d5e1a5f1adc7d9142.tar.bz2
Add memory consistency model draft proposal
Diffstat (limited to 'src')
-rw-r--r--src/memory.tex227
-rw-r--r--src/preamble.tex5
-rw-r--r--src/riscv-spec.tex3
-rw-r--r--src/rv32.tex539
4 files changed, 373 insertions, 401 deletions
diff --git a/src/memory.tex b/src/memory.tex
index e783c39..94e3733 100644
--- a/src/memory.tex
+++ b/src/memory.tex
@@ -26,233 +26,6 @@
\newcommand{\diagram}{(picture coming soon)}
-\chapter{The RISC-V Memory Model}
-
-This document defines the RISC-V memory consistency model.
-A memory consistency model is a set of rules that specifies which values can be legally returned by loads of memory.
-The RISC-V ISA by default uses a memory model called ``RVWMO'' (RISC-V Weak Memory Ordering).
-RVWMO is designed to provide lots of flexibility for architects to build high-performance scalable designs while simultaneously providing a relatively sane memory model to programmers.
-
-The RISC-V ISA also provides the optional Ztso extension which imposes the stronger RVTSO (RISC-V Total Store Ordering) memory model for hardware that chooses to provide it.
-RVTSO provides a simpler programming model but is more restrictive on the implementations that can be legally built.
-%Other models (e.g., IBM Power) provide a model which allows for more aggressive and flexible implementations, but at the cost of a much more complicated programming model.
-%RVWMO, a variant of release consistency, balances these two extremes.
-%The RVWMO memory model also addresses known shortcomings of memory models of the past (e.g., by enforcing memory ordering on address dependencies and on most same-address load-load pairs).
-
-Under both models, code running on a single hart will appear to execute in order from the perspective of other memory operations in the same hart, but memory instructions from another hart may observe the memory instructions from the first hart being executed in a different order.
-Therefore, under both models, multithreaded code may require explicit synchronization to guarantee ordering between memory operations from different harts.
-The base RISC-V ISA provides a {\tt fence} instruction for this purpose, while the optional ``A'' atomics extension defines load-reserved/store-conditional and atomic read-modify-write operations.
-
-%This chapter provides a precise formal specification of the RVWMO memory consistency model.
-%Section~\ref{sec:definition} provides a natural language specification of the model.
-%Section~\ref{sec:explanation} provides explanation and intuition for the design of RVWMO,
-%The sections that follow provide other explanatory and reference material, and then Sections~\ref{sec:alloy}--\ref{sec:operational} provide a suite of formalisms that enable rigorous analysis.
-%Finally, Chapter~\ref{sec:tso} defines RVTSO and the Ztso extension.
-
-\begin{commentary}
- The RVWMO and RVTSO models are formalized as presented, but the interaction of the memory model with I/O, instruction fetches, page table walks, and {\tt sfence.vma} is not formalized. We may formalize some or all of them in a future revision. The ``V'' vector, transactional memory, and ``J'' JIT extensions will need to be incorporated into a future revision as well.
-
- Memory models which permit memory accesses of different sizes are an active and open area of research, and the specifics of how the RISC-V memory model deals with mixed-size memory accesses is subject to change. Nevertheless, we provide an educated guess at sane rules for such situations as a guideline for architects and implementers.
-\end{commentary}
-
-\section{Definition of the RVWMO Memory Model}
-\label{sec:definition}
-
-The RVWMO memory model is defined in terms of the {\em global memory order}, a total ordering of the memory accesses produced by all harts.
-In general, a multithreaded program will have many different possible executions, and each execution will have its its own corresponding global memory order.
-
-The global memory order is defined in terms of the primitive load(s) and/or store(s) generated by each memory instruction.
-It is then subject to the constraints defined in the rest of this chapter.
-Any execution which satisfies all of the memory model constraints is a legal execution (as far as the memory model is concerned).
-
-\subsection{Memory Model Primitives}
-The RVWMO memory model is specified in terms of the memory accesses generated by each instruction.
-The {\em program order} over memory accesses reflects the order in which the instructions that generate each load and store are originally laid out in that hart's dynamic instruction stream; i.e., the order in which a simple in-order processor would execute the instructions of that hart.
-Table~\ref{tab:memoryaccesses} summarizes the memory accesses generated by each type of memory instruction.
-
-\begin{table}[h!]
- \centering
- \begin{tabular}{|l|l|}
- \hline
- RISC-V Instruction & Memory Accesses \\
- \hline
- \hline
- \tt l\{b|h|w|d\} & load$^*$ \\
- \tt s\{b|h|w|d\} & store$^*$ \\
- \hline
- \hline
- \tt lr & load \\
- \tt lr.aq & load-acquire-RCpc \\
- \tt lr.aqrl & load-acquire-RCsc \\
- \tt lr.rl & (deprecated) \\
- \hline
- \hline
- \tt sc & store \\
- \tt sc.rl & store-release-RCpc \\
- \tt sc.aqrl & store-release-RCsc \\
- \tt sc.aq & (deprecated) \\
- \hline
- \hline
- \tt amo<op> & load; {\em $<$op$>$}; store \\
- \tt amo<op>.aq & load-acquire-RCpc; {\em $<$op$>$}; store \\
- \tt amo<op>.rl & load; {\em $<$op$>$}; store-release-RCpc \\
- \tt amo<op>.aqrl & load-SC; {\em $<$op$>$}; store-SC \\
- \hline
- \multicolumn{2}{l}{$^*$: possibly multiple if misaligned}
- \end{tabular}
- \caption{Mapping of instructions into memory accesses, for the purposes of describing the memory model. ``{\tt.sc}'' is a synonym for ``{\tt.aqrl}''.}
- \label{tab:memoryaccesses}
-\end{table}
-
-Every aligned load or store instruction gives rise to exactly one memory access that executes in a single-copy atomic fashion: it can never be observed in a partially-incomplete state.
-Every misaligned load or store may be decomposed into a set of component loads or stores at any granularity.
-The memory accesses generated by misaligned loads and stores are not ordered with respect to each other in program order, but they are ordered with respect to the memory accesses generated by preceding and subsequent instructions in program order.
-
-\begin{commentary}
- The legal decomposition of unaligned memory operations down to even byte granularity facilitates emulation on implementations that do not natively support unaligned accesses.
- Such implementations might, for example, simply iterate over the bytes of a misaligned access one by one.
-\end{commentary}
-
-AMOs give rise to exactly two memory accesses: one load and one store. These accesses are said to be {\em paired}.
-Every {\tt lr} instruction gives rise to exactly one load.
-Every {\tt sc} instruction gives rise to either zero stores or one store, depending on whether the store conditional succeeds.
-An {\tt lr} instruction is said to be paired with the first {\tt sc} instruction that follows it in program order, unless the {\tt sc} is not successful or there is another {\tt lr} instruction in between.
-The memory accesses generated by paired {\tt lr} and (successful) {\tt sc} instructions are also said to be paired.
-Both {\tt lr} and {\tt sc} instructions may be unpaired if they do not meet the conditions above.
-The complete list of conditions determining the success or failure of store conditional instructions is defined in the ``A'' extension.
-
-Loads and stores generated by atomics may carry ordering annotations.
-Loads may carry ``acquire-RCpc'' or ``acquire-RCsc'' annotations.
-The term ``load-acquire'' without further annotation refers to both collectively.
-Stores may carry ``release-RCpc'' or ``release-RCsc'' annotations, and once again ``store-release'' without further annotation refers to both together.
-In the memory model literature, the term ``RCpc'' stands for release consistency with processor-consistent synchronization operations, and the term ``RCsc'' stands for release consistency with sequentially-consistent synchronization operations.
-Finally, AMOs with both {\tt .aq} and {\tt .rl} set are sequentially-consistent in an even stronger sense: they do not allow any reordering in either direction.
-The precise semantics of these annotations as they apply to RVWMO are described by the memory model rules below.
-
-\begin{commentary}
- Although the ISA does not currently contain {\tt l\{b|h|w|d\}.aq[rl]} or {\tt s\{b|h|w|d\}.[aq]rl} instructions, we may add them as assembler pseudoinstructions to facilitate forwards compatibility with their potential future addition into the ISA. These pseudoinstructions will generally assemble per the fence-based mappings of Section~\ref{sec:porting} until if and when the instructions are added to the ISA. The RVWMO memory model is also designed to easily accommodate the possible future inclusion of such instructions.
-\end{commentary}
-
-\subsection{Dependencies}
-The definition of the RVWMO memory model depends in part on the notion of a syntactic dependency.
-A register $r$ read by an instruction $b$ has a syntactic dependency on an instruction $a$ if $a$ precedes $b$ in program order, $r$ is not {\tt x0}, and either of the following hold:
-\begin{enumerate}
- \item $r$ is written by $a$ and read by $b$, and no other instruction between $a$ and $b$ in program order modifies $r$
- \item There is some other instruction $i$ such that a register read by $i$ has a dependency on $a$, and a register read by $b$ has a dependency on $i$
-\end{enumerate}
-
-Specific types of dependencies are defined as follows.
-First, for two instructions $a$ and $b$, $b$ has a syntactic address dependency on $a$ if a register used to calculate the address accessed by $b$ has a syntactic dependency on $a$.
-Second, $b$ has a syntactic data dependency on $a$ if $b$ is a store and a register used to calculate the data being written by $b$ has a syntactic dependency on $a$.
-Third, $b$ has a syntactic control dependency on $a$ if there exists a branch $m$ between $a$ and $b$ in program order such that a register checked as part of the condition of $m$ has a syntactic dependency on $a$.
-Finally, $b$ has a syntactic success dependency on $a$ if $a$ is a store-conditional and a register read by $b$ has a register dependency on the store conditional success register written by $a$.
-
-\subsection{Preserved Program Order}
-The global memory order for any given execution of a program respects some but not necessarily all of each hart's program order.
-The subset of program order which must be respected by the global memory order for all executions is known as {\em preserved program order}.
-
-\newcommand{\ppost}{$a$ and $b$ are accesses to overlapping memory addresses, and $b$ is a store}
-\newcommand{\ppofence}{$a$ and $b$ are separated in program order by a fence, $a$ is in the predecessor set of the fence, and $b$ is in the successor set of the fence}
-\newcommand{\ppoacquire}{$a$ is a load-acquire}
-\newcommand{\ppoamoload}{$b$ is a load-SC}
-\newcommand{\ppoamostore}{$a$ is a store-SC}
-%\newcommand{\ppoloadtoacq}{$a$ is a load, $b$ is a load-acquire, and $a$ and $b$ are accesses to overlapping memory accesses}
-%\newcommand{\ppoloadtoacq}{$a$ is a paired load, $b$ is a load-acquire, $a$ and $b$ are accesses to overlapping memory accesses, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
-\newcommand{\ppoloadtoacq}{$a$ is a paired store, $b$ is a load-acquire, $a$ and $b$ are accesses to overlapping memory accesses, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
-\newcommand{\pposuccess}{$a$ is a load, and there exists some $m$ such that $m$ has an address or data dependency on $a$ and $b$ has a success dependency on $m$}
-\newcommand{\pporelease}{$b$ is a store-release}
-\newcommand{\ppostrongacqrel}{$a$ is a store-release-RCsc and $b$ is a load-acquire-RCsc}
-\newcommand{\ppoaddr}{$a$ is a load, and $b$ has a syntactic address dependency on $a$}
-\newcommand{\ppodata}{$a$ is a load, $b$ is a store, and $b$ has a syntactic data dependency on $a$}
-\newcommand{\ppoctrl}{$a$ is a load, $b$ is a store, and $b$ has a syntactic control dependency on $a$}
-\newcommand{\ppordw}{$a$ and $b$ are loads to overlapping memory addresses, there is no store to overlapping memory location(s) between $a$ and $b$ in program order, and $a$ and $b$ return values from different stores}
-%\newcommand{\ppormwrfi}{$a$ is a paired load, $b$ is a load to an overlapping address, $b$ is a load-acquire, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
-\newcommand{\pporfiaq}{$a$ is a paired store, $b$ is a load to an overlapping address, $b$ is a load-acquire, and there is no store to overlapping memory location(s) between $a$ and $b$ in program order}
-\newcommand{\ppoldstld}{$a$ and $b$ are loads, and there exists some store $m$ between $a$ and $b$ in program order such that $m$ has an address or data dependency on $a$, $m$ accesses a memory address that overlaps the address accessed by $b$, and there is no other store to an overlapping memory location(s) between $m$ and $b$ in program order}
-\newcommand{\ppoaddrpo}{$b$ is a store, and there exists some instruction $m$ between $a$ and $b$ in program order such that $m$ has an address dependency on $a$}
-\newcommand{\ppoctrlcfence}{$a$ and $b$ are loads, $b$ has a syntactic control dependency on $a$, and there exists a {\tt fence.i} between the branch used to form the control dependency and $b$ in program order}
-\newcommand{\ppoaddrpocfence}{$a$ is a load, there exists an instruction $m$ which has a syntactic address dependency on $a$, and there exists a {\tt fence.i} between $m$ and $b$ in program order}
-
-The complete definition of preserved program order is as follows:
-memory access $a$ precedes memory access $b$ in preserved program order (and hence also in the global memory order) if $a$ precedes $b$ in program order, $a$ and $b$ are both accesses to normal memory (i.e., not to I/O regions), and any of the following hold:
-
-\begin{itemize}
- \item Basic same-address orderings:
- \begin{enumerate}
- \item\label{ppo:->st} \ppost
- \end{enumerate}
- \item Explicit synchronization
- \begin{enumerate}[resume]
- \item\label{ppo:fence} \ppofence
- \item\label{ppo:acquire} \ppoacquire
- %\item\label{ppo:loadtoacq} \ppoloadtoacq
- \item\label{ppo:release} \pporelease
- \item\label{ppo:strongacqrel} \ppostrongacqrel
- \item\label{ppo:amostore} \ppoamostore
- \item\label{ppo:amoload} \ppoamoload
- \end{enumerate}
- \item Dependencies
- \begin{enumerate}[resume]
- \item\label{ppo:addr} \ppoaddr
- \item\label{ppo:data} \ppodata
- \item\label{ppo:ctrl} \ppoctrl
- \item\label{ppo:success} \pposuccess
- \end{enumerate}
- \item Same-address load-load ordering
- \begin{enumerate}[resume]
- \item\label{ppo:rdw} \ppordw
- \end{enumerate}
- %\item Orderings for atomics
- % \begin{enumerate}[resume]
- % %\item\label{ppo:rmwrfi} \ppormwrfi
- % \item\label{ppo:rfiaq} \pporfiaq
- % \end{enumerate}
- \item Pipeline dependency artifacts
- \begin{enumerate}[resume]
- \item\label{ppo:ld->st->ld} \ppoldstld
- \item\label{ppo:addrpo} \ppoaddrpo
- \item\label{ppo:ctrlcfence} \ppoctrlcfence
- \item\label{ppo:addrpocfence} \ppoaddrpocfence
- \end{enumerate}
-\end{itemize}
-
-\subsection{Memory Model Axioms}
-
-An execution of a RISC-V program obeys the RVWMO memory consistency model only if it obeys the {\em load value axiom} and the {\em atomicity axiom}.
-
-\newcommand{\loadvalueaxiom}{
- Each byte of each load returns the corresponding byte written by the whichever of the following two stores comes later in the global memory order:
- \begin{enumerate}
- \item the latest store to the same address and preceding the load in the global memory order
- \item the latest store to the same address and preceding the load in program order
- \end{enumerate}
-}
-
-\newcommand{\atomicityaxiom}{If $r$ and $w$ are a paired load and store, and if $s$ is any store from which $r$ returns a value, then there can be no store from another hart to an overlapping memory location which follows both $r$ and $s$ and which precedes $w$ in the global memory order.}
-
-\textbf{Load Value Axiom:} \loadvalueaxiom
-\textbf{Atomicity Axiom:} \atomicityaxiom
-
-\section{Definition of the RVTSO Memory Model}
-\label{sec:tso}
-
-RISC-V cores which implement Ztso impose RVTSO onto all memory accesses.
-RVTSO behaves just like RVWMO but with the following modifications:
-
-\begin{itemize}
- \item All {\tt l\{b|h|w|d|r\}} instructions behave as if {\tt .aq} is set
- \item All {\tt s\{b|h|w|d|c\}} instructions behave as if {\tt .rl} is set
- \item All AMO instructions behave as if {\tt .aq} and {\tt .rl} are both set
-\end{itemize}
-
-These rules render PPO rules \ref{ppo:->st} and \ref{ppo:addr}--\ref{ppo:addrpocfence} redundant.
-They also make redundant any non-I/O fences that do not have both {\tt .pw} and {\tt .sr} set.
-Finally, they also imply that all AMO instructions are fully-fenced; nothing will be reordered past an AMO.
-
-\begin{commentary}
- The definitions of RVTSO and Ztso are new and have not yet been reviewed as carefully as the RVWMO model. For example, it is not yet properly specified how LR/SC will behave under RVTSO.
-\end{commentary}
-
\chapter{RVWMO Explanatory Material}
\label{sec:explanation}
This section provides more explanation for the RVWMO memory model, using more informal language and concrete examples.
diff --git a/src/preamble.tex b/src/preamble.tex
index bbc39bd..68c1b18 100644
--- a/src/preamble.tex
+++ b/src/preamble.tex
@@ -9,6 +9,11 @@
\usepackage{longtable}
\usepackage{multirow}
\usepackage{float}
+\usepackage{xcolor}
+\usepackage{listings}
+\usepackage{comment}
+\usepackage{enumitem}
+\usepackage{verbatimbox}
\usepackage[olditem,oldenum]{paralist}
diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex
index f793b88..b76a2b4 100644
--- a/src/riscv-spec.tex
+++ b/src/riscv-spec.tex
@@ -86,6 +86,9 @@ Andrew Waterman and Krste Asanovi\'{c}, RISC-V Foundation, May 2017.
\input{naming}
\input{history}
+\appendix
+\input{memory}
+
\bibliographystyle{plain}
\bibliography{riscv-spec}
diff --git a/src/rv32.tex b/src/rv32.tex
index 724a98d..7008795 100644
--- a/src/rv32.tex
+++ b/src/rv32.tex
@@ -1021,180 +1021,6 @@ store instructions. Hardware can then automatically optimize accesses
depending on whether runtime addresses are aligned.
\end{commentary}
-\section{Memory Model}
-
-\begin{commentary}
-This section is out of date as the RISC-V memory model is currently
-under revision to ensure it can efficiently support current
-programming language memory models. The revised base memory model
-will contain further ordering constraints, including at least that
-loads to the same address from the same hart cannot be reordered, and
-that syntactic data dependencies between instructions are respected.
-\end{commentary}
-
-The base RISC-V ISA supports multiple concurrent threads of execution
-within a single user address space. Each RISC-V hardware thread, or
-{\em hart}, has its own user register state and program counter, and
-executes an independent sequential instruction stream. The execution
-environment will define how RISC-V harts are created and managed.
-RISC-V harts can communicate and synchronize with other harts either
-via calls to the execution environment, which are documented
-separately in the specification for each execution environment, or
-directly via the shared memory system. RISC-V harts can also interact
-with I/O devices, and indirectly with each other, via loads and stores
-to portions of the address space assigned to I/O.
-
-\begin{commentary}
- We use the term {\em hart} to unambiguously and concisely describe a
- hardware thread as opposed to software-managed thread contexts.
-\end{commentary}
-
-In the base RISC-V ISA, each RISC-V hart observes its own memory
-operations as if they executed sequentially in program order. RISC-V
-has a relaxed memory model between harts, requiring an explicit
-FENCE instruction to guarantee ordering between memory
-operations from different RISC-V harts. Chapter~\ref{atomics}
-describes the optional atomic memory instruction extensions ``A'',
-which provide additional synchronization operations.
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{F@{}IIIIIIIIF@{}F@{}F@{}S}
-\\
-\instbitrange{31}{28} &
-\multicolumn{1}{c}{\instbit{27}} &
-\multicolumn{1}{c}{\instbit{26}} &
-\multicolumn{1}{c}{\instbit{25}} &
-\multicolumn{1}{c}{\instbit{24}} &
-\multicolumn{1}{c}{\instbit{23}} &
-\multicolumn{1}{c}{\instbit{22}} &
-\multicolumn{1}{c}{\instbit{21}} &
-\multicolumn{1}{c}{\instbit{20}} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{0} &
-\multicolumn{1}{c|}{PI} &
-\multicolumn{1}{c|}{PO} &
-\multicolumn{1}{c|}{PR} &
-\multicolumn{1}{c|}{PW} &
-\multicolumn{1}{|c|}{SI} &
-\multicolumn{1}{c|}{SO} &
-\multicolumn{1}{c|}{SR} &
-\multicolumn{1}{c|}{SW} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-4 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 5 & 3 & 5 & 7 \\
-0 & \multicolumn{4}{c}{predecessor} & \multicolumn{4}{c}{successor} & 0 & FENCE & 0 & MISC-MEM \\
-\end{tabular}
-\end{center}
-
-The FENCE instruction is used to order device I/O and
-memory accesses as viewed by other RISC-V harts and external devices
-or coprocessors. Any combination of device input (I), device output
-(O), memory reads (R), and memory writes (W) may be ordered with
-respect to any combination of the same. Informally, no other RISC-V
-hart or external device can observe any operation in the {\em
- successor} set following a FENCE before any operation in the {\em
- predecessor} set preceding the FENCE. The execution environment
-will define what I/O operations are possible, and in particular, which
-load and store instructions might be treated and ordered as device
-input and device output operations respectively rather than memory
-reads and writes. For example, memory-mapped I/O devices will
-typically be accessed with uncached loads and stores that are ordered
-using the I and O bits rather than the R and W bits. Instruction-set
-extensions might also describe new coprocessor I/O instructions that
-will also be ordered using the I and O bits in a FENCE.
-
-The unused fields in the FENCE instruction, {\em imm[11:8]}, {\em rs1}, and
-{\em rd}, are reserved for finer-grain fences in future extensions. For
-forward compatibility, base implementations shall ignore these fields, and
-standard software shall zero these fields.
-
-\begin{commentary}
-We chose a relaxed memory model to allow high performance from simple
-machine implementations, however a completely relaxed memory model is
-too weak to support programming language memory models and so the
-memory model is being tightened.
-
-A relaxed memory model is also most compatible with likely future
-coprocessor or accelerator extensions. We separate out I/O ordering
-from memory R/W ordering to avoid unnecessary serialization within a
-device-driver hart and also to support alternative non-memory paths
-to control added coprocessors or I/O devices. Simple implementations
-may additionally ignore the {\em predecessor} and {\em successor}
-fields and always execute a conservative fence on all operations.
-\end{commentary}
-
-\vspace{-0.4in}
-\begin{center}
-\begin{tabular}{M@{}R@{}S@{}R@{}O}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{imm[11:0]} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-0 & 0 & FENCE.I & 0 & MISC-MEM \\
-\end{tabular}
-\end{center}
-
-The FENCE.I instruction is used to synchronize the instruction and
-data streams. RISC-V does not guarantee that stores to instruction
-memory will be made visible to instruction fetches on the same RISC-V
-hart until a FENCE.I instruction is executed. A FENCE.I instruction
-only ensures that a subsequent instruction fetch on a RISC-V hart
-will see any previous data stores already visible to the same RISC-V
-hart. FENCE.I does {\em not} ensure that other RISC-V harts'
-instruction fetches will observe the local hart's stores in a
-multiprocessor system. To make a store to instruction memory visible
-to all RISC-V harts, the writing hart has to execute a data FENCE
-before requesting that all remote RISC-V harts execute a FENCE.I.
-
-The unused fields in the FENCE.I instruction, {\em imm[11:0]}, {\em rs1}, and
-{\em rd}, are reserved for finer-grain fences in future extensions. For
-forward compatibility, base implementations shall ignore these fields, and
-standard software shall zero these fields.
-
-\begin{commentary}
-Because FENCE.I only orders stores with a hart's own instruction fetches,
-application code should only rely upon FENCE.I if the application thread will
-not be migrated to a different hart. The ABI will provide mechanisms for
-multiprocessor instruction-stream synchronization.
-\end{commentary}
-
-\begin{commentary}
-The FENCE.I instruction was designed to support a wide variety of
-implementations. A simple implementation can flush the local
-instruction cache and the instruction pipeline when the FENCE.I is
-executed. A more complex implementation might snoop the instruction
-(data) cache on every data (instruction) cache miss, or use an
-inclusive unified private L2 cache to invalidate lines from the
-primary instruction cache when they are being written by a local store
-instruction. If instruction and data caches are kept coherent in this
-way, then only the pipeline needs to be flushed at a FENCE.I.
-
-We considered but did not include a ``store instruction word''
-instruction (as in MAJC~\cite{majc}). JIT compilers may generate a
-large trace of instructions before a single FENCE.I, and amortize any
-instruction cache snooping/invalidation overhead by writing translated
-instructions to memory regions that are known not to reside in the
-I-cache.
-\end{commentary}
-
\section{Control and Status Register Instructions}
\label{sec:csrinsts}
@@ -1453,3 +1279,368 @@ renamed to reflect that they can be used more generally than to call a
supervisor-level operating system or debugger.
\end{commentary}
+\section{Memory Consistency Model}
+
+This section defines the RISC-V memory consistency model.
+A memory consistency model is a set of rules that specifies which values can be legally returned by loads of memory.
+The RISC-V ISA by default uses a memory model called ``RVWMO'' (RISC-V Weak Memory Ordering).
+RVWMO is designed to provide flexibility for architects to build high-performance scalable designs, while simultaneously supporting a tractable programming model.
+
+The RISC-V ISA also provides the optional Ztso extension which imposes the stronger RVTSO (RISC-V Total Store Ordering) memory model for hardware that chooses to provide it.
+RVTSO provides a simpler programming model but is more restrictive on the implementations that can be legally built.
+%Other models (e.g., IBM Power) provide a model which allows for more aggressive and flexible implementations, but at the cost of a much more complicated programming model.
+%RVWMO, a variant of release consistency, balances these two extremes.
+%The RVWMO memory model also addresses known shortcomings of memory models of the past (e.g., by enforcing memory ordering on address dependencies and on most same-address load-load pairs).
+
+Under both models, code running on a single hart will appear to execute in order from the perspective of other memory operations in the same hart, but memory instructions from another hart may observe the memory instructions from the first hart being executed in a different order.
+Therefore, under both models, multithreaded code may require explicit synchronization to guarantee ordering between memory operations from different harts.
+The base RISC-V ISA provides a {\tt fence} instruction for this purpose, described in Section~\ref{sec:fence}, while the optional ``A'' atomics extension defines load-reserved/store-conditional and atomic read-modify-write operations.
+
+Appendix~\ref{sec:explanation} provides a formal model and additional explanatory material about the memory model.
+
+%This chapter provides a precise formal specification of the RVWMO memory consistency model.
+%Section~\ref{sec:definition} provides a natural language specification of the model.
+%Section~\ref{sec:explanation} provides explanation and intuition for the design of RVWMO,
+%The sections that follow provide other explanatory and reference material, and then Sections~\ref{sec:alloy}--\ref{sec:operational} provide a suite of formalisms that enable rigorous analysis.
+%Finally, Chapter~\ref{sec:tso} defines RVTSO and the Ztso extension.
+
+\begin{commentary}
+ The RVWMO and RVTSO models are formalized as presented, but the interaction of the memory model with I/O, instruction fetches, page table walks, and {\tt sfence.vma} is not formalized. We may formalize some or all of them in a future revision. The ``V'' vector, transactional memory, and ``J'' JIT extensions will need to be incorporated into a future revision as well.
+
+ Memory models which permit memory accesses of different sizes are an active and open area of research, and the specifics of how the RISC-V memory model deals with mixed-size memory accesses is subject to change. Nevertheless, we provide an educated guess at sane rules for such situations as a guideline for architects and implementers.
+\end{commentary}
+
+\subsection{Definition of the RVWMO Memory Model}
+\label{sec:definition}
+
+The RVWMO memory model is defined in terms of the {\em global memory order}, a total ordering of the memory accesses produced by all harts.
+In general, a multithreaded program will have many different possible executions, and each execution will have its its own corresponding global memory order.
+
+The global memory order is defined in terms of the primitive load(s) and/or store(s) generated by each memory instruction.
+It is then subject to the constraints defined in the rest of this chapter.
+Any execution which satisfies all of the memory model constraints is a legal execution (as far as the memory model is concerned).
+
+\subsubsection*{Memory Model Primitives}
+The RVWMO memory model is specified in terms of the memory accesses generated by each instruction.
+The {\em program order} over memory accesses reflects the order in which the instructions that generate each load and store are originally laid out in that hart's dynamic instruction stream; i.e., the order in which a simple in-order processor would execute the instructions of that hart.
+Table~\ref{tab:memoryaccesses} summarizes the memory accesses generated by each type of memory instruction.
+
+\begin{table}[h!]
+ \centering
+ \begin{tabular}{|l|l|}
+ \hline
+ RISC-V Instruction & Memory Accesses \\
+ \hline
+ \hline
+ \tt l\{b|h|w|d\} & load$^*$ \\
+ \tt s\{b|h|w|d\} & store$^*$ \\
+ \hline
+ \hline
+ \tt lr & load \\
+ \tt lr.aq & load-acquire-RCpc \\
+ \tt lr.aqrl & load-acquire-RCsc \\
+ \tt lr.rl & (deprecated) \\
+ \hline
+ \hline
+ \tt sc & store \\
+ \tt sc.rl & store-release-RCpc \\
+ \tt sc.aqrl & store-release-RCsc \\
+ \tt sc.aq & (deprecated) \\
+ \hline
+ \hline
+ \tt amo<op> & load; {\em $<$op$>$}; store \\
+ \tt amo<op>.aq & load-acquire-RCpc; {\em $<$op$>$}; store \\
+ \tt amo<op>.rl & load; {\em $<$op$>$}; store-release-RCpc \\
+ \tt amo<op>.aqrl & load-SC; {\em $<$op$>$}; store-SC \\
+ \hline
+ \multicolumn{2}{l}{$^*$: possibly multiple if misaligned}
+ \end{tabular}
+ \caption{Mapping of instructions into memory accesses, for the purposes of describing the memory model. ``{\tt.sc}'' is a synonym for ``{\tt.aqrl}''.}
+ \label{tab:memoryaccesses}
+\end{table}
+
+Every aligned load or store instruction gives rise to exactly one memory access that executes in a single-copy atomic fashion: it can never be observed in a partially-incomplete state.
+Every misaligned load or store may be decomposed into a set of component loads or stores at any granularity.
+The memory accesses generated by misaligned loads and stores are not ordered with respect to each other in program order, but they are ordered with respect to the memory accesses generated by preceding and subsequent instructions in program order.
+
+\begin{commentary}
+ The legal decomposition of unaligned memory operations down to even byte granularity facilitates emulation on implementations that do not natively support unaligned accesses.
+ Such implementations might, for example, simply iterate over the bytes of a misaligned access one by one.
+\end{commentary}
+
+AMOs give rise to exactly two memory accesses: one load and one store. These accesses are said to be {\em paired}.
+Every {\tt lr} instruction gives rise to exactly one load.
+Every {\tt sc} instruction gives rise to either zero stores or one store, depending on whether the store conditional succeeds.
+An {\tt lr} instruction is said to be paired with the first {\tt sc} instruction that follows it in program order, unless the {\tt sc} is not successful or there is another {\tt lr} instruction in between.
+The memory accesses generated by paired {\tt lr} and (successful) {\tt sc} instructions are also said to be paired.
+Both {\tt lr} and {\tt sc} instructions may be unpaired if they do not meet the conditions above.
+The complete list of conditions determining the success or failure of store conditional instructions is defined in the ``A'' extension.
+
+Loads and stores generated by atomics may carry ordering annotations.
+Loads may carry ``acquire-RCpc'' or ``acquire-RCsc'' annotations.
+The term ``load-acquire'' without further annotation refers to both collectively.
+Stores may carry ``release-RCpc'' or ``release-RCsc'' annotations, and once again ``store-release'' without further annotation refers to both together.
+In the memory model literature, the term ``RCpc'' stands for release consistency with processor-consistent synchronization operations, and the term ``RCsc'' stands for release consistency with sequentially-consistent synchronization operations.
+Finally, AMOs with both {\tt .aq} and {\tt .rl} set are sequentially-consistent in an even stronger sense: they do not allow any reordering in either direction.
+The precise semantics of these annotations as they apply to RVWMO are described by the memory model rules below.
+
+\begin{commentary}
+ Although the ISA does not currently contain {\tt l\{b|h|w|d\}.aq[rl]} or {\tt s\{b|h|w|d\}.[aq]rl} instructions, we may add them as assembler pseudoinstructions to facilitate forwards compatibility with their potential future addition into the ISA. These pseudoinstructions will generally assemble per the fence-based mappings of Section~\ref{sec:porting} until if and when the instructions are added to the ISA. The RVWMO memory model is also designed to easily accommodate the possible future inclusion of such instructions.
+\end{commentary}
+
+\subsubsection*{Dependencies}
+The definition of the RVWMO memory model depends in part on the notion of a syntactic dependency.
+A register $r$ read by an instruction $b$ has a syntactic dependency on an instruction $a$ if $a$ precedes $b$ in program order, $r$ is not {\tt x0}, and either of the following hold:
+\begin{enumerate}
+ \item $r$ is written by $a$ and read by $b$, and no other instruction between $a$ and $b$ in program order modifies $r$
+ \item There is some other instruction $i$ such that a register read by $i$ has a dependency on $a$, and a register read by $b$ has a dependency on $i$
+\end{enumerate}
+
+Specific types of dependencies are defined as follows.
+First, for two instructions $a$ and $b$, $b$ has a syntactic address dependency on $a$ if a register used to calculate the address accessed by $b$ has a syntactic dependency on $a$.
+Second, $b$ has a syntactic data dependency on $a$ if $b$ is a store and a register used to calculate the data being written by $b$ has a syntactic dependency on $a$.
+Third, $b$ has a syntactic control dependency on $a$ if there exists a branch $m$ between $a$ and $b$ in program order such that a register checked as part of the condition of $m$ has a syntactic dependency on $a$.
+Finally, $b$ has a syntactic success dependency on $a$ if $a$ is a store-conditional and a register read by $b$ has a register dependency on the store conditional success register written by $a$.
+
+\subsubsection*{Preserved Program Order}
+The global memory order for any given execution of a program respects some but not necessarily all of each hart's program order.
+The subset of program order which must be respected by the global memory order for all executions is known as {\em preserved program order}.
+
+\newcommand{\ppost}{$a$ and $b$ are accesses to overlapping memory addresses, and $b$ is a store}
+\newcommand{\ppofence}{$a$ and $b$ are separated in program order by a fence, $a$ is in the predecessor set of the fence, and $b$ is in the successor set of the fence}
+\newcommand{\ppoacquire}{$a$ is a load-acquire}
+\newcommand{\ppoamoload}{$b$ is a load-SC}
+\newcommand{\ppoamostore}{$a$ is a store-SC}
+%\newcommand{\ppoloadtoacq}{$a$ is a load, $b$ is a load-acquire, and $a$ and $b$ are accesses to overlapping memory accesses}
+%\newcommand{\ppoloadtoacq}{$a$ is a paired load, $b$ is a load-acquire, $a$ and $b$ are accesses to overlapping memory accesses, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
+\newcommand{\ppoloadtoacq}{$a$ is a paired store, $b$ is a load-acquire, $a$ and $b$ are accesses to overlapping memory accesses, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
+\newcommand{\pposuccess}{$a$ is a load, and there exists some $m$ such that $m$ has an address or data dependency on $a$ and $b$ has a success dependency on $m$}
+\newcommand{\pporelease}{$b$ is a store-release}
+\newcommand{\ppostrongacqrel}{$a$ is a store-release-RCsc and $b$ is a load-acquire-RCsc}
+\newcommand{\ppoaddr}{$a$ is a load, and $b$ has a syntactic address dependency on $a$}
+\newcommand{\ppodata}{$a$ is a load, $b$ is a store, and $b$ has a syntactic data dependency on $a$}
+\newcommand{\ppoctrl}{$a$ is a load, $b$ is a store, and $b$ has a syntactic control dependency on $a$}
+\newcommand{\ppordw}{$a$ and $b$ are loads to overlapping memory addresses, there is no store to overlapping memory location(s) between $a$ and $b$ in program order, and $a$ and $b$ return values from different stores}
+%\newcommand{\ppormwrfi}{$a$ is a paired load, $b$ is a load to an overlapping address, $b$ is a load-acquire, and there is no other store to an overlapping address between the store paired with $a$ and $b$}
+\newcommand{\pporfiaq}{$a$ is a paired store, $b$ is a load to an overlapping address, $b$ is a load-acquire, and there is no store to overlapping memory location(s) between $a$ and $b$ in program order}
+\newcommand{\ppoldstld}{$a$ and $b$ are loads, and there exists some store $m$ between $a$ and $b$ in program order such that $m$ has an address or data dependency on $a$, $m$ accesses a memory address that overlaps the address accessed by $b$, and there is no other store to an overlapping memory location(s) between $m$ and $b$ in program order}
+\newcommand{\ppoaddrpo}{$b$ is a store, and there exists some instruction $m$ between $a$ and $b$ in program order such that $m$ has an address dependency on $a$}
+\newcommand{\ppoctrlcfence}{$a$ and $b$ are loads, $b$ has a syntactic control dependency on $a$, and there exists a {\tt fence.i} between the branch used to form the control dependency and $b$ in program order}
+\newcommand{\ppoaddrpocfence}{$a$ is a load, there exists an instruction $m$ which has a syntactic address dependency on $a$, and there exists a {\tt fence.i} between $m$ and $b$ in program order}
+
+The complete definition of preserved program order is as follows:
+memory access $a$ precedes memory access $b$ in preserved program order (and hence also in the global memory order) if $a$ precedes $b$ in program order, $a$ and $b$ are both accesses to normal memory (i.e., not to I/O regions), and any of the following hold:
+
+\begin{itemize}
+ \item Basic same-address orderings:
+ \begin{enumerate}
+ \item\label{ppo:->st} \ppost
+ \end{enumerate}
+ \item Explicit synchronization
+ \begin{enumerate}[resume]
+ \item\label{ppo:fence} \ppofence
+ \item\label{ppo:acquire} \ppoacquire
+ %\item\label{ppo:loadtoacq} \ppoloadtoacq
+ \item\label{ppo:release} \pporelease
+ \item\label{ppo:strongacqrel} \ppostrongacqrel
+ \item\label{ppo:amostore} \ppoamostore
+ \item\label{ppo:amoload} \ppoamoload
+ \end{enumerate}
+ \item Dependencies
+ \begin{enumerate}[resume]
+ \item\label{ppo:addr} \ppoaddr
+ \item\label{ppo:data} \ppodata
+ \item\label{ppo:ctrl} \ppoctrl
+ \item\label{ppo:success} \pposuccess
+ \end{enumerate}
+ \item Same-address load-load ordering
+ \begin{enumerate}[resume]
+ \item\label{ppo:rdw} \ppordw
+ \end{enumerate}
+ %\item Orderings for atomics
+ % \begin{enumerate}[resume]
+ % %\item\label{ppo:rmwrfi} \ppormwrfi
+ % \item\label{ppo:rfiaq} \pporfiaq
+ % \end{enumerate}
+ \item Pipeline dependency artifacts
+ \begin{enumerate}[resume]
+ \item\label{ppo:ld->st->ld} \ppoldstld
+ \item\label{ppo:addrpo} \ppoaddrpo
+ \item\label{ppo:ctrlcfence} \ppoctrlcfence
+ \item\label{ppo:addrpocfence} \ppoaddrpocfence
+ \end{enumerate}
+\end{itemize}
+
+\subsubsection*{Memory Model Axioms}
+
+An execution of a RISC-V program obeys the RVWMO memory consistency model only if it obeys the {\em load value axiom} and the {\em atomicity axiom}.
+
+\newcommand{\loadvalueaxiom}{
+ Each byte of each load returns the corresponding byte written by the whichever of the following two stores comes later in the global memory order:
+ \begin{enumerate}
+ \item the latest store to the same address and preceding the load in the global memory order
+ \item the latest store to the same address and preceding the load in program order
+ \end{enumerate}
+}
+
+\newcommand{\atomicityaxiom}{If $r$ and $w$ are a paired load and store, and if $s$ is any store from which $r$ returns a value, then there can be no store from another hart to an overlapping memory location which follows both $r$ and $s$ and which precedes $w$ in the global memory order.}
+
+\textbf{Load Value Axiom:} \loadvalueaxiom
+\textbf{Atomicity Axiom:} \atomicityaxiom
+
+\subsection{Definition of the RVTSO Memory Model}
+\label{sec:tso}
+
+RISC-V cores which implement Ztso impose RVTSO onto all memory accesses.
+RVTSO behaves just like RVWMO but with the following modifications:
+
+\begin{itemize}
+ \item All {\tt l\{b|h|w|d|r\}} instructions behave as if {\tt .aq} is set
+ \item All {\tt s\{b|h|w|d|c\}} instructions behave as if {\tt .rl} is set
+ \item All AMO instructions behave as if {\tt .aq} and {\tt .rl} are both set
+\end{itemize}
+
+These rules render PPO rules \ref{ppo:->st} and \ref{ppo:addr}--\ref{ppo:addrpocfence} redundant.
+They also make redundant any non-I/O fences that do not have both {\tt .pw} and {\tt .sr} set.
+Finally, they also imply that all AMO instructions are fully-fenced; nothing will be reordered past an AMO.
+
+\begin{commentary}
+ The definitions of RVTSO and Ztso are new and have not yet been reviewed as carefully as the RVWMO model. For example, it is not yet properly specified how LR/SC will behave under RVTSO.
+\end{commentary}
+
+\section{Memory Ordering Instructions}
+\label{sec:fence}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{F@{}IIIIIIIIF@{}F@{}F@{}S}
+\\
+\instbitrange{31}{28} &
+\multicolumn{1}{c}{\instbit{27}} &
+\multicolumn{1}{c}{\instbit{26}} &
+\multicolumn{1}{c}{\instbit{25}} &
+\multicolumn{1}{c}{\instbit{24}} &
+\multicolumn{1}{c}{\instbit{23}} &
+\multicolumn{1}{c}{\instbit{22}} &
+\multicolumn{1}{c}{\instbit{21}} &
+\multicolumn{1}{c}{\instbit{20}} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{0} &
+\multicolumn{1}{c|}{PI} &
+\multicolumn{1}{c|}{PO} &
+\multicolumn{1}{c|}{PR} &
+\multicolumn{1}{c|}{PW} &
+\multicolumn{1}{|c|}{SI} &
+\multicolumn{1}{c|}{SO} &
+\multicolumn{1}{c|}{SR} &
+\multicolumn{1}{c|}{SW} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+4 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 5 & 3 & 5 & 7 \\
+0 & \multicolumn{4}{c}{predecessor} & \multicolumn{4}{c}{successor} & 0 & FENCE & 0 & MISC-MEM \\
+\end{tabular}
+\end{center}
+
+The FENCE instruction is used to order device I/O and
+memory accesses as viewed by other RISC-V harts and external devices
+or coprocessors. Any combination of device input (I), device output
+(O), memory reads (R), and memory writes (W) may be ordered with
+respect to any combination of the same. Informally, no other RISC-V
+hart or external device can observe any operation in the {\em
+ successor} set following a FENCE before any operation in the {\em
+ predecessor} set preceding the FENCE. The execution environment
+will define what I/O operations are possible, and in particular, which
+load and store instructions might be treated and ordered as device
+input and device output operations respectively rather than memory
+reads and writes. For example, memory-mapped I/O devices will
+typically be accessed with uncached loads and stores that are ordered
+using the I and O bits rather than the R and W bits. Instruction-set
+extensions might also describe new coprocessor I/O instructions that
+will also be ordered using the I and O bits in a FENCE.
+
+The unused fields in the FENCE instruction, {\em imm[11:8]}, {\em rs1}, and
+{\em rd}, are reserved for finer-grain fences in future extensions. For
+forward compatibility, base implementations shall ignore these fields, and
+standard software shall zero these fields.
+
+\begin{commentary}
+We chose a relaxed memory model to allow high performance from simple
+machine implementations and from likely future
+coprocessor or accelerator extensions. We separate out I/O ordering
+from memory R/W ordering to avoid unnecessary serialization within a
+device-driver hart and also to support alternative non-memory paths
+to control added coprocessors or I/O devices. Simple implementations
+may additionally ignore the {\em predecessor} and {\em successor}
+fields and always execute a conservative fence on all operations.
+\end{commentary}
+
+\vspace{-0.4in}
+\begin{center}
+\begin{tabular}{M@{}R@{}S@{}R@{}O}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{imm[11:0]} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+0 & 0 & FENCE.I & 0 & MISC-MEM \\
+\end{tabular}
+\end{center}
+
+The FENCE.I instruction is used to synchronize the instruction and
+data streams. RISC-V does not guarantee that stores to instruction
+memory will be made visible to instruction fetches on the same RISC-V
+hart until a FENCE.I instruction is executed. A FENCE.I instruction
+only ensures that a subsequent instruction fetch on a RISC-V hart
+will see any previous data stores already visible to the same RISC-V
+hart. FENCE.I does {\em not} ensure that other RISC-V harts'
+instruction fetches will observe the local hart's stores in a
+multiprocessor system. To make a store to instruction memory visible
+to all RISC-V harts, the writing hart has to execute a data FENCE
+before requesting that all remote RISC-V harts execute a FENCE.I.
+
+The unused fields in the FENCE.I instruction, {\em imm[11:0]}, {\em rs1}, and
+{\em rd}, are reserved for finer-grain fences in future extensions. For
+forward compatibility, base implementations shall ignore these fields, and
+standard software shall zero these fields.
+
+\begin{commentary}
+Because FENCE.I only orders stores with a hart's own instruction fetches,
+application code should only rely upon FENCE.I if the application thread will
+not be migrated to a different hart. The ABI will provide mechanisms for
+multiprocessor instruction-stream synchronization.
+\end{commentary}
+
+\begin{commentary}
+The FENCE.I instruction was designed to support a wide variety of
+implementations. A simple implementation can flush the local
+instruction cache and the instruction pipeline when the FENCE.I is
+executed. A more complex implementation might snoop the instruction
+(data) cache on every data (instruction) cache miss, or use an
+inclusive unified private L2 cache to invalidate lines from the
+primary instruction cache when they are being written by a local store
+instruction. If instruction and data caches are kept coherent in this
+way, then only the pipeline needs to be flushed at a FENCE.I.
+
+We considered but did not include a ``store instruction word''
+instruction (as in MAJC~\cite{majc}). JIT compilers may generate a
+large trace of instructions before a single FENCE.I, and amortize any
+instruction cache snooping/invalidation overhead by writing translated
+instructions to memory regions that are known not to reside in the
+I-cache.
+\end{commentary}