4 files changed, 21 insertions, 1490 deletions
diff --git a/src/p.tex b/src/p.tex
index 074e475..dac4e4f 100644
--- a/src/p.tex
+++ b/src/p.tex
@@ -1,5 +1,5 @@
 \chapter{``P'' Standard Extension for Packed-SIMD Instructions,
-  Version 0.1}
+  Version 0.2}
 \label{sec:packedsimd}
 
 \begin{commentary}
@@ -8,94 +8,7 @@
   standardizing on the V extension for large floating-point SIMD
   operations.  However, there was interest in packed-SIMD fixed-point
   operations for use in the integer registers of small RISC-V
-  implementations.
+  implementations. A task group is working to define the new P
+  extension.
 \end{commentary}
 
-In this chapter, we outline a standard packed-SIMD extension for
-RISC-V.  We've reserved the instruction-set extension name ``P'' for a future
-standard set of packed-SIMD extensions.  Many other extensions can
-build upon a packed-SIMD extension, taking advantage of the wide data
-registers and datapaths separate from the integer unit.
-
-\begin{commentary}
-Packed-SIMD extensions, first introduced with the Lincoln Labs TX-2~\cite{tx2},
-have become a popular way to provide higher throughput on data-parallel
-codes. Earlier commercial microprocessor implementations include the
-Intel i860, HP PA-RISC MAX~\cite{lee-max-ieeemicro1996}, SPARC
-VIS~\cite{tremblay-vis-ieeemicro1996}, MIPS
-MDMX~\cite{gwennap-mdmx-mpr1996}, PowerPC
-AltiVec~\cite{diefendorff-altivec-ieeemicro2000}, Intel x86
-MMX/SSE~\cite{peleg-mmx-ieeemicro1996, raman-sse-ieeemicro2000}, while
-recent designs include Intel x86 AVX~\cite{lomont-avx-irm2011} and ARM
-Neon~\cite{goodacre-armisa-computer2005}.  We describe a standard
-framework for adding packed-SIMD in this chapter, but are not actively
-working on such a design.  In our opinion, packed-SIMD designs represent
-a reasonable design point when reusing existing wide datapath resources,
-but if significant additional resources are to be devoted to
-data-parallel execution then designs based on traditional vector
-architectures are a better choice and should use the V extension.
-
-\end{commentary}
-
-A RISC-V packed-SIMD extension reuses the floating-point registers
-({\tt f0}-{\tt f31}).  These registers can be defined to have widths
-of FLEN=32 to FLEN=1024.  The standard floating-point instruction-set
-extensions require registers of width 32 bits (``F''), 64 bits (``D''),
-or 128 bits (``Q'').
-
-\begin{commentary}
-It is natural to use the floating-point registers for packed-SIMD
-values rather than the integer registers (PA-RISC and Alpha
-packed-SIMD extensions) as this frees the integer registers for
-control and address values, simplifies reuse of scalar floating-point
-units for SIMD floating-point execution, and leads naturally to a
-decoupled integer/floating-point hardware design.  The floating-point
-load and store instruction encodings also have space to handle wider
-packed-SIMD registers.  However, reusing the floating-point registers
-for packed-SIMD values does make it more difficult to use a recoded
-internal format for floating-point values.
-\end{commentary}
-
-The existing floating-point load and store instructions are used to
-load and store various-sized words from memory to the {\tt f}
-registers.  The base ISA supports 32-bit and 64-bit loads and stores,
-but the LOAD-FP and STORE-FP instruction encodings allows 8 different
-widths to be encoded as shown in Table~\ref{psimdwidth}.  When used
-with packed-SIMD operations, it is desirable to support non-naturally
-aligned loads and stores in hardware.
-
-\begin{table}[htp]
-\begin{center}
-\begin{tabular}{|c|l|r|}
-\hline
-{\em width} field &
-Code &
-Size in bits\\
-\hline
-000 & B  &  8   \\
-001 & H  & 16   \\
-010 & W  & 32   \\
-011 & D  & 64   \\
-100 & Q  & 128  \\
-101 & Q2 & 256  \\
-110 & Q4 & 512  \\
-111 & Q8 & 1024 \\
-\hline
-\end{tabular}
-\end{center}
-\caption{LOAD-FP and STORE-FP width encoding.}
-\label{psimdwidth}
-\end{table}
-
-Packed-SIMD computational instructions operate on packed values in
-{\tt f} registers.  Each value can be 8-bit, 16-bit, 32-bit, 64-bit,
-or 128-bit, and both integer and floating-point representations can be
-supported.  For example, a 64-bit packed-SIMD extension can treat each
-register as 1$\times$64-bit, 2$\times$32-bit, 4$\times$16-bit, or
-8$\times$8-bit packed values.
-
-\begin{commentary}
-Simple packed-SIMD extensions might fit in unused 32-bit instruction
-opcodes, but more extensive packed-SIMD extensions will likely require
-a dedicated 30-bit instruction space.
-\end{commentary}
diff --git a/src/preface.tex b/src/preface.tex
index 69615b0..535c78a 100644
--- a/src/preface.tex
+++ b/src/preface.tex
@@ -31,7 +31,7 @@ modules:
     \bf Zifencei   & \bf 2.0 & \bf Ratification \\
     \bf Zicsr      & \bf 2.0 & \bf Ratification \\
     \bf M          & \bf 2.0 & \bf Ratification \\
-    \bf A          & \bf 2.0 & \bf Ratification \\
+    \em A          & \em 2.0 &  Frozen \\
     \bf F          & \bf 2.2 & \bf Ratification \\
     \bf D          & \bf 2.2 & \bf Ratification \\
     \bf Q          & \bf 2.2 & \bf Ratification \\
@@ -42,8 +42,8 @@ modules:
     \em B          & \em 0.0 & \em Draft \\
     \em J          & \em 0.0 & \em Draft \\
     \em T          & \em 0.0 & \em Draft \\
-    \em P          & \em 0.1 & \em Draft \\
-    \em V          & \em 0.4 & \em Draft \\
+    \em P          & \em 0.2 & \em Draft \\
+    \em V          & \em 0.7 & \em Draft \\
     \em N          & \em 1.1 & \em Draft \\
     \em Zam        & \em 0.1 & \em Draft \\
     \hline
@@ -56,6 +56,7 @@ The changes in this version of the document include:
 \begin{itemize}
 \parskip 0pt
 \itemsep 1pt
+\item Removed the A extension from ratification.
 \item Changed document version scheme to avoid confusion with versions
   of the ISA modules.
 \item Incremented the version numbers of the base integer ISA to 2.1,
@@ -109,6 +110,10 @@ The changes in this version of the document include:
 \item Improvements to the description and commentary.
 \item Defined the term IALIGN as shorthand to describe the instruction-address
   alignment constraint.
+\item Removed text of P extension chapter as now superceded by active task
+  group documents.
+\item Removed text of V extension chapter as now superceded by separate vector
+  extension draft document.
 \end{itemize}
 
 \section*{Preface to Document Version 2.2}
@@ -140,7 +145,7 @@ versions of the RISC-V ISA modules:
     J        & 0.0 & N \\
     T        & 0.0 & N \\
     P        & 0.1 & N \\
-    V        & 0.2 & N \\
+    V        & 0.7 & N \\
     N        & 1.1 & N \\
     \hline
   \end{tabular}
diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex
index 1ec7ccd..17bed0a 100644
--- a/src/riscv-spec.tex
+++ b/src/riscv-spec.tex
@@ -6,8 +6,8 @@
 
 \input{preamble}
 
-\newcommand{\specrev}{\mbox{20181221-Public-Review-{\em draft}}}
-\newcommand{\specmonthyear}{\mbox{December 2018}}
+\newcommand{\specrev}{\mbox{20190305-Base-Ratification}}
+\newcommand{\specmonthyear}{\mbox{March 2019}}
 
 \begin{document}
 
diff --git a/src/v.tex b/src/v.tex
index c5c277d..e367b99 100644
--- a/src/v.tex
+++ b/src/v.tex
@@ -1,1402 +1,15 @@
-\chapter{``V'' Standard Extension for Vector Operations, Version 0.4-DRAFT}
+\chapter{``V'' Standard Extension for Vector Operations, Version 0.7}
 \label{sec:vector}
 
-{\bf This version is out-of-date with respect to the current working
-  group draft, which is now hosted on {\tt https://github.com/riscv/riscv-v-spec}.}
-
-This chapter presents a proposal for the RISC-V base vector
-instruction-set extension.  The base vector extension is intended to
-provide general support for data-parallel execution within the 32-bit
-instruction encoding space, with later vector extensions supporting
-richer functionality for certain domains.
-
-\begin{commentary}
-The vector extension is based on the style of vector register
-architecture introduced by Seymour Cray in the 1970s, as opposed to
-the earlier packed SIMD approach, introduced with the Lincoln Labs
-TX-2 in 1957 and now adopted by most other commercial instruction
-sets.
-\end{commentary}
-
-The base vector extension defines the components that must be included
-when the ``V'' bit is set in the {\tt misa} register, and consequently
-those that will be assumed to exist by software written for an ABI
-specifying V.
-
-\begin{commentary}
-  This draft version of the chapter includes additional specifications
-  of proposed extensions to the base vector extension to explain some
-  of the encoding choices made for the base.
-\end{commentary}
-
-The vector extension supports a configurable vector unit, to enable
-implementations to tradeoff the number of active architectural vector
-registers and supported element widths against available maximum
-vector length.  The vector extension is designed to allow the same
-binary code to work efficiently across a variety of hardware
-implementations varying in physical vector storage capacity and
-datapath spatial and/or temporal parallelism.
-
-\begin{commentary}
-The vector instruction set contains many features developed in earlier
-research projects, including the Berkeley T0~\cite{} and VIRAM~\cite{VIRAM}
-vector microprocessors, the MIT Scale vector-thread processor~\cite{},
-and the Berkeley Maven~\cite{} and Hwacha~\cite{} projects.
-\end{commentary}
-
-\section{Vector Unit State}
-
-The additional vector unit architectural state includes 32 vector
-registers ({\tt v0}--{\tt v31}), and an XLEN-bit WARL vector length
-CSR, {\tt vl}.  Each vector register {\tt v}$n$ has an associated
-16-bit configuration field {\tt vtype}$n$ described below. A 6-bit
-global maximum element width register {\tt vmaxew} defines the maximum
-number of bits of storage in every element of every active vector
-register.
-
-\begin{commentary}
-  Future vector extensions using wider instruction encodings can
-  support more architectural vector registers. For example, 256
-  architectural vector registers in a 64-bit instruction encoding.
-\end{commentary}
-
-\begin{commentary}
-  Future 2D shape extensions add two more vector length registers,
-  {\tt vm} and {\tt vn}.
-\end{commentary}
-
-There is also a 3-bit fixed-point rounding mode CSR {\tt vxrm}, and a
-single-bit fixed-point saturation status CSR {\tt vxsat}.  The {\tt
-  vcs} CSR alias provides combined access to the {\tt vl}, {\tt vxrm},
-{\tt vxsat} fields to reduce context switch time.  The {\tt vcs}
-register also includes a configuration mode field to support future
-extended configuration modes.
-
-\begin{discussion}
-The components of vcs might not need separate CSR addresses,
-depending on how they're accessed via other non-CSR instructions.
-\end{discussion}
-
-\section{Vector Unit Type Configuration Register ({\tt vtype}$n$)}
-
-The vector unit must be configured before use.  Each architectural
-vector register, {\tt v}$n$, is configured via 16 bits of vector type
-configuration state {\tt vtype}$n$, which can be accessed via vector
-configuration ({\tt vcfg}) CSRs and other rapid vector configuration
-instructions as described below.  The vector register type
-configuration encodes the overall organization, or {\em shape}, of the
-elements in each vector register (e.g., scalar versus 1-D vector), as
-well as the bitwidth and numeric representation of each element.  As
-shown in Figure~\ref{fig:vtype}, the 16-bit {\tt vtype}$n$ encoding is
-divided into a 5-bit current shape field {\tt vshape}$n$, a 5-bit
-representation field {\tt verep}$n$, and a 6-bit element bit-width
-field {\tt vew}$n$\, held in the {\tt vcfg}$x$ CSRs.  The combination
-of an element numeric representation and an element bitwidth is called
-an element {\em format}.  Each vector register can also be disabled to
-free physical vector storage for other architectural vector registers.
-
-\begin{figure}[htb]
-\begin{center}
-\begin{tabular}{O@{}O@{}O}
-\\
-\instbitrange{15}{11} &
-\instbitrange{10}{6} &
-\instbitrange{5}{0} \\
-\hline
-\multicolumn{1}{|c|}{{\tt vshape}$n$} &
-\multicolumn{1}{c|}{{\tt verep}$n$} &
-\multicolumn{1}{c|}{{\tt vew}$n$} \\
-\hline
-5 & 5 & 6 \\
-\end{tabular}
-\end{center}
-\caption{Location of subfields within a single {\tt vtype}$n$ field.}
-\label{fig:vtype}
-\end{figure}
-
-\begin{commentary}
-  It was also common in earlier vector machines to support multiple
-  precisions within the vector datapath.  In particular, the CDC
-  STAR-100~\cite{cdcstar100} supported single-precision and
-  double-precision floating-point operations and also bit, byte, and
-  nibble operations in the vector unit; TI ASC~\cite{tiasc} designs
-  supported dividing 64-bit vector lanes into two 32-bit lanes for
-  double throughput.
-\end{commentary}
-
-\clearpage
-
-\section{Shape Encoding}
-
-The 5-bit shape field describes the structure of the elements within
-the vector register.  In the base vector extension, the shape can be
-set to either scalar or vector.
-
-\begin{table}[hbt]
-  \centering
-  \begin{tabular}{|c|l|}
-    \hline
-        {\tt vshape} & Shape \\
-        \hline
-        00000  & scalar  \\
-        00100  & 1-D vector, length controlled by {\tt vl}  \\
-        \hline
-        \multicolumn{2}{|c|}{All other encodings reserved}\\
-        \hline
-  \end{tabular}
-  \caption{Base vector encoding of {\tt vshape}$n$ field.}
-  \label{tab:vshape}
-\end{table}
+The current working group draft is hosted at {\tt
+  https://github.com/riscv/riscv-v-spec}.
 
 \begin{commentary}
-  For the base vector ISA, only a single bit is required in each {\tt
-    vshape} field to select between scalar and 1-D vector elements
-  with the other bits hardwired to zero.
+The base vector extension is intended to provide general support for
+data-parallel execution within the 32-bit instruction encoding space,
+with later vector extensions supporting richer functionality for
+certain domains.
 \end{commentary}
   
-\begin{table}[hbt]
-  \centering
-  \begin{tabular}{|c|l|}
-    \hline
-        {\tt vshape} & Shape \\
-        \hline
-        00000  & scalar \\
-        00001  & {\em Reserved} \\
-        0001x  & {\em Reserved} \\
-        \hline
-        00100  & 1-D vector {\tt vl} \\
-        01000  & 1-D vector {\tt vm} \\
-        01100  & 1-D vector {\tt vn} \\
-        \hline
-        00101  & 2-D matrix {\tt vl} x {\tt vl} \\
-        00110  & 2-D matrix {\tt vl} x {\tt vm} \\
-        00111  & 2-D matrix {\tt vl} x {\tt vn} \\
-        \hline
-        01001  & 2-D matrix {\tt vm} x {\tt vl} \\
-        01010  & 2-D matrix {\tt vm} x {\tt vm} \\
-        01011  & 2-D matrix {\tt vm} x {\tt vn} \\
-        \hline
-        01101  & 2-D matrix {\tt vn} x {\tt vl} \\
-        01110  & 2-D matrix {\tt vn} x {\tt vm} \\
-        01111  & 2-D matrix {\tt vn} x {\tt vn} \\
-        \hline
-        1xxxx  & {\em Reserved}/{\em Custom} \\
-        \hline
-  \end{tabular}
-  \caption{Extended encoding of per-vector-register {\tt vshape} field.}
-  \label{tab:extvshape}
-\end{table}
-
-\begin{commentary}
-  A sketch of the proposed encodings for the 2D shape extension is
-  shown in the Table.
-\end{commentary}
-
-\clearpage
-
-\section{Representation Encoding}
-
-The 5-bit {\tt verep}$n$ register sets the numeric representation of
-each element of the vector register.  In the base vector
-extension, the representation can be set to unsigned integer,
-two's-complement signed integer, or floating-point.  The
-floating-point representations follow the IEEE 754 standards.
-
-\begin{table}[hbtp]
-  \centering
-  \begin{tabular}{|c|l|}
-    \hline
-    {\tt verep} & Representation \\
-    \hline
-    00000 & Unsigned integer \\
-    00001 & Two's-complement signed integer \\
-    00010 & {\em Reserved (unsigned floating-point?)}\\
-    00011 & IEEE-754 floating-point \\
-    \hline
-    \multicolumn{2}{|c|}{All other encodings reserved}\\
-    \hline
-  \end{tabular}
-  \caption{Base vector representation encoding.}
-  \label{tab:verep}
-\end{table}
-
-\begin{table}[hbtp]
-  \centering
-  \begin{tabular}{|c|l|}
-    \hline
-    {\tt verep} & Representation \\
-    \hline
-    00000 & Unsigned integer \\
-    00001 & Two's-complement signed integer \\
-    00010 & {\em Reserved (unsigned floating-point)}\\
-    00011 & IEEE-754 floating-point \\
-    \hline
-    001x0 & {\em Reserved} \\
-    00101 & Complex signed integer \\
-    00111 & Complex floating-point \\
-    \hline
-    01000 & Prime Galois field - integer representation \\
-    01001 & Prime Galois field - Montgomery representation \\
-    01100 & Binary extension Galois field - polynomial basis \\
-    01101 & Binary extension Galois field - normal basis \\
-    \hline
-    01010 & UNORM \\
-    01011 & SNORM \\
-    01110 & {\em Reserved} \\
-    01111 & {\em Reserved (complex SNORM?)} \\
-    \hline
-    10xxx & Custom representations \\
-    \hline
-    11xxx & {\em Reserved} \\
-    \hline
-  \end{tabular}
-  \caption{Extended vector representation encoding.}
-  \label{tab:extverep}
-\end{table}
-
-\begin{commentary}
-  The complex representations split the element width given in {\tt
-    vew}$n$ into two equal-sized real and imaginary fields, so an
-  element width of 64 bits can hold a single complex value with a
-  32-bit real and a 32-bit imaginary component.
-\end{commentary}
-
-\clearpage
-
-\section{Element Bitwidth}
-
-Each vector register, {\tt v}$n$, has a 6-bit element width
-register, {\tt vew}$n$, to specify the number of bits for each element
-of the current type in the vector register.
-
-The largest element width supported is
-termed ELEN, and is defined to be the larger of the supported integer
-and floating-point type widths:
-\[ \mbox{\em ELEN} = max(\mbox{\em XLEN}, \mbox{\em FLEN}) \]
-For the base vector ISA, the bit width can be set at any power of two
-between 8 and ELEN.
-
-\begin{table}[hbt]
-  \centering
-  \begin{tabular}{|c|r|l|}
-    \hline
-        {\tt vew} & Width & Required in Base \\
-        \hline
-        000 000 & disabled & All \\
-        001 000 & 8 & All \\
-        010 000 & 16 & All \\
-        011 000 & 32 & All \\
-        100 000 & 64 & RV32D, RV64, RV128\\
-        101 000 & 128 & RV64Q, RV128\\
-        \hline
-        \multicolumn{3}{|c|}{All other encodings reserved.}\\
-        \hline
-  \end{tabular}
-  \caption{Base vector ISA encoding of vector element width ({\tt
-      vew}$n$) register fields.}
-  \label{tab:basevew}
-\end{table}
-
-\begin{table}[hbtp]
-  \centering
-  \begin{tabular}{|c|r|}
-    \hline
-        {\tt vew} & Width \\
-        \hline
-        000 000 & disabled \\
-        000 001 & 1 \\
-        000 xxx & \multicolumn{1}{r|}{steps of 1}\\
-        000 111 & 7 \\
-        \hline
-        001 000 & 8 \\
-        001 xxx & \multicolumn{1}{r|}{steps of 1}\\
-        001 111 & 15 \\
-        \hline
-        010 000 & 16 \\
-        010 xxx & \multicolumn{1}{r|}{steps of 2}\\
-        010 111 & 30 \\
-        \hline
-        011 000 & 32 \\
-        011 xxx & \multicolumn{1}{r|}{steps of 4}\\
-        011 111 & 60 \\
-        \hline
-        100 000 & 64 \\
-        100 xxx & \multicolumn{1}{r|}{steps of 8}\\
-        100 111 & 120 \\
-        \hline
-        101 xxx & reserved \\
-        \hline
-        110 000 & 128 \\
-        110 001 & 192 \\
-        110 010 & 2048 \\
-        110 011 & 3072 \\
-        110 100 & 512 \\
-        110 101 & 768 \\
-        110 110 & 8192 \\
-        110 111 & 12288 \\
-        \hline
-        111 000 & 256 \\
-        111 001 & 384 \\
-        111 010 & 4096 \\
-        111 011 & 6144 \\
-        111 100 & 1024 \\
-        111 101 & 1536 \\
-        111 110 & 16384 \\
-        111 111 & 24576 \\
-        \hline
-  \end{tabular}
-
-   \caption{Proposed extended encoding of vector element width ({\tt
-       vew}$n$) register fields. Every bit width between 1 and 16 can
-     be supported.  Bit widths in steps of 2 between 16 to 32 (i.e.,
-     16, 18, 20, ...).  Bit widths in steps of 4 between 32 to 64
-     (i.e., 32, 36, 40, ...).  Bit widths in steps of 8 between 64 and
-     128 (i.e., 64, 72, 80,...).  For bit widths greater than 128, all
-     powers-of-two up to 16384 and all widths 1.5$\times$ greater are
-     supported (128, 384, 512, 768,...).  }
-   \label{tab:extvew}
-\end{table}
-
-\begin{commentary}
-    The extended bit-width encoding is designed to minimize the number
-    of state bits required to support useful subsets of widths. For
-    example, an RV32 system only needs two bits of state per {\tt
-      vew}$n$ field to represent {\em disabled}, 8, 16, and 32. An
-    RV32 system with 3 bits of state can represent {\em disabled}, 4,
-    8, 12, 16, 24, 32, and 48.  An RV64 system with 4 bits of state
-    can represent {\em disabled}, 4, 8, 12, 16, 24, 32, 48, 64, 96,
-    128, 256, 512, 1024.
-\end{commentary}
-
-\clearpage
-
-\section{Base Vector Extension Supported Types}
-
-The types supported by the base V extension depend upon the base
-scalar ISA and supported extensions.  When the base V extension is
-added to a base scalar ISA, it must support the vector data element
-types implied by the supported scalar types as defined by
-Table~\ref{tab:velemtypes}.
-
-\begin{table}[hbt]
-  \centering
-\begin{tabular}{|l|l|}
-  \hline
-  \multicolumn{2}{|c|}{Supported Fixed-Point Formats} \\
-  \hline
-  RV32I  & I8, U8, I16, U16, I32, U32 \\
-  RV64I  & I8, U8, I16, U16, I32, U32, I64, U64 \\
-  RV128I & I8, U8, I16, U16, I32, U32, I64, U64, I128, U128 \\
-  \hline
-  \hline
-  \multicolumn{2}{|c|}{Supported Floating-Point Formats} \\
-  \hline
-  F      & F16, F32 \\
-  FD     & F16, F32, F64 \\
-  FDQ    & F16, F32, F64, F128 \\
-  \hline
-\end{tabular}
-\caption{Supported data element formats depending on base integer ISA
-  and supported floating-point extensions.  I$x$ indicates a signed
-  integer of $x$ bits, U$x$ indicates an unsigned integer of $x$ bits,
-  and F$x$ indicates an IEEE floating-point number of $x$ bits.}
-\label{tab:velemtypes}
-\end{table}
-
-\begin{commentary}
-  Future vector extensions might expand the set of supported
-  datatypes, including custom application-specific datatypes.
-\end{commentary}
-
-\clearpage
-
-\section{Maximum Vector Element Width ({\tt vmaxew})}
-
-The global {\tt vmaxew} field is used to support more complex vector
-runtime environments where the types to be held in each register of a
-single configuration may vary dynamically, and may not even be known
-at compile time due to separate compilation.
-
-The global maximum element width register {\tt vmaxew} defines the
-maximum number of bits of storage in every element of every active
-architectural register, or if zero, defers to the per-vector-register
-width field.
-
-\begin{commentary}
-  The VIRAM processor had a virtual processor width
-  register similar to {\tt vmaxew}~\cite{VIRAM}.
-\end{commentary}
-
-If {\tt vmaxew} is zero, then the per-element vector element widths
-{\tt vew}$n$ determine the minimum storage required for each element
-of the associated vector register {\tt v}$n$.
-
-If {\tt vmaxew} is non-zero, it sets the largest element width that
-can be supported in any vector register element in the current
-configuration.
-
-\clearpage
-
-\section{Vector Configuration Registers ({\tt vcfg0}--{\tt vcfg15})}
-
-The vector type configuration requires 512 bits of state (32 vector
-registers each with 16-bit {\tt vtype}$n$ field) that can be accessed
-via the {\tt vcfg CSRs}.
-
-RV128 uses four vector configuration CSRs: {\tt vcfg0} holds
-configuration data for {\tt v0}--{\tt v7} with bits $16n$ to $16n+15$
-holding {\tt vtype}$n$, while {\tt vcfg4}, {\tt vcfg8} and {\tt
-  vcfg12} similarly holds configuration data for {\tt v8}--{\tt v15},
-  {\tt v16}--{\tt v23}, and {\tt v24}--{\tt v31} respectively.
-
-In RV64, the {\tt vcfg2} CSR provides access to the upper 64 bits of {\tt
-  vcfg0} and {\tt vcfg6} provides access to the upper 64 bits of
-{\tt vcfg4}.  In RV32, the {\tt vcfg1}, {\tt vcfg3}, {\tt vcfg5}
-and {\tt vcfg7} CSRs provides access to the upper bits of {\tt
-  vcfg0}, {\tt vcfg2}, {\tt vcfg4} and {\tt vcfg6} respectively.
-
-Any CSR write to a {\tt vcfg}$x$ register zeros all {\tt vcfg}$y$
-registers, for $y>x$.  As a result configuration data should be
-written from the {\tt vcfg0} CSR upwards.
-
-\begin{commentary}
-  Zeroing higher-numbered {\tt vcfg}$y$ registers allows more rapid
-  reconfiguration of the vector register file via CSR writes, and
-  provides backward-compatibility for extensions that increase the
-  number of possible architectural vector registers.  This choice does
-  prevent the use of CSRRW instructions to swap the configuration
-  context; an entire old configuration must be read out before a new
-  configuration is written in.
-\end{commentary}
-
-Additional instructions are provided to support more rapid changes to
-the vector unit configuration as described below.
-
-\section{Legal Vector Unit Configurations}
-
-To simplify hardware configuration calculations and to reduce software
-context-switch complexity, vector unit configurations are constrained
-to have non-disabled architectural vector registers numbered
-contiguously starting at {\tt v0}.  An exception will be raised if an
-instruction tries to change {\tt vtype}$n$ in a way that violates this
-constraint.
-
-\begin{commentary}
-  During a software vector-context save, the software handler can stop
-  searching for active architectural registers after encountering the
-  first disabled vector register.  Hardware to calculate physical
-  register allocation is also simplified with this constraint.
-\end{commentary}
-
-\clearpage
-
-\section{Vector Unit CSRs}
-
-\begin{table}[hbt]
-  \centering
-  \begin{tabular}{|l|c|l|l|}
-    \hline
-    CSR name & Number & Base ISA & Description\\
-    \hline
-    {\tt vcs}  & TBD & RV32, RV64, RV128 & Vector control-status register\\
-    {\tt vl}    & TBD & RV32, RV64, RV128 & Active vector length\\
-    {\tt vxrm}  & TBD & RV32, RV64, RV128 & Vector fixed-point rounding mode\\
-    {\tt vxsat} & TBD & RV32, RV64, RV128 & Vector fixed-point
-    saturation flag \\
-    {\tt vmaxew} & TBD & RV32, RV64, RV128 & Global maximum vector element width \\
-    \hline
-    {\tt vcfg0} & TBD & RV32, RV64, RV128 & \multirow{16}{*}{Vector
-      register configuration}\\
-    {\tt vcfg1} & TBD & RV32 &\\
-    {\tt vcfg2} & TBD & RV32, RV64 &\\
-    {\tt vcfg3} & TBD & RV32 &\\
-    {\tt vcfg4}  & TBD & RV32, RV64, RV128 &\\
-    {\tt vcfg5} & TBD & RV32 &\\
-    {\tt vcfg6} & TBD & RV32, RV64 &\\
-    {\tt vcfg7} & TBD & RV32 &\\
-    {\tt vcfg8} & TBD & RV32, RV64, RV128 & \\
-    {\tt vcfg9} & TBD & RV32 &\\
-    {\tt vcfg10} & TBD & RV32, RV64 &\\
-    {\tt vcfg11} & TBD & RV32 &\\
-    {\tt vcfg12}  & TBD & RV32, RV64, RV128 &\\
-    {\tt vcfg13} & TBD & RV32 &\\
-    {\tt vcfg14} & TBD & RV32, RV64 &\\
-    {\tt vcfg15} & TBD & RV32 &\\
-    \hline
-  \end{tabular}
-  \caption{Vector extension CSRs.}
-  \label{tab:vcsrs}
-\end{table}
-
-\clearpage
-
-\section{Maximum Vector Length (MVL)}
-
-The implementation determines an available {\em maximum vector length}
-(MVL) dependent on the current vector type configuration held in {\tt
-  vcfg}$x$ and {\tt vmaxew}.  The available MVL depends on the
-configuration setting and on the implementation's microarchitecture,
-but MVL must always have the same value for the same configuration
-parameters on a given hart.
-
-\begin{commentary}
-  Several earlier vector machines had the ability to configure
-  physical vector register storage into a larger number of short
-  vectors or a shorter number of long vectors. In particular the
-  Fujitsu VP series~\cite{vp200} supported combining power-of-2 base
-  vector registers into longer vector registers.
-
-  The Scale~\cite{}, Maven~\cite{}, and Hwacha~\cite{} processors also
-  support configuration-dependent MVL.
-\end{commentary}
-
-\begin{commentary}
-  Previously, the specification imposed a minimum vector length (4) on
-  all configurations to allow stripmining code to be removed for short
-  vector lengths.  With the expanded scope of the vector unit types,
-  this would be too onerous to support, and so the requirement is removed.
-\end{commentary}
-
-\begin{discussion}
-  A separate mechanism for supporting fixed vector lengths should be
-  designed, possibly as part of an optional extension.
-\end{discussion}
-
-Any change to the vector configuration that might change MVL cause the
-entire vector unit state to be zeroed.  Any write to the global {\tt
-  vmaxew} causes the entire vector unit state to be zeroed, even if
-the value in {\tt vmaxew} is unchanged.
-
-If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
-register that would set the width greater than {\tt vmaxew} raises an
-illegal instruction exception and leaves the vector unit state
-unchanged.
-
-If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
-field with a value less than or equal to the value in {\tt vmaxew}
-only zeros the associated vector register {\tt v}$n$ and leaves other
-vector unit state unchanged.  The vector register data is zeroed even
-if {\tt vew}$n$ would be unchanged by the write.
-
-If {\tt vmaxew} is zero, then any write to an individual {\tt vew}$n$
-register zeros the associated {\tt v}$n$ vector register.  In addition,
-any write that changes the value in {\tt vew}$n$, zeros the entire vector
-unit state.
-
-\begin{commentary}
-  The state is zeroed to hide implementation-dependent bit mappings
-  and to provide additional security when context swapping.  Zero is
-  also a convenient initial value for some loops.
-
-  In-order implementations will probably use a flag bit per register to
-  mux in 0 instead of garbage values on each source until it is
-  overwritten.  For in-order machines, vector lengths less than MVL
-  complicate this zeroing, but these cases can be handled by adding a
-  zero bit per element or element group.  Machines with vector
-  register renaming can just initialize the rename table to point
-  entries at a physical zero register.
-\end{commentary}
-
-Each vector register can be reconfigured dynamically to hold different
-formats without zeroing the entire vector unit state provided that: if
-{\tt vmaxew} is zero, the bit-width of the new format is the same as
-the current {\tt vew}; or if {\tt vmaxew} is non-zero, the format does
-not require more than {\tt vmaxew} bits.  Any change to a vector
-register's format zeros the affected vector register.
-
-If a vector register is disabled, then any vector instruction
-that attempts to access that vector register will raise an
-illegal instruction exception.  Attempting to write any {\tt
-  vmaxew}$n$ with an unsupported value will raise an illegal
-instruction exception.
-
-\begin{commentary}
-  Vector registers have both a maximum element width and a
-  current element data type to allow the same vector register to
-  be changed to different types during execution provided the
-  maximum width is not exceeded.  This reduces register pressure and
-  helps support vector function calls, where the caller does not know
-  the types needed by the callee, as described below.
-\end{commentary}
-
-\begin{commentary}
-  The set of supported types might be greatly increased with future
-  extensions.  For example (and not limited to), new scalar types in
-  new number systems, a complex type with real and imaginary
-  components, a key-value type, or an application-specific structure
-  type with multiple constituent fields.  Auxiliary type
-  configuration state might be required in these cases.
-\end{commentary}
-
-Attempting to write an unsupported type or a type that requires more
-than the current {\tt vmaxew} width to a {\tt vetype} field will raise
-an illegal instruction exception.
-
-\begin{commentary}
-Implementations must still raise an exception for a {\tt vetype}$n$
-setting that is greater than the architectural {\tt vmaxew}$n$ width,
-even if they internally implement a larger physical {\tt vmaxew}$n$
-that could accommodate the {\tt vetype}$n$ request.
-\end{commentary}
-
-\begin{discussion}
-We can either have 1) implementations raise exceptions whenever
-illegal values are written to {\tt vmaxew} and {\tt vetype} fields
-(current design), 2) raise exceptions at use if config holds illegal
-values, 3) make the fields WARL so silently reduce to supported types
-with no exceptions.  Option 2 could complicate vector unit context
-switch code by having more cases to check, while Option 3 could make
-debugging more difficult by allowing code to run with reduced
-precision or incorrect types.
-\end{discussion}
-
-\begin{commentary}
-Three broad classes of implementation can be distinguished by how they
-handle {\tt vmaxew} settings.
-
-The simplest is {\em max-width-per-implementation} (MWPI), where the
-vector unit is organized in fixed ELEN-width physical lanes, and
-changes to {\tt vmaxew} settings simply cause portions of the
-physical registers and datapath to be disabled for operations narrower
-than ELEN bits.
-
-The next most complex implementation, {\em
-  max-width-per-configuration} (MWPC), uses the maximum width across
-all {\tt vmaxew} settings in a dynamic configuration to divide the
-physical register storage and datapaths.  For example, a MWPC machine
-with ELEN=64 might subdivide physical lanes into 32-bit datapaths if
-no {\tt vmaxew} setting is greater than 32.  Operations on
-sub-32-bit quantities would disable appropriate portions of the
-physical registers and functional units in each 32-bit lane.  Several
-early vector supercomputers, including the CDC
-Star-100~\cite{cdcstart100}, provided a similar facility to divide
-64-bit physical vector lanes into narrower 32-bit lanes.
-
-The most complex implementations are {\em max-width-per-register}
-(MWPR), which reduce wasted space in the physical register files by
-packing elements in each vector register according to the individual
-{\tt vmaxew} settings and which within one configuration can
-execute instructions with narrower datatypes at higher rates than for
-wider datatypes.  The Berkeley Hwacha vector
-engine~\cite{hwachatr,mixedprecision} is an example microarchitecture
-with this property.
-\end{commentary}
-
-\clearpage
-
-{\bf Following Sections are out-of-date.}
-
-\section{Vector Instruction Formats}
-
-\begin{commentary}
-  The instruction encoding is a work in progress.
-
-  An important design goal was that the base vector extension fit
-  within a few major opcodes of the 32-bit encoding.  It is envisioned
-  that future vector extensions will use 48-bit or 64-bit encodings to
-  increase both the opcode space and the set of architectural
-  registers.  The 64-bit vector encoding would support 256
-  architectural vector registers and orthogonal specification of a
-  predicate register in each instruction.
-\end{commentary}
-
-Vector arithmetic and vector memory instructions are encoded in new
-variants of the R-format, shown in Figure~\ref{fig:vinstformats}.
-Both new formats use one bit to hold a {\em vp} field, which usually
-controls the predicate register in use, either {\tt vp0} or {\tt vp1}.
-The VR4 form is used for fused multiply-add instructions.  The
-existing RISC-V instruction formats are used for other vector-related
-instructions, such as the vector configuration instructions.
-
-\vspace{-0.2in}
-\begin{figure}[h]
-\begin{center}
-\setlength{\tabcolsep}{4pt}
-\begin{tabular}{p{0.7in}@{}p{0.4in}@{}p{0.7in}@{}p{0.7in}@{}p{0.5in}@{}p{0.4in}@{}p{0.7in}@{}p{1in}l}
-\\
-\instbitrange{31}{27} &
-\instbitrange{26}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{13} &
-\instbit{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\cline{1-8}
-\multicolumn{2}{|c|}{funct7} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct2} &
-\multicolumn{1}{c|}{vp} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-VR-type \\
-\cline{1-8}
-\\
-\cline{1-8}
-\multicolumn{1}{|c|}{rs3} &
-\multicolumn{1}{c|}{fmt} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct2} &
-\multicolumn{1}{c|}{vp} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-VR4-type \\
-\cline{1-8}
-\end{tabular}
-\end{center}
-\caption{New V extension instruction formats.  }
-\label{fig:vinstformats}
-\end{figure}
-
-Most vector instructions are available in both vector-vector and
-vector-scalar variants.  Vector-vector instructions take the first
-operand from the vector register specified by {\em rs1} and the second
-operand from the vector register specified by {\em rs2}.
-
-For vector-scalar operations, the {\em rs1} field specifies the scalar
-register to be accessed.  For most vector-scalar instructions, the
-type of the vector operand specified by {\em rs2} indicates whether
-the integer or floating-point scalar register file is accessed using
-the {\em rs1} register specifier.
-
-Some non-commutative vector-scalar instructions (such as sub) are
-provided in two forms, with the scalar value used as the second
-operand.
-
-\begin{commentary}
-  The {\em rs1} field is used to provide the scalar operand because in
-  the base encoding, whenever an instruction has a single scalar
-  source operand, it is encoded in the {\tt rs1} field.
-\end{commentary}
-
-\section{Polymorphic Vector Instructions}
-
-The vector extension uses a polymorphic instruction encoding where the
-opcode is combined with the types of the source and destination
-registers to determine the operation to be performed.  For example, an
-ADD opcode will perform a 32-bit integer vector-vector add if both
-vector source operands and the vector destination register are 32-bit
-integers, but will perform a 16-bit floating-point vector-vector
-operation if both vector source operands and the vector destination
-are 16-bit floats.
-
-The polymorphic encoding also naturally supports operations with mixed
-precisions on the input and output, and also supports extending the
-instruction set with new types without necessarily increasing the
-opcode space.
-
-Not all combinations of source and destination argument types need be
-supported.  The base vector extension mandates only that
-implementations provide a subset of combinations of types on inputs
-and outputs.  Table~\ref{tab:vtypemix} shows the general rules for
-integer and floating-point instructions, but the detailed instruction
-listing should be consulted for accurate information.
-
-\begin{table}
-  \centering
-  \begin{tabular}{|r|r|r|r|r|}
-    \hline
-    \multicolumn{1}{|c|}{Src1} &
-    \multicolumn{1}{c|}{Src2} &
-    \multicolumn{1}{c|}{Src3} &
-    \multicolumn{1}{c|}{Dest} &
-    \multicolumn{1}{c|}{Example} \\
-    \hline
-    \hline
-    \multicolumn{5}{|c|}{Integer vector-scalar}\\
-    \hline
-    XLEN &   X & - &  X & 64b + 32b $\rightarrow$ 32b \\
-    XLEN &   X & - & 2X & 64b + 8b  $\rightarrow$ 16b \\
-    \hline
-    \hline
-    \multicolumn{5}{|c|}{Integer vector-vector}\\
-    \hline
-      X &  X & - &   X & 32b + 32b $\rightarrow$ 32b \\
-      X &  X & - &  2X & 16b + 16b $\rightarrow$ 32b \\
-     2X &  X & - &  2X & 64b + 32b $\rightarrow$ 64b \\
-    \hline
-    \hline
-    \multicolumn{5}{|c|}{Floating-point vector-scalar}\\
-    \hline
-     F &  F & -  &  F &  64b + 64b $\rightarrow$ 64b \\
-     F &  F & F  &  F &  32b $\times$ 32b + 32b $\rightarrow$ 32b \\
-     F &  F & -  & 2F &  32b + 32b $\rightarrow$ 64b \\
-     F &  F & 2F & 2F &  32b $\times$ 32b + 64b $\rightarrow$ 64b \\
-    \hline
-    \hline
-    \multicolumn{5}{|c|}{Floating-point vector-vector}\\
-    \hline
-      F &  F  & - &   F & 32b + 32b $\rightarrow$ 32b \\
-      F &  F  & - &  2F & 16b + 16b $\rightarrow$ 32b \\
-     2F &  F  & - &  2F & 64b + 32b $\rightarrow$ 64b \\
-      F &  F & F  &  F &  64b $\times$ 64b + 64b $\rightarrow$ 64b \\
-      F &  F & 2F & 2F &  16b $\times$ 16b + 32b $\rightarrow$ 32b \\
-    \hline
-  \end{tabular}
-  \caption{General rules for supported types per instruction in base
-    vector extension.  X represents the number of bits in an integer
-    type and F represents the number of bits in a floating-point type.
-    Individual instruction types will provide more detailed listings.
-    Note that the type of a scalar floating-point operand can never be
-    different from that of the vector in Src2, hence the Src1=2F case
-    is missing from vector-scalar operations.}
-  \label{tab:vtypemix}
-\end{table}
-
-A general rule in the base vector instruction set is that the
-destination precision is never less than any source operand, except
-for explicit type-conversion instructions.  Another general rule is
-that the input operands can only be the same width or half the width
-of the destination operand except for the scalar operand in integer
-vector-scalar instructions, which is always XLEN wide.  Also, src2 is
-never larger than src1 or src3.
-
-Integer computations of mixed-precision values always aligns values by
-their LSB, and sign or zero-extends any smaller value according to its
-type.  The result is truncated to fit in the destination type.  Note a
-scalar integer value is already XLEN bits wide, and as wide as any
-possible integer vector value.
-
-Floating-point computations on mixed-precision values acts as if the
-calculations are performed exactly then rounded once to the
-destination format.
-
-\section{Rapid Configuration Instructions}
-
-It can take several CSR instructions to set up the {\tt vcfg} and
-{\tt vnp} CSRs for a given configuration.  Specialized configuration
-instructions are provided to quickly set up common configurations in
-the {\tt vcfg} and {\tt vnp} CSRs.
-
-The {\tt vsetdcfg} instruction takes a scalar register value encoded as
-shown in Figure~\ref{fig:vcfg}, and returns the corresponding MVL in
-the destination register.  The {\tt vsetdcfg} and {\tt vsetdcfgi}
-instructions also clear the {\tt vnp} register, so no predicate
-registers are allocated.
-
-\begin{discussion}
-  For now, only a 32-bit value supporting up to three different vector
-  data types is supported by the {\tt vsetdcfg} instruction.  RV64 and
-  RV128 could support larger number of types, though it's not clear if
-  the hardware cost (area, latency) to support a larger number of
-  different types is justified.
-\end{discussion}
-
-\begin{figure}[b]
-  \centering
-  \begin{tabular}{p{1cm}p{1cm}ccc|c|c|c|c|c|c|c|l}
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} & 
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{mode} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &  \\
-    \cline{6-12}
-    & & & & &
-    \tt type2 & \tt ntype2 &
-    \tt type1 & \tt ntype1 &
-    0 &
-    \tt type0 & \tt ntype0 &  \\
-    \cline{6-12}
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{} & 
-    \multicolumn{1}{c}{} &
-    \multicolumn{1}{c}{5} &
-    \multicolumn{1}{c}{5} &
-    \multicolumn{1}{c}{5} &
-    \multicolumn{1}{c}{5} &
-    \multicolumn{1}{c}{2} &
-    \multicolumn{1}{c}{5} &
-    \multicolumn{1}{c}{5} &  \\
-    %% \cline{2-12}
-    %% & \multicolumn{1}{|c|}{0} & F128 &
-    %% \multicolumn{1}{c|}{type3} & \multicolumn{1}{c|}{\#type3} &
-    %% type2 & \#type2 & type1 & \#type1 & 0 & type0 & \#type0 & RV64 \\
-    %% \cline{2-12}
-    %% & & &
-    %% \multicolumn{1}{c}{} &
-    %% \multicolumn{1}{c}{24} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{2} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &  \\
-    %% \cline{1-12}
-    %% \multicolumn{1}{|c|}{0} & \multicolumn{1}{c|}{X128} &
-    %% \multicolumn{1}{c|}{F128} & I64 & F64 & F32 & F16 & I32 & I16 & I8 & RV128 \\
-    %% \cline{1-12}
-    %% \multicolumn{1}{c}{83} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{2} &
-    %% \multicolumn{1}{c}{5} &
-    %% \multicolumn{1}{c}{5} &  \\
-  \end{tabular}
-  \caption{Format of the {\tt vsetdcfg} value.  The value contains
-    three pairs of a 5-bit type and a 5-bit number of registers
-    to create of that type. A value of 0 for the number of a type
-    indicates that 32 registers should be allocated.  A value of 0 for
-    the type indicates this pair should be skipped.  The types must be
-    of monotonically increasing size from type0 to type2. }
-  \label{fig:vcfg}
-\end{figure}
-
-The {\tt vsetdcfg} value specifies how many vector registers of each
-datatype are allocated, and is divided into a 2-bit mode field and
-pairs of 5-bit fields for each data type in the configuration.
-
-The 2-bit mode field indicates the configuration mode of the vector
-unit and is zero for the base vector extension.
-
-\begin{commentary}
-  The standard vector extension operating mode configures the vector
-  unit into some number of vector registers, each with some number of
-  elements of types supported by the scalar unit.
-
-  At least one alternative mode is planned, where the vector unit is
-  configured as some number of registers each holding a single large
-  element, e.g., 256 bits.  This would be the base for cryptographic
-  operations, or other coprocessors that operated on large structures.
-
-  Other modes can be used to reconfigure the vector unit register file
-  and functional units for other domain-specific purposes.
-\end{commentary}
-
-Each datatype pair contains a 5-bit {\tt type}$x$ value encoded as a
-{\tt vetype}$n$ value, and a 5-bit {\tt ntype}$x$ for the number of
-registers to allocate for that type. If the {\tt type0} field is
-non-zero, the {\tt vsetdcfg} instruction will configure the first {\tt
-  ntype0} vector data registers to have {\tt vetype}$n$ values of {\tt
-  type0} with {\tt vmaxew}$n$ values set accordingly as shown in
-Table~\ref{tab:vetype}.  If the {\tt type0} value is 0, the datatype
-pair is skipped.  If the {\tt type1} field is non-zero, then the next
-{\tt ntype1} vector registers are configured to be of the type given
-in {\tt type1}.  Similarly for the {\tt type2} pair.
-
-A value of zero in a {\tt type}$x$ field indicates this datatype pair
-should be ignored.  A value of zero in a {\tt ntype}$x$ field
-indicates 32 registers should be allocated for the corresponding type.
-
-\begin{commentary}
-Zero values are skipped to simplify setting a configuration with two
-different data types, where a single LUI instruction can set the upper
-20 bits leaving the low bits zero.
-
-A single 12-bit immediate value is sufficient to create a
-configuration with some number of vector registers with a single given
-datatype.
-
-A compressed C.LI with a zero-extended 5-bit immediate can create a
-configuration with 32 vector registers of a given datatype.
-\end{commentary}
-
-A corresponding {\tt vsetdcfgi} instruction takes a 12-bit immediate
-value to set the configuration instead of a scalar value, but
-otherwise is identical to the {\tt vsetcfgd} instruction.
-
-\begin{discussion}
-It is not clear how many immediate bits will be made available for the
-{\tt vsetdcfgi} instruction.  If encoding space is available for both
-12 immediate bits and a source register specifier, then {\tt
-  vsetdcgfi} can be defined to read the source register, OR in the
-bits in the immediate, then create a configuration.  In this case,
-there is no need for a separate {\tt vsetdcfg} instruction.
-\end{discussion}
-
-The configuration value given must result in a legal configuration or
-else an illegal instruction exception will be raised.
-
-If a zero argument is given to {\tt vsetdcfg} the vector unit will be
-disabled and the value 0 will be returned for MVL.  This instruction
-({\tt vsetdcfg x0, x0}) is given the assembly pseudo-code {\tt
-  vdisable}.
-
-Separate {\tt vsetpcfg} and {\tt vsetpcfgi} instructions are provided
-that write the source value to the {\tt vnp} register and return the
-new MVL.  These writes also clear the vector data registers, set all
-bits in the allocated predicate registers, and set {\tt vl}=MVL. A
-{\tt vsetpcfg} or {\tt vsetpcfgi} instruction can be used after a {\tt
-  vsetdcfg} to complete a reconfiguration of the vector unit.
-
-\begin{discussion}
-  If {\tt vnp} is made accessible as a separate CSR, the {\tt setpcfg}
-  and {\tt setpcfgi} instructions are less useful.  The only advantage
-  over a CSR instruction is that they return MVL, which is rarely
-  needed, and which can be obtained via that {\tt setvl} instruction.
-\end{discussion}
-
-\section{Vector-Type-Change Instructions}
-
-To quickly change the individual types of a vector register, {\tt
-  vetyperw} and {\tt vetyperwi} instructions are provided to change
-the type of the specified vector data register to the given scalar
-register value or 5-bit immediate value respectively, while returning
-the previous type in the destination scalar register.
-
-A vector convert instruction, described below, can simultaneously
-convert a source vector register into a new type, and set that type in
-the destination vector register.
-
-\section{Vector Length}
-
-The active vector length is held in the XLEN-bit WARL vector length
-CSR {\tt vl}, which can only hold values between 0 and MVL inclusive.
-Any writes to the configuration registers ({\tt vcfg}$x$ or {\tt
-  vnp}) cause {\tt vl} to be initialized with MVL. Changes to {\tt
-  vetype}$n$ via vector-type-change instructions do not affect {\tt
-  vl}.
-
-The active vector length is usually set via the {\tt setvl}
-instruction.  The source argument to the {\tt setvl} is the requested
-application vector length (AVL) as an unsigned XLEN-bit integer. The
-{\tt setvl} instruction calculates the value to assign to {\tt vl}
-according to Table~\ref{tab:vlcalc}.  The result of this calculation
-is also returned as the result of the {\tt setvl} instruction.
-
-\begin{commentary}
-Earlier drafts encoded {\tt setvl} using a modified CSRRW instruction
-whereas it is now encoded as a separate new instruction.
-\end{commentary}
-
-\begin{table}
-  \centering
-  \begin{tabular}{|c|c|}
-    \hline
-    AVL Value & {\tt vl} setting \\
-    \hline
-    AVL $\geq$ 2\,MVL & MVL \\
-    2\,MVL $>$ AVL $>$ MVL & $\lceil$AVL$/2\rceil$ \\
-    MVL $\geq$ AVL & AVL \\
-    \hline
-  \end{tabular}
-  \caption{Operation of {\tt setvl} instruction to set vector
-    length register {\tt vl} based on requested application vector
-    length (AVL) and current maximum vector length (MVL).}
-  \label{tab:vlcalc}
-\end{table}
-
-\begin{commentary}
-  The rules for setting the {\tt vl} register help keep vector
-  pipelines full over the last two iterations of a stripmined loop.
-  This version of the rules guarantees monotonically decreasing vector
-  lengths. 
-  Similar rules were previously used in Cray-designed machines~\cite{crayx1asm}.
-\end{commentary}
-
-\begin{discussion}
-  There are multiple possible rules for setting VL, and we could give
-  implementations freedom to use different VL setting rules.
-\end{discussion}
-
-\begin{commentary}
-  The idea of having implementation-defined vector length dates back
-  to at least the IBM 3090 Vector Facility~\cite{ibm370varch}, which
-  used a special ``Load Vector Count and Update'' (VLVCU) instruction
-  to control stripmine loops.  The {\tt setvl} instruction included
-  here is based on the simpler {\tt setvlr} instruction introduced by
-  Asanovi\'{c}~\cite{krstephd}.
-\end{commentary}
-
-The {\tt setvl} instruction is typically used at the start of every
-iteration of a stripmined loop to set the number of vector elements to
-operate on in the following loop iteration.  The current MVL can be
-obtained from a vector configuration instruction, or by performing a
-{\tt setvl} with a source argument that has all bits set (largest
-unsigned integer).
-
-When {\tt vl} is less than MVL, vector instructions will set all
-elements in the range [{\tt vl}:MAXVL-1] in the destination vector
-data register or destination vector predicate register to zero.
-
-\begin{commentary}
-  Requiring zeroing of elements past the current active vector length
-  simplifies the design of units with renamed vector data registers.
-  If the specification left destination elements unchanged, renaming
-  implementations would have to copy the tail of the old destination
-  register to the newly allocated destination register.
-  Alternatively, specifying the tail to be undefined will expose
-  implementation differences and possibly cause a security hole.
-
-  Implementations that do not support renaming, will have to zero the
-  tail of a vector, but this can reuse the mechanism that is already
-  required to initialize all vector data registers to zero on
-  reconfiguration, for example, by having a zero bit on each element
-  or element group.
-\end{commentary}
-
-No element operations are performed for any vector instruction when
-{\tt vl}=0.
-
-\begin{commentary}
-  Two possible choices are to 1) require destination registers to be
-  completely zeroed when {\tt vl}=0, or 2) no changes to the
-  destination registers.  Option 2 is currently chosen as this will
-  prevents unnecessary work in some implementations, and option 1 does
-  not provide a clear advantage beyond seeming more consistent with
-  {\tt vl}>0 case.
-\end{commentary}
-
-\begin{figure}[bt]
-  \centering
-\begin{verbatim}
-                 # Vector-vector 32-bit add loop.
-                 # a0 holds N
-                 # a1 holds pointer to result vector
-                 # a2 holds pointer to first source vector
-                 # a3 holds pointer to second source vector
-                 li t0, (2<<VNTYPE0|VREGF32)
-                 vsetdcfg t0     # Configure with two 32-bit float vectors
-
-          loop:  setvl t0, a0    # Set length, get how many elements in strip
-                 vld v0, a2      # Load first vector
-                 sll t1, t0, 2   # Multiply length by 4 to get bytes
-                 add a2, t1      # Bump pointer
-                 vld v1, a3      # Load second vector
-                 add a3, t1      # Bump pointer
-                 vadd v0, v1     # Add elements
-                 sub a0, t0      # Decrement elements completed
-                 vst  v0, a1     # Store result vector
-                 add a1, t1      # Bump pointer
-                 bnez a0, loop   # Any more?
-
-                 vdisable        # Turn off vector unit
-\end{verbatim}
-\caption{Example vector-vector add loop.}
-\label{fig:vvadd}
-\end{figure}
-
-\section{Predicated Execution}
-
-
-\begin{commentary}
-  The 32-bit base encoding does not leave room for a fully orthogonal
-  predicate register specifier.  A single bit is dedicated to the
-  predicate register specification, and is used to select between two
-  active predicate registers, {\tt vp0} or {\tt vp1}. An alternative
-  scheme would have used the bit to select between {\tt vp0} and
-  unpredicated (all elements active).  However, given the ease of
-  setting all predicate bits in a vector predicate register with a
-  single predicate instruction, the current scheme provides more
-  flexibility.
-
-  When there are no vector predicate registers enabled, {\tt vp0}
-  returns all set bits when read.  So, the assembler convention is to
-  assume {\tt vp0} as the predicate register when no predicate
-  register is explicitly given.  The assembler can support a strict
-  operands option to require the vector predicate register is
-  explicitly specified.
-\end{commentary}
-
-At element positions where the selected predicate register bit is
-zero, the corresponding vector element operation has no effect (does
-not change architectural state or generate exceptions), except to
-write a zero to the element position in the destination vector
-register.
-
-\begin{discussion}
-  The previous proposal (undisturb) left the destination vector
-  unchanged at element positions where the predicate bit is false,
-  whereas the current plan-of-record (zero) writes zero to the
-  destination where the predicate bit is false.
-
-  The advantage of the undisturb option is that it can require fewer
-  instructions and fewer architectural registers for many common code
-  sequences.  For in-order machines without register renaming, the
-  undisturb operation simply disables writes to the destination
-  elements, except for vector registers that have not been written
-  since configuration time. Typically an extra zero bit per vector
-  register or element group will be added to represent a zeroed
-  register instead of actually zeroing state at configuration time.
-  For predicated undisturb writes to these uninitialized registers,
-  the predicated false elements must be explicitly written with zeros
-  on each element group and the zero bit is then cleared down.
-  However, in a machine with vector register renaming, undisturb does
-  imply an additional read of the original destination register to
-  write the value into the new physical destination register when the
-  predicate is false.  This additional read port will often be cheaper
-  than in a scalar machine as vector machines often time-multiplex
-  read ports, and the additional read can be skipped when the
-  predicate registers are disabled ({\tt vnp}=0) or when the source is
-  known to be zero after configuration, but still adds complexity to a
-  design.
-
-  The advantage of the zero option is that a machine with vector
-  register renaming does not need to read the original destination
-  vector register and so a read port is saved.  The disadvantage of
-  the zero option is that more instructions and architectural
-  registers are required for common code sequences, and simpler
-  microarchitectures without register renaming are penalized by
-  requiring longer code sequences and greater register pressure.  In
-  particular, vector merge instructions are required to collect
-  results from two divergent control paths, and each vector merge has
-  to read two vector values and write a vector result.  Whether the
-  zero option saves total register file traffic in an register-renamed
-  microarchitecture depends on the ratio of a) internal temporary
-  writes, to b) writes creating values that are live out of each basic
-  block, and also to the frequency of control flow merges.
-
-  Overall, the zero option removes significant complexity from the
-  renamed machines while reducing efficiency somewhat for the
-  non-renamed machines, and is the current plan-of-record.
-\end{discussion}
-
-\section{Vector Load/Store Instructions}
-
-Three vector load/store addressing modes are supported, unit-stride,
-constant stride, and indexed (scatter/gather).  Each addressing mode
-has a 7-bit unsigned immediate offset that is scaled by the element
-type.
-
-The unit-stride address mode takes a scalar base byte address, adds
-the scaled immediate, then generates a contiguous set of element
-addresses for loads or stores.
-
-\begin{commentary}
-  The primary use of immediates in unit-stride loads is to generate
-  overlapping unit-stride loads for convolution operations.
-\end{commentary}
-
-The constant-stride address mode takes a scalar base byte address, a
-stride value encoded in bytes, and adds a scaled immediate value.
-
-\begin{commentary}
-  The stride value is in bytes to allow a single stride register to be
-  used to support operations on arrays-of-structures, where not all
-  elements in each structure have the same size.  The immediate value
-  is still scaled by element size to increase reach, given that
-  element types will be naturally aligned.
-\end{commentary}
-
-The indexed address mode takes a scalar base byte address and a vector
-of byte offsets.  The scalar base address and the immediate value are
-added to element of the offset vector to give a vector of addresses
-used in a scatter/gather.
-
-Indexed stores are provided in three types.  Unordered, ordered, and
-reverse-ordered.  The unordered indexed stores might update the same
-memory location from two different elements in an unspecified order.
-The ordered stores always update memory locations in increasing vector
-element order.  The reverse-ordered stores always update memory
-locations in decreasing memory order.
-
-\begin{commentary}
-  The reverse-ordered stores support vectorization of software memory
-  disambiguation techniques.  A reverse-ordered store of element id
-  into a hash table indexed by a hash on a store access address,
-  followed by a read of the hash table using a load access address and
-  a comparison against the original element id, will indicate if
-  there's a potential RAW hazard with an earlier loop iteration.
-\end{commentary}
-
-\begin{discussion}
-  Not clear if there is sufficient realizable improvement for
-  supporting unordered stores over ordered stores.
-\end{discussion}
-
-Vector loads/stores have a simple memory model, where each vector
-load/store is observed to complete sequentially in program order only
-the local hart, i.e., a vector load on a hart will observe all earlier
-vector stores on the same hart, and no later vector stores.
-
-Vector loads are available in a length-speculative form that writes
-predicate register {\tt vp1} in addition to the destination vector
-data register.  These instructions raise an illegal instruction
-exception if {\tt vp1} is not configured.  For elements that do not
-generate a permissions fault, the length-speculative vector loads
-operate as normally except to also clear the bit in {\tt vp1}.  If an
-element encounters a permission fault, a zero is written to the
-destination vector register element and the {\tt vp1} bit is set to a
-1.  Implementations may treat elements past the first faulting element
-as also causing a fault even if they might not cause a permissions
-fault when accessed alone.
-
-Once software determines the active vector length, it should check if
-any loads within the active vector length caused a fault, and in this
-case, generate a non-length-speculative load to trigger reporting of
-the error.
-
-\begin{commentary}
-  Length-speculative vector loads are required to vectorize while
-  loops, with data-dependent exits (e.g. strlen).
-
-  The only faults ignored by the length-speculative vector loads are
-  ones that would have resulted in a permissions violation.  Page
-  faults and other virtualization-related faults should be handled
-  invisibly to the user thread by the execution environment.
-
-  A malicious program can use length-speculative vector loads to probe
-  accessible address space without fear of a fatal fault.
-\end{commentary}
-
-\section{Vector Register Gather}
-
-A vector register gather produces a new result data vector by gathering
-elements from one source data vector at the element locations
-specified by a second source index vector.  Data source and
-destination vector types must agree.  The index vector can have any
-integer type.  Legal element indices can range from 0 to current
-MAXVL.  Indices out of this range raise an illegal instruction
-exception.
-
-\begin{verbatim}
-  # vindices holds values from 0..MAXVL
-  vrgather  vdest, vsrc, vindices
-\end{verbatim}
-
-\section{Vector Slide}
-
-Reductions (and convolutions) are supported via a vector slide
-instruction that takes elements starting from the middle of one vector
-and places these at the beginning of a second vector register.  This
-supports a recursive-halving reduction approach for any binary
-associative operator.
-
-\begin{commentary}
-  A similar vector register extract instruction was added to the Cray
-  C90 after memory latency grew too large for the memory-memory
-  reductions used in earlier Crays.
-
-  The vector unit microarchitecture can be optimized for the
-  power-of-2 sized element offsets used for reductions.
-\end{commentary}
-
-
-\section{Fixed-Point Support}
-
-Clip instruction supports scaling, rounding, and clipping to
-destination type.  Rounding set by CSR fixed-point rounding mode
-(truncate, jam, round-up, round-nearest-even).  Clipping set by CSR
-clip mode (wrap, saturate).
-
-Add with average, rounding set by rounding mode.
-
-Multiply with same size source and destination types, with some result
-scaling values (+1, 0, -1, -8?) and rounding and clipping according to
-CSR mode.
-
-Accumulate with carry into predicate register to support larger
-precise dot-products.
-
-\section{Optional Transcendental Support}
-
-\section{Instruction-Set Encoding}
-
-\note{This section is out of date.}
 
-On the next two pages is a proposed instruction-set encoding.
 
-\input{v-instr-table}