aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--src/p.tex93
-rw-r--r--src/preface.tex13
-rw-r--r--src/riscv-spec.tex4
-rw-r--r--src/v.tex1401
4 files changed, 21 insertions, 1490 deletions
diff --git a/src/p.tex b/src/p.tex
index 074e475..dac4e4f 100644
--- a/src/p.tex
+++ b/src/p.tex
@@ -1,5 +1,5 @@
\chapter{``P'' Standard Extension for Packed-SIMD Instructions,
- Version 0.1}
+ Version 0.2}
\label{sec:packedsimd}
\begin{commentary}
@@ -8,94 +8,7 @@
standardizing on the V extension for large floating-point SIMD
operations. However, there was interest in packed-SIMD fixed-point
operations for use in the integer registers of small RISC-V
- implementations.
+ implementations. A task group is working to define the new P
+ extension.
\end{commentary}
-In this chapter, we outline a standard packed-SIMD extension for
-RISC-V. We've reserved the instruction-set extension name ``P'' for a future
-standard set of packed-SIMD extensions. Many other extensions can
-build upon a packed-SIMD extension, taking advantage of the wide data
-registers and datapaths separate from the integer unit.
-
-\begin{commentary}
-Packed-SIMD extensions, first introduced with the Lincoln Labs TX-2~\cite{tx2},
-have become a popular way to provide higher throughput on data-parallel
-codes. Earlier commercial microprocessor implementations include the
-Intel i860, HP PA-RISC MAX~\cite{lee-max-ieeemicro1996}, SPARC
-VIS~\cite{tremblay-vis-ieeemicro1996}, MIPS
-MDMX~\cite{gwennap-mdmx-mpr1996}, PowerPC
-AltiVec~\cite{diefendorff-altivec-ieeemicro2000}, Intel x86
-MMX/SSE~\cite{peleg-mmx-ieeemicro1996, raman-sse-ieeemicro2000}, while
-recent designs include Intel x86 AVX~\cite{lomont-avx-irm2011} and ARM
-Neon~\cite{goodacre-armisa-computer2005}. We describe a standard
-framework for adding packed-SIMD in this chapter, but are not actively
-working on such a design. In our opinion, packed-SIMD designs represent
-a reasonable design point when reusing existing wide datapath resources,
-but if significant additional resources are to be devoted to
-data-parallel execution then designs based on traditional vector
-architectures are a better choice and should use the V extension.
-
-\end{commentary}
-
-A RISC-V packed-SIMD extension reuses the floating-point registers
-({\tt f0}-{\tt f31}). These registers can be defined to have widths
-of FLEN=32 to FLEN=1024. The standard floating-point instruction-set
-extensions require registers of width 32 bits (``F''), 64 bits (``D''),
-or 128 bits (``Q'').
-
-\begin{commentary}
-It is natural to use the floating-point registers for packed-SIMD
-values rather than the integer registers (PA-RISC and Alpha
-packed-SIMD extensions) as this frees the integer registers for
-control and address values, simplifies reuse of scalar floating-point
-units for SIMD floating-point execution, and leads naturally to a
-decoupled integer/floating-point hardware design. The floating-point
-load and store instruction encodings also have space to handle wider
-packed-SIMD registers. However, reusing the floating-point registers
-for packed-SIMD values does make it more difficult to use a recoded
-internal format for floating-point values.
-\end{commentary}
-
-The existing floating-point load and store instructions are used to
-load and store various-sized words from memory to the {\tt f}
-registers. The base ISA supports 32-bit and 64-bit loads and stores,
-but the LOAD-FP and STORE-FP instruction encodings allows 8 different
-widths to be encoded as shown in Table~\ref{psimdwidth}. When used
-with packed-SIMD operations, it is desirable to support non-naturally
-aligned loads and stores in hardware.
-
-\begin{table}[htp]
-\begin{center}
-\begin{tabular}{|c|l|r|}
-\hline
-{\em width} field &
-Code &
-Size in bits\\
-\hline
-000 & B & 8 \\
-001 & H & 16 \\
-010 & W & 32 \\
-011 & D & 64 \\
-100 & Q & 128 \\
-101 & Q2 & 256 \\
-110 & Q4 & 512 \\
-111 & Q8 & 1024 \\
-\hline
-\end{tabular}
-\end{center}
-\caption{LOAD-FP and STORE-FP width encoding.}
-\label{psimdwidth}
-\end{table}
-
-Packed-SIMD computational instructions operate on packed values in
-{\tt f} registers. Each value can be 8-bit, 16-bit, 32-bit, 64-bit,
-or 128-bit, and both integer and floating-point representations can be
-supported. For example, a 64-bit packed-SIMD extension can treat each
-register as 1$\times$64-bit, 2$\times$32-bit, 4$\times$16-bit, or
-8$\times$8-bit packed values.
-
-\begin{commentary}
-Simple packed-SIMD extensions might fit in unused 32-bit instruction
-opcodes, but more extensive packed-SIMD extensions will likely require
-a dedicated 30-bit instruction space.
-\end{commentary}
diff --git a/src/preface.tex b/src/preface.tex
index 69615b0..535c78a 100644
--- a/src/preface.tex
+++ b/src/preface.tex
@@ -31,7 +31,7 @@ modules:
\bf Zifencei & \bf 2.0 & \bf Ratification \\
\bf Zicsr & \bf 2.0 & \bf Ratification \\
\bf M & \bf 2.0 & \bf Ratification \\
- \bf A & \bf 2.0 & \bf Ratification \\
+ \em A & \em 2.0 & Frozen \\
\bf F & \bf 2.2 & \bf Ratification \\
\bf D & \bf 2.2 & \bf Ratification \\
\bf Q & \bf 2.2 & \bf Ratification \\
@@ -42,8 +42,8 @@ modules:
\em B & \em 0.0 & \em Draft \\
\em J & \em 0.0 & \em Draft \\
\em T & \em 0.0 & \em Draft \\
- \em P & \em 0.1 & \em Draft \\
- \em V & \em 0.4 & \em Draft \\
+ \em P & \em 0.2 & \em Draft \\
+ \em V & \em 0.7 & \em Draft \\
\em N & \em 1.1 & \em Draft \\
\em Zam & \em 0.1 & \em Draft \\
\hline
@@ -56,6 +56,7 @@ The changes in this version of the document include:
\begin{itemize}
\parskip 0pt
\itemsep 1pt
+\item Removed the A extension from ratification.
\item Changed document version scheme to avoid confusion with versions
of the ISA modules.
\item Incremented the version numbers of the base integer ISA to 2.1,
@@ -109,6 +110,10 @@ The changes in this version of the document include:
\item Improvements to the description and commentary.
\item Defined the term IALIGN as shorthand to describe the instruction-address
alignment constraint.
+\item Removed text of P extension chapter as now superceded by active task
+ group documents.
+\item Removed text of V extension chapter as now superceded by separate vector
+ extension draft document.
\end{itemize}
\section*{Preface to Document Version 2.2}
@@ -140,7 +145,7 @@ versions of the RISC-V ISA modules:
J & 0.0 & N \\
T & 0.0 & N \\
P & 0.1 & N \\
- V & 0.2 & N \\
+ V & 0.7 & N \\
N & 1.1 & N \\
\hline
\end{tabular}
diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex
index 1ec7ccd..17bed0a 100644
--- a/src/riscv-spec.tex
+++ b/src/riscv-spec.tex
@@ -6,8 +6,8 @@
\input{preamble}
-\newcommand{\specrev}{\mbox{20181221-Public-Review-{\em draft}}}
-\newcommand{\specmonthyear}{\mbox{December 2018}}
+\newcommand{\specrev}{\mbox{20190305-Base-Ratification}}
+\newcommand{\specmonthyear}{\mbox{March 2019}}
\begin{document}
diff --git a/src/v.tex b/src/v.tex
index c5c277d..e367b99 100644
--- a/src/v.tex
+++ b/src/v.tex
@@ -1,1402 +1,15 @@
-\chapter{``V'' Standard Extension for Vector Operations, Version 0.4-DRAFT}
+\chapter{``V'' Standard Extension for Vector Operations, Version 0.7}
\label{sec:vector}
-{\bf This version is out-of-date with respect to the current working
- group draft, which is now hosted on {\tt https://github.com/riscv/riscv-v-spec}.}
-
-This chapter presents a proposal for the RISC-V base vector
-instruction-set extension. The base vector extension is intended to
-provide general support for data-parallel execution within the 32-bit
-instruction encoding space, with later vector extensions supporting
-richer functionality for certain domains.
-
-\begin{commentary}
-The vector extension is based on the style of vector register
-architecture introduced by Seymour Cray in the 1970s, as opposed to
-the earlier packed SIMD approach, introduced with the Lincoln Labs
-TX-2 in 1957 and now adopted by most other commercial instruction
-sets.
-\end{commentary}
-
-The base vector extension defines the components that must be included
-when the ``V'' bit is set in the {\tt misa} register, and consequently
-those that will be assumed to exist by software written for an ABI
-specifying V.
-
-\begin{commentary}
- This draft version of the chapter includes additional specifications
- of proposed extensions to the base vector extension to explain some
- of the encoding choices made for the base.
-\end{commentary}
-
-The vector extension supports a configurable vector unit, to enable
-implementations to tradeoff the number of active architectural vector
-registers and supported element widths against available maximum
-vector length. The vector extension is designed to allow the same
-binary code to work efficiently across a variety of hardware
-implementations varying in physical vector storage capacity and
-datapath spatial and/or temporal parallelism.
-
-\begin{commentary}
-The vector instruction set contains many features developed in earlier
-research projects, including the Berkeley T0~\cite{} and VIRAM~\cite{VIRAM}
-vector microprocessors, the MIT Scale vector-thread processor~\cite{},
-and the Berkeley Maven~\cite{} and Hwacha~\cite{} projects.
-\end{commentary}
-
-\section{Vector Unit State}
-
-The additional vector unit architectural state includes 32 vector
-registers ({\tt v0}--{\tt v31}), and an XLEN-bit WARL vector length
-CSR, {\tt vl}. Each vector register {\tt v}$n$ has an associated
-16-bit configuration field {\tt vtype}$n$ described below. A 6-bit
-global maximum element width register {\tt vmaxew} defines the maximum
-number of bits of storage in every element of every active vector
-register.
-
-\begin{commentary}
- Future vector extensions using wider instruction encodings can
- support more architectural vector registers. For example, 256
- architectural vector registers in a 64-bit instruction encoding.
-\end{commentary}
-
-\begin{commentary}
- Future 2D shape extensions add two more vector length registers,
- {\tt vm} and {\tt vn}.
-\end{commentary}
-
-There is also a 3-bit fixed-point rounding mode CSR {\tt vxrm}, and a
-single-bit fixed-point saturation status CSR {\tt vxsat}. The {\tt
- vcs} CSR alias provides combined access to the {\tt vl}, {\tt vxrm},
-{\tt vxsat} fields to reduce context switch time. The {\tt vcs}
-register also includes a configuration mode field to support future
-extended configuration modes.
-
-\begin{discussion}
-The components of vcs might not need separate CSR addresses,
-depending on how they're accessed via other non-CSR instructions.
-\end{discussion}
-
-\section{Vector Unit Type Configuration Register ({\tt vtype}$n$)}
-
-The vector unit must be configured before use. Each architectural
-vector register, {\tt v}$n$, is configured via 16 bits of vector type
-configuration state {\tt vtype}$n$, which can be accessed via vector
-configuration ({\tt vcfg}) CSRs and other rapid vector configuration
-instructions as described below. The vector register type
-configuration encodes the overall organization, or {\em shape}, of the
-elements in each vector register (e.g., scalar versus 1-D vector), as
-well as the bitwidth and numeric representation of each element. As
-shown in Figure~\ref{fig:vtype}, the 16-bit {\tt vtype}$n$ encoding is
-divided into a 5-bit current shape field {\tt vshape}$n$, a 5-bit
-representation field {\tt verep}$n$, and a 6-bit element bit-width
-field {\tt vew}$n$\, held in the {\tt vcfg}$x$ CSRs. The combination
-of an element numeric representation and an element bitwidth is called
-an element {\em format}. Each vector register can also be disabled to
-free physical vector storage for other architectural vector registers.
-
-\begin{figure}[htb]
-\begin{center}
-\begin{tabular}{O@{}O@{}O}
-\\
-\instbitrange{15}{11} &
-\instbitrange{10}{6} &
-\instbitrange{5}{0} \\
-\hline
-\multicolumn{1}{|c|}{{\tt vshape}$n$} &
-\multicolumn{1}{c|}{{\tt verep}$n$} &
-\multicolumn{1}{c|}{{\tt vew}$n$} \\
-\hline
-5 & 5 & 6 \\
-\end{tabular}
-\end{center}
-\caption{Location of subfields within a single {\tt vtype}$n$ field.}
-\label{fig:vtype}
-\end{figure}
-
-\begin{commentary}
- It was also common in earlier vector machines to support multiple
- precisions within the vector datapath. In particular, the CDC
- STAR-100~\cite{cdcstar100} supported single-precision and
- double-precision floating-point operations and also bit, byte, and
- nibble operations in the vector unit; TI ASC~\cite{tiasc} designs
- supported dividing 64-bit vector lanes into two 32-bit lanes for
- double throughput.
-\end{commentary}
-
-\clearpage
-
-\section{Shape Encoding}
-
-The 5-bit shape field describes the structure of the elements within
-the vector register. In the base vector extension, the shape can be
-set to either scalar or vector.
-
-\begin{table}[hbt]
- \centering
- \begin{tabular}{|c|l|}
- \hline
- {\tt vshape} & Shape \\
- \hline
- 00000 & scalar \\
- 00100 & 1-D vector, length controlled by {\tt vl} \\
- \hline
- \multicolumn{2}{|c|}{All other encodings reserved}\\
- \hline
- \end{tabular}
- \caption{Base vector encoding of {\tt vshape}$n$ field.}
- \label{tab:vshape}
-\end{table}
+The current working group draft is hosted at {\tt
+ https://github.com/riscv/riscv-v-spec}.
\begin{commentary}
- For the base vector ISA, only a single bit is required in each {\tt
- vshape} field to select between scalar and 1-D vector elements
- with the other bits hardwired to zero.
+The base vector extension is intended to provide general support for
+data-parallel execution within the 32-bit instruction encoding space,
+with later vector extensions supporting richer functionality for
+certain domains.
\end{commentary}
-\begin{table}[hbt]
- \centering
- \begin{tabular}{|c|l|}
- \hline
- {\tt vshape} & Shape \\
- \hline
- 00000 & scalar \\
- 00001 & {\em Reserved} \\
- 0001x & {\em Reserved} \\
- \hline
- 00100 & 1-D vector {\tt vl} \\
- 01000 & 1-D vector {\tt vm} \\
- 01100 & 1-D vector {\tt vn} \\
- \hline
- 00101 & 2-D matrix {\tt vl} x {\tt vl} \\
- 00110 & 2-D matrix {\tt vl} x {\tt vm} \\
- 00111 & 2-D matrix {\tt vl} x {\tt vn} \\
- \hline
- 01001 & 2-D matrix {\tt vm} x {\tt vl} \\
- 01010 & 2-D matrix {\tt vm} x {\tt vm} \\
- 01011 & 2-D matrix {\tt vm} x {\tt vn} \\
- \hline
- 01101 & 2-D matrix {\tt vn} x {\tt vl} \\
- 01110 & 2-D matrix {\tt vn} x {\tt vm} \\
- 01111 & 2-D matrix {\tt vn} x {\tt vn} \\
- \hline
- 1xxxx & {\em Reserved}/{\em Custom} \\
- \hline
- \end{tabular}
- \caption{Extended encoding of per-vector-register {\tt vshape} field.}
- \label{tab:extvshape}
-\end{table}
-
-\begin{commentary}
- A sketch of the proposed encodings for the 2D shape extension is
- shown in the Table.
-\end{commentary}
-
-\clearpage
-
-\section{Representation Encoding}
-
-The 5-bit {\tt verep}$n$ register sets the numeric representation of
-each element of the vector register. In the base vector
-extension, the representation can be set to unsigned integer,
-two's-complement signed integer, or floating-point. The
-floating-point representations follow the IEEE 754 standards.
-
-\begin{table}[hbtp]
- \centering
- \begin{tabular}{|c|l|}
- \hline
- {\tt verep} & Representation \\
- \hline
- 00000 & Unsigned integer \\
- 00001 & Two's-complement signed integer \\
- 00010 & {\em Reserved (unsigned floating-point?)}\\
- 00011 & IEEE-754 floating-point \\
- \hline
- \multicolumn{2}{|c|}{All other encodings reserved}\\
- \hline
- \end{tabular}
- \caption{Base vector representation encoding.}
- \label{tab:verep}
-\end{table}
-
-\begin{table}[hbtp]
- \centering
- \begin{tabular}{|c|l|}
- \hline
- {\tt verep} & Representation \\
- \hline
- 00000 & Unsigned integer \\
- 00001 & Two's-complement signed integer \\
- 00010 & {\em Reserved (unsigned floating-point)}\\
- 00011 & IEEE-754 floating-point \\
- \hline
- 001x0 & {\em Reserved} \\
- 00101 & Complex signed integer \\
- 00111 & Complex floating-point \\
- \hline
- 01000 & Prime Galois field - integer representation \\
- 01001 & Prime Galois field - Montgomery representation \\
- 01100 & Binary extension Galois field - polynomial basis \\
- 01101 & Binary extension Galois field - normal basis \\
- \hline
- 01010 & UNORM \\
- 01011 & SNORM \\
- 01110 & {\em Reserved} \\
- 01111 & {\em Reserved (complex SNORM?)} \\
- \hline
- 10xxx & Custom representations \\
- \hline
- 11xxx & {\em Reserved} \\
- \hline
- \end{tabular}
- \caption{Extended vector representation encoding.}
- \label{tab:extverep}
-\end{table}
-
-\begin{commentary}
- The complex representations split the element width given in {\tt
- vew}$n$ into two equal-sized real and imaginary fields, so an
- element width of 64 bits can hold a single complex value with a
- 32-bit real and a 32-bit imaginary component.
-\end{commentary}
-
-\clearpage
-
-\section{Element Bitwidth}
-
-Each vector register, {\tt v}$n$, has a 6-bit element width
-register, {\tt vew}$n$, to specify the number of bits for each element
-of the current type in the vector register.
-
-The largest element width supported is
-termed ELEN, and is defined to be the larger of the supported integer
-and floating-point type widths:
-\[ \mbox{\em ELEN} = max(\mbox{\em XLEN}, \mbox{\em FLEN}) \]
-For the base vector ISA, the bit width can be set at any power of two
-between 8 and ELEN.
-
-\begin{table}[hbt]
- \centering
- \begin{tabular}{|c|r|l|}
- \hline
- {\tt vew} & Width & Required in Base \\
- \hline
- 000 000 & disabled & All \\
- 001 000 & 8 & All \\
- 010 000 & 16 & All \\
- 011 000 & 32 & All \\
- 100 000 & 64 & RV32D, RV64, RV128\\
- 101 000 & 128 & RV64Q, RV128\\
- \hline
- \multicolumn{3}{|c|}{All other encodings reserved.}\\
- \hline
- \end{tabular}
- \caption{Base vector ISA encoding of vector element width ({\tt
- vew}$n$) register fields.}
- \label{tab:basevew}
-\end{table}
-
-\begin{table}[hbtp]
- \centering
- \begin{tabular}{|c|r|}
- \hline
- {\tt vew} & Width \\
- \hline
- 000 000 & disabled \\
- 000 001 & 1 \\
- 000 xxx & \multicolumn{1}{r|}{steps of 1}\\
- 000 111 & 7 \\
- \hline
- 001 000 & 8 \\
- 001 xxx & \multicolumn{1}{r|}{steps of 1}\\
- 001 111 & 15 \\
- \hline
- 010 000 & 16 \\
- 010 xxx & \multicolumn{1}{r|}{steps of 2}\\
- 010 111 & 30 \\
- \hline
- 011 000 & 32 \\
- 011 xxx & \multicolumn{1}{r|}{steps of 4}\\
- 011 111 & 60 \\
- \hline
- 100 000 & 64 \\
- 100 xxx & \multicolumn{1}{r|}{steps of 8}\\
- 100 111 & 120 \\
- \hline
- 101 xxx & reserved \\
- \hline
- 110 000 & 128 \\
- 110 001 & 192 \\
- 110 010 & 2048 \\
- 110 011 & 3072 \\
- 110 100 & 512 \\
- 110 101 & 768 \\
- 110 110 & 8192 \\
- 110 111 & 12288 \\
- \hline
- 111 000 & 256 \\
- 111 001 & 384 \\
- 111 010 & 4096 \\
- 111 011 & 6144 \\
- 111 100 & 1024 \\
- 111 101 & 1536 \\
- 111 110 & 16384 \\
- 111 111 & 24576 \\
- \hline
- \end{tabular}
-
- \caption{Proposed extended encoding of vector element width ({\tt
- vew}$n$) register fields. Every bit width between 1 and 16 can
- be supported. Bit widths in steps of 2 between 16 to 32 (i.e.,
- 16, 18, 20, ...). Bit widths in steps of 4 between 32 to 64
- (i.e., 32, 36, 40, ...). Bit widths in steps of 8 between 64 and
- 128 (i.e., 64, 72, 80,...). For bit widths greater than 128, all
- powers-of-two up to 16384 and all widths 1.5$\times$ greater are
- supported (128, 384, 512, 768,...). }
- \label{tab:extvew}
-\end{table}
-
-\begin{commentary}
- The extended bit-width encoding is designed to minimize the number
- of state bits required to support useful subsets of widths. For
- example, an RV32 system only needs two bits of state per {\tt
- vew}$n$ field to represent {\em disabled}, 8, 16, and 32. An
- RV32 system with 3 bits of state can represent {\em disabled}, 4,
- 8, 12, 16, 24, 32, and 48. An RV64 system with 4 bits of state
- can represent {\em disabled}, 4, 8, 12, 16, 24, 32, 48, 64, 96,
- 128, 256, 512, 1024.
-\end{commentary}
-
-\clearpage
-
-\section{Base Vector Extension Supported Types}
-
-The types supported by the base V extension depend upon the base
-scalar ISA and supported extensions. When the base V extension is
-added to a base scalar ISA, it must support the vector data element
-types implied by the supported scalar types as defined by
-Table~\ref{tab:velemtypes}.
-
-\begin{table}[hbt]
- \centering
-\begin{tabular}{|l|l|}
- \hline
- \multicolumn{2}{|c|}{Supported Fixed-Point Formats} \\
- \hline
- RV32I & I8, U8, I16, U16, I32, U32 \\
- RV64I & I8, U8, I16, U16, I32, U32, I64, U64 \\
- RV128I & I8, U8, I16, U16, I32, U32, I64, U64, I128, U128 \\
- \hline
- \hline
- \multicolumn{2}{|c|}{Supported Floating-Point Formats} \\
- \hline
- F & F16, F32 \\
- FD & F16, F32, F64 \\
- FDQ & F16, F32, F64, F128 \\
- \hline
-\end{tabular}
-\caption{Supported data element formats depending on base integer ISA
- and supported floating-point extensions. I$x$ indicates a signed
- integer of $x$ bits, U$x$ indicates an unsigned integer of $x$ bits,
- and F$x$ indicates an IEEE floating-point number of $x$ bits.}
-\label{tab:velemtypes}
-\end{table}
-
-\begin{commentary}
- Future vector extensions might expand the set of supported
- datatypes, including custom application-specific datatypes.
-\end{commentary}
-
-\clearpage
-
-\section{Maximum Vector Element Width ({\tt vmaxew})}
-
-The global {\tt vmaxew} field is used to support more complex vector
-runtime environments where the types to be held in each register of a
-single configuration may vary dynamically, and may not even be known
-at compile time due to separate compilation.
-
-The global maximum element width register {\tt vmaxew} defines the
-maximum number of bits of storage in every element of every active
-architectural register, or if zero, defers to the per-vector-register
-width field.
-
-\begin{commentary}
- The VIRAM processor had a virtual processor width
- register similar to {\tt vmaxew}~\cite{VIRAM}.
-\end{commentary}
-
-If {\tt vmaxew} is zero, then the per-element vector element widths
-{\tt vew}$n$ determine the minimum storage required for each element
-of the associated vector register {\tt v}$n$.
-
-If {\tt vmaxew} is non-zero, it sets the largest element width that
-can be supported in any vector register element in the current
-configuration.
-
-\clearpage
-
-\section{Vector Configuration Registers ({\tt vcfg0}--{\tt vcfg15})}
-
-The vector type configuration requires 512 bits of state (32 vector
-registers each with 16-bit {\tt vtype}$n$ field) that can be accessed
-via the {\tt vcfg CSRs}.
-
-RV128 uses four vector configuration CSRs: {\tt vcfg0} holds
-configuration data for {\tt v0}--{\tt v7} with bits $16n$ to $16n+15$
-holding {\tt vtype}$n$, while {\tt vcfg4}, {\tt vcfg8} and {\tt
- vcfg12} similarly holds configuration data for {\tt v8}--{\tt v15},
- {\tt v16}--{\tt v23}, and {\tt v24}--{\tt v31} respectively.
-
-In RV64, the {\tt vcfg2} CSR provides access to the upper 64 bits of {\tt
- vcfg0} and {\tt vcfg6} provides access to the upper 64 bits of
-{\tt vcfg4}. In RV32, the {\tt vcfg1}, {\tt vcfg3}, {\tt vcfg5}
-and {\tt vcfg7} CSRs provides access to the upper bits of {\tt
- vcfg0}, {\tt vcfg2}, {\tt vcfg4} and {\tt vcfg6} respectively.
-
-Any CSR write to a {\tt vcfg}$x$ register zeros all {\tt vcfg}$y$
-registers, for $y>x$. As a result configuration data should be
-written from the {\tt vcfg0} CSR upwards.
-
-\begin{commentary}
- Zeroing higher-numbered {\tt vcfg}$y$ registers allows more rapid
- reconfiguration of the vector register file via CSR writes, and
- provides backward-compatibility for extensions that increase the
- number of possible architectural vector registers. This choice does
- prevent the use of CSRRW instructions to swap the configuration
- context; an entire old configuration must be read out before a new
- configuration is written in.
-\end{commentary}
-
-Additional instructions are provided to support more rapid changes to
-the vector unit configuration as described below.
-
-\section{Legal Vector Unit Configurations}
-
-To simplify hardware configuration calculations and to reduce software
-context-switch complexity, vector unit configurations are constrained
-to have non-disabled architectural vector registers numbered
-contiguously starting at {\tt v0}. An exception will be raised if an
-instruction tries to change {\tt vtype}$n$ in a way that violates this
-constraint.
-
-\begin{commentary}
- During a software vector-context save, the software handler can stop
- searching for active architectural registers after encountering the
- first disabled vector register. Hardware to calculate physical
- register allocation is also simplified with this constraint.
-\end{commentary}
-
-\clearpage
-
-\section{Vector Unit CSRs}
-
-\begin{table}[hbt]
- \centering
- \begin{tabular}{|l|c|l|l|}
- \hline
- CSR name & Number & Base ISA & Description\\
- \hline
- {\tt vcs} & TBD & RV32, RV64, RV128 & Vector control-status register\\
- {\tt vl} & TBD & RV32, RV64, RV128 & Active vector length\\
- {\tt vxrm} & TBD & RV32, RV64, RV128 & Vector fixed-point rounding mode\\
- {\tt vxsat} & TBD & RV32, RV64, RV128 & Vector fixed-point
- saturation flag \\
- {\tt vmaxew} & TBD & RV32, RV64, RV128 & Global maximum vector element width \\
- \hline
- {\tt vcfg0} & TBD & RV32, RV64, RV128 & \multirow{16}{*}{Vector
- register configuration}\\
- {\tt vcfg1} & TBD & RV32 &\\
- {\tt vcfg2} & TBD & RV32, RV64 &\\
- {\tt vcfg3} & TBD & RV32 &\\
- {\tt vcfg4} & TBD & RV32, RV64, RV128 &\\
- {\tt vcfg5} & TBD & RV32 &\\
- {\tt vcfg6} & TBD & RV32, RV64 &\\
- {\tt vcfg7} & TBD & RV32 &\\
- {\tt vcfg8} & TBD & RV32, RV64, RV128 & \\
- {\tt vcfg9} & TBD & RV32 &\\
- {\tt vcfg10} & TBD & RV32, RV64 &\\
- {\tt vcfg11} & TBD & RV32 &\\
- {\tt vcfg12} & TBD & RV32, RV64, RV128 &\\
- {\tt vcfg13} & TBD & RV32 &\\
- {\tt vcfg14} & TBD & RV32, RV64 &\\
- {\tt vcfg15} & TBD & RV32 &\\
- \hline
- \end{tabular}
- \caption{Vector extension CSRs.}
- \label{tab:vcsrs}
-\end{table}
-
-\clearpage
-
-\section{Maximum Vector Length (MVL)}
-
-The implementation determines an available {\em maximum vector length}
-(MVL) dependent on the current vector type configuration held in {\tt
- vcfg}$x$ and {\tt vmaxew}. The available MVL depends on the
-configuration setting and on the implementation's microarchitecture,
-but MVL must always have the same value for the same configuration
-parameters on a given hart.
-
-\begin{commentary}
- Several earlier vector machines had the ability to configure
- physical vector register storage into a larger number of short
- vectors or a shorter number of long vectors. In particular the
- Fujitsu VP series~\cite{vp200} supported combining power-of-2 base
- vector registers into longer vector registers.
-
- The Scale~\cite{}, Maven~\cite{}, and Hwacha~\cite{} processors also
- support configuration-dependent MVL.
-\end{commentary}
-
-\begin{commentary}
- Previously, the specification imposed a minimum vector length (4) on
- all configurations to allow stripmining code to be removed for short
- vector lengths. With the expanded scope of the vector unit types,
- this would be too onerous to support, and so the requirement is removed.
-\end{commentary}
-
-\begin{discussion}
- A separate mechanism for supporting fixed vector lengths should be
- designed, possibly as part of an optional extension.
-\end{discussion}
-
-Any change to the vector configuration that might change MVL cause the
-entire vector unit state to be zeroed. Any write to the global {\tt
- vmaxew} causes the entire vector unit state to be zeroed, even if
-the value in {\tt vmaxew} is unchanged.
-
-If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
-register that would set the width greater than {\tt vmaxew} raises an
-illegal instruction exception and leaves the vector unit state
-unchanged.
-
-If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
-field with a value less than or equal to the value in {\tt vmaxew}
-only zeros the associated vector register {\tt v}$n$ and leaves other
-vector unit state unchanged. The vector register data is zeroed even
-if {\tt vew}$n$ would be unchanged by the write.
-
-If {\tt vmaxew} is zero, then any write to an individual {\tt vew}$n$
-register zeros the associated {\tt v}$n$ vector register. In addition,
-any write that changes the value in {\tt vew}$n$, zeros the entire vector
-unit state.
-
-\begin{commentary}
- The state is zeroed to hide implementation-dependent bit mappings
- and to provide additional security when context swapping. Zero is
- also a convenient initial value for some loops.
-
- In-order implementations will probably use a flag bit per register to
- mux in 0 instead of garbage values on each source until it is
- overwritten. For in-order machines, vector lengths less than MVL
- complicate this zeroing, but these cases can be handled by adding a
- zero bit per element or element group. Machines with vector
- register renaming can just initialize the rename table to point
- entries at a physical zero register.
-\end{commentary}
-
-Each vector register can be reconfigured dynamically to hold different
-formats without zeroing the entire vector unit state provided that: if
-{\tt vmaxew} is zero, the bit-width of the new format is the same as
-the current {\tt vew}; or if {\tt vmaxew} is non-zero, the format does
-not require more than {\tt vmaxew} bits. Any change to a vector
-register's format zeros the affected vector register.
-
-If a vector register is disabled, then any vector instruction
-that attempts to access that vector register will raise an
-illegal instruction exception. Attempting to write any {\tt
- vmaxew}$n$ with an unsupported value will raise an illegal
-instruction exception.
-
-\begin{commentary}
- Vector registers have both a maximum element width and a
- current element data type to allow the same vector register to
- be changed to different types during execution provided the
- maximum width is not exceeded. This reduces register pressure and
- helps support vector function calls, where the caller does not know
- the types needed by the callee, as described below.
-\end{commentary}
-
-\begin{commentary}
- The set of supported types might be greatly increased with future
- extensions. For example (and not limited to), new scalar types in
- new number systems, a complex type with real and imaginary
- components, a key-value type, or an application-specific structure
- type with multiple constituent fields. Auxiliary type
- configuration state might be required in these cases.
-\end{commentary}
-
-Attempting to write an unsupported type or a type that requires more
-than the current {\tt vmaxew} width to a {\tt vetype} field will raise
-an illegal instruction exception.
-
-\begin{commentary}
-Implementations must still raise an exception for a {\tt vetype}$n$
-setting that is greater than the architectural {\tt vmaxew}$n$ width,
-even if they internally implement a larger physical {\tt vmaxew}$n$
-that could accommodate the {\tt vetype}$n$ request.
-\end{commentary}
-
-\begin{discussion}
-We can either have 1) implementations raise exceptions whenever
-illegal values are written to {\tt vmaxew} and {\tt vetype} fields
-(current design), 2) raise exceptions at use if config holds illegal
-values, 3) make the fields WARL so silently reduce to supported types
-with no exceptions. Option 2 could complicate vector unit context
-switch code by having more cases to check, while Option 3 could make
-debugging more difficult by allowing code to run with reduced
-precision or incorrect types.
-\end{discussion}
-
-\begin{commentary}
-Three broad classes of implementation can be distinguished by how they
-handle {\tt vmaxew} settings.
-
-The simplest is {\em max-width-per-implementation} (MWPI), where the
-vector unit is organized in fixed ELEN-width physical lanes, and
-changes to {\tt vmaxew} settings simply cause portions of the
-physical registers and datapath to be disabled for operations narrower
-than ELEN bits.
-
-The next most complex implementation, {\em
- max-width-per-configuration} (MWPC), uses the maximum width across
-all {\tt vmaxew} settings in a dynamic configuration to divide the
-physical register storage and datapaths. For example, a MWPC machine
-with ELEN=64 might subdivide physical lanes into 32-bit datapaths if
-no {\tt vmaxew} setting is greater than 32. Operations on
-sub-32-bit quantities would disable appropriate portions of the
-physical registers and functional units in each 32-bit lane. Several
-early vector supercomputers, including the CDC
-Star-100~\cite{cdcstart100}, provided a similar facility to divide
-64-bit physical vector lanes into narrower 32-bit lanes.
-
-The most complex implementations are {\em max-width-per-register}
-(MWPR), which reduce wasted space in the physical register files by
-packing elements in each vector register according to the individual
-{\tt vmaxew} settings and which within one configuration can
-execute instructions with narrower datatypes at higher rates than for
-wider datatypes. The Berkeley Hwacha vector
-engine~\cite{hwachatr,mixedprecision} is an example microarchitecture
-with this property.
-\end{commentary}
-
-\clearpage
-
-{\bf Following Sections are out-of-date.}
-
-\section{Vector Instruction Formats}
-
-\begin{commentary}
- The instruction encoding is a work in progress.
-
- An important design goal was that the base vector extension fit
- within a few major opcodes of the 32-bit encoding. It is envisioned
- that future vector extensions will use 48-bit or 64-bit encodings to
- increase both the opcode space and the set of architectural
- registers. The 64-bit vector encoding would support 256
- architectural vector registers and orthogonal specification of a
- predicate register in each instruction.
-\end{commentary}
-
-Vector arithmetic and vector memory instructions are encoded in new
-variants of the R-format, shown in Figure~\ref{fig:vinstformats}.
-Both new formats use one bit to hold a {\em vp} field, which usually
-controls the predicate register in use, either {\tt vp0} or {\tt vp1}.
-The VR4 form is used for fused multiply-add instructions. The
-existing RISC-V instruction formats are used for other vector-related
-instructions, such as the vector configuration instructions.
-
-\vspace{-0.2in}
-\begin{figure}[h]
-\begin{center}
-\setlength{\tabcolsep}{4pt}
-\begin{tabular}{p{0.7in}@{}p{0.4in}@{}p{0.7in}@{}p{0.7in}@{}p{0.5in}@{}p{0.4in}@{}p{0.7in}@{}p{1in}l}
-\\
-\instbitrange{31}{27} &
-\instbitrange{26}{25} &
-\instbitrange{24}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{13} &
-\instbit{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\cline{1-8}
-\multicolumn{2}{|c|}{funct7} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct2} &
-\multicolumn{1}{c|}{vp} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-VR-type \\
-\cline{1-8}
-\\
-\cline{1-8}
-\multicolumn{1}{|c|}{rs3} &
-\multicolumn{1}{c|}{fmt} &
-\multicolumn{1}{c|}{rs2} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct2} &
-\multicolumn{1}{c|}{vp} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} &
-VR4-type \\
-\cline{1-8}
-\end{tabular}
-\end{center}
-\caption{New V extension instruction formats. }
-\label{fig:vinstformats}
-\end{figure}
-
-Most vector instructions are available in both vector-vector and
-vector-scalar variants. Vector-vector instructions take the first
-operand from the vector register specified by {\em rs1} and the second
-operand from the vector register specified by {\em rs2}.
-
-For vector-scalar operations, the {\em rs1} field specifies the scalar
-register to be accessed. For most vector-scalar instructions, the
-type of the vector operand specified by {\em rs2} indicates whether
-the integer or floating-point scalar register file is accessed using
-the {\em rs1} register specifier.
-
-Some non-commutative vector-scalar instructions (such as sub) are
-provided in two forms, with the scalar value used as the second
-operand.
-
-\begin{commentary}
- The {\em rs1} field is used to provide the scalar operand because in
- the base encoding, whenever an instruction has a single scalar
- source operand, it is encoded in the {\tt rs1} field.
-\end{commentary}
-
-\section{Polymorphic Vector Instructions}
-
-The vector extension uses a polymorphic instruction encoding where the
-opcode is combined with the types of the source and destination
-registers to determine the operation to be performed. For example, an
-ADD opcode will perform a 32-bit integer vector-vector add if both
-vector source operands and the vector destination register are 32-bit
-integers, but will perform a 16-bit floating-point vector-vector
-operation if both vector source operands and the vector destination
-are 16-bit floats.
-
-The polymorphic encoding also naturally supports operations with mixed
-precisions on the input and output, and also supports extending the
-instruction set with new types without necessarily increasing the
-opcode space.
-
-Not all combinations of source and destination argument types need be
-supported. The base vector extension mandates only that
-implementations provide a subset of combinations of types on inputs
-and outputs. Table~\ref{tab:vtypemix} shows the general rules for
-integer and floating-point instructions, but the detailed instruction
-listing should be consulted for accurate information.
-
-\begin{table}
- \centering
- \begin{tabular}{|r|r|r|r|r|}
- \hline
- \multicolumn{1}{|c|}{Src1} &
- \multicolumn{1}{c|}{Src2} &
- \multicolumn{1}{c|}{Src3} &
- \multicolumn{1}{c|}{Dest} &
- \multicolumn{1}{c|}{Example} \\
- \hline
- \hline
- \multicolumn{5}{|c|}{Integer vector-scalar}\\
- \hline
- XLEN & X & - & X & 64b + 32b $\rightarrow$ 32b \\
- XLEN & X & - & 2X & 64b + 8b $\rightarrow$ 16b \\
- \hline
- \hline
- \multicolumn{5}{|c|}{Integer vector-vector}\\
- \hline
- X & X & - & X & 32b + 32b $\rightarrow$ 32b \\
- X & X & - & 2X & 16b + 16b $\rightarrow$ 32b \\
- 2X & X & - & 2X & 64b + 32b $\rightarrow$ 64b \\
- \hline
- \hline
- \multicolumn{5}{|c|}{Floating-point vector-scalar}\\
- \hline
- F & F & - & F & 64b + 64b $\rightarrow$ 64b \\
- F & F & F & F & 32b $\times$ 32b + 32b $\rightarrow$ 32b \\
- F & F & - & 2F & 32b + 32b $\rightarrow$ 64b \\
- F & F & 2F & 2F & 32b $\times$ 32b + 64b $\rightarrow$ 64b \\
- \hline
- \hline
- \multicolumn{5}{|c|}{Floating-point vector-vector}\\
- \hline
- F & F & - & F & 32b + 32b $\rightarrow$ 32b \\
- F & F & - & 2F & 16b + 16b $\rightarrow$ 32b \\
- 2F & F & - & 2F & 64b + 32b $\rightarrow$ 64b \\
- F & F & F & F & 64b $\times$ 64b + 64b $\rightarrow$ 64b \\
- F & F & 2F & 2F & 16b $\times$ 16b + 32b $\rightarrow$ 32b \\
- \hline
- \end{tabular}
- \caption{General rules for supported types per instruction in base
- vector extension. X represents the number of bits in an integer
- type and F represents the number of bits in a floating-point type.
- Individual instruction types will provide more detailed listings.
- Note that the type of a scalar floating-point operand can never be
- different from that of the vector in Src2, hence the Src1=2F case
- is missing from vector-scalar operations.}
- \label{tab:vtypemix}
-\end{table}
-
-A general rule in the base vector instruction set is that the
-destination precision is never less than any source operand, except
-for explicit type-conversion instructions. Another general rule is
-that the input operands can only be the same width or half the width
-of the destination operand except for the scalar operand in integer
-vector-scalar instructions, which is always XLEN wide. Also, src2 is
-never larger than src1 or src3.
-
-Integer computations of mixed-precision values always aligns values by
-their LSB, and sign or zero-extends any smaller value according to its
-type. The result is truncated to fit in the destination type. Note a
-scalar integer value is already XLEN bits wide, and as wide as any
-possible integer vector value.
-
-Floating-point computations on mixed-precision values acts as if the
-calculations are performed exactly then rounded once to the
-destination format.
-
-\section{Rapid Configuration Instructions}
-
-It can take several CSR instructions to set up the {\tt vcfg} and
-{\tt vnp} CSRs for a given configuration. Specialized configuration
-instructions are provided to quickly set up common configurations in
-the {\tt vcfg} and {\tt vnp} CSRs.
-
-The {\tt vsetdcfg} instruction takes a scalar register value encoded as
-shown in Figure~\ref{fig:vcfg}, and returns the corresponding MVL in
-the destination register. The {\tt vsetdcfg} and {\tt vsetdcfgi}
-instructions also clear the {\tt vnp} register, so no predicate
-registers are allocated.
-
-\begin{discussion}
- For now, only a 32-bit value supporting up to three different vector
- data types is supported by the {\tt vsetdcfg} instruction. RV64 and
- RV128 could support larger number of types, though it's not clear if
- the hardware cost (area, latency) to support a larger number of
- different types is justified.
-\end{discussion}
-
-\begin{figure}[b]
- \centering
- \begin{tabular}{p{1cm}p{1cm}ccc|c|c|c|c|c|c|c|l}
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{mode} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} & \\
- \cline{6-12}
- & & & & &
- \tt type2 & \tt ntype2 &
- \tt type1 & \tt ntype1 &
- 0 &
- \tt type0 & \tt ntype0 & \\
- \cline{6-12}
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{} &
- \multicolumn{1}{c}{5} &
- \multicolumn{1}{c}{5} &
- \multicolumn{1}{c}{5} &
- \multicolumn{1}{c}{5} &
- \multicolumn{1}{c}{2} &
- \multicolumn{1}{c}{5} &
- \multicolumn{1}{c}{5} & \\
- %% \cline{2-12}
- %% & \multicolumn{1}{|c|}{0} & F128 &
- %% \multicolumn{1}{c|}{type3} & \multicolumn{1}{c|}{\#type3} &
- %% type2 & \#type2 & type1 & \#type1 & 0 & type0 & \#type0 & RV64 \\
- %% \cline{2-12}
- %% & & &
- %% \multicolumn{1}{c}{} &
- %% \multicolumn{1}{c}{24} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{2} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} & \\
- %% \cline{1-12}
- %% \multicolumn{1}{|c|}{0} & \multicolumn{1}{c|}{X128} &
- %% \multicolumn{1}{c|}{F128} & I64 & F64 & F32 & F16 & I32 & I16 & I8 & RV128 \\
- %% \cline{1-12}
- %% \multicolumn{1}{c}{83} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{2} &
- %% \multicolumn{1}{c}{5} &
- %% \multicolumn{1}{c}{5} & \\
- \end{tabular}
- \caption{Format of the {\tt vsetdcfg} value. The value contains
- three pairs of a 5-bit type and a 5-bit number of registers
- to create of that type. A value of 0 for the number of a type
- indicates that 32 registers should be allocated. A value of 0 for
- the type indicates this pair should be skipped. The types must be
- of monotonically increasing size from type0 to type2. }
- \label{fig:vcfg}
-\end{figure}
-
-The {\tt vsetdcfg} value specifies how many vector registers of each
-datatype are allocated, and is divided into a 2-bit mode field and
-pairs of 5-bit fields for each data type in the configuration.
-
-The 2-bit mode field indicates the configuration mode of the vector
-unit and is zero for the base vector extension.
-
-\begin{commentary}
- The standard vector extension operating mode configures the vector
- unit into some number of vector registers, each with some number of
- elements of types supported by the scalar unit.
-
- At least one alternative mode is planned, where the vector unit is
- configured as some number of registers each holding a single large
- element, e.g., 256 bits. This would be the base for cryptographic
- operations, or other coprocessors that operated on large structures.
-
- Other modes can be used to reconfigure the vector unit register file
- and functional units for other domain-specific purposes.
-\end{commentary}
-
-Each datatype pair contains a 5-bit {\tt type}$x$ value encoded as a
-{\tt vetype}$n$ value, and a 5-bit {\tt ntype}$x$ for the number of
-registers to allocate for that type. If the {\tt type0} field is
-non-zero, the {\tt vsetdcfg} instruction will configure the first {\tt
- ntype0} vector data registers to have {\tt vetype}$n$ values of {\tt
- type0} with {\tt vmaxew}$n$ values set accordingly as shown in
-Table~\ref{tab:vetype}. If the {\tt type0} value is 0, the datatype
-pair is skipped. If the {\tt type1} field is non-zero, then the next
-{\tt ntype1} vector registers are configured to be of the type given
-in {\tt type1}. Similarly for the {\tt type2} pair.
-
-A value of zero in a {\tt type}$x$ field indicates this datatype pair
-should be ignored. A value of zero in a {\tt ntype}$x$ field
-indicates 32 registers should be allocated for the corresponding type.
-
-\begin{commentary}
-Zero values are skipped to simplify setting a configuration with two
-different data types, where a single LUI instruction can set the upper
-20 bits leaving the low bits zero.
-
-A single 12-bit immediate value is sufficient to create a
-configuration with some number of vector registers with a single given
-datatype.
-
-A compressed C.LI with a zero-extended 5-bit immediate can create a
-configuration with 32 vector registers of a given datatype.
-\end{commentary}
-
-A corresponding {\tt vsetdcfgi} instruction takes a 12-bit immediate
-value to set the configuration instead of a scalar value, but
-otherwise is identical to the {\tt vsetcfgd} instruction.
-
-\begin{discussion}
-It is not clear how many immediate bits will be made available for the
-{\tt vsetdcfgi} instruction. If encoding space is available for both
-12 immediate bits and a source register specifier, then {\tt
- vsetdcgfi} can be defined to read the source register, OR in the
-bits in the immediate, then create a configuration. In this case,
-there is no need for a separate {\tt vsetdcfg} instruction.
-\end{discussion}
-
-The configuration value given must result in a legal configuration or
-else an illegal instruction exception will be raised.
-
-If a zero argument is given to {\tt vsetdcfg} the vector unit will be
-disabled and the value 0 will be returned for MVL. This instruction
-({\tt vsetdcfg x0, x0}) is given the assembly pseudo-code {\tt
- vdisable}.
-
-Separate {\tt vsetpcfg} and {\tt vsetpcfgi} instructions are provided
-that write the source value to the {\tt vnp} register and return the
-new MVL. These writes also clear the vector data registers, set all
-bits in the allocated predicate registers, and set {\tt vl}=MVL. A
-{\tt vsetpcfg} or {\tt vsetpcfgi} instruction can be used after a {\tt
- vsetdcfg} to complete a reconfiguration of the vector unit.
-
-\begin{discussion}
- If {\tt vnp} is made accessible as a separate CSR, the {\tt setpcfg}
- and {\tt setpcfgi} instructions are less useful. The only advantage
- over a CSR instruction is that they return MVL, which is rarely
- needed, and which can be obtained via that {\tt setvl} instruction.
-\end{discussion}
-
-\section{Vector-Type-Change Instructions}
-
-To quickly change the individual types of a vector register, {\tt
- vetyperw} and {\tt vetyperwi} instructions are provided to change
-the type of the specified vector data register to the given scalar
-register value or 5-bit immediate value respectively, while returning
-the previous type in the destination scalar register.
-
-A vector convert instruction, described below, can simultaneously
-convert a source vector register into a new type, and set that type in
-the destination vector register.
-
-\section{Vector Length}
-
-The active vector length is held in the XLEN-bit WARL vector length
-CSR {\tt vl}, which can only hold values between 0 and MVL inclusive.
-Any writes to the configuration registers ({\tt vcfg}$x$ or {\tt
- vnp}) cause {\tt vl} to be initialized with MVL. Changes to {\tt
- vetype}$n$ via vector-type-change instructions do not affect {\tt
- vl}.
-
-The active vector length is usually set via the {\tt setvl}
-instruction. The source argument to the {\tt setvl} is the requested
-application vector length (AVL) as an unsigned XLEN-bit integer. The
-{\tt setvl} instruction calculates the value to assign to {\tt vl}
-according to Table~\ref{tab:vlcalc}. The result of this calculation
-is also returned as the result of the {\tt setvl} instruction.
-
-\begin{commentary}
-Earlier drafts encoded {\tt setvl} using a modified CSRRW instruction
-whereas it is now encoded as a separate new instruction.
-\end{commentary}
-
-\begin{table}
- \centering
- \begin{tabular}{|c|c|}
- \hline
- AVL Value & {\tt vl} setting \\
- \hline
- AVL $\geq$ 2\,MVL & MVL \\
- 2\,MVL $>$ AVL $>$ MVL & $\lceil$AVL$/2\rceil$ \\
- MVL $\geq$ AVL & AVL \\
- \hline
- \end{tabular}
- \caption{Operation of {\tt setvl} instruction to set vector
- length register {\tt vl} based on requested application vector
- length (AVL) and current maximum vector length (MVL).}
- \label{tab:vlcalc}
-\end{table}
-
-\begin{commentary}
- The rules for setting the {\tt vl} register help keep vector
- pipelines full over the last two iterations of a stripmined loop.
- This version of the rules guarantees monotonically decreasing vector
- lengths.
- Similar rules were previously used in Cray-designed machines~\cite{crayx1asm}.
-\end{commentary}
-
-\begin{discussion}
- There are multiple possible rules for setting VL, and we could give
- implementations freedom to use different VL setting rules.
-\end{discussion}
-
-\begin{commentary}
- The idea of having implementation-defined vector length dates back
- to at least the IBM 3090 Vector Facility~\cite{ibm370varch}, which
- used a special ``Load Vector Count and Update'' (VLVCU) instruction
- to control stripmine loops. The {\tt setvl} instruction included
- here is based on the simpler {\tt setvlr} instruction introduced by
- Asanovi\'{c}~\cite{krstephd}.
-\end{commentary}
-
-The {\tt setvl} instruction is typically used at the start of every
-iteration of a stripmined loop to set the number of vector elements to
-operate on in the following loop iteration. The current MVL can be
-obtained from a vector configuration instruction, or by performing a
-{\tt setvl} with a source argument that has all bits set (largest
-unsigned integer).
-
-When {\tt vl} is less than MVL, vector instructions will set all
-elements in the range [{\tt vl}:MAXVL-1] in the destination vector
-data register or destination vector predicate register to zero.
-
-\begin{commentary}
- Requiring zeroing of elements past the current active vector length
- simplifies the design of units with renamed vector data registers.
- If the specification left destination elements unchanged, renaming
- implementations would have to copy the tail of the old destination
- register to the newly allocated destination register.
- Alternatively, specifying the tail to be undefined will expose
- implementation differences and possibly cause a security hole.
-
- Implementations that do not support renaming, will have to zero the
- tail of a vector, but this can reuse the mechanism that is already
- required to initialize all vector data registers to zero on
- reconfiguration, for example, by having a zero bit on each element
- or element group.
-\end{commentary}
-
-No element operations are performed for any vector instruction when
-{\tt vl}=0.
-
-\begin{commentary}
- Two possible choices are to 1) require destination registers to be
- completely zeroed when {\tt vl}=0, or 2) no changes to the
- destination registers. Option 2 is currently chosen as this will
- prevents unnecessary work in some implementations, and option 1 does
- not provide a clear advantage beyond seeming more consistent with
- {\tt vl}>0 case.
-\end{commentary}
-
-\begin{figure}[bt]
- \centering
-\begin{verbatim}
- # Vector-vector 32-bit add loop.
- # a0 holds N
- # a1 holds pointer to result vector
- # a2 holds pointer to first source vector
- # a3 holds pointer to second source vector
- li t0, (2<<VNTYPE0|VREGF32)
- vsetdcfg t0 # Configure with two 32-bit float vectors
-
- loop: setvl t0, a0 # Set length, get how many elements in strip
- vld v0, a2 # Load first vector
- sll t1, t0, 2 # Multiply length by 4 to get bytes
- add a2, t1 # Bump pointer
- vld v1, a3 # Load second vector
- add a3, t1 # Bump pointer
- vadd v0, v1 # Add elements
- sub a0, t0 # Decrement elements completed
- vst v0, a1 # Store result vector
- add a1, t1 # Bump pointer
- bnez a0, loop # Any more?
-
- vdisable # Turn off vector unit
-\end{verbatim}
-\caption{Example vector-vector add loop.}
-\label{fig:vvadd}
-\end{figure}
-
-\section{Predicated Execution}
-
-
-\begin{commentary}
- The 32-bit base encoding does not leave room for a fully orthogonal
- predicate register specifier. A single bit is dedicated to the
- predicate register specification, and is used to select between two
- active predicate registers, {\tt vp0} or {\tt vp1}. An alternative
- scheme would have used the bit to select between {\tt vp0} and
- unpredicated (all elements active). However, given the ease of
- setting all predicate bits in a vector predicate register with a
- single predicate instruction, the current scheme provides more
- flexibility.
-
- When there are no vector predicate registers enabled, {\tt vp0}
- returns all set bits when read. So, the assembler convention is to
- assume {\tt vp0} as the predicate register when no predicate
- register is explicitly given. The assembler can support a strict
- operands option to require the vector predicate register is
- explicitly specified.
-\end{commentary}
-
-At element positions where the selected predicate register bit is
-zero, the corresponding vector element operation has no effect (does
-not change architectural state or generate exceptions), except to
-write a zero to the element position in the destination vector
-register.
-
-\begin{discussion}
- The previous proposal (undisturb) left the destination vector
- unchanged at element positions where the predicate bit is false,
- whereas the current plan-of-record (zero) writes zero to the
- destination where the predicate bit is false.
-
- The advantage of the undisturb option is that it can require fewer
- instructions and fewer architectural registers for many common code
- sequences. For in-order machines without register renaming, the
- undisturb operation simply disables writes to the destination
- elements, except for vector registers that have not been written
- since configuration time. Typically an extra zero bit per vector
- register or element group will be added to represent a zeroed
- register instead of actually zeroing state at configuration time.
- For predicated undisturb writes to these uninitialized registers,
- the predicated false elements must be explicitly written with zeros
- on each element group and the zero bit is then cleared down.
- However, in a machine with vector register renaming, undisturb does
- imply an additional read of the original destination register to
- write the value into the new physical destination register when the
- predicate is false. This additional read port will often be cheaper
- than in a scalar machine as vector machines often time-multiplex
- read ports, and the additional read can be skipped when the
- predicate registers are disabled ({\tt vnp}=0) or when the source is
- known to be zero after configuration, but still adds complexity to a
- design.
-
- The advantage of the zero option is that a machine with vector
- register renaming does not need to read the original destination
- vector register and so a read port is saved. The disadvantage of
- the zero option is that more instructions and architectural
- registers are required for common code sequences, and simpler
- microarchitectures without register renaming are penalized by
- requiring longer code sequences and greater register pressure. In
- particular, vector merge instructions are required to collect
- results from two divergent control paths, and each vector merge has
- to read two vector values and write a vector result. Whether the
- zero option saves total register file traffic in an register-renamed
- microarchitecture depends on the ratio of a) internal temporary
- writes, to b) writes creating values that are live out of each basic
- block, and also to the frequency of control flow merges.
-
- Overall, the zero option removes significant complexity from the
- renamed machines while reducing efficiency somewhat for the
- non-renamed machines, and is the current plan-of-record.
-\end{discussion}
-
-\section{Vector Load/Store Instructions}
-
-Three vector load/store addressing modes are supported, unit-stride,
-constant stride, and indexed (scatter/gather). Each addressing mode
-has a 7-bit unsigned immediate offset that is scaled by the element
-type.
-
-The unit-stride address mode takes a scalar base byte address, adds
-the scaled immediate, then generates a contiguous set of element
-addresses for loads or stores.
-
-\begin{commentary}
- The primary use of immediates in unit-stride loads is to generate
- overlapping unit-stride loads for convolution operations.
-\end{commentary}
-
-The constant-stride address mode takes a scalar base byte address, a
-stride value encoded in bytes, and adds a scaled immediate value.
-
-\begin{commentary}
- The stride value is in bytes to allow a single stride register to be
- used to support operations on arrays-of-structures, where not all
- elements in each structure have the same size. The immediate value
- is still scaled by element size to increase reach, given that
- element types will be naturally aligned.
-\end{commentary}
-
-The indexed address mode takes a scalar base byte address and a vector
-of byte offsets. The scalar base address and the immediate value are
-added to element of the offset vector to give a vector of addresses
-used in a scatter/gather.
-
-Indexed stores are provided in three types. Unordered, ordered, and
-reverse-ordered. The unordered indexed stores might update the same
-memory location from two different elements in an unspecified order.
-The ordered stores always update memory locations in increasing vector
-element order. The reverse-ordered stores always update memory
-locations in decreasing memory order.
-
-\begin{commentary}
- The reverse-ordered stores support vectorization of software memory
- disambiguation techniques. A reverse-ordered store of element id
- into a hash table indexed by a hash on a store access address,
- followed by a read of the hash table using a load access address and
- a comparison against the original element id, will indicate if
- there's a potential RAW hazard with an earlier loop iteration.
-\end{commentary}
-
-\begin{discussion}
- Not clear if there is sufficient realizable improvement for
- supporting unordered stores over ordered stores.
-\end{discussion}
-
-Vector loads/stores have a simple memory model, where each vector
-load/store is observed to complete sequentially in program order only
-the local hart, i.e., a vector load on a hart will observe all earlier
-vector stores on the same hart, and no later vector stores.
-
-Vector loads are available in a length-speculative form that writes
-predicate register {\tt vp1} in addition to the destination vector
-data register. These instructions raise an illegal instruction
-exception if {\tt vp1} is not configured. For elements that do not
-generate a permissions fault, the length-speculative vector loads
-operate as normally except to also clear the bit in {\tt vp1}. If an
-element encounters a permission fault, a zero is written to the
-destination vector register element and the {\tt vp1} bit is set to a
-1. Implementations may treat elements past the first faulting element
-as also causing a fault even if they might not cause a permissions
-fault when accessed alone.
-
-Once software determines the active vector length, it should check if
-any loads within the active vector length caused a fault, and in this
-case, generate a non-length-speculative load to trigger reporting of
-the error.
-
-\begin{commentary}
- Length-speculative vector loads are required to vectorize while
- loops, with data-dependent exits (e.g. strlen).
-
- The only faults ignored by the length-speculative vector loads are
- ones that would have resulted in a permissions violation. Page
- faults and other virtualization-related faults should be handled
- invisibly to the user thread by the execution environment.
-
- A malicious program can use length-speculative vector loads to probe
- accessible address space without fear of a fatal fault.
-\end{commentary}
-
-\section{Vector Register Gather}
-
-A vector register gather produces a new result data vector by gathering
-elements from one source data vector at the element locations
-specified by a second source index vector. Data source and
-destination vector types must agree. The index vector can have any
-integer type. Legal element indices can range from 0 to current
-MAXVL. Indices out of this range raise an illegal instruction
-exception.
-
-\begin{verbatim}
- # vindices holds values from 0..MAXVL
- vrgather vdest, vsrc, vindices
-\end{verbatim}
-
-\section{Vector Slide}
-
-Reductions (and convolutions) are supported via a vector slide
-instruction that takes elements starting from the middle of one vector
-and places these at the beginning of a second vector register. This
-supports a recursive-halving reduction approach for any binary
-associative operator.
-
-\begin{commentary}
- A similar vector register extract instruction was added to the Cray
- C90 after memory latency grew too large for the memory-memory
- reductions used in earlier Crays.
-
- The vector unit microarchitecture can be optimized for the
- power-of-2 sized element offsets used for reductions.
-\end{commentary}
-
-
-\section{Fixed-Point Support}
-
-Clip instruction supports scaling, rounding, and clipping to
-destination type. Rounding set by CSR fixed-point rounding mode
-(truncate, jam, round-up, round-nearest-even). Clipping set by CSR
-clip mode (wrap, saturate).
-
-Add with average, rounding set by rounding mode.
-
-Multiply with same size source and destination types, with some result
-scaling values (+1, 0, -1, -8?) and rounding and clipping according to
-CSR mode.
-
-Accumulate with carry into predicate register to support larger
-precise dot-products.
-
-\section{Optional Transcendental Support}
-
-\section{Instruction-Set Encoding}
-
-\note{This section is out of date.}
-On the next two pages is a proposed instruction-set encoding.
-\input{v-instr-table}