diff options
author | Krste Asanovic <krste@eecs.berkeley.edu> | 2019-03-05 08:28:24 -0800 |
---|---|---|
committer | Krste Asanovic <krste@eecs.berkeley.edu> | 2019-03-05 08:28:24 -0800 |
commit | 8dfbd56cfb2f2504178f6c70b4801e75cf676b57 (patch) | |
tree | c72b0083afcf603e6b7356d17a851b8b8bea94b5 | |
parent | c2869697edbaa458ca2496facdd5b3c77d5be134 (diff) | |
download | riscv-isa-manual-8dfbd56cfb2f2504178f6c70b4801e75cf676b57.zip riscv-isa-manual-8dfbd56cfb2f2504178f6c70b4801e75cf676b57.tar.gz riscv-isa-manual-8dfbd56cfb2f2504178f6c70b4801e75cf676b57.tar.bz2 |
Version 20190305-Base-Ratification for ratification vote.
-rw-r--r-- | src/p.tex | 93 | ||||
-rw-r--r-- | src/preface.tex | 13 | ||||
-rw-r--r-- | src/riscv-spec.tex | 4 | ||||
-rw-r--r-- | src/v.tex | 1401 |
4 files changed, 21 insertions, 1490 deletions
@@ -1,5 +1,5 @@ \chapter{``P'' Standard Extension for Packed-SIMD Instructions, - Version 0.1} + Version 0.2} \label{sec:packedsimd} \begin{commentary} @@ -8,94 +8,7 @@ standardizing on the V extension for large floating-point SIMD operations. However, there was interest in packed-SIMD fixed-point operations for use in the integer registers of small RISC-V - implementations. + implementations. A task group is working to define the new P + extension. \end{commentary} -In this chapter, we outline a standard packed-SIMD extension for -RISC-V. We've reserved the instruction-set extension name ``P'' for a future -standard set of packed-SIMD extensions. Many other extensions can -build upon a packed-SIMD extension, taking advantage of the wide data -registers and datapaths separate from the integer unit. - -\begin{commentary} -Packed-SIMD extensions, first introduced with the Lincoln Labs TX-2~\cite{tx2}, -have become a popular way to provide higher throughput on data-parallel -codes. Earlier commercial microprocessor implementations include the -Intel i860, HP PA-RISC MAX~\cite{lee-max-ieeemicro1996}, SPARC -VIS~\cite{tremblay-vis-ieeemicro1996}, MIPS -MDMX~\cite{gwennap-mdmx-mpr1996}, PowerPC -AltiVec~\cite{diefendorff-altivec-ieeemicro2000}, Intel x86 -MMX/SSE~\cite{peleg-mmx-ieeemicro1996, raman-sse-ieeemicro2000}, while -recent designs include Intel x86 AVX~\cite{lomont-avx-irm2011} and ARM -Neon~\cite{goodacre-armisa-computer2005}. We describe a standard -framework for adding packed-SIMD in this chapter, but are not actively -working on such a design. In our opinion, packed-SIMD designs represent -a reasonable design point when reusing existing wide datapath resources, -but if significant additional resources are to be devoted to -data-parallel execution then designs based on traditional vector -architectures are a better choice and should use the V extension. - -\end{commentary} - -A RISC-V packed-SIMD extension reuses the floating-point registers -({\tt f0}-{\tt f31}). These registers can be defined to have widths -of FLEN=32 to FLEN=1024. The standard floating-point instruction-set -extensions require registers of width 32 bits (``F''), 64 bits (``D''), -or 128 bits (``Q''). - -\begin{commentary} -It is natural to use the floating-point registers for packed-SIMD -values rather than the integer registers (PA-RISC and Alpha -packed-SIMD extensions) as this frees the integer registers for -control and address values, simplifies reuse of scalar floating-point -units for SIMD floating-point execution, and leads naturally to a -decoupled integer/floating-point hardware design. The floating-point -load and store instruction encodings also have space to handle wider -packed-SIMD registers. However, reusing the floating-point registers -for packed-SIMD values does make it more difficult to use a recoded -internal format for floating-point values. -\end{commentary} - -The existing floating-point load and store instructions are used to -load and store various-sized words from memory to the {\tt f} -registers. The base ISA supports 32-bit and 64-bit loads and stores, -but the LOAD-FP and STORE-FP instruction encodings allows 8 different -widths to be encoded as shown in Table~\ref{psimdwidth}. When used -with packed-SIMD operations, it is desirable to support non-naturally -aligned loads and stores in hardware. - -\begin{table}[htp] -\begin{center} -\begin{tabular}{|c|l|r|} -\hline -{\em width} field & -Code & -Size in bits\\ -\hline -000 & B & 8 \\ -001 & H & 16 \\ -010 & W & 32 \\ -011 & D & 64 \\ -100 & Q & 128 \\ -101 & Q2 & 256 \\ -110 & Q4 & 512 \\ -111 & Q8 & 1024 \\ -\hline -\end{tabular} -\end{center} -\caption{LOAD-FP and STORE-FP width encoding.} -\label{psimdwidth} -\end{table} - -Packed-SIMD computational instructions operate on packed values in -{\tt f} registers. Each value can be 8-bit, 16-bit, 32-bit, 64-bit, -or 128-bit, and both integer and floating-point representations can be -supported. For example, a 64-bit packed-SIMD extension can treat each -register as 1$\times$64-bit, 2$\times$32-bit, 4$\times$16-bit, or -8$\times$8-bit packed values. - -\begin{commentary} -Simple packed-SIMD extensions might fit in unused 32-bit instruction -opcodes, but more extensive packed-SIMD extensions will likely require -a dedicated 30-bit instruction space. -\end{commentary} diff --git a/src/preface.tex b/src/preface.tex index 69615b0..535c78a 100644 --- a/src/preface.tex +++ b/src/preface.tex @@ -31,7 +31,7 @@ modules: \bf Zifencei & \bf 2.0 & \bf Ratification \\ \bf Zicsr & \bf 2.0 & \bf Ratification \\ \bf M & \bf 2.0 & \bf Ratification \\ - \bf A & \bf 2.0 & \bf Ratification \\ + \em A & \em 2.0 & Frozen \\ \bf F & \bf 2.2 & \bf Ratification \\ \bf D & \bf 2.2 & \bf Ratification \\ \bf Q & \bf 2.2 & \bf Ratification \\ @@ -42,8 +42,8 @@ modules: \em B & \em 0.0 & \em Draft \\ \em J & \em 0.0 & \em Draft \\ \em T & \em 0.0 & \em Draft \\ - \em P & \em 0.1 & \em Draft \\ - \em V & \em 0.4 & \em Draft \\ + \em P & \em 0.2 & \em Draft \\ + \em V & \em 0.7 & \em Draft \\ \em N & \em 1.1 & \em Draft \\ \em Zam & \em 0.1 & \em Draft \\ \hline @@ -56,6 +56,7 @@ The changes in this version of the document include: \begin{itemize} \parskip 0pt \itemsep 1pt +\item Removed the A extension from ratification. \item Changed document version scheme to avoid confusion with versions of the ISA modules. \item Incremented the version numbers of the base integer ISA to 2.1, @@ -109,6 +110,10 @@ The changes in this version of the document include: \item Improvements to the description and commentary. \item Defined the term IALIGN as shorthand to describe the instruction-address alignment constraint. +\item Removed text of P extension chapter as now superceded by active task + group documents. +\item Removed text of V extension chapter as now superceded by separate vector + extension draft document. \end{itemize} \section*{Preface to Document Version 2.2} @@ -140,7 +145,7 @@ versions of the RISC-V ISA modules: J & 0.0 & N \\ T & 0.0 & N \\ P & 0.1 & N \\ - V & 0.2 & N \\ + V & 0.7 & N \\ N & 1.1 & N \\ \hline \end{tabular} diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex index 1ec7ccd..17bed0a 100644 --- a/src/riscv-spec.tex +++ b/src/riscv-spec.tex @@ -6,8 +6,8 @@ \input{preamble} -\newcommand{\specrev}{\mbox{20181221-Public-Review-{\em draft}}} -\newcommand{\specmonthyear}{\mbox{December 2018}} +\newcommand{\specrev}{\mbox{20190305-Base-Ratification}} +\newcommand{\specmonthyear}{\mbox{March 2019}} \begin{document} @@ -1,1402 +1,15 @@ -\chapter{``V'' Standard Extension for Vector Operations, Version 0.4-DRAFT} +\chapter{``V'' Standard Extension for Vector Operations, Version 0.7} \label{sec:vector} -{\bf This version is out-of-date with respect to the current working - group draft, which is now hosted on {\tt https://github.com/riscv/riscv-v-spec}.} - -This chapter presents a proposal for the RISC-V base vector -instruction-set extension. The base vector extension is intended to -provide general support for data-parallel execution within the 32-bit -instruction encoding space, with later vector extensions supporting -richer functionality for certain domains. - -\begin{commentary} -The vector extension is based on the style of vector register -architecture introduced by Seymour Cray in the 1970s, as opposed to -the earlier packed SIMD approach, introduced with the Lincoln Labs -TX-2 in 1957 and now adopted by most other commercial instruction -sets. -\end{commentary} - -The base vector extension defines the components that must be included -when the ``V'' bit is set in the {\tt misa} register, and consequently -those that will be assumed to exist by software written for an ABI -specifying V. - -\begin{commentary} - This draft version of the chapter includes additional specifications - of proposed extensions to the base vector extension to explain some - of the encoding choices made for the base. -\end{commentary} - -The vector extension supports a configurable vector unit, to enable -implementations to tradeoff the number of active architectural vector -registers and supported element widths against available maximum -vector length. The vector extension is designed to allow the same -binary code to work efficiently across a variety of hardware -implementations varying in physical vector storage capacity and -datapath spatial and/or temporal parallelism. - -\begin{commentary} -The vector instruction set contains many features developed in earlier -research projects, including the Berkeley T0~\cite{} and VIRAM~\cite{VIRAM} -vector microprocessors, the MIT Scale vector-thread processor~\cite{}, -and the Berkeley Maven~\cite{} and Hwacha~\cite{} projects. -\end{commentary} - -\section{Vector Unit State} - -The additional vector unit architectural state includes 32 vector -registers ({\tt v0}--{\tt v31}), and an XLEN-bit WARL vector length -CSR, {\tt vl}. Each vector register {\tt v}$n$ has an associated -16-bit configuration field {\tt vtype}$n$ described below. A 6-bit -global maximum element width register {\tt vmaxew} defines the maximum -number of bits of storage in every element of every active vector -register. - -\begin{commentary} - Future vector extensions using wider instruction encodings can - support more architectural vector registers. For example, 256 - architectural vector registers in a 64-bit instruction encoding. -\end{commentary} - -\begin{commentary} - Future 2D shape extensions add two more vector length registers, - {\tt vm} and {\tt vn}. -\end{commentary} - -There is also a 3-bit fixed-point rounding mode CSR {\tt vxrm}, and a -single-bit fixed-point saturation status CSR {\tt vxsat}. The {\tt - vcs} CSR alias provides combined access to the {\tt vl}, {\tt vxrm}, -{\tt vxsat} fields to reduce context switch time. The {\tt vcs} -register also includes a configuration mode field to support future -extended configuration modes. - -\begin{discussion} -The components of vcs might not need separate CSR addresses, -depending on how they're accessed via other non-CSR instructions. -\end{discussion} - -\section{Vector Unit Type Configuration Register ({\tt vtype}$n$)} - -The vector unit must be configured before use. Each architectural -vector register, {\tt v}$n$, is configured via 16 bits of vector type -configuration state {\tt vtype}$n$, which can be accessed via vector -configuration ({\tt vcfg}) CSRs and other rapid vector configuration -instructions as described below. The vector register type -configuration encodes the overall organization, or {\em shape}, of the -elements in each vector register (e.g., scalar versus 1-D vector), as -well as the bitwidth and numeric representation of each element. As -shown in Figure~\ref{fig:vtype}, the 16-bit {\tt vtype}$n$ encoding is -divided into a 5-bit current shape field {\tt vshape}$n$, a 5-bit -representation field {\tt verep}$n$, and a 6-bit element bit-width -field {\tt vew}$n$\, held in the {\tt vcfg}$x$ CSRs. The combination -of an element numeric representation and an element bitwidth is called -an element {\em format}. Each vector register can also be disabled to -free physical vector storage for other architectural vector registers. - -\begin{figure}[htb] -\begin{center} -\begin{tabular}{O@{}O@{}O} -\\ -\instbitrange{15}{11} & -\instbitrange{10}{6} & -\instbitrange{5}{0} \\ -\hline -\multicolumn{1}{|c|}{{\tt vshape}$n$} & -\multicolumn{1}{c|}{{\tt verep}$n$} & -\multicolumn{1}{c|}{{\tt vew}$n$} \\ -\hline -5 & 5 & 6 \\ -\end{tabular} -\end{center} -\caption{Location of subfields within a single {\tt vtype}$n$ field.} -\label{fig:vtype} -\end{figure} - -\begin{commentary} - It was also common in earlier vector machines to support multiple - precisions within the vector datapath. In particular, the CDC - STAR-100~\cite{cdcstar100} supported single-precision and - double-precision floating-point operations and also bit, byte, and - nibble operations in the vector unit; TI ASC~\cite{tiasc} designs - supported dividing 64-bit vector lanes into two 32-bit lanes for - double throughput. -\end{commentary} - -\clearpage - -\section{Shape Encoding} - -The 5-bit shape field describes the structure of the elements within -the vector register. In the base vector extension, the shape can be -set to either scalar or vector. - -\begin{table}[hbt] - \centering - \begin{tabular}{|c|l|} - \hline - {\tt vshape} & Shape \\ - \hline - 00000 & scalar \\ - 00100 & 1-D vector, length controlled by {\tt vl} \\ - \hline - \multicolumn{2}{|c|}{All other encodings reserved}\\ - \hline - \end{tabular} - \caption{Base vector encoding of {\tt vshape}$n$ field.} - \label{tab:vshape} -\end{table} +The current working group draft is hosted at {\tt + https://github.com/riscv/riscv-v-spec}. \begin{commentary} - For the base vector ISA, only a single bit is required in each {\tt - vshape} field to select between scalar and 1-D vector elements - with the other bits hardwired to zero. +The base vector extension is intended to provide general support for +data-parallel execution within the 32-bit instruction encoding space, +with later vector extensions supporting richer functionality for +certain domains. \end{commentary} -\begin{table}[hbt] - \centering - \begin{tabular}{|c|l|} - \hline - {\tt vshape} & Shape \\ - \hline - 00000 & scalar \\ - 00001 & {\em Reserved} \\ - 0001x & {\em Reserved} \\ - \hline - 00100 & 1-D vector {\tt vl} \\ - 01000 & 1-D vector {\tt vm} \\ - 01100 & 1-D vector {\tt vn} \\ - \hline - 00101 & 2-D matrix {\tt vl} x {\tt vl} \\ - 00110 & 2-D matrix {\tt vl} x {\tt vm} \\ - 00111 & 2-D matrix {\tt vl} x {\tt vn} \\ - \hline - 01001 & 2-D matrix {\tt vm} x {\tt vl} \\ - 01010 & 2-D matrix {\tt vm} x {\tt vm} \\ - 01011 & 2-D matrix {\tt vm} x {\tt vn} \\ - \hline - 01101 & 2-D matrix {\tt vn} x {\tt vl} \\ - 01110 & 2-D matrix {\tt vn} x {\tt vm} \\ - 01111 & 2-D matrix {\tt vn} x {\tt vn} \\ - \hline - 1xxxx & {\em Reserved}/{\em Custom} \\ - \hline - \end{tabular} - \caption{Extended encoding of per-vector-register {\tt vshape} field.} - \label{tab:extvshape} -\end{table} - -\begin{commentary} - A sketch of the proposed encodings for the 2D shape extension is - shown in the Table. -\end{commentary} - -\clearpage - -\section{Representation Encoding} - -The 5-bit {\tt verep}$n$ register sets the numeric representation of -each element of the vector register. In the base vector -extension, the representation can be set to unsigned integer, -two's-complement signed integer, or floating-point. The -floating-point representations follow the IEEE 754 standards. - -\begin{table}[hbtp] - \centering - \begin{tabular}{|c|l|} - \hline - {\tt verep} & Representation \\ - \hline - 00000 & Unsigned integer \\ - 00001 & Two's-complement signed integer \\ - 00010 & {\em Reserved (unsigned floating-point?)}\\ - 00011 & IEEE-754 floating-point \\ - \hline - \multicolumn{2}{|c|}{All other encodings reserved}\\ - \hline - \end{tabular} - \caption{Base vector representation encoding.} - \label{tab:verep} -\end{table} - -\begin{table}[hbtp] - \centering - \begin{tabular}{|c|l|} - \hline - {\tt verep} & Representation \\ - \hline - 00000 & Unsigned integer \\ - 00001 & Two's-complement signed integer \\ - 00010 & {\em Reserved (unsigned floating-point)}\\ - 00011 & IEEE-754 floating-point \\ - \hline - 001x0 & {\em Reserved} \\ - 00101 & Complex signed integer \\ - 00111 & Complex floating-point \\ - \hline - 01000 & Prime Galois field - integer representation \\ - 01001 & Prime Galois field - Montgomery representation \\ - 01100 & Binary extension Galois field - polynomial basis \\ - 01101 & Binary extension Galois field - normal basis \\ - \hline - 01010 & UNORM \\ - 01011 & SNORM \\ - 01110 & {\em Reserved} \\ - 01111 & {\em Reserved (complex SNORM?)} \\ - \hline - 10xxx & Custom representations \\ - \hline - 11xxx & {\em Reserved} \\ - \hline - \end{tabular} - \caption{Extended vector representation encoding.} - \label{tab:extverep} -\end{table} - -\begin{commentary} - The complex representations split the element width given in {\tt - vew}$n$ into two equal-sized real and imaginary fields, so an - element width of 64 bits can hold a single complex value with a - 32-bit real and a 32-bit imaginary component. -\end{commentary} - -\clearpage - -\section{Element Bitwidth} - -Each vector register, {\tt v}$n$, has a 6-bit element width -register, {\tt vew}$n$, to specify the number of bits for each element -of the current type in the vector register. - -The largest element width supported is -termed ELEN, and is defined to be the larger of the supported integer -and floating-point type widths: -\[ \mbox{\em ELEN} = max(\mbox{\em XLEN}, \mbox{\em FLEN}) \] -For the base vector ISA, the bit width can be set at any power of two -between 8 and ELEN. - -\begin{table}[hbt] - \centering - \begin{tabular}{|c|r|l|} - \hline - {\tt vew} & Width & Required in Base \\ - \hline - 000 000 & disabled & All \\ - 001 000 & 8 & All \\ - 010 000 & 16 & All \\ - 011 000 & 32 & All \\ - 100 000 & 64 & RV32D, RV64, RV128\\ - 101 000 & 128 & RV64Q, RV128\\ - \hline - \multicolumn{3}{|c|}{All other encodings reserved.}\\ - \hline - \end{tabular} - \caption{Base vector ISA encoding of vector element width ({\tt - vew}$n$) register fields.} - \label{tab:basevew} -\end{table} - -\begin{table}[hbtp] - \centering - \begin{tabular}{|c|r|} - \hline - {\tt vew} & Width \\ - \hline - 000 000 & disabled \\ - 000 001 & 1 \\ - 000 xxx & \multicolumn{1}{r|}{steps of 1}\\ - 000 111 & 7 \\ - \hline - 001 000 & 8 \\ - 001 xxx & \multicolumn{1}{r|}{steps of 1}\\ - 001 111 & 15 \\ - \hline - 010 000 & 16 \\ - 010 xxx & \multicolumn{1}{r|}{steps of 2}\\ - 010 111 & 30 \\ - \hline - 011 000 & 32 \\ - 011 xxx & \multicolumn{1}{r|}{steps of 4}\\ - 011 111 & 60 \\ - \hline - 100 000 & 64 \\ - 100 xxx & \multicolumn{1}{r|}{steps of 8}\\ - 100 111 & 120 \\ - \hline - 101 xxx & reserved \\ - \hline - 110 000 & 128 \\ - 110 001 & 192 \\ - 110 010 & 2048 \\ - 110 011 & 3072 \\ - 110 100 & 512 \\ - 110 101 & 768 \\ - 110 110 & 8192 \\ - 110 111 & 12288 \\ - \hline - 111 000 & 256 \\ - 111 001 & 384 \\ - 111 010 & 4096 \\ - 111 011 & 6144 \\ - 111 100 & 1024 \\ - 111 101 & 1536 \\ - 111 110 & 16384 \\ - 111 111 & 24576 \\ - \hline - \end{tabular} - - \caption{Proposed extended encoding of vector element width ({\tt - vew}$n$) register fields. Every bit width between 1 and 16 can - be supported. Bit widths in steps of 2 between 16 to 32 (i.e., - 16, 18, 20, ...). Bit widths in steps of 4 between 32 to 64 - (i.e., 32, 36, 40, ...). Bit widths in steps of 8 between 64 and - 128 (i.e., 64, 72, 80,...). For bit widths greater than 128, all - powers-of-two up to 16384 and all widths 1.5$\times$ greater are - supported (128, 384, 512, 768,...). } - \label{tab:extvew} -\end{table} - -\begin{commentary} - The extended bit-width encoding is designed to minimize the number - of state bits required to support useful subsets of widths. For - example, an RV32 system only needs two bits of state per {\tt - vew}$n$ field to represent {\em disabled}, 8, 16, and 32. An - RV32 system with 3 bits of state can represent {\em disabled}, 4, - 8, 12, 16, 24, 32, and 48. An RV64 system with 4 bits of state - can represent {\em disabled}, 4, 8, 12, 16, 24, 32, 48, 64, 96, - 128, 256, 512, 1024. -\end{commentary} - -\clearpage - -\section{Base Vector Extension Supported Types} - -The types supported by the base V extension depend upon the base -scalar ISA and supported extensions. When the base V extension is -added to a base scalar ISA, it must support the vector data element -types implied by the supported scalar types as defined by -Table~\ref{tab:velemtypes}. - -\begin{table}[hbt] - \centering -\begin{tabular}{|l|l|} - \hline - \multicolumn{2}{|c|}{Supported Fixed-Point Formats} \\ - \hline - RV32I & I8, U8, I16, U16, I32, U32 \\ - RV64I & I8, U8, I16, U16, I32, U32, I64, U64 \\ - RV128I & I8, U8, I16, U16, I32, U32, I64, U64, I128, U128 \\ - \hline - \hline - \multicolumn{2}{|c|}{Supported Floating-Point Formats} \\ - \hline - F & F16, F32 \\ - FD & F16, F32, F64 \\ - FDQ & F16, F32, F64, F128 \\ - \hline -\end{tabular} -\caption{Supported data element formats depending on base integer ISA - and supported floating-point extensions. I$x$ indicates a signed - integer of $x$ bits, U$x$ indicates an unsigned integer of $x$ bits, - and F$x$ indicates an IEEE floating-point number of $x$ bits.} -\label{tab:velemtypes} -\end{table} - -\begin{commentary} - Future vector extensions might expand the set of supported - datatypes, including custom application-specific datatypes. -\end{commentary} - -\clearpage - -\section{Maximum Vector Element Width ({\tt vmaxew})} - -The global {\tt vmaxew} field is used to support more complex vector -runtime environments where the types to be held in each register of a -single configuration may vary dynamically, and may not even be known -at compile time due to separate compilation. - -The global maximum element width register {\tt vmaxew} defines the -maximum number of bits of storage in every element of every active -architectural register, or if zero, defers to the per-vector-register -width field. - -\begin{commentary} - The VIRAM processor had a virtual processor width - register similar to {\tt vmaxew}~\cite{VIRAM}. -\end{commentary} - -If {\tt vmaxew} is zero, then the per-element vector element widths -{\tt vew}$n$ determine the minimum storage required for each element -of the associated vector register {\tt v}$n$. - -If {\tt vmaxew} is non-zero, it sets the largest element width that -can be supported in any vector register element in the current -configuration. - -\clearpage - -\section{Vector Configuration Registers ({\tt vcfg0}--{\tt vcfg15})} - -The vector type configuration requires 512 bits of state (32 vector -registers each with 16-bit {\tt vtype}$n$ field) that can be accessed -via the {\tt vcfg CSRs}. - -RV128 uses four vector configuration CSRs: {\tt vcfg0} holds -configuration data for {\tt v0}--{\tt v7} with bits $16n$ to $16n+15$ -holding {\tt vtype}$n$, while {\tt vcfg4}, {\tt vcfg8} and {\tt - vcfg12} similarly holds configuration data for {\tt v8}--{\tt v15}, - {\tt v16}--{\tt v23}, and {\tt v24}--{\tt v31} respectively. - -In RV64, the {\tt vcfg2} CSR provides access to the upper 64 bits of {\tt - vcfg0} and {\tt vcfg6} provides access to the upper 64 bits of -{\tt vcfg4}. In RV32, the {\tt vcfg1}, {\tt vcfg3}, {\tt vcfg5} -and {\tt vcfg7} CSRs provides access to the upper bits of {\tt - vcfg0}, {\tt vcfg2}, {\tt vcfg4} and {\tt vcfg6} respectively. - -Any CSR write to a {\tt vcfg}$x$ register zeros all {\tt vcfg}$y$ -registers, for $y>x$. As a result configuration data should be -written from the {\tt vcfg0} CSR upwards. - -\begin{commentary} - Zeroing higher-numbered {\tt vcfg}$y$ registers allows more rapid - reconfiguration of the vector register file via CSR writes, and - provides backward-compatibility for extensions that increase the - number of possible architectural vector registers. This choice does - prevent the use of CSRRW instructions to swap the configuration - context; an entire old configuration must be read out before a new - configuration is written in. -\end{commentary} - -Additional instructions are provided to support more rapid changes to -the vector unit configuration as described below. - -\section{Legal Vector Unit Configurations} - -To simplify hardware configuration calculations and to reduce software -context-switch complexity, vector unit configurations are constrained -to have non-disabled architectural vector registers numbered -contiguously starting at {\tt v0}. An exception will be raised if an -instruction tries to change {\tt vtype}$n$ in a way that violates this -constraint. - -\begin{commentary} - During a software vector-context save, the software handler can stop - searching for active architectural registers after encountering the - first disabled vector register. Hardware to calculate physical - register allocation is also simplified with this constraint. -\end{commentary} - -\clearpage - -\section{Vector Unit CSRs} - -\begin{table}[hbt] - \centering - \begin{tabular}{|l|c|l|l|} - \hline - CSR name & Number & Base ISA & Description\\ - \hline - {\tt vcs} & TBD & RV32, RV64, RV128 & Vector control-status register\\ - {\tt vl} & TBD & RV32, RV64, RV128 & Active vector length\\ - {\tt vxrm} & TBD & RV32, RV64, RV128 & Vector fixed-point rounding mode\\ - {\tt vxsat} & TBD & RV32, RV64, RV128 & Vector fixed-point - saturation flag \\ - {\tt vmaxew} & TBD & RV32, RV64, RV128 & Global maximum vector element width \\ - \hline - {\tt vcfg0} & TBD & RV32, RV64, RV128 & \multirow{16}{*}{Vector - register configuration}\\ - {\tt vcfg1} & TBD & RV32 &\\ - {\tt vcfg2} & TBD & RV32, RV64 &\\ - {\tt vcfg3} & TBD & RV32 &\\ - {\tt vcfg4} & TBD & RV32, RV64, RV128 &\\ - {\tt vcfg5} & TBD & RV32 &\\ - {\tt vcfg6} & TBD & RV32, RV64 &\\ - {\tt vcfg7} & TBD & RV32 &\\ - {\tt vcfg8} & TBD & RV32, RV64, RV128 & \\ - {\tt vcfg9} & TBD & RV32 &\\ - {\tt vcfg10} & TBD & RV32, RV64 &\\ - {\tt vcfg11} & TBD & RV32 &\\ - {\tt vcfg12} & TBD & RV32, RV64, RV128 &\\ - {\tt vcfg13} & TBD & RV32 &\\ - {\tt vcfg14} & TBD & RV32, RV64 &\\ - {\tt vcfg15} & TBD & RV32 &\\ - \hline - \end{tabular} - \caption{Vector extension CSRs.} - \label{tab:vcsrs} -\end{table} - -\clearpage - -\section{Maximum Vector Length (MVL)} - -The implementation determines an available {\em maximum vector length} -(MVL) dependent on the current vector type configuration held in {\tt - vcfg}$x$ and {\tt vmaxew}. The available MVL depends on the -configuration setting and on the implementation's microarchitecture, -but MVL must always have the same value for the same configuration -parameters on a given hart. - -\begin{commentary} - Several earlier vector machines had the ability to configure - physical vector register storage into a larger number of short - vectors or a shorter number of long vectors. In particular the - Fujitsu VP series~\cite{vp200} supported combining power-of-2 base - vector registers into longer vector registers. - - The Scale~\cite{}, Maven~\cite{}, and Hwacha~\cite{} processors also - support configuration-dependent MVL. -\end{commentary} - -\begin{commentary} - Previously, the specification imposed a minimum vector length (4) on - all configurations to allow stripmining code to be removed for short - vector lengths. With the expanded scope of the vector unit types, - this would be too onerous to support, and so the requirement is removed. -\end{commentary} - -\begin{discussion} - A separate mechanism for supporting fixed vector lengths should be - designed, possibly as part of an optional extension. -\end{discussion} - -Any change to the vector configuration that might change MVL cause the -entire vector unit state to be zeroed. Any write to the global {\tt - vmaxew} causes the entire vector unit state to be zeroed, even if -the value in {\tt vmaxew} is unchanged. - -If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$ -register that would set the width greater than {\tt vmaxew} raises an -illegal instruction exception and leaves the vector unit state -unchanged. - -If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$ -field with a value less than or equal to the value in {\tt vmaxew} -only zeros the associated vector register {\tt v}$n$ and leaves other -vector unit state unchanged. The vector register data is zeroed even -if {\tt vew}$n$ would be unchanged by the write. - -If {\tt vmaxew} is zero, then any write to an individual {\tt vew}$n$ -register zeros the associated {\tt v}$n$ vector register. In addition, -any write that changes the value in {\tt vew}$n$, zeros the entire vector -unit state. - -\begin{commentary} - The state is zeroed to hide implementation-dependent bit mappings - and to provide additional security when context swapping. Zero is - also a convenient initial value for some loops. - - In-order implementations will probably use a flag bit per register to - mux in 0 instead of garbage values on each source until it is - overwritten. For in-order machines, vector lengths less than MVL - complicate this zeroing, but these cases can be handled by adding a - zero bit per element or element group. Machines with vector - register renaming can just initialize the rename table to point - entries at a physical zero register. -\end{commentary} - -Each vector register can be reconfigured dynamically to hold different -formats without zeroing the entire vector unit state provided that: if -{\tt vmaxew} is zero, the bit-width of the new format is the same as -the current {\tt vew}; or if {\tt vmaxew} is non-zero, the format does -not require more than {\tt vmaxew} bits. Any change to a vector -register's format zeros the affected vector register. - -If a vector register is disabled, then any vector instruction -that attempts to access that vector register will raise an -illegal instruction exception. Attempting to write any {\tt - vmaxew}$n$ with an unsupported value will raise an illegal -instruction exception. - -\begin{commentary} - Vector registers have both a maximum element width and a - current element data type to allow the same vector register to - be changed to different types during execution provided the - maximum width is not exceeded. This reduces register pressure and - helps support vector function calls, where the caller does not know - the types needed by the callee, as described below. -\end{commentary} - -\begin{commentary} - The set of supported types might be greatly increased with future - extensions. For example (and not limited to), new scalar types in - new number systems, a complex type with real and imaginary - components, a key-value type, or an application-specific structure - type with multiple constituent fields. Auxiliary type - configuration state might be required in these cases. -\end{commentary} - -Attempting to write an unsupported type or a type that requires more -than the current {\tt vmaxew} width to a {\tt vetype} field will raise -an illegal instruction exception. - -\begin{commentary} -Implementations must still raise an exception for a {\tt vetype}$n$ -setting that is greater than the architectural {\tt vmaxew}$n$ width, -even if they internally implement a larger physical {\tt vmaxew}$n$ -that could accommodate the {\tt vetype}$n$ request. -\end{commentary} - -\begin{discussion} -We can either have 1) implementations raise exceptions whenever -illegal values are written to {\tt vmaxew} and {\tt vetype} fields -(current design), 2) raise exceptions at use if config holds illegal -values, 3) make the fields WARL so silently reduce to supported types -with no exceptions. Option 2 could complicate vector unit context -switch code by having more cases to check, while Option 3 could make -debugging more difficult by allowing code to run with reduced -precision or incorrect types. -\end{discussion} - -\begin{commentary} -Three broad classes of implementation can be distinguished by how they -handle {\tt vmaxew} settings. - -The simplest is {\em max-width-per-implementation} (MWPI), where the -vector unit is organized in fixed ELEN-width physical lanes, and -changes to {\tt vmaxew} settings simply cause portions of the -physical registers and datapath to be disabled for operations narrower -than ELEN bits. - -The next most complex implementation, {\em - max-width-per-configuration} (MWPC), uses the maximum width across -all {\tt vmaxew} settings in a dynamic configuration to divide the -physical register storage and datapaths. For example, a MWPC machine -with ELEN=64 might subdivide physical lanes into 32-bit datapaths if -no {\tt vmaxew} setting is greater than 32. Operations on -sub-32-bit quantities would disable appropriate portions of the -physical registers and functional units in each 32-bit lane. Several -early vector supercomputers, including the CDC -Star-100~\cite{cdcstart100}, provided a similar facility to divide -64-bit physical vector lanes into narrower 32-bit lanes. - -The most complex implementations are {\em max-width-per-register} -(MWPR), which reduce wasted space in the physical register files by -packing elements in each vector register according to the individual -{\tt vmaxew} settings and which within one configuration can -execute instructions with narrower datatypes at higher rates than for -wider datatypes. The Berkeley Hwacha vector -engine~\cite{hwachatr,mixedprecision} is an example microarchitecture -with this property. -\end{commentary} - -\clearpage - -{\bf Following Sections are out-of-date.} - -\section{Vector Instruction Formats} - -\begin{commentary} - The instruction encoding is a work in progress. - - An important design goal was that the base vector extension fit - within a few major opcodes of the 32-bit encoding. It is envisioned - that future vector extensions will use 48-bit or 64-bit encodings to - increase both the opcode space and the set of architectural - registers. The 64-bit vector encoding would support 256 - architectural vector registers and orthogonal specification of a - predicate register in each instruction. -\end{commentary} - -Vector arithmetic and vector memory instructions are encoded in new -variants of the R-format, shown in Figure~\ref{fig:vinstformats}. -Both new formats use one bit to hold a {\em vp} field, which usually -controls the predicate register in use, either {\tt vp0} or {\tt vp1}. -The VR4 form is used for fused multiply-add instructions. The -existing RISC-V instruction formats are used for other vector-related -instructions, such as the vector configuration instructions. - -\vspace{-0.2in} -\begin{figure}[h] -\begin{center} -\setlength{\tabcolsep}{4pt} -\begin{tabular}{p{0.7in}@{}p{0.4in}@{}p{0.7in}@{}p{0.7in}@{}p{0.5in}@{}p{0.4in}@{}p{0.7in}@{}p{1in}l} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{13} & -\instbit{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\cline{1-8} -\multicolumn{2}{|c|}{funct7} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct2} & -\multicolumn{1}{c|}{vp} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -VR-type \\ -\cline{1-8} -\\ -\cline{1-8} -\multicolumn{1}{|c|}{rs3} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct2} & -\multicolumn{1}{c|}{vp} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} & -VR4-type \\ -\cline{1-8} -\end{tabular} -\end{center} -\caption{New V extension instruction formats. } -\label{fig:vinstformats} -\end{figure} - -Most vector instructions are available in both vector-vector and -vector-scalar variants. Vector-vector instructions take the first -operand from the vector register specified by {\em rs1} and the second -operand from the vector register specified by {\em rs2}. - -For vector-scalar operations, the {\em rs1} field specifies the scalar -register to be accessed. For most vector-scalar instructions, the -type of the vector operand specified by {\em rs2} indicates whether -the integer or floating-point scalar register file is accessed using -the {\em rs1} register specifier. - -Some non-commutative vector-scalar instructions (such as sub) are -provided in two forms, with the scalar value used as the second -operand. - -\begin{commentary} - The {\em rs1} field is used to provide the scalar operand because in - the base encoding, whenever an instruction has a single scalar - source operand, it is encoded in the {\tt rs1} field. -\end{commentary} - -\section{Polymorphic Vector Instructions} - -The vector extension uses a polymorphic instruction encoding where the -opcode is combined with the types of the source and destination -registers to determine the operation to be performed. For example, an -ADD opcode will perform a 32-bit integer vector-vector add if both -vector source operands and the vector destination register are 32-bit -integers, but will perform a 16-bit floating-point vector-vector -operation if both vector source operands and the vector destination -are 16-bit floats. - -The polymorphic encoding also naturally supports operations with mixed -precisions on the input and output, and also supports extending the -instruction set with new types without necessarily increasing the -opcode space. - -Not all combinations of source and destination argument types need be -supported. The base vector extension mandates only that -implementations provide a subset of combinations of types on inputs -and outputs. Table~\ref{tab:vtypemix} shows the general rules for -integer and floating-point instructions, but the detailed instruction -listing should be consulted for accurate information. - -\begin{table} - \centering - \begin{tabular}{|r|r|r|r|r|} - \hline - \multicolumn{1}{|c|}{Src1} & - \multicolumn{1}{c|}{Src2} & - \multicolumn{1}{c|}{Src3} & - \multicolumn{1}{c|}{Dest} & - \multicolumn{1}{c|}{Example} \\ - \hline - \hline - \multicolumn{5}{|c|}{Integer vector-scalar}\\ - \hline - XLEN & X & - & X & 64b + 32b $\rightarrow$ 32b \\ - XLEN & X & - & 2X & 64b + 8b $\rightarrow$ 16b \\ - \hline - \hline - \multicolumn{5}{|c|}{Integer vector-vector}\\ - \hline - X & X & - & X & 32b + 32b $\rightarrow$ 32b \\ - X & X & - & 2X & 16b + 16b $\rightarrow$ 32b \\ - 2X & X & - & 2X & 64b + 32b $\rightarrow$ 64b \\ - \hline - \hline - \multicolumn{5}{|c|}{Floating-point vector-scalar}\\ - \hline - F & F & - & F & 64b + 64b $\rightarrow$ 64b \\ - F & F & F & F & 32b $\times$ 32b + 32b $\rightarrow$ 32b \\ - F & F & - & 2F & 32b + 32b $\rightarrow$ 64b \\ - F & F & 2F & 2F & 32b $\times$ 32b + 64b $\rightarrow$ 64b \\ - \hline - \hline - \multicolumn{5}{|c|}{Floating-point vector-vector}\\ - \hline - F & F & - & F & 32b + 32b $\rightarrow$ 32b \\ - F & F & - & 2F & 16b + 16b $\rightarrow$ 32b \\ - 2F & F & - & 2F & 64b + 32b $\rightarrow$ 64b \\ - F & F & F & F & 64b $\times$ 64b + 64b $\rightarrow$ 64b \\ - F & F & 2F & 2F & 16b $\times$ 16b + 32b $\rightarrow$ 32b \\ - \hline - \end{tabular} - \caption{General rules for supported types per instruction in base - vector extension. X represents the number of bits in an integer - type and F represents the number of bits in a floating-point type. - Individual instruction types will provide more detailed listings. - Note that the type of a scalar floating-point operand can never be - different from that of the vector in Src2, hence the Src1=2F case - is missing from vector-scalar operations.} - \label{tab:vtypemix} -\end{table} - -A general rule in the base vector instruction set is that the -destination precision is never less than any source operand, except -for explicit type-conversion instructions. Another general rule is -that the input operands can only be the same width or half the width -of the destination operand except for the scalar operand in integer -vector-scalar instructions, which is always XLEN wide. Also, src2 is -never larger than src1 or src3. - -Integer computations of mixed-precision values always aligns values by -their LSB, and sign or zero-extends any smaller value according to its -type. The result is truncated to fit in the destination type. Note a -scalar integer value is already XLEN bits wide, and as wide as any -possible integer vector value. - -Floating-point computations on mixed-precision values acts as if the -calculations are performed exactly then rounded once to the -destination format. - -\section{Rapid Configuration Instructions} - -It can take several CSR instructions to set up the {\tt vcfg} and -{\tt vnp} CSRs for a given configuration. Specialized configuration -instructions are provided to quickly set up common configurations in -the {\tt vcfg} and {\tt vnp} CSRs. - -The {\tt vsetdcfg} instruction takes a scalar register value encoded as -shown in Figure~\ref{fig:vcfg}, and returns the corresponding MVL in -the destination register. The {\tt vsetdcfg} and {\tt vsetdcfgi} -instructions also clear the {\tt vnp} register, so no predicate -registers are allocated. - -\begin{discussion} - For now, only a 32-bit value supporting up to three different vector - data types is supported by the {\tt vsetdcfg} instruction. RV64 and - RV128 could support larger number of types, though it's not clear if - the hardware cost (area, latency) to support a larger number of - different types is justified. -\end{discussion} - -\begin{figure}[b] - \centering - \begin{tabular}{p{1cm}p{1cm}ccc|c|c|c|c|c|c|c|l} - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{mode} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & \\ - \cline{6-12} - & & & & & - \tt type2 & \tt ntype2 & - \tt type1 & \tt ntype1 & - 0 & - \tt type0 & \tt ntype0 & \\ - \cline{6-12} - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{} & - \multicolumn{1}{c}{5} & - \multicolumn{1}{c}{5} & - \multicolumn{1}{c}{5} & - \multicolumn{1}{c}{5} & - \multicolumn{1}{c}{2} & - \multicolumn{1}{c}{5} & - \multicolumn{1}{c}{5} & \\ - %% \cline{2-12} - %% & \multicolumn{1}{|c|}{0} & F128 & - %% \multicolumn{1}{c|}{type3} & \multicolumn{1}{c|}{\#type3} & - %% type2 & \#type2 & type1 & \#type1 & 0 & type0 & \#type0 & RV64 \\ - %% \cline{2-12} - %% & & & - %% \multicolumn{1}{c}{} & - %% \multicolumn{1}{c}{24} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{2} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & \\ - %% \cline{1-12} - %% \multicolumn{1}{|c|}{0} & \multicolumn{1}{c|}{X128} & - %% \multicolumn{1}{c|}{F128} & I64 & F64 & F32 & F16 & I32 & I16 & I8 & RV128 \\ - %% \cline{1-12} - %% \multicolumn{1}{c}{83} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{2} & - %% \multicolumn{1}{c}{5} & - %% \multicolumn{1}{c}{5} & \\ - \end{tabular} - \caption{Format of the {\tt vsetdcfg} value. The value contains - three pairs of a 5-bit type and a 5-bit number of registers - to create of that type. A value of 0 for the number of a type - indicates that 32 registers should be allocated. A value of 0 for - the type indicates this pair should be skipped. The types must be - of monotonically increasing size from type0 to type2. } - \label{fig:vcfg} -\end{figure} - -The {\tt vsetdcfg} value specifies how many vector registers of each -datatype are allocated, and is divided into a 2-bit mode field and -pairs of 5-bit fields for each data type in the configuration. - -The 2-bit mode field indicates the configuration mode of the vector -unit and is zero for the base vector extension. - -\begin{commentary} - The standard vector extension operating mode configures the vector - unit into some number of vector registers, each with some number of - elements of types supported by the scalar unit. - - At least one alternative mode is planned, where the vector unit is - configured as some number of registers each holding a single large - element, e.g., 256 bits. This would be the base for cryptographic - operations, or other coprocessors that operated on large structures. - - Other modes can be used to reconfigure the vector unit register file - and functional units for other domain-specific purposes. -\end{commentary} - -Each datatype pair contains a 5-bit {\tt type}$x$ value encoded as a -{\tt vetype}$n$ value, and a 5-bit {\tt ntype}$x$ for the number of -registers to allocate for that type. If the {\tt type0} field is -non-zero, the {\tt vsetdcfg} instruction will configure the first {\tt - ntype0} vector data registers to have {\tt vetype}$n$ values of {\tt - type0} with {\tt vmaxew}$n$ values set accordingly as shown in -Table~\ref{tab:vetype}. If the {\tt type0} value is 0, the datatype -pair is skipped. If the {\tt type1} field is non-zero, then the next -{\tt ntype1} vector registers are configured to be of the type given -in {\tt type1}. Similarly for the {\tt type2} pair. - -A value of zero in a {\tt type}$x$ field indicates this datatype pair -should be ignored. A value of zero in a {\tt ntype}$x$ field -indicates 32 registers should be allocated for the corresponding type. - -\begin{commentary} -Zero values are skipped to simplify setting a configuration with two -different data types, where a single LUI instruction can set the upper -20 bits leaving the low bits zero. - -A single 12-bit immediate value is sufficient to create a -configuration with some number of vector registers with a single given -datatype. - -A compressed C.LI with a zero-extended 5-bit immediate can create a -configuration with 32 vector registers of a given datatype. -\end{commentary} - -A corresponding {\tt vsetdcfgi} instruction takes a 12-bit immediate -value to set the configuration instead of a scalar value, but -otherwise is identical to the {\tt vsetcfgd} instruction. - -\begin{discussion} -It is not clear how many immediate bits will be made available for the -{\tt vsetdcfgi} instruction. If encoding space is available for both -12 immediate bits and a source register specifier, then {\tt - vsetdcgfi} can be defined to read the source register, OR in the -bits in the immediate, then create a configuration. In this case, -there is no need for a separate {\tt vsetdcfg} instruction. -\end{discussion} - -The configuration value given must result in a legal configuration or -else an illegal instruction exception will be raised. - -If a zero argument is given to {\tt vsetdcfg} the vector unit will be -disabled and the value 0 will be returned for MVL. This instruction -({\tt vsetdcfg x0, x0}) is given the assembly pseudo-code {\tt - vdisable}. - -Separate {\tt vsetpcfg} and {\tt vsetpcfgi} instructions are provided -that write the source value to the {\tt vnp} register and return the -new MVL. These writes also clear the vector data registers, set all -bits in the allocated predicate registers, and set {\tt vl}=MVL. A -{\tt vsetpcfg} or {\tt vsetpcfgi} instruction can be used after a {\tt - vsetdcfg} to complete a reconfiguration of the vector unit. - -\begin{discussion} - If {\tt vnp} is made accessible as a separate CSR, the {\tt setpcfg} - and {\tt setpcfgi} instructions are less useful. The only advantage - over a CSR instruction is that they return MVL, which is rarely - needed, and which can be obtained via that {\tt setvl} instruction. -\end{discussion} - -\section{Vector-Type-Change Instructions} - -To quickly change the individual types of a vector register, {\tt - vetyperw} and {\tt vetyperwi} instructions are provided to change -the type of the specified vector data register to the given scalar -register value or 5-bit immediate value respectively, while returning -the previous type in the destination scalar register. - -A vector convert instruction, described below, can simultaneously -convert a source vector register into a new type, and set that type in -the destination vector register. - -\section{Vector Length} - -The active vector length is held in the XLEN-bit WARL vector length -CSR {\tt vl}, which can only hold values between 0 and MVL inclusive. -Any writes to the configuration registers ({\tt vcfg}$x$ or {\tt - vnp}) cause {\tt vl} to be initialized with MVL. Changes to {\tt - vetype}$n$ via vector-type-change instructions do not affect {\tt - vl}. - -The active vector length is usually set via the {\tt setvl} -instruction. The source argument to the {\tt setvl} is the requested -application vector length (AVL) as an unsigned XLEN-bit integer. The -{\tt setvl} instruction calculates the value to assign to {\tt vl} -according to Table~\ref{tab:vlcalc}. The result of this calculation -is also returned as the result of the {\tt setvl} instruction. - -\begin{commentary} -Earlier drafts encoded {\tt setvl} using a modified CSRRW instruction -whereas it is now encoded as a separate new instruction. -\end{commentary} - -\begin{table} - \centering - \begin{tabular}{|c|c|} - \hline - AVL Value & {\tt vl} setting \\ - \hline - AVL $\geq$ 2\,MVL & MVL \\ - 2\,MVL $>$ AVL $>$ MVL & $\lceil$AVL$/2\rceil$ \\ - MVL $\geq$ AVL & AVL \\ - \hline - \end{tabular} - \caption{Operation of {\tt setvl} instruction to set vector - length register {\tt vl} based on requested application vector - length (AVL) and current maximum vector length (MVL).} - \label{tab:vlcalc} -\end{table} - -\begin{commentary} - The rules for setting the {\tt vl} register help keep vector - pipelines full over the last two iterations of a stripmined loop. - This version of the rules guarantees monotonically decreasing vector - lengths. - Similar rules were previously used in Cray-designed machines~\cite{crayx1asm}. -\end{commentary} - -\begin{discussion} - There are multiple possible rules for setting VL, and we could give - implementations freedom to use different VL setting rules. -\end{discussion} - -\begin{commentary} - The idea of having implementation-defined vector length dates back - to at least the IBM 3090 Vector Facility~\cite{ibm370varch}, which - used a special ``Load Vector Count and Update'' (VLVCU) instruction - to control stripmine loops. The {\tt setvl} instruction included - here is based on the simpler {\tt setvlr} instruction introduced by - Asanovi\'{c}~\cite{krstephd}. -\end{commentary} - -The {\tt setvl} instruction is typically used at the start of every -iteration of a stripmined loop to set the number of vector elements to -operate on in the following loop iteration. The current MVL can be -obtained from a vector configuration instruction, or by performing a -{\tt setvl} with a source argument that has all bits set (largest -unsigned integer). - -When {\tt vl} is less than MVL, vector instructions will set all -elements in the range [{\tt vl}:MAXVL-1] in the destination vector -data register or destination vector predicate register to zero. - -\begin{commentary} - Requiring zeroing of elements past the current active vector length - simplifies the design of units with renamed vector data registers. - If the specification left destination elements unchanged, renaming - implementations would have to copy the tail of the old destination - register to the newly allocated destination register. - Alternatively, specifying the tail to be undefined will expose - implementation differences and possibly cause a security hole. - - Implementations that do not support renaming, will have to zero the - tail of a vector, but this can reuse the mechanism that is already - required to initialize all vector data registers to zero on - reconfiguration, for example, by having a zero bit on each element - or element group. -\end{commentary} - -No element operations are performed for any vector instruction when -{\tt vl}=0. - -\begin{commentary} - Two possible choices are to 1) require destination registers to be - completely zeroed when {\tt vl}=0, or 2) no changes to the - destination registers. Option 2 is currently chosen as this will - prevents unnecessary work in some implementations, and option 1 does - not provide a clear advantage beyond seeming more consistent with - {\tt vl}>0 case. -\end{commentary} - -\begin{figure}[bt] - \centering -\begin{verbatim} - # Vector-vector 32-bit add loop. - # a0 holds N - # a1 holds pointer to result vector - # a2 holds pointer to first source vector - # a3 holds pointer to second source vector - li t0, (2<<VNTYPE0|VREGF32) - vsetdcfg t0 # Configure with two 32-bit float vectors - - loop: setvl t0, a0 # Set length, get how many elements in strip - vld v0, a2 # Load first vector - sll t1, t0, 2 # Multiply length by 4 to get bytes - add a2, t1 # Bump pointer - vld v1, a3 # Load second vector - add a3, t1 # Bump pointer - vadd v0, v1 # Add elements - sub a0, t0 # Decrement elements completed - vst v0, a1 # Store result vector - add a1, t1 # Bump pointer - bnez a0, loop # Any more? - - vdisable # Turn off vector unit -\end{verbatim} -\caption{Example vector-vector add loop.} -\label{fig:vvadd} -\end{figure} - -\section{Predicated Execution} - - -\begin{commentary} - The 32-bit base encoding does not leave room for a fully orthogonal - predicate register specifier. A single bit is dedicated to the - predicate register specification, and is used to select between two - active predicate registers, {\tt vp0} or {\tt vp1}. An alternative - scheme would have used the bit to select between {\tt vp0} and - unpredicated (all elements active). However, given the ease of - setting all predicate bits in a vector predicate register with a - single predicate instruction, the current scheme provides more - flexibility. - - When there are no vector predicate registers enabled, {\tt vp0} - returns all set bits when read. So, the assembler convention is to - assume {\tt vp0} as the predicate register when no predicate - register is explicitly given. The assembler can support a strict - operands option to require the vector predicate register is - explicitly specified. -\end{commentary} - -At element positions where the selected predicate register bit is -zero, the corresponding vector element operation has no effect (does -not change architectural state or generate exceptions), except to -write a zero to the element position in the destination vector -register. - -\begin{discussion} - The previous proposal (undisturb) left the destination vector - unchanged at element positions where the predicate bit is false, - whereas the current plan-of-record (zero) writes zero to the - destination where the predicate bit is false. - - The advantage of the undisturb option is that it can require fewer - instructions and fewer architectural registers for many common code - sequences. For in-order machines without register renaming, the - undisturb operation simply disables writes to the destination - elements, except for vector registers that have not been written - since configuration time. Typically an extra zero bit per vector - register or element group will be added to represent a zeroed - register instead of actually zeroing state at configuration time. - For predicated undisturb writes to these uninitialized registers, - the predicated false elements must be explicitly written with zeros - on each element group and the zero bit is then cleared down. - However, in a machine with vector register renaming, undisturb does - imply an additional read of the original destination register to - write the value into the new physical destination register when the - predicate is false. This additional read port will often be cheaper - than in a scalar machine as vector machines often time-multiplex - read ports, and the additional read can be skipped when the - predicate registers are disabled ({\tt vnp}=0) or when the source is - known to be zero after configuration, but still adds complexity to a - design. - - The advantage of the zero option is that a machine with vector - register renaming does not need to read the original destination - vector register and so a read port is saved. The disadvantage of - the zero option is that more instructions and architectural - registers are required for common code sequences, and simpler - microarchitectures without register renaming are penalized by - requiring longer code sequences and greater register pressure. In - particular, vector merge instructions are required to collect - results from two divergent control paths, and each vector merge has - to read two vector values and write a vector result. Whether the - zero option saves total register file traffic in an register-renamed - microarchitecture depends on the ratio of a) internal temporary - writes, to b) writes creating values that are live out of each basic - block, and also to the frequency of control flow merges. - - Overall, the zero option removes significant complexity from the - renamed machines while reducing efficiency somewhat for the - non-renamed machines, and is the current plan-of-record. -\end{discussion} - -\section{Vector Load/Store Instructions} - -Three vector load/store addressing modes are supported, unit-stride, -constant stride, and indexed (scatter/gather). Each addressing mode -has a 7-bit unsigned immediate offset that is scaled by the element -type. - -The unit-stride address mode takes a scalar base byte address, adds -the scaled immediate, then generates a contiguous set of element -addresses for loads or stores. - -\begin{commentary} - The primary use of immediates in unit-stride loads is to generate - overlapping unit-stride loads for convolution operations. -\end{commentary} - -The constant-stride address mode takes a scalar base byte address, a -stride value encoded in bytes, and adds a scaled immediate value. - -\begin{commentary} - The stride value is in bytes to allow a single stride register to be - used to support operations on arrays-of-structures, where not all - elements in each structure have the same size. The immediate value - is still scaled by element size to increase reach, given that - element types will be naturally aligned. -\end{commentary} - -The indexed address mode takes a scalar base byte address and a vector -of byte offsets. The scalar base address and the immediate value are -added to element of the offset vector to give a vector of addresses -used in a scatter/gather. - -Indexed stores are provided in three types. Unordered, ordered, and -reverse-ordered. The unordered indexed stores might update the same -memory location from two different elements in an unspecified order. -The ordered stores always update memory locations in increasing vector -element order. The reverse-ordered stores always update memory -locations in decreasing memory order. - -\begin{commentary} - The reverse-ordered stores support vectorization of software memory - disambiguation techniques. A reverse-ordered store of element id - into a hash table indexed by a hash on a store access address, - followed by a read of the hash table using a load access address and - a comparison against the original element id, will indicate if - there's a potential RAW hazard with an earlier loop iteration. -\end{commentary} - -\begin{discussion} - Not clear if there is sufficient realizable improvement for - supporting unordered stores over ordered stores. -\end{discussion} - -Vector loads/stores have a simple memory model, where each vector -load/store is observed to complete sequentially in program order only -the local hart, i.e., a vector load on a hart will observe all earlier -vector stores on the same hart, and no later vector stores. - -Vector loads are available in a length-speculative form that writes -predicate register {\tt vp1} in addition to the destination vector -data register. These instructions raise an illegal instruction -exception if {\tt vp1} is not configured. For elements that do not -generate a permissions fault, the length-speculative vector loads -operate as normally except to also clear the bit in {\tt vp1}. If an -element encounters a permission fault, a zero is written to the -destination vector register element and the {\tt vp1} bit is set to a -1. Implementations may treat elements past the first faulting element -as also causing a fault even if they might not cause a permissions -fault when accessed alone. - -Once software determines the active vector length, it should check if -any loads within the active vector length caused a fault, and in this -case, generate a non-length-speculative load to trigger reporting of -the error. - -\begin{commentary} - Length-speculative vector loads are required to vectorize while - loops, with data-dependent exits (e.g. strlen). - - The only faults ignored by the length-speculative vector loads are - ones that would have resulted in a permissions violation. Page - faults and other virtualization-related faults should be handled - invisibly to the user thread by the execution environment. - - A malicious program can use length-speculative vector loads to probe - accessible address space without fear of a fatal fault. -\end{commentary} - -\section{Vector Register Gather} - -A vector register gather produces a new result data vector by gathering -elements from one source data vector at the element locations -specified by a second source index vector. Data source and -destination vector types must agree. The index vector can have any -integer type. Legal element indices can range from 0 to current -MAXVL. Indices out of this range raise an illegal instruction -exception. - -\begin{verbatim} - # vindices holds values from 0..MAXVL - vrgather vdest, vsrc, vindices -\end{verbatim} - -\section{Vector Slide} - -Reductions (and convolutions) are supported via a vector slide -instruction that takes elements starting from the middle of one vector -and places these at the beginning of a second vector register. This -supports a recursive-halving reduction approach for any binary -associative operator. - -\begin{commentary} - A similar vector register extract instruction was added to the Cray - C90 after memory latency grew too large for the memory-memory - reductions used in earlier Crays. - - The vector unit microarchitecture can be optimized for the - power-of-2 sized element offsets used for reductions. -\end{commentary} - - -\section{Fixed-Point Support} - -Clip instruction supports scaling, rounding, and clipping to -destination type. Rounding set by CSR fixed-point rounding mode -(truncate, jam, round-up, round-nearest-even). Clipping set by CSR -clip mode (wrap, saturate). - -Add with average, rounding set by rounding mode. - -Multiply with same size source and destination types, with some result -scaling values (+1, 0, -1, -8?) and rounding and clipping according to -CSR mode. - -Accumulate with carry into predicate register to support larger -precise dot-products. - -\section{Optional Transcendental Support} - -\section{Instruction-Set Encoding} - -\note{This section is out of date.} -On the next two pages is a proposed instruction-set encoding. -\input{v-instr-table} |