\chapter{``V'' Standard Extension for Vector Operations, Version 0.3-DRAFT} \label{sec:bits} This chapter presents a proposal for the RISC-V vector instruction set extension. The vector extension supports a configurable vector unit, to tradeoff the number of architectural vector registers and supported element widths against available maximum vector length. The vector extension is designed to allow the same binary code to work efficiently across a variety of hardware implementations varying in physical vector storage capacity and datapath spatial and/or temporal parallelism. The base vector extension is intended to provide general support for data-parallel execution within the 32-bit encoding space, with later vector extensions supporting richer functionality for certain domains. \begin{commentary} The vector extension is based on the style of vector register architecture introduced by Seymour Cray in the 1970s, as opposed to the earlier packed SIMD approach, introduced with the Lincoln Labs TX-2 in 1957 and now adopted by most other commercial instruction sets. The vector instruction set contains many features developed in earlier research projects, including the Berkeley T0~\cite{} and VIRAM~\cite{} vector microprocessors, the MIT Scale vector-thread processor~\cite{}, and the Berkeley Maven~\cite{} and Hwacha~\cite{} projects. \end{commentary} \section{Vector Unit State} The additional vector unit architectural state consists of 32 vector data registers ({\tt v0}--{\tt v31}), 8 vector predicate registers ({\tt vp0}-{\tt vp7}), and an XLEN-bit WARL vector length CSR, {\tt vl}. In addition, the current configuration of the vector unit is held in a set of vector configuration CSRs ({\tt vdcfg0}--{\tt vdcfg7} and {\tt vnp}), as described below. The implementation determines an available {\em maximum vector length} (MVL) for the current configuration held in the {\tt vdcfg} and {\tt vnp} registers. There is also a 3-bit fixed-point rounding mode CSR {\tt vxrm}, and a single-bit fixed-point saturation status CSR {\tt vxsat}. \begin{commentary} Future vector extensions using wider instruction encodings can support more architectural vector registers. For example, 256 architectural vector registers in a 64 bit encoding. \end{commentary} The {\tt vcs} CSR alias provides combined access to the {\tt vl}, {\tt vxrm}, {\tt vxsat}, and {\tt vnp} fields to reduce context switch time. The {\tt vcs} register also includes a configuration mode field to support future extended configuration modes. \begin{discussion} The components of vcs might not need separate CSR addresses, depending on how they're accessed via other non-CSR instructions. \end{discussion} \begin{table} \centering \begin{tabular}{|l|c|l|l|} \hline CSR name & Number & Base ISA & Description\\ \hline {\tt vcs} & TBD & RV32, RV64, RV128 & Vector control-status register\\ {\tt vl} & TBD & RV32, RV64, RV128 & Active vector length\\ {\tt vxrm} & TBD & RV32, RV64, RV128 & Vector fixed-point rounding mode\\ {\tt vxsat} & TBD & RV32, RV64, RV128 & Vector fixed-point saturation flag \\ \hline {\tt vnp} & TBD & RV32, RV64, RV128 & Number of vector predicate registers\\ \hline {\tt vdcfg0} & TBD & RV32, RV64, RV128 & \multirow{8}{*}{Vector data register configuration}\\ {\tt vdcfg1} & TBD & RV32 &\\ {\tt vdcfg2} & TBD & RV32, RV64 &\\ {\tt vdcfg3} & TBD & RV32 &\\ {\tt vdcfg4} & TBD & RV32, RV64, RV128 &\\ {\tt vdcfg5} & TBD & RV32 &\\ {\tt vdcfg6} & TBD & RV32, RV64 &\\ {\tt vdcfg7} & TBD & RV32 &\\ \hline \end{tabular} \caption{Vector extension CSRs.} \label{tab:vcsrs} \end{table} The vector unit must be configured before use. Each architectural vector data register ({\tt v0}--{\tt v31}) is configured with the bit width and type of each element of that vector data register, or can be disabled to free physical vector storage for other architectural vector data registers. The number of available vector predicate registers can also be set independently, from 0 to 8. \begin{commentary} Several earlier vector machines had the ability to configure physical vector register storage into a larger number of short vectors or a shorter number of long vectors, in particular the Fujitsu VP series~\cite{vp200}. \end{commentary} The available MVL depends on the configuration setting, but MVL must always have the same value for the same configuration parameters on a given implementation. Implementations must provide an MVL of at least four elements for all supported configuration settings. \begin{commentary} Specifying a minimum MVL allows operations on known-short vectors to be expressed without requiring stripmining instructions. \end{commentary} \begin{discussion} Both min(MVL) and max(MVL) might be better expressed as part of a profile. \end{discussion} Each vector data register's current configuration is described with an 8-bit encoding split into a 3-bit current maximum-width field {\tt vemaxw}$n$\, and a 5-bit type field {\tt vetype}$n$, held in the {\tt vdcfg}$x$ CSRs. The configuration state is also accessible via other specialized vector configuration instructions. \section{Element Datatypes and Width} The datatypes and operations supported by the V extension depend upon the base scalar ISA and supported extensions, and may include 8-bit, 16-bit, 32-bit, 64-bit, and 128-bit integer and fixed-point data types (X8/X8U, X16/X16U, X32/X32U, X64/X64U, and X128/X128U respectively, where the U suffix indicates unsigned), and 16-bit, 32-bit, 64-bit, and 128-bit floating-point types (F16, F32, F64, and F128 respectively). When the V extension is added, it must support the vector data element types implied by the supported scalar types as defined by Table~\ref{tab:velemtypes}. The largest element width supported: \[ \mbox{\em ELEN} = max(\mbox{\em XLEN}, \mbox{\em FLEN}) \] \begin{commentary} Compiler support for vectorization is greatly simplified when any hardware-supported data types are supported by both scalar and vector instructions. \end{commentary} \begin{table} \centering \begin{tabular}{|l|l|} \hline \multicolumn{2}{|c|}{Supported Fixed-Point Types} \\ \hline RV32I & X8, X8U, X16, X16U, X32, X32U \\ RV64I & X8, X8U, X16, X16U, X32, X32U, X64, X64U \\ RV128I & X8, X8U, X16, X16U, X32, X32U, X64, X64U, X128, X128U \\ \hline \hline \multicolumn{2}{|c|}{Supported Floating-Point Types} \\ \hline F & F16, F32 \\ FD & F16, F32, F64 \\ FDQ & F16, F32, F64, F128 \\ \hline \end{tabular} \caption{Supported data element types depending on base integer ISA and supported floating-point extensions. Signed and unsigned integers are given separate types (e.g, X32 is signed 32-bit value, whereas X32U is an unsigned integer value). Note that supporting a given floating-point width mandates support for all narrower floating-point widths.} \label{tab:velemtypes} \end{table} \begin{commentary} Future vector extensions might expand the set of supported datatypes, including custom application-specific datatypes. \end{commentary} Adding the vector extension to any machine with floating-point support adds support for the IEEE standard half-precision 16-bit floating-point data type. This includes a set of scalar half-precision instructions described in Section~\ref{sec:scalarhalffloat}. The scalar half-precision instructions follow the template for other floating-point precisions, but using the hitherto unused {\em fmt} field encoding of {\tt 10}. \begin{commentary} There is interest in splitting off the scalar half-precision instructions into their own named extension. \end{commentary} \section{Vector Element Width ({\tt vemaxw}$n$)} The current maximum element width for vector data register $n$ is held in a three-bit field, {\tt vemaxw}$n$, encoded as shown in Table~\ref{tab:vemaxw}. \begin{table}[hbt] \centering \begin{tabular}{|r|c|} \hline Width & Encoding \\ \hline Disabled & 000 \\ 8 & 100 \\ 16 & 101 \\ 32 & 110 \\ 64 & 111 \\ 128 & 011 \\ %% 256 & 010 \\ %% 512 & 001 \\ \hline \end{tabular} \caption{Encoding of vector element maximum-width fields {\tt vemaxw0}--{\tt vemaxw31}. All other values are reserved.} \label{tab:vemaxw} \end{table} \begin{commentary} Future extensions might increase the supported vector element widths beyond those of the base scalar ISA, or support smaller non-power-of-2 widths. At least one of the remaining width values should be reserved to support a width-encoding escape to support this larger range of width values. \end{commentary} \begin{commentary} Three broad classes of implementation can be distinguished by how they handle {\tt vemaxw}$n$ settings. The simplest is {\em max-width-per-implementation} (MWPI), where the vector unit is organized in fixed ELEN-width physical lanes, and changes to {\tt vemaxw}$n$ settings simply cause portions of the physical registers and datapath to be disabled for operations narrower than ELEN bits. The next most complex implementation, {\em max-width-per-configuration} (MWPC), uses the maximum width across all {\tt vemaxw}$n$ settings in a dynamic configuration to divide the physical register storage and datapaths. For example, a MWPC machine with ELEN=64 might subdivide physical lanes into 32-bit datapaths if no {\tt vemaxw}$n$ setting is greater than 32. Operations on sub-32-bit quantities would disable appropriate portions of the physical registers and functional units in each 32-bit lane. Several early vector supercomputers, including the CDC Star-100~\cite{cdcstart100}, provided a similar facility to divide 64-bit physical vector lanes into narrower 32-bit lanes. The most complex implementations are {\em max-width-per-register} (MWPR), which reduce wasted space in the physical register files by packing elements in each vector register according to the individual {\tt vemaxw}$n$ settings and which within one configuration can execute instructions with narrower datatypes at higher rates than for wider datatypes. The Berkeley Hwacha vector engine~\cite{hwachatr,mixedprecision} is an example microarchitecture with this property. \end{commentary} Any write to any {\tt vemaxw}$n$ field configures the entire vector unit and causes all vector data registers to be zeroed and all vector predicate registers to be set, and the vector length register {\tt vl} to be set to the maximum supported vector length. \begin{commentary} Vector registers are zeroed on reconfiguration to prevent security holes and to avoid exposing differences between how different implementations manage physical vector register storage. In-order implementations will probaby use a flag bit per register to mux in 0 instead of garbage values on each source until it is overwritten. For in-order machines, vector lengths less than MVL complicate this zeroing, but these cases can be handled by adding a zero bit per element or element group. Machines with vector register renaming can just initialize the rename table to point entries at a physical zero register. \end{commentary} If a vector data register is disabled, then any vector instruction that attempts to access that vector data register will raise an illegal instruction exception. Attempting to write any {\tt vemaxw}$n$ with an unsupported value will raise an illegal instruction exception. \section{Vector Element Type ({\tt vetype}$n$)} The current element type of vector data register $n$ is held in a five-bit {\tt vetype}$n$ field encoded as shown in Table~\ref{tab:vetype}. The element type {\tt vetype}$n$ of a vector data register is constrained to have equal or lesser width than the value in the corresponding {\tt vemaxw}$n$ field. A write to a {\tt vetype}$n$ field zeros the associated vector data register {\tt v}$n$, but leaves other vector unit state undisturbed. Changes to {\tt vetype}$n$ do not alter MVL. \begin{table}[hbt] \centering \begin{tabular}{|l|c|c|} \hline Type & {\tt vemaxw} equivalent & {\tt vetype} encoding \\ \hline Disabled & 000 & 00000 \\ \hline \hline \multicolumn{3}{|c|}{Floating-Point types} \\ \hline F16 & 101 & 01101 \\ F32 & 110 & 01110 \\ F64 & 111 & 01111 \\ F128 & 011 & 01011 \\ \hline \hline \multicolumn{3}{|c|}{Signed integer and fixed-point types} \\ \hline X8 & 100 & 10100 \\ X16 & 101 & 10101 \\ X32 & 110 & 10110 \\ X64 & 111 & 10111 \\ X128 & 011 & 10011 \\ \hline \hline \multicolumn{3}{|c|}{Unsigned integer and fixed-point types} \\ \hline X8U & 100 & 11100 \\ X16U & 101 & 11101 \\ X32U & 110 & 11110 \\ X64U & 111 & 11111 \\ X128U & 011 & 11011 \\ \hline \end{tabular} \caption{Encoding of {\tt vetype} fields. All other values are reserved. The middle column shows the value that will be written to {\tt vemaxw}$n$ for configuration instructions that write both {\tt vetype}$n$ and {\tt vemaxw}$n$ fields. For these standard types, {\tt vemaxw}$n$ follows the low three bits of {\tt vetype}$n$. The value of {\tt vetype}$n$ can be changed independently of {\tt vemaxw}$n$ provided the required element width is less than or equal to {\tt vemaxw}$n$.} \label{tab:vetype} \end{table} \begin{commentary} Vector data registers have both a maximum element width and a current element data type to allow the same vector data register to be changed to different types during execution provided the maximum width is not exceeded. This reduces register pressure and helps support vector function calls, where the caller does not know the types needed by the callee, as described below. \end{commentary} \begin{commentary} The set of supported types might be greatly increased with future extensions. For example (and not limited to), new scalar types in new number systems, a complex type with real and imaginary components, a key-value type, or an application-specific structure type with multiple consitituent fields. Auxiliary type configuration state might be required in these cases. \end{commentary} Attempting to write an unsupported type or a type that requires more than the current {\tt vemaxw} width to a {\tt vetype} field will raise an illegal instruction exception. \begin{commentary} Implementations must still raise an exception for a {\tt vetype}$n$ setting that is greater than the architectural {\tt vemaxw}$n$ width, even if they internally implement a larger physical {\tt vemaxw}$n$ that could accomodate the {\tt vetype}$n$ request. \end{commentary} \begin{discussion} We can either have 1) implementations raise exceptions whenever illegal values are written to {\tt vemaxw} and {\tt vetype} fields (current design), 2) raise exceptions at use if config holds illegal values, 3) make the fields WARL so silently reduce to supported types with no exceptions. Option 2 could complicate vector unit context switch code by having more cases to check, while Option 3 could make debugging more difficult by allowing code to run with reduced precision or incorrect types. \end{discussion} \section{Vector Predicate Configuration Register ({\tt vnp})} The {\tt vnp} CSR holds a single 4-bit value giving the number of enabled architectural predicate registers, between 0 and 8. Any write to {\tt vnp} zeros all vector data registers, sets all bits in visible vector predicate registers, and sets the vector length register {\tt vl} to the maximum supported vector length. Attempting to write a value larger than 8 to {\tt vnp} raises an illegal instruction exception. \begin{discussion} The number of vector predicate registers supported in base ISA could be changed. The base encoding could support up to 32 predicate registers, but it is not clear these would be used frequently enough to warrant increased the architectural cost for all implementations. \end{discussion} When {\tt vnp} is 0, any instruction that reads a vector predicate register other than {\tt vp0} will raise an illegal instruction exception, while reads of {\tt vp0} will return all ones to provide unpredicated execution. When {\tt vnp} is 0, any instruction that attempts to write any vector predicate register will raise an illegal instruction exception. \section{Vector Data Configuration Registers ({\tt vdcfg0}--{\tt vdcfg7})} The vector data register configuration requires 256 bits of state (32 vector data registers each with a 3-bit {\tt vemaxw}$n$ field and a 5-bit {\tt vetype}$n$ field), and is held in the {\tt vdcfg CSRs}. RV128 has two vector configuration CSRs: {\tt vdcfg0} holds configuration data for {\tt v0}--{\tt v15} with bits $8n$ to $8n+4$ holding {\tt vetype}$n$ and bits $8n+5$ to $8n+7$ holding {\tt vemaxw}$n$, while {\tt vdcfg4} similarly holds configuration data for {\tt v16}--{\tt v31}. In RV64, the {\tt vdcfg2} CSR provides access to the upper 64 bits of {\tt vdcfg0} and {\tt vdcfg6} provides access to the upper 64 bits of {\tt vdcfg4}. In RV32, the {\tt vdcfg1}, {\tt vdcfg3}, {\tt vdcfg5} and {\tt vdcfg7} CSRs provides access to the upper bits of {\tt vdcfg0}, {\tt vdcfg2}, {\tt vdcfg4} and {\tt vdcfg6} respectively. Any CSR write to a {\tt vdcfg}$x$ register zeros all {\tt vdcfg}$y$ registers, for $y>x$, and also zeros the {\tt vnp} register. As a result configuration data should be written from the {\tt vdcfg0} CSR upwards, followed by the {\tt vnp} setting if non-zero. \begin{commentary} Zeroing higher-numbered {\tt vdcfg}$y$ registers allows more rapid reconfiguration of the vector register file via CSR writes, and provides backward-compatibility for extensions that increase the number of possible architectural vector registers. This choice does prevent the use of CSRRW instructions to swap the configuration context. \end{commentary} \begin{commentary} Additional instructions are provided to support more rapid changes to the vector unit configuration as described below. These directly affect the {\tt vemaxw}$n$ and {\tt vetype}$n$ fields and do not necessarily have the same side effects as the CSR writes through the {\tt vdcfg}$n$ addresses. \end{commentary} \section{Legal Vector Unit Configurations} To simplify hardware configuration calculations and to reduce software context-switch complexity, vector unit configurations are constrained to have non-disabled architectural vector registers numbered contiguously starting at {\tt v0}. Also, {\tt vemaxw}$m$ must be greater than or equal to {\tt vemaxw}$n$, for $m > n$, i.e., configured element widths must increase monotonically with architectural vector register number. An exception will be raised if any instruction tries to change {\tt vemax}$n$ in a way that violates this constraint. \begin{commentary} During a software vector-context save, the software handler can stop searching for active architectural registers after encountering the first disabled vector register. Hardware to calculate physical register allocation might be slightly simplified with this constraint, and might be able to pack register storage more tightly with monotonically increasing element size. In a vector-function calling convention, higher-numbered registers are usually made available to the callee, and must usually be a wider, often ELEN-width, element. The context that configures the vector unit might have known-narrower element types and can save storage by confguring the lower-numbered architectural vector registers accordingly. \end{commentary} \section{Vector Instruction Formats} \begin{commentary} The instruction encoding is a work in progress. An important design goal was that the base vector extension fit within a few major opcodes of the 32-bit encoding. It is envisioned that future vector extensions will use 48-bit or 64-bit encodings to increase both the opcode space and the set of architectural registers. The 64-bit vector encoding would support 256 architectural vector registers and orthogonal specification of a predicate register in each instruction. \end{commentary} Vector arithmetic and vector memory instructions are encoded in new variants of the R-format, shown in Figure~\ref{fig:vinstformats}. Both new formats use one bit to hold a {\em vp} field, which usually controls the predicate register in use, either {\tt vp0} or {\tt vp1}. The VR4 form is used for fused multiply-add instructions. The existing RISC-V instruction formats are used for other vector-related instructions, such as the vector configuration instructions. \vspace{-0.2in} \begin{figure}[h] \begin{center} \setlength{\tabcolsep}{4pt} \begin{tabular}{p{0.7in}@{}p{0.4in}@{}p{0.7in}@{}p{0.7in}@{}p{0.5in}@{}p{0.4in}@{}p{0.7in}@{}p{1in}l} \\ \instbitrange{31}{27} & \instbitrange{26}{25} & \instbitrange{24}{20} & \instbitrange{19}{15} & \instbitrange{14}{13} & \instbit{12} & \instbitrange{11}{7} & \instbitrange{6}{0} \\ \cline{1-8} \multicolumn{2}{|c|}{funct7} & \multicolumn{1}{c|}{rs2} & \multicolumn{1}{c|}{rs1} & \multicolumn{1}{c|}{funct2} & \multicolumn{1}{c|}{vp} & \multicolumn{1}{c|}{rd} & \multicolumn{1}{c|}{opcode} & VR-type \\ \cline{1-8} \\ \cline{1-8} \multicolumn{1}{|c|}{rs3} & \multicolumn{1}{c|}{fmt} & \multicolumn{1}{c|}{rs2} & \multicolumn{1}{c|}{rs1} & \multicolumn{1}{c|}{funct2} & \multicolumn{1}{c|}{vp} & \multicolumn{1}{c|}{rd} & \multicolumn{1}{c|}{opcode} & VR4-type \\ \cline{1-8} \end{tabular} \end{center} \caption{New V extension instruction formats. } \label{fig:vinstformats} \end{figure} Most vector instructions are available in both vector-vector and vector-scalar variants. Vector-vector instructions take the first operand from the vector register specified by {\em rs1} and the second operand from the vector register specified by {\em rs2}. For vector-scalar operations, the {\em rs1} field specifies the scalar register to be accessed. For most vector-scalar instructions, the type of the vector operand specified by {\em rs2} indicates whether the integer or floating-point scalar register file is accessed using the {\em rs1} register specifier. Some non-commutative vector-scalar instructions (such as sub) are provided in two forms, with the scalar value used as the second operand. \begin{commentary} The {\em rs1} field is used to provide the scalar operand because in the base encoding, whenever an instruction has a single scalar source operand, it is encoded in the {\tt rs1} field. \end{commentary} \section{Polymorphic Vector Instructions} The vector extension uses a polymorphic instruction encoding where the opcode is combined with the types of the source and destination registers to determine the operation to be performed. For example, an ADD opcode will perform a 32-bit integer vector-vector add if both vector source operands and the vector destination register are 32-bit integers, but will perform a 16-bit floating-point vector-vector operation if both vector source operands and the vector destination are 16-bit floats. The polymorphic encoding also naturally supports operations with mixed precisions on the input and output, and also supports extending the instruction set with new types without necessarily increasing the opcode space. Not all combinations of source and destination argument types need be supported. The base vector extension mandates only that implementations provide a subset of combinations of types on inputs and outputs. Table~\ref{tab:vtypemix} shows the general rules for integer and floating-point instructions, but the detailed instruction listing should be consulted for accurate information. \begin{table} \centering \begin{tabular}{|r|r|r|r|r|} \hline \multicolumn{1}{|c|}{Src1} & \multicolumn{1}{c|}{Src2} & \multicolumn{1}{c|}{Src3} & \multicolumn{1}{c|}{Dest} & \multicolumn{1}{c|}{Example} \\ \hline \hline \multicolumn{5}{|c|}{Integer vector-scalar}\\ \hline XLEN & X & - & X & 64b + 32b $\rightarrow$ 32b \\ XLEN & X & - & 2X & 64b + 8b $\rightarrow$ 16b \\ \hline \hline \multicolumn{5}{|c|}{Integer vector-vector}\\ \hline X & X & - & X & 32b + 32b $\rightarrow$ 32b \\ X & X & - & 2X & 16b + 16b $\rightarrow$ 32b \\ 2X & X & - & 2X & 64b + 32b $\rightarrow$ 64b \\ \hline \hline \multicolumn{5}{|c|}{Floating-point vector-scalar}\\ \hline F & F & - & F & 64b + 64b $\rightarrow$ 64b \\ F & F & F & F & 32b $\times$ 32b + 32b $\rightarrow$ 32b \\ F & F & - & 2F & 32b + 32b $\rightarrow$ 64b \\ F & F & 2F & 2F & 32b $\times$ 32b + 64b $\rightarrow$ 64b \\ \hline \hline \multicolumn{5}{|c|}{Floating-point vector-vector}\\ \hline F & F & - & F & 32b + 32b $\rightarrow$ 32b \\ F & F & - & 2F & 16b + 16b $\rightarrow$ 32b \\ 2F & F & - & 2F & 64b + 32b $\rightarrow$ 64b \\ F & F & F & F & 64b $\times$ 64b + 64b $\rightarrow$ 64b \\ F & F & 2F & 2F & 16b $\times$ 16b + 32b $\rightarrow$ 32b \\ \hline \end{tabular} \caption{General rules for supported types per instruction in base vector extension. X represents the number of bits in an integer type and F represents the number of bits in a floating-point type. Individual instruction types will provide more detailed listings. Note that the type of a scalar floating-point operand can never be different from that of the vector in Src2, hence the Src1=2F case is missing from vector-scalar operations.} \label{tab:vtypemix} \end{table} A general rule in the base vector instruction set is that the destination precision is never less than any source operand, except for explicit type-conversion instructions. Another general rule is that the input operands can only be the same width or half the width of the destination operand except for the scalar operand in integer vector-scalar instructions, which is always XLEN wide. Also, src2 is never larger than src1 or src3. Integer computations of mixed-precision values always aligns values by their LSB, and sign or zero-extends any smaller value according to its type. The result is truncated to fit in the destination type. Note a scalar integer value is already XLEN bits wide, and as wide as any possible integer vector value. Floating-point computations on mixed-precision values acts as if the calculations are performed exactly then rounded once to the destination format. \section{Rapid Configuration Instructions} It can take several CSR instructions to set up the {\tt vdcfg} and {\tt vnp} CSRs for a given configuration. Specialized configuration instructions are provided to quickly set up common configurations in the {\tt vdcfg} and {\tt vnp} CSRs. The {\tt vsetdcfg} instruction takes a scalar register value encoded as shown in Figure~\ref{fig:vdcfg}, and returns the corresponding MVL in the destination register. The {\tt vsetdcfg} and {\tt vsetdcfgi} instructions also clear the {\tt vnp} register, so no predicate registers are allocated. \begin{discussion} For now, only a 32-bit value supporting up to three different vector data types is supported by the {\tt vsetdcfg} instruction. RV64 and RV128 could support larger number of types, though it's not clear if the hardware cost (area, latency) to support a larger number of different types is justified. \end{discussion} \begin{figure}[b] \centering \begin{tabular}{p{1cm}p{1cm}ccc|c|c|c|c|c|c|c|l} \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{mode} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \\ \cline{6-12} & & & & & \tt type2 & \tt ntype2 & \tt type1 & \tt ntype1 & 0 & \tt type0 & \tt ntype0 & \\ \cline{6-12} \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{2} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{5} & \\ %% \cline{2-12} %% & \multicolumn{1}{|c|}{0} & F128 & %% \multicolumn{1}{c|}{type3} & \multicolumn{1}{c|}{\#type3} & %% type2 & \#type2 & type1 & \#type1 & 0 & type0 & \#type0 & RV64 \\ %% \cline{2-12} %% & & & %% \multicolumn{1}{c}{} & %% \multicolumn{1}{c}{24} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{2} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & \\ %% \cline{1-12} %% \multicolumn{1}{|c|}{0} & \multicolumn{1}{c|}{X128} & %% \multicolumn{1}{c|}{F128} & X64 & F64 & F32 & F16 & X32 & X16 & X8 & RV128 \\ %% \cline{1-12} %% \multicolumn{1}{c}{83} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{2} & %% \multicolumn{1}{c}{5} & %% \multicolumn{1}{c}{5} & \\ \end{tabular} \caption{Format of the {\tt vsetdcfg} value. The value contains three pairs of a 5-bit type and a 5-bit number of registers to create of that type. A value of 0 for the number of a type indicates that 32 registers should be allocated. A value of 0 for the type indicates this pair should be skipped. The types must be of monotonically increasing size from type0 to type2. } \label{fig:vdcfg} \end{figure} The {\tt vsetdcfg} value specifies how many vector registers of each datatype are allocated, and is divided into a 2-bit mode field and pairs of 5-bit fields for each data type in the configuration. The 2-bit mode field indicates the configuration mode of the vector unit and is zero for the base vector extension. \begin{commentary} The standard vector extension operating mode configures the vector unit into some number of vector registers, each with some number of elements of types supported by the scalar unit. At least one alternative mode is planned, where the vector unit is configured as some number of registers each holding a single large element, e.g., 256 bits. This would be the base for cryptographic operations, or other coprocessors that operated on large structures. Other modes can be used to reconfigure the vector unit register file and functional units for other domain-specific purposes. \end{commentary} Each datatype pair contains a 5-bit {\tt type}$x$ value encoded as a {\tt vetype}$n$ value, and a 5-bit {\tt ntype}$x$ for the number of registers to allocate for that type. If the {\tt type0} field is non-zero, the {\tt vsetdcfg} instruction will configure the first {\tt ntype0} vector data registers to have {\tt vetype}$n$ values of {\tt type0} with {\tt vemaxw}$n$ values set accordingly as shown in Table~\ref{tab:vetype}. If the {\tt type0} value is 0, the datatype pair is skipped. If the {\tt type1} field is non-zero, then the next {\tt ntype1} vector registers are configured to be of the type given in {\tt type1}. Similarly for the {\tt type2} pair. A value of zero in a {\tt type}$x$ field indicates this datatype pair should be ignored. A value of zero in a {\tt ntype}$x$ field indicates 32 registers should be allocated for the corresponding type. \begin{commentary} Zero values are skipped to simplify setting a configuration with two different data types, where a single LUI instruction can set the upper 20 bits leaving the low bits zero. A single 12-bit immediate value is sufficient to create a configuration with some number of vector registers with a single given datatype. A compressed C.LI with a zero-extended 5-bit immediate can create a configuration with 32 vector registers of a given datatype. \end{commentary} A corresponding {\tt vsetdcfgi} instruction takes a 12-bit immediate value to set the configuration instead of a scalar value, but otherwise is identical to the {\tt vsetcfgd} instruction. \begin{discussion} It is not clear how many immediate bits will be made available for the {\tt vsetdcfgi} instruction. If encoding space is available for both 12 immediate bits and a source register specifier, then {\tt vsetdcgfi} can be defined to read the source register, OR in the bits in the immediate, then create a configuration. In this case, there is no need for a separate {\tt vsetdcfg} instruction. \end{discussion} The configuration value given must result in a legal configuration or else an illegal instruction exception will be raised. If a zero argument is given to {\tt vsetdcfg} the vector unit will be disabled and the value 0 will be returned for MVL. This instruction ({\tt vsetdcfg x0, x0}) is given the assembly pseudo-code {\tt vdisable}. Separate {\tt vsetpcfg} and {\tt vsetpcfgi} instructions are provided that write the source value to the {\tt vnp} register and return the new MVL. These writes also clear the vector data registers, set all bits in the allocated predicate registers, and set {\tt vl}=MVL. A {\tt vsetpcfg} or {\tt vsetpcfgi} instruction can be used after a {\tt vsetdcfg} to complete a reconfiguration of the vector unit. \begin{discussion} If {\tt vnp} is made accessible as a separate CSR, the {\tt setpcfg} and {\tt setpcfgi} instructions are less useful. The only advantage over a CSR instruction is that they return MVL, which is rarely needed, and which can be obtained via that {\tt setvl} instruction. \end{discussion} \section{Vector-Type-Change Instructions} To quickly change the individual types of a vector register, {\tt vetyperw} and {\tt vetyperwi} instructions are provided to change the type of the specified vector data register to the given scalar register value or 5-bit immediate value respectively, while returning the previous type in the destination scalar register. A vector convert instruction, described below, can simultaneously convert a source vector register into a new type, and set that type in the destination vector register. \section{Vector Length} The active vector length is held in the XLEN-bit WARL vector length CSR {\tt vl}, which can only hold values between 0 and MVL inclusive. Any writes to the configuration registers ({\tt vdcfg}$x$ or {\tt vnp}) cause {\tt vl} to be initialized with MVL. Changes to {\tt vetype}$n$ via vector-type-change instructions do not affect {\tt vl}. The active vector length is usually set via the {\tt setvl} instruction. The source argument to the {\tt setvl} is the requested application vector length (AVL) as an unsigned XLEN-bit integer. The {\tt setvl} instruction calculates the value to assign to {\tt vl} according to Table~\ref{tab:vlcalc}. The result of this calculation is also returned as the result of the {\tt setvl} instruction. \begin{commentary} Earlier drafts encoded {\tt setvl} using a modified CSRRW instruction whereas it is now encoded as a separate new instruction. \end{commentary} \begin{table} \centering \begin{tabular}{|c|c|} \hline AVL Value & {\tt vl} setting \\ \hline AVL $\geq$ 2\,MVL & MVL \\ 2\,MVL $>$ AVL $>$ MVL & $\lfloor$AVL$/2\rfloor$ \\ MVL $\geq$ AVL & AVL \\ \hline \end{tabular} \caption{Operation of {\tt setvl} instruction to set vector length register {\tt vl} based on requested application vector length (AVL) and current maximum vector length (MVL).} \label{tab:vlcalc} \end{table} \begin{commentary} The rules for setting the {\tt vl} register help keep vector pipelines full over the last two iterations of a stripmined loop. Similar rules were previously used in Cray-designed machines~\cite{crayx1asm}. \end{commentary} \begin{discussion} There are multiple possible rules for setting VL, and we could give implementations freedom to use different VL setting rules. \end{discussion} \begin{commentary} The idea of having implementation-defined vector length dates back to at least the IBM 3090 Vector Facility~\cite{ibm370varch}, which used a special ``Load Vector Count and Update'' (VLVCU) instruction to control stripmine loops. The {\tt setvl} instruction included here is based on the simpler {\tt setvlr} instruction introduced by Asanovi\'{c}~\cite{krstephd}. \end{commentary} The {\tt setvl} instruction is typically used at the start of every iteration of a stripmined loop to set the number of vector elements to operate on in the following loop iteration. The current MVL can be obtained from a vector configuration instruction, or by performing a {\tt setvl} with a source argument that has all bits set (largest unsigned integer). When {\tt vl} is less than MVL, vector instructions will set all elements in the range [{\tt vl}:MAXVL-1] in the destination vector data register or destination vector predicate register to zero. \begin{commentary} Requring zeroing of elements past the current active vector length simplifies the design of units with renamed vector data registers. If the specification left destination elements unchanged, renaming implementations would have to copy the tail of the old destination register to the newly allocated destination register. Alternatively, specifying the tail to be undefined will expose implementation differences and possibly cause a security hole. Implementations that do not support renaming, will have to zero the tail of a vector, but this can reuse the mechanism that is already required to initialize all vector data registers to zero on reconfiguration, for example, by having a zero bit on each element or element group. \end{commentary} No element operations are performed for any vector instruction when {\tt vl}=0. \begin{commentary} Two possible choices are to 1) require destination registers to be completely zeroed when {\tt vl}=0, or 2) no changes to the destination registers. Option 2 is currently chosen as this will prevents unnecessary work in some implementations, and option 1 does not provide a clear advantage beyond seeming more consistent with {\tt vl}>0 case. \end{commentary} \begin{figure}[bt] \centering \begin{verbatim} # Vector-vector 32-bit add loop. # a0 holds N # a1 holds pointer to result vector # a2 holds pointer to first source vector # a3 holds pointer to second source vector li t0, (2<