\chapter{``P'' Standard Extension for Packed-SIMD Instructions, Version 0.1} \label{sec:packedsimd} \begin{commentary} Discussions at the 5th RISC-V workshop indicated a desire to drop this packed-SIMD proposal for floating-point registers in favor of standardizing on the V extension for large floating-point SIMD operations. However, there was interest in packed-SIMD fixed-point operations for use in the integer registers of small RISC-V implementations. \end{commentary} In this chapter, we outline a standard packed-SIMD extension for RISC-V. We've reserved the instruction subset name ``P'' for a future standard set of packed-SIMD extensions. Many other extensions can build upon a packed-SIMD extension, taking advantage of the wide data registers and datapaths separate from the integer unit. \begin{commentary} Packed-SIMD extensions, first introduced with the Lincoln Labs TX-2~\cite{tx2}, have become a popular way to provide higher throughput on data-parallel codes. Earlier commercial microprocessor implementations include the Intel i860, HP PA-RISC MAX~\cite{lee-max-ieeemicro1996}, SPARC VIS~\cite{tremblay-vis-ieeemicro1996}, MIPS MDMX~\cite{gwennap-mdmx-mpr1996}, PowerPC AltiVec~\cite{diefendorff-altivec-ieeemicro2000}, Intel x86 MMX/SSE~\cite{peleg-mmx-ieeemicro1996, raman-sse-ieeemicro2000}, while recent designs include Intel x86 AVX~\cite{lomont-avx-irm2011} and ARM Neon~\cite{goodacre-armisa-computer2005}. We describe a standard framework for adding packed-SIMD in this chapter, but are not actively working on such a design. In our opinion, packed-SIMD designs represent a reasonable design point when reusing existing wide datapath resources, but if significant additional resources are to be devoted to data-parallel execution then designs based on traditional vector architectures are a better choice and should use the V extension. \end{commentary} A RISC-V packed-SIMD extension reuses the floating-point registers ({\tt f0}-{\tt f31}). These registers can be defined to have widths of FLEN=32 to FLEN=1024. The standard floating-point instruction subsets require registers of width 32 bits (``F''), 64 bits (``D''), or 128 bits (``Q''). \begin{commentary} It is natural to use the floating-point registers for packed-SIMD values rather than the integer registers (PA-RISC and Alpha packed-SIMD extensions) as this frees the integer registers for control and address values, simplifies reuse of scalar floating-point units for SIMD floating-point execution, and leads naturally to a decoupled integer/floating-point hardware design. The floating-point load and store instruction encodings also have space to handle wider packed-SIMD registers. However, reusing the floating-point registers for packed-SIMD values does make it more difficult to use a recoded internal format for floating-point values. \end{commentary} The existing floating-point load and store instructions are used to load and store various-sized words from memory to the {\tt f} registers. The base ISA supports 32-bit and 64-bit loads and stores, but the LOAD-FP and STORE-FP instruction encodings allows 8 different widths to be encoded as shown in Table~\ref{psimdwidth}. When used with packed-SIMD operations, it is desirable to support non-naturally aligned loads and stores in hardware. \begin{table}[htp] \begin{center} \begin{tabular}{|c|l|r|} \hline {\em width} field & Code & Size in bits\\ \hline 000 & B & 8 \\ 001 & H & 16 \\ 010 & W & 32 \\ 011 & D & 64 \\ 100 & Q & 128 \\ 101 & Q2 & 256 \\ 110 & Q4 & 512 \\ 111 & Q8 & 1024 \\ \hline \end{tabular} \end{center} \caption{LOAD-FP and STORE-FP width encoding.} \label{psimdwidth} \end{table} Packed-SIMD computational instructions operate on packed values in {\tt f} registers. Each value can be 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit, and both integer and floating-point representations can be supported. For example, a 64-bit packed-SIMD extension can treat each register as 1$\times$64-bit, 2$\times$32-bit, 4$\times$16-bit, or 8$\times$8-bit packed values. \begin{commentary} Simple packed-SIMD extensions might fit in unused 32-bit instruction opcodes, but more extensive packed-SIMD extensions will likely require a dedicated 30-bit instruction space. \end{commentary}