\chapter{``P'' Standard Extension for Packed-SIMD Instructions,
  Version 0.1}
\label{sec:packedsimd}

\begin{commentary}
  Discussions at the 5th RISC-V workshop indicated a desire to drop
  this packed-SIMD proposal for floating-point registers in favor of
  standardizing on the V extension for large floating-point SIMD
  operations.  However, there was interest in packed-SIMD fixed-point
  operations for use in the integer registers of small RISC-V
  implementations.
\end{commentary}

In this chapter, we outline a standard packed-SIMD extension for
RISC-V.  We've reserved the instruction subset name ``P'' for a future
standard set of packed-SIMD extensions.  Many other extensions can
build upon a packed-SIMD extension, taking advantage of the wide data
registers and datapaths separate from the integer unit.

\begin{commentary}
Packed-SIMD extensions, first introduced with the Lincoln Labs TX-2~\cite{tx2},
have become a popular way to provide higher throughput on data-parallel
codes. Earlier commercial microprocessor implementations include the
Intel i860, HP PA-RISC MAX~\cite{lee-max-ieeemicro1996}, SPARC
VIS~\cite{tremblay-vis-ieeemicro1996}, MIPS
MDMX~\cite{gwennap-mdmx-mpr1996}, PowerPC
AltiVec~\cite{diefendorff-altivec-ieeemicro2000}, Intel x86
MMX/SSE~\cite{peleg-mmx-ieeemicro1996, raman-sse-ieeemicro2000}, while
recent designs include Intel x86 AVX~\cite{lomont-avx-irm2011} and ARM
Neon~\cite{goodacre-armisa-computer2005}.  We describe a standard
framework for adding packed-SIMD in this chapter, but are not actively
working on such a design.  In our opinion, packed-SIMD designs represent
a reasonable design point when reusing existing wide datapath resources,
but if significant additional resources are to be devoted to
data-parallel execution then designs based on traditional vector
architectures are a better choice and should use the V extension.

\end{commentary}

A RISC-V packed-SIMD extension reuses the floating-point registers
({\tt f0}-{\tt f31}).  These registers can be defined to have widths
of FLEN=32 to FLEN=1024.  The standard floating-point instruction
subsets require registers of width 32 bits (``F''), 64 bits (``D''),
or 128 bits (``Q'').

\begin{commentary}
It is natural to use the floating-point registers for packed-SIMD
values rather than the integer registers (PA-RISC and Alpha
packed-SIMD extensions) as this frees the integer registers for
control and address values, simplifies reuse of scalar floating-point
units for SIMD floating-point execution, and leads naturally to a
decoupled integer/floating-point hardware design.  The floating-point
load and store instruction encodings also have space to handle wider
packed-SIMD registers.  However, reusing the floating-point registers
for packed-SIMD values does make it more difficult to use a recoded
internal format for floating-point values.
\end{commentary}

The existing floating-point load and store instructions are used to
load and store various-sized words from memory to the {\tt f}
registers.  The base ISA supports 32-bit and 64-bit loads and stores,
but the LOAD-FP and STORE-FP instruction encodings allows 8 different
widths to be encoded as shown in Table~\ref{psimdwidth}.  When used
with packed-SIMD operations, it is desirable to support non-naturally
aligned loads and stores in hardware.

\begin{table}[htp]
\begin{center}
\begin{tabular}{|c|l|r|}
\hline
{\em width} field &
Code &
Size in bits\\
\hline
000 & B  &  8   \\
001 & H  & 16   \\
010 & W  & 32   \\
011 & D  & 64   \\
100 & Q  & 128  \\
101 & Q2 & 256  \\
110 & Q4 & 512  \\
111 & Q8 & 1024 \\
\hline
\end{tabular}
\end{center}
\caption{LOAD-FP and STORE-FP width encoding.}
\label{psimdwidth}
\end{table}

Packed-SIMD computational instructions operate on packed values in
{\tt f} registers.  Each value can be 8-bit, 16-bit, 32-bit, 64-bit,
or 128-bit, and both integer and floating-point representations can be
supported.  For example, a 64-bit packed-SIMD extension can treat each
register as 1$\times$64-bit, 2$\times$32-bit, 4$\times$16-bit, or
8$\times$8-bit packed values.

\begin{commentary}
Simple packed-SIMD extensions might fit in unused 32-bit instruction
opcodes, but more extensive packed-SIMD extensions will likely require
a dedicated 30-bit instruction space.
\end{commentary}