src/m.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

\chapter{``M'' Standard Extension for Integer Multiplication and
  Division, Version 2.0}

This chapter describes the standard integer multiplication and
division instruction extension, which is named ``M'' and contains
instructions that multiply or divide values held in two integer
registers.

\begin{commentary}
We separate integer multiply and divide out from the base to simplify
low-end implementations, or for applications where integer multiply
and divide operations are either infrequent or better handled in
attached accelerators.
\end{commentary}

\section{Multiplication Operations}
\label{multiplication-operations}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O}
\\
\instbitrange{31}{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct7} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
7 & 5 & 5 & 3 & 5 & 7 \\
MULDIV & multiplier & multiplicand & MUL/MULH[[S]U] & dest & OP    \\
MULDIV & multiplier & multiplicand & MULW           & dest & OP-32 \\
\end{tabular}
\end{center}

MUL performs an XLEN-bit$\times$XLEN-bit multiplication
of {\em rs1} by {\em rs2} and places the
lower XLEN bits in the destination register.  MULH, MULHU, and MULHSU
perform the same multiplication but return the upper XLEN bits of the
full 2$\times$XLEN-bit product, for signed$\times$signed,
unsigned$\times$unsigned, and \wunits{signed}{\em rs1}$\times$\wunits{unsigned}{\em rs2} multiplication,
respectively.  If both the high and low bits of the same product are
required, then the recommended code sequence is: MULH[[S]U] {\em rdh,
  rs1, rs2}; MUL {\em rdl, rs1, rs2} (source register specifiers must
be in same order and {\em rdh} cannot be the same as {\em rs1} or {\em
  rs2}).  Microarchitectures can then fuse these into a single
multiply operation instead of performing two separate multiplies.

\begin{commentary}
MULHSU is used in multi-word signed multiplication to multiply the
most-significant word of the multiplicand (which contains the sign bit)
with the less-significant words of the multiplier (which are unsigned).
\end{commentary}

MULW is an RV64 instruction that multiplies the lower 32 bits of the source
registers, placing the sign-extension of the lower 32 bits of the result
into the destination register.

\begin{commentary}
In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit product,
but signed arguments must be proper 32-bit signed values, whereas unsigned
arguments must have their upper 32 bits clear.  If the
arguments are not known to be sign- or zero-extended, an alternative is to
shift both arguments left by 32 bits, then use MULH[[S]U].
\end{commentary}

\section{Division Operations}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{S@{}R@{}R@{}O@{}R@{}O}
\\
\instbitrange{31}{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{funct7} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
7 & 5 & 5 & 3 & 5 & 7 \\
MULDIV & divisor & dividend & DIV[U]/REM[U]   & dest & OP    \\
MULDIV & divisor & dividend & DIV[U]W/REM[U]W & dest & OP-32 \\
\end{tabular}
\end{center}

DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned integer
division of {\em rs1} by {\em rs2}, rounding towards zero.
REM and REMU provide the remainder of the corresponding division operation.
For REM, the sign of the result equals the sign of the dividend.

\begin{commentary}
For both signed and unsigned division, it holds that
\mbox{$\textrm{dividend} = \textrm{divisor} \times \textrm{quotient} + \textrm{remainder}$}.
\end{commentary}

If both the quotient and remainder
are required from the same division, the recommended code sequence is:
DIV[U] {\em rdq, rs1, rs2}; REM[U] {\em rdr, rs1, rs2} ({\em rdq}
cannot be the same as {\em rs1} or {\em rs2}).  Microarchitectures can
then fuse these into a single divide operation instead of performing
two separate divides.

DIVW and DIVUW are RV64 instructions that divide the
lower 32 bits of {\em rs1} by the lower 32 bits of {\em rs2}, treating
them as signed and unsigned integers respectively, placing the 32-bit
quotient in {\em rd}, sign-extended to 64 bits.  REMW and REMUW
are RV64 instructions that provide the corresponding
signed and unsigned remainder operations respectively. Both REMW and
REMUW always sign-extend the 32-bit result to 64 bits, including on a
divide by zero.

The semantics for division by zero and division overflow are summarized in
Table~\ref{tab:divby0}.  The quotient of division by zero has all bits set, and
the remainder of division by zero equals the dividend.  Signed division overflow
occurs only when the most-negative integer is divided by $-1$.  The quotient of
a signed division with overflow is equal to the dividend, and the remainder is
zero. Unsigned division overflow cannot occur.

\begin{table}[h]
\center
\begin{tabular}{|l|c|c||c|c|c|c|}
\hline
Condition              & Dividend   & Divisor & DIVU[W]   & REMU[W] & DIV[W]     & REM[W] \\ \hline
Division by zero       & $x$        & 0       & $2^{L}-1$ & $x$     & $-1$       & $x$    \\
Overflow (signed only) & $-2^{L-1}$ & $-1$    & --        & --      & $-2^{L-1}$ & 0      \\
\hline
\end{tabular}
\caption{Semantics for division by zero and division overflow.
L is the width of the operation in bits: XLEN for DIV[U] and REM[U], or
32 for DIV[U]W and REM[U]W.}
\label{tab:divby0}
\end{table}

\begin{commentary}
We considered raising exceptions on integer divide by zero, with these
exceptions causing a trap in most execution environments.  However,
this would be the only arithmetic trap in the standard ISA
(floating-point exceptions set flags and write default values, but do
not cause traps) and would require language implementors to interact
with the execution environment's trap handlers for this case.
Further, where language standards mandate that a divide-by-zero
exception must cause an immediate control flow change, only a single
branch instruction needs to be added to each divide operation, and
this branch instruction can be inserted after the divide and should
normally be very predictably not taken, adding little runtime
overhead.

The value of all bits set is returned for both unsigned and signed
divide by zero to simplify the divider circuitry.  The value of all 1s
is both the natural value to return for unsigned divide, representing
the largest unsigned number, and also the natural result for simple
unsigned divider implementations.  Signed division is often
implemented using an unsigned division circuit and specifying the same
overflow result simplifies the hardware.
\end{commentary}

\section{Zmmul Extension, Version 0.1}

The Zmmul extension implements the multiplication subset of the M extension.
It adds all of the instructions defined in Section~\ref{multiplication-operations},
namely: MUL, MULH, MULHU, MULHSU, and (for RV64 only) MULW.
The encodings are identical to those of the corresponding M-extension instructions.

\begin{commentary}
The Zmmul extension enables low-cost implementations that require
multiplication operations but not division.
For many microcontroller applications, division operations are too
infrequent to justify the cost of divider hardware.
By contrast, multiplication operations are more frequent, making the cost of
multiplier hardware more justifiable.
Simple FPGA soft cores particularly benefit from eliminating division but
retaining multiplication, since many FPGAs provide hardwired multipliers
but require dividers be implemented in soft logic.
\end{commentary}