src/zfa.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239

[[zfa]]
== "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0

This chapter describes the Zfa standard extension, which adds
instructions for immediate loads, IEEE 754-2019 minimum and maximum
operations, round-to-integer operations, and quiet floating-point
comparisons. For RV32D, the Zfa extension also adds instructions to
transfer double-precision floating-point values to and from integer
registers, and for RV64Q, it adds analogous instructions for
quad-precision floating-point values. The Zfa extension depends on the F
extension.

=== Load-Immediate Instructions

The FLI.S instruction loads one of 32 single-precision floating-point
constants, encoded in the _rs1_ field, into floating-point register
_rd_. The correspondence of _rs1_ field values and single-precision
floating-point values is shown in <<tab:flis, Table 37>>. FLI.S is encoded
like FMV.W.X, but with _rs2_=1.

[[tab:flis]]
.Immediate values loaded by the FLI.S instruction.
[%autowidth,float="center",align="center",cols=">,>,^,^,^",options="header",]
|===
|_rs1_ |Value |Sign |Exponent |Significand
|0 |latexmath:[$-1.0$] |`1` |`01111111` |`000...000`
|1 |_Minimum positive normal_ |`0` |`00000001` |`000...000`
|2 |latexmath:[$1.0 \times 2^{-16}$] |`0` |`01101111` |`000...000`
|3 |latexmath:[$1.0 \times 2^{-15}$] |`0` |`01110000` |`000...000`
|4 |latexmath:[$1.0 \times 2^{-8}$] |`0` |`01110111` |`000...000`
|5 |latexmath:[$1.0 \times 2^{-7}$] |`0` |`01111000` |`000...000`
|6 |0.0625 (latexmath:[$2^{-4}$]) |`0` |`01111011` |`000...000`
|7 |0.125 (latexmath:[$2^{-3}$]) |`0` |`01111100` |`000...000`
|8 |0.25 |`0` |`01111101` |`000...000`
|9 |0.3125 |`0` |`01111101` |`010...000`
|10 |0.375 |`0` |`01111101` |`100...000`
|11 |0.4375 |`0` |`01111101` |`110...000`
|12 |0.5 |`0` |`01111110` |`000...000`
|13 |0.625 |`0` |`01111110` |`010...000`
|14 |0.75 |`0` |`01111110` |`100...000`
|15 |0.875 |`0` |`01111110` |`110...000`
|16 |1.0 |`0` |`01111111` |`000...000`
|17 |1.25 |`0` |`01111111` |`010...000`
|18 |1.5 |`0` |`01111111` |`100...000`
|19 |1.75 |`0` |`01111111` |`110...000`
|20 |2.0 |`0` |`10000000` |`000...000`
|21 |2.5 |`0` |`10000000` |`010...000`
|22 |3 |`0` |`10000000` |`100...000`
|23 |4 |`0` |`10000001` |`000...000`
|24 |8 |`0` |`10000010` |`000...000`
|25 |16 |`0` |`10000011` |`000...000`
|26 |128 (latexmath:[$2^7$]) |`0` |`10000110` |`000...000`
|27 |256 (latexmath:[$2^8$]) |`0` |`10000111` |`000...000`
|28 |latexmath:[$2^{15}$] |`0` |`10001110` |`000...000`
|29 |latexmath:[$2^{16}$] |`0` |`10001111` |`000...000`
|30 |latexmath:[$+\infty$] |`0` |`11111111` |`000...000`
|31 |_Canonical NaN_ |`0` |`11111111` |`100...000`
|===

[TIP]
====
The preferred assembly syntax for entries 1, 30, and 31 is `min`, `inf`,
and `nan`, respectively. For entries 0 through 29 (including entry 1),
the assembler will accept decimal constants in C-like syntax.
====
[TIP]
====
The set of 32 constants was chosen by examining floating-point
libraries, including the C standard math library, and to optimize
fixed-point to floating-point conversion.

Entries 8-22 follow a regular encoding pattern. No entry sets mantissa
bits other than the two most significant ones.
====

If the D extension is implemented, FLI.D performs the analogous
operation, but loads a double-precision value into floating-point
register _rd_. Note that entry 1 (corresponding to the minimum positive
normal value) has a numerically different value for double-precision
than for single-precision. FLI.D is encoded like FLI.S, but with
_fmt_=D.

If the Q extension is implemented, FLI.Q performs the analogous
operation, but loads a quad-precision value into floating-point register
_rd_. Note that entry 1 (corresponding to the minimum positive normal
value) has a numerically different value for quad-precision. FLI.Q is
encoded like FLI.S, but with _fmt_=Q.

If the Zfh or Zvfh extension is implemented, FLI.H performs the
analogous operation, but loads a half-precision floating-point value
into register _rd_. Note that entry 1 (corresponding to the minimum
positive normal value) has a numerically different value for
half-precision. Furthermore, since latexmath:[$2^{16}$] is not
representable in half-precision floating-point, entry 29 in the table
instead loads positive infinity—i.e., it is redundant with entry 30.
FLI.H is encoded like FLI.S, but with _fmt_=H.
[NOTE]
====
Additionally, since latexmath:[$2^{-16}$] and latexmath:[$2^{-15}$] are subnormal in half-precision, entry 1 is numerically greater than entries 2 and 3 for FLI.H.
====
The FLI._fmt_ instructions never set any floating-point exception flags.

=== Minimum and Maximum Instructions

The FMINM.S and FMAXM.S instructions are defined like the FMIN.S and
FMAX.S instructions, except that if either input is NaN, the result is
the canonical NaN.

If the D extension is implemented, FMINM.D and FMAXM.D instructions are
analogously defined to operate on double-precision numbers.

If the Zfh extension is implemented, FMINM.H and FMAXM.H instructions
are analogously defined to operate on half-precision numbers.

If the Q extension is implemented, FMINM.Q and FMAXM.Q instructions are
analogously defined to operate on quad-precision numbers.

These instructions are encoded like their FMIN and FMAX counterparts,
but with instruction bit 13 set to 1.
[NOTE]
====
These instructions implement the IEEE 754-2019 minimum and maximum
operations.
====
=== Round-to-Integer Instructions

The FROUND.S instruction rounds the single-precision floating-point
number in floating-point register _rs1_ to an integer, according to the
rounding mode specified in the instruction's _rm_ field. It then writes
that integer, represented as a single-precision floating-point number,
to floating-point register _rd_. Zero and infinite inputs are copied to
_rd_ unmodified. Signaling NaN inputs cause the invalid operation
exception flag to be set; no other exception flags are set. FROUND.S is
encoded like FCVT.S.D, but with _rs2_=4.

The FROUNDNX.S instruction is defined similarly, but it also sets the
inexact exception flag if the input differs from the rounded result and
is not NaN. FROUNDNX.S is encoded like FCVT.S.D, but with _rs2_=5.

If the D extension is implemented, FROUND.D and FROUNDNX.D instructions
are analogously defined to operate on double-precision numbers. They are
encoded like FCVT.D.S, but with _rs2_=4 and 5, respectively,

If the Zfh extension is implemented, FROUND.H and FROUNDNX.H
instructions are analogously defined to operate on half-precision
numbers. They are encoded like FCVT.H.S, but with _rs2_=4 and 5,
respectively,

If the Q extension is implemented, FROUND.Q and FROUNDNX.Q instructions
are analogously defined to operate on quad-precision numbers. They are
encoded like FCVT.Q.S, but with _rs2_=4 and 5, respectively,
[NOTE]
====
The FROUNDNX._fmt_ instructions implement the IEEE 754-2019
roundToIntegralExact operation, and the FROUND._fmt_ instructions
implement the other operations in the roundToIntegral family.
====
=== Modular Convert-to-Integer Instruction

The FCVTMOD.W.D instruction is defined similarly to the FCVT.W.D
instruction, with the following differences. FCVTMOD.W.D always rounds
towards zero. Bits 31:0 are taken from the rounded, unbounded two's
complement result, then sign-extended to XLEN bits and written to
integer register _rd_. latexmath:[$\pm\infty$] and NaN are converted to
zero.

Floating-point exception flags are raised the same as they would be for
FCVT.W.D with the same input operand.

This instruction is only provided if the D extension is implemented. It
is encoded like FCVT.W.D, but with the rs2 field set to 8 and the _rm_
field set to 1 (RTZ). Other _rm_ values are _reserved_.
[TIP]
====
The assembly syntax requires the RTZ rounding mode to be explicitly
specified, i.e., `fcvtmod.w.d rd, rs1, rtz`.

The FCVTMOD.W.D instruction was added principally to accelerate the
processing of JavaScript Numbers. Numbers are double-precision
values, but some operators implicitly truncate them to signed integers
mod latexmath:[$2^{32}$].
====
=== Move Instructions

For RV32 only, if the D extension is implemented, the FMVH.X.D
instruction moves bits 63:32 of floating-point register _rs1_ into
integer register _rd_. It is encoded in the OP-FP major opcode with
_funct3_=0, _rs2_=1, and _funct7_=1110001.
[NOTE]
====
FMVH.X.D is used in conjunction with the existing FMV.X.W instruction to
move a double-precision floating-point number to a pair of x-registers.
====
For RV32 only, if the D extension is implemented, the FMVP.D.X
instruction moves a double-precision number from a pair of integer
registers into a floating-point register. Integer registers _rs1_ and
_rs2_ supply bits 31:0 and 63:32, respectively; the result is written to
floating-point register _rd_. FMVP.D.X is encoded in the OP-FP major
opcode with _funct3_=0 and _funct7_=1011001.

For RV64 only, if the Q extension is implemented, the FMVH.X.Q
instruction moves bits 127:64 of floating-point register _rs1_ into
integer register _rd_. It is encoded in the OP-FP major opcode with
_funct3_=0, _rs2_=1, and _funct7_=1110011.
[NOTE]
====
FMVH.X.Q is used in conjunction with the existing FMV.X.D instruction to
move a quad-precision floating-point number to a pair of x-registers.
====
For RV64 only, if the Q extension is implemented, the FMVP.Q.X
instruction moves a double-precision number from a pair of integer
registers into a floating-point register. Integer registers _rs1_ and
_rs2_ supply bits 63:0 and 127:64, respectively; the result is written
to floating-point register _rd_. FMVP.Q.X is encoded in the OP-FP major
opcode with _funct3_=0 and _funct7_=1011011.

=== Comparison Instructions

The FLEQ.S and FLTQ.S instructions are defined like the FLE.S and FLT.S
instructions, except that quiet NaN inputs do not cause the invalid
operation exception flag to be set.

If the D extension is implemented, FLEQ.D and FLTQ.D instructions are
analogously defined to operate on double-precision numbers.

If the Zfh extension is implemented, FLEQ.H and FLTQ.H instructions are
analogously defined to operate on half-precision numbers.

If the Q extension is implemented, FLEQ.Q and FLTQ.Q instructions are
analogously defined to operate on quad-precision numbers.

These instructions are encoded like their FLE and FLT counterparts, but
with instruction bit 14 set to 1.
[NOTE]
====
We do not expect analogous comparison instructions will be added to the
vector ISA, since they can be reasonably efficiently emulated using
masking.
====