aboutsummaryrefslogtreecommitdiff
path: root/doc/SoftFloat.html
diff options
context:
space:
mode:
authorJohn Hauser <jhauser@eecs.berkeley.edu>2014-12-17 19:08:03 -0800
committerJohn Hauser <jhauser@eecs.berkeley.edu>2014-12-17 19:08:03 -0800
commit7276b0022ec5f461af9c3b4a1fe2e5526825b58e (patch)
tree3bdcfb60a30c5161db65da8d110be9434e0ffad3 /doc/SoftFloat.html
parent437d9b9fb281962ea10d5e4475e3851eaa7ffd25 (diff)
downloadberkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.zip
berkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.tar.gz
berkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.tar.bz2
Finalized documentation for SoftFloat Release 3.
Diffstat (limited to 'doc/SoftFloat.html')
-rw-r--r--doc/SoftFloat.html350
1 files changed, 185 insertions, 165 deletions
diff --git a/doc/SoftFloat.html b/doc/SoftFloat.html
index fa3919a..d406d91 100644
--- a/doc/SoftFloat.html
+++ b/doc/SoftFloat.html
@@ -11,66 +11,59 @@
<P>
John R. Hauser<BR>
-2014 ______<BR>
-</P>
-
-<P>
-*** CONTENT DONE.
-</P>
-
-<P>
-*** REPLACE QUOTATION MARKS.
-<BR>
-*** REPLACE APOSTROPHES.
-<BR>
-*** REPLACE EM DASH.
+2014 Dec 17<BR>
</P>
<H2>Contents</H2>
-<P>
-*** CHECK.<BR>
-*** FIX FORMATTING.
-</P>
-
-<PRE>
- Introduction
- Limitations
- Acknowledgments and License
- Types and Functions
- Boolean and Integer Types
- Floating-Point Types
- Supported Floating-Point Functions
- Non-canonical Representations in extFloat80_t
- Conventions for Passing Arguments and Results
- Reserved Names
- Mode Variables
- Rounding Mode
- Underflow Detection
- Rounding Precision for 80-Bit Extended Format
- Exceptions and Exception Flags
- Function Details
- Conversions from Integer to Floating-Point
- Conversions from Floating-Point to Integer
- Conversions Among Floating-Point Types
- Basic Arithmetic Functions
- Fused Multiply-Add Functions
- Remainder Functions
- Round-to-Integer Functions
- Comparison Functions
- Signaling NaN Test Functions
- Raise-Exception Function
- Changes from SoftFloat Release 2
- Name Changes
- Changes to Function Arguments
- Added Capabilities
- Better Compatibility with the C Language
- New Organization as a Library
- Optimization Gains (and Losses)
- Future Directions
- Contact Information
-</PRE>
+<BLOCKQUOTE>
+<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
+<COL WIDTH=25>
+<COL WIDTH=*>
+<TR><TD COLSPAN=2>1. Introduction</TD></TR>
+<TR><TD COLSPAN=2>2. Limitations</TD></TR>
+<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
+<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
+<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
+<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
+<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
+<TR>
+ <TD></TD>
+ <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
+</TR>
+<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
+<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
+<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
+<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
+<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
+<TR>
+ <TD></TD>
+ <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
+</TR>
+<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
+<TR><TD COLSPAN=2>8. Function Details</TD></TR>
+<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
+<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
+<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
+<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
+<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
+<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
+<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
+<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
+<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
+<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
+<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
+<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
+<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
+<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
+<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
+<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
+<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
+<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
+<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
+</TABLE>
+</BLOCKQUOTE>
<H2>1. Introduction</H2>
@@ -156,15 +149,20 @@ SoftFloat <NOBR>Release 3</NOBR>.
The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
<NOBR>Release 3</NOBR> of SoftFloat is a completely new implementation
supplanting earlier releases.
-This project was done in the employ of the University of California, Berkeley,
-within the Department of Electrical Engineering and Computer Sciences, first
-for the Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
+This project (<NOBR>Release 3</NOBR> only, not earlier releases) was done in
+the employ of the University of California, Berkeley, within the Department of
+Electrical Engineering and Computer Sciences, first for the Parallel Computing
+Laboratory (Par Lab) and then for the ASPIRE Lab.
The work was officially overseen by Prof. Krste Asanovic, with funding provided
by these sources:
<BLOCKQUOTE>
<TABLE>
+<COL WIDTH=*>
+<COL WIDTH=10>
+<COL WIDTH=*>
<TR>
-<TD><NOBR>Par Lab:</NOBR></TD>
+<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
+<TD></TD>
<TD>
Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
@@ -172,7 +170,8 @@ NVIDIA, Oracle, and Samsung.
</TD>
</TR>
<TR>
-<TD><NOBR>ASPIRE Lab:</NOBR></TD>
+<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
+<TD></TD>
<TD>
DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
@@ -245,16 +244,18 @@ for these headers.
Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
<CODE>&lt;stdbool.h&gt;</CODE> and on these type names from
<CODE>&lt;stdint.h&gt;</CODE>:
+<BLOCKQUOTE>
<PRE>
- uint16_t
- uint32_t
- uint64_t
- int32_t
- int64_t
- uint_fast8_t
- uint_fast32_t
- uint_fast64_t
+uint16_t
+uint32_t
+uint64_t
+int32_t
+int64_t
+uint_fast8_t
+uint_fast32_t
+uint_fast64_t
</PRE>
+</BLOCKQUOTE>
</P>
@@ -263,26 +264,22 @@ Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
<P>
The <CODE>softfloat.h</CODE> header defines four floating-point types:
<BLOCKQUOTE>
-<TABLE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD><CODE>float32_t</CODE></TD>
-<TD>&nbsp;</TD>
<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
</TR>
<TR>
<TD><CODE>float64_t</CODE></TD>
-<TD>&nbsp;</TD>
<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
</TR>
<TR>
-<TD><CODE>extFloat80_t</CODE></TD>
-<TD>&nbsp;</TD>
+<TD><CODE>extFloat80_t&nbsp;&nbsp;&nbsp;</CODE></TD>
<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
Motorola format)</TD>
</TR>
<TR>
<TD><CODE>float128_t</CODE></TD>
-<TD>&nbsp;</TD>
<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
</TR>
</TABLE>
@@ -304,10 +301,10 @@ Header file <CODE>softfloat.h</CODE> also defines a structure,
This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
at least these two fields (not necessarily in this order):
<BLOCKQUOTE>
-<TABLE>
-<TR><TD><CODE>uint16_t signExp;</CODE></TD></TR>
-<TR><TD><CODE>uint64_t signif;</CODE></TD></TR>
-</TABLE>
+<PRE>
+uint16_t signExp;
+uint64_t signif;
+</PRE>
</BLOCKQUOTE>
Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
@@ -339,8 +336,8 @@ operation defined by the IEEE Standard;
for each format, the floating-point remainder operation defined by the IEEE
Standard;
<LI>
-for each format, a ``round to integer'' operation that rounds to the nearest
-integer value in the same format; and
+for each format, a &ldquo;round to integer&rdquo; operation that rounds to the
+nearest integer value in the same format; and
<LI>
comparisons between two values in the same floating-point format.
</UL>
@@ -357,12 +354,12 @@ not supported in SoftFloat <NOBR>Release 3</NOBR>:
conversions between floating-point formats and decimal or hexadecimal character
sequences;
<LI>
-all ``quiet-computation'' operations (<B>copy</B>, <B>negate</B>, <B>abs</B>,
-and <B>copySign</B>, which all involve only simple copying and/or manipulation
-of the floating-point sign bit); and
+all &ldquo;quiet-computation&rdquo; operations (<B>copy</B>, <B>negate</B>,
+<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
+manipulation of the floating-point sign bit); and
<LI>
-all ``non-computational'' operations other than <B>isSignaling</B> (which is
-supported).
+all &ldquo;non-computational&rdquo; operations other than <B>isSignaling</B>
+(which is supported).
</UL>
</P>
@@ -393,9 +390,9 @@ leading significand bit must <NOBR>be 1</NOBR> unless it is required to
For <NOBR>Release 3</NOBR> of SoftFloat, functions are not guaranteed to
operate as expected when inputs of type <CODE>extFloat80_t</CODE> are
non-canonical.
-Assuming all of a function's <CODE>extFloat80_t</CODE> inputs (if any) are
-canonical, function outputs of type <CODE>extFloat80_t</CODE> will always be
-canonical.
+Assuming all of a function&rsquo;s <CODE>extFloat80_t</CODE> inputs (if any)
+are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
+be canonical.
</P>
<H3>4.5. Conventions for Passing Arguments and Results</H3>
@@ -426,8 +423,8 @@ SoftFloat supplies this function:
The first two arguments point to the values to be added, and the last argument
points to the location where the sum will be stored.
The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
-that the <NOBR>128-bit</NOBR> inputs and outputs are ``in memory'', pointed to
-by pointer arguments.
+that the <NOBR>128-bit</NOBR> inputs and outputs are &ldquo;in memory&rdquo;,
+pointed to by pointer arguments.
</P>
<P>
@@ -464,10 +461,11 @@ platforms of interest, programmers can use whichever version they prefer.
<P>
In addition to the variables and functions documented here, SoftFloat defines
some symbol names for its own private use.
-These private names always begin with the prefix `<CODE>softfloat_</CODE>'.
+These private names always begin with the prefix
+&lsquo;<CODE>softfloat_</CODE>&rsquo;.
When a program includes header <CODE>softfloat.h</CODE> or links with the
-SoftFloat library, all names with prefix `<CODE>softfloat_</CODE>' are reserved
-for possible use by SoftFloat.
+SoftFloat library, all names with prefix &lsquo;<CODE>softfloat_</CODE>&rsquo;
+are reserved for possible use by SoftFloat.
Applications that use SoftFloat should not define their own names with this
prefix, and should reference only such names as are documented.
</P>
@@ -477,7 +475,7 @@ prefix, and should reference only such names as are documented.
<P>
The following variables control rounding mode, underflow detection, and the
-<NOBR>80-bit</NOBR> extended format's rounding precision:
+<NOBR>80-bit</NOBR> extended format&rsquo;s rounding precision:
<BLOCKQUOTE>
<CODE>softfloat_roundingMode</CODE><BR>
<CODE>softfloat_detectTininess</CODE><BR>
@@ -497,30 +495,25 @@ The rounding mode is selected by the global variable
</BLOCKQUOTE>
This variable may be set to one of the values
<BLOCKQUOTE>
-<TABLE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD><CODE>softfloat_round_near_even</CODE></TD>
-<TD>&nbsp;</TD>
<TD>round to nearest, with ties to even</TD>
</TR>
<TR>
-<TD><CODE>softfloat_round_near_maxMag</CODE></TD>
-<TD>&nbsp;</TD>
+<TD><CODE>softfloat_round_near_maxMag&nbsp;&nbsp;</CODE></TD>
<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
</TR>
<TR>
<TD><CODE>softfloat_round_minMag</CODE></TD>
-<TD>&nbsp;</TD>
<TD>round to minimum magnitude (toward zero)</TD>
</TR>
<TR>
<TD><CODE>softfloat_round_min</CODE></TD>
-<TD>&nbsp;</TD>
<TD>round to minimum (down)</TD>
</TR>
<TR>
<TD><CODE>softfloat_round_max</CODE></TD>
-<TD>&nbsp;</TD>
<TD>round to maximum (up)</TD>
</TR>
</TABLE>
@@ -550,7 +543,7 @@ Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
always detects loss of accuracy for underflow as an inexact result.
</P>
-<H3>6.3. Rounding Precision for 80-Bit Extended Format</H3>
+<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
<P>
For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
@@ -639,7 +632,7 @@ It does always raise the <I>inexact</I> exception flag as required.
In this section, <CODE>&lt;<I>float</I>&gt;</CODE> appears in function names as
a substitute for one of these abbreviations:
<BLOCKQUOTE>
-<TABLE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD><CODE>f32</CODE></TD>
<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
@@ -696,11 +689,14 @@ Each conversion function takes one input of the appropriate type and generates
one output.
The following illustrates the signatures of these functions in cases when the
floating-point result is passed either by value or via pointers:
+<BLOCKQUOTE>
<PRE>
- float64_t i32_to_f64( int32_t <I>a</I> );
-
- void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
+float64_t i32_to_f64( int32_t <I>a</I> );
</PRE>
+<PRE>
+void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
+</PRE>
+</BLOCKQUOTE>
</P>
<H3>8.2. Conversions from Floating-Point to Integer</H3>
@@ -717,12 +713,15 @@ functions:
</BLOCKQUOTE>
The functions have signatures as follows, depending on whether the
floating-point input is passed by value or via pointers:
+<BLOCKQUOTE>
<PRE>
- int32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
-
- int32_t
- f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
+int32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
</PRE>
+<PRE>
+int32_t
+ f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
+</PRE>
+</BLOCKQUOTE>
The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
the conversion.
The variable that usually indicates rounding mode,
@@ -768,12 +767,14 @@ and convenience:
These functions round only toward zero (to minimum magnitude).
The signatures for these functions are the same as above without the redundant
<CODE><I>roundingMode</I></CODE> argument:
+<BLOCKQUOTE>
<PRE>
- int32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
+int32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
</PRE>
<PRE>
- int32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
+int32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
</PRE>
+</BLOCKQUOTE>
</P>
<H3>8.3. Conversions Among Floating-Point Types</H3>
@@ -789,18 +790,20 @@ result are different formats.
There are four different styles of signature for these functions, depending on
whether the input and the output floating-point values are passed by value or
via pointers:
+<BLOCKQUOTE>
<PRE>
- float32_t f64_to_f32( float64_t <I>a</I> );
+float32_t f64_to_f32( float64_t <I>a</I> );
</PRE>
<PRE>
- float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
+float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
</PRE>
<PRE>
- void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
+void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
</PRE>
<PRE>
- void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
+void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
</PRE>
+</BLOCKQUOTE>
</P>
<P>
@@ -823,22 +826,22 @@ Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
(square root) which takes only one.
The operands and result are all of the same floating-point format.
Signatures for these functions take the following forms:
+<BLOCKQUOTE>
<PRE>
- float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
+float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
</PRE>
<PRE>
- void
- f128M_add(
- const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
+void
+ f128M_add(
+ const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
</PRE>
-</P>
-<P>
<PRE>
- float64_t f64_sqrt( float64_t <I>a</I> );
+float64_t f64_sqrt( float64_t <I>a</I> );
</PRE>
<PRE>
- void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
+void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
</PRE>
+</BLOCKQUOTE>
When floating-point values are passed indirectly through pointers, arguments
<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
@@ -850,7 +853,7 @@ Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
(<CODE>extFloat80_t</CODE>) functions is affected by variable
<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
<NOBR>section 6.3</NOBR>,
-<I>Rounding Precision for <NOBR>80-Bit</NOBR> Extended Format</I>.
+<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
</P>
<H3>8.5. Fused Multiply-Add Functions</H3>
@@ -873,18 +876,20 @@ No fused multiple-add function is currently provided for the
<P>
Depending on whether floating-point values are passed by value or via pointers,
the fused multiply-add functions have signatures of these forms:
+<BLOCKQUOTE>
<PRE>
- float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
+float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
</PRE>
<PRE>
- void
- f128M_mulAdd(
- const float128_t *<I>aPtr</I>,
- const float128_t *<I>bPtr</I>,
- const float128_t *<I>cPtr</I>,
- float128_t *<I>destPtr</I>
- );
+void
+ f128M_mulAdd(
+ const float128_t *<I>aPtr</I>,
+ const float128_t *<I>bPtr</I>,
+ const float128_t *<I>cPtr</I>,
+ float128_t *<I>destPtr</I>
+ );
</PRE>
+</BLOCKQUOTE>
The functions compute
<NOBR>(<CODE><I>a</I></CODE> &times; <CODE><I>b</I></CODE>)
+ <CODE><I>c</I></CODE></NOBR>
@@ -915,14 +920,16 @@ Each remainder operation takes two floating-point operands of the same format
and returns a result in the same format.
Depending on whether floating-point values are passed by value or via pointers,
the remainder functions have signatures of these forms:
+<BLOCKQUOTE>
<PRE>
- float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
+float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
</PRE>
<PRE>
- void
- f128M_rem(
- const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
+void
+ f128M_rem(
+ const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
</PRE>
+</BLOCKQUOTE>
When floating-point values are passed indirectly through pointers, arguments
<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
@@ -938,8 +945,8 @@ where <I>n</I> is the integer closest to
If <NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR> is exactly
halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
-The IEEE Standard's remainder operation is always exact and so requires no
-rounding.
+The IEEE Standard&rsquo;s remainder operation is always exact and so requires
+no rounding.
</P>
<P>
@@ -968,18 +975,20 @@ and the resulting integer value is returned in the same floating-point format.
<P>
The signatures of the round-to-integer functions are similar to those for
conversions to an integer type:
+<BLOCKQUOTE>
<PRE>
- float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
+float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
</PRE>
<PRE>
- void
- f128M_roundToInt(
- const float128_t *<I>aPtr</I>,
- uint_fast8_t <I>roundingMode</I>,
- bool <I>exact</I>,
- float128_t *<I>destPtr</I>
- );
+void
+ f128M_roundToInt(
+ const float128_t *<I>aPtr</I>,
+ uint_fast8_t <I>roundingMode</I>,
+ bool <I>exact</I>,
+ float128_t *<I>destPtr</I>
+ );
</PRE>
+</BLOCKQUOTE>
The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
apply.
The variable that usually indicates rounding mode,
@@ -1005,17 +1014,19 @@ provided:
<CODE>&lt;<I>float</I>&gt;_lt</CODE>
</BLOCKQUOTE>
Each comparison takes two operands of the same type and returns a Boolean.
-The abbreviation <CODE>eq</CODE> stands for ``equal'' (=);
-<CODE>le</CODE> stands for ``less than or equal'' (&le;);
-and <CODE>lt</CODE> stands for ``less than'' (&lt;).
+The abbreviation <CODE>eq</CODE> stands for &ldquo;equal&rdquo; (=);
+<CODE>le</CODE> stands for &ldquo;less than or equal&rdquo; (&le;);
+and <CODE>lt</CODE> stands for &ldquo;less than&rdquo; (&lt;).
Depending on whether the floating-point operands are passed by value or via
pointers, the comparison functions have signatures of these forms:
+<BLOCKQUOTE>
<PRE>
- bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
+bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
</PRE>
<PRE>
- bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
+bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
</PRE>
+</BLOCKQUOTE>
</P>
<P>
@@ -1058,21 +1069,25 @@ provided with these names:
The functions take one floating-point operand and return a Boolean indicating
whether the operand is a signaling NaN.
Accordingly, the functions have the forms
+<BLOCKQUOTE>
<PRE>
- bool f64_isSignalingNaN( float64_t <I>a</I> );
+bool f64_isSignalingNaN( float64_t <I>a</I> );
</PRE>
<PRE>
- bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
+bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
</PRE>
+</BLOCKQUOTE>
</P>
<H3>8.10. Raise-Exception Function</H3>
<P>
SoftFloat provides a single function for raising floating-point exceptions:
+<BLOCKQUOTE>
<PRE>
- void softfloat_raise( uint_fast8_t <I>exceptions</I> );
+void softfloat_raise( uint_fast8_t <I>exceptions</I> );
</PRE>
+</BLOCKQUOTE>
The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
exceptions to raise.
(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
@@ -1084,6 +1099,11 @@ function may cause a trap or abort appropriate for the current system.
<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
+<P>
+Apart from the change in the legal use license, there are numerous technical
+differences between <NOBR>Release 3</NOBR> of SoftFloat and earlier releases.
+</P>
+
<H3>9.1. Name Changes</H3>
<P>
@@ -1214,17 +1234,17 @@ Lastly, there are a few other changes to function names:
<TR>
<TD><CODE>_round_to_zero</CODE></TD>
<TD><CODE>_r_minMag</CODE></TD>
-<TD>conversions from floating-point to integer, section 8.2</TD>
+<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
</TR>
<TR>
<TD><CODE>round_to_int</CODE></TD>
<TD><CODE>roundToInt</CODE></TD>
-<TD>round-to-integer functions, section 8.7</TD>
+<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
</TR>
<TR>
<TD><CODE>is_signaling_nan&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
<TD><CODE>isSignalingNaN</CODE></TD>
-<TD>signaling NaN test functions, section 8.9</TD>
+<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
</TR>
</TABLE>
</BLOCKQUOTE>
@@ -1296,7 +1316,7 @@ argument <CODE><I>exact</I></CODE>.
<P>
With <NOBR>Release 3</NOBR>, a port of SoftFloat can now define any of the
floating-point types <CODE>float32_t</CODE>, <CODE>float64_t</CODE>,
-<CODE>extFloat80_t</CODE>, and <CODE>float128_t</CODE> as aliases for C's
+<CODE>extFloat80_t</CODE>, and <CODE>float128_t</CODE> as aliases for C&rsquo;s
standard floating-point types <CODE>float</CODE>, <CODE>double</CODE>, and
<CODE>long</CODE> <CODE>double</CODE>, using either <CODE>#define</CODE> or
<CODE>typedef</CODE>.
@@ -1304,9 +1324,9 @@ This potential convenience was not supported under <NOBR>Release 2</NOBR>.
</P>
<P>
-(Note, however, that there may be a performance cost to defining SoftFloat's
-floating-point types this way, depending on the platform and the applications
-using SoftFloat.
+(Note, however, that there may be a performance cost to defining
+SoftFloat&rsquo;s floating-point types this way, depending on the platform and
+the applications using SoftFloat.
Ports of SoftFloat may choose to forgo the convenience in favor of better
speed.)
</P>
@@ -1338,7 +1358,7 @@ Fused multiply-add functions have been added for the non-extended formats,
<P>
<NOBR>Release 3</NOBR> of SoftFloat is written to conform better to the ISO C
-Standard's rules for portability.
+Standard&rsquo;s rules for portability.
For example, older releases of SoftFloat employed type conversions in ways
that, while commonly practiced, are not fully defined by the C Standard.
Such problematic type conversions have generally been replaced by the use of
@@ -1387,8 +1407,8 @@ Some loss of speed has been observed due to this change.
The following improvements are anticipated for future releases of SoftFloat:
<UL>
<LI>
-support for the common <NOBR>16-bit</NOBR> ``half-precision'' floating-point
-format;
+support for the common <NOBR>16-bit</NOBR> &ldquo;half-precision&rdquo;
+floating-point format;
<LI>
more functions from the 2008 version of the IEEE Floating-Point Standard;
<LI>