diff options
author | John Hauser <jhauser@eecs.berkeley.edu> | 2014-12-17 19:08:03 -0800 |
---|---|---|
committer | John Hauser <jhauser@eecs.berkeley.edu> | 2014-12-17 19:08:03 -0800 |
commit | 7276b0022ec5f461af9c3b4a1fe2e5526825b58e (patch) | |
tree | 3bdcfb60a30c5161db65da8d110be9434e0ffad3 /doc/SoftFloat.html | |
parent | 437d9b9fb281962ea10d5e4475e3851eaa7ffd25 (diff) | |
download | berkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.zip berkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.tar.gz berkeley-softfloat-3-7276b0022ec5f461af9c3b4a1fe2e5526825b58e.tar.bz2 |
Finalized documentation for SoftFloat Release 3.
Diffstat (limited to 'doc/SoftFloat.html')
-rw-r--r-- | doc/SoftFloat.html | 350 |
1 files changed, 185 insertions, 165 deletions
diff --git a/doc/SoftFloat.html b/doc/SoftFloat.html index fa3919a..d406d91 100644 --- a/doc/SoftFloat.html +++ b/doc/SoftFloat.html @@ -11,66 +11,59 @@ <P> John R. Hauser<BR> -2014 ______<BR> -</P> - -<P> -*** CONTENT DONE. -</P> - -<P> -*** REPLACE QUOTATION MARKS. -<BR> -*** REPLACE APOSTROPHES. -<BR> -*** REPLACE EM DASH. +2014 Dec 17<BR> </P> <H2>Contents</H2> -<P> -*** CHECK.<BR> -*** FIX FORMATTING. -</P> - -<PRE> - Introduction - Limitations - Acknowledgments and License - Types and Functions - Boolean and Integer Types - Floating-Point Types - Supported Floating-Point Functions - Non-canonical Representations in extFloat80_t - Conventions for Passing Arguments and Results - Reserved Names - Mode Variables - Rounding Mode - Underflow Detection - Rounding Precision for 80-Bit Extended Format - Exceptions and Exception Flags - Function Details - Conversions from Integer to Floating-Point - Conversions from Floating-Point to Integer - Conversions Among Floating-Point Types - Basic Arithmetic Functions - Fused Multiply-Add Functions - Remainder Functions - Round-to-Integer Functions - Comparison Functions - Signaling NaN Test Functions - Raise-Exception Function - Changes from SoftFloat Release 2 - Name Changes - Changes to Function Arguments - Added Capabilities - Better Compatibility with the C Language - New Organization as a Library - Optimization Gains (and Losses) - Future Directions - Contact Information -</PRE> +<BLOCKQUOTE> +<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> +<COL WIDTH=25> +<COL WIDTH=*> +<TR><TD COLSPAN=2>1. Introduction</TD></TR> +<TR><TD COLSPAN=2>2. Limitations</TD></TR> +<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR> +<TR><TD COLSPAN=2>4. Types and Functions</TD></TR> +<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR> +<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR> +<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR> +<TR> + <TD></TD> + <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD> +</TR> +<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR> +<TR><TD COLSPAN=2>5. Reserved Names</TD></TR> +<TR><TD COLSPAN=2>6. Mode Variables</TD></TR> +<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR> +<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR> +<TR> + <TD></TD> + <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD> +</TR> +<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR> +<TR><TD COLSPAN=2>8. Function Details</TD></TR> +<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR> +<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR> +<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR> +<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR> +<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR> +<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR> +<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR> +<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR> +<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR> +<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR> +<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR> +<TR><TD></TD><TD>9.1. Name Changes</TD></TR> +<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR> +<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR> +<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR> +<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR> +<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR> +<TR><TD COLSPAN=2>10. Future Directions</TD></TR> +<TR><TD COLSPAN=2>11. Contact Information</TD></TR> +</TABLE> +</BLOCKQUOTE> <H2>1. Introduction</H2> @@ -156,15 +149,20 @@ SoftFloat <NOBR>Release 3</NOBR>. The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser. <NOBR>Release 3</NOBR> of SoftFloat is a completely new implementation supplanting earlier releases. -This project was done in the employ of the University of California, Berkeley, -within the Department of Electrical Engineering and Computer Sciences, first -for the Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab. +This project (<NOBR>Release 3</NOBR> only, not earlier releases) was done in +the employ of the University of California, Berkeley, within the Department of +Electrical Engineering and Computer Sciences, first for the Parallel Computing +Laboratory (Par Lab) and then for the ASPIRE Lab. The work was officially overseen by Prof. Krste Asanovic, with funding provided by these sources: <BLOCKQUOTE> <TABLE> +<COL WIDTH=*> +<COL WIDTH=10> +<COL WIDTH=*> <TR> -<TD><NOBR>Par Lab:</NOBR></TD> +<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD> +<TD></TD> <TD> Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, @@ -172,7 +170,8 @@ NVIDIA, Oracle, and Samsung. </TD> </TR> <TR> -<TD><NOBR>ASPIRE Lab:</NOBR></TD> +<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD> +<TD></TD> <TD> DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, @@ -245,16 +244,18 @@ for these headers. Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from <CODE><stdbool.h></CODE> and on these type names from <CODE><stdint.h></CODE>: +<BLOCKQUOTE> <PRE> - uint16_t - uint32_t - uint64_t - int32_t - int64_t - uint_fast8_t - uint_fast32_t - uint_fast64_t +uint16_t +uint32_t +uint64_t +int32_t +int64_t +uint_fast8_t +uint_fast32_t +uint_fast64_t </PRE> +</BLOCKQUOTE> </P> @@ -263,26 +264,22 @@ Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from <P> The <CODE>softfloat.h</CODE> header defines four floating-point types: <BLOCKQUOTE> -<TABLE> +<TABLE CELLSPACING=0 CELLPADDING=0> <TR> <TD><CODE>float32_t</CODE></TD> -<TD> </TD> <TD><NOBR>32-bit</NOBR> single-precision binary format</TD> </TR> <TR> <TD><CODE>float64_t</CODE></TD> -<TD> </TD> <TD><NOBR>64-bit</NOBR> double-precision binary format</TD> </TR> <TR> -<TD><CODE>extFloat80_t</CODE></TD> -<TD> </TD> +<TD><CODE>extFloat80_t </CODE></TD> <TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or Motorola format)</TD> </TR> <TR> <TD><CODE>float128_t</CODE></TD> -<TD> </TD> <TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD> </TR> </TABLE> @@ -304,10 +301,10 @@ Header file <CODE>softfloat.h</CODE> also defines a structure, This structure is the same size as type <CODE>extFloat80_t</CODE> and contains at least these two fields (not necessarily in this order): <BLOCKQUOTE> -<TABLE> -<TR><TD><CODE>uint16_t signExp;</CODE></TD></TR> -<TR><TD><CODE>uint64_t signif;</CODE></TD></TR> -</TABLE> +<PRE> +uint16_t signExp; +uint64_t signif; +</PRE> </BLOCKQUOTE> Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the @@ -339,8 +336,8 @@ operation defined by the IEEE Standard; for each format, the floating-point remainder operation defined by the IEEE Standard; <LI> -for each format, a ``round to integer'' operation that rounds to the nearest -integer value in the same format; and +for each format, a “round to integer” operation that rounds to the +nearest integer value in the same format; and <LI> comparisons between two values in the same floating-point format. </UL> @@ -357,12 +354,12 @@ not supported in SoftFloat <NOBR>Release 3</NOBR>: conversions between floating-point formats and decimal or hexadecimal character sequences; <LI> -all ``quiet-computation'' operations (<B>copy</B>, <B>negate</B>, <B>abs</B>, -and <B>copySign</B>, which all involve only simple copying and/or manipulation -of the floating-point sign bit); and +all “quiet-computation” operations (<B>copy</B>, <B>negate</B>, +<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or +manipulation of the floating-point sign bit); and <LI> -all ``non-computational'' operations other than <B>isSignaling</B> (which is -supported). +all “non-computational” operations other than <B>isSignaling</B> +(which is supported). </UL> </P> @@ -393,9 +390,9 @@ leading significand bit must <NOBR>be 1</NOBR> unless it is required to For <NOBR>Release 3</NOBR> of SoftFloat, functions are not guaranteed to operate as expected when inputs of type <CODE>extFloat80_t</CODE> are non-canonical. -Assuming all of a function's <CODE>extFloat80_t</CODE> inputs (if any) are -canonical, function outputs of type <CODE>extFloat80_t</CODE> will always be -canonical. +Assuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any) +are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always +be canonical. </P> <H3>4.5. Conventions for Passing Arguments and Results</H3> @@ -426,8 +423,8 @@ SoftFloat supplies this function: The first two arguments point to the values to be added, and the last argument points to the location where the sum will be stored. The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact -that the <NOBR>128-bit</NOBR> inputs and outputs are ``in memory'', pointed to -by pointer arguments. +that the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”, +pointed to by pointer arguments. </P> <P> @@ -464,10 +461,11 @@ platforms of interest, programmers can use whichever version they prefer. <P> In addition to the variables and functions documented here, SoftFloat defines some symbol names for its own private use. -These private names always begin with the prefix `<CODE>softfloat_</CODE>'. +These private names always begin with the prefix +‘<CODE>softfloat_</CODE>’. When a program includes header <CODE>softfloat.h</CODE> or links with the -SoftFloat library, all names with prefix `<CODE>softfloat_</CODE>' are reserved -for possible use by SoftFloat. +SoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’ +are reserved for possible use by SoftFloat. Applications that use SoftFloat should not define their own names with this prefix, and should reference only such names as are documented. </P> @@ -477,7 +475,7 @@ prefix, and should reference only such names as are documented. <P> The following variables control rounding mode, underflow detection, and the -<NOBR>80-bit</NOBR> extended format's rounding precision: +<NOBR>80-bit</NOBR> extended format’s rounding precision: <BLOCKQUOTE> <CODE>softfloat_roundingMode</CODE><BR> <CODE>softfloat_detectTininess</CODE><BR> @@ -497,30 +495,25 @@ The rounding mode is selected by the global variable </BLOCKQUOTE> This variable may be set to one of the values <BLOCKQUOTE> -<TABLE> +<TABLE CELLSPACING=0 CELLPADDING=0> <TR> <TD><CODE>softfloat_round_near_even</CODE></TD> -<TD> </TD> <TD>round to nearest, with ties to even</TD> </TR> <TR> -<TD><CODE>softfloat_round_near_maxMag</CODE></TD> -<TD> </TD> +<TD><CODE>softfloat_round_near_maxMag </CODE></TD> <TD>round to nearest, with ties to maximum magnitude (away from zero)</TD> </TR> <TR> <TD><CODE>softfloat_round_minMag</CODE></TD> -<TD> </TD> <TD>round to minimum magnitude (toward zero)</TD> </TR> <TR> <TD><CODE>softfloat_round_min</CODE></TD> -<TD> </TD> <TD>round to minimum (down)</TD> </TR> <TR> <TD><CODE>softfloat_round_max</CODE></TD> -<TD> </TD> <TD>round to maximum (up)</TD> </TR> </TABLE> @@ -550,7 +543,7 @@ Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat always detects loss of accuracy for underflow as an inexact result. </P> -<H3>6.3. Rounding Precision for 80-Bit Extended Format</H3> +<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3> <P> For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic @@ -639,7 +632,7 @@ It does always raise the <I>inexact</I> exception flag as required. In this section, <CODE><<I>float</I>></CODE> appears in function names as a substitute for one of these abbreviations: <BLOCKQUOTE> -<TABLE> +<TABLE CELLSPACING=0 CELLPADDING=0> <TR> <TD><CODE>f32</CODE></TD> <TD>indicates <CODE>float32_t</CODE>, passed by value</TD> @@ -696,11 +689,14 @@ Each conversion function takes one input of the appropriate type and generates one output. The following illustrates the signatures of these functions in cases when the floating-point result is passed either by value or via pointers: +<BLOCKQUOTE> <PRE> - float64_t i32_to_f64( int32_t <I>a</I> ); - - void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> ); +float64_t i32_to_f64( int32_t <I>a</I> ); </PRE> +<PRE> +void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> ); +</PRE> +</BLOCKQUOTE> </P> <H3>8.2. Conversions from Floating-Point to Integer</H3> @@ -717,12 +713,15 @@ functions: </BLOCKQUOTE> The functions have signatures as follows, depending on whether the floating-point input is passed by value or via pointers: +<BLOCKQUOTE> <PRE> - int32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); - - int32_t - f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); +int32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); </PRE> +<PRE> +int32_t + f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); +</PRE> +</BLOCKQUOTE> The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for the conversion. The variable that usually indicates rounding mode, @@ -768,12 +767,14 @@ and convenience: These functions round only toward zero (to minimum magnitude). The signatures for these functions are the same as above without the redundant <CODE><I>roundingMode</I></CODE> argument: +<BLOCKQUOTE> <PRE> - int32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> ); +int32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> ); </PRE> <PRE> - int32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> ); +int32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> ); </PRE> +</BLOCKQUOTE> </P> <H3>8.3. Conversions Among Floating-Point Types</H3> @@ -789,18 +790,20 @@ result are different formats. There are four different styles of signature for these functions, depending on whether the input and the output floating-point values are passed by value or via pointers: +<BLOCKQUOTE> <PRE> - float32_t f64_to_f32( float64_t <I>a</I> ); +float32_t f64_to_f32( float64_t <I>a</I> ); </PRE> <PRE> - float32_t f128M_to_f32( const float128_t *<I>aPtr</I> ); +float32_t f128M_to_f32( const float128_t *<I>aPtr</I> ); </PRE> <PRE> - void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> ); +void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> ); </PRE> <PRE> - void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); +void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); </PRE> +</BLOCKQUOTE> </P> <P> @@ -823,22 +826,22 @@ Each floating-point operation takes two operands, except for <CODE>sqrt</CODE> (square root) which takes only one. The operands and result are all of the same floating-point format. Signatures for these functions take the following forms: +<BLOCKQUOTE> <PRE> - float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> ); +float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> ); </PRE> <PRE> - void - f128M_add( - const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); +void + f128M_add( + const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); </PRE> -</P> -<P> <PRE> - float64_t f64_sqrt( float64_t <I>a</I> ); +float64_t f64_sqrt( float64_t <I>a</I> ); </PRE> <PRE> - void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); +void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); </PRE> +</BLOCKQUOTE> When floating-point values are passed indirectly through pointers, arguments <CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the @@ -850,7 +853,7 @@ Rounding of the <NOBR>80-bit</NOBR> double-extended-precision (<CODE>extFloat80_t</CODE>) functions is affected by variable <CODE>extF80_roundingPrecision</CODE>, as explained earlier in <NOBR>section 6.3</NOBR>, -<I>Rounding Precision for <NOBR>80-Bit</NOBR> Extended Format</I>. +<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>. </P> <H3>8.5. Fused Multiply-Add Functions</H3> @@ -873,18 +876,20 @@ No fused multiple-add function is currently provided for the <P> Depending on whether floating-point values are passed by value or via pointers, the fused multiply-add functions have signatures of these forms: +<BLOCKQUOTE> <PRE> - float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> ); +float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> ); </PRE> <PRE> - void - f128M_mulAdd( - const float128_t *<I>aPtr</I>, - const float128_t *<I>bPtr</I>, - const float128_t *<I>cPtr</I>, - float128_t *<I>destPtr</I> - ); +void + f128M_mulAdd( + const float128_t *<I>aPtr</I>, + const float128_t *<I>bPtr</I>, + const float128_t *<I>cPtr</I>, + float128_t *<I>destPtr</I> + ); </PRE> +</BLOCKQUOTE> The functions compute <NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>) + <CODE><I>c</I></CODE></NOBR> @@ -915,14 +920,16 @@ Each remainder operation takes two floating-point operands of the same format and returns a result in the same format. Depending on whether floating-point values are passed by value or via pointers, the remainder functions have signatures of these forms: +<BLOCKQUOTE> <PRE> - float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> ); +float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> ); </PRE> <PRE> - void - f128M_rem( - const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); +void + f128M_rem( + const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); </PRE> +</BLOCKQUOTE> When floating-point values are passed indirectly through pointers, arguments <CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands <CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and @@ -938,8 +945,8 @@ where <I>n</I> is the integer closest to If <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>. -The IEEE Standard's remainder operation is always exact and so requires no -rounding. +The IEEE Standard’s remainder operation is always exact and so requires +no rounding. </P> <P> @@ -968,18 +975,20 @@ and the resulting integer value is returned in the same floating-point format. <P> The signatures of the round-to-integer functions are similar to those for conversions to an integer type: +<BLOCKQUOTE> <PRE> - float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); +float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); </PRE> <PRE> - void - f128M_roundToInt( - const float128_t *<I>aPtr</I>, - uint_fast8_t <I>roundingMode</I>, - bool <I>exact</I>, - float128_t *<I>destPtr</I> - ); +void + f128M_roundToInt( + const float128_t *<I>aPtr</I>, + uint_fast8_t <I>roundingMode</I>, + bool <I>exact</I>, + float128_t *<I>destPtr</I> + ); </PRE> +</BLOCKQUOTE> The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to apply. The variable that usually indicates rounding mode, @@ -1005,17 +1014,19 @@ provided: <CODE><<I>float</I>>_lt</CODE> </BLOCKQUOTE> Each comparison takes two operands of the same type and returns a Boolean. -The abbreviation <CODE>eq</CODE> stands for ``equal'' (=); -<CODE>le</CODE> stands for ``less than or equal'' (≤); -and <CODE>lt</CODE> stands for ``less than'' (<). +The abbreviation <CODE>eq</CODE> stands for “equal” (=); +<CODE>le</CODE> stands for “less than or equal” (≤); +and <CODE>lt</CODE> stands for “less than” (<). Depending on whether the floating-point operands are passed by value or via pointers, the comparison functions have signatures of these forms: +<BLOCKQUOTE> <PRE> - bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> ); +bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> ); </PRE> <PRE> - bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> ); +bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> ); </PRE> +</BLOCKQUOTE> </P> <P> @@ -1058,21 +1069,25 @@ provided with these names: The functions take one floating-point operand and return a Boolean indicating whether the operand is a signaling NaN. Accordingly, the functions have the forms +<BLOCKQUOTE> <PRE> - bool f64_isSignalingNaN( float64_t <I>a</I> ); +bool f64_isSignalingNaN( float64_t <I>a</I> ); </PRE> <PRE> - bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> ); +bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> ); </PRE> +</BLOCKQUOTE> </P> <H3>8.10. Raise-Exception Function</H3> <P> SoftFloat provides a single function for raising floating-point exceptions: +<BLOCKQUOTE> <PRE> - void softfloat_raise( uint_fast8_t <I>exceptions</I> ); +void softfloat_raise( uint_fast8_t <I>exceptions</I> ); </PRE> +</BLOCKQUOTE> The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of exceptions to raise. (See earlier section 7, <I>Exceptions and Exception Flags</I>.) @@ -1084,6 +1099,11 @@ function may cause a trap or abort appropriate for the current system. <H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2> +<P> +Apart from the change in the legal use license, there are numerous technical +differences between <NOBR>Release 3</NOBR> of SoftFloat and earlier releases. +</P> + <H3>9.1. Name Changes</H3> <P> @@ -1214,17 +1234,17 @@ Lastly, there are a few other changes to function names: <TR> <TD><CODE>_round_to_zero</CODE></TD> <TD><CODE>_r_minMag</CODE></TD> -<TD>conversions from floating-point to integer, section 8.2</TD> +<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD> </TR> <TR> <TD><CODE>round_to_int</CODE></TD> <TD><CODE>roundToInt</CODE></TD> -<TD>round-to-integer functions, section 8.7</TD> +<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD> </TR> <TR> <TD><CODE>is_signaling_nan </CODE></TD> <TD><CODE>isSignalingNaN</CODE></TD> -<TD>signaling NaN test functions, section 8.9</TD> +<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD> </TR> </TABLE> </BLOCKQUOTE> @@ -1296,7 +1316,7 @@ argument <CODE><I>exact</I></CODE>. <P> With <NOBR>Release 3</NOBR>, a port of SoftFloat can now define any of the floating-point types <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, -<CODE>extFloat80_t</CODE>, and <CODE>float128_t</CODE> as aliases for C's +<CODE>extFloat80_t</CODE>, and <CODE>float128_t</CODE> as aliases for C’s standard floating-point types <CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>. @@ -1304,9 +1324,9 @@ This potential convenience was not supported under <NOBR>Release 2</NOBR>. </P> <P> -(Note, however, that there may be a performance cost to defining SoftFloat's -floating-point types this way, depending on the platform and the applications -using SoftFloat. +(Note, however, that there may be a performance cost to defining +SoftFloat’s floating-point types this way, depending on the platform and +the applications using SoftFloat. Ports of SoftFloat may choose to forgo the convenience in favor of better speed.) </P> @@ -1338,7 +1358,7 @@ Fused multiply-add functions have been added for the non-extended formats, <P> <NOBR>Release 3</NOBR> of SoftFloat is written to conform better to the ISO C -Standard's rules for portability. +Standard’s rules for portability. For example, older releases of SoftFloat employed type conversions in ways that, while commonly practiced, are not fully defined by the C Standard. Such problematic type conversions have generally been replaced by the use of @@ -1387,8 +1407,8 @@ Some loss of speed has been observed due to this change. The following improvements are anticipated for future releases of SoftFloat: <UL> <LI> -support for the common <NOBR>16-bit</NOBR> ``half-precision'' floating-point -format; +support for the common <NOBR>16-bit</NOBR> “half-precision” +floating-point format; <LI> more functions from the 2008 version of the IEEE Floating-Point Standard; <LI> |