diff options
author | John Hauser <jhauser@eecs.berkeley.edu> | 2014-12-17 19:09:39 -0800 |
---|---|---|
committer | John Hauser <jhauser@eecs.berkeley.edu> | 2014-12-17 19:09:39 -0800 |
commit | cec54960bbbfa351cab7dab75eb1418585e4fe64 (patch) | |
tree | 8c606f0c513bc0ef9582795bd159be8dcffaf565 /doc/TestFloat-general.html | |
parent | 86cdc156a7c1bb471c11b14d65b9d2b48b714935 (diff) | |
download | berkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.zip berkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.tar.gz berkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.tar.bz2 |
Finalized documentation for TestFloat Release 3.
Diffstat (limited to 'doc/TestFloat-general.html')
-rw-r--r-- | doc/TestFloat-general.html | 507 |
1 files changed, 305 insertions, 202 deletions
diff --git a/doc/TestFloat-general.html b/doc/TestFloat-general.html index 1618d4a..d72807e 100644 --- a/doc/TestFloat-general.html +++ b/doc/TestFloat-general.html @@ -11,49 +11,38 @@ <P> John R. Hauser<BR> -2014 ______<BR> -</P> - -<P> -*** CONTENT DONE. -</P> - -<P> -*** REPLACE QUOTATION MARKS. -<BR> -*** REPLACE APOSTROPHES. -<BR> -*** REPLACE EM DASH. +2014 Dec 17<BR> </P> <H2>Contents</H2> -<P> -*** CHECK.<BR> -*** FIX FORMATTING. -</P> - -<PRE> - Introduction - Limitations - Acknowledgments and License - What TestFloat Does - Executing TestFloat - Operations Tested by TestFloat - Conversion Operations - Basic Arithmetic Operations - Fused Multiply-Add Operations - Remainder Operations - Round-to-Integer Operations - Comparison Operations - Interpreting TestFloat Output - Variations Allowed by the IEEE Floating-Point Standard - Underflow - NaNs - Conversions to Integer - Contact Information -</PRE> +<BLOCKQUOTE> +<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> +<COL WIDTH=25> +<COL WIDTH=*> +<TR><TD COLSPAN=2>1. Introduction</TD></TR> +<TR><TD COLSPAN=2>2. Limitations</TD></TR> +<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR> +<TR><TD COLSPAN=2>4. What TestFloat Does</TD></TR> +<TR><TD COLSPAN=2>5. Executing TestFloat</TD></TR> +<TR><TD COLSPAN=2>6. Operations Tested by TestFloat</TD></TR> +<TR><TD></TD><TD>6.1. Conversion Operations</TD></TR> +<TR><TD></TD><TD>6.2. Basic Arithmetic Operations</TD></TR> +<TR><TD></TD><TD>6.3. Fused Multiply-Add Operations</TD></TR> +<TR><TD></TD><TD>6.4. Remainder Operations</TD></TR> +<TR><TD></TD><TD>6.5. Round-to-Integer Operations</TD></TR> +<TR><TD></TD><TD>6.6. Comparison Operations</TD></TR> +<TR><TD COLSPAN=2>7. Interpreting TestFloat Output</TD></TR> +<TR> + <TD COLSPAN=2>8. Variations Allowed by the IEEE Floating-Point Standard</TD> +</TR> +<TR><TD></TD><TD>8.1. Underflow</TD></TR> +<TR><TD></TD><TD>8.2. NaNs</TD></TR> +<TR><TD></TD><TD>8.3. Conversions to Integer</TD></TR> +<TR><TD COLSPAN=2>9. Contact Information</TD></TR> +</TABLE> +</BLOCKQUOTE> <H2>1. Introduction</H2> @@ -89,8 +78,8 @@ Details about the standard are available elsewhere. <P> The current version of TestFloat is <NOBR>Release 3</NOBR>. -The set of TestFloat programs as well as the programs' arguments and behavior -have changed some compared to earlier TestFloat releases. +The set of TestFloat programs as well as the programs’ arguments and +behavior have changed some compared to earlier TestFloat releases. </P> @@ -119,15 +108,20 @@ bugs can be found through links posted on the TestFloat Web page The TestFloat package was written by me, <NOBR>John R.</NOBR> Hauser. <NOBR>Release 3</NOBR> of TestFloat is a completely new implementation supplanting earlier releases. -This project was done in the employ of the University of California, Berkeley, -within the Department of Electrical Engineering and Computer Sciences, first -for the Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab. +This project (<NOBR>Release 3</NOBR> only, not earlier releases) was done in +the employ of the University of California, Berkeley, within the Department of +Electrical Engineering and Computer Sciences, first for the Parallel Computing +Laboratory (Par Lab) and then for the ASPIRE Lab. The work was officially overseen by Prof. Krste Asanovic, with funding provided by these sources: <BLOCKQUOTE> <TABLE> +<COL WIDTH=*> +<COL WIDTH=10> +<COL WIDTH=*> <TR> -<TD><NOBR>Par Lab:</NOBR></TD> +<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD> +<TD></TD> <TD> Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, @@ -135,7 +129,8 @@ NVIDIA, Oracle, and Samsung. </TD> </TR> <TR> -<TD><NOBR>ASPIRE Lab:</NOBR></TD> +<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD> +<TD></TD> <TD> DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, @@ -191,8 +186,8 @@ ENHANCEMENTS, OR MODIFICATIONS. <P> TestFloat is designed to test a floating-point implementation by comparing its -behavior with that of TestFloat's own internal floating-point implemented in -software. +behavior with that of TestFloat’s own internal floating-point implemented +in software. For each operation to be tested, the TestFloat programs can generate a large number of test cases, made up of simple pattern tests intermixed with weighted random inputs. @@ -263,19 +258,20 @@ for programs <CODE>testfloat_ver</CODE> and <CODE>testfloat</CODE>. TestFloat normally compares an implementation of floating-point against the Berkeley SoftFloat software implementation of floating-point, also created by me. -The SoftFloat functions are linked into each TestFloat program's executable. +The SoftFloat functions are linked into each TestFloat program’s +executable. Information about SoftFloat can be found at the Web page <A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>. </P> <P> For testing SoftFloat itself, the TestFloat package includes a -<CODE>testsoftfloat</CODE> program that compares SoftFloat's floating-point -against <EM>another</EM> software floating-point implementation. +<CODE>testsoftfloat</CODE> program that compares SoftFloat’s +floating-point against <EM>another</EM> software floating-point implementation. The second software floating-point is simpler and slower than SoftFloat, and is completely independent of SoftFloat. Although the second software floating-point cannot be guaranteed to be -bug-free, the chance that it would mimic any of SoftFloat's bugs is low. +bug-free, the chance that it would mimic any of SoftFloat’s bugs is low. Consequently, an error in one or the other floating-point version should appear as an unexpected difference between the two implementations. Note that testing SoftFloat should be necessary only when compiling a new @@ -347,9 +343,11 @@ These results can then be piped to <CODE>testfloat_ver</CODE> to be checked for correctness. Assuming a vertical bar (<CODE>|</CODE>) indicates a pipe between programs, the complete process could be written as a single command like so: +<BLOCKQUOTE> <PRE> - testfloat_gen ... <type> | <program-that-invokes-op> | testfloat_ver ... <function> +testfloat_gen ... <type> | <program-that-invokes-op> | testfloat_ver ... <function> </PRE> +</BLOCKQUOTE> The program in the middle is not supplied by TestFloat but must be created independently. If for some reason this program cannot take command-line arguments, the @@ -363,9 +361,11 @@ A second method for running TestFloat is similar but has expected results for each case. With this additional information, the job done by <CODE>testfloat_ver</CODE> can be folded into the invoking program to give the following command: +<BLOCKQUOTE> <PRE> - testfloat_gen ... <function> | <program-that-invokes-op-and-compares-results> +testfloat_gen ... <function> | <program-that-invokes-op-and-compares-results> </PRE> +</BLOCKQUOTE> Again, the program that actually invokes the floating-point operation is not supplied by TestFloat but must be created independently. Depending on circumstance, it may be preferable either to let @@ -429,8 +429,8 @@ multiplication, division, and square root operations; for each format, the floating-point remainder operation defined by the IEEE Standard; <LI> -for each format, a ``round to integer'' operation that rounds to the nearest -integer value in the same format; and +for each format, a “round to integer” operation that rounds to the +nearest integer value in the same format; and <LI> comparisons between two values in the same floating-point format. </UL> @@ -451,8 +451,8 @@ is called <CODE>f32</CODE>, <NOBR>64-bit</NOBR> double-precision is <CODE>extF80</CODE>, and <NOBR>128-bit</NOBR> quadruple-precision is <CODE>f128</CODE>. TestFloat generally uses the same names for operations as Berkeley SoftFloat, -except that TestFloat's names never include the <CODE>M</CODE> that SoftFloat -uses to indicate that values are passed through pointers. +except that TestFloat’s names never include the <CODE>M</CODE> that +SoftFloat uses to indicate that values are passed through pointers. </P> <H3>6.1. Conversion Operations</H3> @@ -462,21 +462,23 @@ All conversions among the floating-point formats and all conversions between a floating-point format and <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers can be tested. The conversion operations are: +<BLOCKQUOTE> <PRE> - ui32_to_f32 ui64_to_f32 i32_to_f32 i64_to_f32 - ui32_to_f64 ui64_to_f64 i32_to_f64 i64_to_f64 - ui32_to_extF80 ui64_to_extF80 i32_to_extF80 i64_to_extF80 - ui32_to_f128 ui64_to_f128 i32_to_f128 i64_to_f128 - - f32_to_ui32 f64_to_ui32 extF80_to_ui32 f128_to_ui32 - f32_to_ui64 f64_to_ui64 extF80_to_ui64 f128_to_ui64 - f32_to_i32 f64_to_i32 extF80_to_i32 f128_to_i32 - f32_to_i64 f64_to_i64 extF80_to_i64 f128_to_i64 - - f32_to_f64 f64_to_f32 extF80_to_f32 f128_to_f32 - f32_to_extF80 f64_to_extF80 extF80_to_f64 f128_to_f64 - f32_to_f128 f64_to_f128 extF80_to_f128 f128_to_extF80 +ui32_to_f32 ui64_to_f32 i32_to_f32 i64_to_f32 +ui32_to_f64 ui64_to_f64 i32_to_f64 i64_to_f64 +ui32_to_extF80 ui64_to_extF80 i32_to_extF80 i64_to_extF80 +ui32_to_f128 ui64_to_f128 i32_to_f128 i64_to_f128 + +f32_to_ui32 f64_to_ui32 extF80_to_ui32 f128_to_ui32 +f32_to_ui64 f64_to_ui64 extF80_to_ui64 f128_to_ui64 +f32_to_i32 f64_to_i32 extF80_to_i32 f128_to_i32 +f32_to_i64 f64_to_i64 extF80_to_i64 f128_to_i64 + +f32_to_f64 f64_to_f32 extF80_to_f32 f128_to_f32 +f32_to_extF80 f64_to_extF80 extF80_to_f64 f128_to_f64 +f32_to_f128 f64_to_f128 extF80_to_f128 f128_to_extF80 </PRE> +</BLOCKQUOTE> Abbreviations <CODE>ui32</CODE> and <CODE>ui64</CODE> indicate <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> unsigned integer types, while <CODE>i32</CODE> and <CODE>i64</CODE> indicate their signed counterparts. @@ -495,22 +497,27 @@ operations requires amendment. For <CODE>testfloat</CODE> only, conversions to an integer type have names that explicitly specify the rounding mode and treatment of inexactness. Thus, instead of +<BLOCKQUOTE> <PRE> - <float>_to_<int> +<float>_to_<int> </PRE> +</BLOCKQUOTE> as listed above, operations converting to integer type have names of these forms: +<BLOCKQUOTE> <PRE> - <float>_to_<int>_r_<round> - <float>_to_<int>_rx_<round> +<float>_to_<int>_r_<round> +<float>_to_<int>_rx_<round> </PRE> -The <CODE><round></CODE> component is one of `<CODE>near_even</CODE>', -`<CODE>near_maxMag</CODE>', `<CODE>minMag</CODE>', `<CODE>min</CODE>', or -`<CODE>max</CODE>', choosing the rounding mode. +</BLOCKQUOTE> +The <CODE><round></CODE> component is one of +‘<CODE>near_even</CODE>’, ‘<CODE>near_maxMag</CODE>’, +‘<CODE>minMag</CODE>’, ‘<CODE>min</CODE>’, or +‘<CODE>max</CODE>’, choosing the rounding mode. Any other indication of rounding mode is ignored. -The operations with `<CODE>_r_</CODE>' in their names never raise the -<I>inexact</I> exception, while those with `<CODE>_rx_</CODE>' raise the -<I>inexact</I> exception whenever the result is not exact. +The operations with ‘<CODE>_r_</CODE>’ in their names never raise +the <I>inexact</I> exception, while those with ‘<CODE>_rx_</CODE>’ +raise the <I>inexact</I> exception whenever the result is not exact. </P> <P> @@ -518,7 +525,8 @@ TestFloat assumes that conversions from floating-point to an integer type should raise the <I>invalid</I> exception if the input cannot be rounded to an integer representable by the result format. In such a circumstance, if the result type is an unsigned integer, TestFloat -expects the result of the operation to be the type's largest integer value. +expects the result of the operation to be the type’s largest integer +value. If the result type is a signed integer and conversion overflows, TestFloat expects the result to be the largest-magnitude integer with the same sign as the input. @@ -533,12 +541,14 @@ exception. <P> The following standard arithmetic operations can be tested: +<BLOCKQUOTE> <PRE> - f32_add f32_sub f32_mul f32_div f32_sqrt - f64_add f64_sub f64_mul f64_div f64_sqrt - extF80_add extF80_sub extF80_mul extF80_div extF80_sqrt - f128_add f128_sub f128_mul f128_div f128_sqrt +f32_add f32_sub f32_mul f32_div f32_sqrt +f64_add f64_sub f64_mul f64_div f64_sqrt +extF80_add extF80_sub extF80_mul extF80_div extF80_sqrt +f128_add f128_sub f128_mul f128_div f128_sqrt </PRE> +</BLOCKQUOTE> The double-extended-precision (<CODE>extF80</CODE>) operations can be rounded to reduced precision under rounding precision control. </P> @@ -550,11 +560,13 @@ For all floating-point formats except <NOBR>80-bit</NOBR> double-extended-precision, TestFloat can test the fused multiply-add operation defined by the 2008 IEEE Floating-Point Standard. The fused multiply-add operations are: +<BLOCKQUOTE> <PRE> - f32_mulAdd - f64_mulAdd - f128_mulAdd +f32_mulAdd +f64_mulAdd +f128_mulAdd </PRE> +</BLOCKQUOTE> </P> <P> @@ -566,29 +578,34 @@ exception even if the third operand is a NaN. <H3>6.4. Remainder Operations</H3> <P> -For each format, TestFloat can test the IEEE Standard's remainder operation. +For each format, TestFloat can test the IEEE Standard’s remainder +operation. These operations are: +<BLOCKQUOTE> <PRE> - f32_rem - f64_rem - extF80_rem - f128_rem +f32_rem +f64_rem +extF80_rem +f128_rem </PRE> +</BLOCKQUOTE> The remainder operations are always exact and so require no rounding. </P> <H3>6.5. Round-to-Integer Operations</H3> <P> -For each format, TestFloat can test the IEEE Standard's round-to-integer +For each format, TestFloat can test the IEEE Standard’s round-to-integer operation. For most TestFloat programs, these operations are: +<BLOCKQUOTE> <PRE> - f32_roundToInt - f64_roundToInt - extF80_roundToInt - f128_roundToInt +f32_roundToInt +f64_roundToInt +extF80_roundToInt +f128_roundToInt </PRE> +</BLOCKQUOTE> </P> <P> @@ -596,35 +613,40 @@ Just as for conversions to integer types (<NOBR>section 6.1</NOBR> above), the all-in-one <CODE>testfloat</CODE> program is again an exception. For <CODE>testfloat</CODE> only, the round-to-integer operations have names of these forms: +<BLOCKQUOTE> <PRE> - <float>_roundToInt_r_<round> - <float>_roundToInt_x +<float>_roundToInt_r_<round> +<float>_roundToInt_x </PRE> -For the `<CODE>_r_</CODE>' versions, the <I>inexact</I> exception is never -raised, and the <CODE><round></CODE> component specifies the rounding -mode as one of `<CODE>near_even</CODE>', `<CODE>near_maxMag</CODE>', -`<CODE>minMag</CODE>', `<CODE>min</CODE>', or `<CODE>max</CODE>'. +</BLOCKQUOTE> +For the ‘<CODE>_r_</CODE>’ versions, the <I>inexact</I> exception +is never raised, and the <CODE><round></CODE> component specifies the +rounding mode as one of ‘<CODE>near_even</CODE>’, +‘<CODE>near_maxMag</CODE>’, ‘<CODE>minMag</CODE>’, +‘<CODE>min</CODE>’, or ‘<CODE>max</CODE>’. The usual indication of rounding mode is ignored. -In contrast, the `<CODE>_x</CODE>' versions accept the usual indication of -rounding mode and raise the <I>inexact</I> exception whenever the result is not -exact. -This irregular system follows the IEEE Standard's precise specification for the -round-to-integer operations. +In contrast, the ‘<CODE>_x</CODE>’ versions accept the usual +indication of rounding mode and raise the <I>inexact</I> exception whenever the +result is not exact. +This irregular system follows the IEEE Standard’s precise specification +for the round-to-integer operations. </P> <H3>6.6. Comparison Operations</H3> <P> The following floating-point comparison operations can be tested: +<BLOCKQUOTE> <PRE> - f32_eq f32_le f32_lt - f64_eq f64_le f64_lt - extF80_eq extF80_le extF80_lt - f128_eq f128_le f128_lt +f32_eq f32_le f32_lt +f64_eq f64_le f64_lt +extF80_eq extF80_le extF80_lt +f128_eq f128_le f128_lt </PRE> -The abbreviation <CODE>eq</CODE> stands for ``equal'' (=), <CODE>le</CODE> -stands for ``less than or equal'' (≤), and <CODE>lt</CODE> stands for -``less than'' (<). +</BLOCKQUOTE> +The abbreviation <CODE>eq</CODE> stands for “equal” (=), +<CODE>le</CODE> stands for “less than or equal” (≤), and +<CODE>lt</CODE> stands for “less than” (<). </P> <P> @@ -635,12 +657,14 @@ The equality comparisons, on the other hand, are defined by default to raise the <I>invalid</I> exception only for signaling NaNs, not for quiet NaNs. For completeness, the following additional operations can be tested if supported: +<BLOCKQUOTE> <PRE> - f32_eq_signaling f32_le_quiet f32_lt_quiet - f64_eq_signaling f64_le_quiet f64_lt_quiet - extF80_eq_signaling extF80_le_quiet extF80_lt_quiet - f128_eq_signaling f128_le_quiet f128_lt_quiet +f32_eq_signaling f32_le_quiet f32_lt_quiet +f64_eq_signaling f64_le_quiet f64_lt_quiet +extF80_eq_signaling extF80_le_quiet extF80_lt_quiet +f128_eq_signaling f128_le_quiet f128_lt_quiet </PRE> +</BLOCKQUOTE> The <CODE>signaling</CODE> equality comparisons are identical to the standard operations except that the <I>invalid</I> exception should be raised for any NaN input. @@ -658,8 +682,8 @@ Any rounding mode is ignored. <H2>7. Interpreting TestFloat Output</H2> <P> -The ``errors'' reported by TestFloat programs may or may not really represent -errors in the system being tested. +The “errors” reported by TestFloat programs may or may not really +represent errors in the system being tested. For each test case tried, the results from the floating-point implementation being tested could differ from the expected results for several reasons: <UL> @@ -694,14 +718,16 @@ For each reported error (or apparent error), a line of text is written to the default output. If a line would be longer than 79 characters, it is divided. The first part of each error line begins in the leftmost column, and any -subsequent ``continuation'' lines are indented with a tab. +subsequent “continuation” lines are indented with a tab. </P> <P> Each error reported is of the form: +<BLOCKQUOTE> <PRE> - <inputs> => <observed-output> expected: <expected-output> +<inputs> => <observed-output> expected: <expected-output> </PRE> +</BLOCKQUOTE> The <CODE><inputs></CODE> are the inputs to the operation. Each output (observed and expected) is shown as a pair: the result value first, followed by the exception flags. @@ -709,10 +735,12 @@ first, followed by the exception flags. <P> For example, two typical error lines could be +<BLOCKQUOTE> <PRE> - 800.7FFF00 87F.000100 => 001.000000 ...ux expected: 001.000000 ....x - 081.000004 000.1FFFFF => 001.000000 ...ux expected: 001.000000 ....x +800.7FFF00 87F.000100 => 001.000000 ...ux expected: 001.000000 ....x +081.000004 000.1FFFFF => 001.000000 ...ux expected: 001.000000 ....x </PRE> +</BLOCKQUOTE> In the first line, the inputs are <CODE>800.7FFF00</CODE> and <CODE>87F.000100</CODE>, and the observed result is <CODE>001.000000</CODE> with flags <CODE>...ux</CODE>. @@ -732,8 +760,9 @@ Four are floating-point types: <NOBR>32-bit</NOBR> single-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR> double-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision. The remaining five types are <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> -unsigned integers, <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> two's-complement -signed integers, and Boolean values (the results of comparison operations). +unsigned integers, <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> +two’s-complement signed integers, and Boolean values (the results of +comparison operations). Boolean values are represented as a single character, either a <CODE>0</CODE> or a <CODE>1</CODE>. <NOBR>32-bit</NOBR> integers are represented as 8 hexadecimal digits. @@ -749,47 +778,93 @@ hexadecimal digits that give the raw bits of the floating-point encoding. A period separates the 3rd and 4th hexadecimal digits to mark the division between the exponent bits and fraction bits. Some notable <NOBR>64-bit</NOBR> double-precision values include: -<PRE> - 000.0000000000000 +0 - 3FF.0000000000000 1 - 400.0000000000000 2 - 7FF.0000000000000 +infinity - - 800.0000000000000 -0 - BFF.0000000000000 -1 - C00.0000000000000 -2 - FFF.0000000000000 -infinity - - 3FE.FFFFFFFFFFFFF largest representable number less than +1 -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD><CODE>000.0000000000000 </CODE></TD> + <TD>+0</TD> +</TR> +<TR><TD><CODE>3FF.0000000000000</CODE></TD><TD> 1</TD></TR> +<TR><TD><CODE>400.0000000000000</CODE></TD><TD> 2</TD></TR> +<TR><TD><CODE>7FF.0000000000000</CODE></TD><TD>+infinity</TD></TR> +<TR><TD> </TD></TR> +<TR><TD><CODE>800.0000000000000</CODE></TD><TD>−0</TD></TR> +<TR><TD><CODE>BFF.0000000000000</CODE></TD><TD>−1</TD></TR> +<TR><TD><CODE>C00.0000000000000</CODE></TD><TD>−2</TD></TR> +<TR><TD><CODE>FFF.0000000000000</CODE></TD><TD>−infinity</TD></TR> +<TR><TD> </TD></TR> +<TR> + <TD><CODE>3FE.FFFFFFFFFFFFF</CODE></TD> + <TD>largest representable number less than +1</TD> +</TR> +</TABLE> +</BLOCKQUOTE> The following categories are easily distinguished (assuming the <CODE>x</CODE>s are not all 0): -<PRE> - 000.xxxxxxxxxxxxx positive subnormal (denormalized) numbers - 7FF.xxxxxxxxxxxxx positive NaNs - 800.xxxxxxxxxxxxx negative subnormal numbers - FFF.xxxxxxxxxxxxx negative NaNs -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD><CODE>000.xxxxxxxxxxxxx </CODE></TD> + <TD>positive subnormal (denormalized) numbers</TD> +</TR> +<TR><TD><CODE>7FF.xxxxxxxxxxxxx</CODE></TD><TD>positive NaNs</TD></TR> +<TR> + <TD><CODE>800.xxxxxxxxxxxxx</CODE></TD> + <TD>negative subnormal numbers</TD> +</TR> +<TR><TD><CODE>FFF.xxxxxxxxxxxxx</CODE></TD><TD>negative NaNs</TD></TR> +</TABLE> +</BLOCKQUOTE> </P> <P> <NOBR>128-bit</NOBR> quadruple-precision values are written the same except with 4 hexadecimal digits for the sign and exponent and 28 for the fraction. Notable values include: -<PRE> - 0000.0000000000000000000000000000 +0 - 3FFF.0000000000000000000000000000 1 - 4000.0000000000000000000000000000 2 - 7FFF.0000000000000000000000000000 +infinity - - 8000.0000000000000000000000000000 -0 - BFFF.0000000000000000000000000000 -1 - C000.0000000000000000000000000000 -2 - FFFF.0000000000000000000000000000 -infinity - - 3FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF largest representable number - less than +1 -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD> + <CODE>0000.0000000000000000000000000000 </CODE> + </TD> + <TD>+0</TD> +</TR> +<TR> + <TD><CODE>3FFF.0000000000000000000000000000</CODE></TD> + <TD> 1</TD> +</TR> +<TR> + <TD><CODE>4000.0000000000000000000000000000</CODE></TD> + <TD> 2</TD> +</TR> +<TR> + <TD><CODE>7FFF.0000000000000000000000000000</CODE></TD> + <TD>+infinity</TD> +</TR> +<TR><TD> </TD></TR> +<TR> + <TD><CODE>8000.0000000000000000000000000000</CODE></TD> + <TD>−0</TD> +</TR> +<TR> + <TD><CODE>BFFF.0000000000000000000000000000</CODE></TD> + <TD>−1</TD> +</TR> +<TR> + <TD><CODE>C000.0000000000000000000000000000</CODE></TD> + <TD>−2</TD> +</TR> +<TR> + <TD><CODE>FFFF.0000000000000000000000000000</CODE></TD> + <TD>−infinity</TD> +</TR> +<TR><TD> </TD></TR> +<TR> + <TD><CODE>3FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF</CODE></TD> + <TD>largest representable number less than +1</TD> +</TR> +</TABLE> +</BLOCKQUOTE> </P> <P> @@ -801,19 +876,27 @@ and will be 1 otherwise. Hence, the same values listed above appear in <NOBR>80-bit</NOBR> double-extended-precision as follows (note the leading <CODE>8</CODE> digit in the significands): -<PRE> - 0000.0000000000000000 +0 - 3FFF.8000000000000000 1 - 4000.8000000000000000 2 - 7FFF.8000000000000000 +infinity - - 8000.0000000000000000 -0 - BFFF.8000000000000000 -1 - C000.8000000000000000 -2 - FFFF.8000000000000000 -infinity - - 3FFE.FFFFFFFFFFFFFFFF largest representable number less than +1 -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD><CODE>0000.0000000000000000 </CODE></TD> + <TD>+0</TD> +</TR> +<TR><TD><CODE>3FFF.8000000000000000</CODE></TD><TD> 1</TD></TR> +<TR><TD><CODE>4000.8000000000000000</CODE></TD><TD> 2</TD></TR> +<TR><TD><CODE>7FFF.8000000000000000</CODE></TD><TD>+infinity</TD></TR> +<TR><TD> </TD></TR> +<TR><TD><CODE>8000.0000000000000000</CODE></TD><TD>−0</TD></TR> +<TR><TD><CODE>BFFF.8000000000000000</CODE></TD><TD>−1</TD></TR> +<TR><TD><CODE>C000.8000000000000000</CODE></TD><TD>−2</TD></TR> +<TR><TD><CODE>FFFF.8000000000000000</CODE></TD><TD>−infinity</TD></TR> +<TR><TD> </TD></TR> +<TR> + <TD><CODE>3FFE.FFFFFFFFFFFFFFFF</CODE></TD> + <TD>largest representable number less than +1</TD> +</TR> +</TABLE> +</BLOCKQUOTE> </P> <P> @@ -826,11 +909,13 @@ These are written as 9 hexadecimal digits, with a period separating the 3rd and 4th hexadecimal digits. Broken out into bits, the 9 hexademical digits cover the <NOBR>32-bit</NOBR> single-precision subfields as follows: +<BLOCKQUOTE> <PRE> - x000 .... .... . .... .... .... .... .... .... sign (1 bit) - .... xxxx xxxx . .... .... .... .... .... .... exponent (8 bits) - .... .... .... . 0xxx xxxx xxxx xxxx xxxx xxxx fraction (23 bits) +x000 .... .... . .... .... .... .... .... .... sign (1 bit) +.... xxxx xxxx . .... .... .... .... .... .... exponent (8 bits) +.... .... .... . 0xxx xxxx xxxx xxxx xxxx xxxx fraction (23 bits) </PRE> +</BLOCKQUOTE> As shown in this schematic, the first hexadecimal digit contains only the sign, and will be either <CODE>0</CODE> <NOBR>or <CODE>8</CODE></NOBR>. The next two digits give the biased exponent as an <NOBR>8-bit</NOBR> integer. @@ -841,27 +926,37 @@ The most significant hexadecimal digit of the fraction can be at most <P> Notable single-precision values include: -<PRE> - 000.000000 +0 - 07F.000000 1 - 080.000000 2 - 0FF.000000 +infinity - - 800.000000 -0 - 87F.000000 -1 - 880.000000 -2 - 8FF.000000 -infinity - - 07E.7FFFFF largest representable number less than +1 -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR><TD><CODE>000.000000 </CODE></TD><TD>+0</TD></TR> +<TR><TD><CODE>07F.000000</CODE></TD><TD> 1</TD></TR> +<TR><TD><CODE>080.000000</CODE></TD><TD> 2</TD></TR> +<TR><TD><CODE>0FF.000000</CODE></TD><TD>+infinity</TD></TR> +<TR><TD> </TD></TR> +<TR><TD><CODE>800.000000</CODE></TD><TD>−0</TD></TR> +<TR><TD><CODE>87F.000000</CODE></TD><TD>−1</TD></TR> +<TR><TD><CODE>880.000000</CODE></TD><TD>−2</TD></TR> +<TR><TD><CODE>8FF.000000</CODE></TD><TD>−infinity</TD></TR> +<TR><TD> </TD></TR> +<TR> + <TD><CODE>07E.7FFFFF</CODE></TD> + <TD>largest representable number less than +1</TD> +</TR> +</TABLE> +</BLOCKQUOTE> Again, certain categories are easily distinguished (assuming the <CODE>x</CODE>s are not all 0): -<PRE> - 000.xxxxxx positive subnormal (denormalized) numbers - 0FF.xxxxxx positive NaNs - 800.xxxxxx negative subnormal numbers - 8FF.xxxxxx negative NaNs -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD><CODE>000.xxxxxx </CODE></TD> + <TD>positive subnormal (denormalized) numbers</TD> +</TR> +<TR><TD><CODE>0FF.xxxxxx</CODE></TD><TD>positive NaNs</TD></TR> +<TR><TD><CODE>800.xxxxxx</CODE></TD><TD>negative subnormal numbers</TD></TR> +<TR><TD><CODE>8FF.xxxxxx</CODE></TD><TD>negative NaNs</TD></TR> +</TABLE> +</BLOCKQUOTE> </P> <P> @@ -871,13 +966,21 @@ Each flag is written as either a letter or a period (<CODE>.</CODE>) according to whether the flag was set or not by the operation. A period indicates the flag was not set. The letter used to indicate a set flag depends on the flag: -<PRE> - v invalid exception - i infinite exception ("divide by zero") - o overflow exception - u underflow exception - x inexact exception -</PRE> +<BLOCKQUOTE> +<TABLE CELLSPACING=0 CELLPADDING=0> +<TR> + <TD><CODE>v </CODE></TD> + <TD>invalid exception</TD> +</TR> +<TR> + <TD><CODE>i</CODE></TD> + <TD>infinite exception (“divide by zero”)</TD> +</TR> +<TR><TD><CODE>o</CODE></TD><TD>overflow exception</TD></TR> +<TR><TD><CODE>u</CODE></TD><TD>underflow exception</TD></TR> +<TR><TD><CODE>x</CODE></TD><TD>inexact exception</TD></TR> +</TABLE> +</BLOCKQUOTE> For example, the notation <CODE>...ux</CODE> indicates that the <I>underflow</I> and <I>inexact</I> exception flags were set and that the other three flags (<I>invalid</I>, <I>infinite</I>, and <I>overflow</I>) were not |