aboutsummaryrefslogtreecommitdiff
path: root/doc/TestFloat-general.html
diff options
context:
space:
mode:
authorJohn Hauser <jhauser@eecs.berkeley.edu>2014-12-17 19:09:39 -0800
committerJohn Hauser <jhauser@eecs.berkeley.edu>2014-12-17 19:09:39 -0800
commitcec54960bbbfa351cab7dab75eb1418585e4fe64 (patch)
tree8c606f0c513bc0ef9582795bd159be8dcffaf565 /doc/TestFloat-general.html
parent86cdc156a7c1bb471c11b14d65b9d2b48b714935 (diff)
downloadberkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.zip
berkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.tar.gz
berkeley-testfloat-3-cec54960bbbfa351cab7dab75eb1418585e4fe64.tar.bz2
Finalized documentation for TestFloat Release 3.
Diffstat (limited to 'doc/TestFloat-general.html')
-rw-r--r--doc/TestFloat-general.html507
1 files changed, 305 insertions, 202 deletions
diff --git a/doc/TestFloat-general.html b/doc/TestFloat-general.html
index 1618d4a..d72807e 100644
--- a/doc/TestFloat-general.html
+++ b/doc/TestFloat-general.html
@@ -11,49 +11,38 @@
<P>
John R. Hauser<BR>
-2014 ______<BR>
-</P>
-
-<P>
-*** CONTENT DONE.
-</P>
-
-<P>
-*** REPLACE QUOTATION MARKS.
-<BR>
-*** REPLACE APOSTROPHES.
-<BR>
-*** REPLACE EM DASH.
+2014 Dec 17<BR>
</P>
<H2>Contents</H2>
-<P>
-*** CHECK.<BR>
-*** FIX FORMATTING.
-</P>
-
-<PRE>
- Introduction
- Limitations
- Acknowledgments and License
- What TestFloat Does
- Executing TestFloat
- Operations Tested by TestFloat
- Conversion Operations
- Basic Arithmetic Operations
- Fused Multiply-Add Operations
- Remainder Operations
- Round-to-Integer Operations
- Comparison Operations
- Interpreting TestFloat Output
- Variations Allowed by the IEEE Floating-Point Standard
- Underflow
- NaNs
- Conversions to Integer
- Contact Information
-</PRE>
+<BLOCKQUOTE>
+<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
+<COL WIDTH=25>
+<COL WIDTH=*>
+<TR><TD COLSPAN=2>1. Introduction</TD></TR>
+<TR><TD COLSPAN=2>2. Limitations</TD></TR>
+<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
+<TR><TD COLSPAN=2>4. What TestFloat Does</TD></TR>
+<TR><TD COLSPAN=2>5. Executing TestFloat</TD></TR>
+<TR><TD COLSPAN=2>6. Operations Tested by TestFloat</TD></TR>
+<TR><TD></TD><TD>6.1. Conversion Operations</TD></TR>
+<TR><TD></TD><TD>6.2. Basic Arithmetic Operations</TD></TR>
+<TR><TD></TD><TD>6.3. Fused Multiply-Add Operations</TD></TR>
+<TR><TD></TD><TD>6.4. Remainder Operations</TD></TR>
+<TR><TD></TD><TD>6.5. Round-to-Integer Operations</TD></TR>
+<TR><TD></TD><TD>6.6. Comparison Operations</TD></TR>
+<TR><TD COLSPAN=2>7. Interpreting TestFloat Output</TD></TR>
+<TR>
+ <TD COLSPAN=2>8. Variations Allowed by the IEEE Floating-Point Standard</TD>
+</TR>
+<TR><TD></TD><TD>8.1. Underflow</TD></TR>
+<TR><TD></TD><TD>8.2. NaNs</TD></TR>
+<TR><TD></TD><TD>8.3. Conversions to Integer</TD></TR>
+<TR><TD COLSPAN=2>9. Contact Information</TD></TR>
+</TABLE>
+</BLOCKQUOTE>
<H2>1. Introduction</H2>
@@ -89,8 +78,8 @@ Details about the standard are available elsewhere.
<P>
The current version of TestFloat is <NOBR>Release 3</NOBR>.
-The set of TestFloat programs as well as the programs' arguments and behavior
-have changed some compared to earlier TestFloat releases.
+The set of TestFloat programs as well as the programs&rsquo; arguments and
+behavior have changed some compared to earlier TestFloat releases.
</P>
@@ -119,15 +108,20 @@ bugs can be found through links posted on the TestFloat Web page
The TestFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
<NOBR>Release 3</NOBR> of TestFloat is a completely new implementation
supplanting earlier releases.
-This project was done in the employ of the University of California, Berkeley,
-within the Department of Electrical Engineering and Computer Sciences, first
-for the Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
+This project (<NOBR>Release 3</NOBR> only, not earlier releases) was done in
+the employ of the University of California, Berkeley, within the Department of
+Electrical Engineering and Computer Sciences, first for the Parallel Computing
+Laboratory (Par Lab) and then for the ASPIRE Lab.
The work was officially overseen by Prof. Krste Asanovic, with funding provided
by these sources:
<BLOCKQUOTE>
<TABLE>
+<COL WIDTH=*>
+<COL WIDTH=10>
+<COL WIDTH=*>
<TR>
-<TD><NOBR>Par Lab:</NOBR></TD>
+<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
+<TD></TD>
<TD>
Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
@@ -135,7 +129,8 @@ NVIDIA, Oracle, and Samsung.
</TD>
</TR>
<TR>
-<TD><NOBR>ASPIRE Lab:</NOBR></TD>
+<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
+<TD></TD>
<TD>
DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
@@ -191,8 +186,8 @@ ENHANCEMENTS, OR MODIFICATIONS.
<P>
TestFloat is designed to test a floating-point implementation by comparing its
-behavior with that of TestFloat's own internal floating-point implemented in
-software.
+behavior with that of TestFloat&rsquo;s own internal floating-point implemented
+in software.
For each operation to be tested, the TestFloat programs can generate a large
number of test cases, made up of simple pattern tests intermixed with weighted
random inputs.
@@ -263,19 +258,20 @@ for programs <CODE>testfloat_ver</CODE> and <CODE>testfloat</CODE>.
TestFloat normally compares an implementation of floating-point against the
Berkeley SoftFloat software implementation of floating-point, also created by
me.
-The SoftFloat functions are linked into each TestFloat program's executable.
+The SoftFloat functions are linked into each TestFloat program&rsquo;s
+executable.
Information about SoftFloat can be found at the Web page
<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>.
</P>
<P>
For testing SoftFloat itself, the TestFloat package includes a
-<CODE>testsoftfloat</CODE> program that compares SoftFloat's floating-point
-against <EM>another</EM> software floating-point implementation.
+<CODE>testsoftfloat</CODE> program that compares SoftFloat&rsquo;s
+floating-point against <EM>another</EM> software floating-point implementation.
The second software floating-point is simpler and slower than SoftFloat, and is
completely independent of SoftFloat.
Although the second software floating-point cannot be guaranteed to be
-bug-free, the chance that it would mimic any of SoftFloat's bugs is low.
+bug-free, the chance that it would mimic any of SoftFloat&rsquo;s bugs is low.
Consequently, an error in one or the other floating-point version should appear
as an unexpected difference between the two implementations.
Note that testing SoftFloat should be necessary only when compiling a new
@@ -347,9 +343,11 @@ These results can then be piped to <CODE>testfloat_ver</CODE> to be checked for
correctness.
Assuming a vertical bar (<CODE>|</CODE>) indicates a pipe between programs, the
complete process could be written as a single command like so:
+<BLOCKQUOTE>
<PRE>
- testfloat_gen ... &lt;type&gt; | &lt;program-that-invokes-op&gt; | testfloat_ver ... &lt;function&gt;
+testfloat_gen ... &lt;type&gt; | &lt;program-that-invokes-op&gt; | testfloat_ver ... &lt;function&gt;
</PRE>
+</BLOCKQUOTE>
The program in the middle is not supplied by TestFloat but must be created
independently.
If for some reason this program cannot take command-line arguments, the
@@ -363,9 +361,11 @@ A second method for running TestFloat is similar but has
expected results for each case.
With this additional information, the job done by <CODE>testfloat_ver</CODE>
can be folded into the invoking program to give the following command:
+<BLOCKQUOTE>
<PRE>
- testfloat_gen ... &lt;function&gt; | &lt;program-that-invokes-op-and-compares-results&gt;
+testfloat_gen ... &lt;function&gt; | &lt;program-that-invokes-op-and-compares-results&gt;
</PRE>
+</BLOCKQUOTE>
Again, the program that actually invokes the floating-point operation is not
supplied by TestFloat but must be created independently.
Depending on circumstance, it may be preferable either to let
@@ -429,8 +429,8 @@ multiplication, division, and square root operations;
for each format, the floating-point remainder operation defined by the IEEE
Standard;
<LI>
-for each format, a ``round to integer'' operation that rounds to the nearest
-integer value in the same format; and
+for each format, a &ldquo;round to integer&rdquo; operation that rounds to the
+nearest integer value in the same format; and
<LI>
comparisons between two values in the same floating-point format.
</UL>
@@ -451,8 +451,8 @@ is called <CODE>f32</CODE>, <NOBR>64-bit</NOBR> double-precision is
<CODE>extF80</CODE>, and <NOBR>128-bit</NOBR> quadruple-precision is
<CODE>f128</CODE>.
TestFloat generally uses the same names for operations as Berkeley SoftFloat,
-except that TestFloat's names never include the <CODE>M</CODE> that SoftFloat
-uses to indicate that values are passed through pointers.
+except that TestFloat&rsquo;s names never include the <CODE>M</CODE> that
+SoftFloat uses to indicate that values are passed through pointers.
</P>
<H3>6.1. Conversion Operations</H3>
@@ -462,21 +462,23 @@ All conversions among the floating-point formats and all conversions between a
floating-point format and <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers
can be tested.
The conversion operations are:
+<BLOCKQUOTE>
<PRE>
- ui32_to_f32 ui64_to_f32 i32_to_f32 i64_to_f32
- ui32_to_f64 ui64_to_f64 i32_to_f64 i64_to_f64
- ui32_to_extF80 ui64_to_extF80 i32_to_extF80 i64_to_extF80
- ui32_to_f128 ui64_to_f128 i32_to_f128 i64_to_f128
-
- f32_to_ui32 f64_to_ui32 extF80_to_ui32 f128_to_ui32
- f32_to_ui64 f64_to_ui64 extF80_to_ui64 f128_to_ui64
- f32_to_i32 f64_to_i32 extF80_to_i32 f128_to_i32
- f32_to_i64 f64_to_i64 extF80_to_i64 f128_to_i64
-
- f32_to_f64 f64_to_f32 extF80_to_f32 f128_to_f32
- f32_to_extF80 f64_to_extF80 extF80_to_f64 f128_to_f64
- f32_to_f128 f64_to_f128 extF80_to_f128 f128_to_extF80
+ui32_to_f32 ui64_to_f32 i32_to_f32 i64_to_f32
+ui32_to_f64 ui64_to_f64 i32_to_f64 i64_to_f64
+ui32_to_extF80 ui64_to_extF80 i32_to_extF80 i64_to_extF80
+ui32_to_f128 ui64_to_f128 i32_to_f128 i64_to_f128
+
+f32_to_ui32 f64_to_ui32 extF80_to_ui32 f128_to_ui32
+f32_to_ui64 f64_to_ui64 extF80_to_ui64 f128_to_ui64
+f32_to_i32 f64_to_i32 extF80_to_i32 f128_to_i32
+f32_to_i64 f64_to_i64 extF80_to_i64 f128_to_i64
+
+f32_to_f64 f64_to_f32 extF80_to_f32 f128_to_f32
+f32_to_extF80 f64_to_extF80 extF80_to_f64 f128_to_f64
+f32_to_f128 f64_to_f128 extF80_to_f128 f128_to_extF80
</PRE>
+</BLOCKQUOTE>
Abbreviations <CODE>ui32</CODE> and <CODE>ui64</CODE> indicate
<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> unsigned integer types, while
<CODE>i32</CODE> and <CODE>i64</CODE> indicate their signed counterparts.
@@ -495,22 +497,27 @@ operations requires amendment.
For <CODE>testfloat</CODE> only, conversions to an integer type have names that
explicitly specify the rounding mode and treatment of inexactness.
Thus, instead of
+<BLOCKQUOTE>
<PRE>
- &lt;float&gt;_to_&lt;int&gt;
+&lt;float&gt;_to_&lt;int&gt;
</PRE>
+</BLOCKQUOTE>
as listed above, operations converting to integer type have names of these
forms:
+<BLOCKQUOTE>
<PRE>
- &lt;float&gt;_to_&lt;int&gt;_r_&lt;round&gt;
- &lt;float&gt;_to_&lt;int&gt;_rx_&lt;round&gt;
+&lt;float&gt;_to_&lt;int&gt;_r_&lt;round&gt;
+&lt;float&gt;_to_&lt;int&gt;_rx_&lt;round&gt;
</PRE>
-The <CODE>&lt;round&gt;</CODE> component is one of `<CODE>near_even</CODE>',
-`<CODE>near_maxMag</CODE>', `<CODE>minMag</CODE>', `<CODE>min</CODE>', or
-`<CODE>max</CODE>', choosing the rounding mode.
+</BLOCKQUOTE>
+The <CODE>&lt;round&gt;</CODE> component is one of
+&lsquo;<CODE>near_even</CODE>&rsquo;, &lsquo;<CODE>near_maxMag</CODE>&rsquo;,
+&lsquo;<CODE>minMag</CODE>&rsquo;, &lsquo;<CODE>min</CODE>&rsquo;, or
+&lsquo;<CODE>max</CODE>&rsquo;, choosing the rounding mode.
Any other indication of rounding mode is ignored.
-The operations with `<CODE>_r_</CODE>' in their names never raise the
-<I>inexact</I> exception, while those with `<CODE>_rx_</CODE>' raise the
-<I>inexact</I> exception whenever the result is not exact.
+The operations with &lsquo;<CODE>_r_</CODE>&rsquo; in their names never raise
+the <I>inexact</I> exception, while those with &lsquo;<CODE>_rx_</CODE>&rsquo;
+raise the <I>inexact</I> exception whenever the result is not exact.
</P>
<P>
@@ -518,7 +525,8 @@ TestFloat assumes that conversions from floating-point to an integer type
should raise the <I>invalid</I> exception if the input cannot be rounded to an
integer representable by the result format.
In such a circumstance, if the result type is an unsigned integer, TestFloat
-expects the result of the operation to be the type's largest integer value.
+expects the result of the operation to be the type&rsquo;s largest integer
+value.
If the result type is a signed integer and conversion overflows, TestFloat
expects the result to be the largest-magnitude integer with the same sign as
the input.
@@ -533,12 +541,14 @@ exception.
<P>
The following standard arithmetic operations can be tested:
+<BLOCKQUOTE>
<PRE>
- f32_add f32_sub f32_mul f32_div f32_sqrt
- f64_add f64_sub f64_mul f64_div f64_sqrt
- extF80_add extF80_sub extF80_mul extF80_div extF80_sqrt
- f128_add f128_sub f128_mul f128_div f128_sqrt
+f32_add f32_sub f32_mul f32_div f32_sqrt
+f64_add f64_sub f64_mul f64_div f64_sqrt
+extF80_add extF80_sub extF80_mul extF80_div extF80_sqrt
+f128_add f128_sub f128_mul f128_div f128_sqrt
</PRE>
+</BLOCKQUOTE>
The double-extended-precision (<CODE>extF80</CODE>) operations can be rounded
to reduced precision under rounding precision control.
</P>
@@ -550,11 +560,13 @@ For all floating-point formats except <NOBR>80-bit</NOBR>
double-extended-precision, TestFloat can test the fused multiply-add operation
defined by the 2008 IEEE Floating-Point Standard.
The fused multiply-add operations are:
+<BLOCKQUOTE>
<PRE>
- f32_mulAdd
- f64_mulAdd
- f128_mulAdd
+f32_mulAdd
+f64_mulAdd
+f128_mulAdd
</PRE>
+</BLOCKQUOTE>
</P>
<P>
@@ -566,29 +578,34 @@ exception even if the third operand is a NaN.
<H3>6.4. Remainder Operations</H3>
<P>
-For each format, TestFloat can test the IEEE Standard's remainder operation.
+For each format, TestFloat can test the IEEE Standard&rsquo;s remainder
+operation.
These operations are:
+<BLOCKQUOTE>
<PRE>
- f32_rem
- f64_rem
- extF80_rem
- f128_rem
+f32_rem
+f64_rem
+extF80_rem
+f128_rem
</PRE>
+</BLOCKQUOTE>
The remainder operations are always exact and so require no rounding.
</P>
<H3>6.5. Round-to-Integer Operations</H3>
<P>
-For each format, TestFloat can test the IEEE Standard's round-to-integer
+For each format, TestFloat can test the IEEE Standard&rsquo;s round-to-integer
operation.
For most TestFloat programs, these operations are:
+<BLOCKQUOTE>
<PRE>
- f32_roundToInt
- f64_roundToInt
- extF80_roundToInt
- f128_roundToInt
+f32_roundToInt
+f64_roundToInt
+extF80_roundToInt
+f128_roundToInt
</PRE>
+</BLOCKQUOTE>
</P>
<P>
@@ -596,35 +613,40 @@ Just as for conversions to integer types (<NOBR>section 6.1</NOBR> above), the
all-in-one <CODE>testfloat</CODE> program is again an exception.
For <CODE>testfloat</CODE> only, the round-to-integer operations have names of
these forms:
+<BLOCKQUOTE>
<PRE>
- &lt;float&gt;_roundToInt_r_&lt;round&gt;
- &lt;float&gt;_roundToInt_x
+&lt;float&gt;_roundToInt_r_&lt;round&gt;
+&lt;float&gt;_roundToInt_x
</PRE>
-For the `<CODE>_r_</CODE>' versions, the <I>inexact</I> exception is never
-raised, and the <CODE>&lt;round&gt;</CODE> component specifies the rounding
-mode as one of `<CODE>near_even</CODE>', `<CODE>near_maxMag</CODE>',
-`<CODE>minMag</CODE>', `<CODE>min</CODE>', or `<CODE>max</CODE>'.
+</BLOCKQUOTE>
+For the &lsquo;<CODE>_r_</CODE>&rsquo; versions, the <I>inexact</I> exception
+is never raised, and the <CODE>&lt;round&gt;</CODE> component specifies the
+rounding mode as one of &lsquo;<CODE>near_even</CODE>&rsquo;,
+&lsquo;<CODE>near_maxMag</CODE>&rsquo;, &lsquo;<CODE>minMag</CODE>&rsquo;,
+&lsquo;<CODE>min</CODE>&rsquo;, or &lsquo;<CODE>max</CODE>&rsquo;.
The usual indication of rounding mode is ignored.
-In contrast, the `<CODE>_x</CODE>' versions accept the usual indication of
-rounding mode and raise the <I>inexact</I> exception whenever the result is not
-exact.
-This irregular system follows the IEEE Standard's precise specification for the
-round-to-integer operations.
+In contrast, the &lsquo;<CODE>_x</CODE>&rsquo; versions accept the usual
+indication of rounding mode and raise the <I>inexact</I> exception whenever the
+result is not exact.
+This irregular system follows the IEEE Standard&rsquo;s precise specification
+for the round-to-integer operations.
</P>
<H3>6.6. Comparison Operations</H3>
<P>
The following floating-point comparison operations can be tested:
+<BLOCKQUOTE>
<PRE>
- f32_eq f32_le f32_lt
- f64_eq f64_le f64_lt
- extF80_eq extF80_le extF80_lt
- f128_eq f128_le f128_lt
+f32_eq f32_le f32_lt
+f64_eq f64_le f64_lt
+extF80_eq extF80_le extF80_lt
+f128_eq f128_le f128_lt
</PRE>
-The abbreviation <CODE>eq</CODE> stands for ``equal'' (=), <CODE>le</CODE>
-stands for ``less than or equal'' (&le;), and <CODE>lt</CODE> stands for
-``less than'' (&lt;).
+</BLOCKQUOTE>
+The abbreviation <CODE>eq</CODE> stands for &ldquo;equal&rdquo; (=),
+<CODE>le</CODE> stands for &ldquo;less than or equal&rdquo; (&le;), and
+<CODE>lt</CODE> stands for &ldquo;less than&rdquo; (&lt;).
</P>
<P>
@@ -635,12 +657,14 @@ The equality comparisons, on the other hand, are defined by default to raise
the <I>invalid</I> exception only for signaling NaNs, not for quiet NaNs.
For completeness, the following additional operations can be tested if
supported:
+<BLOCKQUOTE>
<PRE>
- f32_eq_signaling f32_le_quiet f32_lt_quiet
- f64_eq_signaling f64_le_quiet f64_lt_quiet
- extF80_eq_signaling extF80_le_quiet extF80_lt_quiet
- f128_eq_signaling f128_le_quiet f128_lt_quiet
+f32_eq_signaling f32_le_quiet f32_lt_quiet
+f64_eq_signaling f64_le_quiet f64_lt_quiet
+extF80_eq_signaling extF80_le_quiet extF80_lt_quiet
+f128_eq_signaling f128_le_quiet f128_lt_quiet
</PRE>
+</BLOCKQUOTE>
The <CODE>signaling</CODE> equality comparisons are identical to the standard
operations except that the <I>invalid</I> exception should be raised for any
NaN input.
@@ -658,8 +682,8 @@ Any rounding mode is ignored.
<H2>7. Interpreting TestFloat Output</H2>
<P>
-The ``errors'' reported by TestFloat programs may or may not really represent
-errors in the system being tested.
+The &ldquo;errors&rdquo; reported by TestFloat programs may or may not really
+represent errors in the system being tested.
For each test case tried, the results from the floating-point implementation
being tested could differ from the expected results for several reasons:
<UL>
@@ -694,14 +718,16 @@ For each reported error (or apparent error), a line of text is written to the
default output.
If a line would be longer than 79 characters, it is divided.
The first part of each error line begins in the leftmost column, and any
-subsequent ``continuation'' lines are indented with a tab.
+subsequent &ldquo;continuation&rdquo; lines are indented with a tab.
</P>
<P>
Each error reported is of the form:
+<BLOCKQUOTE>
<PRE>
- &lt;inputs&gt; => &lt;observed-output&gt; expected: &lt;expected-output&gt;
+&lt;inputs&gt; => &lt;observed-output&gt; expected: &lt;expected-output&gt;
</PRE>
+</BLOCKQUOTE>
The <CODE>&lt;inputs&gt;</CODE> are the inputs to the operation.
Each output (observed and expected) is shown as a pair: the result value
first, followed by the exception flags.
@@ -709,10 +735,12 @@ first, followed by the exception flags.
<P>
For example, two typical error lines could be
+<BLOCKQUOTE>
<PRE>
- 800.7FFF00 87F.000100 => 001.000000 ...ux expected: 001.000000 ....x
- 081.000004 000.1FFFFF => 001.000000 ...ux expected: 001.000000 ....x
+800.7FFF00 87F.000100 => 001.000000 ...ux expected: 001.000000 ....x
+081.000004 000.1FFFFF => 001.000000 ...ux expected: 001.000000 ....x
</PRE>
+</BLOCKQUOTE>
In the first line, the inputs are <CODE>800.7FFF00</CODE> and
<CODE>87F.000100</CODE>, and the observed result is <CODE>001.000000</CODE>
with flags <CODE>...ux</CODE>.
@@ -732,8 +760,9 @@ Four are floating-point types: <NOBR>32-bit</NOBR> single-precision,
<NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR>
double-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision.
The remaining five types are <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR>
-unsigned integers, <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> two's-complement
-signed integers, and Boolean values (the results of comparison operations).
+unsigned integers, <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR>
+two&rsquo;s-complement signed integers, and Boolean values (the results of
+comparison operations).
Boolean values are represented as a single character, either a <CODE>0</CODE>
or a <CODE>1</CODE>.
<NOBR>32-bit</NOBR> integers are represented as 8 hexadecimal digits.
@@ -749,47 +778,93 @@ hexadecimal digits that give the raw bits of the floating-point encoding.
A period separates the 3rd and 4th hexadecimal digits to mark the division
between the exponent bits and fraction bits.
Some notable <NOBR>64-bit</NOBR> double-precision values include:
-<PRE>
- 000.0000000000000 +0
- 3FF.0000000000000 1
- 400.0000000000000 2
- 7FF.0000000000000 +infinity
-
- 800.0000000000000 -0
- BFF.0000000000000 -1
- C00.0000000000000 -2
- FFF.0000000000000 -infinity
-
- 3FE.FFFFFFFFFFFFF largest representable number less than +1
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD><CODE>000.0000000000000&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
+ <TD>+0</TD>
+</TR>
+<TR><TD><CODE>3FF.0000000000000</CODE></TD><TD>&nbsp;1</TD></TR>
+<TR><TD><CODE>400.0000000000000</CODE></TD><TD>&nbsp;2</TD></TR>
+<TR><TD><CODE>7FF.0000000000000</CODE></TD><TD>+infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR><TD><CODE>800.0000000000000</CODE></TD><TD>&minus;0</TD></TR>
+<TR><TD><CODE>BFF.0000000000000</CODE></TD><TD>&minus;1</TD></TR>
+<TR><TD><CODE>C00.0000000000000</CODE></TD><TD>&minus;2</TD></TR>
+<TR><TD><CODE>FFF.0000000000000</CODE></TD><TD>&minus;infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR>
+ <TD><CODE>3FE.FFFFFFFFFFFFF</CODE></TD>
+ <TD>largest representable number less than +1</TD>
+</TR>
+</TABLE>
+</BLOCKQUOTE>
The following categories are easily distinguished (assuming the
<CODE>x</CODE>s are not all 0):
-<PRE>
- 000.xxxxxxxxxxxxx positive subnormal (denormalized) numbers
- 7FF.xxxxxxxxxxxxx positive NaNs
- 800.xxxxxxxxxxxxx negative subnormal numbers
- FFF.xxxxxxxxxxxxx negative NaNs
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD><CODE>000.xxxxxxxxxxxxx&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
+ <TD>positive subnormal (denormalized) numbers</TD>
+</TR>
+<TR><TD><CODE>7FF.xxxxxxxxxxxxx</CODE></TD><TD>positive NaNs</TD></TR>
+<TR>
+ <TD><CODE>800.xxxxxxxxxxxxx</CODE></TD>
+ <TD>negative subnormal numbers</TD>
+</TR>
+<TR><TD><CODE>FFF.xxxxxxxxxxxxx</CODE></TD><TD>negative NaNs</TD></TR>
+</TABLE>
+</BLOCKQUOTE>
</P>
<P>
<NOBR>128-bit</NOBR> quadruple-precision values are written the same except
with 4 hexadecimal digits for the sign and exponent and 28 for the fraction.
Notable values include:
-<PRE>
- 0000.0000000000000000000000000000 +0
- 3FFF.0000000000000000000000000000 1
- 4000.0000000000000000000000000000 2
- 7FFF.0000000000000000000000000000 +infinity
-
- 8000.0000000000000000000000000000 -0
- BFFF.0000000000000000000000000000 -1
- C000.0000000000000000000000000000 -2
- FFFF.0000000000000000000000000000 -infinity
-
- 3FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF largest representable number
- less than +1
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD>
+ <CODE>0000.0000000000000000000000000000&nbsp;&nbsp;&nbsp;&nbsp;</CODE>
+ </TD>
+ <TD>+0</TD>
+</TR>
+<TR>
+ <TD><CODE>3FFF.0000000000000000000000000000</CODE></TD>
+ <TD>&nbsp;1</TD>
+</TR>
+<TR>
+ <TD><CODE>4000.0000000000000000000000000000</CODE></TD>
+ <TD>&nbsp;2</TD>
+</TR>
+<TR>
+ <TD><CODE>7FFF.0000000000000000000000000000</CODE></TD>
+ <TD>+infinity</TD>
+</TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR>
+ <TD><CODE>8000.0000000000000000000000000000</CODE></TD>
+ <TD>&minus;0</TD>
+</TR>
+<TR>
+ <TD><CODE>BFFF.0000000000000000000000000000</CODE></TD>
+ <TD>&minus;1</TD>
+</TR>
+<TR>
+ <TD><CODE>C000.0000000000000000000000000000</CODE></TD>
+ <TD>&minus;2</TD>
+</TR>
+<TR>
+ <TD><CODE>FFFF.0000000000000000000000000000</CODE></TD>
+ <TD>&minus;infinity</TD>
+</TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR>
+ <TD><CODE>3FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF</CODE></TD>
+ <TD>largest representable number less than +1</TD>
+</TR>
+</TABLE>
+</BLOCKQUOTE>
</P>
<P>
@@ -801,19 +876,27 @@ and will be 1 otherwise.
Hence, the same values listed above appear in <NOBR>80-bit</NOBR>
double-extended-precision as follows (note the leading <CODE>8</CODE> digit in
the significands):
-<PRE>
- 0000.0000000000000000 +0
- 3FFF.8000000000000000 1
- 4000.8000000000000000 2
- 7FFF.8000000000000000 +infinity
-
- 8000.0000000000000000 -0
- BFFF.8000000000000000 -1
- C000.8000000000000000 -2
- FFFF.8000000000000000 -infinity
-
- 3FFE.FFFFFFFFFFFFFFFF largest representable number less than +1
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD><CODE>0000.0000000000000000&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
+ <TD>+0</TD>
+</TR>
+<TR><TD><CODE>3FFF.8000000000000000</CODE></TD><TD>&nbsp;1</TD></TR>
+<TR><TD><CODE>4000.8000000000000000</CODE></TD><TD>&nbsp;2</TD></TR>
+<TR><TD><CODE>7FFF.8000000000000000</CODE></TD><TD>+infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR><TD><CODE>8000.0000000000000000</CODE></TD><TD>&minus;0</TD></TR>
+<TR><TD><CODE>BFFF.8000000000000000</CODE></TD><TD>&minus;1</TD></TR>
+<TR><TD><CODE>C000.8000000000000000</CODE></TD><TD>&minus;2</TD></TR>
+<TR><TD><CODE>FFFF.8000000000000000</CODE></TD><TD>&minus;infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR>
+ <TD><CODE>3FFE.FFFFFFFFFFFFFFFF</CODE></TD>
+ <TD>largest representable number less than +1</TD>
+</TR>
+</TABLE>
+</BLOCKQUOTE>
</P>
<P>
@@ -826,11 +909,13 @@ These are written as 9 hexadecimal digits, with a period separating the 3rd and
4th hexadecimal digits.
Broken out into bits, the 9 hexademical digits cover the <NOBR>32-bit</NOBR>
single-precision subfields as follows:
+<BLOCKQUOTE>
<PRE>
- x000 .... .... . .... .... .... .... .... .... sign (1 bit)
- .... xxxx xxxx . .... .... .... .... .... .... exponent (8 bits)
- .... .... .... . 0xxx xxxx xxxx xxxx xxxx xxxx fraction (23 bits)
+x000 .... .... . .... .... .... .... .... .... sign (1 bit)
+.... xxxx xxxx . .... .... .... .... .... .... exponent (8 bits)
+.... .... .... . 0xxx xxxx xxxx xxxx xxxx xxxx fraction (23 bits)
</PRE>
+</BLOCKQUOTE>
As shown in this schematic, the first hexadecimal digit contains only the sign,
and will be either <CODE>0</CODE> <NOBR>or <CODE>8</CODE></NOBR>.
The next two digits give the biased exponent as an <NOBR>8-bit</NOBR> integer.
@@ -841,27 +926,37 @@ The most significant hexadecimal digit of the fraction can be at most
<P>
Notable single-precision values include:
-<PRE>
- 000.000000 +0
- 07F.000000 1
- 080.000000 2
- 0FF.000000 +infinity
-
- 800.000000 -0
- 87F.000000 -1
- 880.000000 -2
- 8FF.000000 -infinity
-
- 07E.7FFFFF largest representable number less than +1
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR><TD><CODE>000.000000&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD><TD>+0</TD></TR>
+<TR><TD><CODE>07F.000000</CODE></TD><TD>&nbsp;1</TD></TR>
+<TR><TD><CODE>080.000000</CODE></TD><TD>&nbsp;2</TD></TR>
+<TR><TD><CODE>0FF.000000</CODE></TD><TD>+infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR><TD><CODE>800.000000</CODE></TD><TD>&minus;0</TD></TR>
+<TR><TD><CODE>87F.000000</CODE></TD><TD>&minus;1</TD></TR>
+<TR><TD><CODE>880.000000</CODE></TD><TD>&minus;2</TD></TR>
+<TR><TD><CODE>8FF.000000</CODE></TD><TD>&minus;infinity</TD></TR>
+<TR><TD>&nbsp;</TD></TR>
+<TR>
+ <TD><CODE>07E.7FFFFF</CODE></TD>
+ <TD>largest representable number less than +1</TD>
+</TR>
+</TABLE>
+</BLOCKQUOTE>
Again, certain categories are easily distinguished (assuming the
<CODE>x</CODE>s are not all 0):
-<PRE>
- 000.xxxxxx positive subnormal (denormalized) numbers
- 0FF.xxxxxx positive NaNs
- 800.xxxxxx negative subnormal numbers
- 8FF.xxxxxx negative NaNs
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD><CODE>000.xxxxxx&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
+ <TD>positive subnormal (denormalized) numbers</TD>
+</TR>
+<TR><TD><CODE>0FF.xxxxxx</CODE></TD><TD>positive NaNs</TD></TR>
+<TR><TD><CODE>800.xxxxxx</CODE></TD><TD>negative subnormal numbers</TD></TR>
+<TR><TD><CODE>8FF.xxxxxx</CODE></TD><TD>negative NaNs</TD></TR>
+</TABLE>
+</BLOCKQUOTE>
</P>
<P>
@@ -871,13 +966,21 @@ Each flag is written as either a letter or a period (<CODE>.</CODE>) according
to whether the flag was set or not by the operation.
A period indicates the flag was not set.
The letter used to indicate a set flag depends on the flag:
-<PRE>
- v invalid exception
- i infinite exception ("divide by zero")
- o overflow exception
- u underflow exception
- x inexact exception
-</PRE>
+<BLOCKQUOTE>
+<TABLE CELLSPACING=0 CELLPADDING=0>
+<TR>
+ <TD><CODE>v&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
+ <TD>invalid exception</TD>
+</TR>
+<TR>
+ <TD><CODE>i</CODE></TD>
+ <TD>infinite exception (&ldquo;divide by zero&rdquo;)</TD>
+</TR>
+<TR><TD><CODE>o</CODE></TD><TD>overflow exception</TD></TR>
+<TR><TD><CODE>u</CODE></TD><TD>underflow exception</TD></TR>
+<TR><TD><CODE>x</CODE></TD><TD>inexact exception</TD></TR>
+</TABLE>
+</BLOCKQUOTE>
For example, the notation <CODE>...ux</CODE> indicates that the
<I>underflow</I> and <I>inexact</I> exception flags were set and that the other
three flags (<I>invalid</I>, <I>infinite</I>, and <I>overflow</I>) were not