diff options
Diffstat (limited to 'manual/=float.texinfo')
-rw-r--r-- | manual/=float.texinfo | 414 |
1 files changed, 0 insertions, 414 deletions
diff --git a/manual/=float.texinfo b/manual/=float.texinfo deleted file mode 100644 index d4e3920..0000000 --- a/manual/=float.texinfo +++ /dev/null @@ -1,414 +0,0 @@ -@node Floating-Point Limits -@chapter Floating-Point Limits -@pindex <float.h> -@cindex floating-point number representation -@cindex representation of floating-point numbers - -Because floating-point numbers are represented internally as approximate -quantities, algorithms for manipulating floating-point data often need -to be parameterized in terms of the accuracy of the representation. -Some of the functions in the C library itself need this information; for -example, the algorithms for printing and reading floating-point numbers -(@pxref{I/O on Streams}) and for calculating trigonometric and -irrational functions (@pxref{Mathematics}) use information about the -underlying floating-point representation to avoid round-off error and -loss of accuracy. User programs that implement numerical analysis -techniques also often need to be parameterized in this way in order to -minimize or compute error bounds. - -The specific representation of floating-point numbers varies from -machine to machine. The GNU C Library defines a set of parameters which -characterize each of the supported floating-point representations on a -particular system. - -@menu -* Floating-Point Representation:: Definitions of terminology. -* Floating-Point Parameters:: Descriptions of the library facilities. -* IEEE Floating-Point:: An example of a common representation. -@end menu - -@node Floating-Point Representation -@section Floating-Point Representation - -This section introduces the terminology used to characterize the -representation of floating-point numbers. - -You are probably already familiar with most of these concepts in terms -of scientific or exponential notation for floating-point numbers. For -example, the number @code{123456.0} could be expressed in exponential -notation as @code{1.23456e+05}, a shorthand notation indicating that the -mantissa @code{1.23456} is multiplied by the base @code{10} raised to -power @code{5}. - -More formally, the internal representation of a floating-point number -can be characterized in terms of the following parameters: - -@itemize @bullet -@item -The @dfn{sign} is either @code{-1} or @code{1}. -@cindex sign (of floating-point number) - -@item -The @dfn{base} or @dfn{radix} for exponentiation; an integer greater -than @code{1}. This is a constant for the particular representation. -@cindex base (of floating-point number) -@cindex radix (of floating-point number) - -@item -The @dfn{exponent} to which the base is raised. The upper and lower -bounds of the exponent value are constants for the particular -representation. -@cindex exponent (of floating-point number) - -Sometimes, in the actual bits representing the floating-point number, -the exponent is @dfn{biased} by adding a constant to it, to make it -always be represented as an unsigned quantity. This is only important -if you have some reason to pick apart the bit fields making up the -floating-point number by hand, which is something for which the GNU -library provides no support. So this is ignored in the discussion that -follows. -@cindex bias, in exponent (of floating-point number) - -@item -The value of the @dfn{mantissa} or @dfn{significand}, which is an -unsigned quantity. -@cindex mantissa (of floating-point number) -@cindex significand (of floating-point number) - -@item -The @dfn{precision} of the mantissa. If the base of the representation -is @var{b}, then the precision is the number of base-@var{b} digits in -the mantissa. This is a constant for the particular representation. - -Many floating-point representations have an implicit @dfn{hidden bit} in -the mantissa. Any such hidden bits are counted in the precision. -Again, the GNU library provides no facilities for dealing with such low-level -aspects of the representation. -@cindex precision (of floating-point number) -@cindex hidden bit, in mantissa (of floating-point number) -@end itemize - -The mantissa of a floating-point number actually represents an implicit -fraction whose denominator is the base raised to the power of the -precision. Since the largest representable mantissa is one less than -this denominator, the value of the fraction is always strictly less than -@code{1}. The mathematical value of a floating-point number is then the -product of this fraction; the sign; and the base raised to the exponent. - -If the floating-point number is @dfn{normalized}, the mantissa is also -greater than or equal to the base raised to the power of one less -than the precision (unless the number represents a floating-point zero, -in which case the mantissa is zero). The fractional quantity is -therefore greater than or equal to @code{1/@var{b}}, where @var{b} is -the base. -@cindex normalized floating-point number - -@node Floating-Point Parameters -@section Floating-Point Parameters - -@strong{Incomplete:} This section needs some more concrete examples -of what these parameters mean and how to use them in a program. - -These macro definitions can be accessed by including the header file -@file{<float.h>} in your program. - -Macro names starting with @samp{FLT_} refer to the @code{float} type, -while names beginning with @samp{DBL_} refer to the @code{double} type -and names beginning with @samp{LDBL_} refer to the @code{long double} -type. (In implementations that do not support @code{long double} as -a distinct data type, the values for those constants are the same -as the corresponding constants for the @code{double} type.)@refill - -Note that only @code{FLT_RADIX} is guaranteed to be a constant -expression, so the other macros listed here cannot be reliably used in -places that require constant expressions, such as @samp{#if} -preprocessing directives and array size specifications. - -Although the @w{ISO C} standard specifies minimum and maximum values for -most of these parameters, the GNU C implementation uses whatever -floating-point representations are supported by the underlying hardware. -So whether GNU C actually satisfies the @w{ISO C} requirements depends on -what machine it is running on. - -@comment float.h -@comment ISO -@defvr Macro FLT_ROUNDS -This value characterizes the rounding mode for floating-point addition. -The following values indicate standard rounding modes: - -@table @code -@item -1 -The mode is indeterminable. -@item 0 -Rounding is towards zero. -@item 1 -Rounding is to the nearest number. -@item 2 -Rounding is towards positive infinity. -@item 3 -Rounding is towards negative infinity. -@end table - -@noindent -Any other value represents a machine-dependent nonstandard rounding -mode. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_RADIX -This is the value of the base, or radix, of exponent representation. -This is guaranteed to be a constant expression, unlike the other macros -described in this section. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_MANT_DIG -This is the number of base-@code{FLT_RADIX} digits in the floating-point -mantissa for the @code{float} data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MANT_DIG -This is the number of base-@code{FLT_RADIX} digits in the floating-point -mantissa for the @code{double} data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MANT_DIG -This is the number of base-@code{FLT_RADIX} digits in the floating-point -mantissa for the @code{long double} data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_DIG -This is the number of decimal digits of precision for the @code{float} -data type. Technically, if @var{p} and @var{b} are the precision and -base (respectively) for the representation, then the decimal precision -@var{q} is the maximum number of decimal digits such that any floating -point number with @var{q} base 10 digits can be rounded to a floating -point number with @var{p} base @var{b} digits and back again, without -change to the @var{q} decimal digits. - -The value of this macro is guaranteed to be at least @code{6}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_DIG -This is similar to @code{FLT_DIG}, but is for the @code{double} data -type. The value of this macro is guaranteed to be at least @code{10}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_DIG -This is similar to @code{FLT_DIG}, but is for the @code{long double} -data type. The value of this macro is guaranteed to be at least -@code{10}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_MIN_EXP -This is the minimum negative integer such that the mathematical value -@code{FLT_RADIX} raised to this power minus 1 can be represented as a -normalized floating-point number of type @code{float}. In terms of the -actual implementation, this is just the smallest value that can be -represented in the exponent field of the number. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MIN_EXP -This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data -type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MIN_EXP -This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double} -data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_MIN_10_EXP -This is the minimum negative integer such that the mathematical value -@code{10} raised to this power minus 1 can be represented as a -normalized floating-point number of type @code{float}. This is -guaranteed to be no greater than @code{-37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MIN_10_EXP -This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double} -data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MIN_10_EXP -This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long -double} data type. -@end defvr - - - -@comment float.h -@comment ISO -@defvr Macro FLT_MAX_EXP -This is the maximum negative integer such that the mathematical value -@code{FLT_RADIX} raised to this power minus 1 can be represented as a -floating-point number of type @code{float}. In terms of the actual -implementation, this is just the largest value that can be represented -in the exponent field of the number. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MAX_EXP -This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data -type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MAX_EXP -This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double} -data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro FLT_MAX_10_EXP -This is the maximum negative integer such that the mathematical value -@code{10} raised to this power minus 1 can be represented as a -normalized floating-point number of type @code{float}. This is -guaranteed to be at least @code{37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MAX_10_EXP -This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double} -data type. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MAX_10_EXP -This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long -double} data type. -@end defvr - - -@comment float.h -@comment ISO -@defvr Macro FLT_MAX -The value of this macro is the maximum representable floating-point -number of type @code{float}, and is guaranteed to be at least -@code{1E+37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MAX -The value of this macro is the maximum representable floating-point -number of type @code{double}, and is guaranteed to be at least -@code{1E+37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MAX -The value of this macro is the maximum representable floating-point -number of type @code{long double}, and is guaranteed to be at least -@code{1E+37}. -@end defvr - - -@comment float.h -@comment ISO -@defvr Macro FLT_MIN -The value of this macro is the minimum normalized positive -floating-point number that is representable by type @code{float}, and is -guaranteed to be no more than @code{1E-37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_MIN -The value of this macro is the minimum normalized positive -floating-point number that is representable by type @code{double}, and -is guaranteed to be no more than @code{1E-37}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_MIN -The value of this macro is the minimum normalized positive -floating-point number that is representable by type @code{long double}, -and is guaranteed to be no more than @code{1E-37}. -@end defvr - - -@comment float.h -@comment ISO -@defvr Macro FLT_EPSILON -This is the minimum positive floating-point number of type @code{float} -such that @code{1.0 + FLT_EPSILON != 1.0} is true. It's guaranteed to -be no greater than @code{1E-5}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro DBL_EPSILON -This is similar to @code{FLT_EPSILON}, but is for the @code{double} -type. The maximum value is @code{1E-9}. -@end defvr - -@comment float.h -@comment ISO -@defvr Macro LDBL_EPSILON -This is similar to @code{FLT_EPSILON}, but is for the @code{long double} -type. The maximum value is @code{1E-9}. -@end defvr - - - -@node IEEE Floating Point -@section IEEE Floating Point - -Here is an example showing how these parameters work for a common -floating point representation, specified by the @cite{IEEE Standard for -Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 or ANSI/IEEE -Std 854-1987)}. - -The IEEE single-precision float representation uses a base of 2. There -is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total -precision is 24 base-2 digits), and an 8-bit exponent that can represent -values in the range -125 to 128, inclusive. - -So, for an implementation that uses this representation for the -@code{float} data type, appropriate values for the corresponding -parameters are: - -@example -FLT_RADIX 2 -FLT_MANT_DIG 24 -FLT_DIG 6 -FLT_MIN_EXP -125 -FLT_MIN_10_EXP -37 -FLT_MAX_EXP 128 -FLT_MAX_10_EXP +38 -FLT_MIN 1.17549435E-38F -FLT_MAX 3.40282347E+38F -FLT_EPSILON 1.19209290E-07F -@end example |