File: fp.texi

package info (click to toggle)
gnu-c-intro-and-ref 0.1-1
links: PTS, VCS
area: non-free
in suites: forky, sid
size: 1,172 kB
sloc: makefile: 29
file content (1818 lines) | stat: -rw-r--r-- 70,378 bytes

@c Copyright @copyright{} 2022, 2023, 2025 Richard Stallman
@c   and Free Software Foundation, Inc.

@c This is part of the GNU C Intro and Reference Manual
@c and covered by its license.

@node    Floating Point in Depth
@chapter Floating Point in Depth

@menu
* Floating Representations::
* Floating Type Specs::
* Special Float Values::
* Invalid Optimizations::
* Exception Flags::
* Exact Floating-Point::
* Rounding::
* Rounding Issues::
* Significance Loss::
* Fused Multiply-Add::
* Error Recovery::
@c * Double-Rounding Problems::
* Exact Floating Constants::
* Handling Infinity::
* Handling NaN::
* Signed Zeros::
* Scaling by the Base::
* Rounding Control::
* Machine Epsilon::
* Complex Arithmetic::
* Round-Trip Base Conversion::
* Further Reading::
@end menu

@node    Floating Representations
@section Floating-Point Representations
@cindex floating-point representations
@cindex representation of floating-point numbers

@cindex IEEE 754-2008 Standard
Storing numbers as @dfn{floating point} allows representation of
numbers with fractional values, in a range larger than that of
hardware integers.  A floating-point number consists of a sign bit, a
@dfn{significand} (also called the @dfn{mantissa}), and a power of a
fixed base.  GNU C uses the floating-point representations specified by
the @cite{IEEE 754-2008 Standard for Floating-Point Arithmetic}.

The IEEE 754-2008 specification defines basic binary floating-point
formats of five different sizes: 16-bit, 32-bit, 64-bit, 128-bit, and
256-bit.  The formats of 32, 64, and 128 bits are used for the
standard C types @code{float}, @code{double}, and @code{long double}.
GNU C supports the 16-bit floating point type @code{_Float16} on some
platforms, but does not support the 256-bit floating point type.

Each of the formats encodes the floating-point number as a sign bit.
After this comes an exponent that specifies a power of 2 (with a fixed
offset).  Then comes the significand.

The first bit of the significand, before the binary point, is always
1, so there is no need to store it in memory.  It is called the
@dfn{hidden bit} because it doesn't appear in the floating-point
number as used in the computer itself.

All of those floating-point formats are sign-magnitude representations,
so +0 and @minus{}0 are different values.

Besides the IEEE 754 format 128-bit float, GNU C also offers a format
consisting of a pair of 64-bit floating point numbers.  This lacks the
full exponent range of the IEEE 128-bit format, but is useful when the
underlying hardware platform does not support that.

@node Floating Type Specs
@section Floating-Point Type Specifications

The standard library header file @file{float.h} defines a number of
constants that describe the platform's implementation of
floating-point types @code{float}, @code{double} and @code{long
double}.  They include:

@findex FLT_MIN
@findex DBL_MIN
@findex LDBL_MIN
@findex FLT_HAS_SUBNORM
@findex DBL_HAS_SUBNORM
@findex LDBL_HAS_SUBNORM
@findex FLT_TRUE_MIN
@findex DBL_TRUE_MIN
@findex LDBL_TRUE_MIN
@findex FLT_MAX
@findex DBL_MAX
@findex LDBL_MAX
@findex FLT_DECIMAL_DIG
@findex DBL_DECIMAL_DIG
@findex LDBL_DECIMAL_DIG

@table @code
@item FLT_MIN
@itemx DBL_MIN
@itemx LDBL_MIN
Defines the minimum normalized positive floating-point values that can
be represented with the type.

@item FLT_HAS_SUBNORM
@itemx DBL_HAS_SUBNORM
@itemx LDBL_HAS_SUBNORM
Defines if the floating-point type supports subnormal (or ``denormalized'')
numbers or not (@pxref{subnormal numbers}).

@item FLT_TRUE_MIN
@itemx DBL_TRUE_MIN
@itemx LDBL_TRUE_MIN
Defines the minimum positive values (including subnormal values) that
can be represented with the type.

@item FLT_MAX
@itemx DBL_MAX
@itemx LDBL_MAX
Defines the largest values that can be represented with the type.

@item FLT_DECIMAL_DIG
@itemx DBL_DECIMAL_DIG
@itemx LDBL_DECIMAL_DIG
Defines the number of decimal digits @code{n} such that any
floating-point number that can be represented in the type can be
rounded to a floating-point number with @code{n} decimal digits, and
back again, without losing any precision of the value.
@end table

@node Special Float Values
@section Special Floating-Point Values
@cindex special floating-point values
@cindex floating-point values, special

IEEE floating point provides for special values that are not ordinary
numbers.

@table @asis

@item infinities
@code{+Infinity} and @code{-Infinity} are two different infinite
values, one positive and one negative.  These result from
operations such as @code{1 / 0}, @code{Infinity + Infinity},
@code{Infinity * Infinity}, and @code{Infinity + @var{finite}}, and also
from a result that is finite, but larger than the most positive possible
value or smaller than the most negative possible value.

@xref{Handling Infinity}, for more about working with infinities.

@item NaNs (not a number)
@cindex QNaN
@cindex SNaN
There are two special values, called Not-a-Number (NaN): a quiet
NaN (QNaN), and a signaling NaN (SNaN).

A QNaN is produced by operations for which the value is undefined
in real arithmetic, such as @code{0 / 0}, @code{sqrt (-1)},
@code{Infinity - Infinity}, and any basic operation in which an
operand is a QNaN.

The signaling NaN is intended for initializing
otherwise-unassigned storage, and the goal is that unlike a
QNaN, an SNaN @emph{does} cause an interrupt that can be caught
by a software handler, diagnosed, and reported.  In practice,
little use has been made of signaling NaNs, because the most
common CPUs in desktop and portable computers fail to implement
the full IEEE 754 Standard, and supply only one kind of NaN, the
quiet one.  Also, programming-language standards have taken
decades to catch up to the IEEE 754 standard, and implementations
of those language standards make an additional delay before
programmers become willing to use these features.

To enable support for signaling NaNs, use the GCC command-line option
@option{-fsignaling-nans}, but this is an experimental feature and may
not work as expected in every situation.

A NaN has a sign bit, but its value means nothing.

@xref{Handling NaN}, for more about working with NaNs.

@item subnormal numbers
@cindex subnormal numbers
@cindex underflow, floating
@cindex floating underflow
@anchor{subnormal numbers}
It can happen that a computed floating-point value is too small to
represent, such as when two tiny numbers are multiplied.  The result
is then said to @dfn{underflow}.  The traditional behavior before
the IEEE 754 Standard was to use zero as the result, and possibly to report
the underflow in some sort of program output.

The IEEE 754 Standard is vague about whether rounding happens
before detection of floating underflow and overflow, or after, and CPU
designers may choose either.

However, the Standard does something unusual compared to earlier
designs, and that is that when the result is smaller than the
smallest @dfn{normalized} representable value (i.e., one in
which the leading significand bit is @code{1}), the normalization
requirement is relaxed, leading zero bits are permitted, and
precision is gradually lost until there are no more bits in the
significand.  That phenomenon is called @dfn{gradual underflow},
and it serves important numerical purposes, although it does
reduce the precision of the final result.  Some floating-point
designs allow you to choose at compile time, or even at
run time, whether underflows are gradual, or are flushed abruptly
to zero.  Numbers that have entered the region of gradual
underflow are called @dfn{subnormal}.

You can use the library functions @code{fesetround} and
@code{fegetround} to set and get the rounding mode.  Rounding modes
are defined (if supported by the platform) in @code{fenv.h} as:
@code{FE_UPWARD} to round toward positive infinity; @code{FE_DOWNWARD}
to round toward negative infinity; @code{FE_TOWARDZERO} to round
toward zero; and @code{FE_TONEAREST} to round to the nearest
representable value, the default mode.  It is best to use
@code{FE_TONEAREST} except when there is a special need for some other
mode.
@end table

@node Invalid Optimizations
@section Invalid Optimizations
@cindex invalid optimizations in floating-point arithmetic
@cindex floating-point arithmetic invalid optimizations

Signed zeros, Infinity, and NaN invalidate some optimizations by
programmers and compilers that might otherwise have seemed obvious:

@itemize @bullet
@item
@code{x + 0} and @code{x - 0} are not the same as @code{x} when
@code{x} is zero, because the result depends on the rounding rule.
@xref{Rounding}, for more about rounding rules.

@item
@code{x * 0.0} is not the same as @code{0.0} when @code{x} is
Infinity, a NaN, or negative zero.

@item
@code{x / x} is not the same as @code{1.0} when @code{x} is Infinity,
a NaN, or zero.

@item
@code{(x - y)} is not the same as @code{-(y - x)} because when the
operands are finite and equal, one evaluates to @code{+0} and the
other to @code{-0}.

@item
@code{x - x} is not the same as @code{0.0} when @var{x} is Infinity or
a NaN.

@item
@code{x == x} and @code{x != x} are not equivalent to @code{1} and
@code{0} when @var{x} is a NaN.

@item
@code{x < y} and @code{isless (x, y)} are not equivalent, because the
first sets a sticky exception flag (@pxref{Exception Flags}) when an
operand is a NaN, whereas the second does not affect that flag.  The
same holds for the other @code{isxxx} functions that are companions to
relational operators.  @xref{FP Comparison Functions, , , libc, The
GNU C Library Reference Manual}.

@end itemize

The @option{-funsafe-math-optimizations} option enables
these optimizations.


@node Exception Flags
@section Floating Arithmetic Exception Flags
@cindex floating arithmetic exception flags
@cindex exception flags (floating point)
@cindex sticky exception flags (floating point)
@cindex floating overflow
@cindex overflow, floating
@cindex floating underflow
@cindex underflow, floating

@dfn{Sticky exception flags} record the occurrence of particular
conditions: once set, they remain set until the program explicitly
clears them.

The conditions include @emph{invalid operand},
@emph{division by zero}, @emph{inexact result} (i.e., one that
required rounding), @emph{underflow}, and @emph{overflow}.  Some
extended floating-point designs offer several additional exception
flags.  The functions @code{feclearexcept}, @code{feraiseexcept},
@code{fetestexcept}, @code{fegetexceptflags}, and
@code{fesetexceptflags} provide a standardized interface to those
flags.  @xref{Status bit operations, , , libc, The GNU C Library
Reference Manual}.

One important use of those @anchor{fetestexcept}flags is to do a
computation that is normally expected to be exact in floating-point
arithmetic, but occasionally might not be, in which case, corrective
action is needed.  You can clear the @emph{inexact result} flag with a
call to @code{feclearexcept (FE_INEXACT)}, do the computation, and
then test the flag with @code{fetestexcept (FE_INEXACT)}; the result
of that call is 0 if the flag is not set (there was no rounding), and
1 when there was rounding (which, we presume, implies the program has
to correct for that).

@c =====================================================================

@ignore
@node    IEEE 754 Decimal Arithmetic
@section IEEE 754 Decimal Arithmetic
@cindex IEEE 754 decimal arithmetic

One of the difficulties that users of computers for numerical
work face, whether they realize it or not, is that the computer
does not operate in the number base that most people are familiar
with.  As a result, input decimal fractions must be converted to
binary floating-point values for use in computations, and then
the final results converted back to decimal for humans.  Because the
precision is finite and limited, and because algorithms for correct
round-trip conversion between number bases were not known until the
1990s, and are still not implemented on most systems and most
programming languages, the result is frequent confusion for users,
when a simple expression like @code{3.0*(1.0/3.0)} evaluates to
0.999999 instead of the expected 1.0.  Here is an
example from a floating-point calculator that allows rounding-mode
control, with the mode set to @emph{round-towards-zero}:

@example
for (k = 1; k <= 10; ++k)
     (void)printf ("%2d\t%10.6f\n", k, k*(1.0/k))
 1        1.000000
 2        1.000000
 3        0.999999
 4        1.000000
 5        0.999999
 6        0.999999
 7        0.999999
 8        1.000000
 9        0.999999
10        0.999999
@end example

Increasing working precision can sometimes help by reducing
intermediate rounding errors, but the reality is that when
fractional values are involved, @emph{no amount} of extra
precision can suffice for some computations.  For example, the
nice decimal value @code{1/10} in C99-style binary representation
is @code{+0x1.999999999999ap-4}; that final digit @code{a} is the
rounding of an infinite string of @code{9}'s.

Financial computations in particular depend critically on correct
arithmetic, and the losses due to rounding errors can be
large, especially for businesses with large numbers of small
transactions, such as grocery stores and telephone companies.
Tax authorities are particularly picky, and demand specific
rounding rules, including one that instead of rounding ties to
the nearest number, rounds instead in the direction that favors
the taxman.

Programming languages used for business applications, notably the
venerable Cobol language, have therefore always implemented
financial computations in @emph{fixed-point decimal arithmetic}
in software, and because of the large monetary amounts that must be
processed, successive Cobol standards have increased the minimum
number size from 18 to 32 decimal digits, and the most recent one
requires a decimal exponent range of at least @code{[-999, +999]}.

The revised IEEE 754-2008 standard therefore requires decimal
floating-point arithmetic, as well as the now-widely used binary
formats from 1985.  Like the binary formats, the decimal formats
also support Infinity, NaN, and signed zero, and the five basic
operations are also required to produce correctly rounded
representations of infinitely precise exact results.

However, the financial applications of decimal arithmetic
introduce some new features:

@itemize @bullet

@item
There are three decimal formats occupying 32, 64, and 128 bits of
storage, and offering exactly 7, 16, and 34 decimal digits of
precision.  If one size has @code{n} digits, the next larger size
has @code{2 n + 2} digits.  Thus, a future 256-bit format would
supply 70 decimal digits, and at least one library already
supports the 256-bit binary and decimal formats.

@item
Decimal arithmetic has an additional rounding mode, called
@emph{round-ties-to-away-from-zero}, meaning that a four-digit
rounding of @code{1.2345} is @code{1.235}, and @code{-1.2345}
becomes @code{-1.235}.  That rounding mode is mandated by
financial laws in several countries.

@item
The decimal significand is an
@anchor{decimal-significand}@emph{integer}, instead of a fractional
value, and trailing zeros are only removed at user request.  That
feature allows floating-point arithmetic to emulate the
@emph{fixed-point arithmetic} traditionally used in financial
computations.

@end itemize

@noindent
We can easily estimate how many digits are likely to be needed for
financial work: seven billion people on Earth, with an average annual
income of less than US$10,000, means a world financial base that can
be represented in just 15 decimal digits.  Even allowing for alternate
currencies, future growth, multiyear accounting, and intermediate
computations, the 34 digits supplied by the 128-bit format are more
than enough for financial purposes.

We return to decimal arithmetic later in this chapter
(@pxref{More on decimal floating-point arithmetic}), after we have
covered more about floating-point arithmetic in general.

@c =====================================================================

@end ignore

@node    Exact Floating-Point
@section Exact Floating-Point Arithmetic
@cindex exact floating-point arithmetic
@cindex floating-point arithmetic, exact

As long as the numbers are exactly representable (fractions whose
denominator is a power of 2), and intermediate results do not require
rounding, then floating-point arithmetic is @emph{exact}.  It is easy
to predict how many digits are needed for the results of arithmetic
operations:

@itemize @bullet

@item
addition and subtraction of two @var{n}-digit values with the
@emph{same} exponent require at most @code{@var{n} + 1} digits, but
when the exponents differ, many more digits may be needed;

@item
multiplication of two @var{n}-digit values requires exactly
2 @var{n} digits;

@item
although integer division produces a quotient and a remainder of
no more than @var{n}-digits, floating-point remainder and square
root may require an unbounded number of digits, and the quotient
can need many more digits than can be stored.

@end itemize

Whenever a result requires more than @var{n} digits, rounding
is needed.

@c =====================================================================

@node Rounding
@section Rounding
@cindex rounding

When floating-point arithmetic produces a result that can't fit
exactly in the significand of the type that's in use, it has to
@dfn{round} the value.  The basic arithmetic operations---addition,
subtraction, multiplication, division, and square root---always produce
a result that is equivalent to the exact, possibly infinite-precision
result rounded to storage precision according to the current rounding
rule.

Rounding sets the @code{FE_INEXACT} exception flag (@pxref{Exception
Flags}).  This enables programs to determine that rounding has
occurred.

Rounding consists of adjusting the exponent to bring the significand
back to the required base-point alignment, then applying the current
@dfn{rounding rule} to squeeze the significand into the fixed
available size.

The current rule is selected at run time from four options.  Here they
are:

@itemize *
@item
@emph{round-to-nearest}, with ties rounded to an even integer;

@item
@emph{round-up}, towards @code{+Infinity};

@item
@emph{round-down}, towards @code{-Infinity};

@item
@emph{round-towards-zero}.
@end itemize

Under those four rounding rules, a decimal value
@code{-1.2345} that is to be rounded to a four-digit result would
become @code{-1.234}, @code{-1.234}, @code{-1.235}, and
@code{-1.234}, respectively.

The default rounding rule is @emph{round-to-nearest}, because that has
the least bias, and produces the lowest average error.  When the true
result lies exactly halfway between two representable machine numbers,
the result is rounded to the one that ends with an even digit.

The @emph{round-towards-zero} rule was common on many early computer
designs, because it is the easiest to implement: it just requires
silent truncation of all extra bits.

The two other rules, @emph{round-up} and @emph{round-down}, are
essential for implementing @dfn{interval arithmetic}, whereby
each arithmetic operation produces lower and upper bounds that
are guaranteed to enclose the exact result.

@xref{Rounding Control}, for details on getting and setting the
current rounding mode.

@node    Rounding Issues
@section Rounding Issues
@cindex rounding issues (floating point)
@cindex floating-point rounding issues

The default IEEE 754 rounding mode minimizes errors, and most
normal computations should not suffer any serious accumulation of
errors from rounding.

Of course, you can contrive examples where that is not so.  Here
is one: iterate a square root, then attempt to recover the
original value by repeated squaring.

@example
#include <stdio.h>
#include <math.h>

int main (void)
@{
  double x = 100.0;
  double y;
  int n, k;
  for (n = 10; n <= 100; n += 10)
    @{
      y = x;
      for (k = 0; k < n; ++k) y = sqrt (y);
      for (k = 0; k < n; ++k) y *= y;
      printf ("n = %3d; x = %.0f\ty = %.6f\n", n, x, y);
    @}
  return 0;
@}
@end example

@noindent
Here is the output:

@example
n =  10; x = 100        y = 100.000000
n =  20; x = 100        y = 100.000000
n =  30; x = 100        y = 99.999977
n =  40; x = 100        y = 99.981025
n =  50; x = 100        y = 90.017127
n =  60; x = 100        y = 1.000000
n =  70; x = 100        y = 1.000000
n =  80; x = 100        y = 1.000000
n =  90; x = 100        y = 1.000000
n = 100; x = 100        y = 1.000000
@end example

After 50 iterations, @code{y} has barely one correct digit, and
soon after, there are no correct digits.

@c =====================================================================

@node    Significance Loss
@section Significance Loss
@cindex significance loss (floating point)
@cindex floating-point significance loss

A much more serious source of error in floating-point computation is
@dfn{significance loss} from subtraction of nearly equal values.  This
means that the number of bits in the significand of the result is
fewer than the size of the value would permit.  If the values being
subtracted are close enough, but still not equal, a @emph{single
subtraction} can wipe out all correct digits, possibly contaminating
all future computations.

Floating-point calculations can sometimes be carefully designed so
that significance loss is not possible, such as summing a series where
all terms have the same sign.  For example, the Taylor series
expansions of the trigonometric and hyperbolic sines have terms of
identical magnitude, of the general form @code{@var{x}**(2*@var{n} +
1) / (2*@var{n} + 1)!}.  However, those in the trigonometric sine series
alternate in sign, while those in the hyperbolic sine series are all
positive.  Here is the output of two small programs that sum @var{k}
terms of the series for @code{sin (@var{x})}, and compare the computed
sums with known-to-be-accurate library functions:

@example
x = 10      k = 51
s (x)   = -0.544_021_110_889_270
sin (x) = -0.544_021_110_889_370

x = 20      k = 81
s (x)   = 0.912_945_250_749_573
sin (x) = 0.912_945_250_727_628

x = 30      k = 109
s (x)   = -0.987_813_746_058_855
sin (x) = -0.988_031_624_092_862

x = 40      k = 137
s (x)   = 0.617_400_430_980_474
sin (x) = 0.745_113_160_479_349

x = 50      k = 159
s (x)   = 57_105.187_673_745_720_532
sin (x) = -0.262_374_853_703_929

// sinh(x) series summation with positive signs
// with k terms needed to converge to machine precision

x = 10      k = 47
t (x)    = 1.101_323_287_470_340e+04
sinh (x) = 1.101_323_287_470_339e+04

x = 20      k = 69
t (x)    = 2.425_825_977_048_951e+08
sinh (x) = 2.425_825_977_048_951e+08

x = 30      k = 87
t (x)    = 5.343_237_290_762_229e+12
sinh (x) = 5.343_237_290_762_231e+12

x = 40      k = 105
t (x)    = 1.176_926_334_185_100e+17
sinh (x) = 1.176_926_334_185_100e+17

x = 50      k = 121
t (x)    = 2.592_352_764_293_534e+21
sinh (x) = 2.592_352_764_293_536e+21
@end example

@noindent
We have added underscores to the numbers to enhance readability.

The @code{sinh (@var{x})} series with positive terms can be summed to
high accuracy.  By contrast, the series for @code{sin (@var{x})}
suffers increasing significance loss, so that when @var{x} = 30 only
two correct digits remain.  Soon after, all digits are wrong, and the
answers are complete nonsense.

An important skill in numerical programming is to recognize when
significance loss is likely to contaminate a computation, and revise
the algorithm to reduce this problem.  Sometimes, the only practical
way to do so is to compute in higher intermediate precision, which is
why the extended types like @code{long double} are important.

@c Formerly mentioned @code{__float128}

@c =====================================================================

@node    Fused Multiply-Add
@section Fused Multiply-Add
@cindex fused multiply-add in floating-point computations
@cindex floating-point fused multiply-add

In 1990, when IBM introduced the POWER architecture, the CPU
provided a previously unknown instruction, the @dfn{fused
multiply-add} (FMA).  It computes the value @code{x * y + z} with
an @strong{exact} double-length product, followed by an addition with a
@emph{single} rounding.  Numerical computation often needs pairs of
multiply and add operations, for which the FMA is well-suited.

On the POWER architecture, there are two dedicated registers that
hold permanent values of @code{0.0} and @code{1.0}, and the
normal @emph{multiply} and @emph{add} instructions are just
wrappers around the FMA that compute @code{x * y + 0.0} and
@code{x * 1.0 + z}, respectively.

In the early days, it appeared that the main benefit of the FMA
was getting two floating-point operations for the price of one,
almost doubling the performance of some algorithms.  However,
numerical analysts have since shown numerous uses of the FMA for
significantly enhancing accuracy.  We discuss one of the most
important ones in the next section.

A few other architectures have since included the FMA, and most
provide variants for the related operations @code{x * y - z}
(FMS), @code{-x * y + z} (FNMA), and @code{-x * y - z} (FNMS).
@c The IEEE 754-2008 revision requires implementations to provide
@c the FMA, as a sixth basic operation.

The functions @code{fmaf}, @code{fma}, and @code{fmal} implement fused
multiply-add for the @code{float}, @code{double}, and @code{long
double} data types.  Correct implementation of the FMA in software is
difficult, and some systems that appear to provide those functions do
not satisfy the single-rounding requirement.  That situation should
change as more programmers use the FMA operation, and more CPUs
provide FMA in hardware.

Use the @option{-ffp-contract=fast} option to allow generation of FMA
instructions, or @option{-ffp-contract=off} to disallow it.

@c =====================================================================

@node    Error Recovery
@section Error Recovery
@cindex error recovery (floating point)
@cindex floating-point error recovery

When two numbers are combined by one of the four basic
operations, the result often requires rounding to storage
precision.  For accurate computation, one would like to be able
to recover that rounding error.  With historical floating-point
designs, it was difficult to do so portably, but now that IEEE
754 arithmetic is almost universal, the job is much easier.

For addition with the default @emph{round-to-nearest} rounding
mode, we can determine the error in a sum like this:

@example
volatile double err, sum, tmp, x, y;

if (fabs (x) >= fabs (y))
  @{
    sum = x + y;
    tmp = sum - x;
    err = y - tmp;
  @}
else /* fabs (x) < fabs (y) */
  @{
    sum = x + y;
    tmp = sum - y;
    err = x - tmp;
  @}
@end example

@noindent
@cindex twosum
Now, @code{x + y} is @emph{exactly} represented by @code{sum + err}.
This basic operation, which has come to be called @dfn{twosum}
in the numerical-analysis literature, is the first key to tracking,
and accounting for, rounding error.

To determine the error in subtraction, just swap the @code{+} and
@code{-} operators.

We used the @code{volatile} qualifier (@pxref{volatile}) in the
declaration of the variables, which forces the compiler to store and
retrieve them from memory, and prevents the compiler from optimizing
@code{err = y - ((x + y) - x)} into @code{err = 0}.

For multiplication, we can compute the rounding error without
magnitude tests with the FMA operation (@pxref{Fused Multiply-Add}),
like this:

@example
volatile double err, prod, x, y;
prod = x * y;                /* @r{rounded product} */
err  = fma (x, y, -prod);    /* @r{exact product @code{= @var{prod} + @var{err}}} */
@end example

For addition, subtraction, and multiplication, we can represent the
exact result with the notional sum of two values.  However, the exact
result of division, remainder, or square root potentially requires an
infinite number of digits, so we can at best approximate it.
Nevertheless, we can compute an error term that is close to the true
error: it is just that error value, rounded to machine precision.

For division, you can approximate @code{x / y} with @code{quo + err}
like this:

@example
volatile double err, quo, x, y;
quo = x / y;
err = fma (-quo, y, x) / y;
@end example

For square root, we can approximate @code{sqrt (x)} with @code{root +
err} like this:

@example
volatile double err, root, x;
root = sqrt (x);
err = fma (-root, root, x) / (root + root);
@end example

With the reliable and predictable floating-point design provided
by IEEE 754 arithmetic, we now have the tools we need to track
errors in the five basic floating-point operations, and we can
effectively simulate computing in twice working precision, which
is sometimes sufficient to remove almost all traces of arithmetic
errors.

@c =====================================================================

@ignore
@node    Double-Rounding Problems
@section Double-Rounding Problems
@cindex double-rounding problems (floating point)
@cindex floating-point double-rounding problems

Most developers today use 64-bit x86 processors with a 64-bit
operating system, with a Streaming SIMD Extensions (SSE) instruction
set.  In the past, using a 32-bit x87 instruction set was common,
leading to issues described in this section.

To offer a few more digits of precision and a wider exponent range,
the IEEE 754 Standard included an optional @emph{temporary real}
format, with 11 more bits in the significand, and 4 more bits in the
biased exponent.

Compilers are free to exploit the longer format, and most do so.
That is usually a @emph{good thing}, such as in computing a
lengthy sum or product, or in implementing the computation of the
hypotenuse of a right-triangle as @code{sqrt (x*x + y*y)}: the
wider exponent range is critical there for avoiding disastrous
overflow or underflow.

@findex fesetprec
@findex fegetprec
However, sometimes it is critical to know what the intermediate
precision and rounding mode are, such as in tracking errors with
the techniques of the preceding section.  Some compilers provide
options to prevent the use of the 80-bit format in computations
with 64-bit @code{double}, but ensuring that code is always
compiled that way may be difficult.  The x86 architecture has the
ability to force rounding of all operations in the 80-bit
registers to the 64-bit storage format, and some systems provide
a software interface with the functions @code{fesetprec} and
@code{fegetprec}.  Unfortunately, neither of those functions is
defined by the ISO Standards for C and C++, and consequently,
they are not standardly available on all platforms that use
the x86 floating-point design.

When @code{double} computations are done in the 80-bit format,
results necessarily involve a @dfn{double rounding}: first to the
64-bit significand in intermediate operations in registers, and
then to the 53-bit significand when the register contents are
stored to memory.  Here is an example in decimal arithmetic where
such a double rounding results in the wrong answer: round
@code{1_234_999} from seven to five to three digits. The result is
@code{1_240_000}, whereas the correct representation to three
significant digits is @code{1_230_000}.

@cindex -ffloat-store
One way to reduce the use of the 80-bit format is to declare variables
as @code{volatile double}: that way, the compiler is required to store
and load intermediates from memory, rather than keeping them in 80-bit
registers over long sequences of floating-point instructions.  Doing
so does not, however, eliminate double rounding.  The now-common
x86-64 architecture has separate sets of 32-bit and 64-bit
floating-point registers.  The option @option{-float-store} says that
floating-point computation should use only those registers, thus eliminating
the possibility of double rounding.
@end ignore

@c =====================================================================

@node    Exact Floating Constants
@section Exact Floating-Point Constants
@cindex exact specification of floating-point constants
@cindex floating-point constants, exact specification of

One of the frustrations that numerical programmers have suffered
with since the dawn of digital computers is the inability to
precisely specify numbers in their programs.  On the early
decimal machines, that was not an issue: you could write a
constant @code{1e-30} and be confident of that exact value
being used in floating-point operations.  However, when the
hardware works in a base other than 10, then human-specified
numbers have to be converted to that base, and then converted
back again at output time.  The two base conversions are rarely
exact, and unwanted rounding errors are introduced.

@cindex hexadecimal floating-point constants
As computers usually represent numbers in a base other than 10,
numbers often must be converted to and from different bases, and
rounding errors can occur during conversion.  This problem is solved
in C using hexadecimal floating-point constants.  For example,
@code{+0x1.fffffcp-1} is the number that is the IEEE 754 32-bit value
closest to, but below, @code{1.0}. The significand is represented as a
hexadecimal fraction, and the @emph{power of two} is written in
decimal following the exponent letter @code{p} (the traditional
exponent letter @code{e} is not possible, because it is a hexadecimal
digit).

In @code{printf} and @code{scanf} and related functions, you can use
the @samp{%a} and @samp{%A} format specifiers for writing and reading
hexadecimal floating-point values.  @samp{%a} writes them with lower
case letters and @samp{%A} writes them with upper case letters.  For
instance, this code reproduces our sample number:

@example
printf ("%a\n", 1.0 - pow (2.0, -23));
    @print{} 0x1.fffffcp-1
@end example

@noindent
The @code{strtod} family was similarly extended to recognize
numbers in that new format.

If you want to ensure exact data representation for transfer of
floating-point numbers between C programs on different
computers, then hexadecimal constants are an optimum choice.

@c =====================================================================

@node    Handling Infinity
@section Handling Infinity
@cindex infinity in floating-point arithmetic
@cindex floating-point infinity

As we noted earlier, the IEEE 754 model of computing is not to stop
the program when exceptional conditions occur.  It takes note of
exceptional values or conditions by setting sticky @dfn{exception
flags}, or by producing results with the special values Infinity and
QNaN.  In this section, we discuss Infinity; @pxref{Handling NaN} for
the other.

In GNU C, you can create a value of negative Infinity in software like
this:

@example
double x;

x = -1.0 / 0.0;
@end example

GNU C supplies the @code{__builtin_inf}, @code{__builtin_inff}, and
@code{__builtin_infl} macros, and the GNU C Library provides the
@code{INFINITY} macro, all of which are compile-time constants for
positive infinity.

GNU C also provides a standard function to test for an Infinity:
@code{isinf (x)} returns @code{1} if the argument is a signed
infinity, and @code{0} if not.

Infinities can be compared, and all Infinities of the same sign are
equal: there is no notion in IEEE 754 arithmetic of different kinds of
Infinities, as there are in some areas of mathematics.  Positive
Infinity is larger than any finite value, and negative Infinity is
smaller than any finite value.

Infinities propagate in addition, subtraction, multiplication,
and square root, but in division, they disappear, because of the
rule that @code{finite / Infinity} is @code{0.0}.  Thus, an
overflow in an intermediate computation that produces an Infinity
is likely to be noticed later in the final results.  The programmer can
then decide whether the overflow is expected, and acceptable, or whether
the code possibly has a bug, or needs to be run in higher
precision, or redesigned to avoid the production of the Infinity.

@c =====================================================================

@node    Handling NaN
@section Handling NaN
@cindex NaN in floating-point arithmetic
@cindex not a number
@cindex floating-point NaN

NaNs are not numbers: they represent values from computations that
produce undefined results.  They have a distinctive property that
makes them unlike any other floating-point value: they are
@emph{unequal to everything, including themselves}!  Thus, you can
write a test for a NaN like this:

@example
if (x != x)
  printf ("x is a NaN\n");
@end example

@noindent
This test works in GNU C, but some compilers might evaluate that test
expression as false without properly checking for the NaN value.
A more portable way to test for NaN is to use the @code{isnan}
function declared in @code{math.h}:

@example
if (isnan (x))
  printf ("x is a NaN\n");
@end example

@noindent
@xref{Floating Point Classes, , , libc, The GNU C Library Reference Manual}.

One important use of NaNs is marking of missing data.  For
example, in statistics, such data must be omitted from
computations.  Use of any particular finite value for missing
data would eventually collide with real data, whereas such data
could never be a NaN, so it is an ideal marker.  Functions that
deal with collections of data that may have holes can be written
to test for, and ignore, NaN values.

It is easy to generate a NaN in computations: evaluating @code{0.0 /
0.0} is the commonest way, but @code{Infinity - Infinity},
@code{Infinity / Infinity}, and @code{sqrt (-1.0)} also work.
Functions that receive out-of-bounds arguments can choose to return a
stored NaN value, such as with the @code{NAN} macro defined in
@code{math.h}, but that does not set the @emph{invalid operand}
exception flag, and that can fool some programs.

@cindex NaNs-always-propagate rule
Like Infinity, NaNs propagate in computations, but they are even
stickier, because they never disappear in division. Thus, once a
NaN appears in a chain of numerical operations, it is almost
certain to pop out into the final results.  The programmer
has to decide whether that is expected, or whether there is a
coding or algorithmic error that needs repair.

In general, when function gets a NaN argument, it usually returns a
NaN.  However, there are some exceptions in the math-library functions
that you need to be aware of, because they violate the
@emph{NaNs-always-propagate} rule:

@itemize @bullet

@item
@code{pow (x, 0.0)} always returns @code{1.0}, even if @code{x} is
0.0, Infinity, or a NaN.

@item
@code{pow (1, y)} always returns @code{1}, even if @code{y} is a NaN.

@item
@code{hypot (INFINITY, y)} and @code{hypot (-INFINITY, y)} both
always return @code{INFINITY}, even if @code{y} is a Nan.

@item
If just one of the arguments to @code{fmax (x, y)} or
@code{fmin (x, y)} is a NaN, it returns the other argument.  If
both arguments are NaNs, it returns a NaN, but there is no
requirement about where it comes from: it could be @code{x}, or
@code{y}, or some other quiet NaN.
@end itemize

NaNs are also used for the return values of math-library
functions where the result is not representable in real
arithmetic, or is mathematically undefined or uncertain, such as
@code{sqrt (-1.0)} and @code{sin (Infinity)}.  However, note that a
result that is merely too big to represent should always produce
an Infinity, such as with @code{exp (1000.0)} (too big) and
@code{exp (Infinity)} (truly infinite).

@c =====================================================================

@node    Signed Zeros
@section Signed Zeros
@cindex signed zeros in floating-point arithmetic
@cindex floating-point signed zeros

The sign of zero is significant, and important, because it records the
creation of a value that is too small to represent, but came from
either the negative axis, or from the positive axis.  Such fine
distinctions are essential for proper handling of @dfn{branch cuts}
in complex arithmetic (@pxref{Complex Arithmetic}).

The key point about signed zeros is that in comparisons, their sign
does not matter: @code{0.0 == -0.0} must @emph{always} evaluate to
@code{1} (true).  However, they are not @emph{the same number}, and
@code{-0.0} in C code stands for a negative zero.

@c =====================================================================

@node    Scaling by the Base
@section Scaling by Powers of the Base
@cindex scaling floating point by powers of the base
@cindex floating-point scaling by powers of the base

We have discussed rounding errors several times in this chapter,
but it is important to remember that when results require no more
bits than the exponent and significand bits can represent, those results
are @emph{exact}.

One particularly useful exact operation is scaling by a power of
the base.  While one, in principle, could do that with code like
this:

@example
y = x * pow (2.0, (double)k);   /* @r{Undesirable scaling: avoid!} */
@end example

@noindent
that is not advisable, because it relies on the quality of the
math-library power function, and that happens to be one of the
most difficult functions in the C math library to make accurate.
What is likely to happen on many systems is that the returned
value from @code{pow} will be close to a power of two, but
slightly different, so the subsequent multiplication introduces
rounding error.

The correct, and fastest, way to do the scaling is either via the
traditional C library function, or with its C99 equivalent:

@example
y = ldexp (x, k);            /* @r{Traditional pre-C99 style.} */
y = scalbn (x, k);           /* @r{C99 style.} */
@end example

@noindent
Both functions return @code{x * 2**k}.
@xref{Normalization Functions, , , libc, The GNU C Library Reference Manual}.

@c =====================================================================

@node    Rounding Control
@section Rounding Control
@cindex rounding control (floating point)
@cindex floating-point rounding control

Here we describe how to specify the rounding mode at run time.  System
header file @file{fenv.h} provides the prototypes for these functions.
@xref{Rounding, , , libc, The GNU C Library Reference Manual}.

@noindent
That header file also provides constant names for the four rounding modes:
@code{FE_DOWNWARD}, @code{FE_TONEAREST}, @code{FE_TOWARDZERO}, and
@code{FE_UPWARD}.

The function @code{fegetround} examines and returns the current
rounding mode.  On a platform with IEEE 754 floating point,
the value will always equal one of those four constants.
On other platforms, it may return a negative value. The function
@code{fesetround} sets the current rounding mode.

Changing the rounding mode can be slow, so it is useful to minimize
the number of changes.  For interval arithmetic, we seem to need three
changes for each operation, but we really only need two, because we
can write code like this example for interval addition of two reals:

@example
@{
  struct interval_double
    @{
      double hi, lo;
    @} v;
  extern volatile double x, y;
  int rule;

  rule = fegetround ();

  if (fesetround (FE_UPWARD) == 0)
    @{
      v.hi = x + y;
      v.lo = -(-x - y);
    @}
  else
    fatal ("ERROR: failed to change rounding rule");

  if (fesetround (rule) != 0)
    fatal ("ERROR: failed to restore rounding rule");
@}
@end example

@noindent
The @code{volatile} qualifier (@pxref{volatile}) is essential on x86
platforms to prevent an optimizing compiler from producing the same
value for both bounds.

@ignore
We no longer discuss the double rounding issue.
                        The code also needs to be compiled with the
option @option{-ffloat-store} that prevents use of higher precision
for the basic operations, because that would introduce double rounding
that could spoil the bounds guarantee of interval arithmetic.
@end ignore

@c =====================================================================

@node    Machine Epsilon
@section Machine Epsilon
@cindex machine epsilon (floating point)
@cindex floating-point machine epsilon

In any floating-point system, three attributes are particularly
important to know: @dfn{base} (the number that the exponent specifies
a power of), @dfn{precision} (number of digits in the significand),
and @dfn{range} (difference between most positive and most negative
values).  The allocation of bits between exponent and significand
decides the answers to those questions.

A measure of the precision is the answer to the question: what is
the smallest number that can be added to @code{1.0} such that the
sum differs from @code{1.0}?  That number is called the
@dfn{machine epsilon}.

We could define the needed machine-epsilon constants for @code{float},
@code{double}, and @code{long double} like this:

@example
static const float  epsf = 0x1p-23;  /* @r{about 1.192e-07} */
static const double eps  = 0x1p-52;  /* @r{about 2.220e-16} */
static const long double epsl = 0x1p-63;  /* @r{about 1.084e-19} */
@end example

@noindent
Instead of the hexadecimal constants, we could also have used the
Standard C macros, @code{FLT_EPSILON}, @code{DBL_EPSILON}, and
@code{LDBL_EPSILON}.

It is useful to be able to compute the machine epsilons at
run time, and we can easily generalize the operation by replacing
the constant @code{1.0} with a user-supplied value:

@example
double
macheps (double x)
@{ /* @r{Return machine epsilon for @var{x},}  */
   /* @r{such that @var{x} + macheps (@var{x}) > @var{x}.}  */
  static const double base = 2.0;
  double eps;

  if (isnan (x))
      eps = x;
  else
    @{
      eps = (x == 0.0) ? 1.0 : x;

      while ((x + eps / base) != x)
          eps /= base;          /* @r{Always exact!}  */
    @}

  return (eps);
@}
@end example

@noindent
If we call that function with arguments from @code{0} to
@code{10}, as well as Infinity and NaN, and print the returned
values in hexadecimal, we get output like this:

@example
macheps (  0) = 0x1.0000000000000p-1074
macheps (  1) = 0x1.0000000000000p-52
macheps (  2) = 0x1.0000000000000p-51
macheps (  3) = 0x1.8000000000000p-52
macheps (  4) = 0x1.0000000000000p-50
macheps (  5) = 0x1.4000000000000p-51
macheps (  6) = 0x1.8000000000000p-51
macheps (  7) = 0x1.c000000000000p-51
macheps (  8) = 0x1.0000000000000p-49
macheps (  9) = 0x1.2000000000000p-50
macheps ( 10) = 0x1.4000000000000p-50
macheps (Inf) = infinity
macheps (NaN) = nan
@end example

@noindent
Notice that @code{macheps} has a special test for a NaN to prevent an
infinite loop.

@ignore
We no longer discuss double rounding.
To ensure that no expressions are evaluated with an intermediate higher
precision, we can compile with the @option{-fexcess-precision=standard}
option, which tells the compiler that all calculation results, including
intermediate results, are to be put on the stack, forcing rounding.
@end ignore

Our code made another test for a zero argument to avoid getting a
zero return.  The returned value in that case is the smallest
representable floating-point number, here the subnormal value
@code{2**(-1074)}, which is about @code{4.941e-324}.

No special test is needed for an Infinity, because the
@code{eps}-reduction loop then terminates at the first iteration.

Our @code{macheps} function here assumes binary floating point; some
architectures may differ.

The C library includes some related functions that can also be used to
determine machine epsilons at run time:

@example
#include <math.h>           /* @r{Include for these prototypes.} */

double      nextafter  (double x, double y);
float       nextafterf (float x, float y);
long double nextafterl (long double x, long double y);
@end example

@noindent
These return the machine number nearest @var{x} in the direction of
@var{y}.  For example, @code{nextafter (1.0, 2.0)} produces the same
result as @code{1.0 + macheps (1.0)} and @code{1.0 + DBL_EPSILON}.
@xref{FP Bit Twiddling, , , libc, The GNU C Library Reference Manual}.

It is important to know that the machine epsilon is not symmetric
about all numbers.  At the boundaries where normalization changes the
exponent, the epsilon below @var{x} is smaller than that just above
@var{x} by a factor @code{1 / base}.  For example, @code{macheps
(1.0)} returns @code{+0x1p-52}, whereas @code{macheps (-1.0)} returns
@code{+0x1p-53}.  Some authors distinguish those cases by calling them
the @emph{positive} and @emph{negative}, or @emph{big} and
@emph{small}, machine epsilons.  You can produce their values like
this:

@example
eps_neg = 1.0 - nextafter (1.0, -1.0);
eps_pos = nextafter (1.0, +2.0) - 1.0;
@end example

If @var{x} is a variable, such that you do not know its value at
compile time, then you can substitute literal @var{y} values with
either @code{-inf()} or @code{+inf()}, like this:

@example
eps_neg = x - nextafter (x, -inf ());
eps_pos = nextafter (x, +inf() - x);
@end example

@noindent
In such cases, if @var{x} is Infinity, then @emph{the @code{nextafter}
functions return @var{y} if @var{x} equals @var{y}}.  Our two
assignments then produce @code{+0x1.fffffffffffffp+1023} (that is a
hexadecimal floating point constant and its value is around
1.798e+308; see @ref{Floating Constants}) for @var{eps_neg}, and
Infinity for @var{eps_pos}.  Thus, the call @code{nextafter (INFINITY,
-INFINITY)} can be used to find the largest representable finite
number, and with the call @code{nextafter (0.0, 1.0)}, the smallest
representable number (here, @code{0x1p-1074} (about 4.491e-324), a
number that we saw before as the output from @code{macheps (0.0)}).

@c =====================================================================

@node    Complex Arithmetic
@section Complex Arithmetic
@cindex complex arithmetic in floating-point calculations
@cindex floating-point arithmetic with complex numbers

We've already looked at defining and referring to complex numbers
(@pxref{Complex Data Types}).  What is important to discuss here are
some issues that are unlikely to be obvious to programmers without
extensive experience in both numerical computing, and in complex
arithmetic in mathematics.

The first important point is that, unlike real arithmetic, in complex
arithmetic, the danger of significance loss is @emph{pervasive}, and
affects @emph{every one} of the basic operations, and @emph{almost
all} of the math-library functions.  To understand why, recall the
rules for complex multiplication and division:

@example
a = u + I*v              /* @r{First operand.} */
b = x + I*y              /* @r{Second operand.} */

prod = a * b
     = (u + I*v) * (x + I*y)
     = (u * x - v * y) + I*(v * x + u * y)

quo  = a / b
     = (u + I*v) / (x + I*y)
     = [(u + I*v) * (x - I*y)] / [(x + I*y) * (x - I*y)]
     = [(u * x + v * y) + I*(v * x - u * y)] / (x**2 + y**2)
@end example

@noindent
There are four critical observations about those formulas:

@itemize @bullet

@item
the multiplications on the right-hand side introduce the
possibility of premature underflow or overflow;

@item
the products must be accurate to twice working precision;

@item
there is @emph{always} one subtraction on the right-hand sides
that is subject to catastrophic significance loss; and

@item
complex multiplication has up to @emph{six} rounding errors, and
complex division has @emph{ten} rounding errors.

@end itemize

@cindex branch cuts
Another point that needs careful study is the fact that many functions
in complex arithmetic have @dfn{branch cuts}.  You can view a
function with a complex argument, @code{f (z)}, as @code{f (x + I*y)},
and thus, it defines a relation between a point @code{(x, y)} on the
complex plane with an elevation value on a surface.  A branch cut
looks like a tear in that surface, so approaching the cut from one
side produces a particular value, and from the other side, a quite
different value.  Great care is needed to handle branch cuts properly,
and even small numerical errors can push a result from one side to the
other, radically changing the returned value.  As we reported earlier,
correct handling of the sign of zero is critically important for
computing near branch cuts.

The best advice that we can give to programmers who need complex
arithmetic is to always use the @emph{highest precision available},
and then to carefully check the results of test calculations to gauge
the likely accuracy of the computed results.  It is easy to supply
test values of real and imaginary parts where all five basic
operations in complex arithmetic, and almost all of the complex math
functions, lose @emph{all} significance, and fail to produce even a
single correct digit.

Even though complex arithmetic makes some programming tasks
easier, it may be numerically preferable to rework the algorithm
so that it can be carried out in real arithmetic.  That is
commonly possible in matrix algebra.

GNU C can perform code optimization on complex number multiplication and
division if certain boundary checks will not be needed.  The
command-line option @option{-fcx-limited-range} tells the compiler that
a range reduction step is not needed when performing complex division,
and that there is no need to check if a complex multiplication or
division results in the value @code{Nan + I*NaN}.  By default these
checks are enabled.  You can explicitly enable them with the
@option{-fno-cx-limited-range} option.

@ignore
@c =====================================================================

@node    More on Decimal Floating-Point Arithmetic
@section More on Decimal Floating-Point Arithmetic
@cindex decimal floating-point arithmetic
@cindex floating-point arithmetic, decimal

Proposed extensions to the C programming language call for the
inclusion of decimal floating-point arithmetic, which handles
floating-point numbers with a specified radix of 10, instead of the
unspecified traditional radix of 2.

The proposed new types are @code{_Decimal32}, @code{_Decimal64}, and
@code{_Decimal128}, with corresponding literal constant suffixes of
@code{df}, @code{dd}, and @code{dl}, respectively.  For example, a
32-bit decimal floating-point variable could be defined as:

@example
_Decimal32 foo = 42.123df;
@end example

We stated earlier (@pxref{decimal-significand}) that the significand
in decimal floating-point arithmetic is an integer, rather than
fractional, value.  Decimal instructions do not normally alter the
exponent by normalizing nonzero significands to remove trailing zeros.
That design feature is intentional: it allows emulation of the
fixed-point arithmetic that has commonly been used for financial
computations.

One consequence of the lack of normalization is that there are
multiple representations of any number that does not use all of the
significand digits.  Thus, in the 32-bit format, the values
@code{1.DF}, @code{1.0DF}, @code{1.00DF}, @dots{}, @code{1.000_000DF},
all have different bit patterns in storage, even though they compare
equal.  Thus, programmers need to be careful about trailing zero
digits, because they appear in the results, and affect scaling.  For
example, @code{1.0DF * 1.0DF} evaluates to @code{1.00DF}, which is
stored as @code{100 * 10**(-2)}.

In general, when you look at a decimal expression with fractional
digits, you should mentally rewrite it in integer form with suitable
powers of ten.  Thus, a multiplication like @code{1.23 * 4.56} really
means @code{123 * 10**(-2) * 456 * 10**(-2)}, which evaluates to
@code{56088 * 10**(-4)}, and would be output as @code{5.6088}.

Another consequence of the decimal significand choice is that
initializing decimal floating-point values to a pattern of
all-bits-zero does not produce the expected value @code{0.}: instead,
the result is the subnormal values @code{0.e-101}, @code{0.e-398}, and
@code{0.e-6176} in the three storage formats.

GNU C currently supports basic storage and manipulation of decimal
floating-point values on some platforms, and support is expected to
grow in future releases.

@c ??? Suggest chopping the rest of this section, at least for the time
@c ??? being.  Decimal floating-point support in GNU C is not yet complete,
@c ??? and functionality discussed appears to not be available on all
@c ??? platforms, and is not obviously documented for end users of GNU C. --TJR

The exponent in decimal arithmetic is called the @emph{quantum}, and
financial computations require that the quantum always be preserved.
If it is not, then rounding may have happened, and additional scaling
is required.

The function @code{samequantumd (x,y)} for 64-bit decimal arithmetic
returns @code{1} if the arguments have the same exponent, and @code{0}
otherwise.

The function @code{quantized (x,y)} returns a value of @var{x} that has
been adjusted to have the same quantum as @var{y}; that adjustment
could require rounding of the significand.  For example,
@code{quantized (5.dd, 1.00dd)} returns the value @code{5.00dd}, which
is stored as @code{500 * 10**(-2)}.  As another example, a sales-tax
computation might be carried out like this:

@example
decimal_long_double amount, rate, total;

amount = 0.70DL;
rate   = 1.05DL;
total  = quantizedl (amount * rate, 1.00DL);
@end example

@noindent
Without the call to @code{quantizedl}, the result would have been
@code{0.7350}, instead of the correctly rounded @code{0.74}.  That
particular example was chosen because it illustrates yet another
difference between decimal and binary arithmetic: in the latter, the
factors each require an infinite number of bits, and their product,
when converted to decimal, looks like @code{0.734_999_999@dots{}}.
Thus, rounding the product in binary format to two decimal places
always gets @code{0.73}, which is the @emph{wrong} answer for tax
laws.

In financial computations in decimal floating-point arithmetic, the
@code{quantized} function family is expected to receive wide use
whenever multiplication or division change the desired quantum of the
result.

The function call @code{normalized (x)} returns a value that is
numerically equal to @var{x}, but with trailing zeros trimmed.
Here are some examples of its operation:

@multitable @columnfractions .5 .5
@headitem Function Call        @tab Result
@item normalized (+0.00100DD)   @tab +0.001DD
@item normalized (+1.00DD)      @tab +1.DD
@item normalized (+1.E2DD)      @tab +1E+2DD
@item normalized (+100.DD)      @tab +1E+2DD
@item normalized (+100.00DD)    @tab +1E+2DD
@item normalized (+NaN(0x1234)) @tab +NaN(0x1234)
@item normalized (-NaN(0x1234)) @tab -NaN(0x1234)
@item normalized (+Infinity)    @tab +Infinity
@item normalized (-Infinity)    @tab -Infinity
@end multitable

@noindent
The NaN examples show that payloads are preserved.

Because the @code{printf} and @code{scanf} families were designed long
before IEEE 754 decimal arithmetic, their format items do not support
distinguishing between numbers with identical values, but different
quanta, and yet, that distinction is likely to be needed in output.

The solution adopted by one early library for decimal arithmetic is to
provide a family of number-to-string conversion functions that
preserve quantization.  Here is a code fragment and its output that
shows how they work.

@example
decimal_float x;

x = 123.000DF;
printf ("%%He:    x = %He\n", x);
printf ("%%Hf:    x = %Hf\n", x);
printf ("%%Hg:    x = %Hg\n", x);
printf ("ntosdf (x) = %s\n", ntosdf (x));

%He:    x = 1.230000e+02
%Hf:    x = 123.000000
%Hg:    x = 123
ntosdf (x) = +123.000
@end example

@noindent
The format modifier letter @code{H} indicates a 32-bit decimal value,
and the modifiers @code{DD} and @code{DL} correspond to the two other
formats.

@c =====================================================================
@end ignore

@node    Round-Trip Base Conversion
@section Round-Trip Base Conversion
@cindex round-trip base conversion
@cindex base conversion (floating point)
@cindex floating-point round-trip base conversion

Most numeric programs involve converting between base-2 floating-point
numbers, as represented by the computer, and base-10 floating-point
numbers, as entered and handled by the programmer.  What might not be
obvious is the number of base-2 bits vs. base-10 digits required for
each representation.  Consider the following tables showing the number of
decimal digits representable in a given number of bits, and vice versa:

@multitable @columnfractions .5 .1 .1 .1 .1 .1
@item binary in   @tab 24 @tab 53 @tab 64 @tab 113 @tab 237
@item decimal out @tab 9  @tab 17 @tab 21 @tab 36  @tab 73
@end multitable

@multitable @columnfractions .5 .1 .1 .1 .1
@item decimal in  @tab 7  @tab 16 @tab 34  @tab 70
@item binary out  @tab 25 @tab 55 @tab 114 @tab 234
@end multitable

We can compute the table numbers with these two functions:

@example
int
matula(int nbits)
@{   /* @r{Return output decimal digits needed for nbits-bits input.} */
    return ((int)ceil((double)nbits / log2(10.0) + 1.0));
@}

int
goldberg(int ndec)
@{   /* @r{Return output bits needed for ndec-digits input.} */
    return ((int)ceil((double)ndec / log10(2.0) + 1.0));
@}
@end example

One significant observation from those numbers is that we cannot
achieve correct round-trip conversion between the decimal and
binary formats in the same storage size!  For example, we need 25
bits to represent a 7-digit value from the 32-bit decimal format,
but the binary format only has 24 available.  Similar
observations hold for each of the other conversion pairs.

The general input/output base-conversion problem is astonishingly
complicated, and solutions were not generally known until the
publication of two papers in 1990 that are listed later near the end
of this chapter.  For the 128-bit formats, the worst case needs more
than 11,500 decimal digits of precision to guarantee correct rounding
in a binary-to-decimal conversion!

For further details see the references for Bennett Goldberg and David
Matula.

@c =====================================================================

@node    Further Reading
@section Further Reading

The subject of floating-point arithmetic is much more complex
than many programmers seem to think, and few books on programming
languages spend much time in that area.  In this chapter, we have
tried to expose the reader to some of the key ideas, and to warn
of easily overlooked pitfalls that can soon lead to nonsensical
results.  There are a few good references that we recommend
for further reading, and for finding other important material
about computer arithmetic.

We include URLs for these references when we were given them, when
they are morally legitimate to recommend; we have omitted the URLs
that are paywalled or that require running nonfree JavaScript code in
order to function.  We hope you can find morally legitimate sites
where you can access these works.

@c =====================================================================
@c Each bibliography item has a sort key, so the bibliography can be
@c sorted in emacs with M-x sort-paragraphs on the region with the items.
@c =====================================================================

@itemize @bullet

@c Paywalled.
@item   @c sort-key: Abbott
Paul H. Abbott and 15 others, @cite{Architecture and software support
in IBM S/390 Parallel Enterprise Servers for IEEE Floating-Point
arithmetic}, IBM Journal of Research and Development @b{43}(5/6)
723--760 (1999), This article gives a good description of IBM's
algorithm for exact decimal-to-binary conversion, complementing
earlier ones by Clinger and others.

@c Paywalled.
@item   @c sort-key: Beebe
Nelson H. F. Beebe, @cite{The Mathematical-Function Computation Handbook:
Programming Using the MathCW Portable Software Library},
Springer (2017).
This book describes portable implementations of a large superset
of the mathematical functions available in many programming
languages, extended to a future 256-bit format (70 decimal
digits), for both binary and decimal floating point.  It includes
a substantial portion of the functions described in the famous
@cite{NIST Handbook of Mathematical Functions}, Cambridge (2018),
ISBN 0-521-19225-0.
See
@uref{https://www.math.utah.edu/pub/mathcw/}
for compilers and libraries.

@c URL ok
@item   @c sort-key: Clinger-1990
William D. Clinger, @cite{How to Read Floating Point Numbers
Accurately},  ACM SIGPLAN Notices @b{25}(6) 92--101 (June 1990),
@uref{https://doi.org/10.1145/93548.93557}.
See also the papers by Steele & White.

@c Paywalled.
@c @item   @c sort-key: Clinger-2004
@c William D. Clinger, @cite{Retrospective: How to read floating
@c point numbers accurately}, ACM SIGPLAN Notices @b{39}(4) 360--371 (April 2004),
@c Reprint of 1990 paper, with additional commentary.

@c URL ok
@item   @c sort-key: Goldberg-1967
I. Bennett Goldberg, @cite{27  Bits Are Not Enough For 8-Digit Accuracy},
Communications of the ACM @b{10}(2) 105--106 (February 1967),
@uref{https://doi.acm.org/10.1145/363067.363112}.  This paper,
and its companions by David Matula, address the base-conversion
problem, and show that the naive formulas are wrong by one or
two digits.

@c URL ok
@item   @c sort-key: Goldberg-1991
David Goldberg, @cite{What Every Computer Scientist Should Know
About Floating-Point Arithmetic}, ACM Computing Surveys @b{23}(1)
5--58 (March 1991), corrigendum @b{23}(3) 413 (September 1991),
@uref{https://doi.org/10.1145/103162.103163}.
This paper has been widely distributed, and reissued in vendor
programming-language documentation.  It is well worth reading,
and then rereading from time to time.

@c ??? site does not respond, 20 Sep 2022
@item   @c sort-key: Juffa
Norbert Juffa and Nelson H. F. Beebe, @cite{A Bibliography of
Publications on Floating-Point Arithmetic},
@uref{https://www.math.utah.edu/pub/tex/bib/fparith.bib}.
This is the largest known bibliography of publications about
floating-point, and also integer, arithmetic.  It is actively
maintained, and in mid 2019, contains more than 6400 references to
original research papers, reports, theses, books, and Web sites on the
subject matter.  It can be used to locate the latest research in the
field, and the historical coverage dates back to a 1726 paper on
signed-digit arithmetic, and an 1837 paper by Charles Babbage, the
intellectual father of mechanical computers.  The entries for the
Abbott, Clinger, and Steele & White papers cited earlier contain
pointers to several other important related papers on the
base-conversion problem.

@c URL ok
@item   @c sort-key: Kahan
William Kahan, @cite{Branch Cuts for Complex Elementary Functions, or
Much Ado About Nothing's Sign Bit}, (1987),
@uref{https://people.freebsd.org/~das/kahan86branch.pdf}.
This Web document about the fine points of complex arithmetic
also appears in the volume edited by A. Iserles and
M. J. D. Powell, @cite{The State of the Art in Numerical
Analysis: Proceedings of the Joint IMA/SIAM Conference on the
State of the Art in Numerical Analysis held at the University of
Birmingham, 14--18 April 1986}, Oxford University Press (1987),
ISBN 0-19-853614-3 (xiv + 719 pages).  Its author is the famous
chief architect of the IEEE 754 arithmetic system, and one of the
world's greatest experts in the field of floating-point
arithmetic.  An entire generation of his students at the
University of California, Berkeley, have gone on to careers in
academic and industry, spreading the knowledge of how to do
floating-point arithmetic right.

@c paywalled
@item   @c sort-key: Knuth
Donald E. Knuth, @cite{A Simple Program Whose Proof Isn't},
in @cite{Beauty is our business: a birthday salute to Edsger
W. Dijkstra}, W. H. J. Feijen, A. J. M. van Gasteren,
D. Gries, and J. Misra (eds.), Springer (1990), ISBN
1-4612-8792-8,  This book
chapter supplies a correctness proof of the decimal to
binary, and binary to decimal, conversions in fixed-point
arithmetic in the TeX typesetting system.  The proof evaded
its author for a dozen years.

@c URL ok
@item   @c sort-key: Matula-1968a
David W. Matula, @cite{In-and-out conversions},
Communications of the ACM @b{11}(1) 57--50 (January 1968),
@uref{https://doi.org/10.1145/362851.362887}.

@item   @c sort-key: Matula-1968b
David W. Matula, @cite{The Base Conversion Theorem},
Proceedings of the American Mathematical Society @b{19}(3)
716--723 (June 1968).  See also other papers here by this author,
and by I. Bennett Goldberg.

@c page is blank without running JS.
@item   @c sort-key: Matula-1970
David W. Matula, @cite{A Formalization of Floating-Point Numeric
Base Conversion}, IEEE Transactions on Computers @b{C-19}(8)
681--692 (August 1970),

@c page is blank without running JS.
@item   @c sort-key: Muller-2010
Jean-Michel Muller and eight others, @cite{Handbook of
Floating-Point Arithmetic}, Birkh@"auser-Boston (2010), ISBN
0-8176-4704-X (xxiii + 572 pages).  This is a
comprehensive treatise from a French team who are among the
world's greatest experts in floating-point arithmetic, and among
the most prolific writers of research papers in that field.  They
have much to teach, and their book deserves a place on the
shelves of every serious numerical programmer.

@c paywalled
@item   @c sort-key: Muller-2018
Jean-Michel Muller and eight others, @cite{Handbook of
Floating-Point Arithmetic}, Second edition, Birkh@"auser-Boston (2018), ISBN
3-319-76525-6 (xxv + 627 pages).  This is a new
edition of the preceding entry.

@c URL given is obsolete
@item   @c sort-key: Overton
Michael Overton, @cite{Numerical Computing with IEEE Floating
Point Arithmetic, Including One Theorem, One Rule of Thumb, and
One Hundred and One Exercises}, SIAM (2001), ISBN 0-89871-482-6
(xiv + 104 pages),
This is a small volume that can be covered in a few hours.

@c URL ok
@item   @c sort-key: Steele-1990
Guy L. Steele Jr. and Jon L. White, @cite{How to Print
Floating-Point Numbers Accurately}, ACM SIGPLAN Notices
@b{25}(6) 112--126 (June 1990),
@uref{https://doi.org/10.1145/93548.93559}.
See also the papers by Clinger.

@c paywalled
@item   @c sort-key: Steele-2004
Guy L. Steele Jr. and Jon L. White, @cite{Retrospective: How to
Print Floating-Point Numbers Accurately}, ACM SIGPLAN Notices
@b{39}(4) 372--389 (April 2004),  Reprint of 1990
paper, with additional commentary.

@item   @c sort-key: Sterbenz
Pat H. Sterbenz, @cite{Floating Point Computation}, Prentice-Hall
(1974), ISBN 0-13-322495-3 (xiv + 316 pages).  This often-cited book
provides solid coverage of what floating-point arithmetic was like
@emph{before} the introduction of IEEE 754 arithmetic.

@end itemize