1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818
|
@c Copyright @copyright{} 2022, 2023, 2025 Richard Stallman
@c and Free Software Foundation, Inc.
@c This is part of the GNU C Intro and Reference Manual
@c and covered by its license.
@node Floating Point in Depth
@chapter Floating Point in Depth
@menu
* Floating Representations::
* Floating Type Specs::
* Special Float Values::
* Invalid Optimizations::
* Exception Flags::
* Exact Floating-Point::
* Rounding::
* Rounding Issues::
* Significance Loss::
* Fused Multiply-Add::
* Error Recovery::
@c * Double-Rounding Problems::
* Exact Floating Constants::
* Handling Infinity::
* Handling NaN::
* Signed Zeros::
* Scaling by the Base::
* Rounding Control::
* Machine Epsilon::
* Complex Arithmetic::
* Round-Trip Base Conversion::
* Further Reading::
@end menu
@node Floating Representations
@section Floating-Point Representations
@cindex floating-point representations
@cindex representation of floating-point numbers
@cindex IEEE 754-2008 Standard
Storing numbers as @dfn{floating point} allows representation of
numbers with fractional values, in a range larger than that of
hardware integers. A floating-point number consists of a sign bit, a
@dfn{significand} (also called the @dfn{mantissa}), and a power of a
fixed base. GNU C uses the floating-point representations specified by
the @cite{IEEE 754-2008 Standard for Floating-Point Arithmetic}.
The IEEE 754-2008 specification defines basic binary floating-point
formats of five different sizes: 16-bit, 32-bit, 64-bit, 128-bit, and
256-bit. The formats of 32, 64, and 128 bits are used for the
standard C types @code{float}, @code{double}, and @code{long double}.
GNU C supports the 16-bit floating point type @code{_Float16} on some
platforms, but does not support the 256-bit floating point type.
Each of the formats encodes the floating-point number as a sign bit.
After this comes an exponent that specifies a power of 2 (with a fixed
offset). Then comes the significand.
The first bit of the significand, before the binary point, is always
1, so there is no need to store it in memory. It is called the
@dfn{hidden bit} because it doesn't appear in the floating-point
number as used in the computer itself.
All of those floating-point formats are sign-magnitude representations,
so +0 and @minus{}0 are different values.
Besides the IEEE 754 format 128-bit float, GNU C also offers a format
consisting of a pair of 64-bit floating point numbers. This lacks the
full exponent range of the IEEE 128-bit format, but is useful when the
underlying hardware platform does not support that.
@node Floating Type Specs
@section Floating-Point Type Specifications
The standard library header file @file{float.h} defines a number of
constants that describe the platform's implementation of
floating-point types @code{float}, @code{double} and @code{long
double}. They include:
@findex FLT_MIN
@findex DBL_MIN
@findex LDBL_MIN
@findex FLT_HAS_SUBNORM
@findex DBL_HAS_SUBNORM
@findex LDBL_HAS_SUBNORM
@findex FLT_TRUE_MIN
@findex DBL_TRUE_MIN
@findex LDBL_TRUE_MIN
@findex FLT_MAX
@findex DBL_MAX
@findex LDBL_MAX
@findex FLT_DECIMAL_DIG
@findex DBL_DECIMAL_DIG
@findex LDBL_DECIMAL_DIG
@table @code
@item FLT_MIN
@itemx DBL_MIN
@itemx LDBL_MIN
Defines the minimum normalized positive floating-point values that can
be represented with the type.
@item FLT_HAS_SUBNORM
@itemx DBL_HAS_SUBNORM
@itemx LDBL_HAS_SUBNORM
Defines if the floating-point type supports subnormal (or ``denormalized'')
numbers or not (@pxref{subnormal numbers}).
@item FLT_TRUE_MIN
@itemx DBL_TRUE_MIN
@itemx LDBL_TRUE_MIN
Defines the minimum positive values (including subnormal values) that
can be represented with the type.
@item FLT_MAX
@itemx DBL_MAX
@itemx LDBL_MAX
Defines the largest values that can be represented with the type.
@item FLT_DECIMAL_DIG
@itemx DBL_DECIMAL_DIG
@itemx LDBL_DECIMAL_DIG
Defines the number of decimal digits @code{n} such that any
floating-point number that can be represented in the type can be
rounded to a floating-point number with @code{n} decimal digits, and
back again, without losing any precision of the value.
@end table
@node Special Float Values
@section Special Floating-Point Values
@cindex special floating-point values
@cindex floating-point values, special
IEEE floating point provides for special values that are not ordinary
numbers.
@table @asis
@item infinities
@code{+Infinity} and @code{-Infinity} are two different infinite
values, one positive and one negative. These result from
operations such as @code{1 / 0}, @code{Infinity + Infinity},
@code{Infinity * Infinity}, and @code{Infinity + @var{finite}}, and also
from a result that is finite, but larger than the most positive possible
value or smaller than the most negative possible value.
@xref{Handling Infinity}, for more about working with infinities.
@item NaNs (not a number)
@cindex QNaN
@cindex SNaN
There are two special values, called Not-a-Number (NaN): a quiet
NaN (QNaN), and a signaling NaN (SNaN).
A QNaN is produced by operations for which the value is undefined
in real arithmetic, such as @code{0 / 0}, @code{sqrt (-1)},
@code{Infinity - Infinity}, and any basic operation in which an
operand is a QNaN.
The signaling NaN is intended for initializing
otherwise-unassigned storage, and the goal is that unlike a
QNaN, an SNaN @emph{does} cause an interrupt that can be caught
by a software handler, diagnosed, and reported. In practice,
little use has been made of signaling NaNs, because the most
common CPUs in desktop and portable computers fail to implement
the full IEEE 754 Standard, and supply only one kind of NaN, the
quiet one. Also, programming-language standards have taken
decades to catch up to the IEEE 754 standard, and implementations
of those language standards make an additional delay before
programmers become willing to use these features.
To enable support for signaling NaNs, use the GCC command-line option
@option{-fsignaling-nans}, but this is an experimental feature and may
not work as expected in every situation.
A NaN has a sign bit, but its value means nothing.
@xref{Handling NaN}, for more about working with NaNs.
@item subnormal numbers
@cindex subnormal numbers
@cindex underflow, floating
@cindex floating underflow
@anchor{subnormal numbers}
It can happen that a computed floating-point value is too small to
represent, such as when two tiny numbers are multiplied. The result
is then said to @dfn{underflow}. The traditional behavior before
the IEEE 754 Standard was to use zero as the result, and possibly to report
the underflow in some sort of program output.
The IEEE 754 Standard is vague about whether rounding happens
before detection of floating underflow and overflow, or after, and CPU
designers may choose either.
However, the Standard does something unusual compared to earlier
designs, and that is that when the result is smaller than the
smallest @dfn{normalized} representable value (i.e., one in
which the leading significand bit is @code{1}), the normalization
requirement is relaxed, leading zero bits are permitted, and
precision is gradually lost until there are no more bits in the
significand. That phenomenon is called @dfn{gradual underflow},
and it serves important numerical purposes, although it does
reduce the precision of the final result. Some floating-point
designs allow you to choose at compile time, or even at
run time, whether underflows are gradual, or are flushed abruptly
to zero. Numbers that have entered the region of gradual
underflow are called @dfn{subnormal}.
You can use the library functions @code{fesetround} and
@code{fegetround} to set and get the rounding mode. Rounding modes
are defined (if supported by the platform) in @code{fenv.h} as:
@code{FE_UPWARD} to round toward positive infinity; @code{FE_DOWNWARD}
to round toward negative infinity; @code{FE_TOWARDZERO} to round
toward zero; and @code{FE_TONEAREST} to round to the nearest
representable value, the default mode. It is best to use
@code{FE_TONEAREST} except when there is a special need for some other
mode.
@end table
@node Invalid Optimizations
@section Invalid Optimizations
@cindex invalid optimizations in floating-point arithmetic
@cindex floating-point arithmetic invalid optimizations
Signed zeros, Infinity, and NaN invalidate some optimizations by
programmers and compilers that might otherwise have seemed obvious:
@itemize @bullet
@item
@code{x + 0} and @code{x - 0} are not the same as @code{x} when
@code{x} is zero, because the result depends on the rounding rule.
@xref{Rounding}, for more about rounding rules.
@item
@code{x * 0.0} is not the same as @code{0.0} when @code{x} is
Infinity, a NaN, or negative zero.
@item
@code{x / x} is not the same as @code{1.0} when @code{x} is Infinity,
a NaN, or zero.
@item
@code{(x - y)} is not the same as @code{-(y - x)} because when the
operands are finite and equal, one evaluates to @code{+0} and the
other to @code{-0}.
@item
@code{x - x} is not the same as @code{0.0} when @var{x} is Infinity or
a NaN.
@item
@code{x == x} and @code{x != x} are not equivalent to @code{1} and
@code{0} when @var{x} is a NaN.
@item
@code{x < y} and @code{isless (x, y)} are not equivalent, because the
first sets a sticky exception flag (@pxref{Exception Flags}) when an
operand is a NaN, whereas the second does not affect that flag. The
same holds for the other @code{isxxx} functions that are companions to
relational operators. @xref{FP Comparison Functions, , , libc, The
GNU C Library Reference Manual}.
@end itemize
The @option{-funsafe-math-optimizations} option enables
these optimizations.
@node Exception Flags
@section Floating Arithmetic Exception Flags
@cindex floating arithmetic exception flags
@cindex exception flags (floating point)
@cindex sticky exception flags (floating point)
@cindex floating overflow
@cindex overflow, floating
@cindex floating underflow
@cindex underflow, floating
@dfn{Sticky exception flags} record the occurrence of particular
conditions: once set, they remain set until the program explicitly
clears them.
The conditions include @emph{invalid operand},
@emph{division by zero}, @emph{inexact result} (i.e., one that
required rounding), @emph{underflow}, and @emph{overflow}. Some
extended floating-point designs offer several additional exception
flags. The functions @code{feclearexcept}, @code{feraiseexcept},
@code{fetestexcept}, @code{fegetexceptflags}, and
@code{fesetexceptflags} provide a standardized interface to those
flags. @xref{Status bit operations, , , libc, The GNU C Library
Reference Manual}.
One important use of those @anchor{fetestexcept}flags is to do a
computation that is normally expected to be exact in floating-point
arithmetic, but occasionally might not be, in which case, corrective
action is needed. You can clear the @emph{inexact result} flag with a
call to @code{feclearexcept (FE_INEXACT)}, do the computation, and
then test the flag with @code{fetestexcept (FE_INEXACT)}; the result
of that call is 0 if the flag is not set (there was no rounding), and
1 when there was rounding (which, we presume, implies the program has
to correct for that).
@c =====================================================================
@ignore
@node IEEE 754 Decimal Arithmetic
@section IEEE 754 Decimal Arithmetic
@cindex IEEE 754 decimal arithmetic
One of the difficulties that users of computers for numerical
work face, whether they realize it or not, is that the computer
does not operate in the number base that most people are familiar
with. As a result, input decimal fractions must be converted to
binary floating-point values for use in computations, and then
the final results converted back to decimal for humans. Because the
precision is finite and limited, and because algorithms for correct
round-trip conversion between number bases were not known until the
1990s, and are still not implemented on most systems and most
programming languages, the result is frequent confusion for users,
when a simple expression like @code{3.0*(1.0/3.0)} evaluates to
0.999999 instead of the expected 1.0. Here is an
example from a floating-point calculator that allows rounding-mode
control, with the mode set to @emph{round-towards-zero}:
@example
for (k = 1; k <= 10; ++k)
(void)printf ("%2d\t%10.6f\n", k, k*(1.0/k))
1 1.000000
2 1.000000
3 0.999999
4 1.000000
5 0.999999
6 0.999999
7 0.999999
8 1.000000
9 0.999999
10 0.999999
@end example
Increasing working precision can sometimes help by reducing
intermediate rounding errors, but the reality is that when
fractional values are involved, @emph{no amount} of extra
precision can suffice for some computations. For example, the
nice decimal value @code{1/10} in C99-style binary representation
is @code{+0x1.999999999999ap-4}; that final digit @code{a} is the
rounding of an infinite string of @code{9}'s.
Financial computations in particular depend critically on correct
arithmetic, and the losses due to rounding errors can be
large, especially for businesses with large numbers of small
transactions, such as grocery stores and telephone companies.
Tax authorities are particularly picky, and demand specific
rounding rules, including one that instead of rounding ties to
the nearest number, rounds instead in the direction that favors
the taxman.
Programming languages used for business applications, notably the
venerable Cobol language, have therefore always implemented
financial computations in @emph{fixed-point decimal arithmetic}
in software, and because of the large monetary amounts that must be
processed, successive Cobol standards have increased the minimum
number size from 18 to 32 decimal digits, and the most recent one
requires a decimal exponent range of at least @code{[-999, +999]}.
The revised IEEE 754-2008 standard therefore requires decimal
floating-point arithmetic, as well as the now-widely used binary
formats from 1985. Like the binary formats, the decimal formats
also support Infinity, NaN, and signed zero, and the five basic
operations are also required to produce correctly rounded
representations of infinitely precise exact results.
However, the financial applications of decimal arithmetic
introduce some new features:
@itemize @bullet
@item
There are three decimal formats occupying 32, 64, and 128 bits of
storage, and offering exactly 7, 16, and 34 decimal digits of
precision. If one size has @code{n} digits, the next larger size
has @code{2 n + 2} digits. Thus, a future 256-bit format would
supply 70 decimal digits, and at least one library already
supports the 256-bit binary and decimal formats.
@item
Decimal arithmetic has an additional rounding mode, called
@emph{round-ties-to-away-from-zero}, meaning that a four-digit
rounding of @code{1.2345} is @code{1.235}, and @code{-1.2345}
becomes @code{-1.235}. That rounding mode is mandated by
financial laws in several countries.
@item
The decimal significand is an
@anchor{decimal-significand}@emph{integer}, instead of a fractional
value, and trailing zeros are only removed at user request. That
feature allows floating-point arithmetic to emulate the
@emph{fixed-point arithmetic} traditionally used in financial
computations.
@end itemize
@noindent
We can easily estimate how many digits are likely to be needed for
financial work: seven billion people on Earth, with an average annual
income of less than US$10,000, means a world financial base that can
be represented in just 15 decimal digits. Even allowing for alternate
currencies, future growth, multiyear accounting, and intermediate
computations, the 34 digits supplied by the 128-bit format are more
than enough for financial purposes.
We return to decimal arithmetic later in this chapter
(@pxref{More on decimal floating-point arithmetic}), after we have
covered more about floating-point arithmetic in general.
@c =====================================================================
@end ignore
@node Exact Floating-Point
@section Exact Floating-Point Arithmetic
@cindex exact floating-point arithmetic
@cindex floating-point arithmetic, exact
As long as the numbers are exactly representable (fractions whose
denominator is a power of 2), and intermediate results do not require
rounding, then floating-point arithmetic is @emph{exact}. It is easy
to predict how many digits are needed for the results of arithmetic
operations:
@itemize @bullet
@item
addition and subtraction of two @var{n}-digit values with the
@emph{same} exponent require at most @code{@var{n} + 1} digits, but
when the exponents differ, many more digits may be needed;
@item
multiplication of two @var{n}-digit values requires exactly
2 @var{n} digits;
@item
although integer division produces a quotient and a remainder of
no more than @var{n}-digits, floating-point remainder and square
root may require an unbounded number of digits, and the quotient
can need many more digits than can be stored.
@end itemize
Whenever a result requires more than @var{n} digits, rounding
is needed.
@c =====================================================================
@node Rounding
@section Rounding
@cindex rounding
When floating-point arithmetic produces a result that can't fit
exactly in the significand of the type that's in use, it has to
@dfn{round} the value. The basic arithmetic operations---addition,
subtraction, multiplication, division, and square root---always produce
a result that is equivalent to the exact, possibly infinite-precision
result rounded to storage precision according to the current rounding
rule.
Rounding sets the @code{FE_INEXACT} exception flag (@pxref{Exception
Flags}). This enables programs to determine that rounding has
occurred.
Rounding consists of adjusting the exponent to bring the significand
back to the required base-point alignment, then applying the current
@dfn{rounding rule} to squeeze the significand into the fixed
available size.
The current rule is selected at run time from four options. Here they
are:
@itemize *
@item
@emph{round-to-nearest}, with ties rounded to an even integer;
@item
@emph{round-up}, towards @code{+Infinity};
@item
@emph{round-down}, towards @code{-Infinity};
@item
@emph{round-towards-zero}.
@end itemize
Under those four rounding rules, a decimal value
@code{-1.2345} that is to be rounded to a four-digit result would
become @code{-1.234}, @code{-1.234}, @code{-1.235}, and
@code{-1.234}, respectively.
The default rounding rule is @emph{round-to-nearest}, because that has
the least bias, and produces the lowest average error. When the true
result lies exactly halfway between two representable machine numbers,
the result is rounded to the one that ends with an even digit.
The @emph{round-towards-zero} rule was common on many early computer
designs, because it is the easiest to implement: it just requires
silent truncation of all extra bits.
The two other rules, @emph{round-up} and @emph{round-down}, are
essential for implementing @dfn{interval arithmetic}, whereby
each arithmetic operation produces lower and upper bounds that
are guaranteed to enclose the exact result.
@xref{Rounding Control}, for details on getting and setting the
current rounding mode.
@node Rounding Issues
@section Rounding Issues
@cindex rounding issues (floating point)
@cindex floating-point rounding issues
The default IEEE 754 rounding mode minimizes errors, and most
normal computations should not suffer any serious accumulation of
errors from rounding.
Of course, you can contrive examples where that is not so. Here
is one: iterate a square root, then attempt to recover the
original value by repeated squaring.
@example
#include <stdio.h>
#include <math.h>
int main (void)
@{
double x = 100.0;
double y;
int n, k;
for (n = 10; n <= 100; n += 10)
@{
y = x;
for (k = 0; k < n; ++k) y = sqrt (y);
for (k = 0; k < n; ++k) y *= y;
printf ("n = %3d; x = %.0f\ty = %.6f\n", n, x, y);
@}
return 0;
@}
@end example
@noindent
Here is the output:
@example
n = 10; x = 100 y = 100.000000
n = 20; x = 100 y = 100.000000
n = 30; x = 100 y = 99.999977
n = 40; x = 100 y = 99.981025
n = 50; x = 100 y = 90.017127
n = 60; x = 100 y = 1.000000
n = 70; x = 100 y = 1.000000
n = 80; x = 100 y = 1.000000
n = 90; x = 100 y = 1.000000
n = 100; x = 100 y = 1.000000
@end example
After 50 iterations, @code{y} has barely one correct digit, and
soon after, there are no correct digits.
@c =====================================================================
@node Significance Loss
@section Significance Loss
@cindex significance loss (floating point)
@cindex floating-point significance loss
A much more serious source of error in floating-point computation is
@dfn{significance loss} from subtraction of nearly equal values. This
means that the number of bits in the significand of the result is
fewer than the size of the value would permit. If the values being
subtracted are close enough, but still not equal, a @emph{single
subtraction} can wipe out all correct digits, possibly contaminating
all future computations.
Floating-point calculations can sometimes be carefully designed so
that significance loss is not possible, such as summing a series where
all terms have the same sign. For example, the Taylor series
expansions of the trigonometric and hyperbolic sines have terms of
identical magnitude, of the general form @code{@var{x}**(2*@var{n} +
1) / (2*@var{n} + 1)!}. However, those in the trigonometric sine series
alternate in sign, while those in the hyperbolic sine series are all
positive. Here is the output of two small programs that sum @var{k}
terms of the series for @code{sin (@var{x})}, and compare the computed
sums with known-to-be-accurate library functions:
@example
x = 10 k = 51
s (x) = -0.544_021_110_889_270
sin (x) = -0.544_021_110_889_370
x = 20 k = 81
s (x) = 0.912_945_250_749_573
sin (x) = 0.912_945_250_727_628
x = 30 k = 109
s (x) = -0.987_813_746_058_855
sin (x) = -0.988_031_624_092_862
x = 40 k = 137
s (x) = 0.617_400_430_980_474
sin (x) = 0.745_113_160_479_349
x = 50 k = 159
s (x) = 57_105.187_673_745_720_532
sin (x) = -0.262_374_853_703_929
// sinh(x) series summation with positive signs
// with k terms needed to converge to machine precision
x = 10 k = 47
t (x) = 1.101_323_287_470_340e+04
sinh (x) = 1.101_323_287_470_339e+04
x = 20 k = 69
t (x) = 2.425_825_977_048_951e+08
sinh (x) = 2.425_825_977_048_951e+08
x = 30 k = 87
t (x) = 5.343_237_290_762_229e+12
sinh (x) = 5.343_237_290_762_231e+12
x = 40 k = 105
t (x) = 1.176_926_334_185_100e+17
sinh (x) = 1.176_926_334_185_100e+17
x = 50 k = 121
t (x) = 2.592_352_764_293_534e+21
sinh (x) = 2.592_352_764_293_536e+21
@end example
@noindent
We have added underscores to the numbers to enhance readability.
The @code{sinh (@var{x})} series with positive terms can be summed to
high accuracy. By contrast, the series for @code{sin (@var{x})}
suffers increasing significance loss, so that when @var{x} = 30 only
two correct digits remain. Soon after, all digits are wrong, and the
answers are complete nonsense.
An important skill in numerical programming is to recognize when
significance loss is likely to contaminate a computation, and revise
the algorithm to reduce this problem. Sometimes, the only practical
way to do so is to compute in higher intermediate precision, which is
why the extended types like @code{long double} are important.
@c Formerly mentioned @code{__float128}
@c =====================================================================
@node Fused Multiply-Add
@section Fused Multiply-Add
@cindex fused multiply-add in floating-point computations
@cindex floating-point fused multiply-add
In 1990, when IBM introduced the POWER architecture, the CPU
provided a previously unknown instruction, the @dfn{fused
multiply-add} (FMA). It computes the value @code{x * y + z} with
an @strong{exact} double-length product, followed by an addition with a
@emph{single} rounding. Numerical computation often needs pairs of
multiply and add operations, for which the FMA is well-suited.
On the POWER architecture, there are two dedicated registers that
hold permanent values of @code{0.0} and @code{1.0}, and the
normal @emph{multiply} and @emph{add} instructions are just
wrappers around the FMA that compute @code{x * y + 0.0} and
@code{x * 1.0 + z}, respectively.
In the early days, it appeared that the main benefit of the FMA
was getting two floating-point operations for the price of one,
almost doubling the performance of some algorithms. However,
numerical analysts have since shown numerous uses of the FMA for
significantly enhancing accuracy. We discuss one of the most
important ones in the next section.
A few other architectures have since included the FMA, and most
provide variants for the related operations @code{x * y - z}
(FMS), @code{-x * y + z} (FNMA), and @code{-x * y - z} (FNMS).
@c The IEEE 754-2008 revision requires implementations to provide
@c the FMA, as a sixth basic operation.
The functions @code{fmaf}, @code{fma}, and @code{fmal} implement fused
multiply-add for the @code{float}, @code{double}, and @code{long
double} data types. Correct implementation of the FMA in software is
difficult, and some systems that appear to provide those functions do
not satisfy the single-rounding requirement. That situation should
change as more programmers use the FMA operation, and more CPUs
provide FMA in hardware.
Use the @option{-ffp-contract=fast} option to allow generation of FMA
instructions, or @option{-ffp-contract=off} to disallow it.
@c =====================================================================
@node Error Recovery
@section Error Recovery
@cindex error recovery (floating point)
@cindex floating-point error recovery
When two numbers are combined by one of the four basic
operations, the result often requires rounding to storage
precision. For accurate computation, one would like to be able
to recover that rounding error. With historical floating-point
designs, it was difficult to do so portably, but now that IEEE
754 arithmetic is almost universal, the job is much easier.
For addition with the default @emph{round-to-nearest} rounding
mode, we can determine the error in a sum like this:
@example
volatile double err, sum, tmp, x, y;
if (fabs (x) >= fabs (y))
@{
sum = x + y;
tmp = sum - x;
err = y - tmp;
@}
else /* fabs (x) < fabs (y) */
@{
sum = x + y;
tmp = sum - y;
err = x - tmp;
@}
@end example
@noindent
@cindex twosum
Now, @code{x + y} is @emph{exactly} represented by @code{sum + err}.
This basic operation, which has come to be called @dfn{twosum}
in the numerical-analysis literature, is the first key to tracking,
and accounting for, rounding error.
To determine the error in subtraction, just swap the @code{+} and
@code{-} operators.
We used the @code{volatile} qualifier (@pxref{volatile}) in the
declaration of the variables, which forces the compiler to store and
retrieve them from memory, and prevents the compiler from optimizing
@code{err = y - ((x + y) - x)} into @code{err = 0}.
For multiplication, we can compute the rounding error without
magnitude tests with the FMA operation (@pxref{Fused Multiply-Add}),
like this:
@example
volatile double err, prod, x, y;
prod = x * y; /* @r{rounded product} */
err = fma (x, y, -prod); /* @r{exact product @code{= @var{prod} + @var{err}}} */
@end example
For addition, subtraction, and multiplication, we can represent the
exact result with the notional sum of two values. However, the exact
result of division, remainder, or square root potentially requires an
infinite number of digits, so we can at best approximate it.
Nevertheless, we can compute an error term that is close to the true
error: it is just that error value, rounded to machine precision.
For division, you can approximate @code{x / y} with @code{quo + err}
like this:
@example
volatile double err, quo, x, y;
quo = x / y;
err = fma (-quo, y, x) / y;
@end example
For square root, we can approximate @code{sqrt (x)} with @code{root +
err} like this:
@example
volatile double err, root, x;
root = sqrt (x);
err = fma (-root, root, x) / (root + root);
@end example
With the reliable and predictable floating-point design provided
by IEEE 754 arithmetic, we now have the tools we need to track
errors in the five basic floating-point operations, and we can
effectively simulate computing in twice working precision, which
is sometimes sufficient to remove almost all traces of arithmetic
errors.
@c =====================================================================
@ignore
@node Double-Rounding Problems
@section Double-Rounding Problems
@cindex double-rounding problems (floating point)
@cindex floating-point double-rounding problems
Most developers today use 64-bit x86 processors with a 64-bit
operating system, with a Streaming SIMD Extensions (SSE) instruction
set. In the past, using a 32-bit x87 instruction set was common,
leading to issues described in this section.
To offer a few more digits of precision and a wider exponent range,
the IEEE 754 Standard included an optional @emph{temporary real}
format, with 11 more bits in the significand, and 4 more bits in the
biased exponent.
Compilers are free to exploit the longer format, and most do so.
That is usually a @emph{good thing}, such as in computing a
lengthy sum or product, or in implementing the computation of the
hypotenuse of a right-triangle as @code{sqrt (x*x + y*y)}: the
wider exponent range is critical there for avoiding disastrous
overflow or underflow.
@findex fesetprec
@findex fegetprec
However, sometimes it is critical to know what the intermediate
precision and rounding mode are, such as in tracking errors with
the techniques of the preceding section. Some compilers provide
options to prevent the use of the 80-bit format in computations
with 64-bit @code{double}, but ensuring that code is always
compiled that way may be difficult. The x86 architecture has the
ability to force rounding of all operations in the 80-bit
registers to the 64-bit storage format, and some systems provide
a software interface with the functions @code{fesetprec} and
@code{fegetprec}. Unfortunately, neither of those functions is
defined by the ISO Standards for C and C++, and consequently,
they are not standardly available on all platforms that use
the x86 floating-point design.
When @code{double} computations are done in the 80-bit format,
results necessarily involve a @dfn{double rounding}: first to the
64-bit significand in intermediate operations in registers, and
then to the 53-bit significand when the register contents are
stored to memory. Here is an example in decimal arithmetic where
such a double rounding results in the wrong answer: round
@code{1_234_999} from seven to five to three digits. The result is
@code{1_240_000}, whereas the correct representation to three
significant digits is @code{1_230_000}.
@cindex -ffloat-store
One way to reduce the use of the 80-bit format is to declare variables
as @code{volatile double}: that way, the compiler is required to store
and load intermediates from memory, rather than keeping them in 80-bit
registers over long sequences of floating-point instructions. Doing
so does not, however, eliminate double rounding. The now-common
x86-64 architecture has separate sets of 32-bit and 64-bit
floating-point registers. The option @option{-float-store} says that
floating-point computation should use only those registers, thus eliminating
the possibility of double rounding.
@end ignore
@c =====================================================================
@node Exact Floating Constants
@section Exact Floating-Point Constants
@cindex exact specification of floating-point constants
@cindex floating-point constants, exact specification of
One of the frustrations that numerical programmers have suffered
with since the dawn of digital computers is the inability to
precisely specify numbers in their programs. On the early
decimal machines, that was not an issue: you could write a
constant @code{1e-30} and be confident of that exact value
being used in floating-point operations. However, when the
hardware works in a base other than 10, then human-specified
numbers have to be converted to that base, and then converted
back again at output time. The two base conversions are rarely
exact, and unwanted rounding errors are introduced.
@cindex hexadecimal floating-point constants
As computers usually represent numbers in a base other than 10,
numbers often must be converted to and from different bases, and
rounding errors can occur during conversion. This problem is solved
in C using hexadecimal floating-point constants. For example,
@code{+0x1.fffffcp-1} is the number that is the IEEE 754 32-bit value
closest to, but below, @code{1.0}. The significand is represented as a
hexadecimal fraction, and the @emph{power of two} is written in
decimal following the exponent letter @code{p} (the traditional
exponent letter @code{e} is not possible, because it is a hexadecimal
digit).
In @code{printf} and @code{scanf} and related functions, you can use
the @samp{%a} and @samp{%A} format specifiers for writing and reading
hexadecimal floating-point values. @samp{%a} writes them with lower
case letters and @samp{%A} writes them with upper case letters. For
instance, this code reproduces our sample number:
@example
printf ("%a\n", 1.0 - pow (2.0, -23));
@print{} 0x1.fffffcp-1
@end example
@noindent
The @code{strtod} family was similarly extended to recognize
numbers in that new format.
If you want to ensure exact data representation for transfer of
floating-point numbers between C programs on different
computers, then hexadecimal constants are an optimum choice.
@c =====================================================================
@node Handling Infinity
@section Handling Infinity
@cindex infinity in floating-point arithmetic
@cindex floating-point infinity
As we noted earlier, the IEEE 754 model of computing is not to stop
the program when exceptional conditions occur. It takes note of
exceptional values or conditions by setting sticky @dfn{exception
flags}, or by producing results with the special values Infinity and
QNaN. In this section, we discuss Infinity; @pxref{Handling NaN} for
the other.
In GNU C, you can create a value of negative Infinity in software like
this:
@example
double x;
x = -1.0 / 0.0;
@end example
GNU C supplies the @code{__builtin_inf}, @code{__builtin_inff}, and
@code{__builtin_infl} macros, and the GNU C Library provides the
@code{INFINITY} macro, all of which are compile-time constants for
positive infinity.
GNU C also provides a standard function to test for an Infinity:
@code{isinf (x)} returns @code{1} if the argument is a signed
infinity, and @code{0} if not.
Infinities can be compared, and all Infinities of the same sign are
equal: there is no notion in IEEE 754 arithmetic of different kinds of
Infinities, as there are in some areas of mathematics. Positive
Infinity is larger than any finite value, and negative Infinity is
smaller than any finite value.
Infinities propagate in addition, subtraction, multiplication,
and square root, but in division, they disappear, because of the
rule that @code{finite / Infinity} is @code{0.0}. Thus, an
overflow in an intermediate computation that produces an Infinity
is likely to be noticed later in the final results. The programmer can
then decide whether the overflow is expected, and acceptable, or whether
the code possibly has a bug, or needs to be run in higher
precision, or redesigned to avoid the production of the Infinity.
@c =====================================================================
@node Handling NaN
@section Handling NaN
@cindex NaN in floating-point arithmetic
@cindex not a number
@cindex floating-point NaN
NaNs are not numbers: they represent values from computations that
produce undefined results. They have a distinctive property that
makes them unlike any other floating-point value: they are
@emph{unequal to everything, including themselves}! Thus, you can
write a test for a NaN like this:
@example
if (x != x)
printf ("x is a NaN\n");
@end example
@noindent
This test works in GNU C, but some compilers might evaluate that test
expression as false without properly checking for the NaN value.
A more portable way to test for NaN is to use the @code{isnan}
function declared in @code{math.h}:
@example
if (isnan (x))
printf ("x is a NaN\n");
@end example
@noindent
@xref{Floating Point Classes, , , libc, The GNU C Library Reference Manual}.
One important use of NaNs is marking of missing data. For
example, in statistics, such data must be omitted from
computations. Use of any particular finite value for missing
data would eventually collide with real data, whereas such data
could never be a NaN, so it is an ideal marker. Functions that
deal with collections of data that may have holes can be written
to test for, and ignore, NaN values.
It is easy to generate a NaN in computations: evaluating @code{0.0 /
0.0} is the commonest way, but @code{Infinity - Infinity},
@code{Infinity / Infinity}, and @code{sqrt (-1.0)} also work.
Functions that receive out-of-bounds arguments can choose to return a
stored NaN value, such as with the @code{NAN} macro defined in
@code{math.h}, but that does not set the @emph{invalid operand}
exception flag, and that can fool some programs.
@cindex NaNs-always-propagate rule
Like Infinity, NaNs propagate in computations, but they are even
stickier, because they never disappear in division. Thus, once a
NaN appears in a chain of numerical operations, it is almost
certain to pop out into the final results. The programmer
has to decide whether that is expected, or whether there is a
coding or algorithmic error that needs repair.
In general, when function gets a NaN argument, it usually returns a
NaN. However, there are some exceptions in the math-library functions
that you need to be aware of, because they violate the
@emph{NaNs-always-propagate} rule:
@itemize @bullet
@item
@code{pow (x, 0.0)} always returns @code{1.0}, even if @code{x} is
0.0, Infinity, or a NaN.
@item
@code{pow (1, y)} always returns @code{1}, even if @code{y} is a NaN.
@item
@code{hypot (INFINITY, y)} and @code{hypot (-INFINITY, y)} both
always return @code{INFINITY}, even if @code{y} is a Nan.
@item
If just one of the arguments to @code{fmax (x, y)} or
@code{fmin (x, y)} is a NaN, it returns the other argument. If
both arguments are NaNs, it returns a NaN, but there is no
requirement about where it comes from: it could be @code{x}, or
@code{y}, or some other quiet NaN.
@end itemize
NaNs are also used for the return values of math-library
functions where the result is not representable in real
arithmetic, or is mathematically undefined or uncertain, such as
@code{sqrt (-1.0)} and @code{sin (Infinity)}. However, note that a
result that is merely too big to represent should always produce
an Infinity, such as with @code{exp (1000.0)} (too big) and
@code{exp (Infinity)} (truly infinite).
@c =====================================================================
@node Signed Zeros
@section Signed Zeros
@cindex signed zeros in floating-point arithmetic
@cindex floating-point signed zeros
The sign of zero is significant, and important, because it records the
creation of a value that is too small to represent, but came from
either the negative axis, or from the positive axis. Such fine
distinctions are essential for proper handling of @dfn{branch cuts}
in complex arithmetic (@pxref{Complex Arithmetic}).
The key point about signed zeros is that in comparisons, their sign
does not matter: @code{0.0 == -0.0} must @emph{always} evaluate to
@code{1} (true). However, they are not @emph{the same number}, and
@code{-0.0} in C code stands for a negative zero.
@c =====================================================================
@node Scaling by the Base
@section Scaling by Powers of the Base
@cindex scaling floating point by powers of the base
@cindex floating-point scaling by powers of the base
We have discussed rounding errors several times in this chapter,
but it is important to remember that when results require no more
bits than the exponent and significand bits can represent, those results
are @emph{exact}.
One particularly useful exact operation is scaling by a power of
the base. While one, in principle, could do that with code like
this:
@example
y = x * pow (2.0, (double)k); /* @r{Undesirable scaling: avoid!} */
@end example
@noindent
that is not advisable, because it relies on the quality of the
math-library power function, and that happens to be one of the
most difficult functions in the C math library to make accurate.
What is likely to happen on many systems is that the returned
value from @code{pow} will be close to a power of two, but
slightly different, so the subsequent multiplication introduces
rounding error.
The correct, and fastest, way to do the scaling is either via the
traditional C library function, or with its C99 equivalent:
@example
y = ldexp (x, k); /* @r{Traditional pre-C99 style.} */
y = scalbn (x, k); /* @r{C99 style.} */
@end example
@noindent
Both functions return @code{x * 2**k}.
@xref{Normalization Functions, , , libc, The GNU C Library Reference Manual}.
@c =====================================================================
@node Rounding Control
@section Rounding Control
@cindex rounding control (floating point)
@cindex floating-point rounding control
Here we describe how to specify the rounding mode at run time. System
header file @file{fenv.h} provides the prototypes for these functions.
@xref{Rounding, , , libc, The GNU C Library Reference Manual}.
@noindent
That header file also provides constant names for the four rounding modes:
@code{FE_DOWNWARD}, @code{FE_TONEAREST}, @code{FE_TOWARDZERO}, and
@code{FE_UPWARD}.
The function @code{fegetround} examines and returns the current
rounding mode. On a platform with IEEE 754 floating point,
the value will always equal one of those four constants.
On other platforms, it may return a negative value. The function
@code{fesetround} sets the current rounding mode.
Changing the rounding mode can be slow, so it is useful to minimize
the number of changes. For interval arithmetic, we seem to need three
changes for each operation, but we really only need two, because we
can write code like this example for interval addition of two reals:
@example
@{
struct interval_double
@{
double hi, lo;
@} v;
extern volatile double x, y;
int rule;
rule = fegetround ();
if (fesetround (FE_UPWARD) == 0)
@{
v.hi = x + y;
v.lo = -(-x - y);
@}
else
fatal ("ERROR: failed to change rounding rule");
if (fesetround (rule) != 0)
fatal ("ERROR: failed to restore rounding rule");
@}
@end example
@noindent
The @code{volatile} qualifier (@pxref{volatile}) is essential on x86
platforms to prevent an optimizing compiler from producing the same
value for both bounds.
@ignore
We no longer discuss the double rounding issue.
The code also needs to be compiled with the
option @option{-ffloat-store} that prevents use of higher precision
for the basic operations, because that would introduce double rounding
that could spoil the bounds guarantee of interval arithmetic.
@end ignore
@c =====================================================================
@node Machine Epsilon
@section Machine Epsilon
@cindex machine epsilon (floating point)
@cindex floating-point machine epsilon
In any floating-point system, three attributes are particularly
important to know: @dfn{base} (the number that the exponent specifies
a power of), @dfn{precision} (number of digits in the significand),
and @dfn{range} (difference between most positive and most negative
values). The allocation of bits between exponent and significand
decides the answers to those questions.
A measure of the precision is the answer to the question: what is
the smallest number that can be added to @code{1.0} such that the
sum differs from @code{1.0}? That number is called the
@dfn{machine epsilon}.
We could define the needed machine-epsilon constants for @code{float},
@code{double}, and @code{long double} like this:
@example
static const float epsf = 0x1p-23; /* @r{about 1.192e-07} */
static const double eps = 0x1p-52; /* @r{about 2.220e-16} */
static const long double epsl = 0x1p-63; /* @r{about 1.084e-19} */
@end example
@noindent
Instead of the hexadecimal constants, we could also have used the
Standard C macros, @code{FLT_EPSILON}, @code{DBL_EPSILON}, and
@code{LDBL_EPSILON}.
It is useful to be able to compute the machine epsilons at
run time, and we can easily generalize the operation by replacing
the constant @code{1.0} with a user-supplied value:
@example
double
macheps (double x)
@{ /* @r{Return machine epsilon for @var{x},} */
/* @r{such that @var{x} + macheps (@var{x}) > @var{x}.} */
static const double base = 2.0;
double eps;
if (isnan (x))
eps = x;
else
@{
eps = (x == 0.0) ? 1.0 : x;
while ((x + eps / base) != x)
eps /= base; /* @r{Always exact!} */
@}
return (eps);
@}
@end example
@noindent
If we call that function with arguments from @code{0} to
@code{10}, as well as Infinity and NaN, and print the returned
values in hexadecimal, we get output like this:
@example
macheps ( 0) = 0x1.0000000000000p-1074
macheps ( 1) = 0x1.0000000000000p-52
macheps ( 2) = 0x1.0000000000000p-51
macheps ( 3) = 0x1.8000000000000p-52
macheps ( 4) = 0x1.0000000000000p-50
macheps ( 5) = 0x1.4000000000000p-51
macheps ( 6) = 0x1.8000000000000p-51
macheps ( 7) = 0x1.c000000000000p-51
macheps ( 8) = 0x1.0000000000000p-49
macheps ( 9) = 0x1.2000000000000p-50
macheps ( 10) = 0x1.4000000000000p-50
macheps (Inf) = infinity
macheps (NaN) = nan
@end example
@noindent
Notice that @code{macheps} has a special test for a NaN to prevent an
infinite loop.
@ignore
We no longer discuss double rounding.
To ensure that no expressions are evaluated with an intermediate higher
precision, we can compile with the @option{-fexcess-precision=standard}
option, which tells the compiler that all calculation results, including
intermediate results, are to be put on the stack, forcing rounding.
@end ignore
Our code made another test for a zero argument to avoid getting a
zero return. The returned value in that case is the smallest
representable floating-point number, here the subnormal value
@code{2**(-1074)}, which is about @code{4.941e-324}.
No special test is needed for an Infinity, because the
@code{eps}-reduction loop then terminates at the first iteration.
Our @code{macheps} function here assumes binary floating point; some
architectures may differ.
The C library includes some related functions that can also be used to
determine machine epsilons at run time:
@example
#include <math.h> /* @r{Include for these prototypes.} */
double nextafter (double x, double y);
float nextafterf (float x, float y);
long double nextafterl (long double x, long double y);
@end example
@noindent
These return the machine number nearest @var{x} in the direction of
@var{y}. For example, @code{nextafter (1.0, 2.0)} produces the same
result as @code{1.0 + macheps (1.0)} and @code{1.0 + DBL_EPSILON}.
@xref{FP Bit Twiddling, , , libc, The GNU C Library Reference Manual}.
It is important to know that the machine epsilon is not symmetric
about all numbers. At the boundaries where normalization changes the
exponent, the epsilon below @var{x} is smaller than that just above
@var{x} by a factor @code{1 / base}. For example, @code{macheps
(1.0)} returns @code{+0x1p-52}, whereas @code{macheps (-1.0)} returns
@code{+0x1p-53}. Some authors distinguish those cases by calling them
the @emph{positive} and @emph{negative}, or @emph{big} and
@emph{small}, machine epsilons. You can produce their values like
this:
@example
eps_neg = 1.0 - nextafter (1.0, -1.0);
eps_pos = nextafter (1.0, +2.0) - 1.0;
@end example
If @var{x} is a variable, such that you do not know its value at
compile time, then you can substitute literal @var{y} values with
either @code{-inf()} or @code{+inf()}, like this:
@example
eps_neg = x - nextafter (x, -inf ());
eps_pos = nextafter (x, +inf() - x);
@end example
@noindent
In such cases, if @var{x} is Infinity, then @emph{the @code{nextafter}
functions return @var{y} if @var{x} equals @var{y}}. Our two
assignments then produce @code{+0x1.fffffffffffffp+1023} (that is a
hexadecimal floating point constant and its value is around
1.798e+308; see @ref{Floating Constants}) for @var{eps_neg}, and
Infinity for @var{eps_pos}. Thus, the call @code{nextafter (INFINITY,
-INFINITY)} can be used to find the largest representable finite
number, and with the call @code{nextafter (0.0, 1.0)}, the smallest
representable number (here, @code{0x1p-1074} (about 4.491e-324), a
number that we saw before as the output from @code{macheps (0.0)}).
@c =====================================================================
@node Complex Arithmetic
@section Complex Arithmetic
@cindex complex arithmetic in floating-point calculations
@cindex floating-point arithmetic with complex numbers
We've already looked at defining and referring to complex numbers
(@pxref{Complex Data Types}). What is important to discuss here are
some issues that are unlikely to be obvious to programmers without
extensive experience in both numerical computing, and in complex
arithmetic in mathematics.
The first important point is that, unlike real arithmetic, in complex
arithmetic, the danger of significance loss is @emph{pervasive}, and
affects @emph{every one} of the basic operations, and @emph{almost
all} of the math-library functions. To understand why, recall the
rules for complex multiplication and division:
@example
a = u + I*v /* @r{First operand.} */
b = x + I*y /* @r{Second operand.} */
prod = a * b
= (u + I*v) * (x + I*y)
= (u * x - v * y) + I*(v * x + u * y)
quo = a / b
= (u + I*v) / (x + I*y)
= [(u + I*v) * (x - I*y)] / [(x + I*y) * (x - I*y)]
= [(u * x + v * y) + I*(v * x - u * y)] / (x**2 + y**2)
@end example
@noindent
There are four critical observations about those formulas:
@itemize @bullet
@item
the multiplications on the right-hand side introduce the
possibility of premature underflow or overflow;
@item
the products must be accurate to twice working precision;
@item
there is @emph{always} one subtraction on the right-hand sides
that is subject to catastrophic significance loss; and
@item
complex multiplication has up to @emph{six} rounding errors, and
complex division has @emph{ten} rounding errors.
@end itemize
@cindex branch cuts
Another point that needs careful study is the fact that many functions
in complex arithmetic have @dfn{branch cuts}. You can view a
function with a complex argument, @code{f (z)}, as @code{f (x + I*y)},
and thus, it defines a relation between a point @code{(x, y)} on the
complex plane with an elevation value on a surface. A branch cut
looks like a tear in that surface, so approaching the cut from one
side produces a particular value, and from the other side, a quite
different value. Great care is needed to handle branch cuts properly,
and even small numerical errors can push a result from one side to the
other, radically changing the returned value. As we reported earlier,
correct handling of the sign of zero is critically important for
computing near branch cuts.
The best advice that we can give to programmers who need complex
arithmetic is to always use the @emph{highest precision available},
and then to carefully check the results of test calculations to gauge
the likely accuracy of the computed results. It is easy to supply
test values of real and imaginary parts where all five basic
operations in complex arithmetic, and almost all of the complex math
functions, lose @emph{all} significance, and fail to produce even a
single correct digit.
Even though complex arithmetic makes some programming tasks
easier, it may be numerically preferable to rework the algorithm
so that it can be carried out in real arithmetic. That is
commonly possible in matrix algebra.
GNU C can perform code optimization on complex number multiplication and
division if certain boundary checks will not be needed. The
command-line option @option{-fcx-limited-range} tells the compiler that
a range reduction step is not needed when performing complex division,
and that there is no need to check if a complex multiplication or
division results in the value @code{Nan + I*NaN}. By default these
checks are enabled. You can explicitly enable them with the
@option{-fno-cx-limited-range} option.
@ignore
@c =====================================================================
@node More on Decimal Floating-Point Arithmetic
@section More on Decimal Floating-Point Arithmetic
@cindex decimal floating-point arithmetic
@cindex floating-point arithmetic, decimal
Proposed extensions to the C programming language call for the
inclusion of decimal floating-point arithmetic, which handles
floating-point numbers with a specified radix of 10, instead of the
unspecified traditional radix of 2.
The proposed new types are @code{_Decimal32}, @code{_Decimal64}, and
@code{_Decimal128}, with corresponding literal constant suffixes of
@code{df}, @code{dd}, and @code{dl}, respectively. For example, a
32-bit decimal floating-point variable could be defined as:
@example
_Decimal32 foo = 42.123df;
@end example
We stated earlier (@pxref{decimal-significand}) that the significand
in decimal floating-point arithmetic is an integer, rather than
fractional, value. Decimal instructions do not normally alter the
exponent by normalizing nonzero significands to remove trailing zeros.
That design feature is intentional: it allows emulation of the
fixed-point arithmetic that has commonly been used for financial
computations.
One consequence of the lack of normalization is that there are
multiple representations of any number that does not use all of the
significand digits. Thus, in the 32-bit format, the values
@code{1.DF}, @code{1.0DF}, @code{1.00DF}, @dots{}, @code{1.000_000DF},
all have different bit patterns in storage, even though they compare
equal. Thus, programmers need to be careful about trailing zero
digits, because they appear in the results, and affect scaling. For
example, @code{1.0DF * 1.0DF} evaluates to @code{1.00DF}, which is
stored as @code{100 * 10**(-2)}.
In general, when you look at a decimal expression with fractional
digits, you should mentally rewrite it in integer form with suitable
powers of ten. Thus, a multiplication like @code{1.23 * 4.56} really
means @code{123 * 10**(-2) * 456 * 10**(-2)}, which evaluates to
@code{56088 * 10**(-4)}, and would be output as @code{5.6088}.
Another consequence of the decimal significand choice is that
initializing decimal floating-point values to a pattern of
all-bits-zero does not produce the expected value @code{0.}: instead,
the result is the subnormal values @code{0.e-101}, @code{0.e-398}, and
@code{0.e-6176} in the three storage formats.
GNU C currently supports basic storage and manipulation of decimal
floating-point values on some platforms, and support is expected to
grow in future releases.
@c ??? Suggest chopping the rest of this section, at least for the time
@c ??? being. Decimal floating-point support in GNU C is not yet complete,
@c ??? and functionality discussed appears to not be available on all
@c ??? platforms, and is not obviously documented for end users of GNU C. --TJR
The exponent in decimal arithmetic is called the @emph{quantum}, and
financial computations require that the quantum always be preserved.
If it is not, then rounding may have happened, and additional scaling
is required.
The function @code{samequantumd (x,y)} for 64-bit decimal arithmetic
returns @code{1} if the arguments have the same exponent, and @code{0}
otherwise.
The function @code{quantized (x,y)} returns a value of @var{x} that has
been adjusted to have the same quantum as @var{y}; that adjustment
could require rounding of the significand. For example,
@code{quantized (5.dd, 1.00dd)} returns the value @code{5.00dd}, which
is stored as @code{500 * 10**(-2)}. As another example, a sales-tax
computation might be carried out like this:
@example
decimal_long_double amount, rate, total;
amount = 0.70DL;
rate = 1.05DL;
total = quantizedl (amount * rate, 1.00DL);
@end example
@noindent
Without the call to @code{quantizedl}, the result would have been
@code{0.7350}, instead of the correctly rounded @code{0.74}. That
particular example was chosen because it illustrates yet another
difference between decimal and binary arithmetic: in the latter, the
factors each require an infinite number of bits, and their product,
when converted to decimal, looks like @code{0.734_999_999@dots{}}.
Thus, rounding the product in binary format to two decimal places
always gets @code{0.73}, which is the @emph{wrong} answer for tax
laws.
In financial computations in decimal floating-point arithmetic, the
@code{quantized} function family is expected to receive wide use
whenever multiplication or division change the desired quantum of the
result.
The function call @code{normalized (x)} returns a value that is
numerically equal to @var{x}, but with trailing zeros trimmed.
Here are some examples of its operation:
@multitable @columnfractions .5 .5
@headitem Function Call @tab Result
@item normalized (+0.00100DD) @tab +0.001DD
@item normalized (+1.00DD) @tab +1.DD
@item normalized (+1.E2DD) @tab +1E+2DD
@item normalized (+100.DD) @tab +1E+2DD
@item normalized (+100.00DD) @tab +1E+2DD
@item normalized (+NaN(0x1234)) @tab +NaN(0x1234)
@item normalized (-NaN(0x1234)) @tab -NaN(0x1234)
@item normalized (+Infinity) @tab +Infinity
@item normalized (-Infinity) @tab -Infinity
@end multitable
@noindent
The NaN examples show that payloads are preserved.
Because the @code{printf} and @code{scanf} families were designed long
before IEEE 754 decimal arithmetic, their format items do not support
distinguishing between numbers with identical values, but different
quanta, and yet, that distinction is likely to be needed in output.
The solution adopted by one early library for decimal arithmetic is to
provide a family of number-to-string conversion functions that
preserve quantization. Here is a code fragment and its output that
shows how they work.
@example
decimal_float x;
x = 123.000DF;
printf ("%%He: x = %He\n", x);
printf ("%%Hf: x = %Hf\n", x);
printf ("%%Hg: x = %Hg\n", x);
printf ("ntosdf (x) = %s\n", ntosdf (x));
%He: x = 1.230000e+02
%Hf: x = 123.000000
%Hg: x = 123
ntosdf (x) = +123.000
@end example
@noindent
The format modifier letter @code{H} indicates a 32-bit decimal value,
and the modifiers @code{DD} and @code{DL} correspond to the two other
formats.
@c =====================================================================
@end ignore
@node Round-Trip Base Conversion
@section Round-Trip Base Conversion
@cindex round-trip base conversion
@cindex base conversion (floating point)
@cindex floating-point round-trip base conversion
Most numeric programs involve converting between base-2 floating-point
numbers, as represented by the computer, and base-10 floating-point
numbers, as entered and handled by the programmer. What might not be
obvious is the number of base-2 bits vs. base-10 digits required for
each representation. Consider the following tables showing the number of
decimal digits representable in a given number of bits, and vice versa:
@multitable @columnfractions .5 .1 .1 .1 .1 .1
@item binary in @tab 24 @tab 53 @tab 64 @tab 113 @tab 237
@item decimal out @tab 9 @tab 17 @tab 21 @tab 36 @tab 73
@end multitable
@multitable @columnfractions .5 .1 .1 .1 .1
@item decimal in @tab 7 @tab 16 @tab 34 @tab 70
@item binary out @tab 25 @tab 55 @tab 114 @tab 234
@end multitable
We can compute the table numbers with these two functions:
@example
int
matula(int nbits)
@{ /* @r{Return output decimal digits needed for nbits-bits input.} */
return ((int)ceil((double)nbits / log2(10.0) + 1.0));
@}
int
goldberg(int ndec)
@{ /* @r{Return output bits needed for ndec-digits input.} */
return ((int)ceil((double)ndec / log10(2.0) + 1.0));
@}
@end example
One significant observation from those numbers is that we cannot
achieve correct round-trip conversion between the decimal and
binary formats in the same storage size! For example, we need 25
bits to represent a 7-digit value from the 32-bit decimal format,
but the binary format only has 24 available. Similar
observations hold for each of the other conversion pairs.
The general input/output base-conversion problem is astonishingly
complicated, and solutions were not generally known until the
publication of two papers in 1990 that are listed later near the end
of this chapter. For the 128-bit formats, the worst case needs more
than 11,500 decimal digits of precision to guarantee correct rounding
in a binary-to-decimal conversion!
For further details see the references for Bennett Goldberg and David
Matula.
@c =====================================================================
@node Further Reading
@section Further Reading
The subject of floating-point arithmetic is much more complex
than many programmers seem to think, and few books on programming
languages spend much time in that area. In this chapter, we have
tried to expose the reader to some of the key ideas, and to warn
of easily overlooked pitfalls that can soon lead to nonsensical
results. There are a few good references that we recommend
for further reading, and for finding other important material
about computer arithmetic.
We include URLs for these references when we were given them, when
they are morally legitimate to recommend; we have omitted the URLs
that are paywalled or that require running nonfree JavaScript code in
order to function. We hope you can find morally legitimate sites
where you can access these works.
@c =====================================================================
@c Each bibliography item has a sort key, so the bibliography can be
@c sorted in emacs with M-x sort-paragraphs on the region with the items.
@c =====================================================================
@itemize @bullet
@c Paywalled.
@item @c sort-key: Abbott
Paul H. Abbott and 15 others, @cite{Architecture and software support
in IBM S/390 Parallel Enterprise Servers for IEEE Floating-Point
arithmetic}, IBM Journal of Research and Development @b{43}(5/6)
723--760 (1999), This article gives a good description of IBM's
algorithm for exact decimal-to-binary conversion, complementing
earlier ones by Clinger and others.
@c Paywalled.
@item @c sort-key: Beebe
Nelson H. F. Beebe, @cite{The Mathematical-Function Computation Handbook:
Programming Using the MathCW Portable Software Library},
Springer (2017).
This book describes portable implementations of a large superset
of the mathematical functions available in many programming
languages, extended to a future 256-bit format (70 decimal
digits), for both binary and decimal floating point. It includes
a substantial portion of the functions described in the famous
@cite{NIST Handbook of Mathematical Functions}, Cambridge (2018),
ISBN 0-521-19225-0.
See
@uref{https://www.math.utah.edu/pub/mathcw/}
for compilers and libraries.
@c URL ok
@item @c sort-key: Clinger-1990
William D. Clinger, @cite{How to Read Floating Point Numbers
Accurately}, ACM SIGPLAN Notices @b{25}(6) 92--101 (June 1990),
@uref{https://doi.org/10.1145/93548.93557}.
See also the papers by Steele & White.
@c Paywalled.
@c @item @c sort-key: Clinger-2004
@c William D. Clinger, @cite{Retrospective: How to read floating
@c point numbers accurately}, ACM SIGPLAN Notices @b{39}(4) 360--371 (April 2004),
@c Reprint of 1990 paper, with additional commentary.
@c URL ok
@item @c sort-key: Goldberg-1967
I. Bennett Goldberg, @cite{27 Bits Are Not Enough For 8-Digit Accuracy},
Communications of the ACM @b{10}(2) 105--106 (February 1967),
@uref{https://doi.acm.org/10.1145/363067.363112}. This paper,
and its companions by David Matula, address the base-conversion
problem, and show that the naive formulas are wrong by one or
two digits.
@c URL ok
@item @c sort-key: Goldberg-1991
David Goldberg, @cite{What Every Computer Scientist Should Know
About Floating-Point Arithmetic}, ACM Computing Surveys @b{23}(1)
5--58 (March 1991), corrigendum @b{23}(3) 413 (September 1991),
@uref{https://doi.org/10.1145/103162.103163}.
This paper has been widely distributed, and reissued in vendor
programming-language documentation. It is well worth reading,
and then rereading from time to time.
@c ??? site does not respond, 20 Sep 2022
@item @c sort-key: Juffa
Norbert Juffa and Nelson H. F. Beebe, @cite{A Bibliography of
Publications on Floating-Point Arithmetic},
@uref{https://www.math.utah.edu/pub/tex/bib/fparith.bib}.
This is the largest known bibliography of publications about
floating-point, and also integer, arithmetic. It is actively
maintained, and in mid 2019, contains more than 6400 references to
original research papers, reports, theses, books, and Web sites on the
subject matter. It can be used to locate the latest research in the
field, and the historical coverage dates back to a 1726 paper on
signed-digit arithmetic, and an 1837 paper by Charles Babbage, the
intellectual father of mechanical computers. The entries for the
Abbott, Clinger, and Steele & White papers cited earlier contain
pointers to several other important related papers on the
base-conversion problem.
@c URL ok
@item @c sort-key: Kahan
William Kahan, @cite{Branch Cuts for Complex Elementary Functions, or
Much Ado About Nothing's Sign Bit}, (1987),
@uref{https://people.freebsd.org/~das/kahan86branch.pdf}.
This Web document about the fine points of complex arithmetic
also appears in the volume edited by A. Iserles and
M. J. D. Powell, @cite{The State of the Art in Numerical
Analysis: Proceedings of the Joint IMA/SIAM Conference on the
State of the Art in Numerical Analysis held at the University of
Birmingham, 14--18 April 1986}, Oxford University Press (1987),
ISBN 0-19-853614-3 (xiv + 719 pages). Its author is the famous
chief architect of the IEEE 754 arithmetic system, and one of the
world's greatest experts in the field of floating-point
arithmetic. An entire generation of his students at the
University of California, Berkeley, have gone on to careers in
academic and industry, spreading the knowledge of how to do
floating-point arithmetic right.
@c paywalled
@item @c sort-key: Knuth
Donald E. Knuth, @cite{A Simple Program Whose Proof Isn't},
in @cite{Beauty is our business: a birthday salute to Edsger
W. Dijkstra}, W. H. J. Feijen, A. J. M. van Gasteren,
D. Gries, and J. Misra (eds.), Springer (1990), ISBN
1-4612-8792-8, This book
chapter supplies a correctness proof of the decimal to
binary, and binary to decimal, conversions in fixed-point
arithmetic in the TeX typesetting system. The proof evaded
its author for a dozen years.
@c URL ok
@item @c sort-key: Matula-1968a
David W. Matula, @cite{In-and-out conversions},
Communications of the ACM @b{11}(1) 57--50 (January 1968),
@uref{https://doi.org/10.1145/362851.362887}.
@item @c sort-key: Matula-1968b
David W. Matula, @cite{The Base Conversion Theorem},
Proceedings of the American Mathematical Society @b{19}(3)
716--723 (June 1968). See also other papers here by this author,
and by I. Bennett Goldberg.
@c page is blank without running JS.
@item @c sort-key: Matula-1970
David W. Matula, @cite{A Formalization of Floating-Point Numeric
Base Conversion}, IEEE Transactions on Computers @b{C-19}(8)
681--692 (August 1970),
@c page is blank without running JS.
@item @c sort-key: Muller-2010
Jean-Michel Muller and eight others, @cite{Handbook of
Floating-Point Arithmetic}, Birkh@"auser-Boston (2010), ISBN
0-8176-4704-X (xxiii + 572 pages). This is a
comprehensive treatise from a French team who are among the
world's greatest experts in floating-point arithmetic, and among
the most prolific writers of research papers in that field. They
have much to teach, and their book deserves a place on the
shelves of every serious numerical programmer.
@c paywalled
@item @c sort-key: Muller-2018
Jean-Michel Muller and eight others, @cite{Handbook of
Floating-Point Arithmetic}, Second edition, Birkh@"auser-Boston (2018), ISBN
3-319-76525-6 (xxv + 627 pages). This is a new
edition of the preceding entry.
@c URL given is obsolete
@item @c sort-key: Overton
Michael Overton, @cite{Numerical Computing with IEEE Floating
Point Arithmetic, Including One Theorem, One Rule of Thumb, and
One Hundred and One Exercises}, SIAM (2001), ISBN 0-89871-482-6
(xiv + 104 pages),
This is a small volume that can be covered in a few hours.
@c URL ok
@item @c sort-key: Steele-1990
Guy L. Steele Jr. and Jon L. White, @cite{How to Print
Floating-Point Numbers Accurately}, ACM SIGPLAN Notices
@b{25}(6) 112--126 (June 1990),
@uref{https://doi.org/10.1145/93548.93559}.
See also the papers by Clinger.
@c paywalled
@item @c sort-key: Steele-2004
Guy L. Steele Jr. and Jon L. White, @cite{Retrospective: How to
Print Floating-Point Numbers Accurately}, ACM SIGPLAN Notices
@b{39}(4) 372--389 (April 2004), Reprint of 1990
paper, with additional commentary.
@item @c sort-key: Sterbenz
Pat H. Sterbenz, @cite{Floating Point Computation}, Prentice-Hall
(1974), ISBN 0-13-322495-3 (xiv + 316 pages). This often-cited book
provides solid coverage of what floating-point arithmetic was like
@emph{before} the introduction of IEEE 754 arithmetic.
@end itemize
|