1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305
|
\chapter{Floating point arithmetic}
\label{floatingpoint}
Starting with version 5.0, \FORM{} is equiped with arbitrary precision floating point
arithmetic. The low level routines are handled by the GMP and MPFR libraries,
which are available on most systems and if missing can be easily picked up
from the internet. This chapter describes the commands, functions, and behaviour
of \FORM's floating point sytem.
\section{Initializing and closing the floating point system}
Before any floating-point operations can be performed, \FORM{} must activate the
floating point system and set the working precision. This initialization allocates
the internal data structures used by the GMP and MPFR libraries. The system remains
active until the end of the program, or until it is explicitly closed.
The two statements that control these operations are:
\begin{description}
\item[\#StartFloat] This instruction initializes the floating
point system and allocates the necessary internal arrays.
It takes either one or two arguments:
\begin{verbatim}
#StartFloat <precision> [,MZV=<maximumweight>]
\end{verbatim}
The first argument is mandatory and specifies the desired precision. It must
be a positive integer followed by either a \texttt{b} (for precision in bits)
or \texttt{d} (for precision in decimal digits).
\FORM{} will round to at least this precision.
The second argument is optional and only needed when working with multiple
zeta values (MZVs) or Euler sums. It specifies the maximum weight
that will be used. The evaluation of the sums requires a
number of auxiliary arrays that depend on this weight. The default weight is zero.
\item[\#EndFloat] This instruction releases all arrays allocated for the
floating point system. Note that if one would like to change the precision during a run,
this is now possible with a new \texttt{\#StartFloat} instruction.
\end{description}
Example programs that illustrate the use of these statements and the
functionality of \FORM's floating point system are given below.
\section{Conversion between rational and floating point coefficients}
A term in an expression can have a rational or floating point coefficient.
The following statements convert between the two.
\begin{description}
\item[ToFloat] Converts rational coefficients to
floating point numbers in the precision specified by \texttt{\#StartFloat}.
From this point on, the coefficient will be floating point.
\item[ToRational] Attempts to convert floating point coefficients to
rational numbers. To this end it uses continued fractions as in
\begin{eqnarray}
x \;\rightarrow\; n_0 + \frac{1}{\,n_1 + \frac{1}{\,n_2 + \frac{1}{\,n_3 + \cdots}}}\;,
\nonumber
\end{eqnarray}
with $x$ a floating point number. The algorithm keeps track of the
remaining precision and if $1/n_i$ is close to this precision it truncates
the sequence at $n_{i-1}$. After that it works out the corresponding fraction.
It could be that $x$ cannot be expressed as a fraction within the given precision.
This can usually be seen by that the fractions are `rather wild', or that
the result changes when the precision is increased. This statement can also
be abbreviated as \texttt{ToRat}.
\end{description}
The above statements operate on ground level coefficient only. To convert numbers
inside a function argument, one must use the \texttt{Argument} environment.
For example:
\begin{verbatim}
CFunction f;
#StartFloat 10d
Local F = 0.1666666666*f(0.1428571429);
ToRat;
Print "<1> %t";
Argument f;
ToRat;
EndArgument;
Print "<2> %t";
.end
<1> + 1/6*f(1.428571429e-01)
<2> + 1/6*f(1/7)
\end{verbatim}
The argument environment may be nested.
Similarly, the statements \texttt{Evaluate}, \texttt{StrictRounding} and \texttt{Chop} act at
the ground level. To have them act on function argument, one uses the \texttt{Argument} environment.
These statements are explained further below.
\section{Evaluation of functions and symbols}
Before version 5.0, \FORM{} already reserved function names for many common mathematical
functions. These functions can now be evaluated numerically using:
\begin{description}
\item[Evaluate] This statement evaluates the mathematical functions and or symbols numerically:
\begin{verbatim}
Evaluate [function(s)],[symbol(s)];
\end{verbatim}
where the argument specifies the function(s) and/or symbol(s) to evaluate.
More than one function and/or symbol may be listed.
If this statement is used without arguments, all floating point functions and symbols that \FORM{}
knows will be evaluated. Currently, the full list of functions that can be evaluated numerically reads
\begin{verbatim}
sqrt_, ln_, eexp_, li2_, gamma_, agm_,
sin_, cos_, tan_, asin_, acos_, atan_, atan2_,
sinh_, cosh_, tanh_, asinh_, acosh_, atanh_,
mzv_, euler_, mzvhalf_,
\end{verbatim}
where the functions on the last line denote the multiple zeta values, Euler sums and
harmonic polylogarithms of argument $1/2$ respectively.
The list of symbols/constants that can be evaluated is
\begin{verbatim}
pi_, ee_, em_,
\end{verbatim}
where \texttt{ee\_}\index{ee\_} denotes the basis of the natural logarithm
and \texttt{em\_}\index{em\_} the Euler-Mascheroni constant.
In addition, the functions \texttt{lin\_}, \texttt{hpl\_} and \texttt{mpl\_} are reserved function names,
but currently have no numerical evaluation.
\end{description}
\section{Rounding behaviour}
\begin{description}
\item[StrictRounding] This statement rounds floating point numbers to a
given precision:
\begin{verbatim}
StrictRounding [<precision>];
\end{verbatim}
where \texttt{<precision>} is an optional argument that specifies the rounding
precision in either digits or bits, using the same syntax as
\texttt{\#startfloat}. If omitted, the default precision is used.
Internally, the GMP and MPFR libraries may use extra precision beyond that set by
\texttt{\#startfloat}. As a result, terms that print the same may still differ slightly
due to this extra precision and therefore fail to merge. For example:
\begin{verbatim}
#startfloat 6d
CFunction f;
Local F = f(1.0)+f(1.0000001);
Print;
.sort
\end{verbatim}
results in \texttt{F = f(1.0e+00)+f(1.0e+00);}. Although it may appear
that the terms should merge, the extra precision maintained by GMP
prevents this, even though it is not displayed at 6 digits of
precision. Using the strictrounding statement, one can force rounding
to exactly 6 digits. Indeed:
\begin{verbatim}
Argument f;
strictrounding;
Argument;
Print;
.end
\end{verbatim}
results in \texttt{F = 2*f(1.0e+00);}.
Notice that rounding in bits may produce unexpected results when viewed
in decimal digits. For example, the decimal number 1.1e-4 cannot be
represented exactly in binary. Its binary representation, up to 20 bits of precision, is
$1.1100110101011111101*2^{-14}$. When rounded to 5 bits, this becomes
$1.1101*2^{-14}$, which in decimal digits appears as
1.10626220703125e-04.
\item[Chop] This statement removes floating point numbers that are {\em smaller}
in absolute magnitude than a specified threshold. It takes one argument:
\begin{verbatim}
Chop <delta>;
\end{verbatim}
All floating point numbers with absolute value {\em less} than \texttt{<delta>} are replaced by 0.
Terms with no floating point coefficient are left untouched. The threshold \texttt{<delta>}
can be a floating point number, integer, rational number, or power. Because
statements in \FORM{} act term by term, it is often important to sort before invoking the
chop statement. Otherwise, terms might be removed individually, while after
sorting and combining, their combined floating point coefficient could exceed
the specified chop threshold.
\item[Format floatprecision] This instruction controls how many digits are
displayed when printing floating point numbers. It only affects output
formatting and does not influence the internal precision or accuracy of
computations. It can be used in three ways: in case of
\begin{verbatim}
Format floatprecision;
\end{verbatim}
\FORM{} prints floats with the number of digits specified by the current
\texttt{\#startfloat} instruction. With
\begin{verbatim}
Format floatprecision <precision>;
\end{verbatim}
\FORM{} prints the number of digits specified by \texttt{<precision>}.
The syntax is the same as for the precision in \texttt{\#startfloat}.
If the requested precision exceeds the precision specified by
\texttt{\#startfloat}, only the available digits are printed. Finally, with
\begin{verbatim}
Format floatprecision off;
\end{verbatim}
the floating point numbers are printed in raw internal format, see also section \ref{sec:float_raw}.
\end{description}
\section{Examples}
The following example shows some work with Multiple Zeta Values (MZV's):
\begin{verbatim}
#StartFloat 500b, MZV=15
L F1 =
-mzv_(8,1,1,5)
+29056868/39414375*mzv_(2)^6*mzv_(3)
-47576/40425*mzv_(2)^5*mzv_(5)
-163291/18375*mzv_(2)^4*mzv_(7)
-4/105*mzv_(2)^3*mzv_(3)^3
-450797/11025*mzv_(2)^3*mzv_(9)
+7/5*mzv_(2)^2*mzv_(3)^2*mzv_(5)
+16/25*mzv_(2)^2*mzv_(3)*mzv_(5,3)
+454049/1400*mzv_(2)^2*mzv_(11)
-16/25*mzv_(2)^2*mzv_(5,3,3)
+3*mzv_(2)*mzv_(3)^2*mzv_(7)
+61/14*mzv_(2)*mzv_(3)*mzv_(5)^2
+2/7*mzv_(2)*mzv_(3)*mzv_(7,3)
+2172853/420*mzv_(2)*mzv_(13)
-2/7*mzv_(2)*mzv_(7,3,3)
+1/7*mzv_(2)*mzv_(5,5,3)
-33/4*mzv_(3)^2*mzv_(9)
-133/6*mzv_(3)*mzv_(5)*mzv_(7)
-25/9*mzv_(3)*mzv_(9,3)
-244/105*mzv_(5)^3
-359/105*mzv_(5)*mzv_(7,3)
+3/10*mzv_(7)*mzv_(5,3)
+89/18*mzv_(9,3,3)
+569/105*mzv_(7,3,5);
L F2 = mzv_(15);
Evaluate mzv_;
Print +f;
.sort
F1 =
9.1234206877960755900164875575406726239325002222490534540605137258846994\
916348297032751308227224952419629422497720599224543719959652966613231560\
6913926e+03;
F2 =
1.0000305882363070204935517285106450625876279487068581775065699328933322\
671563422795730723343470175484943669684442492832530297757588781904321794\
40477e+00;
Skip F1,F2;
L X = F1/F2;
Torational;
Print +f;
.end
X =
229903169/25200;
0.08 sec out of 0.09 sec
\end{verbatim}
In the first module, \texttt{\#startfloat} initializes the floating point system with
500 bits of precision and a maximum weight for the MZVs and Euler sums of 15.
The \texttt{mzv\_} functions are then evaluated with the \texttt{Evaluate}
statement. In the second module we divide the
numbers and convert the result to a rational. It is a good idea to try this
with various precisions to see whether this is stable. With 60 bits the
final answer would be
\begin{verbatim}
145537942402475031/15952490572234;
\end{verbatim}
while at 150 bits we have already the same answer as with 500 bits. The
fraction that is obtained by this program can be proven to be correct.
\section{Raw form}
\label{sec:float_raw}
Internally, floating point numbers are represented by the function \texttt{float\_},
i.e. \texttt{float\_(prec, size, exp, limbs)}. The integer arguments encode the
internal representation of the floating point number as in the GMP library:
\begin{description}
\item[prec] The precision of the mantissa in limbs.
\item[size] The number of limbs currently in use.
\item[exp] The exponent, determining the location of the implied radix point.
\item[limbs] The limbs packed as the numerator of a \FORM{} rational.
\end{description}
In a normalized term containing \texttt{float\_}, the rational coefficient must
be either $1/1$ or $-1/1$, where the sign of the term is absorbed into the rational
coefficient.
Furthermore, the \texttt{float\_} is protected from the pattern matcher and from
statements that act on functions -- such as \texttt{Transform}, \texttt{Argument},
\texttt{Normalize} etc.
The following program illustrates this:
%
\begin{verbatim}
CFunction f;
#StartFloat 10d
Local F = 1.23456789 + f(1,2);
Identify f?(?a) = f(10);
Print "<1> %t";
.sort
<1> + 1.23456789e+00
<1> + f(10)
#EndFloat
Normalize;
Print "<2> %t";
.sort
<2> + float_(2,3,1,420101683733788795657820481376616399786)
<2> + 10*f(1)
#StartFloat 5d
Print "<3> %t";
.end
<3> + 1.2346e+00
<3> + 10*f(1)
\end{verbatim}
%
As shown, the \texttt{id}-statement does not effect the \texttt{float\_} function.
Here we also see the use of the preprocessor statement \texttt{\#EndFloat} which closes
the floating point system. After this statement, the \texttt{float\_} function becomes a
regular function. Its protected status, however, persists so that \texttt{id}-statements
or statements like \texttt{Normalize} still do not modify it.
|