1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
|
LIBFFM Ver. 0.28 29.11.1998
libffm - Free, pretty fast replacement for some math (libm) routines
on Linux/AXP, optimized for the 21164
By
Joachim Wesner and Kazushige Goto
<joachim.wesner@frankfurt.netsurf.de> <goto@statabo.rim.or.jp>
Algorithms by Joachim Wesner, code optimization, instruction
rescheduling and "vectorization" by Kazushige Goto.
*** NOTE ***
This is a preliminary version and does not yet contain replacements
for all libm routines.
Also, there are no special precautions taken regarding "precise"
trap handling. Exceptions inside these routine can show up as traps
several instructions later within the so-called "trap shadow", probably
even somewhere in user code. This is not a special property of these
routines, but of the Alpha hardware architecture and also applies to
the ported cray routines or the original libm, when not compiled using
-mieee. This will likely *NOT* be "fixed", as such trap handling will
invariably make the code much more complex, larger and slower.
HOWEVER, most of the routines (see below) now include full illegal
argument handling. A trap inside those routines should no longer
occur for *ANY* input value. Instead, depending on the respective
function, the (illegal) argument itself or the bit combinations
for NaN or +/- Inf will be returned.
More of the other missing libm functions will be added soon.
*** LICENSE ***
Copyright (C) 1998 Joachim Wesner <joachim.wesner@frankfurt.netsurf.de>
and Kazushige Goto <goto@statabo.rim.or.jp>
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Library General Public
License as published by the Free Software Foundation; either
version 2 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Library General Public License for more details.
You should have received a copy of the GNU Library General Public
License along with this library (see file COPYING.LIB); if not, write
to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
Boston, MA 02111-1037, USA.
*** For the impatient, HOW TO MAKE AND USE IT ***
To compile the routines and generate the libraries libffm.a and libffm_p.a
simply do the following:
In the subdirectory libffm/ do a 'make'. This compiles/assembles all files
there and generates the (for the moment only) static libraries libffm*.a.
This also compiles the pow helper function pow.o. See below. You can now avoid
the use of the helper function by doing a 'make COPTS=POW' (or by commenting
out a line in the makefile, see there). This way, a library is generated that
always replaces the pow() function by the somewhat less accurate powr()
function.
To have the faster routines replace the routines in the standard libm:
1) Simply copy libffm.a to the working directory of your project and specify
libffm.a as additional file before specifying the math library -lm. I.e. :
gcc myapp.c suba.c subb.c subc.c -o myapp libffm.a .... -lm
Or, even better:
2) Copy libffm.a (as root) to /usr/lib. Now specify libffm.a as an additional
library using -lffm *before* -lm ! I.e.
gcc myapp.c suba.c subb.c subc.c -o myapp .... -lffm -lm
(When using profiling, replace all references to libffm* resp. -lffm above by
references to libffm* resp. -lffm_p)
Include pow.o, when you want to replace calls to pow with calls to powr
without source changes. (Might be especially helpful with FORTRAN, where you
can't redefine intrinsic x**y). See further comments in pow.c. This will
be no longer necessary in a later version of libffm, that will have a fast
full precision pow function replacement.
Change your makefiles accordingly !!
*** CHANGES FROM LIBFFM VERSION 0.21 ***
Functions sinh, cosh, tanh and a fast 'inverse' square root sqrti(x) =
1./sqrt(x) were added. Very fast (up to 4.5 x) vectorized sqrt and sqrti
functions by K. Goto were added. Nearly all functions now have full handling
of illegal (NaN, Inf), denormalized or simply out of domain arguments.
(The only exceptions being - as of 0.28 - the functions pow(r) and atan2, but
this will be fixed soon in version 0.30) This could be implemented with
no or only a very small (< 5 clocks) increase in execution time.
The file 'mk' was replaced by a regular 'Makefile'. You can now optionally
generate a library that permanently uses the somewhat less accurate
function powr() in all pow() calls. (For reason of backward compatibility,
this is not yet the default behaviour).
*** CHANGES FROM LIBFFM VERSION 0.10 ***
More functions where added. Licensing terms have changed. A new fast sqrt
was added and inlined into asin/acos, so those functions speeded up
accordingly. Support for ECOFF and profiling was added. A simple pow helper
function was added to allow using faster powr instead pow without changing
sourcecode (Might be especially valuable fro FORTRAN). A header file
"libffm.h" was added. Timing estimates should now be more accurate (assuming
all data are in cache). Precision is now specified more thoroughly.
*** CAN IT BE USED WITH LIBFM ? ***
Yes, but as libffm.a is becoming more complete, you might be only interested
in the somewhat faster 32-bit accuracy routines and the function cossin() in
libfm. Nevertheless, you can savely use libffm it with libfm and libm. To make
sure that the "right" combination of routines is selected from the libraries
use the following order or -l arguments when linking:
gcc myapp.c suba.c subb.c subc.c -o myapp .... -lffm -lfm -lm
*** WHAT IT DOES ***
libffm V0.28 contains the following functions:
double sin(double x);
double cos(double x);
double tan(double x);
double cotan(double x);
double asin(double x);
double acos(double x);
double atan(double x);
double atan2(double y, double x);
double log2(double x); /* Computes ld(x) */
double log(double x);
double log10(double x);
double exp2(double x); /* Computes 2^x */
double exp(double x);
double exp10(double x); /* Computes 10^x */
double sinh(double x);
double cosh(double x);
double tanh(double x);
double powr(double x, double y); /* Fast pow(x, y) */
double sqrt(double x);
double sqrti(double x); /* 1./sqrt(x) */
void dsqrtv(double *x, double *y, int size);
/* Computes y[i] = sqrt(x[i]), i=0..size-1 */
void dsqrtiv(double *x, double *y, int size);
/* Computes y[i] = 1./sqrt(x[i]), i=0..size-1 */
libffm.h defines the functions not included in math.h .
*** SPEED ?! ***
Some execution times increased *very* slightly due to the argument checking,
but differences from version 0.21 are generally less than 5 clocks. Some
routines could even be speeded up slightly due to better optimization/
scheduling.
This time a different timing routine was used, which tries to avoid any
cache trashing and more exactly estimates the influence of the random number
routine, so somewhat different results from the ones published in version 0.10
appear below. asin/acos are faster now because of inlining our new fast sqrt.
Cray still "wins" slightly on the log and is "par" on the powr routine,
because our approximation requires a divide operation, however it needs much
less table space than Crays routine. Kazushige Goto tried to optimize all
routines with respect to first call, i.e. when required constants or code are
not yet in a high speed cache. However, I will also try to develop an
alternative log version that does not need division and we will compare it's
performance on "real" problems.
Approximate running times in us in a tight loop for random arguments 0..10
(0..1 for asin/acos) on a 533MHz LX 21164 (n = 63).
libffm libfm libm
sin 0.14 0.21 0.43
cos 0.15 0.21 0.43
tan 0.15 0.27 0.54
cotan 0.15 ---- ----
asin 0.22 ---- 1.33
acos 0.22 ---- 1.27
atan 0.19 ---- 0.58
atan2 0.20 ---- 0.72
log2 0.17 ---- ----
log 0.17 0.16 ! 0.44
log10 0.17 0.22 0.53
exp2 0.11 ---- ----
exp 0.15 0.17 0.50
exp10 0.15 ---- ----
sinh 0.23 ---- 0.71
cosh 0.23 ---- 0.67
tanh 0.26 ---- 0.67
powr(x,n) 0.10 0.35 1.67 (pow)
powr(x,y) 0.36 0.35 ! 1.67 (pow)
sqrt 0.13 0.13 0.19
sqrti 0.13 ---- ----
dsqrtv 0.030*n + 0.14 (n = vector length)
dsqrtiv 0.028*n + 0.12
I would be interested to hear how these routines run on a 2106x, compared to
the original libm or libfm !! (However, Kazushige Goto is already planning to
release a version especially optimized for 2106x)
*** ACCURACY ***
Ideal function routines should have a *relative* error of less than 0.5 ulp
(ulp = units of the last place), this means that the result (mantissa) is
always a perfectly rounded, 52 bit version of the infinite precision,
theoretical result.
The "old" libm library achieves this goal many times (at the expense of much
increased running time) and tends to typically show relative errors from < 0.5
to approx. 1 ulp.
The above routines achieve about the same accuracy as the "Cray" routines, and
typically (with the exception of powr) show rel. errors in the range from
< 0.7 to 1.5 ulp and very few times slightly larger than 2 ulp. RMS rel. error
for "random" arguments is typically ~ 0.5 ulp.
The new fast sqrt is a stripped down, rescheduled version of the libm one,
that does no final rounding of the last bit. Error is guaranteed to be within
-1 ulp < err < +1.0625 ulp (according to libdm documentation). This routine
is now as fast as Crays sqrt, but much more accurate.
It is believed (hoped ?) that the relative error *never* exceeds 3 ulp, execpt
probably VERY NEAR to the zeroes of the trig functions outside the main period
and except for the powr function. I would like to hear of any other occurences
and such "behaviour" will be regarded as a bug and will (hopefully) be fixed
in future releases.
The larger errors at the zeroes of the trig functions are *NOT* caused
by imprecise range reduction per se. In fact, a 92 significant bit version of
PI is used in libffm when reducing large arguments, which should avoid any
noticeable increase of overall trig error even for very large arguments.
However, also due to the irrational nature of PI, when reducing a large
argument (that was exactly representable as double initially), the range
reduced argument *cannot* be represented exactly as double and bits beyound
the 52th of the mantissa will be lost, causing the above mentioned increase of
error in the immediate vicinity of the trig function zeroes. lib(d)m uses
interpolation for the chopped bits, which we did not include, because it would
have increased running times. One could probably redefine rel. error here in
the following way: The result y=f(x) of a libffm function is *always*
within 3 ulp of f(x') for an argument x', x-0.5 ulp < x' < x+0.5 ulp.
The powr routine is "quick&dirty" as in the Cray powr function, i.e. NO
extended precision is used in calculating log(x)*y, so the relative error in
the *final result* will grow to ~ 0.4*|ld(x)|*|y| ulp.
However, for integral values of y, repeated multiplications will be used to
calculate the result. So, for "random" x, the rel. error seems to be in the
order of < 0.4*|y| ulp and for (not too large) integral values of x, the final
result should be EXACT.
In contrast to the Cray powr function, a different multiplication scheme is
used, where the number of multiplications only increases with log2(|y|).
A < 3 ulp pow function (necessarily SLOWER than powr) will be added
(hopefully) soon.
*** BUGS, MISSING, DOESN'T WORK ? ***
Please send any bug reports, comments, interesting timing results etc. to
joachim.wesner@frankfurt.netsurf.de
|