File: README

package info (click to toggle)
libffm 0.28-6
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 216 kB
  • ctags: 184
  • sloc: asm: 3,028; makefile: 94; ansic: 12; sh: 2
file content (285 lines) | stat: -rw-r--r-- 11,915 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
						    LIBFFM Ver. 0.28 29.11.1998

libffm	- Free, pretty fast replacement for some math (libm) routines 
		on Linux/AXP, optimized for the 21164

					By

	 	Joachim Wesner 		and 	Kazushige Goto
	<joachim.wesner@frankfurt.netsurf.de> <goto@statabo.rim.or.jp>

	Algorithms by Joachim Wesner, code optimization, instruction
		rescheduling and "vectorization" by Kazushige Goto.

*** NOTE ***

	This is a preliminary version and does not yet contain replacements 
        for all libm routines.

	Also, there are no special precautions taken regarding "precise" 
	trap handling. Exceptions inside these routine can show up as traps 
	several instructions later within the so-called "trap shadow", probably 
	even somewhere in user code. This is not a special property of these
	routines, but of the Alpha hardware architecture and also applies to 
	the ported cray routines or the original libm, when not compiled using 
	-mieee. This will likely *NOT* be "fixed", as such trap handling will 
	invariably make the code much more complex, larger and slower.

	HOWEVER, most of the routines (see below) now include full illegal
	argument handling. A trap inside those routines should no longer
	occur for *ANY* input value. Instead, depending on the respective
	function, the (illegal) argument itself or the bit combinations
	for NaN or +/- Inf will be returned.

	More of the other missing libm functions will be added soon.


*** LICENSE ***

   Copyright (C) 1998  Joachim Wesner <joachim.wesner@frankfurt.netsurf.de>
                  and  Kazushige Goto <goto@statabo.rim.or.jp>

   This library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Library General Public
   License as published by the Free Software Foundation; either
   version 2 of the License, or (at your option) any later version.

   This library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Library General Public License for more details.

   You should have received a copy of the GNU Library General Public
   License along with this library (see file COPYING.LIB); if not, write 
   to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, 
   Boston, MA 02111-1037, USA.


*** For the impatient, HOW TO MAKE AND USE IT ***

To compile the routines and generate the libraries libffm.a and libffm_p.a 
simply do the following:

In the subdirectory libffm/ do a 'make'. This compiles/assembles all files 
there and generates the (for the moment only) static libraries libffm*.a.

This also compiles the pow helper function pow.o. See below. You can now avoid 
the use of the helper function by doing a 'make COPTS=POW' (or by commenting 
out a line in the makefile, see there). This way, a library is generated that 
always replaces the pow() function by the somewhat less accurate powr() 
function.

To have the faster routines replace the routines in the standard libm:

1) Simply copy libffm.a to the working directory of your project and specify
   libffm.a as additional file before specifying the math library -lm. I.e. :

   gcc myapp.c suba.c subb.c subc.c -o myapp libffm.a .... -lm

Or, even better:

2) Copy libffm.a (as root) to /usr/lib. Now specify libffm.a as an additional 
   library using -lffm *before* -lm ! I.e.

   gcc myapp.c suba.c subb.c subc.c -o myapp .... -lffm -lm

(When using profiling, replace all references to libffm* resp. -lffm above by
references to libffm* resp. -lffm_p)

Include pow.o, when you want to replace calls to pow with calls to powr
without source changes. (Might be especially helpful with FORTRAN, where you
can't redefine intrinsic x**y). See further comments in pow.c. This will
be no longer necessary in a later version of libffm, that will have a fast 
full precision pow function replacement.

Change your makefiles accordingly !!


*** CHANGES FROM LIBFFM VERSION 0.21 ***

Functions sinh, cosh, tanh and a fast 'inverse' square root sqrti(x) = 
1./sqrt(x) were added. Very fast (up to 4.5 x) vectorized sqrt and sqrti 
functions by K. Goto were added. Nearly all functions now have full handling
of illegal (NaN, Inf), denormalized or simply out of domain arguments.
(The only exceptions being - as of 0.28 - the functions pow(r) and atan2, but
this will be fixed soon in version 0.30) This could be implemented with
no or only a very small (< 5 clocks) increase in execution time.
The file 'mk' was replaced by a regular 'Makefile'. You can now optionally
generate a library that permanently uses the somewhat less accurate
function powr() in all pow() calls. (For reason of backward compatibility, 
this is not yet the default behaviour).


*** CHANGES FROM LIBFFM VERSION 0.10 ***

More functions where added. Licensing terms have changed. A new fast sqrt
was added and inlined into asin/acos, so those functions speeded up
accordingly. Support for ECOFF and profiling was added. A simple pow helper
function was added to allow using faster powr instead pow without changing
sourcecode (Might be especially valuable fro FORTRAN). A header file
"libffm.h" was added. Timing estimates should now be more accurate (assuming
all data are in cache). Precision is now specified more thoroughly.


*** CAN IT BE USED WITH LIBFM ? *** 

Yes, but as libffm.a is becoming more complete, you might be only interested
in the somewhat faster 32-bit accuracy routines and the function cossin() in
libfm. Nevertheless, you can savely use libffm it with libfm and libm. To make 
sure that the "right" combination of routines is selected from the libraries 
use the following order or -l arguments when linking:

   gcc myapp.c suba.c subb.c subc.c -o myapp .... -lffm -lfm -lm


*** WHAT IT DOES ***

libffm V0.28 contains the following functions:

double sin(double x);
double cos(double x);
double tan(double x);
double cotan(double x);
double asin(double x);
double acos(double x);
double atan(double x);
double atan2(double y, double x);
double log2(double x);			/* Computes ld(x) */
double log(double x);
double log10(double x);
double exp2(double x);			/* Computes 2^x */
double exp(double x);
double exp10(double x);			/* Computes 10^x */
double sinh(double x);
double cosh(double x);
double tanh(double x);
double powr(double x, double y);	/* Fast pow(x, y) */
double sqrt(double x);
double sqrti(double x);			/* 1./sqrt(x) */

void dsqrtv(double *x, double *y, int size);
			/* Computes y[i] = sqrt(x[i]), i=0..size-1 */
void dsqrtiv(double *x, double *y, int size);
			/* Computes y[i] = 1./sqrt(x[i]), i=0..size-1 */

libffm.h defines the functions not included in math.h .


*** SPEED ?! ***

Some execution times increased *very* slightly due to the argument checking,
but differences from version 0.21 are generally less than 5 clocks. Some 
routines could even be speeded up slightly due to better optimization/
scheduling.

This time a different timing routine was used, which tries to avoid any
cache trashing and more exactly estimates the influence of the random number
routine, so somewhat different results from the ones published in version 0.10
appear below. asin/acos are faster now because of inlining our new fast sqrt.

Cray still "wins" slightly on the log and is "par" on the powr routine,
because our approximation requires a divide operation, however it needs much
less table space than Crays routine. Kazushige Goto tried to optimize all
routines with respect to first call, i.e. when required constants or code are 
not yet in a high speed cache. However, I will also try to develop an 
alternative log version that does not need division and we will compare it's 
performance on "real" problems.

Approximate running times in us in a tight loop for random arguments 0..10 
(0..1 for asin/acos) on a 533MHz LX 21164 (n = 63).

	 libffm		libfm		libm

sin	  0.14		0.21		0.43
cos	  0.15		0.21		0.43
tan	  0.15		0.27		0.54
cotan     0.15		----		----
asin	  0.22		----		1.33
acos	  0.22		----		1.27
atan	  0.19		----		0.58
atan2	  0.20		----		0.72
log2      0.17		----		----
log       0.17		0.16 !		0.44
log10     0.17		0.22		0.53
exp2      0.11		----		----
exp       0.15		0.17		0.50
exp10     0.15		----		----
sinh	  0.23		----		0.71
cosh	  0.23		----		0.67
tanh	  0.26		----		0.67
powr(x,n) 0.10		0.35		1.67 (pow)
powr(x,y) 0.36		0.35 !		1.67 (pow)
sqrt	  0.13		0.13		0.19
sqrti	  0.13		----		----		

dsqrtv    0.030*n + 0.14	(n = vector length)
dsqrtiv   0.028*n + 0.12

I would be interested to hear how these routines run on a 2106x, compared to 
the original libm or libfm !! (However, Kazushige Goto is already planning to
release a version especially optimized for 2106x)


*** ACCURACY ***

Ideal function routines should have a *relative* error of less than 0.5 ulp
(ulp = units of the last place), this means that the result (mantissa) is 
always a perfectly rounded, 52 bit version of the infinite precision, 
theoretical result.

The "old" libm library achieves this goal many times (at the expense of much
increased running time) and tends to typically show relative errors from < 0.5
to approx. 1 ulp.

The above routines achieve about the same accuracy as the "Cray" routines, and 
typically (with the exception of powr) show rel. errors in the range from 
< 0.7 to 1.5 ulp and very few times slightly larger than 2 ulp. RMS rel. error
for "random" arguments is typically ~ 0.5 ulp.

The new fast sqrt is a stripped down, rescheduled version of the libm one, 
that does no final rounding of the last bit. Error is guaranteed to be within 
-1 ulp < err < +1.0625 ulp (according to libdm documentation). This routine
is now as fast as Crays sqrt, but much more accurate.

It is believed (hoped ?) that the relative error *never* exceeds 3 ulp, execpt
probably VERY NEAR to the zeroes of the trig functions outside the main period 
and except for the powr function. I would like to hear of any other occurences 
and such "behaviour" will be regarded as a bug and will (hopefully) be fixed 
in future releases.

The larger errors at the zeroes of the trig functions are *NOT* caused
by imprecise range reduction per se. In fact, a 92 significant bit version of 
PI is used in libffm when reducing large arguments, which should avoid any 
noticeable increase of overall trig error even for very large arguments. 
However, also due to the irrational nature of PI, when reducing a large 
argument (that was exactly representable as double initially), the range 
reduced argument *cannot* be represented exactly as double and bits beyound 
the 52th of the mantissa will be lost, causing the above mentioned increase of 
error in the immediate vicinity of the trig function zeroes. lib(d)m uses 
interpolation for the chopped bits, which we did not include, because it would 
have increased running times. One could probably redefine rel. error here in
the following way: The result y=f(x) of a libffm function is *always* 
within 3 ulp of f(x') for an argument x', x-0.5 ulp < x' < x+0.5 ulp.

The powr routine is "quick&dirty" as in the Cray powr function, i.e. NO 
extended precision is used in calculating log(x)*y, so the relative error in 
the *final result* will grow to ~ 0.4*|ld(x)|*|y| ulp.
    
However, for integral values of y, repeated multiplications will be used to 
calculate the result. So, for "random" x, the rel. error seems to be in the 
order of < 0.4*|y| ulp and for (not too large) integral values of x, the final 
result should be EXACT.
    
In contrast to the Cray powr function, a different multiplication scheme is 
used, where the number of multiplications only increases with log2(|y|).

A < 3 ulp pow function (necessarily SLOWER than powr) will be added
(hopefully) soon.


*** BUGS, MISSING, DOESN'T WORK ? ***


Please send any bug reports, comments, interesting timing results etc. to

	joachim.wesner@frankfurt.netsurf.de