SLEEF Documentation - Additional Notes

Table of contents

Frequently asked questions

Q1: Is the scalar functions in SLEEF faster than the corresponding functions in the standard C library?


A1: No. Todays standard C libraries are very well optimized, and there is small room for further optimization. The reason why SLEEF is fast is that it carries out computation directly on SIMD registers and ALUs. This is not simple as it sounds, because conditional branches have to be eliminated in order to take full advantage of SIMD computation. If the algorithm requires conditional branches according to the argument, it must prepare for the case where the elements in the input vector contain both values that would make a branch happen and not happen. This would spoil the advantage of SIMD computation, because each element in a vector would require a different code path.



Q2: Do the trigonometric functions (e.g. sin) in SLEEF return correct values for the whole range of inputs?


A2: Yes. SLEEF does implement a vectorized version of Payne Hanek range reduction, and all the trigonometric functions return a correct value with the specified accuracy.

About the GNUABI version of the library

The GNUABI version of the library (libsleefgnuabi.so) is built for x86 and aarch64 architectectures. This library provides an API compatible with libmvec in glibc, and the API comforms to the x86 vector ABI, AArch64 vector ABI and Power Vector ABI. This library is built and installed by default, and certain compilers call the functions in this library.

How the dispatchers work

The dispatchers in SLEEF are designed to have very low overhead. This overhead is so small and cannot be observed by microbenchmarking.

Fig. 7.1 shows a simplified code of our dispatcher. There is only one exported function mainFunc. When mainFunc is called for the first time, dispatcherMain is called internally, since funcPtr is initialized to the pointer to dispatcherMain (line 14). It then detects if the CPU supports SSE 4.1 (line 7), and rewrites funcPtr to a pointer to the function that utilizes SSE 4.1 or SSE 2, depending on the result of CPU feature detection (line 10). When mainFunc is called for the second time, it does not execute the dispatcherMain. It just executes the function pointed by the pointer stored in funcPtr during the execution of dispatcherMain.

There are advantages in our dispatcher. The first advantage is that it does not require any compiler-specific extension. The second advantage is simplicity. There are only 18 lines of simple code. Since the dispatchers are completely separated for each function, there is not much room for bugs to get in.

The third advantage is low overhead. You might think that the overhead is one function call including execution of the prologue and the epilogue. However, modern compilers are smart enough to eliminate redundant execution of the prologue, epilogue and return instruction. The actual overhead is just one jmp instruction, which has very small overhead since it is not conditional. This overhead is likely hidden by out-of-order execution.

The fourth advantage is thread safety. There is only one variable shared among threads, which is funcPtr. There are only two possible values for this pointer variable. The first value is the pointer to the dispatcherMain, and the second value is the pointer to either funcSSE2 or funcSSE4, depending on the availability of extensions. Once funcPtr is substituted with the pointer to funcSSE2 or funcSSE4, it will not be changed in the future. It should be easy to confirm that the code works in all the cases.

static double (*funcPtr)(double arg);

static double dispatcherMain(double arg) {
    double (*p)(double arg) = funcSSE2;

#if the compiler supports SSE4.1
    if (SSE4.1 is available on the CPU) p = funcSSE4;
#endif

    funcPtr = p;
    return (*funcPtr)(arg);
}

static double (*funcPtr)(double arg) = dispatcherMain;

double mainFunc(double arg) {
    return (*funcPtr)(arg);
}

Fig. 7.1: Simplified code of our dispatcher

ULP, gradual underflow and flush-to-zero mode

ULP stands for "unit in the last place", which is sometimes used for representing accuracy of calculation. 1 ULP is the distance between the two closest floating point number, which depends on the exponent of the FP number. The accuracy of calculation by reputable math libraries is usually between 0.5 and 1 ULP. Here, the accuracy means the largest error of calculation. SLEEF math library provides multiple accuracy choices for most of the math functions. Many functions have 3.5-ULP and 1-ULP versions, and 3.5-ULP versions are faster than 1-ULP versions. If you care more about execution speed than accuracy, it is advised to use the 3.5-ULP versions along with -ffast-math or "unsafe math optimization" options for the compiler.

Note that 3.5 ULPs of error is small enough in many applications. If you do not manage the error of computation by carefully ordering floating point operations in your code, you would easily have that amount of error in the computation results.

In IEEE 754 standard, underflow does not happen abruptly when the exponent becomes zero. Instead, when a number to be represented is smaller than a certain value, a denormal number is produced which has less precision. This is sometimes called gradual underflow. On some processor implementation, a flush-to-zero mode is used since it is easier to implement by hardware. In flush-to-zero mode, numbers smaller than the smallest normalized number are replaced with zero. FP operations are not IEEE-754 conformant if a flush-to-zero mode is used. A flush-to-zero mode influences the accuracy of calculation in some cases. The smallest normalized precision number can be referred with DBL_MIN for double precision, and FLT_MIN for single precision. The naming of these macros is a little bit confusing because DBL_MIN is not the smallest double precision number.

You can see known maximum errors in math functions in glibc at this page.

Explanatory source code for our modified Payne Hanek reduction method

In order to evaluate a trigonometric function with a large argument, an argument reduction method is used to find an FP remainder of dividing the argument x by π. We devised a variation of the Payne-Hanek argument reduction method which is suitable for vector computation. Fig. 7.2 shows an explanatory source code for this method. See our paper for the details.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpfr.h>

typedef struct { double x, y; } double2;
double2 dd(double d) { double2 r = { d, 0 }; return r; }
int64_t d2i(double d) { union { double f; int64_t i; } tmp = {.f = d }; return tmp.i; }
double i2d(int64_t i) { union { double f; int64_t i; } tmp = {.i = i }; return tmp.f; }
double upper(double d) { return i2d(d2i(d) & 0xfffffffff8000000LL); }
double clearlsb(double d) { return i2d(d2i(d) & 0xfffffffffffffffeLL); }

double2 ddrenormalize(double2 t) {
  double2 s = dd(t.x + t.y);
  s.y = t.x - s.x + t.y;
  return s;
}

double2 ddadd(double2 x, double2 y) {
  double2 r = dd(x.x + y.x);
  double v = r.x - x.x;
  r.y = (x.x - (r.x - v)) + (y.x - v) + (x.y + y.y);
  return r;
}

double2 ddmul(double x, double y) {
  double2 r = dd(x * y);
  r.y = fma(x, y, -r.x);
  return r;
}

double2 ddmul2(double2 x, double2 y) {
  double2 r = ddmul(x.x, y.x);
  r.y += x.x * y.y + x.y * y.x;
  return r;
}

// This function computes remainder(a, PI/2)
double2 modifiedPayneHanek(double a) {
  double table[4];
  int scale = fabs(a) > 1e+200 ? -128 : 0;
  a = ldexp(a, scale);

  // Table genration

  mpfr_set_default_prec(2048);
  mpfr_t pi, m;
  mpfr_inits(pi, m, NULL);
  mpfr_const_pi(pi, GMP_RNDN);

  mpfr_d_div(m, 2, pi, GMP_RNDN);
  mpfr_set_exp(m, mpfr_get_exp(m) + (ilogb(a) - 53 - scale));
  mpfr_frac(m, m, GMP_RNDN);
  mpfr_set_exp(m, mpfr_get_exp(m) - (ilogb(a) - 53));

  for(int i=0;i<4;i++) {
    table[i] = clearlsb(mpfr_get_d(m, GMP_RNDN));
    mpfr_sub_d(m, m, table[i], GMP_RNDN);
  }

  mpfr_clears(pi, m, NULL);

  // Main computation

  double2 x = dd(0);
  for(int i=0;i<4;i++) {
    x = ddadd(x, ddmul(a, table[i]));
    x.x = x.x - round(x.x);
    x = ddrenormalize(x);
  }

  double2 pio2 = { 3.141592653589793*0.5, 1.2246467991473532e-16*0.5 };
  x = ddmul2(x, pio2);
  return fabs(a) < 0.785398163397448279 ? dd(a) : x;
}

Fig. 7.2: Explanatory source code for our modified Payne Hanek reduction method

It is a soup ladle. A sleef means a soup ladle in Dutch.


logo
Fig. 7.2: SLEEF logo