File: README.md

package info (click to toggle)
libdivide 3.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 248 kB
  • sloc: cpp: 1,885; ansic: 814; makefile: 2
file content (230 lines) | stat: -rw-r--r-- 8,928 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# libdivide
[![Build Status](https://ci.appveyor.com/api/projects/status/github/ridiculousfish/libdivide?branch=master&svg=true)](https://ci.appveyor.com/project/kimwalisch/libdivide)
[![Github Releases](https://img.shields.io/github/release/ridiculousfish/libdivide.svg)](https://github.com/ridiculousfish/libdivide/releases)

```libdivide.h```  is a header-only C/C++ library for optimizing integer division.
Integer division is one of the slowest instructions on most CPUs e.g. on
current x64 CPUs a 64-bit integer division has a latency of up to 90 clock
cycles whereas a multiplication has a latency of only 3 clock cycles.
libdivide allows you to replace expensive integer divsion instructions by
a sequence of shift, add and multiply instructions that will calculate
the integer division much faster.

On current CPUs you can get a **speedup of up to 10x** for 64-bit integer division
and a speedup of up to to 5x for 32-bit integer division when using libdivide.
libdivide also supports [SSE2](https://en.wikipedia.org/wiki/SSE2),
[AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and
[AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
vector division which provides an even larger speedup. You can test how much
speedup you can achieve on your CPU using the [benchmark](#benchmark-program)
program.

See https://libdivide.com for more information on libdivide.

# C++ example

The first code snippet divides all integers in a vector using integer division.
This is slow as integer division is at least one order of magnitude slower than
any other integer arithmetic operation on current CPUs.

```C++
void divide(std::vector<int64_t>& vect, int64_t divisor)
{
    // Slow, uses integer division
    for (auto& n : vect)
        n /= divisor;
}
```

The second code snippet runs much faster, it uses libdivide to compute the
integer division using a sequence of shift, add and multiply instructions hence
avoiding the slow integer divison operation.

```C++
#include "libdivide.h"

void divide(std::vector<int64_t>& vect, int64_t divisor)
{
    libdivide::divider<int64_t> fast_d(divisor);

    // Fast, computes division using libdivide
    for (auto& n : vect)
        n /= fast_d;
}
```

Generally libdivide will give at significant speedup if:

* The divisor is only known at runtime
* The divisor is reused multiple times e.g. in a loop

# C example

You first need to generate a libdivide divider using one of the ```libdivide_*_gen``` functions (```*```:&nbsp;```s32```,&nbsp;```u32```,&nbsp;```s64```,&nbsp;```u64```)
which can then be used to compute the actual integer division using the
corresponding ```libdivide_*_do``` function.

```C
#include "libdivide.h"

void divide(int64_t *array, size_t size, int64_t divisor)
{
    struct libdivide_s64_t fast_d = libdivide_s64_gen(divisor);

    // Fast, computes division using libdivide
    for (size_t i = 0; i < size; i++)
        array[i] = libdivide_s64_do(array[i], &fast_d);
}
```

# API reference

* [C API](https://github.com/ridiculousfish/libdivide/blob/master/doc/C-API.md)
* [C++ API](https://github.com/ridiculousfish/libdivide/blob/master/doc/CPP-API.md)

# Branchfull vs branchfree

The default libdivide divider makes use of
[branches](https://en.wikipedia.org/wiki/Branch_(computer_science)) to compute the integer
division. When the same divider is used inside a hot loop as in the C++ example section the
CPU will accurately predict the branches and there will be no performance slowdown. Often
the compiler is even able to move the branches outside the body of the loop hence
completely eliminating the branches, this is called loop-invariant code motion.

libdivide also has a branchfree divider type which computes the integer division without
using any branch instructions. The branchfree divider generally uses a few more instructions
than the default branchfull divider. The main use case for the branchfree divider is when
you have an array of different divisors and you need to iterate over the divisors. In this
case the default branchfull divider would exhibit poor performance as the CPU won't be
able to correctly predict the branches.

```C++
#include "libdivide.h"

// 64-bit branchfree divider type
using branchfree_t = libdivide::branchfree_divider<uint64_t>;

uint64_t divide(uint64_t x, std::vector<branchfree_t>& vect)
{
    uint64_t sum = 0;

    for (auto& fast_d : vect)
        sum += x / fast_d;

    return sum;
}
```

Caveats of branchfree divider:

* Unsigned branchfree divider cannot be ```1```
* Faster for unsigned types than for signed types

# Vector division

libdivide supports [SSE2](https://en.wikipedia.org/wiki/SSE2),
[AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and
[AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
vector division on x86 and x64 CPUs. In the example below we divide the packed 32-bit
integers inside an AVX512 vector using libdivide. libdivide supports 32-bit and 64-bit
vector division for both signed and unsigned integers.

```C++
#include "libdivide.h"

void divide(std::vector<__m512i>& vect, uint32_t divisor)
{
    libdivide::divider<uint32_t> fast_d(divisor);

    // AVX512 vector division
    for (auto& n : vect)
        n /= fast_d;
}
```

Note that you need to define one of macros below to enable vector division:

* ```LIBDIVIDE_SSE2```
* ```LIBDIVIDE_AVX2```
* ```LIBDIVIDE_AVX512```

# Performance tips

* If possible use unsigned integer types because libdivide's unsigned division is measurably
  faster than its signed division. This is especially true for the branchfree divider.
* Try both the default branchfull divider and the branchfree divider in your program and
  choose the one that performs best. The branchfree divider is more likely to get auto
  vectorized by the compiler (if you compile with e.g. ```-march=native```). But don't forget
  that the unsigned branchfree divider cannot be 1.
* Vector division is much faster for 32-bit than for 64-bit. This is because there are
  currently no vector multiplication instructions on x86 to efficiently calculate
  64-bit * 64-bit to 128-bit. 

# Build instructions

libdivide has one test program and two benchmark programs which can be built using cmake and
a recent C++ compiler that supports C++11 or later. Optionally ```libdivide.h``` can also be
installed to ```/usr/local/include```.

```bash
cmake .
make -j
sudo make install
```

# Tester program

You can pass the **tester** program one or more of the following arguments: ```u32```,
```s32```, ```u64```, ```s64``` to test the four cases (signed, unsigned, 32-bit, or 64-bit), or
run it with no arguments to test all four. The tester will verify the correctness of libdivide
via a set of randomly chosen numerators and denominators, by comparing the result of libdivide's
division to hardware division. It will stop with an error message as soon as it finds a
discrepancy.

# Benchmark program

You can pass the **benchmark** program one or more of the following arguments: ```u32```,
```s32```, ```u64```, ```s64``` to compare libdivide's speed against hardware division.
**benchmark** tests a simple function that inputs an array of random numerators and a single
divisor, and returns the sum of their quotients. It tests this using both hardware division, and
the various division approaches supported by libdivide, including vector division.

It will output data like this:

```bash
 #   system  scalar  scl_bf  vector  vec_bf   gener   algo
 1   9.684   0.792   0.783   0.426   0.426    1.346   0
 2   9.078   0.781   1.609   0.426   1.529    1.346   0
 3   9.078   1.355   1.609   1.334   1.531   29.045   1
 4   9.076   0.787   1.609   0.426   1.529    1.346   0
 5   9.074   1.349   1.609   1.334   1.531   29.045   1
 6   9.078   1.349   1.609   1.334   1.531   29.045   1
...
```

It will keep going as long as you let it, so it's best to stop it when you are happy with the
denominators tested. These columns have the following significance. All times are in
nanoseconds, lower is better.

```bash
     #:  The divisor that is tested
system:  Hardware divide time
scalar:  libdivide time, using scalar division
scl_bf:  libdivide time, using scalar branchfree division
vector:  libdivide time, using vector division
vec_bf:  libdivide time, using vector branchfree division
 gener:  Time taken to generate the divider struct
  algo:  The algorithm used.
```

The **benchmark** program will also verify that each function returns the same value,
so benchmark is valuable for its verification as well.

# Contributing

We currently do not have automated testing! Hence, before sending in patches, it would be nice
if you compiled your new code at high warning levels using at least MSVC and GCC (or Clang).
Also run the tester program to verify correctness and the benchmark programs to ensure you have
not introduced any performance regressions.

### Happy hacking!