1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
|
NumExpr with Intel MKL
======================
Numexpr has support for Intel's VML (included in Intel's MKL) in order to
accelerate the evaluation of transcendental functions on Intel CPUs. Here it
is a small example on the kind of improvement you may get by using it.
A first benchmark
-----------------
Firstly, we are going to exercise how MKL performs when computing a couple of
simple expressions. One is a pure algebraic one: :code:`2*y + 4*x` and the other
contains transcendental functions: :code:`sin(x)**2 + cos(y)**2`.
For this, we are going to use this worksheet_. I (Francesc Alted) ran this
benchmark on a Intel Xeon E3-1245 v5 @ 3.50GHz. Here are the results when
not using MKL::
NumPy version: 1.11.1
Time for an algebraic expression: 0.168 s / 6.641 GB/s
Time for a transcendental expression: 1.945 s / 0.575 GB/s
Numexpr version: 2.6.1. Using MKL: False
Time for an algebraic expression: 0.058 s / 19.116 GB/s
Time for a transcendental expression: 0.283 s / 3.950 GB/s
And now, using MKL::
NumPy version: 1.11.1
Time for an algebraic expression: 0.169 s / 6.606 GB/s
Time for a transcendental expression: 1.943 s / 0.575 GB/s
Numexpr version: 2.6.1. Using MKL: True
Time for an algebraic expression: 0.058 s / 19.153 GB/s
Time for a transcendental expression: 0.075 s / 14.975 GB/s
As you can see, numexpr using MKL can be up to 3.8x faster for the case of the
transcendental expression. Also, you can notice that the pure algebraic
expression is not accelerated at all. This is completely expected, as the
MKL is offering accelerations for CPU bounded functions (sin, cos, tan, exp,
log, sinh...) and not pure multiplications or adds.
Finally, note how numexpr+MKL can be up to 26x faster than using a pure NumPy
solution. And this was using a processor with just four physical cores; you
should expect more speedup as you throw more cores at that.
.. _worksheet: https://github.com/pydata/numexpr/blob/master/bench/vml_timing2.py
More benchmarks (older)
-----------------------
Numexpr & VML can both use several threads for doing computations. Let's see
how performance improves by using 1 or 2 threads on a 2-core Intel CPU (Core2
E8400 @ 3.00GHz).
Using 1 thread
^^^^^^^^^^^^^^
Here we have some benchmarks on the improvement of speed that Intel's VML can
achieve. First, look at times by some easy expression containing sine and
cosine operations *without* using VML::
In [17]: ne.use_vml
Out[17]: False
In [18]: x = np.linspace(-1, 1, 1e6)
In [19]: timeit np.sin(x)**2+np.cos(x)**2
10 loops, best of 3: 43.1 ms per loop
In [20]: ne.set_num_threads(1)
Out[20]: 2
In [21]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
10 loops, best of 3: 29.5 ms per loop
and now using VML::
In [37]: ne.use_vml
Out[37]: True
In [38]: x = np.linspace(-1, 1, 1e6)
In [39]: timeit np.sin(x)**2+np.cos(x)**2
10 loops, best of 3: 42.8 ms per loop
In [40]: ne.set_num_threads(1)
Out[40]: 2
In [41]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 19.8 ms per loop
Hey, VML can accelerate computations by a 50% using a single CPU. That's great!
Using 2 threads
^^^^^^^^^^^^^^^
First, look at the time of the non-VML numexpr when using 2 threads::
In [22]: ne.set_num_threads(2)
Out[22]: 1
In [23]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 15.3 ms per loop
OK. We've got an almost perfect 2x improvement in speed with regard to the 1
thread case. Let's see about the VML-powered numexpr version::
In [43]: ne.set_num_threads(2)
Out[43]: 1
In [44]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 12.2 ms per loop
Ok, that's about 1.6x improvement over the 1 thread VML computation, and
still a 25% of improvement over the non-VML version. Good, native numexpr
multithreading code really looks very efficient!
Numexpr native threading code vs VML's one
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You may already know that both numexpr and Intel's VML do have support for
multithreaded computations, but you might be curious about which one is more
efficient, so here it goes a hint. First, using the VML multithreaded
implementation::
In [49]: ne.set_vml_num_threads(2)
In [50]: ne.set_num_threads(1)
Out[50]: 1
In [51]: ne.set_vml_num_threads(2)
In [52]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 16.8 ms per loop
and now, using the native numexpr threading code::
In [53]: ne.set_num_threads(2)
Out[53]: 1
In [54]: ne.set_vml_num_threads(1)
In [55]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 12 ms per loop
This means that numexpr's native multithreaded code is about 40% faster than
VML's for this case. So, in general, you should use the former with numexpr
(and this is the default actually).
Mixing numexpr's and VML multithreading capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Finally, you might be tempted to use both multithreading codes at the same
time, but you will be deceived about the improvement in performance::
In [57]: ne.set_vml_num_threads(2)
In [58]: timeit ne.evaluate('sin(x)**2+cos(x)**2')
100 loops, best of 3: 17.7 ms per loop
Your code actually performs much worse. That's normal too because you are
trying to run 4 threads on a 2-core CPU. For CPUs with many cores, you may
want to try with different threading configurations, but as a rule of thumb,
numexpr's one will generally win.
|