File: performance.rst

package info (click to toggle)
genx 3.8.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 39,152 kB
  • sloc: python: 79,013; makefile: 153; sh: 92; xml: 7
file content (100 lines) | stat: -rw-r--r-- 5,070 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
.. _tutorial-performance:

*******************************
Optimizer and model performance
*******************************

.. note::
    This page is still under development. Questions and comments are welcome to improve usability.

Introduction
============
Computing performance is a complex topic and in a modeling set like GenX,
where the user has a large amount of flexibilit within the model definition,
on generic one-size-fits-all solution is possible.

In this chapter I will therefore explain what has been implemented in GenX to try to
optimize the general model performance (mostly for reflectivity) and the different
options in the optimizer settings you can use to tune the calculation to your
sepcific model.

Main factors determining performance
====================================

Single model execution
----------------------
When simulating a single model, no generic solution for parallel processing can be applied.
Especially in Python with the so-called "Global Interpreter Lock" (GIL) there is no shared
state between threads running on different cores.

While GenX tries to optimize the time consuming parts of a model execution using
algorithms optimized for numpy and scipy functions, the resuling speed of models
are typically 5-10x slower than implmentation in compiled languages like C++.

Since version 3.0.2, the core functions for reflectivity have been ported to the
just-in-time (JIT) compiler package numba, which lead to single thread speedups
of 2-5x. In addition, it allows parallelization of these functions, which can
circumvent the GIL. Depending on the used CPU, complex models can gain another
2-10x speed improvement.

For the single calculations used during simulation, GenX does not provide any
further flexibility to alter the computation. The only exception is the
use of CUDA (NVidia GPU computation framework) that can be activated
in the GUI "Fit" menu.
The impact of this setting is strongly model dependant. With many
datapoints (resolution convolution included) and large amount of layers
the speed can be comparible with strong multicore CPUs while requiring less system resources.
Because of several caveats and the need to re-compile the JIT code every time GenX
is started I would not recommand this in most cases, at the moment.

Fitting the model
-----------------
One advantage of the Differential Evolution algorithm is,
that a large number of parameter sets are being calculated for every
generation without any interdependance. This allows a relatively simple
way of parallelizing computations as a pool of processes can be used
with model parameters being passed to them every generation.

Any parallel computing solutions have overhead involved in setting up
and communication between parallel threads. The optimizal settings will
therefore depend on the complexity of the model.

A general rule of thumb is that the more complex a computation within a
thread the lower the influence of the overhead of setting it up. At the
same time, the more data is needed for a computation the more overhead
is produced.

In GenX this means, that the parallel computation provided by the
numba JIT functions is less effective is the model is a small number of
layers and if the number of datapoints is small. So in the case of
simple and fast models the multiprocessing of the differential evolution
optimizer can lead to much higher preformance.
GenX does automatically reduce the number of cores used by numba functions
when the process number in the optimizer settings is increased (simulations
still use the maximum available cores). Transfer to the processes also has
its overhead, that can be influenced by the "items/chunk" parameter as well
as the population size.

Tips to optimize your model performance
=======================================
    * Always use the "Speed" indication in the status bar at the bottom of the window. It shows, how many
      function evaluations per second are being calculated during the fit
    * If you do not need it, you can disable "ignore fom nan" and "ignore fom inf", which can slightly improve performance
    * Try out different settings of population size, items/chunk, parallel on/off and number of processes

Simple models
-------------
    * Models with <20 layers and a few 100 datapoints
    * Expected computation speed >300 (can reach >10000 if optimized)
    * Use parallel processing with process=cores/threads.
    * Use large population size 100-1000
    * Adapt chunk/item to be =(population size)/(processes)

Complex models
--------------
    * These can have 100 or more layers and resolution convolution that leads to >1000 datapoints
    * Expected computation speed <150
    * Try without parallel processing or small number of parallel threads (2-4)
    * If CUDA is available, especially for neutron spin-flip calculations, try using CUDA in conjunction with
      2-8 parallel threads. In this case one of the threads will run on GPU. In tests this could lead
      to 1.5x to 2.0x improvement, even on a system with powerful 16-core CPU.