File: overview.rst

package info (click to toggle)
compyle 0.8.1-11
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,100 kB
  • sloc: python: 12,337; makefile: 21
file content (190 lines) | stat: -rw-r--r-- 8,370 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
An overview
==============

Compyle allows users to execute a restricted subset of Python (almost similar
to C) on a variety of HPC platforms. Currently we support multi-core execution
using Cython, and OpenCL and CUDA for GPU devices.

An introduction to compyle in the context of writing a molecular dynamics
simulator is available in our `SciPy 2020 paper`_. You may also `try Compyle`_
online on a Google Colab notebook if you wish.

Users start with code implemented in a very restricted Python syntax, this
code is then automatically transpiled, compiled and executed to run on either
one CPU core, or multiple CPU cores or on a GPU. Compyle offers source-to-source
transpilation, making it a very convenient tool for writing HPC libraries.

Compyle is not a magic bullet,

- Do not expect that you may get a tremendous speedup.
- Performance optimization can be hard and is platform specific. What works on
  the CPU may not work on the GPU and vice-versa. Compyle does not do anything
  to make this aspect easier. All the issues with memory bandwidth, cache, false
  sharing etc. still remain. Differences between memory architectures of CPUs
  and GPUs are not avoided at all -- you still have to deal with it. But you can
  do so from the comfort of one simple programming language, Python.
- Compyle makes it easy to write everything in pure Python and generate the
  platform specific code from Python. It provides a low-level tool to make it
  easy for you to generate whatever appropriate code.
- The restrictions Compyle imposes make it easy for you to think about your
  algorithms in that context and thereby allow you to build functionality that
  exploits the hardware as you see fit.
- Compyle hides the details of the backend to the extent possible. You can write
  your code in Python, you can reuse your functions and decompose your problem
  to maximize reuse. Traditionally you would end up implementing some code in C,
  some in Python, some in OpenCL/CUDA, some in string fragments that you put
  together. Then you'd have to manage each of the runtimes yourself, worry about
  compilation etc. Compyle minimizes that pain.
- By being written in Python, we make it easy to assemble these building blocks
  together to do fairly sophisticated things relatively easily from the same
  language.
- Compyle is fairly simple and does source translation making it generally
  easier to understand and debug. The core code-base is less than 7k lines of
  code.
- Compyle has relatively simple dependencies, for CPU support it requires
  Cython_ and a C-compiler which supports OpenMP_. On the GPU you need either
  PyOpenCL_ or PyCUDA_. In addition it depends on NumPy_ and Mako_.


.. _Cython: http://www.cython.org
.. _OpenMP: http://openmp.org/
.. _PyOpenCL: https://documen.tician.de/pyopencl/
.. _PyCUDA: https://documen.tician.de/pycuda/
.. _OpenCL: https://www.khronos.org/opencl/
.. _NumPy: http://numpy.scipy.org
.. _Mako: https://pypi.python.org/pypi/Mako
.. _SciPy 2020 paper: http://conference.scipy.org/proceedings/scipy2020/compyle_pr_ab.html
.. _try Compyle: https://colab.research.google.com/drive/1SGRiArYXV1LEkZtUeg9j0qQ21MDqQR2U?usp=sharing

While Compyle is simple and modest, it is quite powerful and convenient. In
fact, Compyle has its origins in PySPH_ which is a powerful Python package
supporting SPH, molecular dynamics, and other particle-based algorithms. The
basic elements of Compyle are used in PySPH_ to automatically generate HPC code
from code written in pure Python and execute it on multiple cores, and on GPUs
without the user having to change any of their code. Compyle generalizes this
code generation to make it available as a general tool.

.. _PySPH: http://pysph.readthedocs.io


These are the restrictions on the Python language that Compyle poses:

- Functions with a C-syntax.
- Function arguments must be declared using either type annotation or with a
  decorator or with default arguments.
- No Python data structures, i.e. no lists, tuples, or dictionaries.
- Contiguous Numpy arrays are supported but must be one dimensional.
- No memory allocation is allowed inside these functions.
- On OpenCL no recursion is supported.
- All function calls must not use dotted names, i.e. don't use ``math.sin``,
  instead just use ``sin``. This is because we do not perform any kind of name
  mangling of the generated code to make it easier to read.

Basically think of it as good old FORTRAN.

Technically we do support structs internally (we use it heavily in PySPH_) but
this is not yet exposed at the high-level and is very likely to be supported
in the future.


Simple example
--------------

Enough talk, lets look at some code.  Here is a very simple example::

   from compyle.api import Elementwise, annotate, wrap, get_config
   import numpy as np

   @annotate(i='int', x='doublep', y='doublep', double='a,b')
   def axpb(i, x, y, a, b):
       y[i] = a*sin(x[i]) + b

   x = np.linspace(0, 1, 10000)
   y = np.zeros_like(x)
   a = 2.0
   b = 3.0

   backend = 'cython'
   get_config().use_openmp = True
   x, y = wrap(x, y, backend=backend)
   e = Elementwise(axpb, backend=backend)
   e(x, y, a, b)

This will execute the elementwise operation in parallel using OpenMP with
Cython. The code is auto-generated, compiled and called for you transparently.
The first time this runs, it will take a bit of time to compile everything but
the next time, this is cached and will run much faster.

If you just change the ``backend = 'opencl'``, the same exact code will be
executed using PyOpenCL_ and if you change the backend to ``'cuda'``, it will
execute via CUDA without any other changes to your code. This is obviously a
very trivial example, there are more complex examples available as well.

To see the source code that is automatically generated for the above
elementwise operation example use::

  e.source

This will contain the sources that are generated based on the user code alone.
To see all the sources created, use::

  e.all_source

A word of warning though that this can be fairly long especially on a GPU and
for other kind of operations may actually include multiple GPU kernels. This
is largely for reference and debugging.


More examples
--------------

More complex examples (but still fairly simple) are available in the `examples
<https://github.com/pypr/compyle/tree/master/examples>`_ directory.

- `axpb.py <https://github.com/pypr/compyle/tree/master/examples/axpb.py>`_: the
  above example but for openmp and opencl compared with serial showing that in
  some cases serial is actually faster than parallel!

- `vm_elementwise.py
  <https://github.com/pypr/compyle/tree/master/examples/vm_elementwise.py>`_:
  shows a simple N-body code with two-dimensional point vortices. The code uses
  a simple elementwise operation and works with OpenMP and OpenCL.

- `vm_numba.py
  <https://github.com/pypr/compyle/tree/master/examples/vm_numba.py>`_: shows
  the same code written in numba for comparison. In our benchmarks, Compyle is
  actually faster even in serial and in parallel it can be much faster when you
  use all cores.

- `vm_kernel.py
  <https://github.com/pypr/compyle/tree/master/examples/vm_kernel.py>`_: shows
  how one can write a low-level OpenCL kernel in pure Python and use that. This
  also shows how you can allocate and use local (or shared) memory which is
  often very important for performance on GPGPUs. This code will only run via
  PyOpenCL.

- `bench_vm.py
  <https://github.com/pypr/compyle/tree/master/examples/bench_vm.py>`_:
  Benchmarks the various vortex method results above for a comparison with
  numba.


Read on for more details about Compyle.


Citing Compyle
---------------

If you find Compyle useful or just want to read a paper on it, please see:

- Aditya Bhosale and Prabhu Ramachandran, "Compyle: Python once, parallel
  computing anywhere", Proceedings of the 19th Python in Science Conference
  (SciPy 2020), July, 2020, Austin, Texas, USA.
  `doi:10.25080/Majora-342d178e-005
  <https://doi.org/10.25080/Majora-342d178e-005>`_ **Won best poster** `SciPy
  2020 Paper`_.

Accompanying the paper is the

 - `Compyle poster presentation <https://docs.google.com/presentation/d/1LS9XO5pQXz8G5d27RP5oWLFxUA-Fr5OvfVUGsgg86TQ/edit#slide=id.p>`_
 - and the `Compyle poster video <https://www.youtube.com/watch?v=h2YpPPL6nEY>`_