File: tutorial.rst

package info (click to toggle)
pycuda 2016.1.2%2Bgit20161024-1
  • links: PTS, VCS
  • area: contrib
  • in suites: stretch
  • size: 1,560 kB
  • ctags: 2,268
  • sloc: python: 11,951; cpp: 9,839; makefile: 139; sh: 1
file content (202 lines) | stat: -rw-r--r-- 6,966 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
Tutorial
========

Getting started
---------------

Before you can use PyCuda, you have to import and initialize it::

  import pycuda.driver as cuda
  import pycuda.autoinit
  from pycuda.compiler import SourceModule

Note that you do not *have* to use :mod:`pycuda.autoinit`--
initialization, context creation, and cleanup can also be performed
manually, if desired.

Transferring Data
-----------------

The next step in most programs is to transfer data onto the device.
In PyCuda, you will mostly transfer data from :mod:`numpy` arrays
on the host. (But indeed, everything that satisfies the Python buffer
interface will work, even a :class:`str`.) Let's make a 4x4 array
of random numbers::

  import numpy
  a = numpy.random.randn(4,4)

But wait--*a* consists of double precision numbers, but most nVidia
devices only support single precision::

  a = a.astype(numpy.float32)

Finally, we need somewhere to transfer data to, so we need to
allocate memory on the device::

  a_gpu = cuda.mem_alloc(a.nbytes)

As a last step, we need to transfer the data to the GPU::

  cuda.memcpy_htod(a_gpu, a)

Executing a Kernel
------------------

For this tutorial, we'll stick to something simple: We will write code to
double each entry in *a_gpu*. To this end, we write the corresponding CUDA C
code, and feed it into the constructor of a
:class:`pycuda.compiler.SourceModule`::

  mod = SourceModule("""
    __global__ void doublify(float *a)
    {
      int idx = threadIdx.x + threadIdx.y*4;
      a[idx] *= 2;
    }
    """)

If there aren't any errors, the code is now compiled and loaded onto the
device. We find a reference to our :class:`pycuda.driver.Function` and call
it, specifying *a_gpu* as the argument, and a block size of 4x4::

  func = mod.get_function("doublify")
  func(a_gpu, block=(4,4,1))

Finally, we fetch the data back from the GPU and display it, together with the
original *a*::

  a_doubled = numpy.empty_like(a)
  cuda.memcpy_dtoh(a_doubled, a_gpu)
  print a_doubled
  print a

This will print something like this::

  [[ 0.51360393  1.40589952  2.25009012  3.02563429]
   [-0.75841576 -1.18757617  2.72269917  3.12156057]
   [ 0.28826082 -2.92448163  1.21624792  2.86353827]
   [ 1.57651746  0.63500965  2.21570683 -0.44537592]]
  [[ 0.25680196  0.70294976  1.12504506  1.51281714]
   [-0.37920788 -0.59378809  1.36134958  1.56078029]
   [ 0.14413041 -1.46224082  0.60812396  1.43176913]
   [ 0.78825873  0.31750482  1.10785341 -0.22268796]]

It worked! That completes our walkthrough. Thankfully, PyCuda takes
over from here and does all the cleanup for you, so you're done.
Stick around for some bonus material in the next section, though.

(You can find the code for this demo as :file:`examples/demo.py` in the PyCuda
source distribution.)

Shortcuts for Explicit Memory Copies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :class:`pycuda.driver.In`, :class:`pycuda.driver.Out`, and
:class:`pycuda.driver.InOut` argument handlers can simplify some of the memory
transfers. For example, instead of creating *a_gpu*, if replacing *a* is fine,
the following code can be used::

  func(cuda.InOut(a), block=(4, 4, 1))

Prepared Invocations
^^^^^^^^^^^^^^^^^^^^

Function invocation using the built-in :meth:`pycuda.driver.Function.__call__`
method incurs overhead for type identification (see :ref:`reference-doc`). To
achieve the same effect as above without this overhead, the function is bound
to argument types (as designated by Python's standard library :mod:`struct`
module), and then called. This also avoids having to assign explicit argument
sizes using the `numpy.number` classes::

    grid = (1, 1)
    block = (4, 4, 1)
    func.prepare("P")
    func.prepared_call(grid, block, a_gpu)

Bonus: Abstracting Away the Complications
-----------------------------------------

Using a :class:`pycuda.gpuarray.GPUArray`, the same effect can be
achieved with much less writing::

  import pycuda.gpuarray as gpuarray
  import pycuda.driver as cuda
  import pycuda.autoinit
  import numpy

  a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
  a_doubled = (2*a_gpu).get()
  print a_doubled
  print a_gpu

Advanced Topics
---------------

Structures
^^^^^^^^^^

(contributed by Nicholas Tung, find the code in :file:`examples/demo_struct.py`)

Suppose we have the following structure, for doubling a number of variable
length arrays::

  mod = SourceModule("""
      struct DoubleOperation {
          int datalen, __padding; // so 64-bit ptrs can be aligned
          float *ptr;
      };

      __global__ void double_array(DoubleOperation *a) {
          a = &a[blockIdx.x];
          for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
              a->ptr[idx] *= 2;
          }
      }
      """)

Each block in the grid (see CUDA documentation) will double one of the arrays.
The `for` loop allows for more data elements than threads to be doubled,
though is not efficient if one can guarantee that there will be a sufficient
number of threads. Next, a wrapper class for the structure is created, and
two arrays are instantiated::

  class DoubleOpStruct:
      mem_size = 8 + numpy.intp(0).nbytes
      def __init__(self, array, struct_arr_ptr):
          self.data = cuda.to_device(array)
          self.shape, self.dtype = array.shape, array.dtype
          cuda.memcpy_htod(int(struct_arr_ptr), numpy.getbuffer(numpy.int32(array.size)))
          cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.getbuffer(numpy.intp(int(self.data))))
      def __str__(self):
          return str(cuda.from_device(self.data, self.shape, self.dtype))

  struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
  do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size

  array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
  array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
  print("original arrays", array1, array2)

This code uses the :func:`pycuda.driver.to_device` and
:func:`pycuda.driver.from_device` functions to allocate and copy values, and
demonstrates how offsets to an allocated block of memory can be used. Finally,
the code can be executed; the following demonstrates doubling both arrays, then
only the second::

  func = mod.get_function("double_array")
  func(struct_arr, block = (32, 1, 1), grid=(2, 1))
  print("doubled arrays", array1, array2)

  func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
  print("doubled second only", array1, array2, "\n")

Where to go from here
---------------------

Once you feel sufficiently familiar with the basics, feel free to dig into the
:ref:`reference-doc`. For more examples, check the in the :file:`examples/`
subdirectory of the distribution.  This folder also contains several benchmarks
to see the difference between GPU and CPU based calculations. As a reference for
how stuff is done, PyCuda's test suite in the :file:`test/` subdirectory of the
distribution may also be of help.