1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
|
=head1 NAME
PDL::ParallelCPU - Parallel processor multi-threading support in PDL
=head1 DESCRIPTION
PDL has support for splitting up numerical processing
between multiple parallel processor threads (or pthreads) using the I<set_autopthread_targ>
and I<set_autopthread_size> functions.
This can improve processing performance (by greater than 2-4X in most cases)
by taking advantage of multi-core and/or multi-processor machines.
As of 2.059, L<PDL::Core/online_cpus> is used to set the number of
threads used if C<PDL_AUTOPTHREAD_TARG> is not set.
=head1 SYNOPSIS
use PDL;
# Set target of 4 parallel pthreads to create, with a lower limit of
# 5Meg elements for splitting processing into parallel pthreads.
set_autopthread_targ(4);
set_autopthread_size(5);
$x = zeroes(5000,5000); # Create 25Meg element array
$y = $x + 5; # Processing will be split up into multiple pthreads
# Get the actual number of pthreads for the last
# processing operation.
$actualPthreads = get_autopthread_actual();
# Or compare these to see CPU usage (first one only 1 pthread, second one 10)
# in the PDL shell:
$x = ones(10,1000,10000); set_autopthread_targ(1); $y = sin($x)*cos($x); p get_autopthread_actual;
$x = ones(10,1000,10000); set_autopthread_targ(10); $y = sin($x)*cos($x); p get_autopthread_actual;
=head1 Terminology
To reduce the confusion that existed in PDL before 2.075, this document uses
B<pthreading> to refer to I<processor multi-threading>, which is the use of multiple processor threads
to split up numerical processing into parallel operations.
=head1 Functions that control PDL pthreads
This is a brief listing and description of the PDL pthreading functions, see the L<PDL::Core> docs
for detailed information.
=over 5
=item set_autopthread_targ
Set the target number of processor-threads (pthreads) for multi-threaded processing. Setting auto_pthread_targ
to 0 means that no pthreading will occur.
See L<PDL::Core|set_autopthread_targ> for details.
=item set_autopthread_size
Set the minimum size (in Meg-elements or 2**20 elements) of the largest PDL involved in a function where auto-pthreading will
be performed. For small PDLs, it probably isn't worth starting multiple pthreads, so this function
is used to define a minimum threshold where auto-pthreading won't be attempted.
See L<PDL::Core|set_autopthread_size> for details.
=item get_autopthread_actual
Get the actual number of pthreads executed for the last pdl processing function.
See L<PDL::get_autopthread_actual> for details.
=back
=head1 Global Control of PDL pthreading using Environment Variables
PDL pthreading can be globally turned on, without modifying existing code by setting
environment variables B<PDL_AUTOPTHREAD_TARG> and B<PDL_AUTOPTHREAD_SIZE> before running a PDL script.
These environment variables are checked when PDL starts up and calls to I<set_autopthread_targ> and
I<set_autopthread_size> functions made with the environment variable's values.
For example, if the environment var B<PDL_AUTOPTHREAD_TARG> is set to 3, and B<PDL_AUTOPTHREAD_SIZE> is
set to 10, then any pdl script will run as if the following lines were at the top of the file:
set_autopthread_targ(3);
set_autopthread_size(10);
=head1 How It Works
The auto-pthreading process works by analyzing broadcast array dimensions in PDL operations (those above the operation's "signature" dimensions)
and splitting up processing according to those and the desired number of
pthreads (i.e. the pthread target or pthread_targ). The offsets,
increments, and dimension-sizes (in case the whole dimension does
not divide neatly by the number of pthreads) that PDL uses to step
thru the data in memory are modified for each pthread so each one sees a different set of data when
performing processing.
B<Example>
$x = sequence(20,4,3); # Small 3-D Array, size 20,4,3
# Setup auto-pthreading:
set_autopthread_targ(2); # Target of 2 pthreads
set_autopthread_size(0); # Zero so that the small PDLs in this example will be pthreaded
# This will be split up into 2 pthreads
$c = maximum($x);
For the above example, the I<maximum> function has a signature of C<(a(n); [o]c())>, which means that the first
dimension of $x (size 20) is a I<Core> dimension of the I<maximum> function. The other dimensions of $x (size 4,3)
are I<broadcast> dimensions (i.e. will be broadcasted-over in the I<maximum> function.
The auto-pthreading algorithm examines the broadcasted dims of size (4,3) and picks the 4 dimension,
since it is evenly divisible by the autopthread_targ of 2. The processing of the maximum function is then
split into two pthreads on the size-4 dimension, with dim indexes 0,2 processed by one pthread
and dim indexes 1,3 processed by the other pthread.
=head1 Limitations
=head2 Must have POSIX Threads Enabled
Auto-pthreading only works if your PDL installation was compiled with POSIX threads enabled. This is normally
the case if you are running on Windows, Linux, MacOS X, or other unix variants.
=head2 Non-Threadsafe Code
Not all the libraries that PDL interfaces to are thread-safe, i.e. they aren't written to operate
in a multi-threaded environment without crashing or causing side-effects. Some examples in the PDL
core is the I<fft> function and the I<pnmout> functions.
To operate properly with these types of functions, the PPCode flag B<NoPthread> has been introduced to indicate
a function as I<not> being pthread-safe. See L<PDL::PP> docs for details.
=head2 Size of PDL Dimensions and pthread Target
As of PDL 2.058, the broadcasted dimension sizes do not need to divide
exactly by the pthread target, although if one does, it will be
used.
If no dimension is as large as the pthread target, the number of
pthreads will be the size of the largest broadcasted dimension.
In order to minimise idle CPUs on the last iteration at the end of
the broadcasted dimension, the algorithm that picks the dimension to
pthread on aims for the largest remainder in dividing the pthread
target into the sizes of the broadcasted dimensions. For example, if
a PDL has broadcasted dimension sizes of (9,6,2) and the I<auto_pthread_targ>
is 4, the algorithm will pick the 1-th (size 6), as that will leave
a remainder of 2 (leaving 2 idle at the end) in preference to one
with size 9, which would leave 3 idle.
=head2 Speed improvement might be less than you expect.
If you have an 8-core machine and call I<auto_pthread_targ> with 8
to generate 8 parallel pthreads, you
probably won't get a 8X improvement in speed, due to memory bandwidth issues. Even though you have 8 separate
CPUs crunching away on data, you will have (for most common machine architectures) common RAM that now becomes
your bottleneck. For simple calculations (e.g simple additions) you can run into a performance limit at about
4 pthreads. For more CPU-bound calculations the limit will be higher.
=head1 COPYRIGHT
Copyright 2011 John Cerney. You can distribute and/or
modify this document under the same terms as the current Perl license.
See: L<http://dev.perl.org/licenses/>
|