| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 
 | /* StarPU --- Runtime system for heterogeneous multicore architectures.
 *
 * Copyright (C) 2009-2021  Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
 *
 * StarPU is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation; either version 2.1 of the License, or (at
 * your option) any later version.
 *
 * StarPU is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *
 * See the GNU Lesser General Public License in COPYING.LGPL for more details.
 */
/*! \page TasksInStarPU Tasks In StarPU
\section TaskGranularity Task Granularity
Like any other runtime, StarPU has some overhead to manage tasks. Since
it does smart scheduling and data management, this overhead is not always
neglectable. The order of magnitude of the overhead is typically a couple of
microseconds, which is actually quite smaller than the CUDA overhead itself. The
amount of work that a task should do should thus be somewhat
bigger, to make sure that the overhead becomes neglectible. The offline
performance feedback can provide a measure of task length, which should thus be
checked if bad performance are observed. To get a grasp at the scalability
possibility according to task size, one can run
<c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
speedup of independent tasks of very small sizes.
To determine what task size your application is actually using, one can use
<c>starpu_fxt_data_trace</c>, see \ref DataTrace .
The choice of scheduler also has impact over the overhead: for instance, the
 scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> does
not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
impact that has on the target machine.
\section TaskSubmission Task Submission
To let StarPU make online optimizations, tasks should be submitted
asynchronously as much as possible. Ideally, all tasks should be
submitted, and mere calls to starpu_task_wait_for_all() or
starpu_data_unregister() be done to wait for
termination. StarPU will then be able to rework the whole schedule, overlap
computation with communication, manage accelerator local memory usage, etc.
\section TaskPriorities Task Priorities
By default, StarPU will consider the tasks in the order they are submitted by
the application. If the application programmer knows that some tasks should
be performed in priority (for instance because their output is needed by many
other tasks and may thus be a bottleneck if not executed early
enough), the field starpu_task::priority should be set to provide the
priority information to StarPU.
\section TaskDependencies Task Dependencies
\subsection SequentialConsistency Sequential Consistency
By default, task dependencies are inferred from data dependency (sequential
coherency) by StarPU. The application can however disable sequential coherency
for some data, and dependencies can be specifically expressed.
Setting (or unsetting) sequential consistency can be done at the data
level by calling starpu_data_set_sequential_consistency_flag() for a
specific data or starpu_data_set_default_sequential_consistency_flag()
for all datas.
Setting (or unsetting) sequential consistency can also be done at task
level by setting the field starpu_task::sequential_consistency to \c 0.
Sequential consistency can also be set (or unset) for each handle of a
specific task, this is done by using the field
starpu_task::handles_sequential_consistency. When set, its value
should be a array with the number of elements being the number of
handles for the task, each element of the array being the sequential
consistency for the \c i-th handle of the task. The field can easily be
set when calling starpu_task_insert() with the flag
::STARPU_HANDLES_SEQUENTIAL_CONSISTENCY
\code{.c}
char *seq_consistency = malloc(cl.nbuffers * sizeof(char));
seq_consistency[0] = 1;
seq_consistency[1] = 1;
seq_consistency[2] = 0;
ret = starpu_task_insert(&cl,
	STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,
	STARPU_HANDLES_SEQUENTIAL_CONSISTENCY, seq_consistency,
	0);
free(seq_consistency);
\endcode
The internal algorithm used by StarPU to set up implicit dependency is
as follows:
\code{.c}
if (sequential_consistency(task) == 1)
    for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)
      if (sequential_consistency(i-th data, task) == 1)
        if (sequential_consistency(i-th data) == 1)
           create_implicit_dependency(...)
\endcode
\subsection TasksAndTagsDependencies Tasks And Tags Dependencies
One can explicitely set dependencies between tasks using
starpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can be
expressed through tags associated to a tag with the field
starpu_task::tag_id and using the function starpu_tag_declare_deps()
or starpu_tag_declare_deps_array().
The termination of a task can be delayed through the function
starpu_task_end_dep_add() which specifies the number of calls to the function
starpu_task_end_dep_release() needed to trigger the task termination. One can
also use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array()
to delay the termination of a task until the termination of other tasks.
\section SettingManyDataHandlesForATask Setting Many Data Handles For a Task
The maximum number of data a task can manage is fixed by the macro
\ref STARPU_NMAXBUFS which has a default value which can be changed
through the \c configure option \ref enable-maxbuffers "--enable-maxbuffers".
However, it is possible to define tasks managing more data by using
the field starpu_task::dyn_handles when defining a task and the field
starpu_codelet::dyn_modes when defining the corresponding codelet.
\code{.c}
enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] =
{
	STARPU_R, STARPU_R, ...
};
struct starpu_codelet dummy_big_cl =
{
	.cuda_funcs = { dummy_big_kernel },
	.opencl_funcs = { dummy_big_kernel },
	.cpu_funcs = { dummy_big_kernel },
	.cpu_funcs_name = { "dummy_big_kernel" },
	.nbuffers = STARPU_NMAXBUFS+1,
	.dyn_modes = modes
};
task = starpu_task_create();
task->cl = &dummy_big_cl;
task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<task->cl->nbuffers ; i++)
{
	task->dyn_handles[i] = handle;
}
starpu_task_submit(task);
\endcode
\code{.c}
starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
{
	handles[i] = handle;
}
starpu_task_insert(&dummy_big_cl,
         	  STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
		  STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
		  0);
\endcode
The whole code for this complex data interface is available in the
file <c>examples/basic_examples/dynamic_handles.c</c>.
\section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task
Normally, the number of data handles given to a task is set with
starpu_codelet::nbuffers. This field can however be set to
\ref STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers
must be set, and starpu_task::modes (or starpu_task::dyn_modes,
see \ref SettingManyDataHandlesForATask) should be used to specify the modes for
the handles.
\section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A Codelet
One may want to write multiple implementations of a codelet for a single type of
device and let StarPU choose which one to run. As an example, we will show how
to use SSE to scale a vector. The codelet can be written as follows:
\code{.c}
#include <xmmintrin.h>
void scal_sse_func(void *buffers[], void *cl_arg)
{
    float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
    unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
    unsigned int n_iterations = n/4;
    if (n % 4 != 0)
        n_iterations++;
    __m128 *VECTOR = (__m128*) vector;
    __m128 factor __attribute__((aligned(16)));
    factor = _mm_set1_ps(*(float *) cl_arg);
    unsigned int i;
    for (i = 0; i < n_iterations; i++)
        VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
}
\endcode
\code{.c}
struct starpu_codelet cl =
{
    .cpu_funcs = { scal_cpu_func, scal_sse_func },
    .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
    .nbuffers = 1,
    .modes = { STARPU_RW }
};
\endcode
Schedulers which are multi-implementation aware (only <c>dmda</c> and
<c>pheft</c> for now) will use the performance models of all the
provided implementations, and pick the one which seems to be the fastest.
\section EnablingImplementationAccordingToCapabilities Enabling Implementation According To Capabilities
Some implementations may not run on some devices. For instance, some CUDA
devices do not support double floating point precision, and thus the kernel
execution would just fail; or the device may not have enough shared memory for
the implementation being used. The field starpu_codelet::can_execute
permits to express this. For instance:
\code{.c}
static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
  const struct cudaDeviceProp *props;
  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
    return 1;
  /* Cuda device */
  props = starpu_cuda_get_device_properties(workerid);
  if (props->major >= 2 || props->minor >= 3)
    /* At least compute capability 1.3, supports doubles */
    return 1;
  /* Old card, does not support doubles */
  return 0;
}
struct starpu_codelet cl =
{
    .can_execute = can_execute,
    .cpu_funcs = { cpu_func },
    .cpu_funcs_name = { "cpu_func" },
    .cuda_funcs = { gpu_func }
    .nbuffers = 1,
    .modes = { STARPU_RW }
};
\endcode
This can be essential e.g. when running on a machine which mixes various models
of CUDA devices, to take benefit from the new models without crashing on old models.
Note: the function starpu_codelet::can_execute is called by the
scheduler each time it tries to match a task with a worker, and should
thus be very fast. The function starpu_cuda_get_device_properties()
provides a quick access to CUDA properties of CUDA devices to achieve
such efficiency.
Another example is to compile CUDA code for various compute capabilities,
resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
provided to StarPU by using starpu_codelet::cuda_funcs, and
starpu_codelet::can_execute can then be used to rule out the
<c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
\code{.c}
static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
  const struct cudaDeviceProp *props;
  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
    return 1;
  /* Cuda device */
  if (nimpl == 0)
    /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases.  */
    return 1;
  /* Trying to execute the 2.0 capability variant, check that the card can do it.  */
  props = starpu_cuda_get_device_properties(workerid);
  if (props->major >= 2 || props->minor >= 0)
    /* At least compute capability 2.0, can run it */
    return 1;
  /* Old card, does not support 2.0, will not be able to execute the 2.0 variant.  */
  return 0;
}
struct starpu_codelet cl =
{
    .can_execute = can_execute,
    .cpu_funcs = { cpu_func },
    .cpu_funcs_name = { "cpu_func" },
    .cuda_funcs = { scal_gpu_13, scal_gpu_20 },
    .nbuffers = 1,
    .modes = { STARPU_RW }
};
\endcode
Another example is having specialized implementations for some given common
sizes, for instance here we have a specialized implementation for 1024x1024
matrices:
\code{.c}
static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
  const struct cudaDeviceProp *props;
  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
    return 1;
  /* Cuda device */
  switch (nimpl)
  {
    case 0:
      /* Trying to execute the generic capability variant.  */
      return 1;
    case 1:
    {
      /* Trying to execute the size == 1024 specific variant.  */
      struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
      return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
    }
  }
}
struct starpu_codelet cl =
{
    .can_execute = can_execute,
    .cpu_funcs = { cpu_func },
    .cpu_funcs_name = { "cpu_func" },
    .cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
    .nbuffers = 1,
    .modes = { STARPU_RW }
};
\endcode
Note that the most generic variant should be provided first, as some schedulers are
not able to try the different variants.
\section InsertTaskUtility Insert Task Utility
StarPU provides the wrapper function starpu_task_insert() to ease
the creation and submission of tasks.
Here the implementation of a codelet:
\code{.c}
void func_cpu(void *descr[], void *_args)
{
        int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
        float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
        int ifactor;
        float ffactor;
        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
        *x0 = *x0 * ifactor;
        *x1 = *x1 * ffactor;
}
struct starpu_codelet mycodelet =
{
        .cpu_funcs = { func_cpu },
        .cpu_funcs_name = { "func_cpu" },
        .nbuffers = 2,
        .modes = { STARPU_RW, STARPU_RW }
};
\endcode
And the call to the function starpu_task_insert():
\code{.c}
starpu_task_insert(&mycodelet,
                   STARPU_VALUE, &ifactor, sizeof(ifactor),
                   STARPU_VALUE, &ffactor, sizeof(ffactor),
                   STARPU_RW, data_handles[0],
		   STARPU_RW, data_handles[1],
                   0);
\endcode
The call to starpu_task_insert() is equivalent to the following
code:
\code{.c}
struct starpu_task *task = starpu_task_create();
task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
char *arg_buffer;
size_t arg_buffer_size;
starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
                    STARPU_VALUE, &ifactor, sizeof(ifactor),
                    STARPU_VALUE, &ffactor, sizeof(ffactor),
                    0);
task->cl_arg = arg_buffer;
task->cl_arg_size = arg_buffer_size;
int ret = starpu_task_submit(task);
\endcode
Here a similar call using ::STARPU_DATA_ARRAY.
\code{.c}
starpu_task_insert(&mycodelet,
                   STARPU_DATA_ARRAY, data_handles, 2,
                   STARPU_VALUE, &ifactor, sizeof(ifactor),
                   STARPU_VALUE, &ffactor, sizeof(ffactor),
                   0);
\endcode
If some part of the task insertion depends on the value of some computation,
the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. For
instance, assuming that the index variable <c>i</c> was registered as handle
<c>A_handle[i]</c>:
\code{.c}
/* Compute which portion we will work on, e.g. pivot */
starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
/* And submit the corresponding task */
STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
                       starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
\endcode
The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
acquiring data <c>i</c> for the main application, and will execute the code
given as third parameter when it is acquired. In other words, as soon as the
value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
be executed, and is allowed to read from <c>i</c> to use it e.g. as an
index. Note that this macro is only avaible when compiling StarPU with
the compiler <c>gcc</c>.
StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the ::STARPU_VALUE arguments passed to the task. There is several ways of calling this function starpu_codelet_unpack_args().
\code{.c}
void func_cpu(void *descr[], void *_args)
{
        int ifactor;
        float ffactor;
        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
\endcode
\code{.c}
void func_cpu(void *descr[], void *_args)
{
        int ifactor;
        float ffactor;
        starpu_codelet_unpack_args(_args, &ifactor, 0);
        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
\endcode
\code{.c}
void func_cpu(void *descr[], void *_args)
{
        int ifactor;
        float ffactor;
	char buffer[100];
        starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
        starpu_codelet_unpack_args(buffer, &ffactor);
}
\endcode
\section GettingTaskChildren Getting Task Children
It may be interesting to get the list of tasks which depend on a given task,
notably when using implicit dependencies, since this list is computed by StarPU.
starpu_task_get_task_succs() provides it. For instance:
\code{.c}
struct starpu_task *tasks[4];
ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);
\endcode
\section ParallelTasks Parallel Tasks
StarPU can leverage existing parallel computation libraries by the means of
parallel tasks. A parallel task is a task which is run by a set of CPUs
(called a parallel or combined worker) at the same time, by using an existing
parallel CPU implementation of the computation to be achieved. This can also be
useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
work collectively on a single task, the completion time of tasks on CPUs become
comparable to the completion time on GPUs, thus relieving from granularity
discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
good performance, otherwise StarPU will not know how to better group
cores.
Two modes of execution exist to accomodate with existing usages.
\subsection Fork-modeParallelTasks Fork-mode Parallel Tasks
In the Fork mode, StarPU will call the codelet function on one
of the CPUs of the combined worker. The codelet function can use
starpu_combined_worker_get_size() to get the number of threads it is
allowed to start to achieve the computation. The CPU binding mask for the whole
set of CPUs is already enforced, so that threads created by the function will
inherit the mask, and thus execute where StarPU expected, the OS being in charge
of choosing how to schedule threads on the corresponding CPUs. The application
can also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to know
the CPU binding mask that StarPU chose.
For instance, using OpenMP (full source is available in
<c>examples/openmp/vector_scal.c</c>):
\snippet forkmode.c To be included. You should update doxygen if you see this text.
Other examples include for instance calling a BLAS parallel CPU implementation
(see <c>examples/mult/xgemm.c</c>).
\subsection SPMD-modeParallelTasks SPMD-mode Parallel Tasks
In the SPMD mode, StarPU will call the codelet function on
each CPU of the combined worker. The codelet function can use
starpu_combined_worker_get_size() to get the total number of CPUs
involved in the combined worker, and thus the number of calls that are made in
parallel to the function, and starpu_combined_worker_get_rank() to get
the rank of the current CPU within the combined worker. For instance:
\code{.c}
static void func(void *buffers[], void *args)
{
    unsigned i;
    float *factor = _args;
    struct starpu_vector_interface *vector = buffers[0];
    unsigned n = STARPU_VECTOR_GET_NX(vector);
    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
    /* Compute slice to compute */
    unsigned m = starpu_combined_worker_get_size();
    unsigned j = starpu_combined_worker_get_rank();
    unsigned slice = (n+m-1)/m;
    for (i = j * slice; i < (j+1) * slice && i < n; i++)
        val[i] *= *factor;
}
static struct starpu_codelet cl =
{
    .modes = { STARPU_RW },
    .type = STARPU_SPMD,
    .max_parallelism = INT_MAX,
    .cpu_funcs = { func },
    .cpu_funcs_name = { "func" },
    .nbuffers = 1,
}
\endcode
Of course, this trivial example will not really benefit from parallel task
execution, and was only meant to be simple to understand.  The benefit comes
when the computation to be done is so that threads have to e.g. exchange
intermediate results, or write to the data in a complex but safe way in the same
buffer.
\subsection ParallelTasksPerformance Parallel Tasks Performance
To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
(parallel eager) will indeed also try to execute tasks with
several CPUs. It will automatically try the various available combined
worker sizes (making several measurements for each worker size) and
thus be able to avoid choosing a large combined worker if the codelet
does not actually scale so much.
This is however for now only proof of concept, and has not really been optimized yet.
\subsection CombinedWorkers Combined Workers
By default, StarPU creates combined workers according to the architecture
structure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>
topology (NUMA node, socket, cache, ...) a combined worker will be created. If
some nodes of the hierarchy have a big arity (e.g. many cores in a socket
without a hierarchy of shared caches), StarPU will create combined workers of
intermediate sizes. The variable \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
permits to tune the maximum arity between levels of combined workers.
The combined workers actually produced can be seen in the output of the
tool <c>starpu_machine_display</c> (the environment variable
\ref STARPU_SCHED has to be set to a combined worker-aware scheduler such
as <c>pheft</c> or <c>peager</c>).
\subsection ConcurrentParallelTasks Concurrent Parallel Tasks
Unfortunately, many environments and librairies do not support concurrent
calls.
For instance, most OpenMP implementations (including the main ones) do not
support concurrent <c>pragma omp parallel</c> statements without nesting them in
another <c>pragma omp parallel</c> statement, but StarPU does not yet support
creating its CPU workers by using such pragma.
Other parallel libraries are also not safe when being invoked concurrently
from different threads, due to the use of global variables in their sequential
sections for instance.
The solution is then to use only one combined worker at a time.  This can be
done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
to <c>1</c>. StarPU will then run only one parallel task at a time (but other
CPU and GPU tasks are not affected and can be run concurrently). The parallel
task scheduler will however still try varying combined worker
sizes to look for the most efficient ones.
\subsection SynchronizationTasks Synchronization Tasks
For the application conveniency, it may be useful to define tasks which do not
actually make any computation, but wear for instance dependencies between other
tasks or tags, or to be submitted in callbacks, etc.
The obvious way is of course to make kernel functions empty, but such task will
thus have to wait for a worker to become ready, transfer data, etc.
A much lighter way to define a synchronization task is to set its starpu_task::cl
field to <c>NULL</c>. The task will thus be a mere synchronization point,
without any data access or execution content: as soon as its dependencies become
available, it will terminate, call the callbacks, and release dependencies.
An intermediate solution is to define a codelet with its
starpu_codelet::where field set to \ref STARPU_NOWHERE, for instance:
\code{.c}
struct starpu_codelet cl =
{
	.where = STARPU_NOWHERE,
	.nbuffers = 1,
	.modes = { STARPU_R },
}
task = starpu_task_create();
task->cl = &cl;
task->handles[0] = handle;
starpu_task_submit(task);
\endcode
will create a task which simply waits for the value of <c>handle</c> to be
available for read. This task can then be depended on, etc.
*/
 |