1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490
|
# HPCToolkit By Example

**HPCToolkit** is a suite of tools for tracing, profiling and analyzing
parallel programs. It can accurately measure a program's amount of work
and resource consumption, as well as user-defined derived metrics such
as FLOPS inefficiency and (lack of) scaling behavior. These metrics can
then be correlated with source code to pinpoint hotspots.
Being exclusively based on statistical sampling (instead of annotating
source code or intercepting MPI calls), it typically adds an overhead of
only 1-3% for reasonable choices of sampling periods, and can work with
fully optimized binaries. An optional (but recommended) pre-analysis of
the static program structure accounts for compiler transformations such
as inlining and pipelining, so that those performance metrics can be
more accurately associated with loops and functions even in a
highly-optimized binary.
Thanks to accurate callstack unwinding, it is possible to accurately
determine which particular callpath leads to a given performance
behavior, shifting blame from symptoms to causes. It also enables
fine-grained tracing to help identify e.g. load imbalance, understand
program behavior across nodes and time, etc.
## Installation
The toolkit may be downloaded from the
[official website](http://hpctoolkit.org/software.html),
as well as it's dependencies, which, for user convenience, have been
packaged in a single tarball, denoted `hpctoolkit-externals`. Binaries
for the visualization tools `hpcviewer` (for profiling), and `hpctraceviewer`
(for tracing) may also be downloaded from the same page; these need a
reasonably recent version of the java runtime (1.5.\* will suffice).
In this tutorial, we will be using all four packages from the website
above. Furthermore, we will make use of
[PAPI 4.4.0](http://icl.cs.utk.edu/projects/papi/downloads/papi-4.4.0.tar.gz)
(see the section *Hardware counters* below).
First, build the `externals` package:
```
tar xzf hpctoolkit-externals-5.3.2-r3950.tar.gz
cd hpctoolkit-externals-5.3.2-r3950
mkdir build && cd build
../configure --with-mpi=$MPI_PATH
make install
```
Here we assume that MPI was installed in `$MPI_PATH`.
Finally we will build the main toolkit. Let `$EXT_PATH` be the path to
the external's `build` folder, and `$PAPI_PATH`
The procedure is then similar to before:
```
tar xzf hpctoolkit-5.3.2-r3950.tar.gz
cd hpctoolkit-5.3.2-r3950
mkdir build && cd build
../configure --with-papi=$PAPI_PATH [--with-mpi=$MPI_PATH]
make install
```
If MPICH2 is not in the `$PATH`, the additional flag
`--with-mpi=$MPI_PATH` must be passed (otherwise HPCToolkit will not be
built with MPI support). Before continuing, ensure that the following
line is in the configuration summary:
```
configure: mpi support?: yes
```
Optionally the two visualizers may also be installed (on development
machines). In either case, the following command should be executed in
the uncompressed folder:
```
./install $HPCTOOLKIT_PATH
```
where, as expected, `$HPCTOOLKIT_PATH` is the folder where the main
toolkit above was installed (TODO: which is the default?).
More information, especially concerning different platforms, may be
found in the
[official instructions page](http://hpctoolkit.org/software-instructions.html).
## Numerical integration
This very short program estimates $\pi$ by using the method of
trapezoids with the following identity:
$\int_0^1\frac{4}{1+x^2}\ dx = \pi$
It is part of the MPICH2 source code distribution, and may be found at
`examples/cpi.c`.
As HPCToolkit is based on sampling, there is no need for manual source
code instrumentation. Compilation remains mostly unchanged; more
importantly, it is highly recommended to compile the target program with
debugging information and optimization turned on:
```
mpicc -g -O3 cpi.c -o cpi -lm
```
However, should the application be statically linked (such as on Compute
Node Linux or BlueGene/P), there is the extra step of linking with
`hpclink`.
```
hpclink `<regular-linker>` `<regular-linker-arguments
## For example:
hpclink mpicc -o cpi cpi.o -lm
```
(More information about static linking can be found in chapter 9 of the
[users manual](http://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf).)
Next, we must recover the static program structure from the linked
binary, for which there is a tool named `hpcstruct`, typically launched
with no extra arguments:
```
hpcstruct ./cpi
```
This will build a representation of the program's structure in
`cpi.hpcstruct` (e.g. loop nesting, inlining) to be used later when
profiling/tracing, so that performance metrics may accurately be
associated with the correct code construct (be it a loop or a
procedure).

Execution differs in that `hpcrun` should be used to launch the
executable (in addition to mpirun):
```
mpirun -np 8 hpcrun `<hpcrun-args>` ./cpi
```
The argument for `hpcrun` will define which measurements will be made,
and how often. By default, `HPCToolkit` comes with a handful of events;
a list may be obtained via the following command:
```
hpcrun -L ./cpi
```
Events are passed as arguments of the form ` --event ` $e_i @ p_i$, where:
- $e_i$ - Event identifier (`WALLCLOCK`, `MEMLEAK`, etc);
- $p_i$ - Period in units meaningful to the event: microseconds for
`WALLCLOCK`, cycles for `PAPI_TOT_CYC`, cache misses for `PAPI_L2_DCA`, etc.
(Statically linked applications set the environment variable
`HPCRUN_EVENT_LIST` instead, which uses the same format, and separes
events by a ';'. Please check the chapter referred to above.)
Now we pick a random metric to use, such as the performance of single
precision division:
```
mpirun -np 12 hpcrun --event WALLCLOCK@400000 --event PAPI_FDV_INS@10000 ./cpi
```
This will create a folder named `hpctoolkit-cpi-measurements`, with
entries for every rank used during runtime.
Finally, we combine the measurements with the program structure,
obtaining the final profiling database, using the command `hpcprof`:
```
hpcprof -S cpi.hpcstruct -I ./'*' hpctoolkit-cpi-measurements
```
The parameter `-I` should point to the folder containing the program's
source code. If it is distributed among several directories, the `*`
wildcard indicates that the directory should be searched recursively for
the code. Note the single quotes around `*`: that is used to prevent the
shell from expanding it into something else.
Much more information about any of these topics can be found in chapter
3 of the users manual.
## Hardware counters
If compiled with [PAPI](http://icl.cs.utk.edu/papi/) support, HPCToolkit
can record low-level events such as the number of mispredicted branches,
cache hits/misses and so on (to the extent supported by the hardware),
on a function/loop granularity. This information can help developers
find subtle, possibly architecture-specific performance bugs.
Additionally, HPCToolkit supports creating derived metrics from those
provided by PAPI, such as the difference between the FLOPS/cycle ratio
of a given loop, and the peak one seen during execution.
No modifications of the code (nor of the build process) are required to
support this; like any other event, those from PAPI are enabled at
run-time through the command-line argument `--event COUNTER_NAME@PERIOD`
to `hpcrun`. The command `papi_avail` lists all available hardware
counters on the processor, but only those listed using the `hpcrun -L`
command above can be used, as some of these are derived hardware metrics
(i.e. exposed as a convenience by PAPI, but not part of the processor's
counter interface).
For example, to sample floating point operations once every `400000`
operations, as well as the number of L2 cache misses once every `100000`
misses, when spawning a job with 12 ranks, the following command could
be used:
```
mpirun -np 12 hpcrun --event PAPI_FP_OPS@400000 --event PAPI_L2_TCM@100000 ./program
```
Derived metrics are defined at analysis time (i.e. using the visualizer
`hpcviewer`).
### Matrix-Matrix multiplication
In this example, we show how PAPI can be used to find an inefficient use
of cache in the context of matrix-matrix multiplications. The attached
[code](../text/Mmult.c) implements the
[naïve textbook algorithm](http://en.wikipedia.org/wiki/Matrix_multiplication#Matrix_product_.28two_matrices.29),
which is inefficient in the following sense. Since C/C++ are row-major,
when calculating $X = AB$, although we conceptually use $B$
column-wise, the memory system is in fact fetching a rectangular block
(whose height is that of the column, and whose length is determined by
the cache block size), of which we only use the first column. Since this
column is reused for every row of $A$, both of which may be very
large, by the time we actually start using the rest of said block it
will likely have been evicted to make space for $A$'s rows.
To measure this effect, the following counters will be used:
| Counter name | Description | Period |
| ------------- | ---------------------- | ------ |
| `PAPI_L2_TCM` | L2 cache misses | 2500 |
| `PAPI_L2_DCA` | L2 data cache accesses | 2500 |
The period was chosen somewhat arbitrarily; short runs should use small
periods, whereas longer runs may increase it (though that might add
blindspots). We now compile, analyse, and execute the code as described
above:
```
mpicc -g -O3 -std=c99 -lm -o mmult mmult.c
hpcstruct ./mmult
mpirun -np 4 hpcrun --event PAPI_L2_TCM@10000 --event PAPI_L2_DCA@10000 ./mmult 32 64 128
```

This will randomly generate two `double` matrices, one of dimension
\(32x64\), and the other of dimension \(64x128\). It will then multiply
them using the naïve algorithm, where each rank is responsible for a
range of rows of the final product, and gather the result on rank 0.
The measurements made by `hpcrun` must now be combined with the source
code analysis to make the database, as described above. We then launch
the visualizer on the database.
```
hpcprof -S mmult.hpcstruct -I ./'*' hpctoolkit-mmult-measurements
...
hpcviewer hpctoolkit-mmult-database
```

We are interested in a new custom metric, the L2 cache miss rate, which
may be added via the following highlighted button:
![The (somewhat hidden) derived metric button.]
(../images/HpctkDerivedMetricButton.png "The (somewhat hidden) derived metric button.")
The menu is (literally) self-explanatory; we then add the desired
metric:
$\text{miss rate} = \frac{\text{number of L2 misses}}{\text{number of L2 accesses}}$.
As we ran with very small matrices, which may fit even in L1, the cache
miss rate is very small (5.56%). Increasing the size to something
larger, such as $2048x2048$, clearly shows the behavior described in
the beginning of this section: the miss rate is 94.62%.
So we should now try something slightly less naive:
1. For every element $a_{ij}$ of $A$:
1. Load row $B_j$ of $B$;
2. Store product $a_{ij}B_j$ in row $C_i$.

This makes better use of the cache because the rows of $B$ are stored
(and accessed) sequentially, which is corroborated by the lower cache
miss rate of 54.83%.
Thanks to HPCToolkit's support for profiling hardware counters, such
techniques can easily be evaluated with no manual bookkeeping whatsoever
by the developer. Further, it can also help determine the impact of
these changes on different architectures.
## Weak/strong scaling
Another useful application of HPCToolkit is in determining scalability
of programs (and assigning blame on a function to function basis). We
reuse the more efficient matrix multiplication code above, first to
investigate slow scaling, and then strong scaling. Since now we are only
interested in execution time and number of cycles taken, we use the
following counters instead:
| Counter name | Description | Period |
| -------------- | --------------------------------------------------- | ------ |
| `PAPI_TOT_CYC` | Total cycles | 10000 |
| `WALLCLOCK` | Wall clock time used by the process in microseconds | 100000 |
The mathematics of the problem presents no obvious scaling (both weak
and strong) constraint for the algorithm, but HPCToolkit provides a very
straightforward way of verifying (or, in this case, refuting) intuition.
Starting with weak scaling (with 2 and 8 cores), we now define the
following derived metric for every context/function:
$\frac{\text{number of cycles taken for 8 cores} - \text{number of cycles taken for 2 cores}}{\text{total number of cycles taken for 8 cores}}$
This metric associates with every scope its contribution to the overall
scaling loss, i.e. how much longer the execution takes on 8-cores than
on 2-cores (despite the fact that every core does exactly the same
amount of work).
```
mpirun -np 2 hpcrun --event PAPI_TOT_CYC@10000 --event WALLCLOCK@100000 ./mmult 1024 1024 1024
hpcprof -S mmult.hpcstruct -I ./'*' hpctoolkit-mmult-measurements
```
After running `mmult` and `hpcprof` a second time (with 8 ranks), a
folder named `hpctoolkit-mmult-database-`<PID> will be created, where
PID is `hpcprof`'s second run's PID.

Both databases for the 2- and 8-core runs must now be merged so that
their components may be used to make new metrics. This is done by
opening them with the **File** menu, and then, on the same menu,
pressing **Merge databases**. Meanwhile, the main `hpcviewer` screen
should look as on the right. The rightmost view is of the merged
databases, so the previous two may safely be closed. We now proceed as
usual.
The new metric will use what is called an *aggregate metric*, which is
the sum of all elements in the column of a given existing metric,
(`PAPI_TOT_CYC` in our case); it is obtained, as described in the menu,
by replacing the `$` preceding the metric ID by a `@`.
As seen on the left, the program is not scaling well on a single node:
running a problem 4 times as big on 8 cores takes 3x as long. We suspect
the problem may lie on memory access contention among the cores.
## Tracing
To get tracing information, the flag `-t` should be passed to `hpcrun`.
To illustrate this, we use the
[example trace data](http://hpctoolkit.org/examples/hpctoolkit-chombo-crayxe6-1024pe-trace.tgz)
from the official website (which is much richer than any of our examples
here). We now point `hpctraceviewer` to the uncompressed folder:
```
hpctraceviewer hpctoolkit-chombo-crayxe6-1024pe-trace
```
The interface is relatively straightforward:

1. **Trace view**: a regular timeline; whereas in
[jumpshot](MPE_by_example.md) the nested functions are
represented as nested rectangles, here they are merely represented
with different colors, depending on the level selected on the...
2. **Call path**: this indicates the maximum depth of the call path for
every point in the timeline. Notice that, unlike jumpshop's *legend
window*, this is not index by function but merely by callstack
depth; in fact, some functions may appear more than once, with
different colors. The callstack is determined by the position of the
crosshead.
3. **Depth view**: this is a plot of the above call path, per unit of
time.
4. **Summary view**: a collapsed view of the **trace view**,
column-wise; this illustrates how much time is spent on each
callpath depth.
5. **Mini map**: a faster way of moving around/zooming in and out in
the trace view.
## Platforms
### Blues
- install hpctoolkit
```
# install papi
./configure --prefix=$PWD/install-{blues,fusion} 2>&1 | tee config.txt
make -j8 2>&1 | tee m.txt
make install
# install hpctoolkit-externals
./autogen
mkdir build-{blues,fusion} ; cd build-{blues,fusion}
../configure --prefix=$PWD/install 2>&1 | tee config.txt
make -j8 2>&1 | tee m.txt
make install
# install hpctoolkit
./autogen
mkdir build-{blues,fusion} ; cd build-{blues,fusion}
../configure --prefix=$PWD/install --with-externals=PATH --with-papi=PATH 2>&1 | tee config.txt
make -j8 2>&1 | tee m.txt
make install
```
- use hpctoolkit
```
# Dynamically linked applications
[mpi-launcher] hpcrun [hpcrun-options] app [app-arguments]
# recover static program structure
hpcstruct app
# analyze measurements
hpcprof -S app.hpcstruct -I <app-src>/'*' hpctoolkit-app-measurements1 [hpctoolkit-app-measurements2 ...]
or
<mpi-launcher> hpcprof-mpi -S app.hpcstruct -I <app-src>/'*' hpctoolkit-app-measurements1 [hpctoolkit-app-measurements2 ...]
hpcviewer hpctoolkit-app-database
```
### Mira
<https://www.alcf.anl.gov/user-guides/hpctoolkit>
## Notes
We haven't thoroughly tested HPCToolkit 2.21.4 with PAPI-V; use 4.4.0 if
possible.
This may be found by typing `which mpicc` and removing `bin/mpicc` from
the resulting string.
If you're wondering why HPCToolkit doesn't use `-R` instead, it is
because it already has another meaning; check chapter 3 of the users
guide.
For some reason, HPCToolkit cannot use derived metrics.
The machine used to develop this example (with a core i7-2630QM) cannot
provide information on L1 cache accesses.
Notice that we count the total number of L2 cache misses, but only the
L2 data cache accesses. HPCToolkit unfortunately cannot profile using
derived metrics (such as L2 *data* cache misses); these can be found at
`papi_avail`. Thankfully, the most time-consuming computation in this
case is a very short loop, so the L2 instruction cache constitutes
around 1% of the total L2 cache misses.
Arguably, this is still high, but the point is to illustrate how to go
about measuring these things.
Taken from the HPCToolkit users manual.
Unfortunately the machine on which this was tested did not support
`PAPI_MEM_SCY`, which measures the number of cycles stalled waiting for
memory accesses.
## External links
- [Main website.](http://hpctoolkit.org/)
- [Users
manual.](http://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf)
(Much of this text was inspired by, or outright hoisted from it)
- [John Mellor-Crummey, Gaining Insight into Parallel Program
Performance using
HPCToolkit](https://computing.llnl.gov/tutorials/HPCToolkit.MellorCrummey.2012.08.07.pdf)
- [Building and Installing
HPCToolkit](http://hpctoolkit.org/software-instructions.html)
|