1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
.TH LIKWID-BENCH 1 <DATE> likwid\-<VERSION>
.SH NAME
likwid-bench \- low-level benchmark suite and microbenchmarking framework
.SH SYNOPSIS
.B likwid-bench
.RB [\-hap]
.RB [ \-t
.IR <testname> ]
.RB [ \-s
.IR <min_time> ]
.RB [ \-w
.IR <workgroup_expression> ]
.RB [ \-W
.IR <workgroup_expression_short> ]
.RB [ \-l
.IR <testname> ]
.RB [ \-d
.IR <delimiter> ]
.RB [ \-i
.IR <iterations> ]
.RB [ \-f
.IR <filepath> ]
.SH DESCRIPTION
.B likwid-bench
is a benchmark suite for low-level (assembly) benchmarks to measure bandwidths and instruction throughput for specific instruction code on x86 systems. The currently included benchmark codes include common data access patterns like load and store but also calculations like vector triad and sum.
.B likwid-bench
includes architecture specific benchmarks for x86, x86_64 and x86 for Intel Xeon Phi coprocessors. With LIKWID 5 also ARM and POWER benchmarks are supported. The performance values can either be calculated by
.B likwid-bench
or measured using performance counters by using
.B likwid-perfctr
as a wrapper to
.B likwid-bench.
This requires to build
.B likwid-bench
with instrumentation enabled in config.mk. Benchmarks can be dynamically added when a proper ptt file is present at $HOME/.likwid/bench/<arch>/<testname>.ptt . The files are compiled to a .S file and compiled using either gcc, icc or pgcc (searched in $PATH). The default folder is /tmp/<PID>. Possible values for <arch> are 'x86', 'x86-64', 'phi', armv7', 'armv8' and 'power'.
.SH OPTIONS
.TP
.B \-\^h
prints a help message to standard output, then exits.
.TP
.B \-\^a
list available benchmark codes for the current system.
.TP
.B \-\^p
list available thread domains.
.TP
.B \-\^s <min_time>
Run the benchmark for at least
.B <min_time> seconds.
The amount of iterations is determined using this value. Default: 1 second.
.TP
.B \-\^t <testname>
Name of the benchmark code to run (mandatory).
.TP
.B \-\^w <workgroup_expression>
Specify the affinity domain, thread count and data set size for the current benchmarking run (-w or -W mandatory). First thread in thread domain initializes the stream.
.TP
.B \-\^W <workgroup_expression_short>
Specify the affinity domain, thread count and data set size for the current benchmarking run (-w or -W mandatory). Each thread in the workgroup initializes its own chunk of the stream.
.TP
.B \-\^l <testname>
list properties of a benchmark code.
.TP
.B \-\^i <iterations>
Set the number of iterations per thread (optional)
.TP
.B \-\^f <filepath>
Filepath for the dynamic generation of benchmarks. Default /tmp/. <PID> is always attached
.SH WORKGROUP SYNTAX
.B <thread_domain>:<size> [:<num_threads>[:<chunk_size>:<stride>]] [-<streamId>:<domain_id>]
with size in kB, MB or GB. The
.B <thread_domain>
defines where the threads are placed.
.B <size>
is the total data set size for the benchmark, the allocated vectors in memory sum up to this size.
.B <num_threads>
specifies how many threads are used in the
.B <thread_domain>.
Threads are always placed using a compact policy in
.B likwid-bench.
This means that per default all SMT threads are used. Optionally similar a the expression based syntax in
.B likwid-pin
a
.B <chunk_size>
and
.B <stride>
can be provided. Optionally for every stream (array, vector) the placement can be controlled. Per default all arrays are placed in the same
.B <thread_domain>
the threads are running in. To place the data in a different domain for every stream of a benchmark case (the total number of streams can be acquired by the
.B \-l
option) the domain to place the data in can be specified. Multiple streams are comma separated. Either the placement is provided or all streams have to be explicitly placed. Please refer to the Wiki pages on
.B https://github.com/RRZE-HPC/likwid/wiki/Likwid-Bench
for further details and examples on usage.
With -W each thread initializes its own chunk of the streams but pleacement of the streams is deactivated.
.SH EXAMPLE
.IP 1. 4
Run the
.B copy
benchmark on socket 0 (
.B S0
) with a total data set size of
.B 100kB.
.TP
.B likwid-bench -t copy -w S0:100kB
.PP
Since no
.B <num_threads>
is given in the workload expression, each hardware thread of socket 0 gets one application thread. The workload is split up between all threads and the number of iterations is determined automatically.
.IP 2. 4
Run the
.B triad
benchmark code with explicitly
.B 100
iterations per thread with
.B 2
threads on the socket 0 (
.B S0
) and a data size of
.B 1GB.
.TP
.B likwid-bench -t triad -i 100 -w S0:1GB:2:1:2
.PP
Assuming socket 0 (
.B S0
) has 2 physical hardware threads with SMT enabled, hence in total 4 hardware threads, one thread is assigned to each physical hardware thread of socket 0.
.IP 3. 4
Run the
.B update
benchmark on socket 0 (
.B S0
) with a workload of
.B 100kB
and on socket 1 (
.B S1
) with the same workload.
.TP
.B likwid-bench -t update -w S0:100kB -w S1:100kB
.PP
The results of both workgroups are combinded for the output. Hence the workload in each workgroup expression should have the same size.
.IP 4. 4
Run the
.B copy
benchmark but measure the memory traffic with
.B likwid-perfctr.
The option
.B INSTRUMENT_BENCH
in
.B config.mk
needs to be true at compile time to use that feature.
.TP
.B likwid-perfctr -c E:S0:4 -g MEM -m likwid-bench -t update -w S0:100kB
.PP
.B likwid-perfctr
will configure and start the performance counters on socket 0 (
.B S0
) with 4 threads prior to the execution of
.B likwid-bench.
The performance counters are read right before and after running the benchmarking code to minimize the interferences of the measurement.
.IP 5. 4
Run the
.B copy
benchmark and place the data on another socket
.TP
.B likwid-bench -t copy -w S0:1GB:10:1:2-0:S1,1:S1
.PP
Stream id 0 and 1 are placed in thread domains
.B S1,
which is socket 1. This can be verified as the initialization threads output where they are running.
.SH WARNING
Since LIKWID 5.0, it is possible to have different numbers of threads in workgroups. Also different sizes are allowed. Both features seem promising, but they show a range of problems. If you have a NUMA system and run with multiple threads on NUMA node 0 but with less on NUMA node 1, the threads on NUMA node 1 cause less preassure on the memory interface and consequently achieve higher throughput. They will finish early compared to the threads on NUMA node 0. The runtime used for caluclating the bandwidth and MFlops/s values use the maximal runtime of all threads, hence one of NUMA node 0.
Similar problems exist with different sizes. One workgroup might run in cache while the other waits for data from the memory interface.
.SH AUTHOR
Written by Thomas Gruber <thomas.roehl@googlemail.com>.
.SH BUGS
Report Bugs on <https://github.com/RRZE-HPC/likwid/issues>.
.SH SEE ALSO
likwid-perfctr(1), likwid-pin(1), likwid-topology(1), likwid-setFrequencies(1)
|