1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
|
For more complete information on NetPIPE, visit the webpage at:
http://www.scl.ameslab.gov/Projects/NetPIPE/
NetPIPE was originally developed by Quinn Snell, Armin Mikler,
John Gustafson, and Guy Helmer.
It is currently being developed and maintained by Troy Benjegerdes, with
help from several graduate students (Veerendra Allada, Kyle Schochenmaier)
Release 3.7 adds support for OpenFabrics infiniband verbs module (NPibv),
and should work with the OFED-1.1 or OFED-1.2 release. It has been tested
on IBM eHCA hardware as well as Mellanox pci-express infiniband adapters.
The openfabrics verbs support is currently lacking any support for the
connection manager, and all connections are set up via TCP sockets. This
means that RDMA ethernet devices will not work. Patches to
netpipe@lists.scl.ameslab.gov are encouraged.
Release 3.6.2 mainly fixes some bugs. A number of portability issues
with 64-bit architectures were taken care of, especially in the Infiniband
module. A small typecasting error was fixed that caused segmentation faults
on Red Hat Enterprise and Fedora Core systems (and probably others...). The
bi-directional mode was also tested with the Infiniband module, and a subset
of NetPIPE options are now supported.
Release 3.6.1 adds a bi-directional (-2) mode to allow data to be sent
in both directions simultaneously. This has been tested with the
TCP, MPI, MPI-2, and GM modules. You can also now test
synchronous MPI communications MPI_SSend/MPI_SRecv using (-S).
A launch utility (nplaunch) allows you to launch NPtcp, NPgm,
NPib, and NPpvm from one side using ssh to start the remote executible.
Version 3.6 adds the ability to test with and without cache effects,
and the ability to offset both the source and destination buffers.
A memcpy module has also been added.
Release 3.5 removes the CPU utilization measurements. Getrusage is
probably not very accurate, so a dummy workload will eventually be
used instead.
The streaming mode has also been fixed. When run at Gigabit speeds,
the TCP window size would collapse limit performance of subsequent
data points. Now we reset the sockets between trials to prevent this.
We have also added in a module to evaluate memory copy rates.
-n now sets a constant number of repeats for each trial.
-r resets the sockets between each trial (automatic for streaming).
Release 3.3 includes an Infiniband module for the Mellanox VAPI.
It also has an integrity check (-i), which is still being developed.
Version 3.2 includes additional modules to test
PVM, TCGMSG, SHMEM, and MPI-2, as well as the GM, GPSHMEM, ARMCI, and LAPI
software layers they run upon.
If you have problems or comments, please email netpipe@scl.ameslab.gov
____________________________________________________________________________
NetPIPE Network Protocol Independent Performance Evaluator, Release 2.3
Copyright 1997, 1998 Iowa State University Research Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation. You should have received a copy of the
GNU General Public License along with this program; if not, write to the
Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
____________________________________________________________________________
Building NetPIPE
----------------
NetPIPE requires an ANSI C compiler. You are on your own for
installing the various libraries that NetPIPE can be used to
test.
Review the provided makefile and change any necessary settings, such
as the CC compiler or CFLAGS flags, required extra libraries, and PVM
library & include file pathnames if you have these communication
libraries. Alternatively, you can specify these changes on the
make command line. The line below would compile the NPtcp module
using the icc compiler instead of the default cc compiler.
make CC=icc tcp
Compile NetPIPE with the desired communication interface by using:
make mpi (this will use the default MPI on the system)
make pvm (you may need to set some paths in the makefile)
make tcgmsg (you will need to set some paths in the makefile)
make mpi2 (this will test 1-sided MPI_Put() functions)
make shmem (1-sided library for Cray and SGI systems)
make tcp
make tcp6 (for IPv6 enabled systems)
make ipx (for IPX enabled systems)
make sctp (for SCTP enabled systems)
make sctp6 (for SCTP6 enabled systems)
make gm (for Myrinet cards, you will need to set some paths)
make shmem (1-sided library for Cray and SGI systems)
make gpshmem (SHMEM interface for other machines)
make armci (still under development)
make lapi (for the IBM SP)
make ib (for Mellanox Infiniband adapters, uses VAPI layer)
make ibv (for OpenFabrics Infiniband devices)
make udapl (for OpenFabrics uDAPL)
make memcpy (uses memcpy to copy data between buffers in 1 process)
make MP_memcpy (uses an optimized copy in MP_memcpy.c to copy data between
buffers. This requires icc or gcc 3.x.)
Running NetPIPE
---------------
NetPIPE will dump its output to the screen by default and also
to the np.out. The following parameters can be used to change how
NetPIPE is run, and are in order of their general usefulness.
-b: specify send and receive TCP buffer sizes e.g. "-b 32768"
This can make a huge difference for Gigabit Ethernet cards.
You may need to tune the OS to set a larger maximum TCP
buffer size for optimal performance.
-O: specify send and optionally receive buffer offsets, e.g. "-O 1,3"
-l: lower bound (start value for block size) e.g. "-l 1"
-u: upper bound (stop value for block size) e.g. "-u 1048576"
-o: specify output filename e.g. "-o output.txt"
-z: for MPI, receive messages using ANYSOURCE
-g: MPI-2: use MPI_Get() instead of MPI_Put()
-f: MPI-2: do not use a fence call (may not work for all packages)
-I: Invalidate cache: Take measures to eliminate the effects cache
has on performance.
-a: asynchronous receive (a.k.a. pre-posted receive)
May not have any effect, depending on your implementation
-B: burst all preposts before measuring performance
Normally only one receive is preposted at a time with -a
-p: set perturbation offset of buffer size, e.g. "-p 3"
-i: Integrity check: Check the integrity of data transfer instead
of performance
-s: stream option (default mode is "ping pong")
If this option is used, it must be specified on both
the sending and receiving processes
-S: Use synchronous sends/receives for MPI.
-2: Bi-directional communications. Transmit in both directions
simultaneously.
-P: Set the port number used by TCP to something other than
default.
TCP
---
Compile NetPIPE using 'make tcp'
remote_host> NPtcp [options]
local_host> NPtcp -h remote_host [options]
OR
local_host> nplaunch NPtcp -h remote_host [options]
TCP6
----
Compile NetPIPE using 'make tcp6'
remote_host> NPtcp6 [options]
local_host> NPtcp6 -h remote_host [options]
OR
local_host> nplaunch NPtcp6 -h remote_host [options]
IPX
---
Compile NetPIPE using 'make ipx
remote_host> NPipx [options]
local_host> NPipx -h remote_host [options]
OR
local_host> nplaunch NPipx -h remote_host [options]
SCTP
----
Compile NetPIPE using 'make sctp
remote_host> NPsctp [options]
local_host> NPsctp -h remote_host [options]
OR
local_host> nplaunch NPsctp -h remote_host [options]
SCTP6
----
Compile NetPIPE using 'make sctp6
remote_host> NPsctp6 [options]
local_host> NPsctp6 -h remote_host [options]
OR
local_host> nplaunch NPsctp6 -h remote_host [options]
MPICH
-----
Install MPICH
Compile NetPIPE using 'make mpi'
use p4pg file or edit mpich/util/mach/mach.{ARCH} file
to specify the machines to run on
mpirun [-nolocal] -np 2 NPmpi [options]
'setenv P4_SOCKBUFSIZE 256000' can make a huge difference for
MPICH on Unix systems.
LAM/MPI (comes on the RedHat Linux distributions now)
-------
Install LAM
Compile NetPIPE using 'make mpi'
put the machine names into a lamhosts file
'lamboot -v -b lamhosts' to start the lamd daemons
mpirun -np 2 [-O] NPmpi [options]
The -O parameter avoids data translation for homogeneous systems.
MPI/Pro (commercial version)
-------
Install MPI/Pro
Compile NetPIPE using 'make mpi'
put the machine names into /etc/machines or a local machine file
mpirun -np 2 NPmpi [options]
MP_Lite (A lightweight version of MPI)
-------
Install MP_Lite (http://www.scl.ameslab.gov/Projects/MP_Lite/)
Compile NetPIPE using 'make MP_Lite'
mprun -np 2 -h {host1} {host2} NPmplite [options]
PVM
---
Install PVM (comes on the RedHat distributions now)
Set the PVM paths in the makefile if necessary.
Compile NetPIPE using 'make pvm'
use the 'pvm' utility to start the pvmd daemons
type 'pvm' to start it (this will also start pvmd on the local_host)
pvm> help --> lists all commands
pvm> add remote_host --> will start a pvmd on a machine called 'host2'
pvm> quit --> when you have all the pvmd machines started
remote_host> NPpvm [options]
local_host> NPpvm -h remote_host [options]
OR
local_host> nplaunch NPpvm -h remote_host [options]
Changing PVMDATA in netpipe.h and PvmRouteDirect in pvm.c can
effect the performance greatly.
TCGMSG (unlikely anyone will try this that doesn't know TCGMSG well)
-------
Install TCGMSG package
Set the TCGMSG paths in the makefile.
Compile NetPIPE using 'make tcgmsg'
create a NPtcgmsg.p file with hosts and paths (see hosts/NPtcgmsg.p)
parallel NPtcgmsg
(no options can be passed into this version)
MPI-2
-----
Install the MPI package
Compile NetPIPE using 'make mpi2'
Follow the directions for running the MPI package from above
The MPI_Put() function will be tested with fence calls by default.
Use -g to test MPI_Get() instead, or -f to do MPI_Put() without
fence calls (will not work with LAM).
SHMEM
-----
Must be run on a Cray or SGI system that supports SHMEM calls.
Compile NetPIPE using 'make shmem'
(Xuehua, fill out the rest)
GPSHMEM (a General Purpose SHMEM library) (gpshmem.c in development)
-------
Ask Ricky or Krzysztof for help :).
GM (test the raw performance of GM on Myrinet cards)
--
Install the GM package and configure the Myrinet cards
Compile NetPIPE using 'make gm'
remote_host> NPgm [options]
local_host> NPgm -h remote_host [options]
OR
local_host> nplaunch NPgm -h remote_host [options]
LAPI
----
Log into IBM SP machine at NERSC
Compile NetPIPE using 'make lapi'
To run interactively at NERSC:
Set environment variable MP_MSG_API to lapi
e.g. 'setenv MP_MSG_API lapi', 'export MP_MSG_API=lapi'
Run NPlapi with '-procs 2' to tell the parallel environment you
want 2 nodes. Use any other options that are applicable to
NetPIPE.
To submit a batch job at NERSC:
Copy the file batchLapi from the 'hosts' directory to the directory
containing NPlapi.
Edit the copy of batchLapi:
job_name: Identifying name of job, can be anything
output: File to send stdout to
error: File to send stderr to (most of NetPIPE's output
will go here)
tasks_per_node: Number of tasks to be run on each node
node: Number of nodes to run on
(Use a combination of the above two options to determine
how NetPIPE runs. Use 1 task per node and 2 nodes to run
benchmark between nodes. Use 2 tasks per node and 1 node
to run benchmark on single node)
Use whatever command-line options are appropriate for NetPIPE
Submit the job with the command 'llsubmit batchLapi'
Check status of all your jobs with 'llqs -u <user>'
You should receive an email when the job finishes. The resulting output
files will then be available.
ARMCI
-----
Install the ARMCI package
Compile NetPIPE using 'make armci'
Follow the directions for running the MPI package from above
If running on interfaces other than the default, create a file
called armci_hosts, containing two lines, one for each hostname,
then run package.
Infiniband
----------
This test will only work on machines connected via TCP/IP as well
as Infiniband.
Install Infiniband adapters and mellanox vapi software
(OR OFED-1.1 or OFED-1.2)
Make sure the adapters are up and running (e.g. Check that the
Mellanox-supplied bandwidth/latency program, perf_main, works, if
you have it. For OFED, make sure ibv_rc_pingpong works)
Compile NetPIPE using 'make ib' for vapi and 'make ibv for OFED
(The environment variable MTHOME needs to be set to the directory
containing the include and lib directories for the Mellanox software).
OFED should build if the libraries are in the standard OFED install
locations, if not, edit the makefile.
remote_host> NPib [-options]
local_host> NPib -h remote_host [-options]
OR
local_host> nplaunch NPib -h remote_host [options]
(remote_host should be the ip address or hostname of the other host)
Other options: (this documentation is out of date for ibv)
Use -m to select mtu size for Infiniband adapter.
Valid values are 256, 512, 1024, 2048, 4096. Default is 1024.
Use -t to select the communications type.
Possible values are
send_recv: basic send and receive
send_recv_with_imm: send and receive with immediate data
rdma_write: one-sided remote dma write
rdma_write_with_imm: one-sided remote dma write with immediate data
Default is send_recv.
Use -c to select the message completion type.
Possible values are
local_poll: poll on last byte of receive buffer
vapi_poll: use VAPI polling mechanism
event: use VAPI event completion mechanism
Default is local_poll.
uDAPL
----------
Make sure the uDAPL interface adapters are up and running.
Compile NetPIPE using 'make udapl' (May need to change the path to
the uDAPL include files and libraries in the makefile depending on
your uDAPL interface adapter. Defaults to /usr/local/ofed/*
remote_host> NPudapl [-options]
local_host> NPudapl -h remote_host [-options]
OR
local_host> nplaunch NPudapl -h remote_host [options]
(remote_host should be the ip address or hostname of the other host)
Other options:
Use -t to select the communications type.
Possible values are
send_recv: basic send and receive
rdma_write: RDMA write in place of send/recv
Default is send_recv.
Use -c to select the message completion type.
Possible values are
local_poll: poll on last byte of receive buffer
dq_poll: use dat_evd_dequeue to poll for completion events
evd_wait: use dat_evd_wait to wait for completion events
cno_wait: use dat_cno_wait to wait for completions, then dequeue
Default is local_poll.
For the best latency & throughput numbers, use local_poll.
To demonstrate the CPU effienciency of uDAPL, use evd_wait or cno_wait.
Interpreting the Results
------------------------
NetPIPE generates a np.out file by default, which can be renamed using the
-o option. This file contains 3 columns: the number of bytes, the
throughput in Mbps, and the round trip time divided by two.
The first 2 columns can therefore be used to produce a throughput vs
message size graph.
The screen output contains this same information, plus the test number
and the number of ping-pong's involved in the test.
>more np.out
1 0.136403 0.00005593
2 0.274586 0.00005557
3 0.402104 0.00005692
4 0.545668 0.00005593
6 0.805053 0.00005686
8 1.039586 0.00005871
12 1.598912 0.00005726
13 1.700719 0.00005832
16 2.098007 0.00005818
19 2.340364 0.00006194
Invalidating Cache
------------------
The -I switch can be used to reduce the effects cache has on performance.
Without the switch, NetPIPE tests the performance of communicating
n-byte blocks by reading from an n-byte buffer on one node, sending data
over the communications link, and writing to an n-byte buffer on the other
node. For each block size, this trial will be repeated x times, where x
typically starts out very large for small block sizes, and decreases as the
block size grows. The same buffers on each node are used repeatedly, so
after the first transfer the entire buffer will be in cache on each node,
given that the block-size is less than the available cache. Thus each transfer
after the first will be read from cache on one end and written into cache on
the other. Depending on the cache architecture, a write to main memory may
not occur on the receiving end during the transfer loop.
While the performance measurements obtained from this method are certainly
useful, it is also interesting to use the -I switch to measure performance
when data is read from and written to main memory. In order to facilitate
this, large pools of memory are allocated at startup, and each n-byte transfer
comes from a region of the pool not in cache. Before each series of n-byte
transfers, every byte of a large dummy buffer is individually accessed in
order to flush the data for the transfer out of cache. After this step, the
first n-byte transfer comes from the beginning of the large pool, the second
comes from n-bytes after the beginning of the pool, and so on (note that stride
between n-byte transfers will depend on the buffer alignment setting). In this
way we make sure each read is coming from main memory.
On the receiving end data is written into a large pool in the same fashion
that it was read on the transmitting end. Data will first be written into
cache. What happens next depends on the cache architecture, but one case is
that no transfer to main memory occurs, YET. For moderately large block
sizes, however, a large number of transfer iterations will cause reuse of
cache memory. As this occurs, data in the cache location to be replaced must
be written back to main memory, so we incur a performance penalty while we
wait for the write.
In summary, using the -I switch gives worst-case performance (i.e. all data
transfers involve reading from or writing to memory not in cache) and not
using the switch gives best-case performance (i.e. all data transfers involve
only reading from or writing to memory in cache). Note that other combinations,
such as reading from memory in cache and writing to memory not in cache, would
give intermediary results. We chose to implement the methods that will measure
the two extremes.
Changes needed
--------------
- we need to replace the getrusage stuff from version 2.4 with a dummy
workload ... We have a working version using DGEMM internally, email
netpipe@lists.scl.ameslab.gov for more info.
- the ibv module needs to have better documentation, and support for the CM
|