1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609
|
\documentclass[dvipdfm,11pt]{article}
\usepackage[dvipdfm]{hyperref} % Upgraded url package
\parskip=.1in
% Formatting conventions for contributors
%
% A quoting mechanism is needed to set off things like file names, command
% names, code fragments, and other strings that would confuse the flow of
% text if left undistinguished from preceding and following text. In this
% document we use the LaTeX macro '\texttt' to indicate such text in the
% source, which normally produces, when used as in '\texttt{special text}',
% the typewriter font.
% It is particularly easy to use this convention if one is using emacs as
% the editor and LaTeX mode within emacs for editing LaTeX documents. In
% such a case the key sequence ^C^F^T (hold down the control key and type
% 'cft') produces '\texttt{}' with the cursor positioned between the
% braces, ready for the special text to be typed. The closing brace can
% be skipped over by typing ^e (go to the end of the line) if entering
% text or ^C-} to just move the cursor past the brace.
%
% Please add index entries for important terms and keywords, including
% environment variables that may control the behavior of MPI or one of the
% tools and concepts such as line labeling from mpiexec.
% LaTeX mode is usually loaded automatically. At Argonne, one way to
% get several useful emacs tools working for you automatically is to put
% the following in your .emacs file.
% (require 'tex-site)
% (setq LaTeX-mode-hook '(lambda ()
% (auto-fill-mode 1)
% (flyspell-mode 1)
% (reftex-mode 1)
% (setq TeX-command "latex")))
\begin{document}
\markright{MPICH User's Guide}
\title{\textbf{MPICH User's Guide}\thanks{This work was supported by the Mathematical,
Information, and Computational Sciences Division subprogram of the
Office of Advanced Scientific Computing Research, SciDAC Program,
the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the
U.S.\ Department of Energy Office of Science and the National Nuclear Security
Administration, Office of Science, U.S. Department of Energy, under Contract
DE-AC02-06CH11357.}\\
Version %MPICH_VERSION%\\
Mathematics and Computer Science Division\\
Argonne National Laboratory}
\author{
William Gropp \and Ewing (Rusty) Lusk \and Rajeev Thakur \and Pavan Balaji
\and Thomas Gillis \and Yanfei Guo \and Rob Latham \and Ken Raffenetti \and Hui Zhou
}
\maketitle
\cleardoublepage
\pagenumbering{roman}
\tableofcontents
\clearpage
\pagenumbering{arabic}
\pagestyle{headings}
\section{Introduction}
\label{sec:introduction}
This manual assumes that MPICH has already been installed. For
instructions on how to install MPICH, see the \emph{MPICH
Installer's Guide}, or the \texttt{README} in the top-level MPICH
directory. This manual explains how to compile, link, and run MPI
applications, and use certain tools that come with MPICH. This is a
preliminary version and some sections are not complete yet. However,
there should be enough here to get you started with MPICH.
\section{Getting Started with MPICH}
\label{sec:migrating}
MPICH is a high-performance and widely portable implementation of the
MPI Standard, designed to implement all of MPI-1, MPI-2, MPI-3 and MPI-4
(including dynamic process management, one-sided operations, parallel
I/O, and other extensions). The \emph{MPICH Installer's Guide}
provides some information on MPICH with respect to configuring and
installing it. Details on compiling, linking, and running MPI programs
are described below.
\subsection{Default Runtime Environment}
\label{sec:default-environment}
MPICH provides a separation of process management and communication.
The default runtime environment in MPICH is called Hydra. Other
process managers are also available.
\subsection{Starting Parallel Jobs}
\label{sec:startup}
MPICH implements \texttt{mpiexec} and all of its standard arguments,
together with some extensions. See Section~\ref{sec:mpiexec-standard}
for standard arguments to \texttt{mpiexec} and various subsections of
Section~\ref{sec:mpiexec} for extensions particular to various process
management systems.
\subsection{Command-Line Arguments in Fortran}
\label{sec:fortran-command-line}
MPICH1 (more precisely MPICH1's \texttt{mpirun}) required access to
command line arguments in all application programs, including Fortran
ones, and MPICH1's \texttt{configure} devoted some effort to finding
the libraries that contained the right versions of \texttt{iargc} and
\texttt{getarg} and including those libraries with which the
\texttt{mpifort} script linked MPI programs. Since MPICH does not
require access to command line arguments to applications, these
functions are optional, and \texttt{configure} does nothing special
with them. If you need them in your applications, you will have to
ensure that they are available in the Fortran environment you are
using.
\section{Quick Start}
\label{sec:quickstart}
To use MPICH, you will have to know the directory where MPICH has
been installed. (Either you installed it there yourself, or your
systems administrator has installed it. One place to look in this
case might be \texttt{/usr/local}. If MPICH has not yet been
installed, see the \emph{MPICH Installer's Guide}.) We suggest that
you put the \texttt{bin} subdirectory of that directory into your
path. This will give you access to assorted MPICH commands to
compile, link, and run your programs conveniently. Other commands in
this directory manage parts of the run-time environment and execute
tools.
One of the first commands you might run is \texttt{mpichversion} to
find out the exact version and configuration of MPICH you are working
with. Some of the material in this manual depends on just what version
of MPICH you are using and how it was configured at installation
time.
You should now be able to run an MPI program. Let us assume that the
directory where MPICH has been installed is
\texttt{/home/you/mpich-installed}, and that you have added that directory to
your path, using
\begin{verbatim}
setenv PATH /home/you/mpich-installed/bin:$PATH
\end{verbatim}
for \texttt{tcsh} and \texttt{csh}, or
\begin{verbatim}
export PATH=/home/you/mpich-installed/bin:$PATH
\end{verbatim}
for \texttt{bash} or \texttt{sh}.
Then to run an MPI program, albeit only on one machine, you can do:
\begin{verbatim}
cd /home/you/mpich-installed/examples
mpiexec -n 3 ./cpi
\end{verbatim}
Details for these commands are provided below, but if you can
successfully execute them here, then you have a correctly installed
MPICH and have run an MPI program.
\section{Compiling and Linking}
\label{sec:compiling}
A convenient way to compile and link your program is by using scripts
that use the same compiler that MPICH was built with. These are
\texttt{mpicc}, \texttt{mpicxx}, and \texttt{mpifort},
for C, C++, and Fortran programs, respectively. If any
of these commands are missing, it means that MPICH was configured
without support for that particular language.
%% Pavan Balaji (12/27/2009): I'm commenting out this part as this is
%% broken in the current MPICH stack (see ticket #502).
%% \subsection{Specifying Compilers}
%% \label{sec:specifying-compilers}
%% You need not use the same compiler that MPICH was built with, but not
%% all compilers are compatible. You can also specify the compiler for
%% building MPICH itself, as reported by \texttt{mpichversion}, just by
%% using the compiling and linking commands from the previous section.
%% The environment variables \texttt{MPICH_CC}, \texttt{MPICH_CXX},
%% \texttt{MPICH_F77}, and \texttt{MPICH_F90} may be used to specify
%% alternate C, C++, Fortran 77, and Fortran 90 compilers, respectively.
\subsection{Special Issues for C++}
\label{sec:cxx}
Some users may get error messages such as
\begin{small}
\begin{verbatim}
SEEK_SET is #defined but must not be for the C++ binding of MPI
\end{verbatim}
\end{small}
The problem is that both \texttt{stdio.h} and the MPI C++ interface use
\texttt{SEEK\_SET}, \texttt{SEEK\_CUR}, and \texttt{SEEK\_END}. This is really a bug
in the MPI standard. You can try adding
\begin{verbatim}
#undef SEEK_SET
#undef SEEK_END
#undef SEEK_CUR
\end{verbatim}
before \texttt{mpi.h} is included, or add the definition
\begin{verbatim}
-DMPICH_IGNORE_CXX_SEEK
\end{verbatim}
to the command line (this will cause the MPI versions of \texttt{SEEK\_SET}
etc. to be skipped).
\subsection{Special Issues for Fortran}
\label{sec:fortran}
MPICH provides two kinds of support for Fortran programs. For
Fortran 77 programmers, the file \texttt{mpif.h} provides the
definitions of the MPI constants such as \texttt{MPI\_COMM\_WORLD}.
Fortran 90 programmers should use the \texttt{MPI} module instead;
this provides all of the definitions as well as interface definitions
for many of the MPI functions. However, this MPI module does not
provide full Fortran 90 support; in particular, interfaces for the
routines, such as \texttt{MPI\_Send}, that take ``choice'' arguments
are not provided.
\section{Running Programs with \texttt{mpiexec}}
\label{sec:mpiexec}
The MPI Standard describes \texttt{mpiexec} as a suggested way to run
MPI programs. MPICH implements the \texttt{mpiexec} standard, and also
provides some extensions.
\subsection{Standard \texttt{mpiexec}}
\label{sec:mpiexec-standard}
Here we describe the standard \texttt{mpiexec} arguments from the MPI
Standard~\cite{mpi-forum:mpi2-journal}. To run a program with 'n'
processes on your local machine, you can use:
\begin{verbatim}
mpiexec -n <number> ./a.out
\end{verbatim}
To test that you can run an 'n' process job on multiple nodes:
\begin{verbatim}
mpiexec -f machinefile -n <number> ./a.out
\end{verbatim}
The 'machinefile' is of the form:
\begin{verbatim}
host1
host2:2
host3:4 # Random comments
host4:1
\end{verbatim}
'host1', 'host2', 'host3' and 'host4' are the hostnames of the
machines you want to run the job on. The ':2', ':4', ':1' segments
depict the number of processes you want to run on each node. If
nothing is specified, ':1' is assumed.
More details on interacting with Hydra can be found at
\url{https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md}
\subsection{Extensions for All Process Management Environments}
\label{sec:extensions-uniform}
Some \texttt{mpiexec} arguments are specific to particular
communication subsystems (``devices'') or process management
environments (``process managers''). Our intention is to make all
arguments as uniform as possible across devices and process managers.
For the time being we will document these separately.
\subsection{\texttt{mpiexec} Extensions for the Hydra Process Manager}
MPICH provides a number of process management systems. Hydra is the
default process manager in MPICH. More details on Hydra and its
extensions to mpiexec can be found at
\url{https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md}
\subsection{Extensions for the gforker Process Management Environment}
\label{sec:extensions-forker}
\texttt{gforker} is a process management system for starting
processes on a single machine, so called because the MPI processes are
simply \texttt{fork}ed from the \texttt{mpiexec} process. This process
manager supports programs that use \texttt{MPI\_Comm\_spawn} and the other
dynamic process routines, but does not support the use of the dynamic process
routines from programs that are not started with \texttt{mpiexec}. The
\texttt{gforker} process manager is primarily intended as a debugging aid as
it simplifies development and testing of MPI programs on a single node or
processor.
\subsubsection{\texttt{mpiexec} arguments for gforker}
\label{sec:mpiexec-forker}
In addition to the standard \texttt{mpiexec} command-line arguments, the
\texttt{gforker} \texttt{mpiexec} supports the following options:
\begin{description}
\item[\texttt{-np <num>}]A synonym for the standard \texttt{-n} argument
\item[\texttt{-env <name> <value>}]Set the environment variable
\texttt{<name>} to \texttt{<value>} for the processes being run by
\texttt{mpiexec}.
\item[\texttt{-envnone}]Pass no environment variables (other than ones
specified with other \texttt{-env} or \texttt{-genv} arguments) to the
processes being run by \texttt{mpiexec}.
By default, all environment
variables are provided to each MPI process (rationale: principle of
least surprise for the user)
\item[\texttt{-envlist <list>}]Pass the listed environment variables (names
separated by commas), with their current values, to the processes being run by
\texttt{mpiexec}.
\item[\texttt{-genv <name> <value>}]The \item{-genv} options have the same
meaning as their corresponding \texttt{-env} version, except they apply to all
executables, not just the current executable (in the case that the colon
syntax is used to specify multiple executables).
\item[\texttt{-genvnone}]Like \texttt{-envnone}, but for all executables
\item[\texttt{-genvlist <list>}]Like \texttt{-envlist}, but for all executables
\item[\texttt{-usize <n>}]Specify the value returned for the value of the
attribute \texttt{MPI\_UNIVERSE\_SIZE}.
\item[\texttt{-l}]Label standard out and standard error (\texttt{stdout} and \texttt{stderr}) with
the rank of the process
\item[\texttt{-maxtime <n>}]Set a timelimit of \texttt{<n>} seconds.
\item[\texttt{-exitinfo}]Provide more information on the reason each process
exited if there is an abnormal exit
\end{description}
In addition to the commandline arguments, the \texttt{gforker} \texttt{mpiexec}
provides a number of environment variables that can be used to control the
behavior of \texttt{mpiexec}:
\begin{description}
\item[\texttt{MPIEXEC\_TIMEOUT}]Maximum running time in seconds.
\texttt{mpiexec} will terminate MPI programs that take longer than the value
specified by \texttt{MPIEXEC\_TIMEOUT}.
\item[\texttt{MPIEXEC\_UNIVERSE\_SIZE}]Set the universe size
\item[\texttt{MPIEXEC\_PORT\_RANGE}]Set the range of ports that
\texttt{mpiexec} will use
in communicating with the processes that it starts. The format of
this is \texttt{<low>:<high>}. For example, to specify any port between
10000 and 10100, use \texttt{10000:10100}.
\item[\texttt{MPICH\_PORT\_RANGE}]Has the same meaning as
\texttt{MPIEXEC\_PORT\_RANGE} and is used if \texttt{MPIEXEC\_PORT\_RANGE} is
not set.
\item[\texttt{MPIEXEC\_PREFIX\_DEFAULT}]If this environment variable is set,
output to standard output is prefixed by the rank in \texttt{MPI\_COMM\_WORLD}
of the process and output to standard error is prefixed by the rank and the
text \texttt{(err)}; both are followed by an angle bracket (\texttt{>}). If
this variable is not set, there is no prefix.
\item[\texttt{MPIEXEC\_PREFIX\_STDOUT}]Set the prefix used for lines sent to
standard output. A \texttt{\%d} is replaced with the rank in
\texttt{MPI\_COMM\_WORLD}; a \texttt{\%w} is replaced with an indication of
which \texttt{MPI\_COMM\_WORLD} in MPI jobs that involve multiple
\texttt{MPI\_COMM\_WORLD}s (e.g., ones that use \texttt{MPI\_Comm\_spawn} or
\texttt{MPI\_Comm\_connect}).
\item[\texttt{MPIEXEC\_PREFIX\_STDERR}]Like \texttt{MPIEXEC\_PREFIX\_STDOUT},
but for standard error.
\item[\texttt{MPIEXEC\_STDOUTBUF}]Sets the buffering mode for standard
output. Valid values are \texttt{NONE} (no buffering),
\texttt{LINE} (buffering by lines), and \texttt{BLOCK} (buffering by
blocks of characters; the size of the block is implementation
defined). The default is \texttt{NONE}.
\item[\texttt{MPIEXEC\_STDERRBUF}]Like \texttt{MPIEXEC\_STDOUTBUF},
but for standard error.
\end{description}
\subsection{Restrictions of the remshell Process Management Environment}
\label{sec:restrictions-remshell}
The \texttt{remshell} ``process manager'' provides a very simple version of
\texttt{mpiexec} that makes use of the secure shell command (\texttt{ssh}) to
start processes on a collection of machines. As this is intended primarily as
an illustration of how to build a version of \texttt{mpiexec} that works with
other process managers, it does not implement all of the features of the other
\texttt{mpiexec} programs described in this document. In particular, it
ignores the command line options that control the environment variables given
to the MPI programs. It does support the same output labeling features
provided by the \texttt{gforker} version of \texttt{mpiexec}.
However, this version of \texttt{mpiexec} can be used
much like the \texttt{mpirun} for the \texttt{ch\_p4} device in MPICH-1 to run
programs on a collection of machines that allow remote shells. A file by the
name of \texttt{machines} should contain the names of machines on which
processes can be run, one machine name per line. There must be enough
machines listed to satisfy the requested number of processes; you can list the
same machine name multiple times if necessary.
\subsection{Using MPICH with Slurm and PBS}
\label{sec:external_pm}
There are multiple ways of using MPICH with Slurm or PBS. Hydra
provides native support for both Slurm and PBS, and is likely the
easiest way to use MPICH on these systems (see the Hydra
documentation above for more details).
Alternatively, Slurm also provides compatibility with MPICH's
internal process management interface. To use this, you need to
configure MPICH with Slurm support, and then use the {\texttt srun}
job launching utility provided by Slurm.
For PBS, MPICH jobs can be launched in two ways: (i) use Hydra's
mpiexec with the appropriate options corresponding to PBS, or (ii)
using the OSC mpiexec.
\subsubsection{OSC mpiexec}
\label{sec:osc_mpiexec}
Pete Wyckoff from the Ohio Supercomputer Center provides a alternate
utility called OSC mpiexec to launch MPICH jobs on PBS systems. More
information about this can be found here:
\url{http://www.osc.edu/~pw/mpiexec}
\section{Specification of Implementation Details}
\label{sec:specification}
The MPI Standard defines a number of areas where a library is free to
define its own specific behavior as long as such behavior is documented
appropriately. This section provides that documentation for MPICH where
necessary.
\subsection{MPI Error Handlers for Communicators}
\label{sec:errhandler}
In Section 8.3.1 (Error Handlers for Communicators) of the MPI-3.0
Standard~\cite{mpi-forum:mpi3},
MPI defines an error handler callback function as
\begin{verbatim}
typedef void MPI_Comm_errhandler_function(MPI_Comm *, int *, ...);
\end{verbatim}
Where the first argument is the communicator in use, the second argument is
the error code to be returned by the MPI routine that raised the error, and
the remaining arguments to be implementation specific ``{\texttt varargs}''.
MPICH does not provide any arguments as part of this list. So a callback
function being provided to MPICH is sufficient if the header is
\begin{verbatim}
typedef void MPI_Comm_errhandler_function(MPI_Comm *, int *);
\end{verbatim}
\section{Debugging}
\label{sec:debugging}
Debugging parallel programs is notoriously difficult. Here we describe
a number of approaches, some of which depend on the exact version of
MPICH you are using.
\subsection{TotalView}
\label{sec:totalview}
MPICH supports use of the TotalView debugger from Etnus. If MPICH
has been configured to enable debugging with TotalView then one can
debug an MPI program using
\begin{verbatim}
totalview -a mpiexec -a -n 3 cpi
\end{verbatim}
You will get a popup window from TotalView asking whether you want to
start the job in a stopped state. If so, when the TotalView window
appears, you may see assembly code in the source window. Click on
\texttt{main} in the stack window (upper left) to see the source of
the \texttt{main} function. TotalView will show that the program (all
processes) are stopped in the call to \texttt{MPI\_Init}.
If you have TotalView 8.1.0 or later, you can use a TotalView feature
called indirect launch with MPICH. Invoke TotalView as:
\begin{verbatim}
totalview <program> -a <program args>
\end{verbatim}
Then select the Process/Startup Parameters command. Choose the
Parallel tab in the resulting dialog box and choose MPICH as the
parallel system. Then set the number of tasks using the Tasks field
and enter other needed mpiexec arguments into the Additional Starter
Arguments field.
\section{Checkpointing}
\label{sec:checkpointing}
MPICH supports checkpoint/rollback fault tolerance when used with the
Hydra process manager. Currently only the BLCR checkpointing library
is supported. BLCR needs to be installed separately. Below we
describe how to enable the feature in MPICH and how to use it. This
information can also be found on the MPICH Wiki:
\url{https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Checkpointing.md}
\subsection{Configuring for checkpointing}
\label{sec:conf-checkp}
First, you need to have BLCR version 0.8.2 installed on your
machine. If it's installed in the default system location, add the
following two options to your configure command:
\begin{small}
\begin{verbatim}
--enable-checkpointing --with-hydra-ckpointlib=blcr
\end{verbatim}
\end{small}
If BLCR is not installed in the default system location, you'll need
to tell MPICH's configure where to find it. You might also need to
set the \texttt{LD\_LIBRARY\_PATH} environment variable so that BLCR's shared
libraries can be found. In this case add the following options to your
configure command:
\begin{small}
\begin{verbatim}
--enable-checkpointing --with-hydra-ckpointlib=blcr
--with-blcr=BLCR_INSTALL_DIR LD_LIBRARY_PATH=BLCR_INSTALL_DIR/lib
\end{verbatim}
\end{small}
where \texttt{BLCR\_INSTALL\_DIR} is the directory where BLCR has been
installed (whatever was specified in \texttt{--prefix} when BLCR was
configured). Note, checkpointing is only supported with the Hydra
process manager. Hyrda will used by default, unless you choose
something else with the \texttt{--with-pm=} configure option.
After it's configured, compile as usual (e.g., \texttt{make; make install}).
\subsection{Taking checkpoints}
\label{sec:taking-checkpoints}
To use checkpointing, include the \texttt{-ckpointlib} option for
\texttt{mpiexec} to specify the checkpointing library to use and
\texttt{-ckpoint-prefix} to specify the directory where the checkpoint
images should be written:
\begin{small}
\begin{verbatim}
shell$ mpiexec -ckpointlib blcr \
-ckpoint-prefix /home/buntinas/ckpts/app.ckpoint \
-f hosts -n 4 ./app
\end{verbatim}
\end{small}
While the application is running, the user can request for a
checkpoint at any time by sending a \texttt{SIGUSR1} signal to
\texttt{mpiexec}. You can also automatically checkpoint the
application at regular intervals using the mpiexec option
\texttt{-ckpoint-interval} to specify the number of seconds between
checkpoints:
\begin{small}
\begin{verbatim}
shell$ mpiexec -ckpointlib blcr \
-ckpoint-prefix /home/buntinas/ckpts/app.ckpoint \
-ckpoint-interval 3600 -f hosts -n 4 ./app
\end{verbatim}
\end{small}
The checkpoint/restart parameters can also be controlled with the
environment variables \texttt{HYDRA\_\linebreak[0]CKPOINTLIB},
\texttt{HYDRA\_\linebreak[0]CKPOINT\_\linebreak[0]PREFIX} and
\texttt{HYDRA\_\linebreak[0]CKPOINT\_\linebreak[0]INTERVAL}.
Each checkpoint generates one file per node. Note that checkpoints
for all processes on a node will be stored in the same file. Each
time a new checkpoint is taken an additional set of files are created.
The files are numbered by the checkpoint number. This allows the
application to be restarted from checkpoints other than the most
recent. The checkpoint number can be specified with the
\texttt{-ckpoint-num} parameter. To restart a process:
\begin{small}
\begin{verbatim}
shell$ mpiexec -ckpointlib blcr \
-ckpoint-prefix /home/buntinas/ckpts/app.ckpoint \
-ckpoint-num 5 -f hosts -n 4
\end{verbatim}
\end{small}
Note that by default, the process will be restarted from the first
checkpoint, so in most cases, the checkpoint number should be
specified.
\section{Other Tools Provided with MPICH}
\label{sec:other-tools}
MPICH also includes a test suite for MPI functionality; this suite may
be found in the \texttt{mpich/test/mpi} source directory and can be
run with the command \texttt{make testing} after \texttt{make install}.
This test suite should work with any MPI implementation, not just MPICH.
To test a pre-installed MPI implementation:
\begin{small}
\begin{verbatim}
shell$ cd mpich/test/mpi
shell$ ./configure --with-mpi=/path/to/mpi
shell$ make testing
\end{verbatim}
\end{small}
\clearpage
\appendix
\section{Frequently Asked Questions}
The frequently asked questions are maintained online
here:\url{https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md}
\bibliographystyle{plain}
\bibliography{user}
\end{document}
|