1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487
|
% -*- latex -*-
%
% Copyright (c) 2001-2004 The Trustees of Indiana University.
% All rights reserved.
% Copyright (c) 1998-2001 University of Notre Dame.
% All rights reserved.
% Copyright (c) 1994-1998 The Ohio State University.
% All rights reserved.
%
% This file is part of the LAM/MPI software package. For license
% information, see the LICENSE file in the top level directory of the
% LAM/MPI source distribution.
%
% $Id: debuggers.tex,v 1.11 2003/05/27 05:50:24 brbarret Exp $
%
\chapter{Debugging Parallel Programs}
\label{sec:debug}
\label{sec:debugging}
\index{debuggers|(}
LAM/MPI supports multiple methods of debugging parallel programs. The
following notes and observations generally apply to debugging in
parallel:
\begin{itemize}
\item Note that most debuggers require that MPI applications were
compiled with debugging support enabled. This typically entails
adding \cmdarg{-g} to the compile and link lines when building
your MPI application.
\item Unless you specifically need it, it is not recommended to
compile LAM with \cmdarg{-g}. This will allow you to treat MPI
function calls as atomic instructions.
\item Even when debugging in parallel, it is possible that not all MPI
processes will execute exactly the same code. For example, ``if''
statements that are based upon a communicator's rank of the calling
process, or other location-specific information may cause different
execution paths in each MPI process.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Naming MPI Objects}
LAM/MPI supports the MPI-2 functions {\sf
MPI\_\-$<$type$>$\_\-SET\_\-NAME} and {\sf
MPI\_\-$<$type$>$\_\-GET\_\-NAME}, where {\sf $<$type$>$} can be:
{\sf COMM}, {\sf WIN}, or {\sf TYPE}. Hence, you can associate
relevant text names with communicators, windows, and datatypes (e.g.,
``6x13x12 molecule datatype'', ``Local group reduction
intracommunicator'', ``Spawned worker intercommunicator''). The use
of these functions is strongly encouraged while debugging MPI
applications. Since they are constant-time, one-time setup functions,
using these functions likely does not impact performance, and may be
safe to use in production environments, too.
The rationale for using these functions is to allow LAM (and supported
debuggers, profilers, and other MPI diagnostic tools) to display
accurate information about MPI communicators, windows, and datatypes.
For example, whenever a communicator name is available, LAM will use
it in relevant error messages; when names are not available,
communicators (and windows and types) are identified by index number,
which -- depending on the application -- may vary between successive
runs. The TotalView parallel debugger will also show communicator
names (if available) when displaying the message queues.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{TotalView Parallel Debugger}
\label{sec:debug-totalview}
TotalView is a commercial debugger from Etnus that supports debugging
MPI programs in parallel. That is, with supported MPI
implementations, the TotalView debugger can automatically attach to
one or more MPI processes in a parallel application.
LAM now supports basic debugging functionality with the TotalView
debugger. Specifically, LAM supports TotalView attaching to one or
more MPI processes, as well as viewing the MPI message queues in
supported RPI modules.
This section provides some general tips and suggested use of TotalView
with LAM/MPI. It is {\em not} intended to replace the TotalView
documentation in any way. {\bf Be sure to consult the TotalView
documentation for more information and details than are provided
here.}
Note: TotalView is licensed product provided by Etnus. You need to
have TotalView installed properly before you can use it with
LAM.\footnote{Refer to \url{http://www.etnus.com/} for more
information about TotalView.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Attaching TotalView to MPI Processes}
\index{TotalView parallel debugger}
\index{debuggers!TotalView}
LAM/MPI does not need to be configured or compiled in any special way
to allow TotalView to attach to MPI processes.
You can attach TotalView to MPI processes started by \icmd{mpirun} /
\icmd{mpiexec} in following ways:
\begin{enumerate}
\item Use the \cmdarg{-tv} convenience argument when running
\cmd{mpirun} or \cmd{mpiexec} (this is the preferred method):
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -tv [...other mpirun arguments...]
\end{lstlisting}
% Stupid emacs mode: $
For example:
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -tv C my_mpi_program arg1 arg2 arg3
\end{lstlisting}
% Stupid emacs mode: $
\item Directly launch \cmd{mpirun} in TotalView (you {\em cannot}
launch \cmd{mpiexec} in TotalView):
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ totalview mpirun -a [...mpirun arguments...]
\end{lstlisting}
% Stupid emacs mode: $
For example:
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ totalview mpirun -a C my_mpi_program arg1 arg2 arg3
\end{lstlisting}
% Stupid emacs mode: $
Note the \cmdarg{-a} argument after \cmd{mpirun}. This is necessary
to tell TotalView that arguments following ``\cmdarg{-a}'' belong to
\cmd{mpirun} and not TotalView.
Also note that the \cmdarg{-tv} convenience argument to \cmd{mpirun}
simply executes ``\cmd{totalview mpirun -a ...}''; so both methods
are essentially identical.
\end{enumerate}
TotalView can either attach to all MPI processes in
\mpiconst{MPI\_\-COMM\_\-WORLD} or a subset of them. The controls for
``partial attach'' are in TotalView, not LAM. In TotalView 6.0.0
(analogous methods may work for earlier versions of TotalView -- see
the TotalView documentation for more details), you need to set the
parallel launch preference to ``ask.'' In the root window menu:
\begin{enumerate}
\item Select File $\rightarrow$ Preferences
\item Select the Parallel tab
\item In the ``When a job goes parallel'' box, select ``Ask what to do''
\item Click on OK
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Suggested Use}
Since TotalView support is started with the \cmd{mpirun} command,
TotalView will, by default, start by debugging \cmd{mpirun} itself.
While this may seem to be an annoying drawback, there are actually
good reasons for this:
\begin{itemize}
\item While debugging the parallel program, if you need to re-run the
program, you can simply re-run the application from within TotalView
itself. There is no need to exit the debugger to run your parallel
application again.
\item TotalView can be configured to automatically skip displaying the
\cmd{mpirun} code. Specifically, instead of displaying the
\cmd{mpirun} code and enabling it for debugging, TotalView will
recognize the command named \cmd{mpirun} and start executing it
immediately upon load. See below for details.
\end{itemize}
\noindent There are two ways to start debugging the MPI application:
\begin{enumerate}
\item The preferred method is to have a \ifile{\$HOME/.tvdrc} file
that tells TotalView to skip past the \cmd{mpirun} code and
automatically start the parallel program. Create or edit your
\ifile{\$HOME/.tvdrc} file to include the following:
\lstset{style=lam-shell}
\begin{lstlisting}
# Set a variable to say what the MPI ``starter'' program is
set starter_program mpirun
# Check if the newly loaded image is the starter program
# and start it immediately if it is.
proc auto_run_starter {loaded_id} {
global starter_program
set executable_name [TV::symbol get $loaded_id full_pathname]
set file_component [file tail $executable_name]
if {[string compare $file_component $starter_program] == 0} {
puts ``Automatically starting $file_component''
dgo
}
}
# Append this function to TotalView's image load callbacks so that
# TotalView run this program automatically.
dlappend TV::image_load_callbacks auto_run_starter
\end{lstlisting}
% Stupid emacs mode: $
Note that when using this method, \cmd{mpirun} is actually running in
the debugger while you are debugging your parallel application, even
though it may not be obvious. Hence, when the MPI job completes,
you'll be returned to viewing \cmd{mpirun} in the debugger. {\em This
is normal} -- all MPI processes have exited; the only process that
remains is \cmd{mpirun}. If you click ``Go'' again, \cmd{mpirun} will
launch the MPI job again.
\item Do not create the \file{\$HOME/.tvdrc} file with the ``auto
run'' functionality described in the previous item, but instead
simply click the ``go'' button when TotalView launches. This runs
the \cmd{mpirun} command with the command line arguments, which will
eventually launch the MPI programs and allow attachment to the MPI
processes.
\end{enumerate}
When TotalView initially attaches to an MPI process, you will see the
code for \mpifunc{MPI\_\-INIT} or one of its sub-functions (which will
likely be assembly code, unless LAM itself was compiled with debugging
information).
%
You probably want to skip past the rest of \mpifunc{MPI\_\-INIT}. In
the Stack Trace window, click on function which called
\mpifunc{MPI\_\-INIT} (e.g., \func{main}) and set a breakpoint to line
following call to \mpifunc{MPI\_\-INIT}. Then click ``Go''.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Limitations}
The following limitations are currently imposed when debugging LAM/MPI
jobs in TotalView:
\begin{enumerate}
\item Cannot attach to scripts: You cannot attach TotalView to MPI
processes if they were launched by scripts instead of \cmd{mpirun}.
Specifically, the following won't work:
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -tv C script_to_launch_foo
\end{lstlisting}
% Stupid emacs mode: $
But this will:
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -tv C foo
\end{lstlisting}
% Stupid emacs mode: $
For that reason, since \cmd{mpiexec} is a script, although the
\cmdarg{-tv} switch works with \cmd{mpiexec} (because it will
eventually invoke \cmd{mpirun}), you cannot launch \cmd{mpiexec}
with TotalView.
\item TotalView needs to launch the TotalView server on all remote
nodes in order to attach to remote processes.
The command that TotalView uses to launch remote executables might
be different than what LAM/MPI uses. You may have to set this
command explicitly and independently of LAM/MPI.
%
For example, if your local environment has \cmd{rsh} disabled and
only allows \cmd{ssh}, then you likely need to set the TotalView
remote server launch command to ``\cmd{ssh}''. You can set this
internally in TotalView or with the \ienvvar{TVDSVRLAUNCHCMD}
environment variable (see the TotalView documentation for more
information on this).
\item The TotalView license must be able to be found on all nodes
where you expect to attach the debugger.
Consult with your system administrator to ensure that this is set up
properly. You may need to edit your ``dot'' files (e.g.,
\file{.profile}, \file{.bashrc}, \file{.cshrc}, etc.) to ensure that
relevant environment variable settings exist on all nodes when you
\cmd{lamboot}.
\item It is always a good idea to let \cmd{mpirun} finish before you
rerun or exit TotalView.
\item TotalView will not be able to attach to MPI programs when you
execute \cmd{mpirun} with \cmdarg{-s} option.
This is because TotalView will not get the source code of your
program on nodes other than the source node. We advise you to
either use a common filesystem or copy the source code and
executable on all nodes when using TotalView with LAM so that you
can avoid the use of \cmd{mpirun}'s \cmdarg{-s} flag.
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Message Queue Debugging}
The TotalView debugger can show the sending, receiving, and
unexepected message queues for many parallel applications. Note the
following:
\begin{itemize}
\item The MPI-2 function for naming communicators
(\mpifunc{MPI\_\-COMM\_\-SET\_\-NAME}) is strongly recommended when
using the message queue debugging functionality. For example,
\mpiconst{MPI\_\-COMM\_\-WORLD} and \mpiconst{MPI\_\-COMM\_\-SELF}
are automatically named by LAM/MPI. Naming communicators makes it
significantly easier to identify communicators of interest in the
debugger.
Any communicator that is not named will be displayed as ``{\tt
--unnamed--}''.
\item Message queue debugging of applications is not currently
supported for 64 bit executables. If you attempt to use the message
queue debugging functionality on a 64 bit executable, TotalView will
display a warning before disabling the message queue options.
\item The \rpi{lamd} RPI does not support the message queue debugging
functionality.
\item LAM/MPI does not currently provide debugging support for dynamic
processes (e.g., \mpifunc{MPI\_\-COMM\_\-SPAWN}).
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Serial Debuggers}
\label{sec:debug-serial}
\index{serial debuggers}
\index{debuggers!serial}
LAM also allows the use of one or more serial debuggers when debugging
a parallel program.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Lauching Debuggers}
\index{debuggers!launching}
LAM allows the arbitrary execution of any executable in an MPI context
as long as an MPI executable is eventually launched. For example, it
is common to \icmd{mpirun} a debugger (or a script that launches a
debugger on some nodes, and directly runs the application on other
nodes) since the debugger will eventually launch the MPI process.
However, one must be careful when running programs on remote nodes
that expect the use of \file{stdin} -- \file{stdin} on remote nodes is
redirected to \file{/dev/null}. For example, it is advantageous to
export the \ienvvar{DISPLAY} environment variable, and run a shell
script that invokes an \cmd{xterm} with ``\cmd{gdb}'' (for example)
running in it on each node. For example:
\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -x DISPLAY xterm-gdb.csh
\end{lstlisting}
% Stupid emacs mode: $
Additionally, it may be desirable to only run the debugger on certain
ranks in \mpiconst{MPI\_\-COMM\_\-WORLD}. For example, with parallel
jobs that include tens or hundreds of MPI processes, it is really only
feasible to attach debuggers to a small number of processes. In this
case, a script may be helpful to launch debuggers for some ranks in
\mpiconst{MPI\_\-COMM\_\-WORLD} and directly launch the application in
others.
The LAM environment variable \ienvvar{LAMRANK} can be helpful in this
situation. This variable is placed in the environment before the
target application is executed. Hence, it is visible to shell scripts
as well as the target MPI application. It is erroneous to alter the
value of this variable.
Consider the following script:
\lstset{style=lam-shell}
\begin{lstlisting}
#!/bin/csh -f
# Which debugger to run
set debugger=gdb
# On MPI_COMM_WORLD rank 0, launch the process in the debugger.
# Elsewhere, just launch the process directly.
if (``$LAMRANK'' == ``0'') then
echo Launching $debugger on MPI_COMM_WORLD rank $LAMRANK
$debugger $*
else
echo Launching MPI executable on MPI_COMM_WORLD rank $LAMRANK
$*
endif
# All done
exit 0
\end{lstlisting}
% Stupid emacs mode: $
This script can be executed via \cmd{mpirun} to launch a debugger on
\mpiconst{MPI\_\-COMM\_\-WORLD} rank 0, and directly launch the MPI
process in all other cases.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Attaching Debuggers}
\index{debuggers!attaching}
In some cases, it is not possible or desirable to start debugging a
parallel application immediately. For example, it may only be
desirable to attach to certain MPI processes whose identity may not be
known until run-time.
In this case, the technique of attaching to a running process can be
used (this functionality is supported by many serial debuggers).
Specifically, determine which MPI process you want to attach to. Then
login to the node where it is running, and use the debugger's
``attach'' functionality to latch on to the running process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Memory-Checking Debuggers}
\label{sec:debug-mem}
\index{debuggers!memory-checking}
Memory-checking debuggers are an invaluable tool when debugging
software (even parallel software). They can provide detailed reports
about memory leaks, bad memory accesses, duplicate/bad memory
management calls, etc. Some memory-checking debuggers include (but
are not limited to): the Solaris Forte debugger (including the
\cmd{bcheck} command-line memory checker), the Purify software
package, and the Valgrind software package.
LAM can be used with memory-checking debuggers. However, LAM should
be compiled with special support for such debuggers. This is because
in an attempt to optimize performance, there are many structures used
internally to LAM that do not always have all memory positions
initialized. For example, LAM's internal \type{struct nmsg} is one of
the underlying message constructs used to pass data between LAM
processes. But since the \type{struct nmsg} is used in so many
places, it is a generalized structure and contains fields that are not
used in every situation.
By default, LAM only initializes relevant struct members before using
a structure. Using a structure may involve sending the entire
structure (including uninitialized members) to a remote host. This is
not a problem for LAM; the remote host will also ignore the irrelevant
struct members (depending on the specific function being invoked).
More to the point -- LAM was designed this way to avoid setting
variables that will not be used; this is a slight optimization in
run-time performance. Memory-checking debuggers, however, will flag
this behavior with ``read from uninitialized'' warnings.
The \confflag{with-purify} option can be used with LAM's
\cmd{configure} script that will force LAM to zero out {\em all}
memory before it is used. This will eliminate the ``read from
uninitialized'' types of warnings that memory-checking debuggers will
identify deep inside LAM. This option can only be specified when LAM
is configured; it is not possible to enable or disable this behavior
at run-time. Since this option invokes a slight overhead penalty in
the run-time performance of LAM, it is not the default.
% Close out the debuggers index entry
\index{debuggers|)}
|