File: debuggers.tex

package info (click to toggle)
lam 7.1.4-8
links: PTS
area: main
in suites:
size: 56,404 kB
sloc: ansic: 156,541; sh: 9,991; cpp: 7,699; makefile: 5,621; perl: 488; fortran: 260; asm: 83
file content (487 lines) | stat: -rw-r--r-- 19,393 bytes
parent folder | download | duplicates (10)
% -*- latex -*-
%
% Copyright (c) 2001-2004 The Trustees of Indiana University.  
%                         All rights reserved.
% Copyright (c) 1998-2001 University of Notre Dame. 
%                         All rights reserved.
% Copyright (c) 1994-1998 The Ohio State University.  
%                         All rights reserved.
% 
% This file is part of the LAM/MPI software package.  For license
% information, see the LICENSE file in the top level directory of the
% LAM/MPI source distribution.
%
% $Id: debuggers.tex,v 1.11 2003/05/27 05:50:24 brbarret Exp $
%

\chapter{Debugging Parallel Programs}
\label{sec:debug}
\label{sec:debugging}
\index{debuggers|(}

LAM/MPI supports multiple methods of debugging parallel programs.  The
following notes and observations generally apply to debugging in
parallel:

\begin{itemize}
\item Note that most debuggers require that MPI applications were
  compiled with debugging support enabled.  This typically entails
  adding \cmdarg{-g} to the compile and link lines when building
  your MPI application.
  
\item Unless you specifically need it, it is not recommended to
  compile LAM with \cmdarg{-g}.  This will allow you to treat MPI
  function calls as atomic instructions.

\item Even when debugging in parallel, it is possible that not all MPI
  processes will execute exactly the same code.  For example, ``if''
  statements that are based upon a communicator's rank of the calling
  process, or other location-specific information may cause different
  execution paths in each MPI process.
\end{itemize}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Naming MPI Objects}

LAM/MPI supports the MPI-2 functions {\sf
  MPI\_\-$<$type$>$\_\-SET\_\-NAME} and {\sf
  MPI\_\-$<$type$>$\_\-GET\_\-NAME}, where {\sf $<$type$>$} can be:
{\sf COMM}, {\sf WIN}, or {\sf TYPE}.  Hence, you can associate
relevant text names with communicators, windows, and datatypes (e.g.,
``6x13x12 molecule datatype'', ``Local group reduction
intracommunicator'', ``Spawned worker intercommunicator'').  The use
of these functions is strongly encouraged while debugging MPI
applications.  Since they are constant-time, one-time setup functions,
using these functions likely does not impact performance, and may be
safe to use in production environments, too.

The rationale for using these functions is to allow LAM (and supported
debuggers, profilers, and other MPI diagnostic tools) to display
accurate information about MPI communicators, windows, and datatypes.
For example, whenever a communicator name is available, LAM will use
it in relevant error messages; when names are not available,
communicators (and windows and types) are identified by index number,
which -- depending on the application -- may vary between successive
runs.  The TotalView parallel debugger will also show communicator
names (if available) when displaying the message queues.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{TotalView Parallel Debugger}
\label{sec:debug-totalview}

TotalView is a commercial debugger from Etnus that supports debugging
MPI programs in parallel.  That is, with supported MPI
implementations, the TotalView debugger can automatically attach to
one or more MPI processes in a parallel application.

LAM now supports basic debugging functionality with the TotalView
debugger.  Specifically, LAM supports TotalView attaching to one or
more MPI processes, as well as viewing the MPI message queues in
supported RPI modules.

This section provides some general tips and suggested use of TotalView
with LAM/MPI.  It is {\em not} intended to replace the TotalView
documentation in any way.  {\bf Be sure to consult the TotalView
  documentation for more information and details than are provided
  here.}

Note: TotalView is licensed product provided by Etnus. You need to
have TotalView installed properly before you can use it with
LAM.\footnote{Refer to \url{http://www.etnus.com/} for more
  information about TotalView.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Attaching TotalView to MPI Processes}
\index{TotalView parallel debugger}
\index{debuggers!TotalView}

LAM/MPI does not need to be configured or compiled in any special way
to allow TotalView to attach to MPI processes.

You can attach TotalView to MPI processes started by \icmd{mpirun} /
\icmd{mpiexec} in following ways:

\begin{enumerate}
\item Use the \cmdarg{-tv} convenience argument when running
  \cmd{mpirun} or \cmd{mpiexec} (this is the preferred method):

  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ mpirun -tv [...other mpirun arguments...]
  \end{lstlisting}
  % Stupid emacs mode: $

  For example:
  
  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ mpirun -tv C my_mpi_program arg1 arg2 arg3
  \end{lstlisting}
  % Stupid emacs mode: $

\item Directly launch \cmd{mpirun} in TotalView (you {\em cannot}
  launch \cmd{mpiexec} in TotalView):
  
  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ totalview mpirun -a [...mpirun arguments...]
  \end{lstlisting}
  % Stupid emacs mode: $
  
  For example:
  
  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ totalview mpirun -a C my_mpi_program arg1 arg2 arg3
  \end{lstlisting}
  % Stupid emacs mode: $
  
  Note the \cmdarg{-a} argument after \cmd{mpirun}.  This is necessary
  to tell TotalView that arguments following ``\cmdarg{-a}'' belong to
  \cmd{mpirun} and not TotalView.
  
  Also note that the \cmdarg{-tv} convenience argument to \cmd{mpirun}
  simply executes ``\cmd{totalview mpirun -a ...}''; so both methods
  are essentially identical.
\end{enumerate}
        
TotalView can either attach to all MPI processes in
\mpiconst{MPI\_\-COMM\_\-WORLD} or a subset of them.  The controls for
``partial attach'' are in TotalView, not LAM.  In TotalView 6.0.0
(analogous methods may work for earlier versions of TotalView -- see
the TotalView documentation for more details), you need to set the
parallel launch preference to ``ask.''  In the root window menu:

\begin{enumerate}
\item Select File $\rightarrow$ Preferences
\item Select the Parallel tab
\item In the ``When a job goes parallel'' box, select ``Ask what to do''
\item Click on OK
\end{enumerate}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Suggested Use}

Since TotalView support is started with the \cmd{mpirun} command,
TotalView will, by default, start by debugging \cmd{mpirun} itself.
While this may seem to be an annoying drawback, there are actually 
good reasons for this:

\begin{itemize}
\item While debugging the parallel program, if you need to re-run the
  program, you can simply re-run the application from within TotalView
  itself.  There is no need to exit the debugger to run your parallel
  application again.
  
\item TotalView can be configured to automatically skip displaying the
  \cmd{mpirun} code.  Specifically, instead of displaying the
  \cmd{mpirun} code and enabling it for debugging, TotalView will
  recognize the command named \cmd{mpirun} and start executing it
  immediately upon load.  See below for details.
\end{itemize}

\noindent There are two ways to start debugging the MPI application:

\begin{enumerate}
\item The preferred method is to have a \ifile{\$HOME/.tvdrc} file
  that tells TotalView to skip past the \cmd{mpirun} code and
  automatically start the parallel program.  Create or edit your
  \ifile{\$HOME/.tvdrc} file to include the following:

\lstset{style=lam-shell}
\begin{lstlisting}
# Set a variable to say what the MPI ``starter'' program is
set starter_program mpirun

# Check if the newly loaded image is the starter program
# and start it immediately if it is.
proc auto_run_starter {loaded_id} {
    global starter_program
    set executable_name [TV::symbol get $loaded_id full_pathname]
    set file_component [file tail $executable_name]

    if {[string compare $file_component $starter_program] == 0} {
        puts ``Automatically starting $file_component''
        dgo
    }
}

# Append this function to TotalView's image load callbacks so that
# TotalView run this program automatically.
dlappend TV::image_load_callbacks auto_run_starter
\end{lstlisting}
% Stupid emacs mode: $

Note that when using this method, \cmd{mpirun} is actually running in
the debugger while you are debugging your parallel application, even
though it may not be obvious.  Hence, when the MPI job completes,
you'll be returned to viewing \cmd{mpirun} in the debugger.  {\em This
  is normal} -- all MPI processes have exited; the only process that
remains is \cmd{mpirun}.  If you click ``Go'' again, \cmd{mpirun} will
launch the MPI job again.

\item Do not create the \file{\$HOME/.tvdrc} file with the ``auto
  run'' functionality described in the previous item, but instead
  simply click the ``go'' button when TotalView launches.  This runs
  the \cmd{mpirun} command with the command line arguments, which will
  eventually launch the MPI programs and allow attachment to the MPI
  processes.
  
\end{enumerate}

When TotalView initially attaches to an MPI process, you will see the
code for \mpifunc{MPI\_\-INIT} or one of its sub-functions (which will
likely be assembly code, unless LAM itself was compiled with debugging
information).
%
You probably want to skip past the rest of \mpifunc{MPI\_\-INIT}.  In
the Stack Trace window, click on function which called
\mpifunc{MPI\_\-INIT} (e.g., \func{main}) and set a breakpoint to line
following call to \mpifunc{MPI\_\-INIT}.  Then click ``Go''.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Limitations}

The following limitations are currently imposed when debugging LAM/MPI
jobs in TotalView:

\begin{enumerate}
\item Cannot attach to scripts: You cannot attach TotalView to MPI
  processes if they were launched by scripts instead of \cmd{mpirun}.
  Specifically, the following won't work:
                
  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ mpirun -tv C script_to_launch_foo
  \end{lstlisting}
  % Stupid emacs mode: $

  But this will:

  \lstset{style=lam-cmdline}
  \begin{lstlisting}
shell$ mpirun -tv C foo
  \end{lstlisting}
  % Stupid emacs mode: $
  
  For that reason, since \cmd{mpiexec} is a script, although the
  \cmdarg{-tv} switch works with \cmd{mpiexec} (because it will
  eventually invoke \cmd{mpirun}), you cannot launch \cmd{mpiexec}
  with TotalView.
  
\item TotalView needs to launch the TotalView server on all remote
  nodes in order to attach to remote processes.  
  
  The command that TotalView uses to launch remote executables might
  be different than what LAM/MPI uses.  You may have to set this
  command explicitly and independently of LAM/MPI.
%
  For example, if your local environment has \cmd{rsh} disabled and
  only allows \cmd{ssh}, then you likely need to set the TotalView
  remote server launch command to ``\cmd{ssh}''.  You can set this
  internally in TotalView or with the \ienvvar{TVDSVRLAUNCHCMD}
  environment variable (see the TotalView documentation for more
  information on this).
  
\item The TotalView license must be able to be found on all nodes
  where you expect to attach the debugger.  
  
  Consult with your system administrator to ensure that this is set up
  properly.  You may need to edit your ``dot'' files (e.g.,
  \file{.profile}, \file{.bashrc}, \file{.cshrc}, etc.) to ensure that
  relevant environment variable settings exist on all nodes when you
  \cmd{lamboot}.
  
\item It is always a good idea to let \cmd{mpirun} finish before you
  rerun or exit TotalView.
  
\item TotalView will not be able to attach to MPI programs when you
  execute \cmd{mpirun} with \cmdarg{-s} option.  
  
  This is because TotalView will not get the source code of your
  program on nodes other than the source node.  We advise you to
  either use a common filesystem or copy the source code and
  executable on all nodes when using TotalView with LAM so that you
  can avoid the use of \cmd{mpirun}'s \cmdarg{-s} flag.
\end{enumerate}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Message Queue Debugging}

The TotalView debugger can show the sending, receiving, and
unexepected message queues for many parallel applications.  Note the
following:

\begin{itemize}
\item The MPI-2 function for naming communicators
  (\mpifunc{MPI\_\-COMM\_\-SET\_\-NAME}) is strongly recommended when
  using the message queue debugging functionality.  For example,
  \mpiconst{MPI\_\-COMM\_\-WORLD} and \mpiconst{MPI\_\-COMM\_\-SELF}
  are automatically named by LAM/MPI.  Naming communicators makes it
  significantly easier to identify communicators of interest in the
  debugger.

  Any communicator that is not named will be displayed as ``{\tt
  --unnamed--}''.

\item Message queue debugging of applications is not currently
  supported for 64 bit executables.  If you attempt to use the message
  queue debugging functionality on a 64 bit executable, TotalView will
  display a warning before disabling the message queue options.
  
\item The \rpi{lamd} RPI does not support the message queue debugging
  functionality.
  
\item LAM/MPI does not currently provide debugging support for dynamic
  processes (e.g., \mpifunc{MPI\_\-COMM\_\-SPAWN}).
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Serial Debuggers}
\label{sec:debug-serial}
\index{serial debuggers}
\index{debuggers!serial}

LAM also allows the use of one or more serial debuggers when debugging
a parallel program.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Lauching Debuggers}
\index{debuggers!launching}

LAM allows the arbitrary execution of any executable in an MPI context
as long as an MPI executable is eventually launched.  For example, it
is common to \icmd{mpirun} a debugger (or a script that launches a
debugger on some nodes, and directly runs the application on other
nodes) since the debugger will eventually launch the MPI process.

However, one must be careful when running programs on remote nodes
that expect the use of \file{stdin} -- \file{stdin} on remote nodes is
redirected to \file{/dev/null}.  For example, it is advantageous to
export the \ienvvar{DISPLAY} environment variable, and run a shell
script that invokes an \cmd{xterm} with ``\cmd{gdb}'' (for example)
running in it on each node.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -x DISPLAY xterm-gdb.csh
\end{lstlisting}
% Stupid emacs mode: $

Additionally, it may be desirable to only run the debugger on certain
ranks in \mpiconst{MPI\_\-COMM\_\-WORLD}.  For example, with parallel
jobs that include tens or hundreds of MPI processes, it is really only
feasible to attach debuggers to a small number of processes.  In this
case, a script may be helpful to launch debuggers for some ranks in
\mpiconst{MPI\_\-COMM\_\-WORLD} and directly launch the application in
others.  

The LAM environment variable \ienvvar{LAMRANK} can be helpful in this
situation.  This variable is placed in the environment before the
target application is executed.  Hence, it is visible to shell scripts
as well as the target MPI application.  It is erroneous to alter the
value of this variable.

Consider the following script:

\lstset{style=lam-shell}
\begin{lstlisting}
#!/bin/csh -f

# Which debugger to run
set debugger=gdb

# On MPI_COMM_WORLD rank 0, launch the process in the debugger.
# Elsewhere, just launch the process directly.
if (``$LAMRANK'' == ``0'') then
  echo Launching $debugger on MPI_COMM_WORLD rank $LAMRANK
  $debugger $*
else
  echo Launching MPI executable on MPI_COMM_WORLD rank $LAMRANK
  $*
endif

# All done
exit 0
\end{lstlisting}
% Stupid emacs mode: $

This script can be executed via \cmd{mpirun} to launch a debugger on
\mpiconst{MPI\_\-COMM\_\-WORLD} rank 0, and directly launch the MPI
process in all other cases.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Attaching Debuggers}
\index{debuggers!attaching}

In some cases, it is not possible or desirable to start debugging a
parallel application immediately.  For example, it may only be
desirable to attach to certain MPI processes whose identity may not be
known until run-time.

In this case, the technique of attaching to a running process can be
used (this functionality is supported by many serial debuggers).
Specifically, determine which MPI process you want to attach to.  Then
login to the node where it is running, and use the debugger's
``attach'' functionality to latch on to the running process.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Memory-Checking Debuggers}
\label{sec:debug-mem}
\index{debuggers!memory-checking}

Memory-checking debuggers are an invaluable tool when debugging
software (even parallel software).  They can provide detailed reports
about memory leaks, bad memory accesses, duplicate/bad memory
management calls, etc.  Some memory-checking debuggers include (but
are not limited to): the Solaris Forte debugger (including the
\cmd{bcheck} command-line memory checker), the Purify software
package, and the Valgrind software package.

LAM can be used with memory-checking debuggers.  However, LAM should
be compiled with special support for such debuggers.  This is because
in an attempt to optimize performance, there are many structures used
internally to LAM that do not always have all memory positions
initialized.  For example, LAM's internal \type{struct nmsg} is one of
the underlying message constructs used to pass data between LAM
processes.  But since the \type{struct nmsg} is used in so many
places, it is a generalized structure and contains fields that are not
used in every situation.

By default, LAM only initializes relevant struct members before using
a structure.  Using a structure may involve sending the entire
structure (including uninitialized members) to a remote host.  This is
not a problem for LAM; the remote host will also ignore the irrelevant
struct members (depending on the specific function being invoked).
More to the point -- LAM was designed this way to avoid setting
variables that will not be used; this is a slight optimization in
run-time performance.  Memory-checking debuggers, however, will flag
this behavior with ``read from uninitialized'' warnings.

The \confflag{with-purify} option can be used with LAM's
\cmd{configure} script that will force LAM to zero out {\em all}
memory before it is used.  This will eliminate the ``read from
uninitialized'' types of warnings that memory-checking debuggers will
identify deep inside LAM.  This option can only be specified when LAM
is configured; it is not possible to enable or disable this behavior
at run-time.  Since this option invokes a slight overhead penalty in
the run-time performance of LAM, it is not the default.

% Close out the debuggers index entry

\index{debuggers|)}