File: lost-output.rst

package info (click to toggle)
openmpi 5.0.8-4
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 201,684 kB
  • sloc: ansic: 613,078; makefile: 42,353; sh: 11,194; javascript: 9,244; f90: 7,052; java: 6,404; perl: 5,179; python: 1,859; lex: 740; fortran: 61; cpp: 20; tcl: 12
file content (34 lines) | stat: -rw-r--r-- 1,694 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Application Output Lost with Abnormal Termination
=================================================

There many be many reasons for application output to be lost when
an application abnormally terminates. The Open MPI Team strongly
encourages the use of tools (such as debuggers) whenever possible.

One of the reasons, however, may come from inside Open MPI itself.  If
your application fails due to memory corruption, Open MPI may
subsequently fail to output an error message before terminating.
Open MPI attempts to aggregate error
messages from multiple processes in an attempt to show unique error
messages only once (vs. once for each MPI process |mdash| which can be
unwieldy, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the
MPI process when it displays the error message.  If the process's
memory is already corrupted, Open MPI's attempt to allocate memory may
fail and the process will simply terminate, possibly silently.  When Open
MPI does not attempt to aggregate error messages, most of its setup
work is done when the MPI library is initiaized  and no memory is allocated
during the "print the error" routine.  It therefore almost always successfully
outputs error messages in real time |mdash| but at the expense that you'll
potentially see the same error message for *each* MPI process that
encountered the error.

Hence, the error message aggregation is *usually* a good thing, but
sometimes it can mask a real error.  You can disable Open MPI's error
message aggregation with the ``opal_base_help_aggregate`` MCA
parameter.  For example:

.. code-block:: sh

   shell$ mpirun --mca opal_base_help_aggregate 0 ...