File: DEBUGGING.md

package info (click to toggle)
armci-mpi 0.4-8
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 2,004 kB
  • sloc: ansic: 13,849; sh: 506; makefile: 116; fortran: 44
file content (80 lines) | stat: -rw-r--r-- 4,093 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Debugging ARMCI-MPI

I hope you never need to read this, but obviously, debugging is sometimes necessary.

The reality is that MPI-3 RMA implementations still aren't perfect, 
and even if all the latest releases were bug-free, many users are using older builds.

This debugging guide will start with the easiest (least intrusive) options and progress
to the more complicated ones.

## Environment variables

ARMCI-MPI has a number of environment variables to work around known bugs in MPI libraries.
Some of these are so common that we change the settings automatically.

Always set `ARMCI_VERBOSE=1` when you want to know what options are used.
I almost always run with this, because there's really no reason not to use it.
It adds about 10 lines to the beginning of your output stream.

### Allocation

`ARMCI_USE_WIN_ALLOCATE` and `ARMCI_USE_ALLOC_SHM` default 1 but can be set to 0 
to debug whether RMA allocation is a problem.
Never set these unless you need to, because it will negatively impact performance in some cases.

In my experience, `ARMCI_USE_WIN_ALLOCATE` is extremely useful for debugging, whereas 
`ARMCI_USE_ALLOC_SHM` is not.  Many bugs have been mitigated by the former, but none by the latter.

`ARMCI_DEBUG_ALLOC=1` will zero allocations on creation, which will trigger segfaults inside
of the allocation routine rather than in user code, which might be useful.
This option hasn't been useful in many years.

### Datatypes

`ARMCI_STRIDED_METHOD` and `ARMCI_IOV_METHOD` control how non-contigous data is communicated.
These mitigate bugs in MPI datatypes.  NWChem makes heavier use of the former.

Bugs in datatypes lead to crashes during communication operations (`GA_Get`, `GA_Acc`, `GA_Put`)
rather than allocation functions.  They can also lead to incorrect results in especially bad
cases where the implementation is silently failing to do the right thing.

`ARMCI_STRIDED_METHOD` supports `DIRECT`, which uses datatypes, and `IOV`, which falls back on  
whatever `ARMCI_IOV_METHOD` is.  `ARMCI_IOV_METHOD` supports `AUTO` and `DIRECT`, which 
use datatypes, as well as `BATCHED` and `CONSRV` (conservative).  The latter two emit one
RMA operation for each contiguous chunk, with `CONSRV` synchronizing after each RMA operation.

The performance of `CONSRV` will be poor.  `BATCHED` is usually slower than `DIRECT` except
on platforms like Blue Gene, where the CPU was slow relative to the network.

Because Open-MPI has had numerous bugs in its datatype implementation over the years,
ARMCI-MPI sets `BATCHED` as the default at compile time when it is used.
Optimistic users can try datatypes by setting `DIRECT` explicitly.
For all other (usually MPICH-based) implementations, `DIRECT` is the default,
which are assumed to be correct except with the Galaxy Express network (Tianhe-2).

Enabling `ARMCI_IOV_CHECKS` will check the input for validity, which should not be useful
when debugging most GA programs.  It takes a rather perverse use case to trigger conflicts
here, which were encountered in the development of ARMCI-MPI but not since.

## MPI libraries

If you are using Open-MPI and having problems, please try an MPICH-based implementation instead.
If this solves the problem, then file a bug with Open-MPI and tell them to fix their
RMA and/or datatypes implementation again.

Open-MPI does not have a monopoly on RMA bugs, and we have observed RMA bugs in MPICH itself and
Cray MPI over the years.  Because ARMCI-MPI is tested regularly at Argonne (or at least I hope it is),
the MPICH bugs should be less common now.  Cray MPI is not tested regularly anymore but we assume
the bugs of the past are resolved.

While Intel MPI was problematic a few years ago, in the past five years, there have been significant
improvements in their implementation of RMA and we can recommend it in most cases.

MVAPICH2 is generally good at RMA and the recommended MPI-3 RMA implementation on InfiniBand platforms.

## Performance

Debugging RMA performance is not a simple task and not covered in this document.

If you are interested in this topic, email Jeff.