1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
|
# Authors
James Dinan (MPI-2 implementation)
Jeff Hammond (MPI-3 implementation)
# Introduction
This project provides a full, high performance, portable implementation of the
ARMCI runtime system using MPI's remote memory access (RMA) functionality.
# Installing Only ARMCI-MPI
ARMCI-MPI uses autoconf and must be configured before compiling:
```
$ ./configure
```
Many configure options are provided, run `configure --help` for details. After
configuring the source tree, the code can be built and installed by running:
$ make && make install
The quality of MPI-RMA implementations varies. As of August, 2011 the
following MPI-2 implementations are known to work correctly with ARMCI-MPI:
* MPICH2 and MPICH 3+
* MVAPICH2 1.6
* Cray MPI on Cray XE6
* IBM MPI on BG/P (set `ARMCI_STRIDED_METHOD=IOV` and `ARMCI_IOV_METHOD=BATCHED` for _performance reasons_)
* OpenMPI 1.5.4 (set `ARMCI_STRIDED_METHOD=IOV` and `ARMCI_IOV_METHOD=BATCHED` for _correctness reasons_)
The following MPI-2 implementations are known to fail with ARMCI-MPI:
* MVAPICH2 prior to 1.6
As of April 2014, the following MPI-3 implementations are know to work correctly
with ARMCI-MPI (`mpi3rma` branch):
* MPICH 3.0.4 and later on Mac, Linux SMPs and SGI SMPs.
* MVAPICH2 2.0a and later on Linux InfiniBand clusters.
* CrayMPI 6.1.0 and later on Cray XC30.
* SGI MPT 2.09 on SGI SMPs.
* OpenMPI development version on Mac (set `ARMCI_STRIDED_METHOD=IOV` and `ARMCI_IOV_METHOD=BATCHED`)
Note that a bug in MPICH that propagated to MVAPICH2, Cray MPI and Intel MPI affects correctness
when windows are backed by shared-memory. This bug affects `ARMCI_Rmw` and is avoided with the
default settings, which use `MPI_Win_create`. This may negatively affect performance in some
cases and prevents one from using Casper. To utilize `MPI_Win_allocate`, do one of the following:
* Build with the `--enable-win-allocate` option to `configure`.
* Set `ARMCI_USE_WIN_ALLOCATE=1` in your runtime environment.
# Installing Global Arrays with ARMCI-MPI
ARMCI-MPI2 has been tested with GA 5.0.2. To build GA with ARMCI-MPI2, rename
this directory to "armci" and substitute it for the "armci" directory in the GA
distribution. Configure and build GA as usual; no special flags are required.
To build GA 5.2 with ARMCI-MPI (any version), use the configure option
`--with-armci=$(PATH_TO_ARMCI_MPI)` and make sure that you use the same MPI
implementation with GA that was used to compile ARMCI-MPI.
ARMCI-MPI3 has been tested extensively with GA 5.2 and 5.3.
# The ARMCI-MPI Test Suite
ARMCI-MPI includes a set of testing and benchmark programs located under `tests/`
and `benchmarks/`. These programs can be compiled and run via:
```
$ make check MPIEXEC="mpiexec -n 4"
```
The `MPIEXEC` variable is optional and is used to override the default MPI launch
command. If you want only to build the test suite, the following target can be
used:
```
$ make checkprogs
```
# ARMCI-MPI Errata
Direct access to local buffers:
* Because of MPI-2's semantics, you are not allowed to access shared memory
directly, it must be through put/get. Alternatively you can use the
new ARMCI_Access_begin/end() functions.
* MPI-3 allows direct access provided one uses a synchronization operation
afterwards. The ARMCI_Access_begin/end() functions are also valid.
Progress semantics:
* On some MPI implementations and networks you may need to enable implicit
progress. In many cases this is done through an environment variable. For
MPICH2: set `MPICH_ASYNC_PROGRESS`; for MVAPICH2 recompile with
`--enable-async-progress` and set `MPICH_ASYNC_PROGRESS`; set `DCMF_INTERRUPTS=1`
for MPI on BGP; etc.
See [this page](https://github.com/jeffhammond/HPCInfo/tree/master/mpi#asynchronous-progress)
for more information on activating asynchronous progress in MPI. However, we find
that most platforms show no improvement and often a decrease in performance,
provided the application makes calls to GA/ARMCI/MPI frequently enough on all
processes.
# Environment Variables:
Boolean environment variables are enabled when set to a value beginning with
't', 'T', 'y', 'Y', or '1'; any other value is interpreted as false.
##Debugging Options
ARMCI_VERBOSE (boolean)
Enable extra status output from ARMCI-MPI.
ARMCI_DEBUG_ALLOC (boolean)
Turn on extra shared allocation debugging.
ARMCI_FLUSH_BARRIERS (boolean) (deprecated)
Enable/disable extra communication flushing in ARMCI_Barrier. Extra flushes
are present to help make unsafe DLA safer. (This option is deprecated with
the ARMCI-MPI3 implementation.)
## Performance Options
ARMCI_CACHE_RANK_TRANSLATION (boolean)
Create a table to more quickly translate between absolute and group ranks.
ARMCI_PROGRESS_THREAD (boolean)
Create a Pthread to poke the MPI progress engine.
ARMCI_PROGRESS_USLEEP (int)
Argument to `usleep()` to pause the progress polling loop.
## Noncollective Groups
ARMCI_NONCOLLECTIVE_GROUPS (boolean)
Enable noncollective ARMCI group formation; group creation is collective on
the output group rather than the parent group.
## Shared Buffer Protection
ARMCI_SHR_BUF_METHOD = { COPY (default), NOGUARD }
ARMCI policy for managing shared origin buffers in communication operations:
lock the buffer (unsafe, but fast), copy the buffer (safe), or don't guard
the buffer - assume that the system is cache coherent and MPI supports
unlocked load/store.
## Strided Options
ARMCI_STRIDED_METHOD = { DIRECT (default), IOV }
Select the method for processing strided operations.
## I/O Vector Options
ARMCI_IOV_METHOD = { AUTO (default), CONSRV, BATCHED, DIRECT }
Select the IO vector communication strategy: automatic; a "conservative"
implementation that does lock/unlock around each operation; an implementation
that issues batches of operations within a single lock/unlock epoch; and a
direct implementation that generates datatypes for the origin and target and
issues a single operation using them.
ARMCI_IOV_CHECKS (boolean)
Enable (expensive) IOV safety/debugging checks (not recommended for
performance runs).
ARMCI_IOV_BATCHED_LIMIT = { 0 (default), 1, ... }
Set the maximum number of one-sided operations per epoch for the BATCHED IOV
method. Zero (default) is unlimited.
# Status
Right now, we are seeing failures caused by OpenMPI bugs, not because ARMCI-MPI itself is broken.
* `master`: [](https://travis-ci.org/jeffhammond/armci-mpi)
* `mpi2rma`: [](https://travis-ci.org/jeffhammond/armci-mpi)
* `mpi2rma-plus-rmw`: [](https://travis-ci.org/jeffhammond/armci-mpi)
* `mpi3rma`: [](https://travis-ci.org/jeffhammond/armci-mpi)
The `master` branch is still only for MPI-2 RMA.
The `mpi2rma` branch is essentially the same, except it is ABI-compatible with the MPI-3 branch.
The `mpi2rma-plus-rmw` uses mostly MPI-2 features except for MPI-3 atomics. Because it uses
MPI-2 RMA synchronization, it is _conservative_ and thus may work with an otherwise buggy
implementation of MPI-3 RMA. It exists primarily for historical purposes.
The `mpi3rma` branch is what one should use on the majority of platforms, provided MPI-3 RMA is working.
|