File: developer_guide.md

package info (click to toggle)
mpich 5.0.0-1
links: PTS, VCS
area: main
in suites: experimental
size: 251,828 kB
sloc: ansic: 1,323,147; cpp: 82,869; f90: 72,420; javascript: 40,763; perl: 28,296; sh: 19,399; python: 16,191; xml: 14,418; makefile: 9,468; fortran: 8,046; java: 4,635; pascal: 352; asm: 324; ruby: 176; awk: 27; lisp: 19; php: 8; sed: 4
file content (417 lines) | stat: -rw-r--r-- 17,766 bytes
parent folder | download | duplicates (3)
# MPICH Developer Guide

This serves as a README for developers. It provides high level descriptions of
code architecture and provide pointers to other notes.

1. [mpi.h](#mpi_h)
1. [mpiimpl.h](#mpiimpl_h)
1. [MPI and PMPI functions](#mpi-and-pmpi-functions)
1. [MPIR impl functions](#mpir-impl-functions)
1. [Handle Objects](#handle-objects)
    1. Handle bit-layerout
1. [MPID Interface](#mpid-interface)
    1. [ch3](#ch3)
    1. [ch4](#ch4)
        1. [Active Message](#active-message)
        1. [Shm](#shm)
        1. [Netmod](#netmod)
        1. [RMA](#rma)
1. [Datatype](#datatype)
    1. Typerep
1. [Collectives](#collectives)
1. [Op](#op)
1. [Romio](#romio)
1. [MPI_T](#mpi_t)
1. [MPL](#mpl)
1. [CVAR](#cvar)
1. [PM and PMI](#pm-and-pmi)
1. Fortran binding
    1. mpif.h
    1. use mpi
    1. use mpi_f08
1. [Test Suite](#test-suite)
1. [Build and Debug](#build-and-debug)

---------------------------------------- 

## `mpi.h` <a id="mpi_h"></a>

The top-most layer is src/include/mpi.h. This is the interface standardized by
MPI Forum and exposed to users. It defines user-level constants, types, and
API prototypes.

The types include MPI objects, which we internally refer as *handle objects*,
such as MPI_Communicator, MPI_Datatype, MPI_Request, etc. In MPICH, all handle
objects are exposed to user as integer handle. Internally we have a custom
allocator system for these objects. How we manage handle objects is a cruicial
piece in understanding the MPICH architecture. Refer to dedicated section
below.

The API prototypes are currently generated by `maint/gen_binding_c.py` in
`mpi_proto.h`, which get included in `mpi.h`.

`mpi.h` is included in `mpiimpl.h`, which is included by nearly every source
code.

Reference:
* [mpi.h](../../src/include/mpi.h.in)

## `mpiimpl.h` <a id="mpiimpl_h"></a>

Nearly all internal source code includes `mpiimpl.h` and often only
`mpiimpl.h`. The all encompassing nature makes this header quite complex and
delicate. The best way to understand it is to refer to the header itself
rather than repeat here. It is cruical for understanding how we organize types
and inline functions.

The all encompasing nature provides the convenience that most source file only
need include this single header. On the other hand, it is also responsible for
the slow compilation. There is a balance between spending time maintaining
headers and spending time waiting for compilations.

Reference:
* [mpiimpl.h](../../src/include/mpiimpl.h)

## MPI and PMPI functions

The top level functions are MPI functions, e.g. MPI_Send. You will not find
these functions directly in our current code base. They are generated by
`maint/gen_binding_c.py` based on `src/binding/mpi_standard_api.txt` and other
meta files. MPI functions handles parameter validation, trivial case early
returns, standard error behaviors, and calling internal implementation
routines with necessary argument transformations. These functionality contains
large amount of boilerplates, thus are more suited for script generation.

PMPI prefixed function names are used to support MPI profiling interface. When
user call MPI functions, e.g. MPI_Send, the symbol may link to a tool or
profiling library function, which itercepts the call, do its profiling or
analyses, and then calls PMPI_Send. Thus all top level functions are defined
with PMPI_ names. This is why often PMPI names show up in traceback logs. To
work when there isn't a tool library (the common case), both PMPI and MPI
symbols are defined. If the compiler supports weak symbol, MPI names are weak
symbols that links to PMPI names. This is how we do on Linux. Without weak
symbol support, the top level functions are compiled twice, once with MPI
names, and a second time with PMPI names. This is how on MacOs works.

Since this layer is mostly generated, we also refer to this layer as binding
layer, which widely also apply to the Fortran/CXX bindings.

Reference:
* [notes on binding generation](notes/binding_c.md)
* [mpi_standard_api.txt](../../src/binding/mpi_standard_api.txt)
  [api_mapping.txt](../../src/binding/api_mapping.txt)
* [MPI Specification](https://www.mpi-forum.org/docs/drafts/mpi-2020-draft-report.pdf)


## MPIR impl functions

Nearly all MPI functions will call MPIR_Xxx_impl functions except some trivial
functions that we directly handle in the binding layer.

MPIR impl functions uses the same parameters as the MPI function with some
exceptions. For one, all handle objects are passed as pointers. E.g. `MPI_Comm
comm` in MPI function become `MPIR_Comm * comm_ptr` internally. For two,
internally we use `MPI_Aint` as much as we can to avoid hazadous incompatible
integer conversions, while MPI functions may use `int` or `MPI_Count`. These
conversion are all handled by the binging layer.

MPI functions are grouped in "chapters" -- referring to the chapters in MPI
specification. The mpir implementation functions are placed in `src/mpi/xxx`,
where `xxx` refers to the chapters such as `pt2pt`, `coll`, `rma`, etc.

Reference:
* [src/mpi/](../../src/mpi/)

## Handle Objects
Handle objects are crucial data structures in MPICH. Currently defined handle
objects are [here](../../src/include/mpir_objects.h#L138-L155).

Internally, handle objects are defined as struct, e.g. `MPIR_Comm`. The first
two fields for all object structs are:
```
    int handle;
    Handle_ref_count ref_count;
```

We use custom handle memory allocation for handle objects. There are 3 tiers:
1. built-in objects - [example](../../src/mpi/comm/commutil.c#L21)
2. direct objects - [example](../../src/mpi/comm/commutil.c#L22)
3. indirect objects - allocated by slabs
This system allows no initialization/overhead for built-in objects, minimum
overhead when only minimum number of objects are used, and a managed overhead
when large amount of objects are needed.

### Handle bit-layout
Use of integer handle provides better stability for bindings (where pointer
type is not guaranteed) and it provides better debuggability since the handle
contains more semantic information than pointer address. With our handle memory
system, integer handler can be quickly converted to internal pointers.

Current handle bits layout:
```
    0 [2]  Handle type (INVALID, BUILTIN, DIRECT, INDIRECT)
    2 [4]  Handle kind (MPIR_COMM, ...)
    6 [26] Block and Index
```
The 26 bits blocks are further divided into 14 bits for slab (block) index,
and 12 bits for index within the block. For request objects, the 14 bits block
index is further divided into 6 bits for request pools and 8 bits for blocks.

Consequently, the maximum number of *live* objects we can have is limited by
handle bits.

Reference:
* [mpir_objects.h](../../src/include/mpir_objects.h)
* [mpir_handlemem.h](../../src/include/mpir_handlemem.h)
* [mpir_request.h](../../src/include/mpir_request.h)

## MPID Interface

A key design goal of MPICH is to allow downstream vendors to easily create
vendor specific implementations. This achieved by Abstract Device Interface
(ADI).ADI is a set of MPID prefixed functions that implements the
functionality of MPI operation. For example, MPID_Send implements MPI_Send.
Nearly all MPI functions will call MPID counter parts first, allowing device
layer to either supply full functionality or simply fall back by calling
`MPIR` implementations.

For performance critical path, e.g. `pt2pt` and `rma` path, we call `MPID`
layer directly from binding. This allows full inline build to achieve maximum
compiler optimization. The other ADIs are not performance critical, but
provided as hooks -- MPIR layer will call these hooks at key points -- to
allow device properly setup and control the implementation behaviors.

Note: all pt2pt communicatioin in ADI are nonblocking -- it will only initiate
the communication and completed during `MPID_Progress_wait/test`.

Reference:
* [ch3/include/mpidpre.h](../../src/mpid/ch3/include/mpidpre.h#L532-L753)
* [ch4/include/mpidch4.h](../../src/mpid/ch4/include/mpidch4.h#L15-L313)

### ch3

Ch3 is currently in maintenance mode. It is still fully supported since there
are vendors still basing on ch3.

There are two channels in ch3. `ch3:sock` is a pure socket implementation.
`ch3:nemesis` adds shared memory communication and also supports network
modules (netmod). We currently support `ch3:nemesis:tcp` and
`ch3:nemesis:ofi`.

### ch4

Ch4 is where currently active research and development take place. Many new
features, e.g. per-VCI threading, GPU IPC, partitioned communication, etc. are
only available in Ch4.

Ch4 designs an additional ADI-like interface, commonly referred as ch4 API or
shm/netmod API. With most MPID functions, ch4-layer will check whether the
communication is local (can be carried out using shared memory) and calls into
either shm API or netmod API. It is possible to disable shm entirely.

The framework for ch4 API involves large amount of boilerplates due to the
need allow both fully inlined build and nonlined build using function tables.
We use script to generate most of these API files.

Reference:
* [Notes on ch4 api autogen](notes/ch4_api.md)
* [Notes on ch4 inline mechanism](notes/ch4_inline.md)
* [Notes on ch4 namespace convention](notes/ch4_namespace.md)
* [ch4_api.txt](../../src/mpid/ch4/ch4_api.txt)
* [request.md](notes/request.md)

#### Active Message

Similar to MPIR, ch4 provide a default/fallback implementation based on active
messages. Ch3 implementation is largely an active message implementation. It
is easier to use an example to explain what is active message and how it
works.

MPID_Send will call either `MPIDI_SHM_mpi_isend` or `MPIDI_NM_mpi_isend` based
on locality, which may call back to `MPIDIG_mpi_isend`. `MPIDIG_mpi_isend` will
send the message along with am header to target process using
`MPIDI_{SHM,NM}_am_isend`. It creates an MPIR_Request object and return to user.

During progress, the receiving process receives the message (inside shm
progress or netmod progress), it will check the am header, and call a
registered handler function to process the message. The handler functions are
essential pieces in a active message protocol. In the pt2pt eager send case,
it will check posted receiver buffer list and either copy the message data to
the receiver buffer (and completes the receive request) or enqueuing the
message to an unexpected queue.

As a counter part, `MPID_Recv` will at some point call `MPIDIG_mpi_irecv`, which
will check the unexpected queue, and either copy the data over (and completes
the receive buffer) or enqueue the receive request to a posted queue.

There are complications, where we may implement different protocols such as
rendezvous protocol or IPC or RDMA, as well as various RMA operations.
Different protocols are all implemented by handler functions which in turn may
send additional active messages with another handler functions to carry on the
protocols.

Reference:
* [src/mpid/ch4/src/mpidig_send.h](../../src/mpid/ch4/src/mpidig_send.h)
* [src/mpid/ch4/src/mpidig_recv.h](../../src/mpid/ch4/src/mpidig_recv.h)
* [src/mpid/ch4/src/mpidig_pt2pt_callbacks.h](../../src/mpid/ch4/src/mpidig_pt2pt_callbacks.h)
* [src/mpid/ch4/src/mpidig_recvq.h](../../src/mpid/ch4/src/mpidig_recvq.h)
* [src/mpid/ch4/src/mpidig_rma.h](../../src/mpid/ch4/src/mpidig_rma.h)
* [src/mpid/ch4/src/mpidig_rma_callbacks.h](../../src/mpid/ch4/src/mpidig_rma_callbacks.h)
* [src/mpid/ch4/src/mpidig_rma_callbacks.h](../../src/mpid/ch4/src/mpidig_rma_callbacks.h)

Active message APIs as of MPICH 3.4.2:
```
/****************** Header and Data Movement APIs ******************/

/* blocking header send */
int MPIDI_[NM|SHM]_am_send_hdr(int rank, MPIR_Comm * comm, int handler_id,
                               const void *am_hdr, MPI_Aint am_hdr_sz)

/* nonblocking header + datatype send */
int MPIDI_[NM|SHM]_am_isend(int rank, MPIR_Comm * comm, int handler_id,
                            const void *am_hdr, MPI_Aint am_hdr_sz,
                            const void *data, MPI_Aint count,
                            MPI_Datatype datatype, MPIR_Request * sreq)

/* nonblocking headers + datatype send */
int MPIDI_[NM|SHM]_am_isendv(int rank, MPIR_Comm * comm, int handler_id,
                             struct iovec *am_hdrs, size_t iov_len,
                             const void *data, MPI_Aint count,
                             MPI_Datatype datatype, MPIR_Request * sreq)

/* blocking header send (callback safe) */
int MPIDI_[NM|SHM]_am_send_hdr_reply(MPIR_Comm * comm, int src_rank,
                                     int handler_id, const void *am_hdr,
                                     MPI_Aint am_hdr_sz)

/* nonblocking header + datatype send (callback safe) */
int MPIDI_[NM|SHM]_am_isend_reply(MPIR_Comm * comm, int src_rank, int handler_id,
                                  const void *am_hdr, MPI_Aint am_hdr_sz,
                                  const void *data, MPI_Aint count,
                                  MPI_Datatype datatype, MPIR_Request * sreq)

/* CTS for pt2pt messages */
int MPIDI_[NM|SHM]_am_recv(MPIR_Request * rreq)

/* largest header (in bytes) supported by the nm/shm */
MPI_Aint MPIDI_[NM|SHM]_am_hdr_max_sz(void)

/* eager size supported by transport (only used internally by nm/shm) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_limit(void)

/* eager buffer size supported by transport, used to assert pipeline protocol will work(?) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_buf_limit(void)

/* return true/false if pt2pt message can be sent eagerly */
bool MPIDI_[NM|SHM]_am_check_eager(MPI_Aint am_hdr_sz, MPI_Aint data_sz,
                                   const void *data, MPI_Aint count,
                                   MPI_Datatype datatype, MPIR_Request * sreq)

/****************** Callback APIs ******************/

/* target-side message callback */
typedef int (*MPIDIG_am_target_msg_cb) (int handler_id, void *am_hdr,
                                        void *data, MPI_Aint data_sz,
                                        int is_local, int is_async, MPIR_Request ** req);

/* target-side completion callback */
typedef int (*MPIDIG_am_target_cmpl_cb) (MPIR_Request * req);

/* origin-side completion callback */
typedef int (*MPIDIG_am_origin_cb) (MPIR_Request * req);
```
 
#### SHM

We currently support shared memory based communication based on posix shared
memory.

#### Netmod

We currently support `ch4:ofi` and `ch4:ucx` netmods. A minimum netmod can be
implemented by just implement a minimum set of byte-transfer functions, e.g.
`MPIDI_NM_am_isend` and the corresponding mechanism of calling back handler
function upon receiving the byte payload. To achieve better performance, both
the libfabric library and ucx library provides APIs to directly send messages
with tag matching and RDMA operations that avoid extra data copy and shorten
latencies. Thus both netmod have direct implementations of netmod API that
skips active message.

Both netmod provide direct RMA operations where it can, and fallback to active
message where there is limitations. All window synchronizations are based on
active messages.

## Collectives

Collectives are implemented in `src/mpi/coll/`, with separate folders for each
collective function.

Use `MPI_Bcast` for example, `bcast/bcast.c` defines 3 functions, mostly
boilerplates. The binding layer will call **`MPIR_Bcast`**, which will call
`MPID_Bcast` unless the control variables (CVAR) directs differently.
Typically `MPID_Bcast` will fallback to call **`MPIR_Bcast_impl`**,  which
will check CVARs and select particlar algorithm.  By default, it will call
**`MPIR_Bcast_allcomm_auto`**, which will automatically choose a best
algorithm based on selection logic defined by a runtime json file.

Each algorithm is typically implemented in a separate file.

## Romio

MPI-IO (MPI_File) APIs are implemented as Romio library. Romio is implemented
entirely on top of other MPI functions and doesn't use MPICH internals. This
is similar in a way as how we implement the Fortran bindings.

## MPI_T

MPI Tools information interface provides orthogonal system for controlling and
monitoring implementation behavior by means of control variables (CVAR),
performance variables (PVAR), and the event callback interface.

## MPL

All utilities, whose functionality is indepeendent of MPICH internals, are
implemented inside MPL. All functions from MPL uses `MPL_` prefix.

Some of the essential MPICH functionality are provided by MPL. They include:
* Debug logging
* Atomic variables
* Thread functions
* Memory tracing
* GPU support

Sometime we use an additional MPIR wrapping layer on top of MPL functions to
provide additional control. For example, the threading interfaces are wrapped
in `src/mpid/common/thread/mpidu_thread_fallback.h` to allow device override
of locking mechanism. Another example, GPU interface are wrapped to allow
`MPIR_CVAR_ENABLE_GPU` to by pass GPU functions (when it is set to disable).

## CVAR

Control variables are a mechanism to control library behavior using
environment variables. The MPI Tool interface allows usage of control
variables in code during runtime as well (however, it is rather clunky
thus not widely used).

Reference:
* [notes on CVAR](notes/cvar.md)

## PM and PMI
References:
* [Notes on Process Manager](notes/pm.md)
* [Notes on Process Manager Interface](notes/pmi.md)
* [Notes on Namepub](notes/namepub.md)
* [Using the Hydra Process Manager](how_to/Using_the_Hydra_Process_Manager.md)

## Test Suite

We have an extensive test suite in `test/mpi/`. While currently it is
configured with the mpich configure, it requires mpich to be installed before
running `make testing`.

Tests are controlled by `testlist` files.

## Build and Debug

References:
* [Debug Event Logging](design/Debug_Event_Logging.md)