1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417
|
# MPICH Developer Guide
This serves as a README for developers. It provides high level descriptions of
code architecture and provide pointers to other notes.
1. [mpi.h](#mpi_h)
1. [mpiimpl.h](#mpiimpl_h)
1. [MPI and PMPI functions](#mpi-and-pmpi-functions)
1. [MPIR impl functions](#mpir-impl-functions)
1. [Handle Objects](#handle-objects)
1. Handle bit-layerout
1. [MPID Interface](#mpid-interface)
1. [ch3](#ch3)
1. [ch4](#ch4)
1. [Active Message](#active-message)
1. [Shm](#shm)
1. [Netmod](#netmod)
1. [RMA](#rma)
1. [Datatype](#datatype)
1. Typerep
1. [Collectives](#collectives)
1. [Op](#op)
1. [Romio](#romio)
1. [MPI_T](#mpi_t)
1. [MPL](#mpl)
1. [CVAR](#cvar)
1. [PM and PMI](#pm-and-pmi)
1. Fortran binding
1. mpif.h
1. use mpi
1. use mpi_f08
1. [Test Suite](#test-suite)
1. [Build and Debug](#build-and-debug)
----------------------------------------
## `mpi.h` <a id="mpi_h"></a>
The top-most layer is src/include/mpi.h. This is the interface standardized by
MPI Forum and exposed to users. It defines user-level constants, types, and
API prototypes.
The types include MPI objects, which we internally refer as *handle objects*,
such as MPI_Communicator, MPI_Datatype, MPI_Request, etc. In MPICH, all handle
objects are exposed to user as integer handle. Internally we have a custom
allocator system for these objects. How we manage handle objects is a cruicial
piece in understanding the MPICH architecture. Refer to dedicated section
below.
The API prototypes are currently generated by `maint/gen_binding_c.py` in
`mpi_proto.h`, which get included in `mpi.h`.
`mpi.h` is included in `mpiimpl.h`, which is included by nearly every source
code.
Reference:
* [mpi.h](../../src/include/mpi.h.in)
## `mpiimpl.h` <a id="mpiimpl_h"></a>
Nearly all internal source code includes `mpiimpl.h` and often only
`mpiimpl.h`. The all encompassing nature makes this header quite complex and
delicate. The best way to understand it is to refer to the header itself
rather than repeat here. It is cruical for understanding how we organize types
and inline functions.
The all encompasing nature provides the convenience that most source file only
need include this single header. On the other hand, it is also responsible for
the slow compilation. There is a balance between spending time maintaining
headers and spending time waiting for compilations.
Reference:
* [mpiimpl.h](../../src/include/mpiimpl.h)
## MPI and PMPI functions
The top level functions are MPI functions, e.g. MPI_Send. You will not find
these functions directly in our current code base. They are generated by
`maint/gen_binding_c.py` based on `src/binding/mpi_standard_api.txt` and other
meta files. MPI functions handles parameter validation, trivial case early
returns, standard error behaviors, and calling internal implementation
routines with necessary argument transformations. These functionality contains
large amount of boilerplates, thus are more suited for script generation.
PMPI prefixed function names are used to support MPI profiling interface. When
user call MPI functions, e.g. MPI_Send, the symbol may link to a tool or
profiling library function, which itercepts the call, do its profiling or
analyses, and then calls PMPI_Send. Thus all top level functions are defined
with PMPI_ names. This is why often PMPI names show up in traceback logs. To
work when there isn't a tool library (the common case), both PMPI and MPI
symbols are defined. If the compiler supports weak symbol, MPI names are weak
symbols that links to PMPI names. This is how we do on Linux. Without weak
symbol support, the top level functions are compiled twice, once with MPI
names, and a second time with PMPI names. This is how on MacOs works.
Since this layer is mostly generated, we also refer to this layer as binding
layer, which widely also apply to the Fortran/CXX bindings.
Reference:
* [notes on binding generation](notes/binding_c.md)
* [mpi_standard_api.txt](../../src/binding/mpi_standard_api.txt)
[api_mapping.txt](../../src/binding/api_mapping.txt)
* [MPI Specification](https://www.mpi-forum.org/docs/drafts/mpi-2020-draft-report.pdf)
## MPIR impl functions
Nearly all MPI functions will call MPIR_Xxx_impl functions except some trivial
functions that we directly handle in the binding layer.
MPIR impl functions uses the same parameters as the MPI function with some
exceptions. For one, all handle objects are passed as pointers. E.g. `MPI_Comm
comm` in MPI function become `MPIR_Comm * comm_ptr` internally. For two,
internally we use `MPI_Aint` as much as we can to avoid hazadous incompatible
integer conversions, while MPI functions may use `int` or `MPI_Count`. These
conversion are all handled by the binging layer.
MPI functions are grouped in "chapters" -- referring to the chapters in MPI
specification. The mpir implementation functions are placed in `src/mpi/xxx`,
where `xxx` refers to the chapters such as `pt2pt`, `coll`, `rma`, etc.
Reference:
* [src/mpi/](../../src/mpi/)
## Handle Objects
Handle objects are crucial data structures in MPICH. Currently defined handle
objects are [here](../../src/include/mpir_objects.h#L138-L155).
Internally, handle objects are defined as struct, e.g. `MPIR_Comm`. The first
two fields for all object structs are:
```
int handle;
Handle_ref_count ref_count;
```
We use custom handle memory allocation for handle objects. There are 3 tiers:
1. built-in objects - [example](../../src/mpi/comm/commutil.c#L21)
2. direct objects - [example](../../src/mpi/comm/commutil.c#L22)
3. indirect objects - allocated by slabs
This system allows no initialization/overhead for built-in objects, minimum
overhead when only minimum number of objects are used, and a managed overhead
when large amount of objects are needed.
### Handle bit-layout
Use of integer handle provides better stability for bindings (where pointer
type is not guaranteed) and it provides better debuggability since the handle
contains more semantic information than pointer address. With our handle memory
system, integer handler can be quickly converted to internal pointers.
Current handle bits layout:
```
0 [2] Handle type (INVALID, BUILTIN, DIRECT, INDIRECT)
2 [4] Handle kind (MPIR_COMM, ...)
6 [26] Block and Index
```
The 26 bits blocks are further divided into 14 bits for slab (block) index,
and 12 bits for index within the block. For request objects, the 14 bits block
index is further divided into 6 bits for request pools and 8 bits for blocks.
Consequently, the maximum number of *live* objects we can have is limited by
handle bits.
Reference:
* [mpir_objects.h](../../src/include/mpir_objects.h)
* [mpir_handlemem.h](../../src/include/mpir_handlemem.h)
* [mpir_request.h](../../src/include/mpir_request.h)
## MPID Interface
A key design goal of MPICH is to allow downstream vendors to easily create
vendor specific implementations. This achieved by Abstract Device Interface
(ADI).ADI is a set of MPID prefixed functions that implements the
functionality of MPI operation. For example, MPID_Send implements MPI_Send.
Nearly all MPI functions will call MPID counter parts first, allowing device
layer to either supply full functionality or simply fall back by calling
`MPIR` implementations.
For performance critical path, e.g. `pt2pt` and `rma` path, we call `MPID`
layer directly from binding. This allows full inline build to achieve maximum
compiler optimization. The other ADIs are not performance critical, but
provided as hooks -- MPIR layer will call these hooks at key points -- to
allow device properly setup and control the implementation behaviors.
Note: all pt2pt communicatioin in ADI are nonblocking -- it will only initiate
the communication and completed during `MPID_Progress_wait/test`.
Reference:
* [ch3/include/mpidpre.h](../../src/mpid/ch3/include/mpidpre.h#L532-L753)
* [ch4/include/mpidch4.h](../../src/mpid/ch4/include/mpidch4.h#L15-L313)
### ch3
Ch3 is currently in maintenance mode. It is still fully supported since there
are vendors still basing on ch3.
There are two channels in ch3. `ch3:sock` is a pure socket implementation.
`ch3:nemesis` adds shared memory communication and also supports network
modules (netmod). We currently support `ch3:nemesis:tcp` and
`ch3:nemesis:ofi`.
### ch4
Ch4 is where currently active research and development take place. Many new
features, e.g. per-VCI threading, GPU IPC, partitioned communication, etc. are
only available in Ch4.
Ch4 designs an additional ADI-like interface, commonly referred as ch4 API or
shm/netmod API. With most MPID functions, ch4-layer will check whether the
communication is local (can be carried out using shared memory) and calls into
either shm API or netmod API. It is possible to disable shm entirely.
The framework for ch4 API involves large amount of boilerplates due to the
need allow both fully inlined build and nonlined build using function tables.
We use script to generate most of these API files.
Reference:
* [Notes on ch4 api autogen](notes/ch4_api.md)
* [Notes on ch4 inline mechanism](notes/ch4_inline.md)
* [Notes on ch4 namespace convention](notes/ch4_namespace.md)
* [ch4_api.txt](../../src/mpid/ch4/ch4_api.txt)
* [request.md](notes/request.md)
#### Active Message
Similar to MPIR, ch4 provide a default/fallback implementation based on active
messages. Ch3 implementation is largely an active message implementation. It
is easier to use an example to explain what is active message and how it
works.
MPID_Send will call either `MPIDI_SHM_mpi_isend` or `MPIDI_NM_mpi_isend` based
on locality, which may call back to `MPIDIG_mpi_isend`. `MPIDIG_mpi_isend` will
send the message along with am header to target process using
`MPIDI_{SHM,NM}_am_isend`. It creates an MPIR_Request object and return to user.
During progress, the receiving process receives the message (inside shm
progress or netmod progress), it will check the am header, and call a
registered handler function to process the message. The handler functions are
essential pieces in a active message protocol. In the pt2pt eager send case,
it will check posted receiver buffer list and either copy the message data to
the receiver buffer (and completes the receive request) or enqueuing the
message to an unexpected queue.
As a counter part, `MPID_Recv` will at some point call `MPIDIG_mpi_irecv`, which
will check the unexpected queue, and either copy the data over (and completes
the receive buffer) or enqueue the receive request to a posted queue.
There are complications, where we may implement different protocols such as
rendezvous protocol or IPC or RDMA, as well as various RMA operations.
Different protocols are all implemented by handler functions which in turn may
send additional active messages with another handler functions to carry on the
protocols.
Reference:
* [src/mpid/ch4/src/mpidig_send.h](../../src/mpid/ch4/src/mpidig_send.h)
* [src/mpid/ch4/src/mpidig_recv.h](../../src/mpid/ch4/src/mpidig_recv.h)
* [src/mpid/ch4/src/mpidig_pt2pt_callbacks.h](../../src/mpid/ch4/src/mpidig_pt2pt_callbacks.h)
* [src/mpid/ch4/src/mpidig_recvq.h](../../src/mpid/ch4/src/mpidig_recvq.h)
* [src/mpid/ch4/src/mpidig_rma.h](../../src/mpid/ch4/src/mpidig_rma.h)
* [src/mpid/ch4/src/mpidig_rma_callbacks.h](../../src/mpid/ch4/src/mpidig_rma_callbacks.h)
* [src/mpid/ch4/src/mpidig_rma_callbacks.h](../../src/mpid/ch4/src/mpidig_rma_callbacks.h)
Active message APIs as of MPICH 3.4.2:
```
/****************** Header and Data Movement APIs ******************/
/* blocking header send */
int MPIDI_[NM|SHM]_am_send_hdr(int rank, MPIR_Comm * comm, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz)
/* nonblocking header + datatype send */
int MPIDI_[NM|SHM]_am_isend(int rank, MPIR_Comm * comm, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* nonblocking headers + datatype send */
int MPIDI_[NM|SHM]_am_isendv(int rank, MPIR_Comm * comm, int handler_id,
struct iovec *am_hdrs, size_t iov_len,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* blocking header send (callback safe) */
int MPIDI_[NM|SHM]_am_send_hdr_reply(MPIR_Comm * comm, int src_rank,
int handler_id, const void *am_hdr,
MPI_Aint am_hdr_sz)
/* nonblocking header + datatype send (callback safe) */
int MPIDI_[NM|SHM]_am_isend_reply(MPIR_Comm * comm, int src_rank, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* CTS for pt2pt messages */
int MPIDI_[NM|SHM]_am_recv(MPIR_Request * rreq)
/* largest header (in bytes) supported by the nm/shm */
MPI_Aint MPIDI_[NM|SHM]_am_hdr_max_sz(void)
/* eager size supported by transport (only used internally by nm/shm) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_limit(void)
/* eager buffer size supported by transport, used to assert pipeline protocol will work(?) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_buf_limit(void)
/* return true/false if pt2pt message can be sent eagerly */
bool MPIDI_[NM|SHM]_am_check_eager(MPI_Aint am_hdr_sz, MPI_Aint data_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/****************** Callback APIs ******************/
/* target-side message callback */
typedef int (*MPIDIG_am_target_msg_cb) (int handler_id, void *am_hdr,
void *data, MPI_Aint data_sz,
int is_local, int is_async, MPIR_Request ** req);
/* target-side completion callback */
typedef int (*MPIDIG_am_target_cmpl_cb) (MPIR_Request * req);
/* origin-side completion callback */
typedef int (*MPIDIG_am_origin_cb) (MPIR_Request * req);
```
#### SHM
We currently support shared memory based communication based on posix shared
memory.
#### Netmod
We currently support `ch4:ofi` and `ch4:ucx` netmods. A minimum netmod can be
implemented by just implement a minimum set of byte-transfer functions, e.g.
`MPIDI_NM_am_isend` and the corresponding mechanism of calling back handler
function upon receiving the byte payload. To achieve better performance, both
the libfabric library and ucx library provides APIs to directly send messages
with tag matching and RDMA operations that avoid extra data copy and shorten
latencies. Thus both netmod have direct implementations of netmod API that
skips active message.
Both netmod provide direct RMA operations where it can, and fallback to active
message where there is limitations. All window synchronizations are based on
active messages.
## Collectives
Collectives are implemented in `src/mpi/coll/`, with separate folders for each
collective function.
Use `MPI_Bcast` for example, `bcast/bcast.c` defines 3 functions, mostly
boilerplates. The binding layer will call **`MPIR_Bcast`**, which will call
`MPID_Bcast` unless the control variables (CVAR) directs differently.
Typically `MPID_Bcast` will fallback to call **`MPIR_Bcast_impl`**, which
will check CVARs and select particlar algorithm. By default, it will call
**`MPIR_Bcast_allcomm_auto`**, which will automatically choose a best
algorithm based on selection logic defined by a runtime json file.
Each algorithm is typically implemented in a separate file.
## Romio
MPI-IO (MPI_File) APIs are implemented as Romio library. Romio is implemented
entirely on top of other MPI functions and doesn't use MPICH internals. This
is similar in a way as how we implement the Fortran bindings.
## MPI_T
MPI Tools information interface provides orthogonal system for controlling and
monitoring implementation behavior by means of control variables (CVAR),
performance variables (PVAR), and the event callback interface.
## MPL
All utilities, whose functionality is indepeendent of MPICH internals, are
implemented inside MPL. All functions from MPL uses `MPL_` prefix.
Some of the essential MPICH functionality are provided by MPL. They include:
* Debug logging
* Atomic variables
* Thread functions
* Memory tracing
* GPU support
Sometime we use an additional MPIR wrapping layer on top of MPL functions to
provide additional control. For example, the threading interfaces are wrapped
in `src/mpid/common/thread/mpidu_thread_fallback.h` to allow device override
of locking mechanism. Another example, GPU interface are wrapped to allow
`MPIR_CVAR_ENABLE_GPU` to by pass GPU functions (when it is set to disable).
## CVAR
Control variables are a mechanism to control library behavior using
environment variables. The MPI Tool interface allows usage of control
variables in code during runtime as well (however, it is rather clunky
thus not widely used).
Reference:
* [notes on CVAR](notes/cvar.md)
## PM and PMI
References:
* [Notes on Process Manager](notes/pm.md)
* [Notes on Process Manager Interface](notes/pmi.md)
* [Notes on Namepub](notes/namepub.md)
* [Using the Hydra Process Manager](how_to/Using_the_Hydra_Process_Manager.md)
## Test Suite
We have an extensive test suite in `test/mpi/`. While currently it is
configured with the mpich configure, it requires mpich to be installed before
running `make testing`.
Tests are controlled by `testlist` files.
## Build and Debug
References:
* [Debug Event Logging](design/Debug_Event_Logging.md)
|