File: supported.rst

package info (click to toggle)
openmpi 5.0.8-4
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 201,684 kB
  • sloc: ansic: 613,078; makefile: 42,353; sh: 11,194; javascript: 9,244; f90: 7,052; java: 6,404; perl: 5,179; python: 1,859; lex: 740; fortran: 61; cpp: 20; tcl: 12
file content (45 lines) | stat: -rw-r--r-- 1,775 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Supported fault tolerance techniques
====================================

Open MPI is a vehicle for research in fault tolerance and over the years
provided support for a wide range of resilience techniques:

* Currently supported
  
    * User Level Fault Mitigation techniques similar to those defined
      in the context of the MPI Forum (this is the closest match when
      migrating from FT-MPI); :ref:`see its documentation section
      <ulfm-label>`.

* Only for research / non-production usage
      
    * Message logging techniques. Similar to those implemented in
      MPICH-V.

* Deprecated / no longer available
  
    * Coordinated and uncoordinated process checkpoint and
      restart. Similar to those implemented in LAM/MPI and MPICH-V,
      respectively.
    * Data Reliability and network fault tolerance. Similar to those
      implemented in LA-MPI.

Current fault tolerance development
-----------------------------------

The only active work in resilience in Open MPI targets the User Level Fault
Mitigation (ULFM) approach, a technique discussed in the context of the MPI
standardization body.

For information on the Fault Tolerant MPI prototype in Open MPI see the
links below:

*  :ref:`Open MPI's ULFM documentation section <ulfm-label>`
* `MPI Forum's Fault Tolerance Working Group <https://github.com/mpiwg-ft/ft-issues/wiki>`_
* `Information, examples, and support <https://fault-tolerance.org/>`_

Support for other types of resilience (e.g., :ref:`data reliability <ft-data-reliability-label>`,
:ref:`checkpoint <ft-checkpoint-restart-label>`) has been deprecated over the
years due to lack of adoption and lack of maintenance. If you are interested
in doing some archeological work, traces are still available on the main
repository.