File: index.rst

package info (click to toggle)

openmpi 5.0.8-4

links: PTS, VCS
area: main
in suites:
size: 201,684 kB
sloc: ansic: 613,078; makefile: 42,353; sh: 11,194; javascript: 9,244; f90: 7,052; java: 6,404; perl: 5,179; python: 1,859; lex: 740; fortran: 61; cpp: 20; tcl: 12

file content (32 lines) | stat: -rw-r--r-- 1,284 bytes

parent folder | download | duplicates (8)

Fault tolerance
===============

The phrase "fault tolerance" means many things to many
people.  Typical definitions include user processes dumping vital
state to disk periodically, checkpoint/restart of running processes,
elaborate recreate-process-state-from-incremental-pieces schemes,
and many others.

In the scope of Open MPI, we typically define "fault tolerance" to
mean the ability to recover from one or more component failures in a
well defined manner with either a transparent or application-directed
mechanism.  Component failures may exhibit themselves as a corrupted
transmission over a faulty network interface or the failure of one or
more serial or parallel processes due to a processor or node failure.
Open MPI strives to provide the application with a consistent system
view while still providing a production quality, high performance
implementation.

Yes, that's pretty much as all-inclusive as possible |mdash| intentionally
so!  Remember that in addition to being a production-quality MPI
implementation, Open MPI is also a vehicle for research.  So while
some forms of "fault tolerance" are more widely accepted and used,
others are certainly of valid academic interest.


.. toctree::
   :maxdepth: 1

   supported
   checkpoint-restart
   data-reliability