File: checkpoint-restart.rst

package info (click to toggle)
openmpi 5.0.8-4
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 201,684 kB
  • sloc: ansic: 613,078; makefile: 42,353; sh: 11,194; javascript: 9,244; f90: 7,052; java: 6,404; perl: 5,179; python: 1,859; lex: 740; fortran: 61; cpp: 20; tcl: 12
file content (27 lines) | stat: -rw-r--r-- 1,300 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
.. _ft-checkpoint-restart-label:

Checkpoint and restart of parallel jobs
=======================================

Old versions of Open MPI (starting from v1.3 series) had support for
the transparent, coordinated checkpointing and restarting of MPI
processes (similar to LAM/MPI).

Open MPI supported both the the `BLCR <http://ftg.lbl.gov/checkpoint/>`_
checkpoint/restart system and a "self" checkpointer that allows
applications to perform their own checkpoint/restart functionality while taking
advantage of the Open MPI checkpoint/restart infrastructure.
For both of these, Open MPI provides a coordinated checkpoint/restart protocol
and integration with a variety of network interconnects including shared memory,
Ethernet, and InfiniBand.

The implementation introduced a series of new frameworks and
components designed to support a variety of checkpoint and restart
techniques. This enabled support for the methods described above
(application-directed, BLCR, etc.) as well as other kinds of
checkpoint/restart systems (e.g., Condor, libckpt) and protocols
(e.g., uncoordinated, message induced).

.. note:: The checkpoint/restart support was last included as part of
          the v1.6 series.  It was removed from Open MPI after failing
          to gain adoption and lack of maintenance.