1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
.. _ft-checkpoint-restart-label:
Checkpoint and restart of parallel jobs
=======================================
Old versions of Open MPI (starting from v1.3 series) had support for
the transparent, coordinated checkpointing and restarting of MPI
processes (similar to LAM/MPI).
Open MPI supported both the the `BLCR <http://ftg.lbl.gov/checkpoint/>`_
checkpoint/restart system and a "self" checkpointer that allows
applications to perform their own checkpoint/restart functionality while taking
advantage of the Open MPI checkpoint/restart infrastructure.
For both of these, Open MPI provides a coordinated checkpoint/restart protocol
and integration with a variety of network interconnects including shared memory,
Ethernet, and InfiniBand.
The implementation introduced a series of new frameworks and
components designed to support a variety of checkpoint and restart
techniques. This enabled support for the methods described above
(application-directed, BLCR, etc.) as well as other kinds of
checkpoint/restart systems (e.g., Condor, libckpt) and protocols
(e.g., uncoordinated, message induced).
.. note:: The checkpoint/restart support was last included as part of
the v1.6 series. It was removed from Open MPI after failing
to gain adoption and lack of maintenance.
|