1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
|
<!DOCTYPE html>
<html class="writer-html5" lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>11.6.1. Supported fault tolerance techniques — Open MPI 5.0.9 documentation</title>
<link rel="stylesheet" type="text/css" href="../../_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="../../_static/css/theme.css" />
<!--[if lt IE 9]>
<script src="../../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script data-url_root="../../" id="documentation_options" src="../../_static/documentation_options.js"></script>
<script src="../../_static/jquery.js"></script>
<script src="../../_static/underscore.js"></script>
<script src="../../_static/_sphinx_javascript_frameworks_compat.js"></script>
<script src="../../_static/doctools.js"></script>
<script src="../../_static/sphinx_highlight.js"></script>
<script src="../../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../../genindex.html" />
<link rel="search" title="Search" href="../../search.html" />
<link rel="next" title="11.6.2. Checkpoint and restart of parallel jobs" href="checkpoint-restart.html" />
<link rel="prev" title="11.6. Fault tolerance" href="index.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../../index.html" class="icon icon-home">
Open MPI
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../../quickstart.html">1. Quick start</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../getting-help.html">2. Getting help</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../release-notes/index.html">3. Release notes</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../installing-open-mpi/index.html">4. Building and installing Open MPI</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../features/index.html">5. Open MPI-specific features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../validate.html">6. Validating your installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../version-numbering.html">7. Version numbers and compatibility</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../mca.html">8. The Modular Component Architecture (MCA)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../building-apps/index.html">9. Building MPI applications</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../launching-apps/index.html">10. Launching MPI applications</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../index.html">11. Run-time operation and tuning MPI applications</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../environment-var.html">11.1. Environment variables set for MPI applications</a></li>
<li class="toctree-l2"><a class="reference internal" href="../networking/index.html">11.2. Networking support</a></li>
<li class="toctree-l2"><a class="reference internal" href="../multithreaded.html">11.3. Running multi-threaded MPI applications</a></li>
<li class="toctree-l2"><a class="reference internal" href="../dynamic-loading.html">11.4. Dynamically loading <code class="docutils literal notranslate"><span class="pre">libmpi</span></code> at runtime</a></li>
<li class="toctree-l2"><a class="reference internal" href="../fork-system-popen.html">11.5. Calling fork(), system(), or popen() in MPI processes</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="index.html">11.6. Fault tolerance</a><ul class="current">
<li class="toctree-l3 current"><a class="current reference internal" href="#">11.6.1. Supported fault tolerance techniques</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#current-fault-tolerance-development">11.6.1.1. Current fault tolerance development</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="checkpoint-restart.html">11.6.2. Checkpoint and restart of parallel jobs</a></li>
<li class="toctree-l3"><a class="reference internal" href="data-reliability.html">11.6.3. End-to-end data reliability for MPI messages</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../large-clusters/index.html">11.7. Large Clusters</a></li>
<li class="toctree-l2"><a class="reference internal" href="../affinity.html">11.8. Processor and memory affinity</a></li>
<li class="toctree-l2"><a class="reference internal" href="../mpi-io/index.html">11.9. MPI-IO tuning options</a></li>
<li class="toctree-l2"><a class="reference internal" href="../coll-tuned.html">11.10. Tuning Collectives</a></li>
<li class="toctree-l2"><a class="reference internal" href="../benchmarking.html">11.11. Benchmarking Open MPI applications</a></li>
<li class="toctree-l2"><a class="reference internal" href="../heterogeneity.html">11.12. Building heterogeneous MPI applications</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../app-debug/index.html">12. Debugging Open MPI Parallel Applications</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../developers/index.html">13. Developer’s guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../contributing.html">14. Contributing to Open MPI</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../license/index.html">15. License</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../history.html">16. History of Open MPI</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../man-openmpi/index.html">17. Open MPI manual pages</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../man-openshmem/index.html">18. OpenSHMEM manual pages</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../../index.html">Open MPI</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item"><a href="../index.html"><span class="section-number">11. </span>Run-time operation and tuning MPI applications</a></li>
<li class="breadcrumb-item"><a href="index.html"><span class="section-number">11.6. </span>Fault tolerance</a></li>
<li class="breadcrumb-item active"><span class="section-number">11.6.1. </span>Supported fault tolerance techniques</li>
<li class="wy-breadcrumbs-aside">
<a href="../../_sources/tuning-apps/fault-tolerance/supported.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<style>
.wy-table-responsive table td,.wy-table-responsive table th{white-space:normal}
</style><div class="section" id="supported-fault-tolerance-techniques">
<h1><span class="section-number">11.6.1. </span>Supported fault tolerance techniques<a class="headerlink" href="#supported-fault-tolerance-techniques" title="Permalink to this heading"></a></h1>
<p>Open MPI is a vehicle for research in fault tolerance and over the years
provided support for a wide range of resilience techniques:</p>
<ul>
<li><p>Currently supported</p>
<blockquote>
<div><ul class="simple">
<li><p>User Level Fault Mitigation techniques similar to those defined
in the context of the MPI Forum (this is the closest match when
migrating from FT-MPI); <a class="reference internal" href="../../features/ulfm.html#ulfm-label"><span class="std std-ref">see its documentation section</span></a>.</p></li>
</ul>
</div></blockquote>
</li>
<li><p>Only for research / non-production usage</p>
<blockquote>
<div><ul class="simple">
<li><p>Message logging techniques. Similar to those implemented in
MPICH-V.</p></li>
</ul>
</div></blockquote>
</li>
<li><p>Deprecated / no longer available</p>
<blockquote>
<div><ul class="simple">
<li><p>Coordinated and uncoordinated process checkpoint and
restart. Similar to those implemented in LAM/MPI and MPICH-V,
respectively.</p></li>
<li><p>Data Reliability and network fault tolerance. Similar to those
implemented in LA-MPI.</p></li>
</ul>
</div></blockquote>
</li>
</ul>
<div class="section" id="current-fault-tolerance-development">
<h2><span class="section-number">11.6.1.1. </span>Current fault tolerance development<a class="headerlink" href="#current-fault-tolerance-development" title="Permalink to this heading"></a></h2>
<p>The only active work in resilience in Open MPI targets the User Level Fault
Mitigation (ULFM) approach, a technique discussed in the context of the MPI
standardization body.</p>
<p>For information on the Fault Tolerant MPI prototype in Open MPI see the
links below:</p>
<ul class="simple">
<li><p><a class="reference internal" href="../../features/ulfm.html#ulfm-label"><span class="std std-ref">Open MPI’s ULFM documentation section</span></a></p></li>
<li><p><a class="reference external" href="https://github.com/mpiwg-ft/ft-issues/wiki">MPI Forum’s Fault Tolerance Working Group</a></p></li>
<li><p><a class="reference external" href="https://fault-tolerance.org/">Information, examples, and support</a></p></li>
</ul>
<p>Support for other types of resilience (e.g., <a class="reference internal" href="data-reliability.html#ft-data-reliability-label"><span class="std std-ref">data reliability</span></a>,
<a class="reference internal" href="checkpoint-restart.html#ft-checkpoint-restart-label"><span class="std std-ref">checkpoint</span></a>) has been deprecated over the
years due to lack of adoption and lack of maintenance. If you are interested
in doing some archeological work, traces are still available on the main
repository.</p>
</div>
</div>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="index.html" class="btn btn-neutral float-left" title="11.6. Fault tolerance" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="checkpoint-restart.html" class="btn btn-neutral float-right" title="11.6.2. Checkpoint and restart of parallel jobs" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>© Copyright 2003-2025, The Open MPI Community.
<span class="lastupdated">Last updated on 2025-10-30 22:49:30 UTC.
</span></p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>
|