1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631
|
<!--#include virtual="header.txt"-->
<h1>Upgrade Guide</h1>
<p>Slurm supports in-place upgrades between certain versions. This page provides
important details about the steps necessary to perform an upgrade and the
potential complications to prepare for.</p>
<p>See also <a href="quickstart_admin.html">Quick Start Administrator Guide</a></p>
<h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2>
<ul>
<li><a href="#release_cycle">Release Cycle</a>
<ul>
<li><a href="#compatibility_window">Compatibility Window</a></li>
<li><a href="#epel_repository">EPEL Repository</a></li>
<li><a href="#prerelease">Pre-Release Versions</a></li>
</ul></li>
<li><a href="#revert">Reverting an Upgrade</a></li>
<li><a href="#minor_upgrades">Minor Upgrades</a></li>
<li><a href="#procedure">Upgrade Procedure</a>
<ul>
<li><a href="#preparation">Preparation</a></li>
<li><a href="#backups">Create Backups</a></li>
<li><a href="#slurmdbd">slurmdbd (Accounting)</a>
<ul>
<li><a href="#db_server">Database Server</a></li>
</ul></li>
<li><a href="#slurmctld">slurmctld (Controller)</a></li>
<li><a href="#slurmd">slurmd (Compute Nodes)</a></li>
<li><a href="#other_commands">Other Slurm Commands</a></li>
<li><a href="#custom_plugins">Customized Slurm Plugins</a></li>
</ul></li>
<li><a href="#seamless_upgrades">Seamless Upgrades</a></li>
</ul>
<h2 id="release_cycle">Release Cycle
<a class="slurm_link" href="#release_cycle"></a></h2>
<p>The Slurm version number contains three period-separated numbers that
represent both the major Slurm release and maintenance release level.
For example, Slurm 23.11.4:</p>
<ul>
<li><b>23.11</b> = major release
<ul>
<li>This matches the year and month of initial release (November 2023)</li>
<li>Major releases may contain changes to RPCs (remote procedure calls),
state files, configuration options, and core functionality</li>
</ul></li>
<li><b>.4</b> = maintenance version
<ul>
<li>Maintenance releases may contain bug fixes and performance improvements</li>
</ul></li>
</ul>
<p>Prior to the 24.05 release, Slurm operated on a 9-month release cycle for
major versions. Slurm 24.05 represents the first release on the
<a href="https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle/">
new 6-month cycle</a>.</p>
<h3 id="compatibility_window">Compatibility Window
<a class="slurm_link" href="#compatibility_window"></a></h3>
<p>Upgrades from the <b>previous two major releases</b> are compatible. For
example, slurmdbd 23.11.x is capable of accepting messages from slurmctld
daemons and commands with a version of 23.11.x, 23.02.x or 22.05.x. It is also
capable of updating the records in the database that were recorded by an
instance of slurmdbd running these versions.</p>
<p>The Slurm 24.11 release will introduce compatibility with three previous
major releases to provide a similar support duration with the more frequent
6-month release cycle:</p>
<table class="tlist">
<tbody>
<tr>
<td><strong>Slurm Release</strong></td>
<td><strong>Revised End of Support</strong><br>(total length)</td>
<td><strong>Compatible Prior Version</strong></td>
</tr>
<tr>
<td>23.02</td>
<td>November 2024 (21 months)</td>
<td>22.05, 21.08</td>
</tr>
<tr>
<td>23.11</td>
<td>May 2025 (18 months)</td>
<td>23.02, 22.05</td>
</tr>
<tr>
<td>24.05</td>
<td>November 2025 (18 months)</td>
<td>23.11, 23.02</td>
</tr>
<tr>
<td>24.11</td>
<td>May 2026 (18 months)</td>
<td>24.05, 23.11, 23.02</td>
</tr>
<tr>
<td>25.05</td>
<td>November 2026 (18 months)</td>
<td>24.11, 24.05, 23.11</td>
</tr>
<tr>
<td>25.11</td>
<td>May 2027 (18 months)</td>
<td>25.05, 24.11, 24.05</td>
</tr>
</tbody>
</table>
<br>
<p>Upgrades from incompatible versions will fail immediately upon startup.
It is required to perform upgrades from incompatible prior versions in steps,
going to newer versions compatible with the current running version. It may
take several steps to upgrade to a current release of Slurm. For example,
instead of upgrading directly from Slurm 20.11 to 23.11, first upgrade all
systems to Slurm 22.05 and verify functionality, then proceed to upgrade to
23.11. This ensures that each upgrade performed is tested and can be supported
by SchedMD. Compatibility requirements apply to running jobs and upgrading
outside of their compatibility window will result in the jobs being killed and
job accounting being lost.</p>
<h3 id="epel_repository">EPEL Repository
<a class="slurm_link" href="#epel_repository"></a></h3>
<p>In the beginning of 2021, a version of Slurm was added to the
EPEL repository. This version is not provided by or supported by SchedMD, and is
not currently supported for customer use. Unfortunately, this inclusion could
cause Slurm to be updated to a newer version outside of a planned maintenance
period or result in conflicting packages. In order to prevent Slurm from being
changed and broken unintentionally, we recommend you modify the EPEL Repository
configuration to exclude all Slurm packages from automatic updates.</p>
<p>Add the following under the <code>[epel]</code>
section of /etc/yum.repos.d/epel.repo:
<pre>exclude=slurm*</pre></p>
<h3 id="prerelease">Pre-Release Versions
<a class="slurm_link" href="#prerelease"></a></h3>
<p>When installing pre-release versions (e.g., 24.05.0rc1 or
<a href="https://github.com/SchedMD/slurm">master branch</a>), you should prepare
for unexpected crashes, bugs, and loss of state information. SchedMD aims to
use the NEWS file to indicate cases in which state information will be lost with
pre-release versions. However, these pre-release versions receive <b>limited
testing</b> and are not intended for production clusters. Sites are encouraged
to actively run pre-release versions on test machines before each major release.
</p>
<h2 id="revert">Reverting an Upgrade
<a class="slurm_link" href="#revert"></a></h2>
<p>Reverting an upgrade (or downgrading) is <b>not supported</b> once any of the
Slurm daemons have been started. When starting up after an upgrade, the Slurm
daemons (slurmctld, slurmdbd, and slurmd) will update their relevant state
files and databases to the structure used in the new version. If you revert to
an older version, the relevant Slurm daemon will not recognize the new state
file or database, resulting in loss or corruption of state information or job
accounting. The Slurm daemons will likely refuse to start unless configured to
start with the risk of possible data loss.</p>
<p>By using recovery tools, like comprehensive file backups, disk images, and
snapshots, it may be possible to revert components to the pre-upgrade state.
In particular, restoring the contents of <i>StateSaveLocation</i> (as defined in
<i>slurm.conf</i>) and (if configured) the accounting database will be required
if you wish to revert an upgrade. Reverting an upgrade will wipe out anything
that happened after the backups were created.</p>
<h2 id="minor_upgrades">Minor Upgrades
<a class="slurm_link" href="#minor_upgrades"></a></h2>
<p>When upgrading to a newer minor maintenance release (as
<a href="#release_cycle">defined above</a>), we recommend following the same
upgrade procedure as with major releases. You will find that the process takes
less time, and is more accommodating of mixed versions and in-place
downgrades. However, you should always have current backups to solidify your
recovery options.</p>
<h2 id="procedure">Upgrade Procedure
<a class="slurm_link" href="#procedure"></a></h2>
<p>The upgrades procedure can be summarized as follows. Note the specific order
in which the daemons should be upgraded:</p>
<ol>
<li><a href="#preparation">Prepare cluster for the upgrade</a></li>
<li><a href="#backups">Create backups</a></li>
<li>Upgrade <a href="#slurmdbd">slurmdbd</a></li>
<li>Upgrade <a href="#slurmctld">slurmctld</a></li>
<li>Upgrade <a href="#slurmd">slurmd</a> (preferably with slurmctld)</li>
<li>Upgrade <a href="#other_commands">login nodes and client commands</a></li>
<li>Recompile/upgrade <a href="#custom_plugins">customized Slurm plugins</a></li>
<li>Test key functionality</li>
<li>Archive backup data</li>
</ol>
<p>Before considering the upgrade complete, wait for all jobs that were already
running to finish. Any jobs started before the <b>slurmd</b> system was upgraded
will be running with the old version of <b>slurmstepd</b>, so starting another
upgrade or trying to use new features in the new version may cause problems.</p>
<p><b>NOTE</b>: If multiple daemons are present on the same system, they may
need to be upgraded at the same time due to dependencies to the general
<b>slurm</b> package. After upgrading, daemons should be started in the order
listed above. This is <b>not</b> a recommended setup for production; sites are
strongly advised to assign a single core Slurm daemon to each system.</p>
<h3 id="preparation">Preparation
<a class="slurm_link" href="#preparation"></a></h3>
<h4 id="release_notes">RELEASE_NOTES and NEWS
<a class="slurm_link" href="#release_notes"></a></h4>
<p>Review relevant release notes in the <b>RELEASE_NOTES</b> file in root of
Slurm source directory for the target release and any major versions between
what you're currently running and the target you are upgrading to. Pay
particular attention to any entries in which items are <b>removed</b> or
<b>changed</b>. These are particularly likely to require specific attention or
changes during the upgrade. Also look for changes in optional slurm components
that you are using. You may also notice new items added to Slurm that you wish
to start using after the upgrade.</p>
<p>Release notes for the latest major version are
available <a href="release_notes.html">here</a>. Release notes for other
versions can be found in the source, which can be viewed on
<a href="https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES">GitHub</a>
by selecting the branch or tag corresponding to the desired version. More
detailed changes, including minor release changes, can be found in the
<b>NEWS</b> file, but are usually not needed to prepare for upgrades.</p>
<h4 id="config_changes">Configuration Changes
<a class="slurm_link" href="#config_changes"></a></h4>
<p>Always prepare and test configuration changes in a test environment
before upgrading in production. Changes outlined in the release notes will need
to be looked up in the man pages (such as <a href="slurm.conf.html">slurm.conf
</a>) for details and new syntax. Certain options in your configuration files
may need to be changed as features and functionality are improved in every major
Slurm release. Typically, new naming and syntax conventions are introduced
several versions before the old ones are removed, so you may be able to make the
necessary changes before starting the upgrade process.</p>
<h4 id="downtime">Plan for Downtime
<a class="slurm_link" href="#downtime"></a></h4>
<p>Refer to the expected downtime guidance in the
following sections for each relevant Slurm daemon, particularly the
<a href="#slurmdbd">slurmdbd</a>. Notify affected users of the estimated
downtime for the relevant services and the potential impact on their jobs.
Whenever possible, try to plan upgrades during SchedMD's support hours.
If you encounter an issue outside of these hours there will be a delay before
assistance can be provided.</p>
<h4 id="openapi_changes">OpenAPI Changes
<a class="slurm_link" href="#openapi_changes"></a></h4>
<p>Sites using <code>--json</code> or <code>--yaml</code> arguments with any CLI
commands or running <code>slurmrestd</code> need to check for format
compatibility and data_parser plugin removals before upgrading. The formats for
the values parsed and dumped as JSON and YAML are handled by the data_parser
and openapi plugins. Changes to the formats are tracked in the
<a href="openapi_release_notes.html">OpenAPI release notes</a>.</p>
<table class="tlist">
<tbody>
<tr>
<td><strong>Release Notes</strong></td>
<td><strong>Added OpenAPI plugins</strong></td>
<td><strong>Added Data_Parser plugin</strong></td>
<td><strong>Removed in Release</strong></td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#20020">20.02</a></td>
<td>v0.0.35,dbv0.0.35</td>
<td></td>
<td>22.05</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#20110">20.11</a></td>
<td>v0.0.36, dbv0.0.36</td>
<td></td>
<td>23.02</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#21080">21.08</a></td>
<td>v0.0.37, dbv0.0.37</td>
<td></td>
<td>23.11</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#22050">22.05</a></td>
<td>v0.0.38, dbv0.0.38</td>
<td></td>
<td>24.05</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#23020">23.02</a></td>
<td>v0.0.39, dbv0.0.39</td>
<td>v0.0.39</td>
<td>24.11</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#23110">23.11</a></td>
<td>slurmctld, slurmdbd</td>
<td>v0.0.40</td>
<td>25.11</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#24050">24.05</a></td>
<td></td>
<td>v0.0.41</td>
<td>26.05</td>
</tr>
<tr>
<td><a href="openapi_release_notes.html#24110">24.11</a></td>
<td></td>
<td>v0.0.42</td>
<td>26.11</td>
</tr>
</tbody>
</table>
<p><b>NOTE</b>: The unversioned openapi/slurmctld and openapi/slurmdbd plugins
have no planned removal release.</p>
<p>Any scripts or clients making use of <code>--json</code> or
<code>--yaml</code> arguments with any CLI commands may need to pass the
data_parser version explicitly to avoid issues after an upgrade. The default
data_parser used is the latest version which may not have a compatible format
with the prior versions. Sites can use the specification generation mode to
compare formatting differences.
<pre>
$CLI_COMMAND --json=v0.0.41+spec_only > /tmp/v41.json;
$CLI_COMMAND --json=v0.0.40+spec_only > /tmp/v40.json;
json_diff /tmp/v40.json /tmp/v41.json;
</pre></p>
<p>In the event of a format incompatibility, the preferred data_parser can be
requested explicitly starting with the v0.0.40 plugins in any release before
the plugin's removal.
<pre>
$CLI_COMMAND --json=v0.0.41 $OTHER_ARGS | $SITE_SCRIPT;
$CLI_COMMAND --json=v0.0.40 $OTHER_ARGS | $SITE_SCRIPT;
$CLI_COMMAND --yaml=v0.0.41 $OTHER_ARGS | $SITE_SCRIPT;
$CLI_COMMAND --yaml=v0.0.40 $OTHER_ARGS | $SITE_SCRIPT;
</pre></p>
<p>Any <code>slurmrestd</code> web clients can determine the relevant plugin
being used by looking at the URL being queried. Example URLs:
<pre>
http://$HOST/slurmdb/v0.0.40/jobs
http://$HOST/slurm/v0.0.40/jobs
</pre></p>
<p>The relevant data_parser plugin in the example URLs is "v0.0.40" which
matches the <code>data_parser/v0.0.40</code> plugin. Plugin naming follows the
naming schema of <code>vXX.XX.XX</code> where the XX are numbers. The naming
schema matches the internal naming schema for Slurm's packed binary RPC layer
but is not directly related. The URLs for each given data_parser plugins will
remain a valid query target until the plugin is removed as part of SchedMD's
commitment to ensure release limited backwards compatibility. While it should
be possible to continue using any client from a prior release while the plugins
are still supported, <b>sites should always recompile any generated OpenAPI
clients and test thoroughly before upgrading.</b></p>
<h3 id="backups">Create Backups
<a class="slurm_link" href="#backups"></a></h3>
<p><b>Always</b> create full backups to restore all parts of Slurm, including
the Mysql database, before upgrading in the event the upgrade must be reverted.
SchedMD aims to make supported upgrades a seamless process but it is possible
for unexpected issues to arise and <b>irreversibly corrupt</b> all of the data
kept by Slurm. If something like this happens, it will not be possible to
recover any corrupted data and you will be reliant on backed up data.</p>
<p>It is recommended to prepare recovery options (file backups, disk images,
snapshots, database dumps) that will take you back to a known working cluster
state. How backups are taken is specific to how the systems integrator
designed and setup the cluster and procedures are not provided here.</p>
<p>At a minimum, back up the following:
<ul>
<li><b>StateSaveLocation</b> as defined in
<a href="slurm.conf.html#OPT_StateSaveLocation">slurm.conf</a>, or it can be
queried by calling <pre>scontrol show config | grep StateSaveLocation</pre></li>
<li><b>Entire slurm configuration directory</b>, as defined by
<code>configure --sysconfdir=DIR</code> during compilation.
This is usually located in <code>/etc/slurm/</code></li>
<li><b>MySQL database</b> (if slurmdbd is configured). Usually done by calling
<pre>
mysqldump --databases slurm_acct_db > /path/to/offline/storage/backup.sql
</pre>
This assumes that <b>slurmdbd</b> is not running while the dump is running.
<br>If you wish to back it up while <b>slurmdbd</b> is running, you may use the
<code>--single-transaction</code> flag with the <b>following limitations</b>:
<ol>
<li>Database operations may be slower while the dump is running</li>
<li>Restoring this dump will restore the database at the time the dump was
<b>started</b>, losing any changes made during or after the dump</li>
<li>Certain cluster operations may lead to an incorrect or failed dump:
<ul>
<li>Creating a new database</li>
<li>Upgrading an existing database</li>
<li>Adding or Removing a cluster in the slurmdbd</li>
<li><a href="https://slurm.schedmd.com/accounting.html#slurmdbd-archive-purge">
Archiving or Purging</a> accounting data</li>
</ul>
</li>
</ol>
</li>
</ul></p>
<h3 id="slurmdbd">slurmdbd (Accounting)
<a class="slurm_link" href="#slurmdbd"></a></h3>
<p>If <b>slurmdbd</b> is used in your environment, it must be at the same or
higher major release number as the slurmctld daemon(s), and at a close enough
version for <a href="#compatibility_window">compatibility</a>. Thus, when
performing upgrades, it should be upgraded first. When a backup slurmdbd host
is in use, it should be upgraded at the same time as the primary.</p>
<p>Upgrades to the slurmdbd may require significant <b>downtime</b>.
With large accounting databases, the precautionary database dump will take some
time, and the upgraded daemon may be unresponsive for tens of minutes while it
updates the database to the new schema. Sites are encouraged to use the
<a href="slurmdbd.conf.html#OPT_PurgeJobAfter">purge functionality</a> if older
accounting data is not required for normal operations. Purging old records
before attempting to upgrade can significantly decrease outage time.</p>
<p>The non-slurmdbd functionality of the cluster will continue to operate while
the upgrade is in process, provided the activity does not fill up the slurmdbd
Agent queue on the slurmctld node. While slurmdbd is offline, you should
monitor the memory usage of slurmctld, and the <b>DBD Agent queue size</b>, as
reported by <b>sdiag</b>, to ensure it does not exceed the configured
<b>MaxDBDMsgs</b> in <a href="slurm.conf.html#OPT_MaxDBDMsgs">slurm.conf</a>.
Cli commands <a href="sacct.html">sacct</a> and <a href="sacctmgr.html">
sacctmgr</a> will not work while slurmdbd is down.
<code>slurmrestd</code> queries that include slurmdb in
the URL path will fail while slurmdbd is down.</p>
<p>It is preferred to create a backup of the database after shutting down the
<b>slurmdbd</b> daemon, when the MySQL database is no longer changing. If you
wish to take a backup with <b>mysqldump</b> while the slurmdbd is still
running, you can add <code>--single-transaction</code> to the mysqldump command.
Note that the slurmdbd will continue to execute operations that will not be
contained in the dump, which may cause complications if you need to restore
the database to this state.</p>
<p>The suggested upgrade procedure is as follows:</p>
<ol>
<li>Shutdown the slurmdbd daemon(s) gracefully:
<pre>sacctmgr shutdown</pre>or via systemd:
<pre>systemctl stop slurmdbd</pre> Wait until slurmdbd is fully down before
proceeding or there may be data loss from data that was not fully saved.
<pre>systemctl status slurmdbd</pre>
</li>
<li><a href="#backups">Backup the Slurm database</a></li>
<li>Verify that the innodb_buffer_pool_size in my.cnf is greater than the
default. See the recommendation in the
<a href="accounting.html#slurm-accounting-configuration-before-build">
accounting page</a>.</li>
<li>Upgrade the slurmdbd daemon binaries, libraries, and its systemd unit file
(if used). If using <a href="quickstart_admin.html#build_install">
RPM/DEB packages</a>, the package manager will take care of these,
although systemd overrides may prevent the new unit from taking effect.
<br>Only upgrade the slurmdbd system(s) at this time; other Slurm
systems should remain on the old version.</li>
<li>Start the primary slurmdbd daemon.
<br><b>NOTE</b>: If you typically use systemd, it is recommended to
initially start the daemon directly as the configured SlurmUser:
<br><code>sudo -u slurm slurmdbd -D</code>
<br>When the daemon starts up for the first time after upgrading, it
will take some extra time to update existing records in the database. If
it is started with systemd and reaches the configured timeout value, it
may be killed prematurely potentially causing data loss. After it
finishes starting up, you can use <code>Ctrl+C</code> to exit, then
start it normally with systemd.</li>
<li>Start the backup slurmdbd daemon (if applicable).</li>
<li>Validate accounting operation, such as retrieving data through
<code>sacct</code> or <code>sacctmgr</code>.</li>
</ol>
<h4 id="db_server"><b>Database Server</b>
<a class="slurm_link" href="#db_server"></a></h4>
<p>When upgrading the database server that is used by slurmdbd (e.g., MySQL or
MariaDB), usually no special procedures are required. It is recommended to use a
database server that is supported by the publisher (or that was at the time when
the chosen Slurm version was initially released). Database upgrades should be
performed while the slurmdbd is stopped and according to the recommended
procedure for the database used.</p>
<p>When upgrading an existing accounting database to <b>MariaDB 10.2.1</b> or
later from an older version of MariaDB or any version of MySQL, ensure you are
running <b>slurmdbd 22.05.7</b> or later. These versions will gracefully handle
changes to MariaDB default values that can cause problems for slurmdbd.</p>
<h3 id="slurmctld">slurmctld (Controller)
<a class="slurm_link" href="#slurmctld"></a></h3>
<p>It is preferred to upgrade the slurmctld system(s) at the same time as slurmd
on the compute nodes and other Slurm commands on client machines and login nodes.
The effects of downtime on slurmctld and slurmd daemons are largely the same,
so upgrading them all together minimizes the total duration of these effects.
Rolling upgrades are also possible if the slurmctld is upgraded first. When
multiple slurmctld hosts are used, all should be upgraded simultaneously.</p>
<p>Upgrading the slurmctld involves a brief period of <b>downtime</b> during
which job submissions are not accepted, queued jobs are not scheduled, and
information about completing jobs is held. These functions will resume once
the upgraded controller is started.</p>
<p>The recommended upgrade procedure is below, including optional steps for a
simultaneous upgrade of slurmd systems:</p>
<ol>
<li>Increase configured SlurmdTimeout and SlurmctldTimeout values and
execute <code>scontrol reconfig</code> for them to take effect.
<br>The new timeout should be long enough to perform the upgrade using
your preferred method. If the timeout is reached, nodes may be marked
DOWN and their jobs killed.</li>
<li>Shutdown the slurmctld daemon(s).</li>
<li>(opt.) Shutdown the slurmd daemons on the compute nodes.</li>
<li>Back up the contents of the configured StateSaveLocation.</li>
<li>Upgrade the slurmctld (and optionally slurmd) daemons and their systemd
service files (if used).</li>
<li>(opt.) Restart the slurmd daemons on the compute nodes.</li>
<li>Restart the slurmctld daemon(s).</li>
<li>Validate proper operation, such as communication with nodes and a job's
ability to successfully start and finish.</li>
<li>Restore the preferred SlurmdTimeout and SlurmctldTimeout values and
execute <code>scontrol reconfig</code> for them to take effect.</li>
</ol>
<h3 id="slurmd">slurmd (Compute Nodes)
<a class="slurm_link" href="#slurmd"></a></h3>
<p>It is preferred to upgrade all slurmd nodes at the same time as the slurmctld.
It is also possible to perform a rolling upgrade by upgrading the slurmd nodes
later in any number of groups. Sites are encouraged to minimize the amount of
time during which mixed versions are used in a cluster.</p>
<p>Upgrades will not interrupt running jobs as long as <b>SlurmdTimeout</b>
is not reached during the process. However, while the slurmd is down for
upgrades, new jobs will not be started and finishing jobs will wait to
report back to the controller until it comes back online.</p>
<p>If you are upgrading the slurmd nodes separately from the controller, the
following procedure can be followed:</p>
<ol>
<li>Increase the configured SlurmdTimeout value and execute
<code>scontrol reconfig</code> for it to take effect.
<br>The new timeout should be long enough to perform the upgrade using
your preferred method. If the timeout is reached, nodes may be marked
DOWN and their jobs killed.</li>
<li>Shutdown the slurmd daemons on the compute nodes.</li>
<li>Back up the contents of the configured StateSaveLocation.</li>
<li>Upgrade the slurmd daemons and their systemd unit files (if used).</li>
<li>Restart the slurmd daemons.</li>
<li>Validate proper operation, such as communication with the controller and a
job's ability to successfully start and finish.</li>
<li>Repeat for any other groups of nodes that need to be upgraded.</li>
<li>Restore the preferred SlurmdTimeout value and
execute <code>scontrol reconfig</code> for it to take effect.</li>
</ol>
<h3 id="other_commands">Other Slurm Commands
<a class="slurm_link" href="#other_commands"></a></h3>
<p>Other Slurm commands (including client commands) do not require special
attention when upgrading, except where specifically noted in the release notes.
You should also pay attention to any changes introduced in these additional
components. After core Slurm components have been upgraded, upgrade additional
components along with their systemd unit files (if used) and client commands
using the normal method for your system, then restart any affected daemons.</p>
<h3 id="custom_plugins">Customized Slurm Plugins
<a class="slurm_link" href="#custom_plugins"></a></h3>
<p>Slurm's main public API library (libslurm.so.X.0.0) increases its version
number with every major release, so any application linked against it should be
recompiled after an upgrade. This includes locally developed Slurm plugins.</p>
<p>If you have built your own version of Slurm plugins, besides having to
recompile them, they will likely need modification to support the new version
of Slurm. It is common for plugins to add new functions and function arguments
during major updates. See the RELEASE_NOTES file for details about these
changes.</p>
<p>Slurm's PMI-1 (libpmi.so.0.0.0) and PMI-2 (libpmi2.so.0.0.0) public API
libraries do not change between releases and are meant to be permanently
fixed. This means that linking against either of them will not require you
to recompile the application after a Slurm upgrade, except in the unlikely
event that one of them changes. It is unlikely because these libraries must
be compatible with any other PMI-1 and PMI-2 implementations. If there was a
change, it would be announced in the RELEASE_NOTES and would only happen on
a major release.</p>
<p>As an example, MPI stacks like OpenMPI and MVAPICH2 link against Slurm's
PMI-1 and/or PMI-2 API, but not against our main public API. This means that at
the time of writing this documentation, you don't need to recompile these
stacks after a Slurm upgrade. One known exception is MPICH. When MPICH is
compiled with Slurm support and with the Hydra Process Manager, it will use
the Slurm API to obtain job information. This link means you will need to
recompile the MPICH stack after an upgrade.</p>
<p>One easy way to know if an application requires a recompile is to inspect all
of its ELF files with 'ldd' and grep for 'slurm'. If you see a versioned
'libslurm.so.x.y.z' reference, then the application will likely need to be
recompiled.</p>
<h2 id="seamless_upgrades">Seamless Upgrades
<a class="slurm_link" href="#seamless_upgrades"></a></h2>
<p>In environments where the Slurm build process is customized, it is possible
to install a new version of Slurm to a unique directory and use a symbolic link
to point the directory in your PATH to the version of Slurm you would like to
use. This allows you to install the new version before you are in a maintenance
period as well as easily switch between versions should you need to roll
back for any reason. It also avoids potential problems with library conflicts
that might arise from installing different versions to the same directory.</p>
<p style="text-align:center;">Last modified 15 January 2025</p>
<!--#include virtual="footer.txt"-->
|