Slurm Power Saving Guide

Slurm provides an integrated power saving mechanism for powering down idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode, which can reduce power consumption or fully power down the node. The nodes will be restored to normal operation once work is assigned to them. For example, power saving can be accomplished using a cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver must be enabled in the Linux kernel configuration). Of particular note, Slurm can power nodes up or down at a configurable rate to prevent rapid changes in power demands. For example, starting a 1000 node job on an idle cluster could result in an instantaneous surge in power demand of multiple megawatts without Slurm's support to increase power demands in a gradual fashion.

Configuration

A great deal of flexibility is offered in terms of when and how idle nodes are put into or removed from power save mode. Note that the Slurm control daemon, slurmctld, must be restarted to initially enable power saving mode. Changes in the configuration parameters (e.g. SuspendTime) will take effect after modifying the slurm.conf configuration file and executing "scontrol reconfig". The following configuration parameters are available:

Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where the slurmctld daemon runs (primary and backup server nodes). Use of sudo may be required for SlurmUserto power down and restart nodes. If you need to convert Slurm's hostlist expression into individual node names, the scontrol show hostnames command may prove useful. The commands used to boot or shut down nodes will depend upon your cluster management tools.

Note that SuspendProgram and ResumeProgram are not subject to any time limits. They should perform the required action, ideally verify the action (e.g. node boot and start the slurmd daemon, thus the node is no longer non-responsive to slurmctld) and terminate. Long running programs will be logged by slurmctld, but not aborted.

Also note that the stderr/out of the suspend and resume programs are not logged. If logging is desired it should be added to the scripts.

#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_shutdown $host
done

#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_startup $host
done

Subject to the various rates, limits and exclusions, the power save code follows this logic:

  1. Identify nodes which have been idle for at least SuspendTime.
  2. Execute SuspendProgram with an argument of the idle node names.
  3. Identify the nodes which are in power save mode (a flag in the node's state field), but have been allocated to jobs.
  4. Execute ResumeProgram with an argument of the allocated node names.
  5. Once the slurmd responds, initiate the job and/or job steps allocated to it.
  6. If the slurmd fails to respond within the value configured for SlurmdTimeout, the node will be marked DOWN and the job requeued if possible.
  7. Repeat indefinitely.

The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in power save mode using messages of this sort:

[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes

Using these logs you can easily see the effect of Slurm's power saving support. You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it.

Use of Allocations

A resource allocation request will be granted as soon as resources are selected for use, possibly before the nodes are all available for use. The launching of job steps will be delayed until the required nodes have been restored to service (it prints a warning about waiting for nodes to become available and periodically retries until they are available).

In the case of an sbatch command, the batch program will start when node zero of the allocation is ready for use and pre-processing can be performed as needed before using srun to launch job steps. The sbatch --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=sbatch_wait_nodes option.

In the case of the salloc command, once the allocation is made a new shell will be created on the login node. The salloc --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=salloc_wait_nodes option.

Fault Tolerance

If the slurmctld daemon is terminated gracefully, it will wait up to ten seconds (or the maximum of SuspendTimeout or ResumeTimeout if less than ten seconds) for any spawned SuspendProgram or ResumeProgram to terminate before the daemon terminates. If the spawned program does not terminate within that time period, the event will be logged and slurmctld will exit in order to permit another slurmctld daemon to be initiated. Any spawned SuspendProgram or ResumeProgram will continue to run.

When the slurmctld daemon shuts down, any SLURM_RESUME_FILE temporary files are no longer available, even once slurmctld restarts. Therefore, ResumeProgram should use SLURM_RESUME_FILE within ten seconds of starting to guarantee that it still exists.

Booting Different Images

If you want ResumeProgram to boot various images according to job specifications, it will need to be a fairly sophisticated program and perform the following actions:

  1. Determine which jobs are associated with the nodes to be booted
  2. Determine which image is required for each job and
  3. Boot the appropriate image for each node

Last modified 11 October 2022