Slurm provides an integrated power saving mechanism for powering down idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode, which can reduce power consumption or fully power down the node. The nodes will be restored to normal operation once work is assigned to them. For example, power saving can be accomplished using a cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver must be enabled in the Linux kernel configuration). Of particular note, Slurm can power nodes up or down at a configurable rate to prevent rapid changes in power demands. For example, starting a 1000 node job on an idle cluster could result in an instantaneous surge in power demand of multiple megawatts without Slurm's support to increase power demands in a gradual fashion.
A great deal of flexibility is offered in terms of when and how idle nodes are put into or removed from power save mode. Note that the Slurm control daemon, slurmctld, must be restarted to initially enable power saving mode. Changes in the configuration parameters (e.g. SuspendTime) will take effect after modifying the slurm.conf configuration file and executing "scontrol reconfig". The following configuration parameters are available:
Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where the slurmctld daemon runs (primary and backup server nodes). Use of sudo may be required for SlurmUserto power down and restart nodes. If you need to convert Slurm's hostlist expression into individual node names, the scontrol show hostnames command may prove useful. The commands used to boot or shut down nodes will depend upon your cluster management tools.
Note that SuspendProgram and ResumeProgram are not subject to any time limits. They should perform the required action, ideally verify the action (e.g. node boot and start the slurmd daemon, thus the node is no longer non-responsive to slurmctld) and terminate. Long running programs will be logged by slurmctld, but not aborted.
Also note that the stderr/out of the suspend and resume programs are not logged. If logging is desired it should be added to the scripts.
#!/bin/bash # Example SuspendProgram echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do sudo node_shutdown $host done #!/bin/bash # Example ResumeProgram echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do sudo node_startup $host done
Subject to the various rates, limits and exclusions, the power save code follows this logic:
The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in power save mode using messages of this sort:
[May 02 15:31:25] Power save mode 0 nodes ... [May 02 15:41:26] Power save mode 10 nodes ... [May 02 15:51:28] Power save mode 22 nodes
Using these logs you can easily see the effect of Slurm's power saving support. You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it.
A resource allocation request will be granted as soon as resources are selected for use, possibly before the nodes are all available for use. The launching of job steps will be delayed until the required nodes have been restored to service (it prints a warning about waiting for nodes to become available and periodically retries until they are available).
In the case of an sbatch command, the batch program will start when node zero of the allocation is ready for use and pre-processing can be performed as needed before using srun to launch job steps. The sbatch --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=sbatch_wait_nodes option.
In the case of the salloc command, once the allocation is made a new shell will be created on the login node. The salloc --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=salloc_wait_nodes option.
If the slurmctld daemon is terminated gracefully, it will wait up to ten seconds (or the maximum of SuspendTimeout or ResumeTimeout if less than ten seconds) for any spawned SuspendProgram or ResumeProgram to terminate before the daemon terminates. If the spawned program does not terminate within that time period, the event will be logged and slurmctld will exit in order to permit another slurmctld daemon to be initiated. Any spawned SuspendProgram or ResumeProgram will continue to run.
When the slurmctld daemon shuts down, any SLURM_RESUME_FILE temporary files are no longer available, even once slurmctld restarts. Therefore, ResumeProgram should use SLURM_RESUME_FILE within ten seconds of starting to guarantee that it still exists.
If you want ResumeProgram to boot various images according to job specifications, it will need to be a fairly sophisticated program and perform the following actions:
Last modified 11 October 2022