1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257
|
<!--#include virtual="header.txt"-->
<h1>Slurm Power Saving Guide</h1>
<p>Slurm provides an integrated power saving mechanism for powering down
idle nodes.
Nodes that remain idle for a configurable period of time can be placed
in a power saving mode, which can reduce power consumption or fully power down
the node.
The nodes will be restored to normal operation once work is assigned to them.
For example, power saving can be accomplished using a <i>cpufreq</i> governor
that can change CPU frequency and voltage (note that the <i>cpufreq</i> driver
must be enabled in the Linux kernel configuration).
Of particular note, Slurm can power nodes up or down
at a configurable rate to prevent rapid changes in power demands.
For example, starting a 1000 node job on an idle cluster could result
in an instantaneous surge in power demand of multiple megawatts without
Slurm's support to increase power demands in a gradual fashion.</p>
<h2 id="config">Configuration<a class="slurm_link" href="#config"></a></h2>
<p>A great deal of flexibility is offered in terms of when and
how idle nodes are put into or removed from power save mode.
Note that the Slurm control daemon, <i>slurmctld</i>, must be
restarted to initially enable power saving mode.
Changes in the configuration parameters (e.g. <i>SuspendTime</i>)
will take effect after modifying the <i>slurm.conf</i> configuration
file and executing "<i>scontrol reconfig</i>".
The following configuration parameters are available:
<ul>
<li><b>SuspendTime</b>:
Nodes becomes eligible for power saving mode after being idle or down
for this number of seconds.
A negative number disables power saving mode.
The default value is -1 (disabled).</li>
<li><b>SuspendRate</b>:
Maximum number of nodes to be placed into power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 60.
Use this to prevent rapid drops in power consumption.</li>
<li><b>ResumeRate</b>:
Maximum number of nodes to be removed from power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 300.
Use this to prevent rapid increases in power consumption.</li>
<li><b>SuspendProgram</b>:
Program to be executed to place nodes into power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be placed into power savings mode (using Slurm's hostlist
expression format).</li>
<li><b>ResumeProgram</b>:
Program to be executed to remove nodes from power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be removed from power savings mode (using Slurm's hostlist
expression format).
A job to node mapping is available in JSON format by reading the temporary file
specified by the <b>SLURM_RESUME_FILE</b> environment variable.
This file should be used at the beginning of <b>ResumeProgram</b> - see the
<a href=#tolerance>Fault Tolerance</a> section for more details.
This program may use the <i>scontrol show node</i> command
to ensure that a node has booted and the <i>slurmd</i>
daemon started.
If the <i>slurmd</i> daemon fails to respond within the
configured <b>ResumeTimeout</b> value with an updated BootTime,
the node will be placed in a DOWN state and the job requesting
the node will be requeued. If the node isn't actually rebooted
(i.e. when multiple-slurmd is configured) you can start slurmd
with the "-b" option to report the node boot time as now.
<b>NOTE</b>: The <b>SLURM_RESUME_FILE</b> will only exist and be usable if
Slurm was compiled with the <a href=download.html#json>JSON-C</a> serializer
library.
</li>
<li><b>SuspendTimeout</b>:
Maximum time permitted (in second) between when a node suspend request
is issued and when the node shutdown is complete.
At that time the node must ready for a resume request to be issued
as needed for new workload.
The default value is 30 seconds.</li>
<li><b>ResumeTimeout</b>:
Maximum time permitted (in seconds) between when a node resume request
is issued and when the node is actually available for use.
Nodes which fail to respond in this time frame will be marked DOWN and
the jobs scheduled on the node requeued.
Nodes which reboot after this time frame will be marked DOWN with a reason of
"Node unexpectedly rebooted."
The default value is 60 seconds.</li>
<li><b>SuspendExcNodes</b>:
List of nodes to never place in power saving mode.
Use Slurm's hostlist expression format.
By default, no nodes are excluded.</li>
<li><b>SuspendExcParts</b>:
List of partitions with nodes to never place in power saving mode.
Multiple partitions may be specified using a comma separator.
By default, no nodes are excluded.</li>
<li><b>BatchStartTimeout</b>:
Specifies how long to wait after a batch job start request is issued
before we expect the batch job to be running on the compute node.
Depending upon how nodes are returned to service, this value may need to
be increased above its default value of 10 seconds.</li>
</ul></p>
<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> execute as
<i>SlurmUser</i> on the node where the <i>slurmctld</i> daemon runs
(primary and backup server nodes).
Use of <i>sudo</i> may be required for <i>SlurmUser</i>to power down
and restart nodes.
If you need to convert Slurm's hostlist expression into individual node
names, the <i>scontrol show hostnames</i> command may prove useful.
The commands used to boot or shut down nodes will depend upon your
cluster management tools.</p>
<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> are not
subject to any time limits.
They should perform the required action, ideally verify the action
(e.g. node boot and start the <i>slurmd</i> daemon, thus the node is
no longer non-responsive to <i>slurmctld</i>) and terminate.
Long running programs will be logged by <i>slurmctld</i>, but not
aborted.</p>
<p>Also note that the stderr/out of the suspend and resume programs
are not logged. If logging is desired it should be added to the
scripts.</p>
<pre>
#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
sudo node_shutdown $host
done
#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
sudo node_startup $host
done
</pre>
<p>Subject to the various rates, limits and exclusions, the power save
code follows this logic:
<ol>
<li>Identify nodes which have been idle for at least <b>SuspendTime</b>.</li>
<li>Execute <b>SuspendProgram</b> with an argument of the idle node names.</li>
<li>Identify the nodes which are in power save mode (a flag in the node's
state field), but have been allocated to jobs.</li>
<li>Execute <b>ResumeProgram</b> with an argument of the allocated node names.</li>
<li>Once the <i>slurmd</i> responds, initiate the job and/or job steps
allocated to it.</li>
<li>If the <i>slurmd</i> fails to respond within the value configured for
<b>SlurmdTimeout</b>, the node will be marked DOWN and the job requeued
if possible.</li>
<li>Repeat indefinitely.</li>
</ol></p>
<p>The slurmctld daemon will periodically (every 10 minutes) log how many
nodes are in power save mode using messages of this sort:
<pre>
[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes
</pre>
<p>Using these logs you can easily see the effect of Slurm's power saving
support.
You can also configure Slurm with programs that perform no action as
<b>SuspendProgram</b> and <b>ResumeProgram</b> to assess the potential
impact of power saving mode before enabling it.</p>
<h2 id="allocations">Use of Allocations
<a class="slurm_link" href="#allocations"></a>
</h2>
<p>A resource allocation request will be granted as soon as resources
are selected for use, possibly before the nodes are all available
for use.
The launching of job steps will be delayed until the required nodes
have been restored to service (it prints a warning about waiting for
nodes to become available and periodically retries until they are
available).</p>
<p>In the case of an <i>sbatch</i> command, the batch program will start
when node zero of the allocation is ready for use and pre-processing can
be performed as needed before using <i>srun</i> to launch job steps.
The sbatch <i>--wait-all-nodes=<value></i> command can be used to
override this behavior on a per-job basis and a system-wide default can be set
with the <i>SchedulerParameters=sbatch_wait_nodes</i> option.
</p>
<p>In the case of the <i>salloc</i> command, once the allocation is made a new
shell will be created on the login node. The salloc
<i>--wait-all-nodes=<value></i> command can be used to override this
behavior on a per-job basis and a system-wide default can be set with the
<i>SchedulerParameters=salloc_wait_nodes</i> option.
</p>
<h2 id="tolerance">Fault Tolerance
<a class="slurm_link" href="#tolerance"></a>
</h2>
<p>If the <i>slurmctld</i> daemon is terminated gracefully, it will
wait up to ten seconds (or the maximum of <b>SuspendTimeout</b> or
<b>ResumeTimeout</b> if less than ten seconds) for any spawned
<b>SuspendProgram</b> or <b>ResumeProgram</b> to terminate before the daemon
terminates.
If the spawned program does not terminate within that time period,
the event will be logged and <i>slurmctld</i> will exit in order to
permit another <i>slurmctld</i> daemon to be initiated.
Any spawned <b>SuspendProgram</b> or <b>ResumeProgram</b> will continue to run.
</p>
<p>
When the slurmctld daemon shuts down, any <b>SLURM_RESUME_FILE</b>
temporary files are no longer available, even once slurmctld restarts.
Therefore, <b>ResumeProgram</b> should use <b>SLURM_RESUME_FILE</b> within
ten seconds of starting to guarantee that it still exists.
</p>
<h2 id="images">Booting Different Images
<a class="slurm_link" href="#images"></a>
</h2>
<p>If you want <b>ResumeProgram</b> to boot various images according to
job specifications, it will need to be a fairly sophisticated program
and perform the following actions:
<ol>
<li>Determine which jobs are associated with the nodes to be booted</li>
<li>Determine which image is required for each job and</li>
<li>Boot the appropriate image for each node</li>
</ol>
<p style="text-align:center;">Last modified 11 October 2022</p>
<!--#include virtual="footer.txt"-->
|