File: power_save.shtml

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (257 lines) | stat: -rw-r--r-- 10,561 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
<!--#include virtual="header.txt"-->

<h1>Slurm Power Saving Guide</h1>

<p>Slurm provides an integrated power saving mechanism for powering down
idle nodes.
Nodes that remain idle for a configurable period of time can be placed
in a power saving mode, which can reduce power consumption or fully power down
the node.
The nodes will be restored to normal operation once work is assigned to them.
For example, power saving can be accomplished using a <i>cpufreq</i> governor
that can change CPU frequency and voltage (note that the <i>cpufreq</i> driver
must be enabled in the Linux kernel configuration).
Of particular note, Slurm can power nodes up or down
at a configurable rate to prevent rapid changes in power demands.
For example, starting a 1000 node job on an idle cluster could result
in an instantaneous surge in power demand of multiple megawatts without
Slurm's support to increase power demands in a gradual fashion.</p>


<h2 id="config">Configuration<a class="slurm_link" href="#config"></a></h2>

<p>A great deal of flexibility is offered in terms of when and
how idle nodes are put into or removed from power save mode.
Note that the Slurm control daemon, <i>slurmctld</i>, must be
restarted to initially enable power saving mode.
Changes in the configuration parameters (e.g. <i>SuspendTime</i>)
will take effect after modifying the <i>slurm.conf</i> configuration
file and executing "<i>scontrol reconfig</i>".
The following configuration parameters are available:
<ul>

<li><b>SuspendTime</b>:
Nodes becomes eligible for power saving mode after being idle or down
for this number of seconds.
A negative number disables power saving mode.
The default value is -1 (disabled).</li>

<li><b>SuspendRate</b>:
Maximum number of nodes to be placed into power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 60.
Use this to prevent rapid drops in power consumption.</li>

<li><b>ResumeRate</b>:
Maximum number of nodes to be removed from power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 300.
Use this to prevent rapid increases in power consumption.</li>

<li><b>SuspendProgram</b>:
Program to be executed to place nodes into power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be placed into power savings mode (using Slurm's hostlist
expression format).</li>

<li><b>ResumeProgram</b>:
Program to be executed to remove nodes from power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be removed from power savings mode (using Slurm's hostlist
expression format).
A job to node mapping is available in JSON format by reading the temporary file
specified by the <b>SLURM_RESUME_FILE</b> environment variable.
This file should be used at the beginning of <b>ResumeProgram</b> - see the
<a href=#tolerance>Fault Tolerance</a> section for more details.
This program may use the <i>scontrol show node</i> command
to ensure that a node has booted and the <i>slurmd</i>
daemon started.
If the <i>slurmd</i> daemon fails to respond within the
configured <b>ResumeTimeout</b> value with an updated BootTime,
the node will be placed in a DOWN state and the job requesting
the node will be requeued. If the node isn't actually rebooted
(i.e. when multiple-slurmd is configured) you can start slurmd
with the "-b" option to report the node boot time as now.

<b>NOTE</b>: The <b>SLURM_RESUME_FILE</b> will only exist and be usable if
Slurm was compiled with the <a href=download.html#json>JSON-C</a> serializer
library.
</li>

<li><b>SuspendTimeout</b>:
Maximum time permitted (in second) between when a node suspend request
is issued and when the node shutdown is complete.
At that time the node must ready for a resume request to be issued
as needed for new workload.
The default value is 30 seconds.</li>

<li><b>ResumeTimeout</b>:
Maximum time permitted (in seconds) between when a node resume request
is issued and when the node is actually available for use.
Nodes which fail to respond in this time frame will be marked DOWN and
the jobs scheduled on the node requeued.
Nodes which reboot after this time frame will be marked DOWN with a reason of
"Node unexpectedly rebooted."
The default value is 60 seconds.</li>

<li><b>SuspendExcNodes</b>:
List of nodes to never place in power saving mode.
Use Slurm's hostlist expression format.
By default, no nodes are excluded.</li>

<li><b>SuspendExcParts</b>:
List of partitions with nodes to never place in power saving mode.
Multiple partitions may be specified using a comma separator.
By default, no nodes are excluded.</li>

<li><b>BatchStartTimeout</b>:
Specifies how long to wait after a batch job start request is issued
before we expect the batch job to be running on the compute node.
Depending upon how nodes are returned to service, this value may need to
be increased above its default value of 10 seconds.</li>
</ul></p>

<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> execute as
<i>SlurmUser</i> on the node where the <i>slurmctld</i> daemon runs
(primary and backup server nodes).
Use of <i>sudo</i> may be required for <i>SlurmUser</i>to power down
and restart nodes.
If you need to convert Slurm's hostlist expression into individual node
names, the <i>scontrol show hostnames</i> command may prove useful.
The commands used to boot or shut down nodes will depend upon your
cluster management tools.</p>

<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> are not
subject to any time limits.
They should perform the required action, ideally verify the action
(e.g. node boot and start the <i>slurmd</i> daemon, thus the node is
no longer non-responsive to <i>slurmctld</i>) and terminate.
Long running programs will be logged by <i>slurmctld</i>, but not
aborted.</p>

<p>Also note that the stderr/out of the suspend and resume programs
are not logged.  If logging is desired it should be added to the
scripts.</p>

<pre>
#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_shutdown $host
done

#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_startup $host
done
</pre>

<p>Subject to the various rates, limits and exclusions, the power save
code follows this logic:
<ol>
<li>Identify nodes which have been idle for at least <b>SuspendTime</b>.</li>
<li>Execute <b>SuspendProgram</b> with an argument of the idle node names.</li>
<li>Identify the nodes which are in power save mode (a flag in the node's
state field), but have been allocated to jobs.</li>
<li>Execute <b>ResumeProgram</b> with an argument of the allocated node names.</li>
<li>Once the <i>slurmd</i> responds, initiate the job and/or job steps
allocated to it.</li>
<li>If the <i>slurmd</i> fails to respond within the value configured for
<b>SlurmdTimeout</b>, the node will be marked DOWN and the job requeued
if possible.</li>
<li>Repeat indefinitely.</li>
</ol></p>

<p>The slurmctld daemon will periodically (every 10 minutes) log how many
nodes are in power save mode using messages of this sort:
<pre>
[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes
</pre>

<p>Using these logs you can easily see the effect of Slurm's power saving
support.
You can also configure Slurm with programs that perform no action as
<b>SuspendProgram</b> and <b>ResumeProgram</b> to assess the potential
impact of power saving mode before enabling it.</p>

<h2 id="allocations">Use of Allocations
<a class="slurm_link" href="#allocations"></a>
</h2>

<p>A resource allocation request will be granted as soon as resources
are selected for use, possibly before the nodes are all available
for use.
The launching of job steps will be delayed until the required nodes
have been restored to service (it prints a warning about waiting for
nodes to become available and periodically retries until they are
available).</p>

<p>In the case of an <i>sbatch</i> command, the batch program will start
when node zero of the allocation is ready for use and pre-processing can
be performed as needed before using <i>srun</i> to launch job steps.
The sbatch <i>--wait-all-nodes=&lt;value&gt;</i> command can be used to
override this behavior on a per-job basis and a system-wide default can be set
with the <i>SchedulerParameters=sbatch_wait_nodes</i> option.
</p>

<p>In the case of the <i>salloc</i> command, once the allocation is made a new
shell will be created on the login node. The salloc
<i>--wait-all-nodes=&lt;value&gt;</i> command can be used to override this
behavior on a per-job basis and a system-wide default can be set with the
<i>SchedulerParameters=salloc_wait_nodes</i> option.
</p>

<h2 id="tolerance">Fault Tolerance
<a class="slurm_link" href="#tolerance"></a>
</h2>

<p>If the <i>slurmctld</i> daemon is terminated gracefully, it will
wait up to ten seconds (or the maximum of <b>SuspendTimeout</b> or
<b>ResumeTimeout</b> if less than ten seconds) for any spawned
<b>SuspendProgram</b> or <b>ResumeProgram</b> to terminate before the daemon
terminates.
If the spawned program does not terminate within that time period,
the event will be logged and <i>slurmctld</i> will exit in order to
permit another <i>slurmctld</i> daemon to be initiated.
Any spawned <b>SuspendProgram</b> or <b>ResumeProgram</b> will continue to run.
</p>
<p>
When the slurmctld daemon shuts down, any <b>SLURM_RESUME_FILE</b>
temporary files are no longer available, even once slurmctld restarts.
Therefore, <b>ResumeProgram</b> should use <b>SLURM_RESUME_FILE</b> within
ten seconds of starting to guarantee that it still exists.
</p>

<h2 id="images">Booting Different Images
<a class="slurm_link" href="#images"></a>
</h2>

<p>If you want <b>ResumeProgram</b> to boot various images according to
job specifications, it will need to be a fairly sophisticated program
and perform the following actions:
<ol>
<li>Determine which jobs are associated with the nodes to be booted</li>
<li>Determine which image is required for each job and</li>
<li>Boot the appropriate image for each node</li>
</ol>

<p style="text-align:center;">Last modified 11 October 2022</p>

<!--#include virtual="footer.txt"-->