File: power_mgmt.shtml

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (243 lines) | stat: -rw-r--r-- 12,056 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
<!--#include virtual="header.txt"-->

<h1>Slurm Power Management Guide</h1>

<p>Slurm provide an integrated power management system for power capping.
The mode of operation is to take the configured power cap for the system and
distribute it across the compute nodes under Slurm control.
Initially that power is distributed evenly across all compute nodes.
Slurm then monitors actual power consumption and redistributes power as appropriate.
Specifically, Slurm lowers the power caps on nodes using less than their cap
and redistributes that power across the other nodes.
The thresholds at which a node's power cap are raised or lowered are configurable
as are the rate of change the power cap.
In addition, starting a job on a node immediately triggers resetting the node's
power cap to a higher level.
Note this functionality is distinct from Slurm's ability to
<a href="power_save.html">power down idle nodes</a>.</p>

<h2 id="config">Configuration<a class="slurm_link" href="#config"></a></h2>

<p>The following configuration parameters are available:
<ul>

<li><b>DebugFlags=power</b>:
Enable plugin-specific logging messages.</li>

<li><b>PowerParameters</b>:
Defines power management behavior.
Changes to this value take effect when the Slurm daemons are reconfigured.
Currently valid options are:
<ul>
<li><b>balance_interval=#</b> -
  Specifies the time interval, in seconds, between attempts to balance power
  caps across the nodes.
  This also controls the frequency at which Slurm attempts to collect current
  power consumption data (old data may be used until new data is available from
  the underlying infrastructure and values below 10 seconds are not recommended
  for Cray systems).
  The default value is 30 seconds.
  Supported by the power/cray_aries plugin.</li>
<li><b>capmc_path=/...</b> -
  Specifies the absolute path of the <b>capmc</b> command.
  The default value is "/opt/cray/capmc/default/bin/capmc".
  Supported by the power/cray_aries plugin.</li>
<li><b>cap_watts=#[KW|MW]</b> -
  Specifies the total power limit to be established across all compute nodes
  managed by Slurm.
  A value of 0 sets every compute node to have an unlimited cap.
  The default value is 0.
  Supported by the power/cray_aries plugin.</li>
<li><b>decrease_rate=#</b> -
  Specifies the maximum rate of change in the power cap for a node where the
  actual power usage is below the power cap by an amount greater than
  lower_threshold (see below).
  Value represents a percentage of the difference between a node's minimum and
  maximum power consumption.
  The default value is 50 percent.
  Supported by the power/cray_aries plugin.</li>
<li><b>increase_rate=#</b> -
  Specifies the maximum rate of change in the power cap for a node where the
  actual power usage is within upper_threshold (see below) of the power cap.
  Value represents a percentage of the difference between a node's minimum and
  maximum power consumption.
  The default value is 20 percent.
  Supported by the power/cray_aries plugin.</li>
<li><b>job_level</b> -
  All compute nodes associated with every job will be assigned the same power
  cap.
  Nodes shared by multiple jobs with have a power cap different from other
  nodes allocated to the individual jobs.
  By default, this is configurable by the user for each job.</li>
<li><b>job_no_level</b> -
  Power caps are established independently for each compute node.
  This disabled the "--power=level" option available in the job submission
  commands.
  By default, this is configurable by the user for each job.</li>
<li><b>lower_threshold=#</b> -
  Specify a lower power consumption threshold.
  If a node's current power consumption is below this percentage of its current
  cap, then its power cap will be reduced.
  The default value is 90 percent.
  Supported by the power/cray_aries plugin.</li>
<li><b>recent_job=#</b> -
  If a job has started or resumed execution (from suspend) on a compute node
  within this number of seconds from the current time, the node's power cap will
  be increased to the maximum.
  The default value is 300 seconds.
  Supported by the power/cray_aries plugin.</li>
<li><b>set_watts=#</b> -
  Specifies the power limit to be set on every compute nodes managed by Slurm.
  Every node gets this same power cap and there is no variation through time
  based upon actual power usage on the node.
  Supported by the power/cray_aries plugin.</li>
<li><b>upper_threshold=#</b> -
  Specify an upper power consumption threshold.
  If a node's current power consumption is above this percentage of its current
  cap, then its power cap will be increased to the extent possible.
  A node's power cap will also be increased if a job is newly started on it.
  The default value is 95 percent.
  Supported by the power/cray_aries plugin.</li>
</ul></li>

<li><b>PowerPlugin</b>:
Identifies the plugin used to manage system power consumption.
Changes to this value require restarting Slurm daemons to take effect.
By default, no power plugin is loaded.
Currently valid options are:
<ul>
<li><b>power/cray_aries</b> -
   Used for Cray XC systems with power monitoring and management
   functionality included as part of System Management Workstation (SMW)
   7.0.UP03.</li>
<li><b>power/none</b> - No power management support.</li>
</ul></li>
</ul>

<p><b>Note for Cray systems:</b> The JSON-C library must be installed in order
to build Slurm's power/cray_aries plugin, which must parse JSON format data.
See Slurm's <a href="download.html#json">JSON installation information</a>
for details.</p>

<p><b>Note for Cray systems:</b> Power management is provided for native
Slurm configurations (i.e. without the ALPS resource manager).</p>

<p><b>Note for Cray systems:</b> Use of the capmc command requires either 
specifying its absolute path ("/opt/cray/capmc/default/bin/capmc" by default)
or loading the capmc module:</p>
<pre>
$ module load capmc
</pre>

<h2 id="commands">User and System Administrator Commands
<a class="slurm_link" href="#commands"></a>
</h2>

<p>Equal power caps for all nodes allocated to a job can be requested at job
submission time by using the "--power=level" option with the salloc, sbatch
or srun command.
The system administrator can override the user option with the PowerParameters
configuration parameter and the job_level or job_no_level option.</p>

<p>Specific minimum and maximum CPU frequency in addition to CPU governor may
be requested at job submit time using the "--cpu-freq" option  with the salloc,
sbatch or srun command. The frequency requested may be "low", "medium",
"highm1" (second highest available frequency), "high" or a specific frequency
(expressed as a KHz value). The governor specification may be "conservative",
"ondemand", "performance" or "powersave". These values are user requests
subject to system constraints. Some examples follow.</p>
<pre>
$ sbatch --cpu-freq=2400000-3000000 ...
$ salloc --cpu-freq=powersave ...
$ srun --cpu-freq=highm1 ...
</pre>

<p>The power consumption and power cap data are available for all compute nodes
using either the "scontrol show node" or sview commands.
Information available includes "CurrentWatts" and "CapWatts".</p>

<h2 id="example">Example<a class="slurm_link" href="#example"></a></h2>

<h3 id="initial">Initial State<a class="slurm_link" href="#initial"></a></h3>
<p>In our example, assume the following configuration:
10 compute node cluster, where each node has a minimum power consumption of 100 watts
and maximum power consumption of 200 watts.
The following values for PowerParameters:
balance_interval=60,
cap_watts=1800,
decrease_rate=30, increase_rate=10,
lower_threshold=90, upper_threshold=98.
The initial state is simply based upon the cap_watts divided by the number of
compute nodes: 1800 watts / 10 nodes = 180 watts per node.</p>

<h3 id="60">State in 60 Seconds<a class="slurm_link" href="#60"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume that one of those nodes is consuming 110 watts and the others are
using 180 watts.
First we identify which nodes are consuming less than their lower_threshold
of the power cap: 90% x 180 watts = 162 watts.
One node falls in this category with 110 watts of power consumption.
Its power cap is reduced by either half of the difference between its current
power cap and power consumption ((180 watts - 110 watts) / 2 = 35 watts) OR
decrease_rate, which is a percentage of the difference between the node's
maximum and minimum power consumption ((200 watts - 100 watts) x 30% = 30 watts).
So that node's power cap is reduce from 180 watts to 150 watts.
Ignoring the upper_threshold parameter for now, we now have 1650 watts available
to distribute to the remaining 9 compute nodes, or 183 watts per node
(1650 watts / 9 nodes = 183 watts per node).</p>

<h3 id="120">State in 120 Seconds<a class="slurm_link" href="#120"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume that one of those nodes is still consuming 110 watts, a second node is
consuming 115 watts and the other eight are using 183 watts.
First we identify which nodes are consuming less than their lower_threshold.
Our node using 110 watts has its cap reduced by half of the difference between
its current power cap and power consumption
((150 watts - 110 watts) / 2 = 20 watts);
so that node's power cap is reduce from 150 watts to 130 watts.
The node consuming 115 watts has its power cap reduced by 30 watts based
decrease_rate; so that node's power cap is reduce from 183 watts to 153 watts.
That leaves 1517 watts (1800 watts - 130 watts - 153 watts = 1517 watts) to
be distributed over 8 nodes or 189 watts per node.</p>

<h3 id="180">State in 180 Seconds<a class="slurm_link" href="#180"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume the node previously consuming 110 watts is now consuming 128 watts.
Since that is over upper_threshold of its power cap
(98% x 130 watts = 127 watts), its power cap is increased by increase_rate
((200 watts - 100 watts) x 10% = 10 watts), so its power cap goes from
130 watts to 140 watts.
Assume the node previously consuming 115 watts has been allocated a new job.
This triggers the node to be allocated the same power cap as nodes previously
running at their power cap.
Therefore we have 1660 watts available (1800 watts - 140 watts = 1660 watts)
to be distributed over 9 nodes or 184 watts per node.</p>

<h2 id="notes">Notes<a class="slurm_link" href="#notes"></a></h2>
<ul>
<li>Slurm's power management plugin can be used in conjunction with the
  <a href="power_save.html">power save mode</a>, where idle nodes are powered
  down and then powered back up as needed. On a Cray system, set each node's
  power cap to the minimum value before powering it down. Also set the default
  power cap of each node to the minimum value as that will be used at power up
  time.</li>
<li>Cray permits independent power capping for accelerators (GPUs or MICs),
  which is not currently used by Slurm.</li>
<li>Current default values for configuration parameters should probably be
  changed once we have a better understanding of the algorithm's behavior.</li>
<li>No integration of this logic with gang scheduling currently exists.
  It is not clear that configuration is practical to support as gang scheduling
  time slices will typically be smaller than the power management
  balance_interval and synchronizing changes may be difficult</li>
<li>There can be situations where capmc program gets stuck for some reason and
  the node remains in IDLE*+POWER state until ResumeTimeout is reached, despite
  it has been rebooted or manually cleaned.
  In this situation the node can be brought back into service issuing an
  'scontrol update nodename=xxx state=power_down' which will cancel the
  previous power_up request. Then capmc program must be diagnosed and fixed.
</li>
</ul>

<p style="text-align:center;">Last modified 7 Mar 2018</p>

<!--#include virtual="footer.txt"-->