1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243
|
<!--#include virtual="header.txt"-->
<h1>Slurm Power Management Guide</h1>
<p>Slurm provide an integrated power management system for power capping.
The mode of operation is to take the configured power cap for the system and
distribute it across the compute nodes under Slurm control.
Initially that power is distributed evenly across all compute nodes.
Slurm then monitors actual power consumption and redistributes power as appropriate.
Specifically, Slurm lowers the power caps on nodes using less than their cap
and redistributes that power across the other nodes.
The thresholds at which a node's power cap are raised or lowered are configurable
as are the rate of change the power cap.
In addition, starting a job on a node immediately triggers resetting the node's
power cap to a higher level.
Note this functionality is distinct from Slurm's ability to
<a href="power_save.html">power down idle nodes</a>.</p>
<h2 id="config">Configuration<a class="slurm_link" href="#config"></a></h2>
<p>The following configuration parameters are available:
<ul>
<li><b>DebugFlags=power</b>:
Enable plugin-specific logging messages.</li>
<li><b>PowerParameters</b>:
Defines power management behavior.
Changes to this value take effect when the Slurm daemons are reconfigured.
Currently valid options are:
<ul>
<li><b>balance_interval=#</b> -
Specifies the time interval, in seconds, between attempts to balance power
caps across the nodes.
This also controls the frequency at which Slurm attempts to collect current
power consumption data (old data may be used until new data is available from
the underlying infrastructure and values below 10 seconds are not recommended
for Cray systems).
The default value is 30 seconds.
Supported by the power/cray_aries plugin.</li>
<li><b>capmc_path=/...</b> -
Specifies the absolute path of the <b>capmc</b> command.
The default value is "/opt/cray/capmc/default/bin/capmc".
Supported by the power/cray_aries plugin.</li>
<li><b>cap_watts=#[KW|MW]</b> -
Specifies the total power limit to be established across all compute nodes
managed by Slurm.
A value of 0 sets every compute node to have an unlimited cap.
The default value is 0.
Supported by the power/cray_aries plugin.</li>
<li><b>decrease_rate=#</b> -
Specifies the maximum rate of change in the power cap for a node where the
actual power usage is below the power cap by an amount greater than
lower_threshold (see below).
Value represents a percentage of the difference between a node's minimum and
maximum power consumption.
The default value is 50 percent.
Supported by the power/cray_aries plugin.</li>
<li><b>increase_rate=#</b> -
Specifies the maximum rate of change in the power cap for a node where the
actual power usage is within upper_threshold (see below) of the power cap.
Value represents a percentage of the difference between a node's minimum and
maximum power consumption.
The default value is 20 percent.
Supported by the power/cray_aries plugin.</li>
<li><b>job_level</b> -
All compute nodes associated with every job will be assigned the same power
cap.
Nodes shared by multiple jobs with have a power cap different from other
nodes allocated to the individual jobs.
By default, this is configurable by the user for each job.</li>
<li><b>job_no_level</b> -
Power caps are established independently for each compute node.
This disabled the "--power=level" option available in the job submission
commands.
By default, this is configurable by the user for each job.</li>
<li><b>lower_threshold=#</b> -
Specify a lower power consumption threshold.
If a node's current power consumption is below this percentage of its current
cap, then its power cap will be reduced.
The default value is 90 percent.
Supported by the power/cray_aries plugin.</li>
<li><b>recent_job=#</b> -
If a job has started or resumed execution (from suspend) on a compute node
within this number of seconds from the current time, the node's power cap will
be increased to the maximum.
The default value is 300 seconds.
Supported by the power/cray_aries plugin.</li>
<li><b>set_watts=#</b> -
Specifies the power limit to be set on every compute nodes managed by Slurm.
Every node gets this same power cap and there is no variation through time
based upon actual power usage on the node.
Supported by the power/cray_aries plugin.</li>
<li><b>upper_threshold=#</b> -
Specify an upper power consumption threshold.
If a node's current power consumption is above this percentage of its current
cap, then its power cap will be increased to the extent possible.
A node's power cap will also be increased if a job is newly started on it.
The default value is 95 percent.
Supported by the power/cray_aries plugin.</li>
</ul></li>
<li><b>PowerPlugin</b>:
Identifies the plugin used to manage system power consumption.
Changes to this value require restarting Slurm daemons to take effect.
By default, no power plugin is loaded.
Currently valid options are:
<ul>
<li><b>power/cray_aries</b> -
Used for Cray XC systems with power monitoring and management
functionality included as part of System Management Workstation (SMW)
7.0.UP03.</li>
<li><b>power/none</b> - No power management support.</li>
</ul></li>
</ul>
<p><b>Note for Cray systems:</b> The JSON-C library must be installed in order
to build Slurm's power/cray_aries plugin, which must parse JSON format data.
See Slurm's <a href="download.html#json">JSON installation information</a>
for details.</p>
<p><b>Note for Cray systems:</b> Power management is provided for native
Slurm configurations (i.e. without the ALPS resource manager).</p>
<p><b>Note for Cray systems:</b> Use of the capmc command requires either
specifying its absolute path ("/opt/cray/capmc/default/bin/capmc" by default)
or loading the capmc module:</p>
<pre>
$ module load capmc
</pre>
<h2 id="commands">User and System Administrator Commands
<a class="slurm_link" href="#commands"></a>
</h2>
<p>Equal power caps for all nodes allocated to a job can be requested at job
submission time by using the "--power=level" option with the salloc, sbatch
or srun command.
The system administrator can override the user option with the PowerParameters
configuration parameter and the job_level or job_no_level option.</p>
<p>Specific minimum and maximum CPU frequency in addition to CPU governor may
be requested at job submit time using the "--cpu-freq" option with the salloc,
sbatch or srun command. The frequency requested may be "low", "medium",
"highm1" (second highest available frequency), "high" or a specific frequency
(expressed as a KHz value). The governor specification may be "conservative",
"ondemand", "performance" or "powersave". These values are user requests
subject to system constraints. Some examples follow.</p>
<pre>
$ sbatch --cpu-freq=2400000-3000000 ...
$ salloc --cpu-freq=powersave ...
$ srun --cpu-freq=highm1 ...
</pre>
<p>The power consumption and power cap data are available for all compute nodes
using either the "scontrol show node" or sview commands.
Information available includes "CurrentWatts" and "CapWatts".</p>
<h2 id="example">Example<a class="slurm_link" href="#example"></a></h2>
<h3 id="initial">Initial State<a class="slurm_link" href="#initial"></a></h3>
<p>In our example, assume the following configuration:
10 compute node cluster, where each node has a minimum power consumption of 100 watts
and maximum power consumption of 200 watts.
The following values for PowerParameters:
balance_interval=60,
cap_watts=1800,
decrease_rate=30, increase_rate=10,
lower_threshold=90, upper_threshold=98.
The initial state is simply based upon the cap_watts divided by the number of
compute nodes: 1800 watts / 10 nodes = 180 watts per node.</p>
<h3 id="60">State in 60 Seconds<a class="slurm_link" href="#60"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume that one of those nodes is consuming 110 watts and the others are
using 180 watts.
First we identify which nodes are consuming less than their lower_threshold
of the power cap: 90% x 180 watts = 162 watts.
One node falls in this category with 110 watts of power consumption.
Its power cap is reduced by either half of the difference between its current
power cap and power consumption ((180 watts - 110 watts) / 2 = 35 watts) OR
decrease_rate, which is a percentage of the difference between the node's
maximum and minimum power consumption ((200 watts - 100 watts) x 30% = 30 watts).
So that node's power cap is reduce from 180 watts to 150 watts.
Ignoring the upper_threshold parameter for now, we now have 1650 watts available
to distribute to the remaining 9 compute nodes, or 183 watts per node
(1650 watts / 9 nodes = 183 watts per node).</p>
<h3 id="120">State in 120 Seconds<a class="slurm_link" href="#120"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume that one of those nodes is still consuming 110 watts, a second node is
consuming 115 watts and the other eight are using 183 watts.
First we identify which nodes are consuming less than their lower_threshold.
Our node using 110 watts has its cap reduced by half of the difference between
its current power cap and power consumption
((150 watts - 110 watts) / 2 = 20 watts);
so that node's power cap is reduce from 150 watts to 130 watts.
The node consuming 115 watts has its power cap reduced by 30 watts based
decrease_rate; so that node's power cap is reduce from 183 watts to 153 watts.
That leaves 1517 watts (1800 watts - 130 watts - 153 watts = 1517 watts) to
be distributed over 8 nodes or 189 watts per node.</p>
<h3 id="180">State in 180 Seconds<a class="slurm_link" href="#180"></a></h3>
<p>The power consumption is then examined balance_interval (60) seconds later.
Assume the node previously consuming 110 watts is now consuming 128 watts.
Since that is over upper_threshold of its power cap
(98% x 130 watts = 127 watts), its power cap is increased by increase_rate
((200 watts - 100 watts) x 10% = 10 watts), so its power cap goes from
130 watts to 140 watts.
Assume the node previously consuming 115 watts has been allocated a new job.
This triggers the node to be allocated the same power cap as nodes previously
running at their power cap.
Therefore we have 1660 watts available (1800 watts - 140 watts = 1660 watts)
to be distributed over 9 nodes or 184 watts per node.</p>
<h2 id="notes">Notes<a class="slurm_link" href="#notes"></a></h2>
<ul>
<li>Slurm's power management plugin can be used in conjunction with the
<a href="power_save.html">power save mode</a>, where idle nodes are powered
down and then powered back up as needed. On a Cray system, set each node's
power cap to the minimum value before powering it down. Also set the default
power cap of each node to the minimum value as that will be used at power up
time.</li>
<li>Cray permits independent power capping for accelerators (GPUs or MICs),
which is not currently used by Slurm.</li>
<li>Current default values for configuration parameters should probably be
changed once we have a better understanding of the algorithm's behavior.</li>
<li>No integration of this logic with gang scheduling currently exists.
It is not clear that configuration is practical to support as gang scheduling
time slices will typically be smaller than the power management
balance_interval and synchronizing changes may be difficult</li>
<li>There can be situations where capmc program gets stuck for some reason and
the node remains in IDLE*+POWER state until ResumeTimeout is reached, despite
it has been rebooted or manually cleaned.
In this situation the node can be brought back into service issuing an
'scontrol update nodename=xxx state=power_down' which will cancel the
previous power_up request. Then capmc program must be diagnosed and fixed.
</li>
</ul>
<p style="text-align:center;">Last modified 7 Mar 2018</p>
<!--#include virtual="footer.txt"-->
|