File: cgroups.shtml

package info (click to toggle)
slurm-wlm-contrib 24.11.5-4
  • links: PTS, VCS
  • area: contrib
  • in suites: forky, sid
  • size: 50,600 kB
  • sloc: ansic: 529,598; exp: 64,795; python: 17,051; sh: 9,411; javascript: 6,528; makefile: 4,030; perl: 3,762; pascal: 131
file content (429 lines) | stat: -rw-r--r-- 16,971 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
<!--#include virtual="header.txt"-->

<h1>Control Group in Slurm</h1>

<h2 id="contents">Contents
<a class="slurm_link" href="#contents"></a>
</h2>

<ul>
<li><a href="#overview">Control Group Overview</a></li>
<li><a href="#cgroup_design">Slurm cgroup plugins design</a></li>
<li><a href="#use">Use of cgroup in Slurm</a></li>
<li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li>
<li><a href="#Plugins">Currently Available Cgroup Plugins</a>
    <ul>
        <li><a href="#proctrack">proctrack/cgroup plugin</a></li>
        <li><a href="#task">task/cgroup plugin</a></li>
        <li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li>
    </ul>
</li>
<li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li>
<li><a href="#cgroupplugins">Slurm cgroup plugins</a>
    <ul>
        <li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li>
        <li><a href="#interfaces">Main differences between controller interfaces</a></li>
        <li><a href="#generalities">Other generalities</a></li>
    </ul>
</li>
</ul>

<h2 id="overview">Control Group Overview
<a class="slurm_link" href="#overview"></a>
</h2>
<p>Control Group is a mechanism provided by the kernel to organize processes
hierarchically and distribute system resources along the hierarchy in a
controlled and configurable manner. Slurm can make use of cgroups to constrain
different resources to jobs, steps and tasks, and to get accounting about these
resources.</p>

<p>A cgroup provides different controllers (formerly "subsystems") for different
resources. Slurm plugins can use several of these controllers, e.g.: <i>memory,
cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller
gives the ability to constrain resources to a set of processes. If one
controller is not available on the system, then Slurm cannot constrain the
associated resources through a cgroup.</p>

<p>"cgroup" stands for "control group" and is never capitalized. The singular
form is used to designate the whole feature and also as a qualifier as in
"cgroup controllers". When explicitly referring to multiple individual control
groups, the plural form "cgroups" is used.</p>

<p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode
(cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are
mixed in a system is not supported.</p>

<p>See the kernel.org documentation for a more comprehensive description of
cgroup:</p>

<ul>
<li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt">
Kernel's Cgroup v1 documentation</a>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">
Kernel's Cgroup v2 documentation </a>
</li>
</ul>

<h2 id="cgroup_design">Slurm cgroup plugins design
<a class="slurm_link" href="#cgroup_design"></a>
</h2>

For extended information on Slurm's internal Cgroup plugin read:
<ul>
<li>cgroup/v1 plugin documentation (not available yet)</li>
<li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li>
</ul>

<h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2>
<p>Slurm provides cgroup versions of a number of plugins.</p>
<ul>
<li>proctrack/cgroup (for process tracking and management)</li>
<li>task/cgroup (for constraining resources at step and task level)</li>
<li>jobacct_gather/cgroup (for gathering statistics)</li>
</ul>

<p>cgroups can also be used for resource specialization (constraining daemons to
cores or memory).</p>

<h2 id="configuration">Slurm Cgroup Configuration Overview
<a class="slurm_link" href="#configuration"></a>
</h2>

<p>There are several sets of configuration options for Slurm cgroups:</p>

<ul>
<li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the
cgroup plugins. Each plugin may be enabled or disabled independently of the
others.
</li>
<li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are
common to all cgroup plugins, plus additional options that apply only to
specific plugins.
</li>
<li>System-level resource specialization is enabled using node configuration
parameters.
</li>
</ul>

<h2 id="Plugins">Currently Available Cgroup Plugins
<a class="slurm_link" href="#Plugins"></a>
</h2>

<h3 id="proctrack">proctrack/cgroup plugin
<a class="slurm_link" href="#proctrack"></a>
</h3>

<p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such
as proctrack/linux for process tracking and suspend/resume capability.
</p>

<p>
proctrack/cgroup uses the freezer controller to keep track of all pids of a
job. It basically stores the pids in a specific hierarchy in the cgroup tree and
takes cares of signaling these pids when instructed. For example, if a user
decides to cancel a job, Slurm will execute this order internally by calling the
proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack
maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know
which ones will need to be signaled.
<br>
Proctrack can also respond to queries for getting a list of all the pids of a
job or a step.
<br>
Alternatively, when using proctrack/linux, pids are stored by cgroup in a
single file (cgroup.procs) which is read by the plugin to get all the pids of a
part of the hierarchy. For example, when using proctrack/cgroup, a single step
has its own cgroup.procs file, so getting the pids of the step is instantaneous.
In proctrack/linux, we need to read recursively /proc to get all the descendants
of a parent pid.
</p>

<p>To enable this plugin, configure the following option in slurm.conf:
<pre>ProctrackType=proctrack/cgroup</pre>
</p>

<p>There are no specific options for this plugin in cgroup.conf, but the general
options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
details.</p>

<h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3>

<p>The task/cgroup plugin allows constraining resources to a job, a step, or a
task. This is the only plugin that can ensure that the boundaries of an
allocation are not violated.
Only jobacctgather/linux offers a very simplistic mechanism for
constraining memory to a job but it is not reliable (there's a window of time
where jobs can exceed its limits) and only for very rare systems where cgroup is
not available.</p>

<p>task/cgroup provides the following features:</p>

<ul>
<li>Confine jobs and steps to their allocated cpuset.</li>
<li>Confine jobs and steps to specific memory resources.</li>
<li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li>
</ul>

<p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p>

<p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration
parameter in slurm.conf:</p>

<pre>TaskPlugin=task/cgroup</pre>

<p>There are many specific options for this plugin in cgroup.conf. The general
options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page
for details.</p>

<p>This plugin can be stacked with other task plugins, for example with
<i>task/affinity</i>. This will allow it to constrain resources to a job plus
getting the advantage of the affinity plugin (order doesn't matter):</p>

<pre>TaskPlugin=task/cgroup,task/affinity</pre>

<h3 id="jobacct_gather">jobacct_gather/cgroup plugin
<a class="slurm_link" href="#jobacct_gather"></a>
</h3>

<p>
The <i>jobacct_gather/cgroup</i> plugin is an alternative to the
<i>jobacct_gather/linux</i> plugin for the collection of accounting statistics
for jobs, steps and tasks.
<br>
<i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers.
</p>

<p>The cpu and memory statistics collected by this plugin do not represent the
same resources as the cpu and memory statistics collected by the
<i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats
file and similar containing the information for the entire subtree of pids, the
linux plugin gets information from /proc/pid/stat for every pid and then does
the calculations, thus becoming a bit less efficient (thought not noticeable in
the practice) than the cgroup one.</p>

<p>To enable this plugin, configure the following option in slurm.conf:
<pre>JobacctGatherType=jobacct_gather/cgroup</pre>
</p>

<p>There are no specific options for this plugin in cgroup.conf, but the general
options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
details.</p>

<h2 id="Specialization">Use of cgroup for Resource Specialization
<a class="slurm_link" href="#Specialization"></a>
</h2>

<p>Resource Specialization may be used to reserve a subset of cores or a
specific amount of memory on each compute node for exclusive use by the Slurm
compute node daemon, slurmd.</p>

<p>If cgroup/v1 is used the reserved resources will also be used by the
slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by
this resource specialization. Instead the slurmstepd is constrained to the
resources allocated to the job, since it is considered part of the job and its
consumption is completely dependent on the topology of the job. For example an
MPI job can initialize many ranks with PMI and make slurmstepd consume more
memory.</p>

<p>System-level resource specialization is enabled with special node
configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core
specialization in <a href="core_spec.html">core_spec.html</a> for more
information.</p>

<h2 id="cgroupplugins">Slurm cgroup plugins
<a class="slurm_link" href="#cgroupplugins"></a>
</h2>

<p>
Since 22.05, Slurm supports cgroup/v1 and cgroup/v2. Both plugins have very
different ways of organizing their hierarchies and respond to different design
constraints. The design is the responsibility of the kernel maintainers.
</p>

<h3 id="differences">Main differences between cgroup/v1 and cgroup/v2
<a class="slurm_link" href="#differences"></a>
</h3>

<p>The three main differences between v1 and v2 are:</p>

<ul>
<li><b>Unified mode in v2</b><br>

<p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which
means the job structure must be replicated and managed for every enabled
controller. For example, for the same job, if using
<i>memory</i> and <i>freezer</i> controllers, we will need to create the same
slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For
example:

<pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre>
<pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre>

<p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are
enabled at the same level and presented to the user as different files.</p>

<pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p>

</li>

<li><b>Top-down constraint in v2</b><br>
<p>Resources are distributed top-down and a cgroup can further distribute a
resource only if the resource has been distributed to it from the parent.
Enabled controllers are listed in the <i>cgroup.controllers</i> file and
enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>.
</li>

<li><b>No-Internal-Process constraint in v2</b><br>
<p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any
directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel
restriction which impedes adding a pid to non-leaf directories.</p>
</li>

<li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds
</b><p> This is not a kernel limitation but a systemd decision, which imposes an
important restriction on services that decide to use <i>Delegate=yes</i>.
Systemd, with pid 1, decided to be the complete owner of the cgroup
hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i>
design. This means that everything related to cgroup must be under control of
systemd. If one decides to manually modify the cgroup tree, creating directories
and moving pids around, it is possible that at some point systemd may decide to
enable or disable controllers on the entire tree, or move pids around. It's been
experienced that a

<pre>systemd reload</pre>

or a

<pre>systemd reset-failed</pre>

removed controllers, at any level and directory of the tree, if there was not
any "systemd unit" making use of it and there were not any "Delegate=Yes"
started "systemd unit" on the system. This is because systemd wants to cleanup
the cgroup tree and match it against its internal unit database. In fact,
looking at the code of systemd one can see how cgroup directories related to
units with "Delegate=yes" flag are ignored, while any other cgroup directories
are modified.  This makes it mandatory to start slurmd and slurmstepd processes
under a unit with "Delegate=yes". This means we need to start, stop and restart
slurmd with systemd. If we do that though, since we may have previously modified
the tree where slurmd belongs (e.g. adding job directories) systemd will not be
able to restart slurmd because of the <i>Top-down constraint</i> mentioned
earlier. It will not be able to put the new slurmd pid into the root cgroup
which is now a non-leaf. This forces us to separate the cgroup hierarchies of
slurmstepd from the slurmd ones, and since we need to inform systemd about it
and put slurmstepd into a new unit, we will do a dbus call to systemd to create
a new scope for slurmstepds. See
<a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/">
systemd ControlGroupInterface</a> for more information.</p>
</li>
</ul>

<p>The following differences shouldn't affect how other plugins interact with
cgroup plugins, but instead they only show internal functional differences.</p>

<ul>
<li>A controller in <i>cgroup/v2</i> is enabled by writing to
<i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be
mounted with filesystem type <i>"-t cgroup"</i> and corresponding options,
e.g.<i>"-o freezer"</i>.
</li>

<li>In <i>cgroup/v2</i> the freezer controller is inherently present in the
<i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and
separate controller which needs to be mounted.
</li>

<li>The devices controller does not exist in cgroup/v2, instead a new eBPF
program must be inserted in the kernel.
</li>

<li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of
anon+swapcached+anon_thp to match the RSS concept in v1.
</li>

<li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat
in <i>cgroup/v1</i> provides metrics in USER_HZ.
</li>

</ul>

<h3 id="interfaces">Main differences between controller interfaces
<a class="slurm_link" href="#interfaces"></a>
</h3>
<table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%">
<tr bgcolor="#e0e0e0">
<td><u><b>cgroup/v1</b></u></td>
<td><u><b>cgroup/v2</b></u></td>
</tr>
<tr>
<td>memory.limit_in_bytes</td>
<td>memory.max</td>
</tr>
<tr>
<td>memory.soft_limit_in_bytes</td>
<td>memory.high</td>
</tr>
<tr>
<td>memory.memsw_limit_in_bytes</td>
<td>memory.swap.max</td>
</tr>
<tr>
<td>memory.swappiness</td>
<td>none</td>
</tr>
<tr>
<td>freezer.state</td>
<td>cgroup.freeze</td>
</tr>
<tr>
<td>cpuset.cpus</td>
<td>cpuset.cpus.effective and cpuset.cpus</td>
</tr>
<tr>
<td>cpuset.mems</td>
<td>cpuset.mems.effective and cpuset.mems</td>
</tr>
<tr>
<td>cpuacct.stat</td>
<td>cpu.stat</td>
</tr>
<tr>
<td>device.*</td>
<td>ebpf program</td>
</tr>
</table>


<h3 id="generalities">Other generalities
<a class="slurm_link" href="#generalities"></a>
</h3>

<ul>
<li>When using cgroup/v1, some configurations can exclude the swap cgroup
accounting. This accounting is part of the features provided by the memory
controller.  If this feature is disabled from the kernel or boot parameters,
trying to enable swap constraints will produce an error. If this is required,
add the following parameters to the kernel command line:

<pre>cgroup_enable=memory swapaccount=1</pre>

This can usually be placed in /etc/default/grub inside
the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i>
must be run after updating the file. This feature can be disabled also at kernel
config with the parameter:

<pre>CONFIG_MEMCG_SWAP=</pre></li>

<li>In some Linux distributions, it was possible to use the systemd parameter
JoinControllers, which is actually deprecated. This parameter allowed multiple
controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or
less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode.
However, Slurm does not work correctly with this configuration, so please make
sure your system.conf does not use JoinControllers and that all your cgroup
controllers are under separate directories when using
<i>cgroup/v1</i> legacy mode.
</li>
</ul>

<p style="text-align:center;">Last modified 4 April 2025</p>

<!--#include virtual="footer.txt"-->