1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305
|
.TH "cgroup.conf" "5" "Slurm Configuration File" "April 2022" "Slurm Configuration File"
.SH "NAME"
cgroup.conf \- Slurm configuration file for the cgroup support
.SH "DESCRIPTION"
\fBcgroup.conf\fP is an ASCII file which defines parameters used by
Slurm's Linux cgroup related plugins.
The file will always be located in the same directory as the \fBslurm.conf\fR.
.LP
Parameter names are case insensitive.
Any text following a "#" in the configuration file is treated
as a comment through the end of that line.
Changes to the configuration file take effect upon restart of
Slurm daemons, daemon receipt of the SIGHUP signal, or execution
of the command "scontrol reconfigure" unless otherwise noted.
.LP
For general Slurm cgroups information, see the Cgroups Guide at
<https://slurm.schedmd.com/cgroups.html>.
.LP
The following cgroup.conf parameters are defined to control the general behavior
of Slurm cgroup plugins.
.TP
\fBCgroupAutomount\fR=<yes|no>
In cgroup/v1 this parameter detects if /sys/fs/cgroup/<controller_name> is
available, and if not it tries to mount the filesystem.
In cgroup/v2 this parameter only takes effect if IgnoreSystemd is set, and
enables the required controllers on slurmd and slurmstepd cgroup directories.
This parameter is intended for development and testing with cgroup/v2.
.IP
.TP
\fBCgroupMountpoint\fR=\fIPATH\fR
Only intended for development and testing. Specifies the \fIPATH\fR under which
cgroup controllers should be mounted. The default \fIPATH\fR is /sys/fs/cgroup.
.IP
.TP
\fBCgroupPlugin\fR=\fI<cgroup/v1|cgroup/v2|autodetect>\fR
Specify the plugin to be used when interacting with the cgroup subsystem.
Supported values at the moment are "cgroup/v1" which supports the legacy
interface of cgroup v1, "cgroup/v2" for unified architecture, or "autodetect"
which tries to determine which cgroup version does your system provide.
This is useful if nodes have support for different cgroup versions.
The default value is "autodetect".
.IP
.TP
\fBIgnoreSystemd\fR=\fI<yes|no>\fR
Only for cgroup/v2 and for development and testing. It will avoid any call to
dbus and contact with systemd, and instead will prepare all the cgroup hierarchy
manually. This option is dangerous in systems with systemd since the cgroup
can be modified by systemd and cause issues to jobs.
.IP
.TP
\fBIgnoreSystemdOnFailure\fR=\fI<yes|no>\fR
Only for cgroup/v2 and for development and testing. It has similar functionality
to \fBIgnoreSystemd\fR but only in the case that a dbus call does not succeed.
.IP
.SH "TASK/CGROUP PLUGIN"
.LP
The following cgroup.conf parameters are defined to control the behavior
of this particular plugin:
.TP
\fBAllowedKmemSpace\fR=<number>
Only for cgroup/v1.
Constrain the job cgroup kernel memory to this amount of the allocated memory,
specified in bytes. The \fBAllowedKmemSpace\fR must be between the upper and
lower memory limits, specified by \fBMaxKmemPercent\fR and \fBMinKmemSpace\fR,
respectively. If \fBAllowedKmemSpace\fR goes beyond the upper or lower limit,
it will be reset to that upper or lower limit, whichever has been exceeded.
.IP
.TP
\fBAllowedRAMSpace\fR=<number>
Constrain the job/step cgroup RAM to this percentage of the allocated memory.
The percentage supplied may be expressed as floating point number, e.g. 101.5.
Sets the cgroup soft memory limit at the allocated memory size and then sets the
job/step hard memory limit at the (AllowedRAMSpace/100) * allocated memory. If
the job/step exceeds the hard limit, then it might trigger Out Of Memory (OOM)
events (including oom\-kill) which will be logged to kernel log ring buffer
(dmesg in Linux). Setting AllowedRAMSpace above 100 may cause system Out of
Memory (OOM) events as it allows job/step to allocate more memory than
configured to the nodes. Reducing configured node available memory to avoid
system OOM events is suggested. Setting AllowedRAMSpace below 100 will result
in jobs receiving less memory than allocated and soft memory limit will set to
the same value as the hard limit.
Also see \fBConstrainRAMSpace\fR.
The default value is 100.
.IP
.TP
\fBAllowedSwapSpace\fR=<number>
Constrain the job cgroup swap space to this percentage of the allocated
memory. The default value is 0, which means that RAM+Swap will be limited
to \fBAllowedRAMSpace\fR. The supplied percentage may be expressed as a
floating point number, e.g. 50.5. If the limit is exceeded, the job steps
will be killed and a warning message will be written to standard error.
Also see \fBConstrainSwapSpace\fR.
NOTE: Setting AllowedSwapSpace to 0 does not restrict the Linux kernel from
using swap space. To control how the kernel uses swap space, see
\fBMemorySwappiness\fR.
.IP
.TP
\fBConstrainCores\fR=<yes|no>
If configured to "yes" then constrain allowed cores to the subset of
allocated resources. This functionality makes use of the cpuset subsystem.
Due to a bug fixed in version 1.11.5 of HWLOC, the task/affinity plugin may be
required in addition to task/cgroup for this to function properly.
The default value is "no".
.IP
.TP
\fBConstrainDevices\fR=<yes|no>
If configured to "yes" then constrain the job's allowed devices based on GRES
allocated resources. It uses the devices subsystem for that.
The default value is "no".
.IP
.TP
\fBConstrainKmemSpace\fR=<yes|no>
Only for cgroup/v1.
If configured to "yes" then constrain the job's Kmem RAM usage in addition to
RAM usage. Only takes effect if \fBConstrainRAMSpace\fR is set to "yes". If
enabled, the job's Kmem limit will be assigned the value of
\fBAllowedKmemSpace\fR or the value coming from \fBMaxKmemPercent\fR.
The default value is "no" which will leave Kmem setting untouched by Slurm.
Also see \fBAllowedKmemSpace\fR, \fBMaxKmemPercent\fR.
.IP
.TP
\fBConstrainRAMSpace\fR=<yes|no>
If configured to "yes" then constrain the job's RAM usage by setting
the memory soft limit to the allocated memory and the hard limit to
the allocated memory * \fBAllowedRAMSpace\fR. The default value is "no", in
which case the job's RAM limit will be set to its swap space limit if
\fBConstrainSwapSpace\fR is set to "yes".
Also see \fBAllowedSwapSpace\fR, \fBAllowedRAMSpace\fR and
\fBConstrainSwapSpace\fR.
\fBNOTE\fR: When using \fBConstrainRAMSpace\fR, if the combined memory used
by all processes in a step is greater than the limit, then the kernel will
trigger an OOM event, killing one or more of the processes in the step. The
step state will be marked as OOM, but the step itself will keep running and
other processes in the step may continue to run as well.
This differs from the behavior of \fBOverMemoryKill\fR, where the whole step
will be killed/cancelled.
\fBNOTE\fR: When enabled, ConstrainRAMSpace can lead to a noticeable decline in
per\-node job throughout. Sites with high\-throughput requirements should
carefully weigh the tradeoff between per\-node throughput, versus potential
problems that can arise from unconstrained memory usage on the node. See
<https://slurm.schedmd.com/high_throughput.html> for further discussion.
.IP
.TP
\fBConstrainSwapSpace\fR=<yes|no>
If configured to "yes" then constrain the job's swap space usage.
The default value is "no". Note that when set to "yes" and
ConstrainRAMSpace is set to "no", \fBAllowedRAMSpace\fR is automatically set
to 100% in order to limit the RAM+Swap amount to 100% of job's requirement
plus the percent of allowed swap space. This amount is thus set to both
RAM and RAM+Swap limits. This means that in that particular case,
ConstrainRAMSpace is automatically enabled with the same limit as the one
used to constrain swap space.
Also see \fBAllowedSwapSpace\fR.
.IP
.TP
\fBMaxRAMPercent\fR=\fIPERCENT\fR
Set an upper bound in percent of total RAM on the RAM constraint for a job.
This will be the memory constraint applied to jobs that are not explicitly
allocated memory by Slurm (i.e. Slurm's select plugin is not configured to manage
memory allocations). The \fIPERCENT\fR may be an arbitrary floating
point number. The default value is 100.
.IP
.TP
\fBMaxSwapPercent\fR=\fIPERCENT\fR
Set an upper bound (in percent of total RAM) on the amount of RAM+Swap
that may be used for a job. This will be the swap limit applied to jobs
on systems where memory is not being explicitly allocated to job. The
\fIPERCENT\fR may be an arbitrary floating point number between 0 and 100.
The default value is 100.
.IP
.TP
\fBMaxKmemPercent\fR=\fIPERCENT\fR
Only for cgroup/v1.
Set an upper bound in percent of total RAM as the maximum Kmem for a job. The
\fIPERCENT\fR may be an arbitrary floating point number, however, the product
of \fBMaxKmemPercent\fR and job requested memory has to fall between
\fBMinKmemSpace\fR and job requested memory, otherwise the boundary value is
used. The default value is 100.
.IP
.TP
\fBMemorySwappiness\fR=<number>
Only for cgroup/v1.
Configure the kernel's priority for swapping out anonymous pages (such as
program data) verses file cache pages for the job cgroup. Valid values are
between 0 and 100, inclusive. A value of 0 prevents the kernel from swapping
out program data. A value of 100 gives equal priority to swapping out file
cache or anonymous pages. If not set, then the kernel's default swappiness
value will be used. \fBConstrainSwapSpace\fR
must be set to \fByes\fR in order for this parameter to be applied.
.IP
.TP
\fBMinKmemSpace\fR=<number>
Only for cgroup/v1.
Set a lower bound (in MB) on the memory limits defined by
\fBAllowedKmemSpace\fR. The default limit is 30M.
.IP
.TP
\fBMinRAMSpace\fR=<number>
Set a lower bound (in MB) on the memory limits defined by
\fBAllowedRAMSpace\fR and \fBAllowedSwapSpace\fR. This prevents
accidentally creating a memory cgroup with such a low limit that slurmstepd
is immediately killed due to lack of RAM. The default limit is 30M.
.IP
.SH "DISTRIBUTION\-SPECIFIC NOTES"
.LP
Debian and derivatives (e.g. Ubuntu) usually exclude the memory and memsw (swap)
cgroups by default. To include them, add the following parameters to the kernel
command line: \fBcgroup_enable=memory swapaccount=1\fR
.LP
This can usually be placed in /etc/default/grub inside the
\fBGRUB_CMDLINE_LINUX\fR variable. A command such as update\-grub must be run
after updating the file.
.SH "EXAMPLE"
.TP
\fB/etc/slurm/cgroup.conf\fR:
This example cgroup.conf file shows a configuration that enables the more
commonly used cgroup enforcement mechanisms.
.IP
.nf
###
# Slurm cgroup support configuration file.
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
.fi
.TP
\fB/etc/slurm/slurm.conf\fR:
These are the entries required in \fBslurm.conf\fR to activate the cgroup
enforcement mechanisms. Make sure that the node definitions in your
\fBslurm.conf\fR closely match the configuration as shown by "\fBslurmd \-C\fR".
Either MemSpecLimit should be set or RealMemory should be defined with less
than the actual amount of memory for a node to ensure that all system/non\-job
processes will have sufficient memory at all times. Sites should also configure
\fBpam_slurm_adopt\fR to ensure users can not escape the cgroups via \fBssh\fR.
.IP
.nf
###
# Slurm configuration entries for cgroups
###
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup #optional for gathering metrics
PrologFlags=Contain #X11 flag is also suggested
.fi
.SH "COPYING"
Copyright (C) 2010\-2012 Lawrence Livermore National Security.
Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
.br
Copyright (C) 2010\-2022 SchedMD LLC.
.LP
This file is part of Slurm, a resource management program.
For details, see <https://slurm.schedmd.com/>.
.LP
Slurm is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
.LP
Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
details.
.SH "SEE ALSO"
.LP
\fBslurm.conf\fR(5)
|