Control Group in Slurm

Contents

Control Group Overview

Control Group is a mechanism provided by the kernel to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner. Slurm can make use of cgroups to constrain different resources to jobs, steps and tasks, and to get accounting about these resources.

A cgroup provides different controllers (formerly "subsystems") for different resources. Slurm plugins can use several of these controllers, e.g.: memory, cpu, devices, freezer, cpuset, cpuacct. Each enabled controller gives the ability to constrain resources to a set of processes. If one controller is not available on the system, then Slurm cannot constrain the associated resources through a cgroup.

"cgroup" stands for "control group" and is never capitalized. The singular form is used to designate the whole feature and also as a qualifier as in "cgroup controllers". When explicitly referring to multiple individual control groups, the plural form "cgroups" is used.

Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode (cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are mixed in a system is not supported.

See the kernel.org documentation for a more comprehensive description of cgroup:

Slurm cgroup plugins design

For extended information on Slurm's internal Cgroup plugin read:

Use of cgroup in Slurm

Slurm provides cgroup versions of a number of plugins.

cgroups can also be used for resource specialization (constraining daemons to cores or memory).

Slurm Cgroup Configuration Overview

There are several sets of configuration options for Slurm cgroups:

Currently Available Cgroup Plugins

proctrack/cgroup plugin

The proctrack/cgroup plugin is an alternative to other proctrack plugins such as proctrack/linux for process tracking and suspend/resume capability.

proctrack/cgroup uses the freezer controller to keep track of all pids of a job. It basically stores the pids in a specific hierarchy in the cgroup tree and takes cares of signaling these pids when instructed. For example, if a user decides to cancel a job, Slurm will execute this order internally by calling the proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know which ones will need to be signaled.
Proctrack can also respond to queries for getting a list of all the pids of a job or a step.
Alternatively, when using proctrack/linux, pids are stored by cgroup in a single file (cgroup.procs) which is read by the plugin to get all the pids of a part of the hierarchy. For example, when using proctrack/cgroup, a single step has its own cgroup.procs file, so getting the pids of the step is instantaneous. In proctrack/linux, we need to read recursively /proc to get all the descendants of a parent pid.

To enable this plugin, configure the following option in slurm.conf:

ProctrackType=proctrack/cgroup

There are no specific options for this plugin in cgroup.conf, but the general options apply. See the cgroup.conf man page for details.

task/cgroup plugin

The task/cgroup plugin allows constraining resources to a job, a step, or a task. This is the only plugin that can ensure that the boundaries of an allocation are not violated. Only jobacctgather/linux offers a very simplistic mechanism for constraining memory to a job but it is not reliable (there's a window of time where jobs can exceed its limits) and only for very rare systems where cgroup is not available.

task/cgroup provides the following features:

The task/cgroup plugin uses the cpuset, memory and devices subsystems.

To enable this plugin, add task/cgroup to the TaskPlugin configuration parameter in slurm.conf:

TaskPlugin=task/cgroup

There are many specific options for this plugin in cgroup.conf. The general options also apply. See the cgroup.conf man page for details.

This plugin can be stacked with other task plugins, for example with task/affinity. This will allow it to constrain resources to a job plus getting the advantage of the affinity plugin (order doesn't matter):

TaskPlugin=task/cgroup,task/affinity

jobacct_gather/cgroup plugin

The jobacct_gather/cgroup plugin is an alternative to the jobacct_gather/linux plugin for the collection of accounting statistics for jobs, steps and tasks.
jobacct_gather/cgroup uses the cpuacct and memory cgroup controllers.

The cpu and memory statistics collected by this plugin do not represent the same resources as the cpu and memory statistics collected by the jobacct_gather/linux. While the cgroup plugin just reads a cgroup.stats file and similar containing the information for the entire subtree of pids, the linux plugin gets information from /proc/pid/stat for every pid and then does the calculations, thus becoming a bit less efficient (thought not noticeable in the practice) than the cgroup one.

To enable this plugin, configure the following option in slurm.conf:

JobacctGatherType=jobacct_gather/cgroup

There are no specific options for this plugin in cgroup.conf, but the general options apply. See the cgroup.conf man page for details.

Use of cgroup for Resource Specialization

Resource Specialization may be used to reserve a subset of cores or a specific amount of memory on each compute node for exclusive use by the Slurm compute node daemon, slurmd.

If cgroup/v1 is used the reserved resources will also be used by the slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by this resource specialization. Instead the slurmstepd is constrained to the resources allocated to the job, since it is considered part of the job and its consumption is completely dependent on the topology of the job. For example an MPI job can initialize many ranks with PMI and make slurmstepd consume more memory.

System-level resource specialization is enabled with special node configuration parameters. Read slurm.conf and core specialization in core_spec.html for more information.

Slurm cgroup plugins

Since 22.05, Slurm supports cgroup/v1 and cgroup/v2. Both plugins have very different ways of organizing their hierarchies and respond to different design constraints. The design is the responsibility of the kernel maintainers.

Main differences between cgroup/v1 and cgroup/v2

The three main differences between v1 and v2 are:

The following differences shouldn't affect how other plugins interact with cgroup plugins, but instead they only show internal functional differences.

Main differences between controller interfaces

cgroup/v1 cgroup/v2
memory.limit_in_bytes memory.max
memory.soft_limit_in_bytes memory.high
memory.memsw_limit_in_bytes memory.swap.max
memory.swappiness none
freezer.state cgroup.freeze
cpuset.cpus cpuset.cpus.effective and cpuset.cpus
cpuset.mems cpuset.mems.effective and cpuset.mems
cpuacct.stat cpu.stat
device.* ebpf program

Other generalities

Last modified 4 April 2025