Containers are being adopted in HPC workloads. Containers rely on existing kernel features to allow greater user control over what applications see and can interact with at any given time. For HPC Workloads, these are usually restricted to the mount namespace. Slurm natively supports the requesting of unprivileged OCI Containers for jobs and steps.
Setting up containers requires several steps:
The following is a list of known limitations of the Slurm OCI container implementation.
The host kernel must be configured to allow user land containers:
sudo sysctl -w kernel.unprivileged_userns_clone=1 sudo sysctl -w kernel.apparmor_restrict_unprivileged_unconfined=0 sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
Docker also provides a tool to verify the kernel configuration:
$ dockerd-rootless-setuptool.sh check --force [INFO] Requirements are satisfied
The OCI Runtime Specification provides requirements for all compliant runtimes but does not expressly provide requirements on how runtimes will use arguments. In order to support as many runtimes as possible, Slurm provides pattern replacement for commands issued for each OCI runtime operation. This will allow a site to edit how the OCI runtimes are called as needed to ensure compatibility.
For runc and crun, there are two sets of examples provided. The OCI runtime specification only provides the start and create operations sequence, but these runtimes provides a much more efficient run operation. Sites are strongly encouraged to use the run operation (if provided) as the start and create operations require that Slurm poll the OCI runtime to know when the containers have completed execution. While Slurm attempts to be as efficient as possible with polling, it will result in a thread using CPU time inside of the job and slower response of Slurm to catch when container execution is complete.
The examples provided have been tested to work but are only suggestions. Sites are expected to ensure that the resultant root directory used will be secure from cross user viewing and modifications. The examples provided point to "/run/user/%U" where %U will be replaced with the numeric user id. Systemd manages "/run/user/" (independently of Slurm) and will likely need additional configuration to ensure the directories exist on compute nodes when the users will not log in to the nodes directly. This configuration is generally achieved by calling loginctl to enable lingering sessions. Be aware that the directory in this example will be cleaned up by systemd once the user session ends on the node.
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="runc --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeCreate="runc --rootless=true --root=/run/user/%U/ create %n.%u.%j.%s.%t -b %b" RunTimeStart="runc --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t" RunTimeKill="runc --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="runc --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="runc --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="runc --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="runc --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeRun="runc --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="crun --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="crun --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="crun --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeCreate="crun --rootless=true --root=/run/user/%U/ create --bundle %b %n.%u.%j.%s.%t" RunTimeStart="crun --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t"
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="crun --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="crun --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="crun --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeRun="crun --rootless=true --root=/run/user/%U/ run --bundle %b %n.%u.%j.%s.%t"
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeCreate="nvidia-container-runtime --rootless=true --root=/run/user/%U/ create %n.%u.%j.%s.%t -b %b" RunTimeStart="nvidia-container-runtime --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t" RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeRun="nvidia-container-runtime --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
IgnoreFileConfigJson=true EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeRun="singularity exec --userns %r %@" RunTimeKill="kill -s SIGTERM %p" RunTimeDelete="kill -s SIGKILL %p"
Singularity v4.x requires setuid mode for OCI support.
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="sudo singularity oci state %n.%u.%j.%s.%t" RunTimeRun="sudo singularity oci run --bundle %b %n.%u.%j.%s.%t" RunTimeKill="sudo singularity oci kill %n.%u.%j.%s.%t" RunTimeDelete="sudo singularity oci delete %n.%u.%j.%s.%t"
WARNING: Singularity (v4.0.2) requires sudo or setuid binaries for OCI support, which is a security risk since the user is able to modify these calls. This example is only provided for testing purposes.
WARNING: Upstream singularity development of the OCI interface appears to have ceased and sites should use the user namespace support instead.
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" OCIRunTimeQuery="sudo singularity oci state %n.%u.%j.%s.%t" OCIRunTimeCreate="sudo singularity oci create --bundle %b %n.%u.%j.%s.%t" OCIRunTimeStart="sudo singularity oci start %n.%u.%j.%s.%t" OCIRunTimeKill="sudo singularity oci kill %n.%u.%j.%s.%t" OCIRunTimeDelete="sudo singularity oci delete %n.%u.%j.%s.%t
WARNING: Singularity (v3.8.0) requires sudo or setuid binaries for OCI support, which is a security risk since the user is able to modify these calls. This example is only provided for testing purposes.
WARNING: Upstream singularity development of the OCI interface appears to have ceased and sites should use the user namespace support instead.
IgnoreFileConfigJson=true CreateEnvFile=newline EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeRun="env -i PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin/:/sbin/ USER=$(whoami) HOME=/home/$(whoami)/ ch-run -w --bind /etc/group:/etc/group --bind /etc/passwd:/etc/passwd --bind /etc/slurm:/etc/slurm --bind %m:/var/run/slurm/ --bind /var/run/munge/:/var/run/munge/ --set-env=%e --no-passwd %r -- %@" RunTimeKill="kill -s SIGTERM %p" RunTimeDelete="kill -s SIGKILL %p"
IgnoreFileConfigJson=true CreateEnvFile=newline EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeRun="/usr/local/bin/enroot-start-wrapper %b %m %e -- %@" RunTimeKill="kill -s SIGINT %p" RunTimeDelete="kill -s SIGTERM %p"
/usr/local/bin/enroot-start-wrapper:
#!/bin/bash
BUNDLE="$1"
SPOOLDIR="$2"
ENVFILE="$3"
shift 4
IMAGE=
export USER=$(whoami)
export HOME="$BUNDLE/"
export TERM
export ENROOT_SQUASH_OPTIONS='-comp gzip -noD'
export ENROOT_ALLOW_SUPERUSER=n
export ENROOT_MOUNT_HOME=y
export ENROOT_REMAP_ROOT=y
export ENROOT_ROOTFS_WRITABLE=y
export ENROOT_LOGIN_SHELL=n
export ENROOT_TRANSFER_RETRIES=2
export ENROOT_CACHE_PATH="$SPOOLDIR/"
export ENROOT_DATA_PATH="$SPOOLDIR/"
export ENROOT_TEMP_PATH="$SPOOLDIR/"
export ENROOT_ENVIRON="$ENVFILE"
if [ ! -f "$BUNDLE" ]
then
IMAGE="$SPOOLDIR/container.sqsh"
enroot import -o "$IMAGE" -- "$BUNDLE" && \
enroot create "$IMAGE"
CONTAINER="container"
else
CONTAINER="$BUNDLE"
fi
enroot start -- "$CONTAINER" "$@"
rc=$?
[ $IMAGE ] && unlink $IMAGE
exit $rc
If you wish to accommodate multiple runtimes in your environment, it is possible to do so with a bit of extra setup. This section outlines one possible way to do so:
IgnoreFileConfigJson=true RunTimeRun="/opt/slurm-oci/run %b %m %u %U %n %j %s %t %@" RunTimeKill="kill -s SIGTERM %p" RunTimeDelete="kill -s SIGKILL %p"
#!/bin/bash if [[ -e ~/.slurm-oci-run ]]; then ~/.slurm-oci-run "$@" else /opt/slurm-oci/slurm-oci-run-default "$@" fi
#!/bin/bash --login # Parse CONTAINER="$1" SPOOL_DIR="$2" USER_NAME="$3" USER_ID="$4" NODE_NAME="$5" JOB_ID="$6" STEP_ID="$7" TASK_ID="$8" shift 8 # subsequent arguments are the command to run in the container # Run apptainer run --bind /var/spool --containall "$CONTAINER" "$@"
chmod +x /opt/slurm-oci/run /opt/slurm-oci/slurm-oci-run-default
Once this is done, users may create a script at '~/.slurm-oci-run' if they wish to customize the container run process, such as using a different container runtime. Users should model this file after the default '/opt/slurm-oci/slurm-oci-run-default'
Slurm calls the OCI runtime directly in the job step. If it fails, then the job will also fail.
cd $ABS_PATH_TO_BUNDLE
$OCIRunTime $ARGS create test --bundle $PATH_TO_BUNDLE
$OCIRunTime $ARGS start test
$OCIRunTime $ARGS kill test
$OCIRunTime $ARGS delete testIf these commands succeed, then the OCI runtime is correctly configured and can be tested in Slurm.
salloc, srun and sbatch (in Slurm 21.08+) have the '--container' argument, which can be used to request container runtime execution. The requested job container will not be inherited by the steps called, excluding the batch and interactive steps.
sbatch --container $ABS_PATH_TO_BUNDLE --wrap 'bash -c "cat /etc/*rel*"'
sbatch --wrap 'srun bash -c "--container $ABS_PATH_TO_BUNDLE cat /etc/*rel*"'
salloc --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"
salloc srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"
srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"
srun srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"
Slurm's scrun can be directly integrated with Rootless Docker to run containers as jobs. No special user permissions are required and should not be granted to use this functionality.
AuthType=auth/munge
--security-opt label:disable --security-opt seccomp=unconfined --security-opt apparmor=unconfined --net=noneDocker's builtin security functionality is not required (or wanted) for containers being run by Slurm. Docker is only acting as a container image lifecycle manager. The containers will be executed remotely via Slurm following the existing security configuration in Slurm outside of unprivileged user control.
docker exec command is not supported.docker swarm command is not supported.docker compose/docker-compose command is not
supported.docker pause command is not supported.docker unpause command is not supported.docker swarm command is not supported.docker commands are not supported inside of containers.export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sockAll commands following this will expect this environment variable to be set.
systemctl --user stop docker
/etc/docker/daemon.json
~/.config/docker/daemon.json
{
"experimental": true,
"iptables": false,
"bridge": "none",
"no-new-privileges": true,
"rootless": true,
"selinux-enabled": false,
"default-runtime": "slurm",
"runtimes": {
"slurm": {
"path": "/usr/local/bin/scrun"
}
},
"data-root": "/run/user/${USER_ID}/docker/",
"exec-root": "/run/user/${USER_ID}/docker-exec/"
}
Correct path to scrun as if installation prefix was configured. Replace
${USER_ID} with numeric user id or target a different directory with global
write permissions and sticky bit. Rootless docker requires a different root
directory than the system's default to avoid permission errors.{
"storage-driver": "vfs",
"data-root": "/path/to/shared/filesystem/user_name/data/",
"exec-root": "/path/to/shared/filesystem/user_name/exec/",
}
Any node expected to be able to run containers from Docker must have ability to
atleast read the filesystem used. Full write privileges are suggested and will
be required if changes to the container filesystem are desired./etc/systemd/user/docker.service.d/override.conf
~/.config/systemd/user/docker.service.d/override.conf
[Service] Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=none" Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_NET=host"
systemctl --user daemon-reload
systemctl --user start docker
export DOCKER_SECURITY="--security-opt label=disable --security-opt seccomp=unconfined --security-opt apparmor=unconfined --net=none" docker run $DOCKER_SECURITY hello-world docker run $DOCKER_SECURITY alpine /bin/printenv SLURM_JOB_ID docker run $DOCKER_SECURITY alpine /bin/hostname docker run $DOCKER_SECURITY -e SCRUN_JOB_NUM_NODES=10 alpine /bin/hostname
Slurm's scrun can be directly integrated with Podman to run containers as jobs. No special user permissions are required and should not be granted to use this functionality.
AuthType=auth/munge
podman exec command is not supported.podman-compose command is not supported, due to only being
partially implemented. Some compositions may work but each container
may be run on different nodes. The network for all containers must be
the network_mode: host device.podman kube command is not supported.podman pod command is not supported.podman farm command is not supported.podman commands are not supported inside of containers.$ podman info --format '{{.Host.Security.Rootless}}'
true$ id $ podman run --userns keep-id alpine id
$ sudo id $ podman run --userns nomap alpine id
/etc/containers/containers.conf$XDG_CONFIG_HOME/containers/containers.conf
or
~/.config/containers/containers.conf
(if $XDG_CONFIG_HOME is not defined).[containers] apparmor_profile = "unconfined" cgroupns = "host" cgroups = "enabled" default_sysctls = [] label = false netns = "host" no_hosts = true pidns = "host" utsns = "host" userns = "host" log_driver = "journald" [engine] cgroup_manager = "systemd" runtime = "slurm" remote = false [engine.runtimes] slurm = [ "/usr/local/bin/scrun", "/usr/bin/scrun" ]Correct path to scrun as if installation prefix was configured.
/etc/containers/storage.conf
$XDG_CONFIG_HOME/containers/storage.conf
[storage]
driver = "vfs"
runroot = "$HOME/containers"
graphroot = "$HOME/containers"
[storage.options]
pull_options = {use_hard_links = "true", enable_partial_images = "true"}
[storage.options.vfs]
ignore_chown_errors = "true"
Any node expected to be able to run containers from Podman must have ability to
atleast read the filesystem used. Full write privileges are suggested and will
be required if changes to the container filesystem are desired.podman run hello-world podman run alpine printenv SLURM_JOB_ID podman run alpine hostname podman run alpine -e SCRUN_JOB_NUM_NODES=10 hostname salloc podman run --env-host=true alpine hostname salloc sh -c 'podman run -e SLURM_JOB_ID=$SLURM_JOB_ID alpine hostname'
alias docker=podmanor
alias docker='podman --config=/some/path "$@"'
$ podman run alpine uptime Error: allocating lock for new container: allocation failed; exceeded num_locks (2048)
podman system renumber
podman system reset
There are multiple ways to generate an OCI Container bundle. The instructions below are the method we found the easiest. The OCI standard provides the requirements for any given bundle: Filesystem Bundle
Here are instructions on how to generate a container using a few alternative container solutions:
sudo debootstrap stable /image/rootfs http://deb.debian.org/debian/
sudo yum --config /etc/yum.conf --installroot=/image/rootfs/ --nogpgcheck --releasever=${CENTOS_RELEASE} -y
mkdir -p ~/oci_images/alpine/rootfs cd ~/oci_images/ docker pull alpine docker create --name alpine alpine docker export alpine | tar -C ~/oci_images/alpine/rootfs -xf - docker rm alpine
cd ~/oci_images/alpine runc --rootless=true spec --rootless
srun --container ~/oci_images/alpine/ uptime
mkdir -p ~/oci_images/ cd ~/oci_images/ skopeo copy docker://alpine:latest oci:alpine:latest umoci unpack --rootless --image alpine ~/oci_images/alpine srun --container ~/oci_images/alpine uptime
mkdir -p ~/oci_images/alpine/ cd ~/oci_images/alpine/ singularity pull alpine sudo singularity oci mount ~/oci_images/alpine/alpine_latest.sif ~/oci_images/alpine mv config.json singularity_config.json runc spec --rootless srun --container ~/oci_images/alpine/ uptime
FROM almalinux:latest RUN dnf -y update && dnf -y upgrade && dnf install -y epel-release && dnf -y update RUN dnf -y install make automake gcc gcc-c++ kernel-devel bzip2 python3 wget libevent-devel hwloc-devel WORKDIR /usr/local/src/ RUN wget --quiet 'https://github.com/openpmix/openpmix/releases/download/v5.0.7/pmix-5.0.7.tar.bz2' -O - | tar --no-same-owner -xvjf - WORKDIR /usr/local/src/pmix-5.0.7/ RUN ./configure && make -j && make install WORKDIR /usr/local/src/ RUN wget --quiet --inet4-only 'https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.bz2' -O - | tar --no-same-owner -xvjf - WORKDIR /usr/local/src/openmpi-5.0.7/ RUN ./configure --disable-pty-support --enable-ipv6 --without-slurm --with-pmix --enable-debug && make -j && make install WORKDIR /usr/local/src/openmpi-5.0.7/examples RUN make && cp -v hello_c ring_c connectivity_c spc_example /usr/local/bin
Slurm allows container developers to create SPANK Plugins that can be called at various points of job execution to support containers. Any site using one of these plugins to start containers should not have an "oci.conf" configuration file. The "oci.conf" file activates the builtin container functionality which may conflict with the SPANK based plugin functionality.
The following projects are third party container solutions that have been designed to work with Slurm, but they have not been tested or validated by SchedMD.
Shifter is a container project out of NERSC to provide HPC containers with full scheduler integration.
Enroot is a user namespace container system sponsored by NVIDIA that supports:
Sarus is a privileged container system sponsored by ETH Zurich CSCS that supports:
Last modified 27 November 2024