Prolog and Epilog Guide

Slurm supports a multitude of prolog and epilog programs. Note that for security reasons, these programs do not have a search path set. Either specify fully qualified path names in the program or set the PATH environment variable. The first table below identifies what prologs and epilogs are available for job allocations, when and where they run.

Parameter

Location

Invoked by

User

When executed

Prolog (from slurm.conf)

Compute or front end node

slurmd daemon

SlurmdUser (normally user root)

First job or job step initiation on that node (by default); PrologFlags=Alloc will force the script to be executed at job allocation

PrologSlurmctld (from slurm.conf)

Head node (where slurmctld daemon runs)

slurmctld daemon

SlurmctldUser

At job allocation

Epilog (from slurm.conf)

Compute or front end node

slurmd daemon

SlurmdUser (normally user root)

At job termination

EpilogSlurmctld (from slurm.conf)

Head node (where slurmctld daemon runs)

slurmctld daemon

SlurmctldUser

At job termination


This second table below identifies what prologs and epilogs are available for job step allocations, when and where they run.

Parameter

Location

Invoked by

User

When executed

SrunProlog (from slurm.conf) or srun --prolog

srun invocation node

srun command

User invoking srun command

Prior to launching job step

TaskProlog (from slurm.conf)

Compute node

slurmstepd daemon

User invoking srun command

Prior to launching job step

srun --task-prolog

Compute node

slurmstepd daemon

User invoking srun command

Prior to launching job step

TaskEpilog (from slurm.conf)

Compute node

slurmstepd daemon

User invoking srun command

Completion job step

srun --task-epilog

Compute node

slurmstepd daemon

User invoking srun command

Completion job step

SrunEpilog (from slurm.conf) or srun --epilog

srun invocation node

srun command

User invoking srun command

Completion job step

By default the Prolog script is only run on any individual node when it first sees a job step from a new allocation; it does not run the Prolog immediately when an allocation is granted. If no job steps from an allocation are run on a node, it will never run the Prolog for that allocation. This Prolog behaviour can be changed by the PrologFlags parameter. The Epilog, on the other hand, always runs on every node of an allocation when the allocation is released.

Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used.

The task prolog is executed with the same environment as the user tasks to be initiated. The standard output of that program is read and processed as follows:
export name=value sets an environment variable for the user task
unset name clears an environment variable from the user task
print ... writes to the task's standard output.
Special treatment is given to the SLURM_PROLOG_CPU_MASK variable when set in the task prolog. The variable is interpreted as a coma separated list of hex maps. It allows you to specify the CPU(s) that will be bound to a task and is applied using sched_setaffinity. The above functionality is limited to the task prolog script.

Unless otherwise specified, these environment variables are available to all of the programs.

Plugin functions may also be useful to execute logic at various well-defined points.

SPANK is another mechanism that may be useful to invoke logic in the user commands, slurmd daemon, and slurmstepd daemon.

Failure Handling

If the Epilog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state. If the EpilogSlurmctld fails (returns a non-zero exit code), this will only be logged. If the Prolog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state and the job requeued in a held state (unless nohold_on_prolog_fail is configured in SchedulerParameters). If the PrologSlurmctld fails (returns a non-zero exit code), this will cause the job to be requeued. Only batch jobs can be requeued. Interactive jobs (salloc and srun) will be cancelled if the PrologSlurmctld fails.

If a task epilog or srun epilog fails (returns a non-zero exit code) this will only be logged. If a task prolog fails (returns a non-zero exit code), the task will be canceled. If the srun prolog fails (returns a non-zero exit code), the step will be canceled.


Based upon work by Jason Sollom, Cray Inc. and used by permission.

Last modified 23 August 2022