Scheduling Configuration Guide

Overview

Slurm is designed to perform a quick and simple scheduling attempt at events such as job submission or completion and configuration changes. During these event-triggered scheduling events, default_queue_depth (default is 100) number of jobs will be considered.

At less frequent intervals, defined by sched_interval, the main scheduling loop will run, considering all jobs while still honoring the partition_job_depth limit.

In both cases, jobs are evaluated in a strict priority order and once any job or job array task in a partition is left pending, no other jobs in that partition will be scheduled to avoid taking resources from the higher-priority pending job.

A more comprehensive scheduling attempt is typically done by the backfill scheduling plugin, which considers job run time and resources required to determine if lower-priority jobs would actually take resources needed by higher-priority jobs. This allows the backfill scheduler to assign more specific reasons to pending jobs, or to start jobs that were previously pending.

Scheduling Configuration

The SchedulerType configuration parameter specifies the scheduler plugin to use. Options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.

There is also a SchedulerParameters configuration parameter which can specify a wide range of parameters as described below. This first set of parameters applies to all scheduling configurations. See the slurm.conf(5) man page for more details.

Backfill Scheduling

The backfill scheduling plugin is loaded by default. Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.

Slurm's backfill scheduler takes into consideration every running job. It then considers pending jobs in priority order, determining when and where each will start, taking into consideration the possibility of job preemption, gang scheduling, generic resource (GRES) requirements, memory requirements, etc. If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so. Otherwise the resources required by the job will be reserved during the job's expected execution time. The backfill plugin will set the expected start time for pending jobs setting these reserved nodes into a 'Planned' state. A job's expected start time can be seen using the squeue --start command. For performance reasons, the backfill scheduler reserves whole nodes for jobs, even if jobs don't require whole nodes.

The scheduling logic builds a sorted list of job-partition pairs. Jobs submitted to multiple partitions will have as many entries in the list as requested partitions. By default, the backfill scheduler may evaluate all the job-partition pairs for a single job, potentially reserving resources for each pair, but only starting the job in the reservation offering the earliest start time.

Having a single job reserving resources for multiple partitions could impede other jobs (or hetjob components) from reserving resources already reserved for the partitions that don't offer the earliest start time. A single job that requests multiple partitions can also prevent itself from starting earlier in a lower priority partition if the partitions overlap nodes and a backfill reservation in the higher priority partition blocks nodes that are also in the lower priority partition.

Backfill scheduling is difficult without reasonable time limit estimates for jobs, but some configuration parameters that can help.

Backfill scheduling is a time consuming operation. Locks are released briefly every two seconds so that other options can be processed, for example to process new job submission requests. Backfill scheduling can optionally continue execution after the lock release and ignore newly submitted jobs (SchedulerParameters=bf_continue). Doing so will permit consideration of more jobs, but may result in the delayed scheduling of newly submitted jobs. A partial list of SchedulerParameters configuration parameters related to backfill scheduling follows. For more details and a complete list of the backfill related SchedulerParameters see the slurm.conf(5) man page.

Last modified 04 June 2024