File: federation.shtml

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (483 lines) | stat: -rw-r--r-- 18,064 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
<!--#include virtual="header.txt"-->

<h1>Slurm Federated Scheduling Guide</h1>

<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#configuration">Configuration</a></li>
<li><a href="#jobids">Federated Job IDs</a></li>
<li><a href="#jobsubmission">Job Submission</a></li>
<li><a href="#jobscheduling">Job Scheduling</a></li>
<li><a href="#jobrequeue">Job Requeue</a></li>
<li><a href="#interactivejobs">Interactive Jobs</a></li>
<li><a href="#jobcancel">Canceling Jobs</a></li>
<li><a href="#jobmodify">Job Modification</a></li>
<li><a href="#jobarrays">Job Arrays</a></li>
<li><a href="#statuscmds">Status Commands</a></li>
<li><a href="#glossary">Glossary</a></li>
<li><a href="#limitations">Limitations</a></li>
</ul>

<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
<p>Slurm includes support for creating a federation of clusters
and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a
federation receive a unique job ID that is unique among all clusters in the
federation. A job is submitted to the local cluster (the cluster defined in the
slurm.conf) and is then replicated across the clusters in the federation. Each
cluster then independently attempts to the schedule the job based off of its own
scheduling policies. The clusters coordinate with the "origin" cluster (cluster
the job was submitted to) to schedule the job.

<p>
<b>NOTE:</b> This is not intended as a high-throughput environment. If
scheduling more than 50,000 jobs a day, consider configuring fewer clusters that
the sibling jobs can be submitted to or directing load
to the local cluster only (e.g. --cluster-constraint= or -M submission options
could be used to do this).
</p>


<h2 id="configuration">Configuration
<a class="slurm_link" href="#configuration"></a>
</h2>
<p>
A federation is created using the sacctmgr command to create a federation in the
database and by adding clusters to a federation.
</p>

<br>
To create a federation use:
<pre>
sacctmgr add federation &lt;federation_name&gt; [clusters=&lt;list_of_clusters&gt;]
</pre>

Clusters can be added or removed from a federation using:
<br>
<b>NOTE:</b> A cluster can only be a member of one federation at a time.
<pre>
sacctmgr modify federation &lt;federation_name&gt; set clusters[+-]=&lt;list_of_clusters&gt;
sacctmgr modify cluster &lt;cluster_name&gt; set federation=&lt;federation_name&gt;
sacctmgr modify federation &lt;federation_name&gt; set clusters=
sacctmgr modify cluster &lt;cluster_name&gt; set federation=
</pre>
<b>NOTE:</b> If a cluster is removed from a federation without first being
drained, running jobs on the removed cluster, or that originated from the
removed cluster, will continue to run as non-federated jobs. If a job is pending
on the origin cluster, the job will remain pending on the origin cluster as a
non-federated job and the remaining sibling jobs will be removed. If the origin
cluster is being removed and the job is pending and is only viable on one
cluster then it will remain pending on the viable cluster as a non-federated
job. If the origin cluster is being removed and the job is pending and viable on
multiple clusters other than the origin cluster, then the remaining pending jobs
will remain pending as a federated job and the remaining sibling clusters will
schedule amongst themselves to start the job.

<br>
<br>
Federations can be deleted using:
<pre>
sacctmgr delete federation &lt;federation_name&gt;
</pre>


Generic features can be assigned to clusters and can be requested at submission
using the <b>--cluster-constraint=[!]&lt;feature_list&gt;</b> option:
<pre>
sacctmgr modify cluster &lt;cluster_name&gt; set features[+-]=&lt;feature_list&gt;
</pre>

A cluster's federated state can be set using:
<pre>
sacctmgr modify cluster &lt;cluster_name&gt; set fedstate=&lt;state&gt;
</pre>
where possible states are:
<ul>
	<li><b>ACTIVE:</b> Cluster will actively accept and schedule federated
		jobs</li>
	<li><b>INACTIVE:</b> Cluster will not schedule or accept any jobs</li>
	<li><b>DRAIN:</b> Cluster will not accept any new jobs and will let
		existing federated jobs complete</li>
	<li><b>DRAIN+REMOVE:</b> Cluster will not accept any new jobs and will
		remove itself from the federation once all federated jobs have
		completed. When removed from the federation, the cluster will
		accept jobs as a non-federated cluster</li>
</ul>

Federation configuration can be viewed used using:
<pre>
sacctmgr show federation [tree]
sacctmgr show cluster withfed
</pre>

After clusters are added to a federation and the controllers are started their
status can be viewed from the controller using:
<pre>
scontrol show federation
</pre>


By default the status commands will show a local view. A default federated view
can be set by configuring the following parameter in the slurm.conf:
<pre>
FederationParameters=fed_display
</pre>


<h2 id="jobids">Federated Job IDs<a class="slurm_link" href="#jobids"></a></h2>
When a job is submitted to a federation it gets a federated job id. Job ids in
the federation are unique across all clusters in the federation. A federated
job ID is made by utilizing an unsigned 32 bit integer to assign the cluster's
ID and the cluster's local ID.

<pre>
Bits 0-25:  Local Job ID
Bits 26-31: Cluster Origin ID
</pre>

Federated job IDs allow the controllers to know which cluster the job was
submitted to by looking at the cluster origin id of the job.


<h2 id="jobsubmission">Job Submission
<a class="slurm_link" href="#jobsubmission"></a>
</h2>
<p>
When a federated cluster receives a job submission, it will submit copies of the
job (<b>sibling jobs</b>) to each eligible cluster. Each cluster will then
independently attempt to schedule the job.
</p>

<p>
Jobs can be directed to specific clusters in the federation using the
<b>-M,--clusters=&lt;cluster_list&gt;</b> and the new
<b>--cluster-constraint=[!]&lt;constraint_list&gt;</b> options.
</p>

<p>
Using the <b>-M,--clusters=&lt;cluster_list&gt;</b> the submission command
(sbatch, salloc, srun) will pick one cluster from the list of clusters to submit
the job to and will also pass along the list of clusters with the job. The
clusters in the list will be the only viable clusters that siblings jobs can be
submitted to. For example the submission:
<pre>
cluster1$ sbatch -Mcluster2,cluster3 script.sh
</pre>
will submit the job to either cluster2 or cluster3 and will only submit sibling
jobs to cluster2 and cluster3 even if there are more clusters in the federation.
</p>

<p>
Using the <b>--cluster-constraint=[!]&lt;constraint_list&gt;</b> option will
submit sibling jobs to only the clusters that have the requested cluster
feature(s) -- or don't have the feature(s) if using <b>!</b>. Cluster features
are added using the <b>sacctmgr modify cluster &lt;cluster_name&gt; set
features[+-]=&lt;feature_list&gt;</b> option.
</p>

<p>
<b>NOTE:</b> When using the <b>!</b> option, add quotes around the option to
prevent the shell from interpreting the <b>!</b> (e.g
--cluster-constraint='!highmem').
</p>


<p>
When using both the <b>--cluster-constraint=</b> and
<b>--clusters=</b> options together, the origin cluster will only submit
sibling jobs to clusters that meet both requirements.
</p>

<p>
Held or dependent jobs are kept on the origin cluster until they are released
or are no longer dependent, at which time they are submitted to other viable
clusters in the federation. If a job becomes held or dependent
after being submitted, the job is removed from every cluster but the origin.
</p>

<h2 id="jobscheduling">Job Scheduling
<a class="slurm_link" href="#jobscheduling"></a>
</h2>
<p>
Each cluster in the federation independently attempts to schedule each job with
the exception of coordinating with the <b>origin cluster</b> (cluster where the
job was submitted to) to allocate resources to a federated job. When a cluster
determines it can attempt to allocate resources for a job it communicates with
the origin cluster to verify that no other cluster is attempting to allocate
resources at the same time. If no other cluster is attempting to allocate
resources the cluster will attempt to allocate resources for the job. If it
succeeds then it will notify the origin cluster that it started the job and the
origin cluster will notify the clusters with sibling jobs to remove the sibling
jobs and put them in a <b>revoked</b> state. If the cluster was unable to
allocate resources to the job then it lets the origin cluster know so that other
clusters can attempt to schedule the job. If it was the main scheduler
attempting to allocate resources then the main scheduler will stop looking at
further jobs in the job's partition. If it was the backfill scheduler attempting
to allocate resources then the resources will be reserved for the job.
</p>

<p>
If an origin cluster is down, then the remote siblings will coordinate with a
job's viable siblings to schedule the job. When the origin cluster comes back
up, it will sync with the other siblings.
</p>

<h2 id="jobrequeue">Job Requeue
<a class="slurm_link" href="#jobrequeue"></a>
</h2>
<p>
When a federated job is requeued the origin cluster is notified and the origin
cluster will then submit new sibling jobs to viable clusters and the federated
job is eligible to start on a different cluster than the one it ran on.
</p>

<p>
slurm.conf options <b>RequeueExit</b> and <b>RequeueExitHold</b> are controlled
by the origin cluster.
</p>


<h2 id="interactivejobs">Interactive Jobs
<a class="slurm_link" href="#interactivejobs"></a>
</h2>
<p>
Interactive jobs -- jobs submitted with srun and salloc -- can be submitted to
the local cluster and get an allocation from a different cluster. When an salloc
job allocation is granted by a cluster other than the local cluster, a new
environment variable, SLURM_WORKING_CLUSTER, will be set with the remote sibling
cluster's IP address, port and RPC version so that any sruns will know which
cluster to communicate with.
</p>

<p>
<b>NOTE:</b> It is required that all compute nodes must be accessible to all
submission hosts for this to work.
<br>
<b>NOTE:</b> The current implementation of the MPI interfaces in Slurm require
the SlurmdSpooldir to be the same on the host where the srun is being run as it
is on the compute nodes in the allocation. If they aren’t, a workaround is to
get an allocation that puts the user on the actual compute node. Then the sruns
on the compute nodes will be using the slurm.conf that corresponds to the
correct cluster. Setting <b>LaunchParameters=use_interactive_step</b>
slurm.conf will put the user on an actual compute node when using salloc.
</p>


<h2 id="jobcancel">Canceling Jobs
<a class="slurm_link" href="#jobcancel"></a>
</h2>
<p>
Cancel requests in the federation will cancel the running sibling job or all
pending sibling jobs. Specific pending sibling jobs can be removed by using
<b>scancel</b>'s <b>--sibling=&lt;cluster_name&gt;</b> option to remove the
sibling job from the job's active sibling list.
</p>


<h2 id="jobmodify">Job Modification
<a class="slurm_link" href="#jobmodify"></a>
</h2>
Job modifications are routed to the origin cluster where the origin cluster will
push out the changes to each sibling job.


<h2 id="jobarrays">Job Arrays<a class="slurm_link" href="#jobarrays"></a></h2>
Currently, job arrays only run on the origin cluster.


<h2 id="statuscmds">Status Commands
<a class="slurm_link" href="#statuscmds"></a>
</h2>
<p>
By default, status commands, such as: squeue, sinfo, sprio, sacct, sreport, will
show a view local to the local cluster. A unified view of the jobs in the
federation can be viewed using the <b>--federation</b> option to each status
command. The <b>--federation</b> command causes the status command to first
check if the local cluster is part of a federation. If it is then the command
will query each cluster in parallel for job info and will combine the
information into one unified view.
</p>

<p>
A new <b>FederationParameters=fed_display</b> slurm.conf parameter has been
added so that all status commands will present a federated view by default --
equivalent to setting the <b>--federation</b> option for each status command.
The federated view can be overridden using the <b>--local</b> option. Using the
<b>--clusters,-M</b> option will also override the federated view and give a
local view for the given cluster(s).
</p>

<p>
Using the existing <b>--clusters,-M option</b>, the status commands will output
the information in the same format that exists today where each cluster’s
information is listed separately.
</p>

<h3 id="squeue">squeue<a class="slurm_link" href="#squeue"></a></h3>
<p>
squeue also has a new <b>--sibling</b> option that will show each sibling job
rather than merge them into one.
</p>

<p>
Several new long format options have been added to display the job's federated
information:
<ul>
	<li>
		<b>cluster:</b> Name of the cluster that is running the job or
		job step.
	</li>
	<li>
		<b>siblingsactive:</b> Cluster names of where federated sibling
		jobs exist.
	</li>
	<li>
		<b>siblingsactiveraw:</b> Cluster IDs of where federated
		sibling jobs exist.
	</li>
	<li>
		<b>siblingsviable:</b> Cluster names of where federated sibling
		jobs are viable to run.
	</li>
	<li>
		<b>siblingsviableraw:</b> Cluster names of where federated
		sibling jobs are viable to run.
	</li>
</ul>
</p>

<p>
squeue output can be sorted using the <b>-S cluster</b> option.
</p>


<h3 id="sinfo">sinfo<a class="slurm_link" href="#sinfo"></a></h3>
<p>
sinfo will show the partitions from each cluster in one view. In a federated
view, the cluster name is displayed with each partition. The cluster name can be
specified in the format options using the short format <b>%V</b> or the long
format <b>cluster</b> options. The output can be sorted by cluster names using
the <b>-S %[+-]V</b> option.
</p>

<h3 id="sprio">sprio<a class="slurm_link" href="#sprio"></a></h3>
<p>
In a federated view, sprio displays the job information from the local cluster
or from the first cluster to report the job. Since each sibling job could have a
different priority on each cluster it may be helpful to use the <b>--sibling</b>
option to show all records of a job to get a better picture of a job’s priority.
The name of the cluster reporting the job record can be displayed using the
<b>%c</b> format option. The cluster name is shown by default when using
<b>--sibling</b> option.
</p>

<h3 id="sacct">sacct<a class="slurm_link" href="#sacct"></a></h3>
<p>
By default, sacct will not display "revoked" jobs and will show the job from the
cluster that ran the job. However, "revoked" jobs can be viewed using the
<b>--duplicate/-D</b> option.
</p>

<h3 id="sreport">sreport<a class="slurm_link" href="#sreport"></a></h3>
<p>
sreport will combine the reports from each cluster and display them as one.
</p>

<h3 id="scontrol">scontrol<a class="slurm_link" href="#scontrol"></a></h3>
<p>
The following scontrol options will display a federated view:
<ul>
	<li>show [--federation|--sibling] jobs</li>
	<li>show [--federation] steps</li>
	<li>completing</li>
</ul>
</p>

<p>
The following scontrol options are handled in a federation. If the command is
run from a cluster other than the federated cluster it will be routed to the
origin cluster.
<ul>
	<li>hold</li>
	<li>uhold</li>
	<li>release</li>
	<li>requeue</li>
	<li>requeuehold</li>
	<li>suspend</li>
	<li>update job</li>
</ul>
</p>

<p>
All other scontrol options should be directed to the specific cluster either by
issuing the command on the cluster or using the <b>--cluster/-M</b>  option.
</p>




<h2 id="glossary">Glossary<a class="slurm_link" href="#glossary"></a></h2>
<ul>
	<li>
		<b>Federated Job:</b> A job that is submitted to the federated
		cluster. It has a unique job ID across all clusters (Origin
		Cluster ID + Local Job ID).
	</li>
	<li>
		<b>Sibling Job:</b> A copy of the federated job that is
		submitted to other federated clusters.
	</li>
	<li>
		<b>Local Cluster:</b> The cluster found in the slurm.conf that
		the commands will talk to by default.
	</li>
	<li>
		<b>Origin Cluster:</b> The cluster that the federated job was
		originally submitted to. The origin cluster submits sibling jobs
		to other clusters in the federation. The origin cluster
		determines whether a sibling job can run or not. Communications
		for the federated job are routed through the origin cluster.
	</li>
	<li>
		<b>Sibling Cluster:</b> The cluster that is associated with a
		sibling job.
	</li>
	<li>
		<b>Origin Job:</b> The federated job that resides on the cluster
		that it was originally submitted to.
	</li>
	<li>
		<b>Revoked (RV) State:</b> The state that the origin job is in
		while the origin job is not actively being scheduled on the
		origin cluster (e.g. not a viable sibling or one of the sibling
		jobs is running on a remote cluster). Or the state that a remote
		sibling job is put in when another sibling is allocated nodes.
	</li>
	<li>
		<b>Viable Sibling:</b> a cluster that is eligible to run a
		sibling job based off of the requested clusters, cluster
		features and state of the cluster (e.g. active, draining, etc.).
	</li>
	<li>
		<b>Active Sibling:</b> a sibling job that actively has a
		sibling job and is able to schedule the job.
	</li>
</ul>

<h2 id="limitations">Limitations
<a class="slurm_link" href="#limitations"></a>
</h2>
<ul>
	<li>A federated job that fails due to resources (partition, node counts,
		etc.) on the local cluster will be rejected and won't be
		submitted to other sibling clusters even if it could run on
		them.</li>
	<li>Job arrays only run on the cluster that they were submitted to.</li>
	<li>Job modification must succeed on the origin cluster for the changes
		to be pushed to the sibling jobs on remote clusters. </li>
	<li>Modifications to anything other than jobs are disabled in sview.</li>
	<li>sview grid is disabled in a federated view.</li>
</ul>

<p style="text-align:center;">Last modified 9 June 2021</p>

<!--#include virtual="footer.txt"-->