File: topology.shtml

package info (click to toggle)
slurm-wlm 24.11.5-4
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 51,508 kB
  • sloc: ansic: 529,598; exp: 64,795; python: 17,051; sh: 10,365; javascript: 6,528; makefile: 4,116; perl: 3,762; pascal: 131
file content (238 lines) | stat: -rw-r--r-- 10,292 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
<!--#include virtual="header.txt"-->

<h1>Topology Guide</h1>

<p>Slurm can be configured to support topology-aware resource
allocation to optimize job performance.
Slurm supports several modes of operation, one to optimize performance on
systems with a three-dimensional torus interconnect and another for
a hierarchical interconnect.
The hierarchical mode of operation supports both fat-tree or dragonfly networks,
using slightly different algorithms.</p>

<p>Slurm's native mode of resource selection is to consider the nodes
as a one-dimensional array.
Jobs are allocated resources on a best-fit basis.
For larger jobs, this minimizes the number of sets of consecutive nodes
allocated to the job.</p>

<h2 id="topo_3d">Three-dimension Topology
<a class="slurm_link" href="#topo_3d"></a>
</h2>

<p>Some larger computers rely upon a three-dimensional torus interconnect.
The Cray XT and XE systems also have three-dimensional
torus interconnects, but do not require that jobs execute in adjacent nodes.
On those systems, Slurm only needs to allocate resources to a job which
are nearby on the network.
Slurm accomplishes this using a
<a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a>
to map the nodes from a three-dimensional space into a one-dimensional
space.
Slurm's native best-fit algorithm is thus able to achieve a high degree
of locality for jobs.

<h2 id="hierarchical">Hierarchical Networks
<a class="slurm_link" href="#hierarchical"></a>
</h2>

<p>Slurm can also be configured to allocate resources to jobs on a
hierarchical network to minimize network contention.
The basic algorithm is to identify the lowest level switch in the
hierarchy that can satisfy a job's request and then allocate resources
on its underlying leaf switches using a best-fit algorithm.
Use of this logic requires a configuration setting of
<i>TopologyPlugin=topology/tree</i>.</p>

<p>Note that slurm uses a best-fit algorithm on the currently
available resources. This may result in an allocation with
more than the optimum number of switches. The user can request
a maximum number of leaf switches for the job as well as a
maximum time willing to wait for that number using the <i>--switches</i>
option with the salloc, sbatch and srun commands. The parameters can
also be changed for pending jobs using the scontrol and squeue commands.</p>

<p>At some point in the future Slurm code may be provided to
gather network topology information directly.
Now the network topology information must be included
in a <i>topology.conf</i> configuration file as shown in the
examples below.
The first example describes a three level switch in which
each switch has two children.
Note that the <i>SwitchName</i> values are arbitrary and only
used for bookkeeping purposes, but a name must be specified on
each line.
The leaf switch descriptions contain a <i>SwitchName</i> field
plus a <i>Nodes</i> field to identify the nodes connected to the
switch.
Higher-level switch descriptions contain a <i>SwitchName</i> field
plus a <i>Switches</i> field to identify the child switches.
Slurm's hostlist expression parser is used, so the node and switch
names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]"
and "Switches=s[0-2,4-8,12]" will parse fine).
</p>

<p>An optional LinkSpeed option can be used to indicate the
relative performance of the link.
The units used are arbitrary and this information is currently not used.
It may be used in the future to optimize resource allocations.</p>

<p>The first example shows what a topology would look like for an
eight node cluster in which all switches have only two children as
shown in the diagram (not a very realistic configuration, but
useful for an example).</p>

<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-1]
SwitchName=s1 Nodes=tux[2-3]
SwitchName=s2 Nodes=tux[4-5]
SwitchName=s3 Nodes=tux[6-7]
SwitchName=s4 Switches=s[0-1]
SwitchName=s5 Switches=s[2-3]
SwitchName=s6 Switches=s[4-5]
</pre>
<img src=topo_ex1.gif width=600>

<p>The next example is for a network with two levels and
each switch has four connections.</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3]   LinkSpeed=900
SwitchName=s1 Nodes=tux[4-7]   LinkSpeed=900
SwitchName=s2 Nodes=tux[8-11]  LinkSpeed=900
SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800
SwitchName=s4 Switches=s[0-3]  LinkSpeed=1800
SwitchName=s5 Switches=s[0-3]  LinkSpeed=1800
SwitchName=s6 Switches=s[0-3]  LinkSpeed=1800
SwitchName=s7 Switches=s[0-3]  LinkSpeed=1800
</pre>
<img src=topo_ex2.gif width=600>

<p>As a practical matter, listing every switch connection
definitely results in a slower scheduling algorithm for Slurm
to optimize job placement.
The application performance may achieve little benefit from such optimization.
Listing the leaf switches with their nodes plus one top level switch
should result in good performance for both applications and Slurm.
The previous example might be configured as follows:
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3]
SwitchName=s1 Nodes=tux[4-7]
SwitchName=s2 Nodes=tux[8-11]
SwitchName=s3 Nodes=tux[12-15]
SwitchName=s4 Switches=s[0-3]
</pre>

<p>Note that compute nodes on switches that lack a common parent switch can
be used, but no job will span leaf switches without a common parent
(unless the TopologyParam=TopoOptional option is used).
For example, it is legal to remove the line "SwitchName=s4 Switches=s[0-3]"
from the above topology.conf file.
In that case, no job will span more than four compute nodes on any single leaf
switch.
This configuration can be useful if one wants to schedule multiple physical
clusters as a single logical cluster under the control of a single slurmctld
daemon.</p>

<p>If you have nodes that are in separate networks and are associated with
unique switches in your <b>topology.conf</b> file, it's possible that you
could get in a situation where a job isn't able to run.  If a job requests
nodes that are in the different networks, either by requesting the nodes
directly or by requesting a feature, the job will fail because the requested
nodes can't communicate with each other.  We recommend placing nodes in
separate network segments in disjoint partitions.</p>

<p>For systems with a dragonfly network, configure Slurm with
<i>TopologyPlugin=topology/tree</i> plus <i>TopologyParam=dragonfly</i>.
If a single job can not be entirely placed within a single network leaf
switch, the job will be spread across as many leaf switches as possible
in order to optimize the job's network bandwidth.</p>

<p><b>NOTE</b>: When using the <i>topology/tree</i> plugin, Slurm identifies
the network switches which provide the best fit for pending jobs. If nodes
have a <i>Weight</i> defined, this will override the resource selection based
on network topology. If optimizing resource selection by node weight is more
important than optimizing network topology then do NOT use the
<i>topology/tree</i> plugin.</p>

<h3 id="config_generators">Configuration Generators
<a class="slurm_link" href="#config_generators"></a></h3>

The following independently maintained tools may be useful in generating the
<b>topology.conf</b> file for certain switch types:

<ul>
<li>Infiniband switch - <b>slurmibtopology</b><br>
<a href="https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology">
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology</a></li>
<li>Omni-Path (OPA) switch - <b>opa2slurm</b><br>
<a href="https://gitlab.com/jtfrey/opa2slurm">
https://gitlab.com/jtfrey/opa2slurm</a></li>
<li>AWS Elastic Fabric Adapter (EFA) - <b>ec2-topology</b><br>
<a href="https://github.com/aws-samples/ec2-topology-aware-for-slurm">
https://github.com/aws-samples/ec2-topology-aware-for-slurm</a></li>
</ul>

<h2 id="user_opts">User Options<a class="slurm_link" href="#user_opts"></a></h2>

<p>For use with the <b>topology/tree</b> plugin, user can also specify the
maximum number of leaf switches to be used for their job with the maximum time
the job should wait for this optimized configuration. The syntax for this option
is <b>--switches=count[@time]</b>.
The system administrator can limit the maximum time that any job can
wait for this optimized configuration using the <b>SchedulerParameters</b>
configuration parameter with the
<a href="slurm.conf.html#OPT_max_switch_wait=#">max_switch_wait</a> option.</p>

<p>When <b>topology/tree</b> or <b>topology/block</b> is configured, hostlist
functions may be used in place of or alongside regular hostlist expressions
in commands or configuration files that interact with the slurmctld. Valid
topology functions include:</p>

<ul>
<li><b>block{blockX}</b> and <b>switch{switchY}</b> - expand to all nodes in
	the specified block/switch.</li>
<li><b>blockwith{nodeX}</b> and <b>switchwith{nodeY}</b> - expand to all nodes
	in the same block/switch as the specified node.</li>
</ul>

<p>For example:</p>

<pre>
scontrol update node=block{b1} state=resume
sbatch --nodelist=blockwith{node0} -N 10 program
PartitionName=Block10 Nodes=block{block10} ...
</pre>

See also the hostlist function <b>feature{myfeature}</b>
<a href="slurm.conf.html#OPT_Features">here</a>.</p>

<h2 id="env_vars">Environment Variables
<a class="slurm_link" href="#env_vars"></a>
</h2>

<p>If the topology/tree plugin is used, two environment variables will be set
to describe that job's network topology. Note that these environment variables
will contain different data for the tasks launched on each node. Use of these
environment variables is at the discretion of the user.</p>

<p><b>SLURM_TOPOLOGY_ADDR</b>:
The value will be set to the names network switches which may be involved in
the job's communications from the system's top level switch down to the leaf
switch and ending with node name. A period is used to separate each hardware
component name.</p>

<p><b>SLURM_TOPOLOGY_ADDR_PATTERN</b>:
This is set only if the system has the topology/tree plugin configured.
The value will be set component types listed in SLURM_TOPOLOGY_ADDR.
Each component will be identified as either "switch" or "node".
A period is used to separate each hardware component type.</p>

<p style="text-align:center;">Last modified 13 November 2024</p>

<!--#include virtual="footer.txt"-->