1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
|
<!--#include virtual="header.txt"-->
<h1>Topology Guide</h1>
<p>Slurm can be configured to support topology-aware resource
allocation to optimize job performance.
Slurm supports several modes of operation, one to optimize performance on
systems with a three-dimensional torus interconnect and another for
a hierarchical interconnect.
The hierarchical mode of operation supports both fat-tree or dragonfly networks,
using slightly different algorithms.</p>
<p>Slurm's native mode of resource selection is to consider the nodes
as a one-dimensional array.
Jobs are allocated resources on a best-fit basis.
For larger jobs, this minimizes the number of sets of consecutive nodes
allocated to the job.</p>
<h2 id="topo_3d">Three-dimension Topology
<a class="slurm_link" href="#topo_3d"></a>
</h2>
<p>Some larger computers rely upon a three-dimensional torus interconnect.
The Cray XT and XE systems also have three-dimensional
torus interconnects, but do not require that jobs execute in adjacent nodes.
On those systems, Slurm only needs to allocate resources to a job which
are nearby on the network.
Slurm accomplishes this using a
<a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a>
to map the nodes from a three-dimensional space into a one-dimensional
space.
Slurm's native best-fit algorithm is thus able to achieve a high degree
of locality for jobs.
<h2 id="hierarchical">Hierarchical Networks
<a class="slurm_link" href="#hierarchical"></a>
</h2>
<p>Slurm can also be configured to allocate resources to jobs on a
hierarchical network to minimize network contention.
The basic algorithm is to identify the lowest level switch in the
hierarchy that can satisfy a job's request and then allocate resources
on its underlying leaf switches using a best-fit algorithm.
Use of this logic requires a configuration setting of
<i>TopologyPlugin=topology/tree</i>.</p>
<p>Note that slurm uses a best-fit algorithm on the currently
available resources. This may result in an allocation with
more than the optimum number of switches. The user can request
a maximum number of leaf switches for the job as well as a
maximum time willing to wait for that number using the <i>--switches</i>
option with the salloc, sbatch and srun commands. The parameters can
also be changed for pending jobs using the scontrol and squeue commands.</p>
<p>At some point in the future Slurm code may be provided to
gather network topology information directly.
Now the network topology information must be included
in a <i>topology.conf</i> configuration file as shown in the
examples below.
The first example describes a three level switch in which
each switch has two children.
Note that the <i>SwitchName</i> values are arbitrary and only
used for bookkeeping purposes, but a name must be specified on
each line.
The leaf switch descriptions contain a <i>SwitchName</i> field
plus a <i>Nodes</i> field to identify the nodes connected to the
switch.
Higher-level switch descriptions contain a <i>SwitchName</i> field
plus a <i>Switches</i> field to identify the child switches.
Slurm's hostlist expression parser is used, so the node and switch
names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]"
and "Switches=s[0-2,4-8,12]" will parse fine).
</p>
<p>An optional LinkSpeed option can be used to indicate the
relative performance of the link.
The units used are arbitrary and this information is currently not used.
It may be used in the future to optimize resource allocations.</p>
<p>The first example shows what a topology would look like for an
eight node cluster in which all switches have only two children as
shown in the diagram (not a very realistic configuration, but
useful for an example).</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-1]
SwitchName=s1 Nodes=tux[2-3]
SwitchName=s2 Nodes=tux[4-5]
SwitchName=s3 Nodes=tux[6-7]
SwitchName=s4 Switches=s[0-1]
SwitchName=s5 Switches=s[2-3]
SwitchName=s6 Switches=s[4-5]
</pre>
<img src=topo_ex1.gif width=600>
<p>The next example is for a network with two levels and
each switch has four connections.</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900
SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900
SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900
SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800
SwitchName=s4 Switches=s[0-3] LinkSpeed=1800
SwitchName=s5 Switches=s[0-3] LinkSpeed=1800
SwitchName=s6 Switches=s[0-3] LinkSpeed=1800
SwitchName=s7 Switches=s[0-3] LinkSpeed=1800
</pre>
<img src=topo_ex2.gif width=600>
<p>As a practical matter, listing every switch connection
definitely results in a slower scheduling algorithm for Slurm
to optimize job placement.
The application performance may achieve little benefit from such optimization.
Listing the leaf switches with their nodes plus one top level switch
should result in good performance for both applications and Slurm.
The previous example might be configured as follows:
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3]
SwitchName=s1 Nodes=tux[4-7]
SwitchName=s2 Nodes=tux[8-11]
SwitchName=s3 Nodes=tux[12-15]
SwitchName=s4 Switches=s[0-3]
</pre>
<p>Note that compute nodes on switches that lack a common parent switch can
be used, but no job will span leaf switches without a common parent
(unless the TopologyParam=TopoOptional option is used).
For example, it is legal to remove the line "SwitchName=s4 Switches=s[0-3]"
from the above topology.conf file.
In that case, no job will span more than four compute nodes on any single leaf
switch.
This configuration can be useful if one wants to schedule multiple physical
clusters as a single logical cluster under the control of a single slurmctld
daemon.</p>
<p>If you have nodes that are in separate networks and are associated with
unique switches in your <b>topology.conf</b> file, it's possible that you
could get in a situation where a job isn't able to run. If a job requests
nodes that are in the different networks, either by requesting the nodes
directly or by requesting a feature, the job will fail because the requested
nodes can't communicate with each other. We recommend placing nodes in
separate network segments in disjoint partitions.</p>
<p>For systems with a dragonfly network, configure Slurm with
<i>TopologyPlugin=topology/tree</i> plus <i>TopologyParam=dragonfly</i>.
If a single job can not be entirely placed within a single network leaf
switch, the job will be spread across as many leaf switches as possible
in order to optimize the job's network bandwidth.</p>
<p><b>NOTE</b>: When using the <i>topology/tree</i> plugin, Slurm identifies
the network switches which provide the best fit for pending jobs. If nodes
have a <i>Weight</i> defined, this will override the resource selection based
on network topology. If optimizing resource selection by node weight is more
important than optimizing network topology then do NOT use the
<i>topology/tree</i> plugin.</p>
<h3 id="config_generators">Configuration Generators
<a class="slurm_link" href="#config_generators"></a></h3>
The following independently maintained tools may be useful in generating the
<b>topology.conf</b> file for certain switch types:
<ul>
<li>Infiniband switch - <b>slurmibtopology</b><br>
<a href="https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology">
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology</a></li>
<li>Omni-Path (OPA) switch - <b>opa2slurm</b><br>
<a href="https://gitlab.com/jtfrey/opa2slurm">
https://gitlab.com/jtfrey/opa2slurm</a></li>
<li>AWS Elastic Fabric Adapter (EFA) - <b>ec2-topology</b><br>
<a href="https://github.com/aws-samples/ec2-topology-aware-for-slurm">
https://github.com/aws-samples/ec2-topology-aware-for-slurm</a></li>
</ul>
<h2 id="user_opts">User Options<a class="slurm_link" href="#user_opts"></a></h2>
<p>For use with the <b>topology/tree</b> plugin, user can also specify the
maximum number of leaf switches to be used for their job with the maximum time
the job should wait for this optimized configuration. The syntax for this option
is <b>--switches=count[@time]</b>.
The system administrator can limit the maximum time that any job can
wait for this optimized configuration using the <b>SchedulerParameters</b>
configuration parameter with the
<a href="slurm.conf.html#OPT_max_switch_wait=#">max_switch_wait</a> option.</p>
<p>When <b>topology/tree</b> or <b>topology/block</b> is configured, hostlist
functions may be used in place of or alongside regular hostlist expressions
in commands or configuration files that interact with the slurmctld. Valid
topology functions include:</p>
<ul>
<li><b>block{blockX}</b> and <b>switch{switchY}</b> - expand to all nodes in
the specified block/switch.</li>
<li><b>blockwith{nodeX}</b> and <b>switchwith{nodeY}</b> - expand to all nodes
in the same block/switch as the specified node.</li>
</ul>
<p>For example:</p>
<pre>
scontrol update node=block{b1} state=resume
sbatch --nodelist=blockwith{node0} -N 10 program
PartitionName=Block10 Nodes=block{block10} ...
</pre>
See also the hostlist function <b>feature{myfeature}</b>
<a href="slurm.conf.html#OPT_Features">here</a>.</p>
<h2 id="env_vars">Environment Variables
<a class="slurm_link" href="#env_vars"></a>
</h2>
<p>If the topology/tree plugin is used, two environment variables will be set
to describe that job's network topology. Note that these environment variables
will contain different data for the tasks launched on each node. Use of these
environment variables is at the discretion of the user.</p>
<p><b>SLURM_TOPOLOGY_ADDR</b>:
The value will be set to the names network switches which may be involved in
the job's communications from the system's top level switch down to the leaf
switch and ending with node name. A period is used to separate each hardware
component name.</p>
<p><b>SLURM_TOPOLOGY_ADDR_PATTERN</b>:
This is set only if the system has the topology/tree plugin configured.
The value will be set component types listed in SLURM_TOPOLOGY_ADDR.
Each component will be identified as either "switch" or "node".
A period is used to separate each hardware component type.</p>
<p style="text-align:center;">Last modified 13 November 2024</p>
<!--#include virtual="footer.txt"-->
|