1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265
|
<!--#include virtual="header.txt"-->
<h1>Network Configuration Guide</h1>
<h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#slurmctld">Communication for slurmctld</a></li>
<li><a href="#slurmdbd">Communication for slurmdbd</a></li>
<li><a href="#slurmd">Communication for slurmd</a></li>
<li><a href="#client">Communication for client commands</a></li>
<li><a href="#failover">Communication for multiple controllers</a></li>
<li><a href="#multi">Communication with multiple clusters</a></li>
<li><a href="#federation">Communication in a federation</a></li>
<li><a href="#ipv6">Communication with IPv6</a></li>
</ul>
<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
<p>There are a lot of components in a Slurm cluster that need to be able
to communicate with each other. Some sites have security requirements that
prevent them from opening all communications between the machines and will
need to be able to selectively open just the ports that are necessary.
This document will go over what is needed for different components to be
able to talk to each other.</p>
<p>Below is a diagram of a fairly typical cluster, with <b>slurmctld</b>
and <b>slurmdbd</b> on separate machines. In smaller clusters, MySQL can run
on the same machine as the <b>slurmdbd</b>, but in most cases it is preferable
to have it run on a dedicated machine. <b>slurmd</b> runs on the
compute nodes and the client commands can be installed and run from machines
of your choosing.</p>
<div class="figure">
<img src="network_standard.gif" width="550"><br>
Typical configuration
</div>
<h2 id="slurmctld">Communication for slurmctld
<a class="slurm_link" href="#slurmctld"></a>
</h2>
<p>The default port used by <b>slurmctld</b> to listen for incoming requests
is <u>6817</u>. This port can be changed with the
<a href="slurm.conf.html#OPT_SlurmctldPort">SlurmctldPort</a> slurm.conf
parameter. Slurmctld listens for incoming requests on that port and responds
back on the same connection opened by the requestor.</p>
<p>The machine running <b>slurmctld</b> needs to be able to establish
outbound connections as well. It needs to communicate with <b>slurmdbd</b>
on port <u>6819</u> by default (see the <a href="#slurmdbd">slurmdbd</a>
section for information on how to change this). It also needs to communicate
with <b>slurmd</b> on the compute nodes on port <u>6818</u> by default (see the
<a href="#slurmd">slurmd</a> section for information on how to change
this).</p>
<p>By default, the <b>slurmctld</b> will listen for IPv4 traffic. IPv6
communication can be enabled by adding <u>EnableIPv6</u> to the
<a href="slurm.conf.html#OPT_CommunicationParameters">
CommunicationParameters</a> in your slurm.conf. With IPv6 enabled, you can
disable IPv4 by adding <u>DisableIPv4</u> to the
<a href="slurm.conf.html#OPT_CommunicationParameters">
CommunicationParameters</a>. These settings must match in both slurmdbd.conf
and slurm.conf (see the <a href="#slurmdbd">slurmdbd</a> section).</p>
<h2 id="slurmdbd">Communication for slurmdbd
<a class="slurm_link" href="#slurmdbd"></a>
</h2>
<p>The default port used by <b>slurmdbd</b> to listen for incoming requests
is <u>6819</u>. This port can be changed with the
<a href="slurmdbd.conf.html#OPT_DbdPort">DbdPort</a> slurmdbd.conf parameter.
Slurmdbd listens for incoming requests on that port and responds back
on the same connection opened by the requestor.</p>
<p>The machine running <b>slurmdbd</b> needs to be able to reach the
MySQL or MariaDB server on port <u>3306</u> by default (the port is
configurable on the database side).
This port can be changed with the
<a href="slurmdbd.conf.html#OPT_StoragePort">StoragePort</a> slurmdbd.conf
parameter. It also needs to be able to initiate
a connection to <b>slurmctld</b> on port 6819 by default (see the
<a href="#slurmctld">slurmctld</a> section for information on how to
change this).</p>
<p>By default, the <b>slurmdbd</b> will listen for IPv4 traffic. IPv6
communication can be enabled by adding <u>EnableIPv6</u> to the
<a href="slurmdbd.conf.html#OPT_CommunicationParameters">
CommunicationParameters</a> in your slurmdbd.conf. With IPv6 enabled, you can
disable IPv4 by adding <u>DisableIPv4</u> to the
<a href="slurmdbd.conf.html#OPT_CommunicationParameters">
CommunicationParameters</a>. These settings must match in both slurmdbd.conf
and slurm.conf (see the <a href="#slurmctld">slurmctld</a> section).</p>
<h2 id="slurmd">Communication for slurmd
<a class="slurm_link" href="#slurmd"></a>
</h2>
<p>The default port used by <b>slurmd</b> to listen for incoming requests
from <b>slurmctld</b> is <u>6818</u>. This port can be changed with the
<a href="slurm.conf.html#OPT_SlurmdPort">SlurmdPort</a> slurm.conf
parameter.</p>
<p>The machines running <b>srun</b> also use a range of ports to be able
to communicate with <b>slurmstepd</b>. By default these ports are chosen
at random from the ephemeral port range, but you can use the
<a href="slurm.conf.html#OPT_SrunPortRange">SrunPortRange</a> to specify
a range of ports from which they can be chosen. This is necessary
for login nodes that are behind a firewall.</p>
<p>The machines running <b>slurmd</b> need to be able to establish
connections with <b>slurmctld</b> on port <u>6817</u> by default (see
the <a href="#slurmctld">slurmctld</a> section for information on how to
change this).</p>
<p>By default, the <b>slurmd</b> communicates over IPv4. Please see the
<a href="#slurmctld">slurmctld</a> section for details on how to change this
as the slurm.conf parameter affects <b>slurmd</b> daemons as well.</p>
<h2 id="client">Communication for client commands
<a class="slurm_link" href="#client"></a>
</h2>
<p>The majority of the client commands will communicate with <b>slurmctld</b>
on port <u>6817</u> by default (see the <a href="#slurmctld">slurmctld</a>
section for information on how to change this) to get the information they
need. This includes the following commands:</p>
<dl>
<dd>salloc
<dd>sacctmgr
<dd>sbatch
<dd>sbcast
<dd>scancel
<dd>scontrol
<dd>sdiag
<dd>sinfo
<dd>sprio
<dd>squeue
<dd>sshare
<dd>sstat
<dd>strigger
<dd>sview
</dl>
<p>There are also commands that communicate directly with <b>slurmdbd</b> on
port <u>6819</u> by default (see the <a href="#slurmdbd">slurmdbd</a> section
for information on how to change this). The following commands get information
from <b>slurmdbd</b>:</p>
<dl>
<dd>sacct
<dd>sacctmgr
<dd>sreport
</dl>
<p>When a user starts a job using <b>srun</b> there has to be a communication
path from the machine where <b>srun</b> is called to the node(s) the job is
allocated. Communication follows the sequence outlined below:</p>
<dl>
<dd>1a. srun sends job allocation request to slurmctld
<dd>1b. slurmctld grants allocation and returns details
<dd>2a. srun sends step create request to slurmctld
<dd>2b. slurmctld responds with step credential
<dd>3. srun opens sockets for I/O
<dd>4. srun forwards credential with task info to slurmd
<dd>5. slurmd forwards request as needed (per fanout)
<dd>6. slurmd forks/execs slurmstepd
<dd>7. slurmstepd connects I/O and launches tasks
<dd>8. On task termination, slurmstepd notifies srun
<dd>9. srun notifies slurmctld of job termination
<dd>10. slurmctld verifies termination of all processes via slurmd and
releases resources for next job
</dl>
<div class="figure">
<img src="network_srun.gif" width="550"><br>
srun communication
</div>
<h2 id="failover">Communication with multiple controllers
<a class="slurm_link" href="#failover"></a>
</h2>
<p>You can configure a secondary <b>slurmctld</b> and/or <b>slurmdbd</b> to
serve as a fallback if the primary should go down. The ports involved don't
change, but there are additional communication paths that need to be taken
into consideration. The client commands need to be able to reach both
machines running <b>slurmctld</b> as well as both machines running
<b>slurmdbd</b>. Both instances of <b>slurmctld</b> need to be able to
reach both instances of <b>slurmdbd</b> and each <b>slurmdbd</b> needs
to be able to reach the MySQL server.</p>
<div class="figure">
<img src="network_failover.gif" width="550"><br>
Fallback slurmctld and slurmdbd
</div>
<h2 id="multi">Communication with multiple clusters
<a class="slurm_link" href="#multi"></a>
</h2>
<p>In environments where multiple <b>slurmctld</b> instances share the same
<b>slurmdbd</b> you can configure each cluster to stand on their own and allow
users to specify a cluster to submit their jobs to. Ports
used by the different daemons don't change, but all instances of
<b>slurmctld</b> need to be able to communicate with the same instance of
<b>slurmdbd</b>. You can read more about multi cluster configurations in the
<a href="multi_cluster.html#OPT_SlurmdPort">Multi-Cluster Operation</a>
documentation.</p>
<div class="figure">
<img src="network_multi_cluster.gif" width="550"><br>
Multi-Cluster configuration
</div>
<h2 id="federation">Communication in a federation
<a class="slurm_link" href="#federation"></a>
</h2>
<p>Slurm also provides the ability to schedule jobs in a peer-to-peer fashion
between multiple clusters, allowing jobs to run on the cluster that has
available resources first. The difference in communication needs between this
and a multi-cluster configuration is that the two instances of <b>slurmctld</b>
need to be able to communicate with each other. There are more details about
using a
<a href="federation.html#OPT_SlurmdPort">Federation</a> in the
documentation.</p>
<div class="figure">
<img src="network_federation.gif" width="550"><br>
Federation configuration
</div>
<h2 id="ipv6">Communication with IPv6
<a class="slurm_link" href="#ipv6"></a>
</h2>
<p>The <b>slurmctld</b>, <b>slurmdbd</b>, and <b>slurmd</b> daemons will,
by default, communicate using IPv4, but they can be configured to use IPv6.
This is handled by setting <b>CommunicationParameters=EnableIPv6</b>
in your slurm.conf and slurmdbd.conf, then restarting all of the daemons.
The <b>slurmd</b> may operate over IPv4 OR IPv6 in this mode. IPv4 can be
disabled by setting <b>CommunicationParameters=EnableIPv6,DisableIPv4</b>.
In is mode, everything must have a valid IPv6 address or the connection will
fail.</p>
<p>The <b>slurmctld</b> expects a node to map to a single IP address (which
will be the first address returned when looking up the IP of the node with
<b>getaddrinfo()</b>). If you enable IPv6 on an existing cluster and the
nodes have IPv6 addresses, you must restart the <b>slurmd</b> daemons for
communication over IPv6 to be established.</p>
<p>The presence of <span>precedence ::ffff:0:0/96 100</span> in /etc/gai.conf
will cause IPv4 addresses to be returned BEFORE an IPv6 address. This might
cause a situation where you have enabled IPv6 for Slurm, but are still seeing nodes
communicate with IPv4. If there is confusion as to which address is being used
you can call <span>scontrol setdebugflags +NET</span> to enable network related
debug logging in your slurmctld.log.</p>
<p>If IPv4 and IPv6 are enabled, the loopback interface may still resolve to
127.0.0.1. This is not necessarily an indication of a problem.</p>
<p style="text-align:center;">Last modified 25 November 2020</p>
<!--#include virtual="footer.txt"-->
|