File: dynamic_nodes.shtml

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (160 lines) | stat: -rw-r--r-- 4,700 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
<!--#include virtual="header.txt"-->

<h1><a name="top">Dynamic Nodes</a></h1>

<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

<p>Starting in Slurm 22.05, nodes can be dynamically added and removed from
Slurm.
</p>

<h2 id="communications">Dynamic Node Communications
<a class="slurm_link" href="#config"></a>
</h2>
<p>
For regular, non-dynamically created nodes, Slurm knows how to communicate with
nodes by reading in the slurm.conf. This is why it is important for a
non-dynamic setup that the slurm.conf is synchronized across the cluster. For
dynamically created nodes, other than the slurmctld, the rest of the Slurm
components (e.g. srun, daemons) don't know about the dynamically created nodes.
In order for srun and the slurmds to know how to communicate with the other
nodes in a job allocation, slurmctld passes each node's address information
(NodeName, NodeAddr, NodeHostname) -- known as the alias list -- to the srun
and the srun forwards the list to the slurmds. This list is seen in the job's
environment as the <b>SLURM_NODE_ALIASES</b> envirionment variable.
</p>

<p>
The controller automatically grabs the node's <b>NodeAddr</b> and
<b>NodeHostname</b> for dynamic slurmd registrations. For cloud nodes created
with scontrol, if the nodename is not resolvable, then either 1) the node's
<b>NodeAddr</b> and <b>NodeHostname</b> need to be updated with the
<b>scontrol update</b> command before the node registers or 2) use the 
<a href="slurm.conf.html#OPT_cloud_reg_addrs">cloud_reg_addrs</a>
<b>SlurmctldParameter</b>.
</p>

<h2 id="config">Slurm Configuration
<a class="slurm_link" href="#config"></a>
</h2>

<p>
<dl>
<dt><b>MaxNodeCount=#</b>
<dd>
Set to the number of possible nodes that can be active in a system at a time.
See the slurm.conf <a href="slurm.conf.html#OPT_MaxNodeCount">man</a> page for
more details.

<dt><b>SelectType=select/cons_tres</b>
<dd>Dynamic nodes are only supported with cons_tres.

<dt><b>TreeWidth=65533</b>
<dd>
Fanning out of controller pings and application launches through slurmds are
not supported with dynamic nodes. TreeWidth must be disabled (i.e. set to
65533) for dynamic environments. However, the reverse fanout of step
completions through slurmds does happen due to the job's alias list.
</dl>
</p>

<p>
<b>NOTE:</b> The <b>cloud_dns</b> <b>SlurmctldParameter</b> must not be set as
this disables the alias list.
</p>

<h3 id="config">Partition Assignment
<a class="slurm_link" href="#partitions"></a>
</h3>
<p>
Dynamic nodes can be automatically assigned to partitions at creation by using
the partition's nodes <a href="slurm.conf.html#OPT_Nodes_1">ALL</a> keyword or
<a href="slurm.conf.html#SECTION_NODESET-CONFIGURATION">NodeSets</a> and
specifying a feature on the nodes.
</p>

<p>
e.g.
<pre>
Nodeset=ns1 Feature=f1
Nodeset=ns2 Feature=f2

PartitionName=all  Nodes=ALL Default=yes
PartitionName=dyn1 Nodes=ns1
PartitionName=dyn2 Nodes=ns2
PartitionName=dyn3 Nodes=ns1,ns2
</pre>
</a>

<h2 id="config">Creating Nodes
<a class="slurm_link" href="#create"></a>
</h2>

<p>
Nodes can be created two ways:
<ol>
<li>
<dl>
<dt><b>Dynamic slurmd registration</b>
<dd>
<p>Using the slurmd <a href="slurmd.html#OPT_-Z">-Z</a> and
<a href="slurmd.html#OPT_conf-<node-parameters>">--conf</a> options a slurmd
will register with the controller and will automatically be added to the system.
</p>

<p>
e.g.
<pre>
slurmd -Z --conf "RealMemory=80000 Gres=gpu:2 Feature=f1"
</pre>
</p>

</dl>
</li>
<li>
<dl>
<dt><b>scontrol create NodeName= ...</b>
<dd>
<p>Create nodes using scontrol by specifying the same <b>NodeName</b>
line that you would define in the slurm.conf. See slurm.conf
<a href="slurm.conf.html#SECTION_NODE-CONFIGURATION">man</a> page for node
options. Only <b>State=CLOUD</b> and <b>State=FUTURE</b> are supported. The
node configuration should match what the slurmd will register with
(e.g. slurmd -C) plus any additional attributes.
</p>

</p>
e.g.
<pre>
scontrol create NodeName=d[1-100] CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=31848 Gres=gpu:2 Feature=f1 State=cloud
</pre>
</p>
</dl>
</li>
</ol>
</p>

<h2 id="config">Deleting Nodes
<a class="slurm_link" href="#create"></a>
</h2>
<p>
Nodes can be deleted using <b>scontrol delete nodename=&lt;nodelist&gt;</b>.
Nodes can only be deleted if they have no jobs running on them and aren't part
of a reservation.
</p>

<h2 id="LIMITATIONS">Limitations
<a class="slurm_link" href="#limitations"></a>
</h2>
<p>The following are not supported with dynamic nodes:
<ol>
<li>
Topology Plugins
</li>
</ol>
</p>


<p style="text-align:center;">Last modified 26 May 2022</p>

<!--#include virtual="footer.txt"-->