File: intel_knl.shtml

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (348 lines) | stat: -rw-r--r-- 16,110 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
<!--#include virtual="header.txt"-->

<h1>Intel Knights Landing (KNL) User and Administrator Guide</h1>

<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

<p>This document describes the unique features of Slurm on the computers with
the Intel Knights Landing processor.
You should be familiar with the Slurm's mode of operation on Linux clusters
before studying the relatively few differences in Intel KNL system operation
described in this document.</p>

<h2 id="user_tools">User Tools
<a class="slurm_link" href="#user_tools"></a>
</h2>

<p>The desired NUMA and MCDRAM modes for a KNL processor should be specified
using the -C or --constraint option of Slurm's job submission commands: salloc,
sbatch, and srun. Currently available NUMA and MCDRAM modes are shown in the
table below. Each node's available and current NUMA and MCDRAM modes are
visible in the "available features" and "active features" fields respectively,
which may be seen using the scontrol, sinfo, or sview commands.
Note that a node may need to be rebooted to get the desired NUMA and MCDRAM
modes and nodes may only be rebooted when they contain no running jobs
(i.e. sufficient resources may be available to run a pending job, but until
the node is idle and can be rebooted, the pending job may not be allocated
resources). Also note that the job will be charged for resources from the time
of resource allocation, which may include time to reboot a node into the
desired NUMA and MCDRAM configuration.</p>

<p>Slurm supports a very rich set of options for the node constraint options
(exclusive OR, node counts for each constraint, etc.).
See the man pages for the salloc, sbatch and srun commands for more information
about the constraint syntax.
Jobs may specify their desired NUMA and/or MCDRAM configuration. If no
NUMA and/or MCDRAM configuration is specified, then a node with any possible
value for that configuration will be used.</p>

<table width="100%" border=1 cellspacing=0 cellpadding=4>
<tr>
  <th width="15%">Type</th>
  <th width="15%">Name</th>
  <th width="70%">Description</th>
</tr>
<tr><td>MCDRAM</td><td>cache</td><td>All of MCDRAM to be used as cache</td></tr>
<tr><td>MCDRAM</td><td>equal</td><td>MCDRAM to be used partly as cache and partly combined with primary memory</td></tr>
<tr><td>MCDRAM</td><td>flat</td><td>MCDRAM to be combined with primary memory into a "flat" memory space</td></tr>
<tr><td>NUMA</td><td>a2a</td><td>All to all</td></tr>
<tr><td>NUMA</td><td>hemi</td><td>Hemisphere</td></tr>
<tr><td>NUMA</td><td>snc2</td><td>Sub-NUMA cluster 2</td></tr>
<tr><td>NUMA</td><td>snc4</td><td>Sub-NUMA cluster 4 (<a href="#note">NOTE</a>)</td></tr>
<tr><td>NUMA</td><td>quad</td><td>Quadrant (<a href="#note2">NOTE</a>)</td></tr>
</table>

<p>Jobs requiring some or all of the KNL high bandwidth memory (HBM) should
explicitly request that memory using Slurm's Generic RESource (GRES) options.
The HBM will always be known by Slurm GRES name of "hbm".
Examples below demonstrate use of HBM.</p>

<p>Sorting of the free cache pages at step startup using Intel's zonesort
module can be configred as the default for all steps using the
"LaunchParameters=mem_sort" option in the slurm.conf file.
Individual job steps can enable or disable sorting using the "--mem-bind=sort"
or "--mem-bind=nosort" command line options for srun.
Sorting will be performed only for the NUMA nodes allocated to the job step.</p>

<p><a id="note"><b>NOTE</b></a>: Slurm provides limited support
for restricting use of HBM. At some point in the future, the amount of HBM
requested by the job will be compared with the total amount of HBM and number of
memory-containing NUMA nodes available on the KNL processor. The job will then
be bound to specific NUMA nodes in order to limit the total amount of HBM
available to the job, and thus reserve the remaining HBM for other jobs running
on that KNL processor.</p>

<p><a id="note2"><b>NOTE</b></a>: Slurm can only
support homogeneous nodes (e.g. the same number of cores per NUMA node).
KNL processors with <u>68 cores</u> (a subset of KNL models) will not have
homogeneous NUMA nodes in snc4 mode, but each NUMA node will have
either 16 or 18 cores. This will result in Slurm using the lower core count,
finding a total of 256 threads rather than 272 threads and setting the node
to a DOWN state.</p>

<h3 id="accounting">Accounting<a class="slurm_link" href="#accounting"></a></h3>

<p>If a node requires rebooting for a job's required configuration, the job
will be charged for the resource allocation from the time of allocation through
the lifetime of the job, including the time consumed for booting the nodes.
The job's time limit will be calculated from the time that all nodes are ready
for use.
For example, a job with a 10 minute time limit may be allocated resources at
10:00:00.
If the nodes require rebooting, they might not be available for use until
10:20:00, 20 minutes after allocation, and the job will begin execution at
that time.
The job must complete no later than 10:30:00 in order to satisfy its time limit
(10 minutes after execution actually begins).
However, the job will be charged for 30 minutes of resource use, which includes
the boot time.</p>

<h3 id="use_case">Sample Use Cases
<a class="slurm_link" href="#use_case"></a>
</h3>

<pre>
$ sbatch -C flat,a2a -N2 --gres=hbm:8g --exclusive my.script
$ srun --constraint=hemi,cache -n36 a.out
$ srun --constraint=flat --gres=hbm:2g -n36 a.out

$ sinfo -o "%30N %20b %f"
NODELIST       ACTIVE_FEATURES  AVAIL_FEATURES
nid000[10-11]
nid000[12-35]  flat,a2a         flat,a2a,snc2,hemi
nid000[36-43]  cache,a2a        flat,equal,cache,a2a,hemi
</pre>

<h3 id="topology">Network Topology
<a class="slurm_link" href="#topology"></a>
</h3>

<p>Slurm will optimize performance using those resources available without
rebooting. If node rebooting is required, then it will optimize layout with
respect to network bandwidth using both nodes currently in the desired
configuration and those which can be made available after rebooting.
This can result in more nodes being rebooted than strictly needed, but will
improve application performance.</p>

<p>Users can specify they want all resources allocated on a specific count of
leaf switches (Dragonfly group) using Slurm's <b>--switches</b> option.
They can also specify how much additional time they are willing to wait for
such a configuration. If the desired configuration can not be made available
within the specified time interval, the job will be allocated nodes optimized
with respect to network bandwidth to the extent possible. On a Dragonfly
network, this means allocating resources over either single group or
distributed evenly over as many groups as possible. For example:</p>
<pre>
srun --switches=1@10:00 N16 a.out
</pre>
<p>Note that system administrators can disable use of the <b>--switches</b>
option or limit the amount of time the job can be deferred using the
<b>SchedulerParameters</b> <b>max-switch-wait</b> option.</p>

<h3 id="boot_problems">Booting Problems
<a class="slurm_link" href="#boot_problems"></a>
</h3>

<p>If node boots fail, those nodes are drained and the job is requeued so that
it can be allocated a different set of nodes. The nodes originally allocated
to the job will remain available to the job, so likely a small number of
additional nodes will be required.</p>

<h2 id="administration">System Administration
<a class="slurm_link" href="#administration"></a>
</h2>

<p>Four important components are required to use Slurm on an Intel KNL system.</p>
<ol>
<li>Slurm needs a mechanism to determine the node's current topology (e.g.
how many NUMA exist and which cores are associated with each NUMA). Slurm
relies upon <a href="http://www.open-mpi.org/projects/hwloc/">
Portable Hardware Locality (HWLOC)</a> for this functionality. Please install
HWLOC before building Slurm.</li>

<li>The node features plugin manages the available and active features
information available for each KNL node.</li>

<li>A configuration file is used to define various timeouts, default
configuration, etc. The configuration file name and contents will depend upon
the node features plugins used. See the <a href="knl.conf.html">knl.conf</a>
man page for more information.</li>

<li>A mechanism is required to boot nodes in the desired configuration. This
mechanism must be integrated with existing Slurm infrastructure for
<a href="sbatch.html">rebooting nodes on user request (--reboot)</a> plus
<u>(for Cray systems only)</u>
<a href="power_save.html">power saving</a> (powering down idle nodes and
restarting them on demand).</li>
</ol>

<p>In addition, there is a DebugFlags option of "NodeFeatures" which will
generate detailed information about KNL operations.</p>

<p>The KNL-specific available and active features are configured differently
based upon the plugin configured.<br>
<u>For the knl_cray plugin</u>, KNL-specific available and active features are
not included in the "slurm.conf" configuration file, but are set and the managed
by the NodeFeatures plugin when the slurmctld daemon starts.<br>
<u>For the knl_generic plugin</u>, KNL-specific features should be defined
in the "slurm.conf" configuration file. When the slurmd daemon starts on each
compute node, it will update the available and active features as needed.<br>
Features which are not KNL-specific (e.g. rack number, "knl", etc.) will be
copied from the node's "Features" configuration in "slurm.conf" to both the
available and active feature fields and not modified by the NodeFeatures
plugin.</p>

<p>NOTE: For Dell KNL systems you must also include the <i>SystemType=Dell</i>
option for successful operation and will likely need to increase the
<i>SyscfgTimeout</i> to allow enough time for the command to successfully
complete.  Experience at one site has shown that a 10 second timeout may
be necessary, configured as <i>SyscfgTimeout=10000</i>.</p>

<p>Slurm does not support the concept of multiple NUMA nodes
within a single socket. If a KNL node is booted with multiple NUMA, then each
NUMA node will appear in Slurm as a separate socket.
In the slurm.conf configuration file, set node socket and
core counts to values which are appropriate for some NUMA mode to be used on the
node. When the node boots and the slurmd daemon on the node starts, it will
report to the slurmctld daemon the node's actual socket (NUMA) and core counts,
which will update Slurm data structures for the node to the values which are
currently configured.
Note that Slurm currently does not support the concept of
differing numbers of cores in each socket (or NUMA node). We are currently
working to address these issues.</p>

<h3 id="operation">Mode of Operation
<a class="slurm_link" href="#operation"></a>
</h3>

<ol>
<li>The node's configured "Features" are copied to the available and active
feature fields.</li>
<li>The node features plugin determines the node's current MCDRAM and NUMA
values as well as those which are available and adds those values to the node's
active and available feature fields respectively. Note that these values may
not be available until the node has booted and the slurmd daemon on the
compute node sends that information to the slurmctld daemon.</li>
<li>Jobs will be allocated nodes already in the requested MCDRAM and NUMA mode
if possible. If insufficient resources are available with the requested
configuration then other nodes will be selected and booted into the desired
configuration once no other jobs are active on the node. Until a node is idle,
its configuration can not be changed. Note that node reboot time is roughly
on the order of 20 minutes.</li>
</ol>

<h3 id="cray">Cray Configuration<a class="slurm_link" href="#cray"></a></h3>

<p>On Cray systems, NodeFeaturesPlugins should be configured as "knl_cray".</p>

<p>The configuration file will be named "knl_cray.conf".
The file will include the path to the <i>capmc</i> program (CapmcPath),
which is used to get a node's available MCDRAM and NUMA modes, change the modes,
power the node down, reboot it, etc.
Note the "CapmcTimeout" parameter is the time required for the capmc program
to respond to a request and NOT the time for a boot operation to complete.</p>

<p>Power saving mode is integrated with rebooting nodes in the desired mode.
Programs named "capmc_resume" and "capmc_suspend" are provided to boot nodes in
the desired mode. The programs are included in the main Slurm RPM and
installed in the "sbin" directory and must be installed on the Cray "sdb" node.
If powering down of idle nodes is not desired, then configure "ResumeProgram"
in "slurm.conf" to the path of the "capmc_resume" file and configure
"SuspendTime" to a huge value (e.g. "SuspendTime=30000000" will only power
down a node which has been idle for about one year).</p>

<p>Note that getting a compute node's current MCDRAM and NUMA mode,
modifying its MCDRAM and NUMA mode, and rebooting it are operations performed
by the slurmctld daemon on the head node.

<p>The GresTypes configuration parameter should include "hbm" to identify
High Bandwidth Memory (HBM) as being a consumable resources on compute nodes.
Additional GresTypes can be specified as needed in a comma separated list.
The amount of HBM on each node should not be configured in a Slurm configuration
file, but that information will be loaded by the knl_cray plugin using
information provided by the capmc program.</p>

<h4>Sample knl_cray.conf file</h4>

<pre>
# Sample knl_cray.conf
CapmcPath=/opt/cray/capmc/default/bin/capmc
CapmcTimeout=2000	# msec
DefaultNUMA=a2a         # NUMA=all2all
AllowNUMA=a2a,snc2,hemi
DefaultMCDRAM=cache     # MCDRAM=cache
</pre>

<h4>Sample slurm.conf File</h4>

<pre>
# Sample slurm.conf
NodeFeaturesPlugins=knl_cray
DebugFlags=NodeFeatures
GresTypes=hbm
#
ResumeProgram=/opt/slurm/default/sbin/capmc_resume
SuspendProgram=/opt/slurm/default/sbin/capmc_suspend
SuspendTime=30000000
ResumeTimeout=1800
...
Nodename=default Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 RealMemory=128000 Feature=knl
NodeName=nid[00000-00127] State=UNKNOWN
</pre>


<h3 id="config">Generic Cluster Configuration
<a class="slurm_link" href="#config"></a>
</h3>

<p>All other clusters should have NodeFeaturesPlugins configured to "knl_generic".
This plugin performs all operations directly on the compute nodes using Intel's
<i>syscfg</i> program to get and modify the node's MCDRAM and NUMA mode and
uses the Linux <i>reboot</i> program to reboot the compute node in order for
modifications in MCDRAM and/or NUMA mode to take effect.
Make sure that RebootProgram is defined in the slurm.conf file.
This plugin currently does <u>not</u> permit the specification of ResumeProgram,
SuspendProgram, SuspendTime, etc. in slurm.conf, however that limitation may
be removed in the future (the ResumeProgram currently has no means of changing
the node's MCDRAM and/or NUMA mode prior to reboot).</p>

<p><b>NOTE:</b>The syscfg program reports the MCDRAM and NUMA mode to be used
when the node is next booted. If the syscfg program is used to modify the MCDRAM
or NUMA mode of a node, but it is not rebooted, then Slurm will be making
scheduling decisions based upon incorrect state information. If you want to
change node state information outside of Slurm then use the following procedure:
<ol>
<li>Drain the nodes of interest</li>
<li>Change their MCDRAM and/or NUMA mode</li>
<li>Reboot the nodes, then</li>
<li>Restore them to service in Slurm</li>
</ol>
</p>

<h4>Sample knl_generic.conf File</h4>

<pre>
# Sample knl_generic.conf
SyscfgPath=/usr/bin/syscfg
DefaultNUMA=a2a         # NUMA=all2all
AllowNUMA=a2a,snc2,hemi
DefaultMCDRAM=cache     # MCDRAM=cache
</pre>

<h4>Sample slurm.conf File</h4>

<pre>
# Sample slurm.conf
NodeFeaturesPlugins=knl_generic
DebugFlags=NodeFeatures
GresTypes=hbm
RebootProgram=/sbin/reboot
...
Nodename=default Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 RealMemory=128000 Feature=knl
NodeName=nid[00000-00127] State=UNKNOWN
</pre>


<p style="text-align:center;">Last modified 31 May 2022</p>

<!--#include virtual="footer.txt"-->