File: nonstop.conf.5

package info (click to toggle)
slurm-wlm 22.05.8-4%2Bdeb12u3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 48,492 kB
  • sloc: ansic: 475,246; exp: 69,020; sh: 8,862; javascript: 6,528; python: 6,444; makefile: 4,185; perl: 4,069; pascal: 131
file content (144 lines) | stat: -rw-r--r-- 4,565 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
.TH "nonstop.conf" "5" "Slurm Configuration File" "January 2022" "Slurm Configuration File"

.SH "NAME"
nonstop.conf \- Slurm configuration file for fault\-tolerant computing.

.SH "DESCRIPTION"
\fBnonstop.conf\fP is an ASCII file which describes the configuration
used for fault\-tolerant computing with Slurm using the optional
slurmctld/nonstop plugin.
This plugin provides a means for users to notify Slurm of nodes it believes
are suspect, replace the job's failing or failed nodes, and extend a job's
in response to failures.
The file will always be located in the same directory as the \fBslurm.conf\fR.
.LP
Parameter names are case insensitive.
Any text following a "#" in the configuration file is treated
as a comment through the end of that line.
Changes to the configuration file take effect upon restart of
Slurm daemons, daemon receipt of the SIGHUP signal, or execution
of the command "scontrol reconfigure" unless otherwise noted.
The configuration parameters available include:

.TP
\fBBackupAddr\fR
Communications address used for the slurmctld daemon.
This can either be a hostname or IP address.
This value would typically be the same as the secondary \fBSlurmctldHost\fR
in the slurm.conf file, when applicable.
.IP

.TP
\fBControlAddr\fR
Communications address used for the slurmctld daemon.
This can either be a hostname or IP address.
This value would typically be the same as the \fBSlurmctldHost\fR
in the slurm.conf file.
.IP

.TP
\fBDebug\fR
A number indicating the level of additional logging desired for the plugin.
The default value is zero, which generates no additional logging.
.IP

.TP
\fBHotSpareCount\fR
This identifies how many nodes in each partition should be maintained as
spare resources.
When a job fails, this pool of resources will be depleted and then replenished
when possible using idle resources.
The value should be a comma\-delimited list of
partition and node count pairs separated by a colon.
.IP

.TP
\fBMaxSpareNodeCount\fR
This identifies the maximum number of nodes any single job may replace through
the job's entire lifetime.
This could prevent a single job from causing all of the nodes in a cluster to
fail.
By default, there is no maximum node count.
.IP

.TP
\fBPort\fR
Port used for communications.
The default value is 6820.
.IP

.TP
\fBTimeLimitDelay\fR
If a job requires replacement resources and none are immediately available,
then permit a job to extend its time limit by the length of time required to
secure replacement resources up to the number of minutes specified by
\fBTimeLimitDelay\fR.
This option will only take effect if no hot spare resources are available at
the time replacement resources are requested.
This time limit extension is in addition to the value calculated using the
\fBTimeLimitExtend\fR.
The default value is zero (no time limit extension).
The value may not exceed 65533 seconds.
.IP

.TP
\fBTimeLimitDrop\fR
Specifies the number of minutes that a job can extend its time limit for
each failed or failing node removed from the job's allocation.
The default value is zero (no time limit extension).
The value may not exceed 65533 seconds.
.IP

.TP
\fBTimeLimitExtend\fR
Specifies the number of minutes that a job can extend its time limit for
each replaced node.
The default value is zero (no time limit extension).
The value may not exceed 65533 seconds.
.IP

.TP
\fBUserDrainAllow\fR
This identifies a comma\-delimited list of user names or user IDs of users who
are authorized to drain nodes they believe are failing.
Specify a value of "ALL" to permit any user to drain nodes.
By default, no users may drain nodes using this interface.
.IP

.TP
\fBUserDrainDeny\fR
This identifies a comma\-delimited list of user names or user IDs of users who
are NOT authorized to drain nodes they believe are failing.
Specifying a value for \fBUserDrainDeny\fR implicitly allows all other users
to drain nodes (sets the value of UserDrainAllow to "ALL").
.IP

.SH "EXAMPLE"
.nf
#
# Sample nonstop.conf file
# Date: 12 Feb 2013
#
ControlAddr=12.34.56.78
BackupAddr=12.34.56.79
Port=1234
#
HotSpareCount=batch:6,interactive:0
MaxSpareNodesCount=4
TimeLimitDelay=30
TimeLimitExtend=20
TimeLimitExtend=10
UserDrainAllow=adam,brenda
.fi

.SH "COPYING"
Copyright (C) 2013\-2022 SchedMD LLC. All rights reserved.
.LP
Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
details.

.SH "SEE ALSO"
.LP
\fBslurm.conf\fR(5)