File: strigger.1

package info (click to toggle)
slurm-llnl 18.08.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 41,876 kB
  • sloc: ansic: 439,494; exp: 79,435; sh: 8,604; perl: 4,602; makefile: 4,019; python: 1,211
file content (403 lines) | stat: -rw-r--r-- 12,723 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
.TH strigger "1" "Slurm Commands" "August 2016" "Slurm Commands"

.SH "NAME"
strigger \- Used set, get or clear Slurm trigger information.

.SH "SYNOPSIS"
\fBstrigger \-\-set\fR   [\fIOPTIONS\fR...]
.br
\fBstrigger \-\-get\fR   [\fIOPTIONS\fR...]
.br
\fBstrigger \-\-clear\fR [\fIOPTIONS\fR...]

.SH "DESCRIPTION"
\fBstrigger\fR is used to set, get or clear Slurm trigger information.
Triggers include events such as a node failing, a job reaching its
time limit or a job terminating.
These events can cause actions such as the execution of an arbitrary
script.
Typical uses include notifying system administrators of node failures
and gracefully terminating a job when it's time limit is approaching.
A hostlist expression for the nodelist or job ID is passed as an argument
to the program.

Trigger events are not processed instantly, but a check is performed for
trigger events on a periodic basis (currently every 15 seconds).
Any trigger events which occur within that interval will be compared
against the trigger programs set at the end of the time interval.
The trigger program will be executed once for any event occurring in
that interval.
The record of those events (e.g. nodes which went DOWN in the previous
15 seconds) will then be cleared.
The trigger program must set a new trigger before the end of the next
interval to ensure that no trigger events are missed OR the trigger must be
created with an argument of "\-\-flags=PERM".
If desired, multiple trigger programs can be set for the same event.

\fBIMPORTANT NOTE:\fR This command can only set triggers if run by the
user \fISlurmUser\fR unless \fISlurmUser\fR is configured as user root.
This is required for the \fIslurmctld\fR daemon to set the appropriate
user and group IDs for the executed program.
Also note that the trigger program is executed on the same node that the
\fIslurmctld\fR daemon uses rather than some allocated compute node.
To check the value of \fISlurmUser\fR, run the command:

\fIscontrol show config | grep SlurmUser\fR

.SH "ARGUMENTS"
.TP
\fB\-a\fR, \fB\-\-primary_slurmctld_failure\fR
Trigger an event when the primary slurmctld fails.

.TP
\fB\-A\fR, \fB\-\-primary_slurmctld_resumed_operation\fR
Trigger an event when the primary slurmctld resuming operation after failure.

.TP
\fB\-b\fR, \fB\-\-primary_slurmctld_resumed_control\fR
Trigger an event when primary slurmctld resumes control.

.TP
\fB\-B\fR, \fB\-\-backup_slurmctld_failure\fR
Trigger an event when the backup slurmctld fails.

.TP
\fB\-c\fR, \fB\-\-backup_slurmctld_resumed_operation\fR
Trigger an event when the backup slurmctld resumes operation after failure.

.TP
\fB\-C\fR, \fB\-\-backup_slurmctld_assumed_control\fR
Trigger event when backup slurmctld assumes control.


.TP
\fB\-\-burst_buffer\fR
Trigger event when burst buffer error occurs.

.TP
\fB\-\-clear\fP
Clear or delete a previously defined event trigger.
The \fB\-\-id\fR, \fB\-\-jobid\fR or \fB\-\-user\fR
option must be specified to identify the trigger(s) to
be cleared.
Only user root or the trigger's creator can delete a trigger.

.TP
\fB\-d\fR, \fB\-\-down\fR
Trigger an event if the specified node goes into a DOWN state.

.TP
\fB\-D\fR, \fB\-\-drained\fR
Trigger an event if the specified node goes into a DRAINED state.

.TP
\fB\-e\fR, \fB\-\-primary_slurmctld_acct_buffer_full\fR
Trigger an event when primary slurmctld accounting buffer is full.

.TP
\fB\-F\fR, \fB\-\-fail\fR
Trigger an event if the specified node goes into a FAILING state.

.TP
\fB\-f\fR, \fB\-\-fini\fR
Trigger an event when the specified job completes execution.

.TP
\fB\-\-flags\fR=\fItype\fR
Associate flags with the reservation. Multiple flags should be comma separated.
Valid flags include:
.RS
.TP
PERM
Make the trigger permanent. Do not purge it after the event occurs.
.RE

.TP
\fB\-\-front_end\fR
Trigger events based upon changes in state of front end nodes rather than
compute nodes. Applies to Cray ALPS architectures only, where the
slurmd daemon executes on front end nodes rather than the compute nodes.
Use this option with either the \fB\-\-up\fR or \fB\-\-down\fR option.

.TP
\fB\-g\fR, \fB\-\-primary_slurmdbd_failure\fR
Trigger an event when the primary slurmdbd fails.

.TP
\fB\-G\fR, \fB\-\-primary_slurmdbd_resumed_operation\fR
Trigger an event when the primary slurmdbd resumes operation after failure.

.TP
\fB\-\-get\fP
Show registered event triggers.
Options can be used for filtering purposes.

.TP
\fB\-h\fR, \fB\-\-primary_database_failure\fR
Trigger an event when the primary database fails.

.TP
\fB\-H\fR, \fB\-\-primary_database_resumed_operation\fR
Trigger an event when the primary database resumes operation after failure.

.TP
\fB\-i\fR, \fB\-\-id\fR=\fIid\fR
Trigger ID number.

.TP
\fB\-I\fR, \fB\-\-idle\fR
Trigger an event if the specified node remains in an IDLE state
for at least the time period specified by the \fB\-\-offset\fR
option. This can be useful to hibernate a node that remains idle,
thus reducing power consumption.

.TP
\fB\-j\fR, \fB\-\-jobid\fR=\fIid\fR
Job ID of interest.
\fBNOTE:\fR The \fB\-\-jobid\fR option can not be used in conjunction
with the \fB\-\-node\fR option. When the \fB\-\-jobid\fR option is
used in conjunction with the \fB\-\-up\fR or \fB\-\-down\fR option,
all nodes allocated to that job will considered the nodes used as a
trigger event.

.TP
\fB\-M\fR, \fB\-\-clusters\fR=<\fIstring\fR>
Clusters to issue commands to.
Note that the SlurmDBD must be up for this option to work properly.

.TP
\fB\-n\fR, \fB\-\-node\fR[=\fIhost\fR]
Host name(s) of interest.
By default, all nodes associated with the job (if \fB\-\-jobid\fR
is specified) or on the system are considered for event triggers.
\fBNOTE:\fR The \fB\-\-node\fR option can not be used in conjunction
with the \fB\-\-jobid\fR option. When the \fB\-\-jobid\fR option is
used in conjunction with the \fB\-\-up\fR, \fB\-\-down\fR or
\fB\-\-drained\fR option,
all nodes allocated to that job will considered the nodes used as a
trigger event. Since this option's argument is optional, for proper
parsing the single letter option must be followed immediately with
the value and not include a space between them. For example "\-ntux"
and not "\-n tux".

.TP
\fB\-N\fR, \fB\-\-noheader\fR
Do not print the header when displaying a list of triggers.

.TP
\fB\-o\fR, \fB\-\-offset\fR=\fIseconds\fR
The specified action should follow the event by this time interval.
Specify a negative value if action should preceded the event.
The default value is zero if no \fB\-\-offset\fR option is specified.
The resolution of this time is about 20 seconds, so to execute
a script not less than five minutes prior to a job reaching its
time limit, specify \fB\-\-offset=320\fR (5 minutes plus 20 seconds).

.TP
\fB\-p\fR, \fB\-\-program\fR=\fIpath\fR
Execute the program at the specified fully qualified pathname
when the event occurs.
You may quote the path and include extra program arguments if desired.
The program will be executed as the user who sets the trigger.
If the program fails to terminate within 5 minutes, it will
be killed along with any spawned processes.

.TP
\fB\-Q\fR, \fB\-\-quiet\fR
Do not report non\-fatal errors.
This can be useful to clear triggers which may have already been purged.

.TP
\fB\-r\fR, \fB\-\-reconfig\fR
Trigger an event when the system configuration changes.
This is triggered when the slurmctld daemon reads its configuration file or
when a node state changes.

.TP
\fB\-\-set\fP
Register an event trigger based upon the supplied options.
NOTE: An event is only triggered once. A new event trigger
must be set established for future events of the same type
to be processed.
Triggers can only be set if the command is run by the user
\fISlurmUser\fR unless \fISlurmUser\fR is configured as user root.

.TP
\fB\-t\fR, \fB\-\-time\fR
Trigger an event when the specified job's time limit is reached.
This must be used in conjunction with the \fB\-\-jobid\fR option.

.TP
\fB\-u\fR, \fB\-\-up\fR
Trigger an event if the specified node is returned to service
from a DOWN state.

.TP
\fB\-\-user\fR=\fIuser_name_or_id\fR
Clear or get triggers created by the specified user.
For example, a trigger created by user \fIroot\fR for a job created by user
\fIadam\fR could be cleared with an option \fI\-\-user=root\fR.
Specify either a user name or user ID.

.TP
\fB\-v\fR, \fB\-\-verbose\fR
Print detailed event logging. This includes time\-stamps on data structures,
record counts, etc.

.TP
\fB\-V\fR , \fB\-\-version\fR
Print version information and exit.

.SH "OUTPUT FIELD DESCRIPTIONS"
.TP
\fBTRIG_ID\fP
Trigger ID number.

.TP
\fBRES_TYPE\fP
Resource type: \fIjob\fR or \fInode\fR

.TP
\fBRES_ID\fP
Resource ID: job ID or host names or "*" for any host

.TP
\fBTYPE\fP
Trigger type: \fItime\fR or \fIfini\fR (for jobs only),
\fIdown\fR or \fIup\fR (for jobs or nodes), or
\fIdrained\fR, \fIidle\fR or \fIreconfig\fR (for nodes only)

.TP
\fBOFFSET\fP
Time offset in seconds. Negative numbers indicated the action should
occur before the event (if possible)

.TP
\fBUSER\fP
Name of the user requesting the action

.TP
\fBPROGRAM\fP
Pathname of the program to execute when the event occurs

.SH "ENVIRONMENT VARIABLES"
.PP
Some \fBstrigger\fR options may be set via environment variables. These
environment variables, along with their corresponding options, are listed below.
(Note: commandline options will always override these settings)
.TP 20
\fBSLURM_CONF\fR
The location of the Slurm configuration file.

.SH "EXAMPLES"
Execute the program "/usr/sbin/primary_slurmctld_failure" whenever the
primary slurmctld fails.

.nf
> cat /usr/sbin/primary_slurmctld_failure
#!/bin/bash
# Submit trigger for next primary slurmctld failure event
strigger \-\-set \-\-primary_slurmctld_failure \\
         \-\-program=/usr/sbin/primary_slurmctld_failure
# Notify the administrator of the failure using by e\-mail
/usr/bin/mail slurm_admin@site.com \-s Primary_SLURMCTLD_FAILURE

> strigger \-\-set \-\-primary_slurmctld_failure \\
           \-\-program=/usr/sbin/primary_slurmctld_failure
.fi

.PP
Execute the program "/usr/sbin/slurm_admin_notify" whenever
any node in the cluster goes down. The subject line will include
the node names which have entered the down state (passed as an
argument to the script by Slurm).

.nf
> cat /usr/sbin/slurm_admin_notify
#!/bin/bash
# Submit trigger for next event
strigger \-\-set \-\-node \-\-down \\
         \-\-program=/usr/sbin/slurm_admin_notify
# Notify administrator using by e\-mail
/usr/bin/mail slurm_admin@site.com \-s NodesDown:$*

> strigger \-\-set \-\-node \-\-down \\
           \-\-program=/usr/sbin/slurm_admin_notify
.fi

.PP
Execute the program "/usr/sbin/slurm_suspend_node" whenever
any node in the cluster remains in the idle state for at least
600 seconds.

.nf
> strigger \-\-set \-\-node \-\-idle \-\-offset=600 \\
           \-\-program=/usr/sbin/slurm_suspend_node
.fi

.PP
Execute the program "/home/joe/clean_up" when job 1234 is within
10 minutes of reaching its time limit.

.nf
> strigger \-\-set \-\-jobid=1234 \-\-time \-\-offset=-600 \\
           \-\-program=/home/joe/clean_up
.fi

.PP
Execute the program "/home/joe/node_died" when any node allocated to
job 1234 enters the DOWN state.

.nf
> strigger \-\-set \-\-jobid=1234 \-\-down \\
           \-\-program=/home/joe/node_died
.fi

.PP
Show all triggers associated with job 1235.

.nf
> strigger \-\-get \-\-jobid=1235
TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
    123      job   1235 time   \-600  joe /home/bob/clean_up
    125      job   1235 down      0  joe /home/bob/node_died
.fi

.PP
Delete event trigger 125.

.fp
> strigger \-\-clear \-\-id=125
.fi

.PP
Execute /home/joe/job_fini upon completion of job 1237.

.fp
> strigger \-\-set \-\-jobid=1237 \-\-fini \-\-program=/home/joe/job_fini
.fi

.SH "COPYING"
Copyright (C) 2007 The Regents of the University of California.
Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
.br
Copyright (C) 2008\-2010 Lawrence Livermore National Security.
.br
Copyright (C) 2010\-2013 SchedMD LLC.
.LP
This file is part of Slurm, a resource management program.
For details, see <https://slurm.schedmd.com/>.
.LP
Slurm is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
.LP
Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
details.

.SH "SEE ALSO"
\fBscontrol\fR(1), \fBsinfo\fR(1), \fBsqueue\fR(1)