File: installation-startup-shutdown-reconfiguration.rst

package info (click to toggle)
condor 23.9.6%2Bdfsg-2.1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 60,012 kB
  • sloc: cpp: 528,272; perl: 87,066; python: 42,650; ansic: 29,558; sh: 11,271; javascript: 3,479; ada: 2,319; java: 619; makefile: 615; xml: 613; awk: 268; yacc: 78; fortran: 54; csh: 24
file content (417 lines) | stat: -rw-r--r-- 16,934 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
Starting Up, Shutting Down and  Reconfiguring the System
========================================================

If you installed HTCondor with administrative privileges, HTCondor will
start up when the machine boots and shut down when the machine does, using
the usual mechanism for the machine's operating system.  You can generally
use those mechanisms in the usual way if you need to manually control
whether or not HTCondor is running.  There are two situations in
which you might want to run :doc:`../man-pages/condor_master`,
:doc:`../man-pages/condor_on`, or :doc:`../man-pages/condor_off` from the
command line.

#. If you installed HTCondor without administrative privileges, you'll
   have to run :tool:`condor_master` from the command line to turn on HTCondor:

    .. code-block:: console

        $ condor_master

   Then run the following command to turn HTCondor completely off:

    .. code-block:: console

        $ condor_off -master

#. If the usual OS-specific method of controlling HTCondor is inconvenient
   to use remotely, you may be able to use the :tool:`condor_on` and :tool:`condor_off`
   tools instead.

Daemons That Do Not Run as root
-------------------------------

:index:`running as root`
:index:`running as root<single: running as root; daemon>`

HTCondor is normally installed such that the HTCondor daemons have root
permission. This allows HTCondor to run the *condor_shadow*
daemon and the job with the submitting user's UID and file access
rights. When HTCondor is started as root, HTCondor jobs can access
whatever files the user that submits the jobs can.

However, it is possible that the HTCondor installation does not have
root access, or has decided not to run the daemons as root. That is
unfortunate, since HTCondor is designed to be run as root. To see if
HTCondor is running as root on a specific machine, use the command

.. code-block:: console

      $ condor_status -master -l <machine-name>

where <machine-name> is the name of the specified machine. This command
displays the full condor_master ClassAd; if the attribute :ad-attr:`RealUid`
equals zero, then the HTCondor daemons are indeed running with root
access. If the :ad-attr:`RealUid` attribute is not zero, then the HTCondor
daemons do not have root access.

.. note::

   The Unix program *ps* is not an effective method of determining if HTCondor is
   running with root access. When using *ps*, it may often appear that the daemons
   are running as the condor user instead of root.  However, note that the *ps*
   command shows the current effective owner of the process, not the real owner.
   (See the *getuid* (2) and *geteuid* (2) Unix man pages for details.) In Unix, a
   process running under the real UID of root may switch its effective UID. (See
   the *seteuid* (2) man page.) For security reasons, the daemons only set the
   effective UID to root when absolutely necessary, as it will be to perform a
   privileged operation.

If daemons are not running with root access, make any and all files
and/or directories that the job will touch readable and/or writable by
the UID (user id) specified by the :ad-attr:`RealUid` attribute. Often this may
mean using the Unix command chmod 777 on the directory from which the
HTCondor job is submitted.

Remote Management Features
--------------------------

:index:`shutting down HTCondor<single: shutting down HTCondor; pool management>`
:index:`restarting HTCondor<single: restarting HTCondor; pool management>`

All of the commands described in this section are subject to the
security policy chosen for the HTCondor pool.  As such, the commands must
be either run from a machine that has the proper authorization, or run
by a user that is authorized to issue the commands.
The :doc:`/admin-manual/security` section details the
implementation of security in HTCondor.

 Shutting Down HTCondor
    There are a variety of ways to shut down all or parts of an HTCondor
    pool. All utilize the :tool:`condor_off` tool.

    To stop a single execute machine from running jobs, the
    :tool:`condor_off` command specifies the machine by host name.

    .. code-block:: console

        $ condor_off -startd <hostname>

    Jobs will be killed. If it is instead desired that the machine
    stops running jobs only after the currently executing job completes,
    the command is

    .. code-block:: console

        $ condor_off -startd -peaceful <hostname>

    Note that this waits indefinitely for the running job to finish,
    before the *condor_startd* daemon exits.

    Th shut down all execution machines within the pool,

    .. code-block:: console

        $ condor_off -all -startd

    To wait indefinitely for each machine in the pool to finish its
    current HTCondor job, shutting down all of the execute machines as
    they no longer have a running job,

    .. code-block:: console

        $ condor_off -all -startd -peaceful

    To shut down HTCondor on a machine from which jobs are submitted,

    .. code-block:: console

        $ condor_off -schedd <hostname>

    If it is instead desired that the access point (which runs the
    *condor_schedd*) shuts down only after all jobs that are currently in the
    queue are finished, first disable new submissions to the queue by setting
    the configuration variable

    .. code-block:: condor-config

        MAX_JOBS_SUBMITTED = 0

    See instructions below in :ref:`Reconfiguring an HTCondor Pool <reconfiguring>`
    for how to reconfigure a pool. After the reconfiguration,
    the command to wait for all jobs to complete and shut down the submission of
    jobs is

    .. code-block:: console

        $ condor_off -schedd -peaceful <hostname>

    Substitute the option **-all** for the host name, if all submit
    machines in the pool are to be shut down.

 Restarting HTCondor, If HTCondor Daemons Are Not Running
    If HTCondor is not running, perhaps because one of the :tool:`condor_off`
    commands was used, then starting HTCondor daemons back up depends on
    which part of HTCondor is currently not running.

    If no HTCondor daemons are running, then starting HTCondor is a
    matter of executing the :tool:`condor_master` daemon. The
    :tool:`condor_master` daemon will then invoke all other specified daemons
    on that machine. The :tool:`condor_master` daemon executes on every
    machine that is to run HTCondor.

    If a specific daemon needs to be started up, and the
    :tool:`condor_master` daemon is already running, then issue the command
    on the specific machine with

    .. code-block:: console

        $ condor_on -subsystem <subsystemname>

    where <subsystemname> is replaced by the daemon's subsystem name.
    Or, this command might be issued from another machine in the pool
    (which has administrative authority) with

    .. code-block:: console

        $ condor_on <hostname> -subsystem <subsystemname>

    where <subsystemname> is replaced by the daemon's subsystem name,
    and <hostname> is replaced by the host name of the machine where
    this :tool:`condor_on` command is to be directed.

 Restarting HTCondor, If HTCondor Daemons Are Running
    If HTCondor daemons are currently running, but need to be killed and
    newly invoked, the :tool:`condor_restart` tool does this. This would be
    the case for a new value of a configuration variable for which using
    :tool:`condor_reconfig` is inadequate.

    To restart all daemons on all machines in the pool,

    .. code-block:: console

        $ condor_restart -all

    To restart all daemons on a single machine in the pool,

    .. code-block:: console

        $ condor_restart <hostname>

    where <hostname> is replaced by the host name of the machine to be
    restarted.

.. _reconfiguring:

 Reconfiguring an HTCondor Pool
    :index:`reconfiguration<single: reconfiguration; pool management>`

    To change a global configuration variable and have all the machines
    start to use the new setting, change the value within the file, and send
    a :tool:`condor_reconfig` command to each host. Do this with a single
    command,

    .. code-block:: console

      $ condor_reconfig -all

    If the global configuration file is not shared among all the machines,
    as it will be if using a shared file system, the change must be made to
    each copy of the global configuration file before issuing the
    :tool:`condor_reconfig` command.

    Issuing a :tool:`condor_reconfig` command is inadequate for some
    configuration variables. For those, a restart of HTCondor is required.
    Those configuration variables that require a restart are listed in
    the :ref:`admin-manual/introduction-to-configuration:macros that will require a
    restart when changed` section.  You can also refer to the
    :doc:`/man-pages/condor_restart` manual page.

DaemonCore
----------

:index:`daemoncore`
:index:`shared functionality in daemons<single: shared functionality in daemons; HTCondor>`

This section is a brief description of DaemonCore. DaemonCore is a
library that is shared among most of the HTCondor daemons which provides
common functionality. Currently, the following daemons use DaemonCore:

-  *condor_master*
-  *condor_startd*
-  *condor_schedd*
-  *condor_collector*
-  *condor_negotiator*
-  *condor_kbdd*
-  *condor_gridmanager*
-  *condor_credd*
-  *condor_had*
-  *condor_replication*
-  *condor_transferer*
-  *condor_job_router*
-  *condor_lease_manager*
-  *condor_rooster*
-  *condor_shared_port*
-  *condor_defrag*
-  *condor_c-gahp*
-  *condor_c-gahp_worker_thread*
-  *condor_dagman*
-  *condor_ft-gahp*
-  *condor_rooster*
-  *condor_shadow*
-  *condor_shared_port*
-  *condor_transferd*
-  *condor_vm-gahp*

Most of DaemonCore's details are not interesting for administrators.
However, DaemonCore does provide a uniform interface for the daemons to
various Unix signals, and provides a common set of command-line options
that can be used to start up each daemon.

DaemonCore and Unix signals
'''''''''''''''''''''''''''

:index:`Unix signals<single: Unix signals; daemoncore>`

One of the most visible features that DaemonCore provides for
administrators is that all daemons which use it behave the same way on
certain Unix signals. The signals and the behavior DaemonCore provides
are listed below:

SIGHUP
    Causes the daemon to reconfigure itself.
SIGTERM
    Causes the daemon to gracefully shutdown.
SIGQUIT
    Causes the daemon to quickly shutdown.

Exactly what gracefully and quickly means varies from daemon to daemon.
For daemons with little or no state (the *condor_kbdd*,
*condor_collector* and *condor_negotiator*) there is no difference,
and both ``SIGTERM`` and ``SIGQUIT`` signals result in the daemon
shutting itself down quickly. For the :tool:`condor_master`, a graceful
shutdown causes the :tool:`condor_master` to ask all of its children to
perform their own graceful shutdown methods. The quick shutdown causes
the :tool:`condor_master` to ask all of its children to perform their own
quick shutdown methods. In both cases, the :tool:`condor_master` exits after
all its children have exited. In the *condor_startd*, if the machine is
not claimed and running a job, both the ``SIGTERM`` and ``SIGQUIT``
signals result in an immediate exit. In the *condor_schedd*, if
there are no jobs currently running, there will be no *condor_shadow*
processes, and both signals result in an immediate exit. However, with
jobs running, a graceful shutdown causes the *condor_schedd* to ask
each *condor_shadow* to gracefully vacate the job it is serving, while
a quick shutdown results in a hard kill of every *condor_shadow*.

For all daemons, a reconfigure results in the daemon re-reading its
configuration file(s), causing any settings that have changed to take
effect. See the :doc:`/admin-manual/introduction-to-configuration` section for
full details on what settings are in the configuration files and what they do.

DaemonCore and Command-line Arguments
'''''''''''''''''''''''''''''''''''''

:index:`command line arguments<single: command line arguments; daemoncore>`
:index:`command line arguments<single: command line arguments; HTCondor daemon>`

The second visible feature that DaemonCore provides to administrators is
a common set of command-line arguments that all daemons understand.
These arguments and what they do are described below:

\-a string
    Append a period character ('.') concatenated with **string** to the
    file name of the log for this daemon, as specified in the
    configuration file.

\-b
    Causes the daemon to start up in the background. When a DaemonCore
    process starts up with this option, it disassociates itself from the
    terminal and forks itself, so that it runs in the background. This
    is the default behavior for the :tool:`condor_master`. Prior to 8.9.7 it
    was the default for all HTCondor daemons.

\-c filename
    Causes the daemon to use the specified **filename** as a full path
    and file name as its global configuration file. This overrides the
    ``CONDOR_CONFIG`` environment variable and the regular locations
    that HTCondor checks for its configuration file.

\-d
    Use dynamic directories. The ``$(LOG)``, ``$(SPOOL)``, and
    ``$(EXECUTE)`` directories are all created by the daemon at run
    time, and they are named by appending the parent's IP address and
    PID to the value in the configuration file. These values are then
    inherited by all children of the daemon invoked with this **-d**
    argument. For the :tool:`condor_master`, all HTCondor processes will use
    the new directories. If a *condor_schedd* is invoked with the *-d*
    argument, then only the *condor_schedd* daemon and any
    *condor_shadow* daemons it spawns will use the dynamic directories
    (named with the *condor_schedd* daemon's PID).

    Note that by using a dynamically-created spool directory named by
    the IP address and PID, upon restarting daemons, jobs submitted to
    the original *condor_schedd* daemon that were stored in the old
    spool directory will not be noticed by the new *condor_schedd*
    daemon, unless you manually specify the old, dynamically-generated
    :macro:`SPOOL` directory path in the configuration of the new
    *condor_schedd* daemon.

\-f
    Causes the daemon to start up in the foreground. Instead of forking,
    the daemon runs in the foreground. Since 8.9.7, this has been the default
    for all daemons other than the :tool:`condor_master`.

\-k filename
    For non-Windows operating systems, causes the daemon to read out a
    PID from the specified **filename**, and send a SIGTERM to that
    process. The daemon started with this optional argument waits until
    the daemon it is attempting to kill has exited.

\-l directory
    Overrides the value of :macro:`LOG` as specified in
    the configuration files. Primarily, this option is used with the
    *condor_kbdd* when it needs to run as the individual user logged
    into the machine, instead of running as root. Regular users would
    not normally have permission to write files into HTCondor's log
    directory. Using this option, they can override the value of :macro:`LOG`
    and have the *condor_kbdd* write its log file into a directory that
    the user has permission to write to.

\-local-name name
    Specify a local name for this instance of the daemon. This local
    name will be used to look up configuration parameters.
    The :ref:`admin-manual/introduction-to-configuration:configuration file
    macros` section contains details on how this local name will be used in the
    configuration.

\-p port
    Causes the daemon to bind to the specified port as its command
    socket. The :tool:`condor_master` daemon uses this option to ensure that
    the *condor_collector* and *condor_negotiator* start up using
    well-known ports that the rest of HTCondor depends upon them using.

\-pidfile filename
    Causes the daemon to write out its PID (process id number) to the
    specified **filename**. This file can be used to help shutdown the
    daemon without first searching through the output of the Unix *ps*
    command.

    Since daemons run with their current working directory set to the
    value of :macro:`LOG`, if a full path (one that begins with a slash
    character, ``/``) is not specified, the file will be placed in the
    :macro:`LOG` directory.

\-q
    Quiet output; write less verbose error messages to ``stderr`` when
    something goes wrong, and before regular logging can be initialized.

\-r minutes
    Causes the daemon to set a timer, upon expiration of which, it sends
    itself a SIGTERM for graceful shutdown.

\-t
    Causes the daemon to print out its error message to ``stderr``
    instead of its specified log file. This option forces the **-f**
    option.

\-v
    Causes the daemon to print out version information and exit.