File: admin-quick-start.rst

package info (click to toggle)
condor 23.9.6%2Bdfsg-2.1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 60,012 kB
  • sloc: cpp: 528,272; perl: 87,066; python: 42,650; ansic: 29,558; sh: 11,271; javascript: 3,479; ada: 2,319; java: 619; makefile: 615; xml: 613; awk: 268; yacc: 78; fortran: 54; csh: 24
file content (232 lines) | stat: -rw-r--r-- 11,158 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
.. _admin_quick_start_guide:

Administrative Quick Start Guide
================================

This guide does not contain step-by-step instructions for
:doc:`getting HTCondor <index>`.  Rather, it is a guide to joining multiple
machines into a single pool of computational resources for use by HTCondor
jobs.

This guide begins by briefly describing the three roles required by every
HTCondor pool, as well as the resources and networking required by each
of those roles.  This information will enable you to choose which machine(s)
will perform which role(s).  This guide also includes instructions on how to
use the ``get_htcondor`` tool to install and configure Linux (or Mac) machines
to perform each of the roles.

If you're curious, using Windows machines, or you want to automate the
configuration of their pool using a tool like Puppet, the
:ref:`last section <the_details>` of this guide briefly describes what
the ``get_htcondor`` tool does and provides a link to the rest of the details.

.. sidebar:: Single-machine Installations

    If you just finished installing a single-machine ("mini") HTCondor
    using ``get_htcondor``, you can just run ``get_htcondor`` again (and
    follow its instructions) to reconfigure the machine to be one of
    these three roles; this may destroy any other configuration changes
    you've made.

    We don't recommend trying to add a machine configured as a "mini"
    HTCondor to the pool, or trying to add execute machines to an existing
    "mini" HTCondor pool.  We also don't recommend creating an entire
    pool out of unprivileged installations.

The Three Roles
---------------

Even a single-machine installation of HTCondor performs all three roles.

The Execute Role
################

The most common reason for adding a machine to an HTCondor pool is to make
another machine execute HTCondor jobs; the first major role, therefore, is
the execute role.  This role is responsible for the technical aspects of
actually running, monitoring, and managing the job's executable; transferring
the job's input and output; and advertising, monitoring, and managing the
resources of the execute machine.  HTCondor can manage pools containing
tens of thousands of execute machines, so this is by far the most common role.

The execute role itself uses very few resources, so almost any machine
can contribute to a pool.  The execute role can run on a machine with only
outbound network connectivity, but being able to accept inbound connections
from the machine(s) performing the submit role will simplify setup and reduce
overhead.  The execute machine does not need to allow user access, or
even share user IDs with other machines in the pool (although this may be
very convenient, especially on Windows).

The Submit Role
###############

We'll discuss what "advertising" a machine's resources means in the next
section, but the execute role leaves an obvious question unanswered: where
do the jobs come from?  The answer is the submit role.  This role is
responsible for accepting, monitoring, managing, and scheduling jobs on its
assigned resources; transferring the input and output of jobs; and requesting
and accepting resource assignments.  (A "resource" is some reserved fraction
of an execute machine.)  HTCondor allows arbitrarily many submit roles in a
pool, but for administrative convenience, most pools only have one, or a
small number, of machines acting in the submit role.

A submit-role machine requires a bit under a megabyte of RAM for each
running job, and its ability to transfer data to and from the execute-role
machines may become a performance bottleneck.  We typically recommend adding
another access point for every twenty thousand simultaneously running
jobs.  A access point must have outbound network connectivity, but a submit
machine without inbound network connectivity can't use execute-role machines
without inbound network connectivity.  As execute machines are more numerous,
access points typically allow inbound connections.  Although you may allow
users to submit jobs over the network, we recommend allowing users SSH access
to the access point.

The Central Manager Role
########################

Only one machine in each HTCondor pool can perform this role (barring
certain high-availability configurations, where only one machine can
perform this role at a time).  A central manager matches resource requests --
generated by the submit role based on its jobs -- with the resources described
by the execute machines.  We refer to sending these (automatically-generated)
descriptions to the central manager as "advertising" because it's the
primary way execute machines get jobs to run.

A central manager must accept connections from each execute machine and each
access point in a pool.  However, users should never need access to the
central manager.  Every machine in the pool updates the central manager every
few minutes, and it answers both system and user queries about the status of
the pool's resources, so a fast network is important.  For very large pools,
memory may become a limiting factor.

Assigning Roles to Machines
---------------------------

The easiest way to assign a role to a machine is when you initially
:doc:`get HTCondor <index>`.  You'll need to supply the same password for
each machine in the same pool; sharing that secret is how the machines
recognize each other as members of the same pool, and connections between
machines are encrypted with it.  (HTCondor uses port 9618 to communicate,
so make sure that the machines in your pool accept TCP connections on that
port from each other.)  In the command lines below, replace
``$htcondor_password`` with the password you want to use.  In addition to the
password, you must specify the name of the central manager, which may be a
host name (which must resolve on all machines in the pool) or an IP address.
In the command lines below, replace ``$central_manager_name`` with the host
name or IP address you want to use.

When you :doc:`get HTCondor <index>`, start with the central manager, then add
the access point(s), and then add the execute machine(s).  You may
not have ``sudo`` installed; you may omit it from the command lines below
if you run them as root.

.. rubric:: Central Manager

.. code-block:: shell

    curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name

.. rubric:: Submit

.. code-block:: shell

    curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name

.. rubric:: Execute

.. code-block:: shell

    curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name

At this point, users logged in on the access point should be able to see
execute machines in the pool (using ``condor_status``), submit jobs
(using ``condor_submit``), and see them run (using ``condor_q``).

Creating a Multi-Machine Pool using Windows or Containers
#########################################################

If you are creating a multi-machine HTCondor pool on Windows computers or
using containerization, please see the "Setting Up a Whole Pool" section
of the relevant installation guide:

* :ref:`admin_install_windows_pool`
* :ref:`docker_image_pool`

Where to Go from Here
---------------------

There are two major directions you can go from here, but before we discuss
them, a warning.

.. admonition:: Making Configuration Changes
    :class: warning

    HTCondor configuration files should generally be owned by root
    (or Administrator, on Windows), but readable by all users.  We recommend
    that you don't make changes to the configuration files established by the
    installation procedure; this avoids conflicts between your changes and any
    changes we may have to make to the base configuration in future
    updates.  Instead, you should add (or edit) files in the configuration
    directory; its location can be determined on a given machine by running
    ``condor_config_val LOCAL_CONFIG_DIR`` there.  HTCondor will process files
    in this directory in lexicographic order, so we recommend naming files
    ``##-name.config`` so that, for example, a setting in ``00-base.config``
    will be overridden by a setting in ``99-specific.config``.

.. rubric:: Enabling Features

Some features of HTCondor, for one reason or another, aren't (or can't be)
enabled by default.  Areas of potentially general interest include:

* :doc:`/admin-manual/ep-policy-configuration` (particularly
  :ref:`enabling_oauth_credentials` and :ref:`resource_limits_with_cgroups`),
* :ref:`admin-manual/ep-policy-configuration:docker universe`
* :ref:`admin-manual/ep-policy-configuration:Apptainer and Singularity support`

.. rubric:: Implementing Policies

Although your HTCondor pool should be fully functional at this point, it
may not be behaving precisely as you wish, particularly with respect to
resource allocation.  You can tune how HTCondor allocates resources to
users, or groups of users, using the user priority and group quota systems,
described in :doc:`../admin-manual/cm-configuration`.  You
can enforce machine-specific policies -- for instance, preferring GPU jobs
on machines with GPUs -- using the options described in
:doc:`../admin-manual/ep-policy-configuration`.

.. rubric:: Further Reading

* It may be helpful to at least skim the :doc:`../users-manual/index` to get
  an idea of what your users might want or expect, particularly the
  sections on :doc:`../automated-workflows/dagman-introduction`,
  :doc:`../users-manual/choosing-an-htcondor-universe`, and
  :doc:`../users-manual/self-checkpointing-applications`.
* Understanding :doc:`../classads/classad-mechanism` is essential for
  many administrative tasks.
* The rest of the :doc:`../admin-manual/index`, particularly the section on
  :ref:`admin-manual/cm-configuration:Monitoring with Ganglia, Elasticsearch, etc.`.
* Slides from
  `past HTCondor Weeks <https://htcondor.org/past_condor_weeks.html>`_
  -- our annual conference -- include a number of tutorials and talks on
  administrative topics, including monitoring and examples of policies and
  their implementations.

.. _the_details:

What ``get_htcondor`` Does to Configure a Role
----------------------------------------------

The configuration files generated by ``get_htcondor`` are very similar, and
only two lines long:

* set the HTCondor configuration variable :macro:`CONDOR_HOST` to the name
  (or IP address) of your central manager;
* add the appropriate metaknob: ``use role : get_htcondor_central_manager``,
  ``use role : get_htcondor_submit``, or ``use role : get_htcondor_execute``.

Putting all of the pool-independent configuration into the metaknobs allows
us to change the metaknobs to fix problems or work with later versions of
HTCondor as you upgrade.

The ``get_htcondor`` :doc:`documentation <../man-pages/get_htcondor>`
describes what the configuration script does and how to determine the exact details.