1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
|
.. _admin_quick_start_guide:
Administrative Quick Start Guide
================================
This guide does not contain step-by-step instructions for
:doc:`getting HTCondor <index>`. Rather, it is a guide to joining multiple
machines into a single pool of computational resources for use by HTCondor
jobs.
This guide begins by briefly describing the three roles required by every
HTCondor pool, as well as the resources and networking required by each
of those roles. This information will enable you to choose which machine(s)
will perform which role(s). This guide also includes instructions on how to
use the ``get_htcondor`` tool to install and configure Linux (or Mac) machines
to perform each of the roles.
If you're curious, using Windows machines, or you want to automate the
configuration of their pool using a tool like Puppet, the
:ref:`last section <the_details>` of this guide briefly describes what
the ``get_htcondor`` tool does and provides a link to the rest of the details.
.. sidebar:: Single-machine Installations
If you just finished installing a single-machine ("mini") HTCondor
using ``get_htcondor``, you can just run ``get_htcondor`` again (and
follow its instructions) to reconfigure the machine to be one of
these three roles; this may destroy any other configuration changes
you've made.
We don't recommend trying to add a machine configured as a "mini"
HTCondor to the pool, or trying to add execute machines to an existing
"mini" HTCondor pool. We also don't recommend creating an entire
pool out of unprivileged installations.
The Three Roles
---------------
Even a single-machine installation of HTCondor performs all three roles.
The Execute Role
################
The most common reason for adding a machine to an HTCondor pool is to make
another machine execute HTCondor jobs; the first major role, therefore, is
the execute role. This role is responsible for the technical aspects of
actually running, monitoring, and managing the job's executable; transferring
the job's input and output; and advertising, monitoring, and managing the
resources of the execute machine. HTCondor can manage pools containing
tens of thousands of execute machines, so this is by far the most common role.
The execute role itself uses very few resources, so almost any machine
can contribute to a pool. The execute role can run on a machine with only
outbound network connectivity, but being able to accept inbound connections
from the machine(s) performing the submit role will simplify setup and reduce
overhead. The execute machine does not need to allow user access, or
even share user IDs with other machines in the pool (although this may be
very convenient, especially on Windows).
The Submit Role
###############
We'll discuss what "advertising" a machine's resources means in the next
section, but the execute role leaves an obvious question unanswered: where
do the jobs come from? The answer is the submit role. This role is
responsible for accepting, monitoring, managing, and scheduling jobs on its
assigned resources; transferring the input and output of jobs; and requesting
and accepting resource assignments. (A "resource" is some reserved fraction
of an execute machine.) HTCondor allows arbitrarily many submit roles in a
pool, but for administrative convenience, most pools only have one, or a
small number, of machines acting in the submit role.
A submit-role machine requires a bit under a megabyte of RAM for each
running job, and its ability to transfer data to and from the execute-role
machines may become a performance bottleneck. We typically recommend adding
another access point for every twenty thousand simultaneously running
jobs. A access point must have outbound network connectivity, but a submit
machine without inbound network connectivity can't use execute-role machines
without inbound network connectivity. As execute machines are more numerous,
access points typically allow inbound connections. Although you may allow
users to submit jobs over the network, we recommend allowing users SSH access
to the access point.
The Central Manager Role
########################
Only one machine in each HTCondor pool can perform this role (barring
certain high-availability configurations, where only one machine can
perform this role at a time). A central manager matches resource requests --
generated by the submit role based on its jobs -- with the resources described
by the execute machines. We refer to sending these (automatically-generated)
descriptions to the central manager as "advertising" because it's the
primary way execute machines get jobs to run.
A central manager must accept connections from each execute machine and each
access point in a pool. However, users should never need access to the
central manager. Every machine in the pool updates the central manager every
few minutes, and it answers both system and user queries about the status of
the pool's resources, so a fast network is important. For very large pools,
memory may become a limiting factor.
Assigning Roles to Machines
---------------------------
The easiest way to assign a role to a machine is when you initially
:doc:`get HTCondor <index>`. You'll need to supply the same password for
each machine in the same pool; sharing that secret is how the machines
recognize each other as members of the same pool, and connections between
machines are encrypted with it. (HTCondor uses port 9618 to communicate,
so make sure that the machines in your pool accept TCP connections on that
port from each other.) In the command lines below, replace
``$htcondor_password`` with the password you want to use. In addition to the
password, you must specify the name of the central manager, which may be a
host name (which must resolve on all machines in the pool) or an IP address.
In the command lines below, replace ``$central_manager_name`` with the host
name or IP address you want to use.
When you :doc:`get HTCondor <index>`, start with the central manager, then add
the access point(s), and then add the execute machine(s). You may
not have ``sudo`` installed; you may omit it from the command lines below
if you run them as root.
.. rubric:: Central Manager
.. code-block:: shell
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name
.. rubric:: Submit
.. code-block:: shell
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name
.. rubric:: Execute
.. code-block:: shell
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name
At this point, users logged in on the access point should be able to see
execute machines in the pool (using ``condor_status``), submit jobs
(using ``condor_submit``), and see them run (using ``condor_q``).
Creating a Multi-Machine Pool using Windows or Containers
#########################################################
If you are creating a multi-machine HTCondor pool on Windows computers or
using containerization, please see the "Setting Up a Whole Pool" section
of the relevant installation guide:
* :ref:`admin_install_windows_pool`
* :ref:`docker_image_pool`
Where to Go from Here
---------------------
There are two major directions you can go from here, but before we discuss
them, a warning.
.. admonition:: Making Configuration Changes
:class: warning
HTCondor configuration files should generally be owned by root
(or Administrator, on Windows), but readable by all users. We recommend
that you don't make changes to the configuration files established by the
installation procedure; this avoids conflicts between your changes and any
changes we may have to make to the base configuration in future
updates. Instead, you should add (or edit) files in the configuration
directory; its location can be determined on a given machine by running
``condor_config_val LOCAL_CONFIG_DIR`` there. HTCondor will process files
in this directory in lexicographic order, so we recommend naming files
``##-name.config`` so that, for example, a setting in ``00-base.config``
will be overridden by a setting in ``99-specific.config``.
.. rubric:: Enabling Features
Some features of HTCondor, for one reason or another, aren't (or can't be)
enabled by default. Areas of potentially general interest include:
* :doc:`/admin-manual/ep-policy-configuration` (particularly
:ref:`enabling_oauth_credentials` and :ref:`resource_limits_with_cgroups`),
* :ref:`admin-manual/ep-policy-configuration:docker universe`
* :ref:`admin-manual/ep-policy-configuration:Apptainer and Singularity support`
.. rubric:: Implementing Policies
Although your HTCondor pool should be fully functional at this point, it
may not be behaving precisely as you wish, particularly with respect to
resource allocation. You can tune how HTCondor allocates resources to
users, or groups of users, using the user priority and group quota systems,
described in :doc:`../admin-manual/cm-configuration`. You
can enforce machine-specific policies -- for instance, preferring GPU jobs
on machines with GPUs -- using the options described in
:doc:`../admin-manual/ep-policy-configuration`.
.. rubric:: Further Reading
* It may be helpful to at least skim the :doc:`../users-manual/index` to get
an idea of what your users might want or expect, particularly the
sections on :doc:`../automated-workflows/dagman-introduction`,
:doc:`../users-manual/choosing-an-htcondor-universe`, and
:doc:`../users-manual/self-checkpointing-applications`.
* Understanding :doc:`../classads/classad-mechanism` is essential for
many administrative tasks.
* The rest of the :doc:`../admin-manual/index`, particularly the section on
:ref:`admin-manual/cm-configuration:Monitoring with Ganglia, Elasticsearch, etc.`.
* Slides from
`past HTCondor Weeks <https://htcondor.org/past_condor_weeks.html>`_
-- our annual conference -- include a number of tutorials and talks on
administrative topics, including monitoring and examples of policies and
their implementations.
.. _the_details:
What ``get_htcondor`` Does to Configure a Role
----------------------------------------------
The configuration files generated by ``get_htcondor`` are very similar, and
only two lines long:
* set the HTCondor configuration variable :macro:`CONDOR_HOST` to the name
(or IP address) of your central manager;
* add the appropriate metaknob: ``use role : get_htcondor_central_manager``,
``use role : get_htcondor_submit``, or ``use role : get_htcondor_execute``.
Putting all of the pool-independent configuration into the metaknobs allows
us to change the metaknobs to fix problems or work with later versions of
HTCondor as you upgrade.
The ``get_htcondor`` :doc:`documentation <../man-pages/get_htcondor>`
describes what the configuration script does and how to determine the exact details.
|