1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693
|
Configuration for Access Points
===============================
:index:`condor_schedd policy<single: condor_schedd policy; configuration>`
:index:`policy configuration<single: policy configuration; submit host>`
Schedd Job Transforms
---------------------
:index:`job transforms`
The *condor_schedd* can transform jobs as they are submitted.
Transformations can be used to guarantee the presence of required job
attributes, to set defaults for job attributes the user does not supply,
or to modify job attributes so that they conform to schedd policy; an
example of this might be to automatically set accounting attributes
based on the owner of the job while letting the job owner indicate a
preference.
There can be multiple job transforms. Each transform can have a
Requirements expression to indicate which jobs it should transform and
which it should ignore. Transforms without a Requirements expression
apply to all jobs. Job transforms are applied in order. The set of
transforms and their order are configured using the Configuration
variable :macro:`JOB_TRANSFORM_NAMES`.
For each entry in this list there must be a corresponding
:macro:`JOB_TRANSFORM_<name>`
configuration variable that specifies the transform rules. Transforms
can use the same syntax as *condor_job_router* transforms; although unlike
the *condor_job_router* there is no default transform, and all
matching transforms are applied - not just the first one. (See the
:doc:`/grid-computing/job-router` section for information on the
*condor_job_router*.)
When a submission is a late materialization job factory,
transforms that would match the first factory job will be applied to the Cluster ad at submit time.
When job ads are later materialized, attribute values set by the transform
will override values set by the job factory for those attributes.
The following example shows a set of two transforms: one that
automatically assigns an accounting group to jobs based on the
submitting user, and one that shows one possible way to transform
Vanilla jobs to Docker jobs.
.. code-block:: text
JOB_TRANSFORM_NAMES = AssignGroup, SL6ToDocker
JOB_TRANSFORM_AssignGroup @=end
# map Owner to group using the existing accounting group attribute as requested group
EVALSET AcctGroup = userMap("Groups",Owner,AcctGroup)
EVALSET AccountingGroup = join(".",AcctGroup,Owner)
@end
JOB_TRANSFORM_SL6ToDocker @=end
# match only vanilla jobs that have WantSL6 and do not already have a DockerImage
REQUIREMENTS JobUniverse==5 && WantSL6 && DockerImage =?= undefined
SET WantDocker = true
SET DockerImage = "SL6"
SET Requirements = TARGET.HasDocker && $(MY.Requirements)
@end
The AssignGroup transform above assumes that a mapfile that can map an
owner to one or more accounting groups has been configured via
:macro:`SCHEDD_CLASSAD_USER_MAP_NAMES`, and given the name "Groups".
The SL6ToDocker transform above is most likely incomplete, as it assumes
a custom attribute (``WantSL6``) that your pool may or may not use.
Submit Requirements
-------------------
:index:`submit requirements`
The *condor_schedd* may reject job submissions, such that rejected jobs
never enter the queue. Rejection may be best for the case in which there
are jobs that will never be able to run; for instance, a job specifying
an obsolete universe, like standard.
Another appropriate example might be to reject all jobs that
do not request a minimum amount of memory. Or, it may be appropriate to
prevent certain users from using a specific submit host.
Rejection criteria are configured. Configuration variable
:macro:`SUBMIT_REQUIREMENT_NAMES`
lists criteria, where each criterion is given a name. The chosen name is
a major component of the default error message output if a user attempts
to submit a job which fails to meet the requirements. Therefore, choose
a descriptive name. For the three example submit requirements described:
.. code-block:: text
SUBMIT_REQUIREMENT_NAMES = NotStandardUniverse, MinimalRequestMemory, NotChris
The criterion for each submit requirement is then specified in
configuration variable
:macro:`SUBMIT_REQUIREMENT_<Name>`, where ``<Name>`` matches the
chosen name listed in :macro:`SUBMIT_REQUIREMENT_NAMES`. The value is a
boolean ClassAd expression. The three example criterion result in these
configuration variable definitions:
.. code-block:: text
SUBMIT_REQUIREMENT_NotStandardUniverse = JobUniverse != 1
SUBMIT_REQUIREMENT_MinimalRequestMemory = RequestMemory > 512
SUBMIT_REQUIREMENT_NotChris = Owner != "chris"
Submit requirements are evaluated in the listed order; the first
requirement that evaluates to ``False`` causes rejection of the job,
terminates further evaluation of other submit requirements, and is the
only requirement reported. Each submit requirement is evaluated in the
context of the *condor_schedd* ClassAd, which is the ``MY.`` name space
and the job ClassAd, which is the ``TARGET.`` name space. Note that
:ad-attr:`JobUniverse` and :ad-attr:`RequestMemory` are both job ClassAd attributes.
Further configuration may associate a rejection reason with a submit
requirement with the :macro:`SUBMIT_REQUIREMENT_<Name>_REASON`.
.. code-block:: text
SUBMIT_REQUIREMENT_NotStandardUniverse_REASON = "This pool does not accept standard universe jobs."
SUBMIT_REQUIREMENT_MinimalRequestMemory_REASON = strcat( "The job only requested ", \
RequestMemory, " Megabytes. If that small amount is really enough, please contact ..." )
SUBMIT_REQUIREMENT_NotChris_REASON = "Chris, you may only submit jobs to the instructional pool."
The value must be a ClassAd expression which evaluates to a string.
Thus, double quotes were required to make strings for both
``SUBMIT_REQUIREMENT_NotStandardUniverse_REASON`` and
``SUBMIT_REQUIREMENT_NotChris_REASON``. The ClassAd function strcat()
produces a string in the definition of
``SUBMIT_REQUIREMENT_MinimalRequestMemory_REASON``.
Rejection reasons are sent back to the submitting program and will
typically be immediately presented to the user. If an optional
:macro:`SUBMIT_REQUIREMENT_<Name>_REASON` is not defined, a default reason
will include the ``<Name>`` chosen for the submit requirement.
Completing the presentation of the example submit requirements, upon an
attempt to submit a standard universe job, :tool:`condor_submit` would print
.. code-block:: text
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: This pool does not accept standard universe jobs.
Where there are multiple jobs in a cluster, if any job within the
cluster is rejected due to a submit requirement, the entire cluster of
jobs will be rejected.
Submit Warnings
---------------
:index:`submit warnings`
Starting in HTCondor 8.7.4, you may instead configure submit warnings. A
submit warning is a submit requirement for which
:macro:`SUBMIT_REQUIREMENT_<Name>_IS_WARNING` is true. A submit
warning does not cause the submission to fail; instead, it returns a
warning to the user's console (when triggered via :tool:`condor_submit`) or
writes a message to the user log (always). Submit warnings are intended
to allow HTCondor administrators to provide their users with advance
warning of new submit requirements. For example, if you want to increase
the minimum request memory, you could use the following configuration.
.. code-block:: text
SUBMIT_REQUIREMENT_NAMES = OneGig $(SUBMIT_REQUIREMENT_NAMES)
SUBMIT_REQUIREMENT_OneGig = RequestMemory > 1024
SUBMIT_REQUIREMENT_OneGig_REASON = "As of <date>, the minimum requested memory will be 1024."
SUBMIT_REQUIREMENT_OneGig_IS_WARNING = TRUE
When a user runs :tool:`condor_submit` to submit a job with :ad-attr:`RequestMemory`
between 512 and 1024, they will see (something like) the following,
assuming that the job meets all the other requirements.
.. code-block:: text
Submitting job(s).
WARNING: Committed job submission into the queue with the following warning:
WARNING: As of <date>, the minimum requested memory will be 1024.
1 job(s) submitted to cluster 452.
The job will contain (something like) the following:
.. code-block:: text
000 (452.000.000) 10/06 13:40:45 Job submitted from host: <128.105.136.53:37317?addrs=128.105.136.53-37317+[fc00--1]-37317&noUDP&sock=19966_e869_5>
WARNING: Committed job submission into the queue with the following warning: As of <date>, the minimum requested memory will be 1024.
...
Marking a submit requirement as a warning does not change when or how it
is evaluated, only the result of doing so. In particular, failing a
submit warning does not terminate further evaluation of the submit
requirements list. Currently, only one (the most recent) problem is
reported for each submit attempt. This means users will see (as they
previously did) only the first failed requirement; if all requirements
passed, they will see the last failed warning, if any.
Working with Remote Job Submission
''''''''''''''''''''''''''''''''''
:index:`of job queue, with remote job submission<single: of job queue, with remote job submission; High Availability>`
Remote job submission requires identification of the job queue,
submitting with a command similar to:
.. code-block:: console
$ condor_submit -remote condor@example.com myjob.submit
This implies the identification of a single *condor_schedd* daemon,
running on a single machine. With the high availability of the job
queue, there are multiple *condor_schedd* daemons, of which only one at
a time is acting as the single submission point. To make remote
submission of jobs work properly, set the configuration variable
:macro:`SCHEDD_NAME` in the local configuration to
have the same value for each potentially running *condor_schedd*
daemon. In addition, the value chosen for the variable :macro:`SCHEDD_NAME`
will need to include the at symbol (@), such that HTCondor will not
modify the value set for this variable. See the description of
:macro:`MASTER_NAME` in the :ref:`admin-manual/configuration-macros:condor_master
configuration file macros` section for defaults and composition of valid values
for :macro:`SCHEDD_NAME`. As an example, include in each local configuration a value
similar to:
.. code-block:: condor-config
SCHEDD_NAME = had-schedd@
Then, with this sample configuration, the submit command appears as:
.. code-block:: console
$ condor_submit -remote had-schedd@ myjob.submit
Schedd Cron
-----------
:index:`Schedd Cron`
Just as an administrator can dynamically add new classad attributes
and values programmatically with script to the startd's ads, the
same can be done with the classads the *condor_schedd* sends to the
collector. However, these are less generally useful, as there is
no matchmaking with the schedd ads. Administrators might want to
use this to advertise some performance or resource usage of
the machine the schedd is running on for further monitoring.
See the section in :ref:`admin-manual/ep-policy-configuration:Startd Cron`
for examples and information about this.
Dedicated Scheduling
--------------------
:index:`dedicated scheduling`
:index:`under the dedicated scheduler<single: under the dedicated scheduler; MPI application>`
The dedicated scheduler is a part of the *condor_schedd* that handles
the scheduling of parallel jobs that require more than one machine
concurrently running per job. MPI applications are a common use for the
dedicated scheduler, but parallel applications which do not require MPI
can also be run with the dedicated scheduler. All jobs which use the
parallel universe are routed to the dedicated scheduler within the
*condor_schedd* they were submitted to. A default HTCondor installation
does not configure a dedicated scheduler; the administrator must
designate one or more *condor_schedd* daemons to perform as dedicated
scheduler.
Selecting and Setting Up a Dedicated Scheduler
''''''''''''''''''''''''''''''''''''''''''''''
We recommend that you select a single machine within an HTCondor pool to
act as the dedicated scheduler. This becomes the machine from upon which
all users submit their parallel universe jobs. The perfect choice for
the dedicated scheduler is the single, front-end machine for a dedicated
cluster of compute nodes. For the pool without an obvious choice for a
access point, choose a machine that all users can log into, as well as
one that is likely to be up and running all the time. All of HTCondor's
other resource requirements for a access point apply to this machine,
such as having enough disk space in the spool directory to hold jobs.
See :ref:`admin-manual/logging:directories` for more information.
Configuration Examples for Dedicated Resources
''''''''''''''''''''''''''''''''''''''''''''''
Each execute machine may have its own policy for the execution of jobs,
as set by configuration. Each machine with aspects of its configuration
that are dedicated identifies the dedicated scheduler. And, the ClassAd
representing a job to be executed on one or more of these dedicated
machines includes an identifying attribute. An example configuration
file with the following various policy settings is
``/etc/examples/condor_config.local.dedicated.resource``.
Each execute machine defines the configuration variable
:macro:`DedicatedScheduler`, which identifies the dedicated scheduler it is
managed by. The local configuration file contains a modified form of
.. code-block:: text
DedicatedScheduler = "DedicatedScheduler@full.host.name"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
Substitute the host name of the dedicated scheduler machine for the
string "full.host.name".
If running personal HTCondor, the name of the scheduler includes the
user name it was started as, so the configuration appears as:
.. code-block:: text
DedicatedScheduler = "DedicatedScheduler@username@full.host.name"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
All dedicated execute machines must have policy expressions which allow
for jobs to always run, but not be preempted. The resource must also be
configured to prefer jobs from the dedicated scheduler over all other
jobs. Therefore, configuration gives the dedicated scheduler of choice
the highest rank. It is worth noting that HTCondor puts no other
requirements on a resource for it to be considered dedicated.
Job ClassAds from the dedicated scheduler contain the attribute
``Scheduler``. The attribute is defined by a string of the form
.. code-block:: text
Scheduler = "DedicatedScheduler@full.host.name"
The host name of the dedicated scheduler substitutes for the string
full.host.name.
Different resources in the pool may have different dedicated policies by
varying the local configuration.
Policy Scenario: Machine Runs Only Jobs That Require Dedicated Resources
One possible scenario for the use of a dedicated resource is to only
run jobs that require the dedicated resource. To enact this policy,
configure the following expressions:
.. code-block:: text
START = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
The :macro:`START` expression specifies that a job
with the ``Scheduler`` attribute must match the string corresponding
:macro:`DedicatedScheduler` attribute in the machine ClassAd. The
:macro:`RANK` expression specifies that this same job
(with the ``Scheduler`` attribute) has the highest rank. This
prevents other jobs from preempting it based on user priorities. The
rest of the expressions disable any other of the *condor_startd*
daemon's pool-wide policies, such as those for evicting jobs when
keyboard and CPU activity is discovered on the machine.
Policy Scenario: Run Both Jobs That Do and Do Not Require Dedicated Resources
While the first example works nicely for jobs requiring dedicated
resources, it can lead to poor utilization of the dedicated
machines. A more sophisticated strategy allows the machines to run
other jobs, when no jobs that require dedicated resources exist. The
machine is configured to prefer jobs that require dedicated
resources, but not prevent others from running.
To implement this, configure the machine as a dedicated resource as
above, modifying only the :macro:`START` expression:
.. code-block:: text
START = True
Policy Scenario: Adding Desktop Resources To The Mix
A third policy example allows all jobs. These desktop machines use a
preexisting :macro:`START` expression that takes the machine owner's
usage into account for some jobs. The machine does not preempt jobs
that must run on dedicated resources, while it may preempt other
jobs as defined by policy. So, the default pool policy is used for
starting and stopping jobs, while jobs that require a dedicated
resource always start and are not preempted.
The :macro:`START`, :macro:`SUSPEND`, :macro:`PREEMPT`, and :macro:`RANK` policies are
set in the global configuration. Locally, the configuration is
modified to this hybrid policy by adding a second case.
.. code-block:: text
SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND))
PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT))
RANK_FACTOR = 1000000
RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) \
+ $(RANK)
START = (Scheduler =?= $(DedicatedScheduler)) || ($(START))
Define ``RANK_FACTOR`` to be a larger
value than the maximum value possible for the existing rank
expression. :macro:`RANK` is a floating point value,
so there is no harm in assigning a very large value.
Preemption with Dedicated Jobs
''''''''''''''''''''''''''''''
The dedicated scheduler can be configured to preempt running parallel
universe jobs in favor of higher priority parallel universe jobs. Note
that this is different from preemption in other universes, and parallel
universe jobs cannot be preempted either by a machine's user pressing a
key or by other means.
By default, the dedicated scheduler will never preempt running parallel
universe jobs. Two configuration variables control preemption of these
dedicated resources: :macro:`SCHEDD_PREEMPTION_REQUIREMENTS` and
:macro:`SCHEDD_PREEMPTION_RANK`. These
variables have no default value, so if either are not defined,
preemption will never occur. :macro:`SCHEDD_PREEMPTION_REQUIREMENTS` must
evaluate to ``True`` for a machine to be a candidate for this kind of
preemption. If more machines are candidates for preemption than needed
to satisfy a higher priority job, the machines are sorted by
:macro:`SCHEDD_PREEMPTION_RANK`, and only the highest ranked machines are
taken.
Note that preempting one node of a running parallel universe job
requires killing the entire job on all of its nodes. So, when preemption
occurs, it may end up freeing more machines than are needed for the new
job. Also, preempted jobs will be re-run, starting again from the
beginning. Thus, the administrator should be careful when enabling
preemption of these dedicated resources. Enable dedicated preemption
with the configuration:
.. code-block:: text
STARTD_JOB_ATTRS = JobPrio
SCHEDD_PREEMPTION_REQUIREMENTS = (My.JobPrio < Target.JobPrio)
SCHEDD_PREEMPTION_RANK = 0.0
In this example, preemption is enabled by user-defined job priority. If
a set of machines is running a job at user priority 5, and the user
submits a new job at user priority 10, the running job will be preempted
for the new job. The old job is put back in the queue, and will begin
again from the beginning when assigned to a newly acquired set of
machines.
Grouping Dedicated Nodes into Parallel Scheduling Groups
''''''''''''''''''''''''''''''''''''''''''''''''''''''''
:index:`parallel scheduling groups`
In some parallel environments, machines are divided into groups, and
jobs should not cross groups of machines. That is, all the nodes of a
parallel job should be allocated to machines within the same group. The
most common example is a pool of machine using InfiniBand switches. For
example, each switch might connect 16 machines, and a pool might have
160 machines on 10 switches. If the InfiniBand switches are not routed
to each other, each job must run on machines connected to the same
switch. The dedicated scheduler's Parallel Scheduling Groups feature
supports this operation.
Each *condor_startd* must define which group it belongs to by setting the
:macro:`ParallelSchedulingGroup` variable in the configuration file, and
advertising it into the machine ClassAd. The value of this variable is a
string, which should be the same for all *condor_startd* daemons within a given
group. The property must be advertised in the *condor_startd* ClassAd by
appending :macro:`ParallelSchedulingGroup` to the :macro:`STARTD_ATTRS`
configuration variable.
The submit description file for a parallel universe job which must not
cross group boundaries contains
.. code-block:: text
+WantParallelSchedulingGroups = True
The dedicated scheduler enforces the allocation to within a group.
High Availability of the Job Queue
----------------------------------
:index:`of job queue<single: of job queue; High Availability>`
.. warning::
This High Availability configuration depends entirely on using
an extremely reliably shared file server. In our experience, only
expensive, proprietary file servers are suitable for this role.
Frequently, casual configuration of a Highly Available HTCondor
job queue will result in lowered reliability.
For a pool where all jobs are submitted through a single machine in the
pool, and there are lots of jobs, this machine becoming nonfunctional
means that jobs stop running. The *condor_schedd* daemon maintains the
job queue. No job queue due to having a nonfunctional machine implies
that no jobs can be run. This situation is worsened by using one machine
as the single submission point. For each HTCondor job (taken from the
queue) that is executed, a *condor_shadow* process runs on the machine
where submitted to handle input/output functionality. If this machine
becomes nonfunctional, none of the jobs can continue. The entire pool
stops running jobs.
The goal of High Availability in this special case is to transfer the
*condor_schedd* daemon to run on another designated machine. Jobs
caused to stop without finishing can be restarted from the beginning, or
can continue execution using the most recent checkpoint. New jobs can
enter the job queue. Without High Availability, the job queue would
remain intact, but further progress on jobs would wait until the machine
running the *condor_schedd* daemon became available (after fixing
whatever caused it to become unavailable).
HTCondor uses its flexible configuration mechanisms to allow the
transfer of the *condor_schedd* daemon from one machine to another. The
configuration specifies which machines are chosen to run the
*condor_schedd* daemon. To prevent multiple *condor_schedd* daemons
from running at the same time, a lock (semaphore-like) is held over the
job queue. This synchronizes the situation in which control is
transferred to a secondary machine, and the primary machine returns to
functionality. Configuration variables also determine time intervals at
which the lock expires, and periods of time that pass between polling to
check for expired locks.
To specify a single machine that would take over, if the machine running
the *condor_schedd* daemon stops working, the following additions are
made to the local configuration of any and all machines that are able to
run the *condor_schedd* daemon (becoming the single pool submission
point):
.. code-block:: condor-config
MASTER_HA_LIST = SCHEDD
SPOOL = /share/spool
HA_LOCK_URL = file:/share/spool
VALID_SPOOL_FILES = $(VALID_SPOOL_FILES) SCHEDD.lock
Configuration macro :macro:`MASTER_HA_LIST`
identifies the *condor_schedd* daemon as the daemon that is to be
watched to make sure that it is running. Each machine with this
configuration must have access to the lock (the job queue) which
synchronizes which single machine does run the *condor_schedd* daemon.
This lock and the job queue must both be located in a shared file space,
and is currently specified only with a file URL. The configuration
specifies the shared space (:macro:`SPOOL`), and the URL of the lock.
:tool:`condor_preen` is not currently aware of the lock file and will delete
it if it is placed in the :macro:`SPOOL` directory, so be sure to add file
``SCHEDD.lock`` to :macro:`VALID_SPOOL_FILES[with HA Schedd]`.
As HTCondor starts on machines that are configured to run the single
*condor_schedd* daemon, the :tool:`condor_master` daemon of the first
machine that looks at (polls) the lock and notices that no lock is held.
This implies that no *condor_schedd* daemon is running. This
:tool:`condor_master` daemon acquires the lock and runs the *condor_schedd*
daemon. Other machines with this same capability to run the
*condor_schedd* daemon look at (poll) the lock, but do not run the
daemon, as the lock is held. The machine running the *condor_schedd*
daemon renews the lock periodically.
If the machine running the *condor_schedd* daemon fails to renew the
lock (because the machine is not functioning), the lock times out
(becomes stale). The lock is released by the :tool:`condor_master` daemon if
:tool:`condor_off` or *condor_off -schedd* is executed, or when the
:tool:`condor_master` daemon knows that the *condor_schedd* daemon is no
longer running. As other machines capable of running the
*condor_schedd* daemon look at the lock (poll), one machine will be the
first to notice that the lock has timed out or been released. This
machine (correctly) interprets this situation as the *condor_schedd*
daemon is no longer running. This machine's :tool:`condor_master` daemon then
acquires the lock and runs the *condor_schedd* daemon.
See the :ref:`admin-manual/configuration-macros:condor_master configuration
file macros` section for details relating to the configuration variables used
to set timing and polling intervals.
Performance Tuning of the AP
----------------------------
Of the three roles (AP, CM, EP) in a HTCondor system, the AP is the most common
place performance tuning is done. The CM is mostly stateless, and can
typically scale out to very large pools without much additional work. The EP
daemons aren't resource intensive. However, as the AP stores the state of all
the jobs under its control, and persistently stores frequent updates to those
jobs, it is not uncommon for the AP to exhaust system resources, like cpu, or
disk and network bandwidth.
Monitoring AP Performance
'''''''''''''''''''''''''
The *condor_schedd* is single threaded. Practically, this means that it only
does one thing at a time, and often when it may be "busy" doing that one thing,
it is actually waiting on the system for some i/o to complete. As such, it
will rarely appear to use 100% of a cpu in any system monitoring tool. To help
guage how busy the schedd is, it keeps track of a metric called
:ad-attr:`RecentDaemonCoreDutyCycle`. This is a floating point value that
ranges from 0.0 (completely idle) to 1.0 (competely busy). Values over 0.95
indicate the schedd is overloaded. In extreme cases :tool:`condor_q` and
:tool:`condor_submit` may timeout and fail trying to communicate to an
overloaded schedd. An administrator can see this attribute by running
.. code-block:: console
$ condor_status -direct -schedd name-of-schedd -af RecentDaemonCoreDutyCycle
Horizontal Scaling
''''''''''''''''''
While the *condor_schedd* and the machine it runs on can be tuned to handle a
greater rate of jobs, every machine has some limit of jobs it can support. The
main strategy for supporting more jobs in the system as a whole is simply by
running more schedds, or horizontal scaling. This may require partitioning
users onto differening submit machines, or submiting remotely, but at the end
of the day, the best way to scale out a very large HTCondor system is by adding
more *condor_schedd*'s.
Putting the schedd's database on the fastest disk
'''''''''''''''''''''''''''''''''''''''''''''''''
The *condor_schedd* frequently saves state to a file on disk, so that in event
of a crash, no jobs will be lost on a restart. The cost of this reliability,
though, is relatively high. In addition to writing to the disk, the schedd
uses the fsync system call to force all the data onto the disk. By default,
this file named job_queue.log is written to the :macro:`SPOOL` directory.
However, the configuration option :macro:`JOB_QUEUE_LOG` will override this path. Setting
:macro:`JOB_QUEUE_LOG` to point to a file on a solid state or nvme drive will
make the schedd faster. Ideally, this path should be on a filesystem that only
holds this file.
Avoiding shared filesystems for event logs
''''''''''''''''''''''''''''''''''''''''''
Another type of file the *condor_schedd* frequently writes to are job event
logs, those specified by the :subcom:`log` submit command. When these are on
NFS or other distributed or slow filesystems, the whole system can slow down
tremendously. If possible, encourage users not to put their event logs on such
slow filesystems.
Using third party (url / plugin) transfers when able
''''''''''''''''''''''''''''''''''''''''''''''''''''
HTCondor can transfer user's sandboxes to the EP in many ways. The default
method, called HTCondor file transfer, or "cedar" file transfer, copies files
from the AP to the EP. Obviously, this uses cpu, disk and network bandwidth on
the AP. To the degree possible, changing large input file file transfers from
cedar, to http transfers from some third party server, moves the load off of
the AP, and onto an http server. If one http server isn't sufficent there are
many methods for scaling http servers to handle additional load.
Limiting CPU or I/O bound procesing on the AP
'''''''''''''''''''''''''''''''''''''''''''''
The machine the *condor_schedd* runs on is typically a machine users can log
into, to prepare and submit jobs. Sometimes, users will start long-running,
cpu or I/O heavy jobs on the submit machine, which can slow down the various
HTCondor services on that machine. We encourage admins to try to limit this,
either by social pressure, or enforced by system limits on the user cpu.
.. _enabling htcondor annex:
Enabling ``htcondor annex`` on an AP
------------------------------------
The macro template :macro:`use feature:HPC_ANNEX` enables the ``annex``
noun of the :doc:`../man-pages/htcondor` command and configures HTCondor
to support it.
.. note::
The following section is not normative; it reflects the implementation
at the time of writing.
The annex pilot starts an EP which directly connects to the AP, using an
IDTOKEN generated by :doc:`../man-pages/condor_token_fetch`. The AP will
only run a job on a directly-connected EP if that token's identifier is
the same as that job's :ad-attr:`Owner` attribute. Obtaining this token will
naturally fail if the user running ``htcondor annex`` does not have
permission to submit jobs. (The signing key used to generate the token is
automatically created as a result of the macro template and is used for no
other purpose.)
However, because the schedd does not have a stable address by default (its
shared port ID changes), the annex pilot needs a collector to look the
schedd up in. So you will notice a small collector on the side; the
macro template calls it the "AP collector". This collector performs two
other tasks: it generates the random key used to sign the pilot job's
token, and it holds a copy of the slot ads generated by annex EPs so that
``htcondor annex`` does not have to query the schedd for them, reducing
the load on that daemon.
The configuration assumes that IDTOKENS are enabled for both the schedd
and the collector, rather than trying to modify the security configuration
of those two daemons.
Finally, the macro template adds a job transform so that jobs submitted
with a ``TargetAnnexName`` attribute -- as jobs submitted via
``htcondor job submit --annex-name`` will have -- will only run on resources
with the same annex name and with the same owner.
|