1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589
|
Running and Managing DAGMan
===========================
Once once a workflow has been setup in a ``.dag`` file, all that
is left is to submit the prepared workflow. A key concept to understand
regarding the submission and management of a DAGMan workflow is
that the DAGMan process itself is ran as a HTCondor Scheduler universe
job that runs under the schedd on the AP (often referred to as the
DAGMan proper job) that will in turn manage and submit all the various
jobs and scripts defined in the workflow.
.. _DAG controls:
Basic DAG Controls
------------------
:index:`DAG submission<single: DAGMan; DAG submission>`
DAG Submission
^^^^^^^^^^^^^^
.. sidebar:: Example: Submitting Diamond DAG
.. code-block:: console
:caption: Example DAG submission
$ condor_submit_dag diamond.dag
To submit a DAG simply use :tool:`condor_submit_dag` or :tool:`htcondor dag submit`
with the DAG description file from the current working directory that the DAG description file
is stored. This will automatically generate an HTCondor scheduler universe job
submit file to execute :tool:`condor_dagman` and submit this job to HTCondor. The
generated DAG submit file is named ``<DAG Description File>.condor.sub``. If desired,
the generated submit description file can be modified prior to job submission
by doing the following:
.. code-block:: console
:caption: Example DAG submission with modification of DAGMan job
$ condor_submit_dag -no_submit diamond.dag
$ vim diamond.dag.condor.sub
$ condor_submit diamond.dag.condor.sub
Since the :tool:`condor_dagman` process is an actual HTCondor job, all jobs
managed by DAGMan are marked with the DAGMan proper jobs :ad-attr:`ClusterId`.
This value is set to the managed jobs ClassAd attribute :ad-attr:`DAGManJobId`.
.. warning::
Do not submit the same DAG, with the same DAG description file, from the same
working directory at the same time. This will cause unpredictable behavior
and failures since both DAGMan jobs will attempt to use the same files to
execute.
:index:`single submission of multiple, independent DAGs<single: DAGMan; Single submission of multiple, independent DAGs>`
Single Submission of Multiple, Independent DAGs
'''''''''''''''''''''''''''''''''''''''''''''''
.. sidebar:: Example: Submitting Multiple Independent DAGs
.. code-block:: console
:caption: Example multi-DAG submission at one time
$ condor_submit_dag A.dag B.dag C.dag
Multiple independent DAGs described in various DAG description files can be submitted
in a single instance of :tool:`condor_submit_dag` resulting in one :tool:`condor_dagman`
job managing all DAGs. This is done by internally combining all independent
DAGs into one large DAG with no inter-dependencies between the individual
DAGs. To avoid possible node name collisions when producing the large DAG,
DAGMan renames all the nodes. The renaming of nodes is controlled by
:macro:`DAGMAN_MUNGE_NODE_NAMES`.
When multiple DAGs are submitted like this, DAGMan sets the first DAG description
file provided on the command line as it's primary DAG file, and uses the primary
DAG file when writing various files such as the ``*.dagman.out``. In the case of
failure, DAGMan will produce a rescue file named ``<Primary DAG>_multi.rescue<XXX>``.
See :ref:`Rescue DAG` section for more information.
The success or failure of the independent DAGs is well defined. When
multiple, independent DAGs are submitted with a single command, the
success of the composite DAG is defined as the logical AND of the
success of each independent DAG, and failure is defined as the logical
OR of the failure of any of the independent DAGs.
:index:`DAG monitoring<single: DAGMan; DAG monitoring>`
DAG Monitoring
^^^^^^^^^^^^^^
After submission, the progress of the DAG can be monitored by looking at
the job event log file(s), observing the e-mail that job submission to
HTCondor causes, or by using :tool:`condor_q`. Using just :tool:`condor_q`
while a DAGMan workflow is running will display condensed information
regarding the overall workflow progress under the DAGMan proper job as follows:
.. code-block:: console
:caption: Example condor_q DAG workflow output (condensed)
$ condor_q
$ OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
$ Cole diamond.dag+1024 1/1 12:34 1 2 - 4 1025.0 ... 1026.0
Using :tool:`condor_q` with the *-dag* and *-nobatch* flags will display information
about the DAGMan proper job and all jobs currently submitted/running as
part of the DAGMan workflow as follows:
.. code-block:: console
:caption: Example condor_q DAG workflow output (uncondensed)
$ condor_q -dag -nobatch
$ ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
$ 1024.0 Cole 1/1 12:34 0+01:13:19 R 0 0.4 condor_dagman ...
$ 1025.0 |-Node_B 1/1 13:44 0+00:03:19 R 0 0.4 diamond.sh ...
$ 1026.0 |-Node_C 1/1 13:45 0+00:02:19 R 0 0.4 diamond.sh ...
In addition to basic job management, the DAGMan proper job holds a lot of extra
information within its job ClassAd that can queried with the *-l* or the more
recommended *-af* *<Attributes>* flags for :tool:`condor_q` in association with the
DAGMan proper Job Id.
.. code-block:: console
:caption: Example condor_q of DAGMan proper job's information
$ condor_q <dagman-job-id> -af Attribute-1 ... Attribute-N
$ condor_q -l <dagman-job-id>
A large amount of information about DAG progress and errors can be found in
the debug log file named ``<DAG Description File>.dagman.out``. This file should
be saved if errors occur. This file also doesn't get removed between DAG
new executions, and all logged messages are appended to the file.
:index:`DAG status in a job ClassAd<single: DAGMan; DAG status in a job ClassAd>`
Status Information for the DAG in a ClassAd
'''''''''''''''''''''''''''''''''''''''''''
.. sidebar:: View DAG Progress
Get a detailed DAG status report via :tool:`htcondor dag status`:
.. code-block:: console
$ htcondor dag status <dagman-job-id>
.. code-block:: text
DAG 1024 [diamond.dag] has been running for 00:00:49
DAG has submitted 3 job(s), of which:
1 is submitted and waiting for resources.
1 is running.
1 has completed.
DAG contains 4 node(s) total, of which:
[#] 1 has completed.
[=] 2 are running: 2 jobs.
[-] 1 is waiting on other nodes to finish.
DAG is running normally.
[#########===================----------] DAG is 25.00% complete.
The :tool:`condor_dagman` job places information about its status in its ClassAd
as the following job ad attributes:
+-----------------+-----------------------------+-----------------------------+
| | :ad-attr:`DAG_Status` | :ad-attr:`DAG_InRecovery` |
| DAG Info +-----------------------------+-----------------------------+
| | :ad-attr:`DAG_AdUpdateTime` | |
+-----------------+-----------------------------+-----------------------------+
| | :ad-attr:`DAG_NodesTotal` | :ad-attr:`DAG_NodesDone` |
| +-----------------------------+-----------------------------+
| | :ad-attr:`DAG_NodesPrerun` | :ad-attr:`DAG_NodesPostrun` |
| +-----------------------------+-----------------------------+
| Node Info | :ad-attr:`DAG_NodesReady` | :ad-attr:`DAG_NodesUnready` |
| +-----------------------------+-----------------------------+
| | :ad-attr:`DAG_NodesFailed` | :ad-attr:`DAG_NodesFutile` |
| +-----------------------------+-----------------------------+
| | :ad-attr:`DAG_NodesQueued` | |
+-----------------+-----------------------------+-----------------------------+
| | :ad-attr:`DAG_JobsSubmitted`| :ad-attr:`DAG_JobsCompleted`|
| +-----------------------------+-----------------------------+
| DAG Process Info| :ad-attr:`DAG_JobsIdle` | :ad-attr:`DAG_JobsRunning` |
| +-----------------------------+-----------------------------+
| | :ad-attr:`DAG_JobsHeld` | |
+-----------------+-----------------------------+-----------------------------+
.. note::
Most of this information is also available in the ``dagman.out`` file, and
DAGMan updates these ClassAd attributes every 2 minutes.
:index:`DAG removal<single: DAGMan; DAG removal>`
Removing a DAG
^^^^^^^^^^^^^^
.. sidebar:: Removing a DAG
.. code-block:: console
:caption: Example removing a DAGMan workflow
$ condor_q -nobatch
-- Submitter: user.cs.wisc.edu : <128.105.175.125:36165> : user.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 taylor 10/12 11:47 0+00:01:32 R 0 8.7 condor_dagman -f ...
11.0 taylor 10/12 11:48 0+00:00:00 I 0 3.6 B.exe
2 jobs; 1 idle, 1 running, 0 held
$ condor_rm 9.0
To remove a DAG simply use :tool:`condor_rm[Removing a DAG]` on the
:tool:`condor_dagman` job. This will remove both the DAGMan proper job
and all node jobs, including sub-DAGs, from the HTCondor queue.
A removed DAG will be considered failed unless the DAG has a :dag-cmd:`FINAL` node
that succeeds.
In the case where a machine is scheduled to go down, DAGMan will clean
up memory and exit. However, it will leave any submitted jobs in the
HTCondor queue.
:index:`suspending a running DAG<single: DAGMan; Suspending a running DAG>`
.. _Suspending a DAG:
Suspending a Running DAG
^^^^^^^^^^^^^^^^^^^^^^^^
It may be desired to temporarily suspend a running DAG. For example, the
load may be high on the access point, and therefore it is desired to
prevent DAGMan from submitting any more jobs until the load goes down.
There are two ways to suspend (and resume) a running DAG.
- Use :tool:`condor_hold[with DAGMan]`/:tool:`condor_release` on the :tool:`condor_dagman` job.
After placing the :tool:`condor_dagman` job on hold, no new node jobs will
be submitted, and no scripts will be run. Any node jobs already in the
HTCondor queue will continue undisturbed. Any running PRE or POST scripts
will be killed. If the :tool:`condor_dagman` job is left on hold, it will
remain in the HTCondor queue after all of the currently running node jobs
are finished. To resume the DAG, use :tool:`condor_release` on the
:tool:`condor_dagman` job.
.. note::
While the :tool:`condor_dagman` job is on hold, no updates will
be made to the ``*.dagman.out`` file.
- Use a DAG halt file.
A DAG can be suspended by halting it with a halt file. This is a
special file named ``<DAG Description Filename>.halt`` that DAGMan will
periodically check exists. If found then the DAG enters the halted
state where no PRE scripts are not run and node jobs stop being
submitted. Running node jobs will continue undisturbed, POST scripts
will run, and the ``*.dagman.out`` log will still be updated.
Once all running node jobs and POST scripts have finished, DAGMan
will write a Rescue DAG and exit.
.. note::
If a halt file exists at DAG submission time, it it removed.
.. warning::
Neither :tool:`condor_hold` nor a DAG halt is propagated to sub-DAGS. In
other word if a parent DAG is held or halted, any sub-DAGs will continue
to submit node jobs. However, these effects are applied to DAG splices
since they are merged into the parent DAG and are controlled by a single
:tool:`condor_dagman` instance.
:index:`file paths in DAGs<single: DAGMan; File paths in DAGs>`
File Paths in DAGs
------------------
.. sidebar:: Example File Paths with DAGMan
A DAG and its node submit description file in the
same ``example`` directory. Once ran, ``A.out``
and ``A.log`` are expected in the directory.
.. code-block:: condor-dagman
:caption: Example DAG description with single node
# sample.dag
JOB A A.sub
.. code-block:: condor-submit
:caption: Example simple job submit description
# A.sub
executable = programA
input = A.in
output = A.out
log = A.log
.. code-block:: text
:caption: Example DAGMan working directory tree
example/
├── A.input
├── A.sub
├── sample.dag
└── programA
:tool:`condor_dagman` assumes all relative paths in a DAG description file and its
node job submit descriptions are relative to the current working directory
where :tool:`condor_submit_dag` was ran. Meaning all files declared in a DAG
or its jobs are expected to be found or will be written relative to the DAGs
working directory. All jobs will be submitted and all scripts will be ran
from the DAGs working directory.
For simple DAG structures this may be alright, but not for complex DAGs.
To help reduce confusion of where things run or files are written, the :dag-cmd:`JOB`
command takes an optional keyword **DIR <path>**. This will cause DAGMan to submit
the node's job(s) and run the node scripts from the directory specified.
.. code-block:: condor-dagman
:caption: Example DAG description with single node specifying node DIR
JOB A A.submit DIR dirA
.. code-block:: text
:caption: Example DAGMan working directory tree
example/
├── sample.dag
└── dirA
├── A.input
├── A.submit
└── programA
If dealing with multiple independent DAGs separated into different directories
as described below then a single :tool:`condor_submit_dag` submission from the
parent directory will fail to successfully execute since all paths are now relative
to the parent directory.
.. sidebar:: Example Paths with Independent DAGs
Given the directory structure on the left, the following
will fail
.. code-block:: console
$ cd parent
$ condor_submit_dag dag1/one.dag dag2/two.dag
But using *-UseDagDir* will execute each individual DAG
as intended
.. code-block:: console
$ cd parent
$ condor_submit_dag -usedagdir dag1/one.dag dag2/two.dag
.. code-block:: text
:caption: Example multi-DAG DAGMan working directory tree with separate working directories
parent/
├── dag1
│ ├── A.input
│ ├── A.submit
│ ├── one.dag
│ └── programA
└── dag2
├── B.input
├── B.submit
├── programB
└── two.dag
Use the :tool:`condor_submit_dag` *-UseDagDir* flag to execute each individual
DAG in their relative directories. For this example, ``one.dag`` would run from
the ``dag1`` directory and ``two.dag`` would run from ``dag2``. All produced
DAGMan files will be relative to the primary DAG (first DAG specified on the
command line).
.. warning::
Use of *-usedagdir* does not work in conjunction with a :dag-cmd:`JOB` command
that specifies a working directory via the **DIR** keyword. Using both will be
detected and generate an error.
:index:`large numbers of jobs<single: DAGMan; Large numbers of jobs>`
Managing Large Numbers of Jobs
------------------------------
DAGMan provides lots of useful mechanisms to help submit and manage large
numbers of jobs. This can be useful whether a DAG is structured via
dependencies or just a bag of loose jobs. Notable features of DAGMan are:
* Throttling
Throttling limits the number of submitted jobs at any point in time.
* Retry a failed list of jobs
Automatically re-run a failed list of jobs to attempt a successful execution.
For more information visit :ref:`Retry DAG Nodes`.
* Scripts associated with node jobs
Perform simple tasks on the Access Point before and/or after a node's
job(s) execution. For more information visit DAGMan :ref:`DAG Node Scripts`.
.. sidebar:: Example Large DAG Unique Submit File
.. code-block:: condor-submit
:caption: Example automatically produced unique job description file
# Generated Submit: job2.sub
executable = /path/to/executable
log = job2.log
input = job2.in
output = job2.out
arguments = "-file job2.out"
request_cpus = 1
request_memory = 1024M
request_disk = 10240K
queue
It is common for a large grouping of similar jobs to ran under a DAG. It
is also very common for some external program or script to produce these
large DAGs and needed files. There are generally two ways of organizing
DAGs with large number of jobs to manage:
#. Using a unique submit description for each node in the DAG
In this setup, a single DAG description file containing ``n`` nodes with
a unique submit description file (see right) for each node such as:
.. code-block:: condor-dagman
:caption: Example large DAG description using unique job description files
# Large DAG Example: sweep.dag w/ unique submit files
JOB job0 job0.sub
JOB job1 job1.sub
JOB job2 job2.sub
...
JOB job999 job999.sub
The benefit of this method is the individual node's job(s) can easily be
submitted separately at any time but at the cost of producing ``n`` unique
files that need to be stored and managed.
.. sidebar:: Example Large DAG Shared Submit File
.. code-block:: condor-submit
:caption: Example shared job description file
# Generic Submit: common.sub
executable = /path/to/executable
log = job$(runnumber).log
input = job$(runnumber).in
output = job$(runnumber).out
arguments = "-file job$(runnumber).out"
request_cpus = 1
request_memory = 1024M
request_disk = 10240K
queue
#. Using a shared submit description file and :ref:`DAGMan VARS`
In this setup, a single DAG description file containing ``n`` nodes share
a single submit description (see right) and utilize custom macros
added to each job for variance by DAGMan is described such as:
.. code-block:: condor-dagman
:caption: Example large DAG using shared job description file for all nodes
# Large DAG example: sweep.dag w/ shared submit file
JOB job0 common.sub
VARS job0 runnumber="0"
JOB job1 common.sub
VARS job1 runnumber="1"
JOB job2 common.sub
VARS job2 runnumber="2"
...
JOB job999 common.sub
VARS job999 runnumber="999"
The benefit to this method is that less files need to be produced,
stored, and managed at the cost of more complexity and a double in
size to the DAG description file.
.. note::
Even though DAGMan can assist with the management of large number of jobs,
DAGs managing several thousands worth of jobs will produce lots of various
files making directory traversal difficult. Consider how the directory structure
should look for large DAGs prior to creating and running.
.. _DAGMan throttling:
DAGMan Throttling
^^^^^^^^^^^^^^^^^
To prevent possible overloading of the *condor_schedd* and resources on the
Access Point that :tool:`condor_dagman` executes on, DAGMan comes with built
in capabilities to help throttle/limit the load on the Access Point.
:index:`throttling<single: DAGMan; Throttling>`
Throttling at DAG Submission
''''''''''''''''''''''''''''
#. Total nodes/clusters:
The total number of DAG nodes that can be submitted to the HTCondor queue at a time.
This is specified either at submit time via :tool:`condor_submit_dag`\s **-maxjobs**
option or via the configuration option :macro:`DAGMAN_MAX_JOBS_SUBMITTED`.
#. Idle Jobs:
The total number of idle jobs associated with nodes managed by DAGMan in the HTCondor
queue at a time. If DAGMan submits jobs and goes over this limit then DAGMan will
wait until the number of idle jobs under its management drops below this max value
prior to submitting ready nodes. This is specified either at submit time via
:tool:`condor_submit_dag`\s **-maxidle** option or via the configuration option
:macro:`DAGMAN_MAX_JOBS_IDLE`.
#. PRE/POST script:
The total number of PRE and POST scripts DAGMan will execute at a time on the
Access Point. These limits can either be specified via :tool:`condor_submit_dag`\s
**-maxpre** and **-maxpost** options or via the configuration options
:macro:`DAGMAN_MAX_PRE_SCRIPTS` and :macro:`DAGMAN_MAX_POST_SCRIPTS`.
:index:`editing DAG throttles<single: DAGMan; Editing DAG throttles>`
Editing DAG Throttles
'''''''''''''''''''''
The following throttling properties of a running DAG can be changed after the workflow
has been started. The values of these properties are published in the :tool:`condor_dagman`
job ad; changing any of these properties using :tool:`condor_qedit` will also update
the internal DAGMan value.
.. sidebar:: Edit DAGMan Limits
To edit one of these properties, use the :tool:`condor_qedit`
tool with the job ID of the :tool:`condor_dagman`
.. code-block:: console
$ condor_qedit <dagman-job-id> DAGMan_MaxJobs 1000
Currently, you can change the following attributes:
+----------------------------------+-----------------------------------------------------+
| **Attribute Name** | **Attribute Description** |
+----------------------------------+-----------------------------------------------------+
| :ad-attr:`DAGMan_MaxJobs` | Maximum number of running nodes |
+----------------------------------+-----------------------------------------------------+
| :ad-attr:`DAGMan_MaxIdle` | Maximum number of idle jobs |
+----------------------------------+-----------------------------------------------------+
| :ad-attr:`DAGMan_MaxPreScripts` | Maximum number of running PRE scripts |
+----------------------------------+-----------------------------------------------------+
| :ad-attr:`DAGMan_MaxPostScripts` | Maximum number of running POST scripts |
+----------------------------------+-----------------------------------------------------+
:index:`throttling nodes by category<single: DAGMan; Throttling nodes by category>`
.. _DAG throttling cmds:
Throttling Nodes by Category
''''''''''''''''''''''''''''
.. sidebar:: Throttling by Category
:dag-cmd:`CATEGORY` and :dag-cmd:`MAXJOBS` command syntax
.. code-block:: condor-dagman
CATEGORY <NodeName | ALL_NODES> CategoryName
.. code-block:: condor-dagman
MAXJOBS CategoryName MaxJobsValue
.. note::
Category names cannot contain white space.
Please see :ref:`DAG Splice Limitations` in association with categories.
DAGMan also allows the limiting of the number of running nodes (submitted job
clusters) within a DAG at a finer grained control with the :dag-cmd:`CATEGORY[Usage]` and
:dag-cmd:`MAXJOBS[Usage]` commands. The :dag-cmd:`CATEGORY` command will assign a DAG node to a
category that can be referenced by the :dag-cmd:`MAXJOBS` command to limit the number
of submitted job clusters on a per category basis.
If the number of submitted job clusters for a given category reaches the
limit, no further job clusters in that category will be submitted until
other job clusters within the category terminate. If :dag-cmd:`MAXJOBS` is not set
for a defined category, then there is no limit placed on the number of
submissions within that category.
The configuration variable :macro:`DAGMAN_MAX_JOBS_SUBMITTED` and the
:tool:`condor_submit_dag` *-maxjobs* command-line option are still enforced
if these *CATEGORY* and *MAXJOBS* throttles are used.
|