File: path-facility.rst

package info (click to toggle)
condor 23.9.6%2Bdfsg-2.1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 60,012 kB
  • sloc: cpp: 528,272; perl: 87,066; python: 42,650; ansic: 29,558; sh: 11,271; javascript: 3,479; ada: 2,319; java: 619; makefile: 615; xml: 613; awk: 268; yacc: 78; fortran: 54; csh: 24
file content (219 lines) | stat: -rw-r--r-- 7,503 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
Recipe: Run a Job on the PATh Facility Using Credits
----------------------------------------------------

This recipe assumes that you have decided to use your credits for the
PATh Facility to run one of your HTCondor jobs.  It takes you step by
step through the process of Bringing Your Own Resources (BYOR) in the
form of an allocation to an OSG Portal access point and using that
resource to run your HTCondor job.  In what follows, we refer to the
named set of resources leased from that allocation as an *annex*.

In this recipe, we assume that the job has not yet been placed at an
OSG Portal access point when we begin.

Ingredients
===========

- An
  `OSG Portal account <https://portal.osg-htc.org/application>`_
  and password.
- An HTCondor job submit file (:doc:`example.submit <annex-example-job>`).
- Credits for the PATh Facility.
- Command-line login access to the PATh Facility (see
  `PATH's instructions for gaining access <https://path-cc.io/facility/registration.html#login>`_).
  We'll use ``LOGIN_NAME`` to refer to your login name at the PATh Facility.
- A name for your PATh Facility annex (example).  By convention,
  this is the name of the submit file you want to run, without its extension.

Assumptions
===========

- You want to run the job described above on the PATh Facility.
- The job described above fits within the
  `capabilities <https://path-cc.io/facility/#facility-description>`_
  of the PATh Facility.

Preparation
===========

None!

(A note: when copying examples further down the page, don't copy the ``$``;
it just signifies something you type in, rather than something
that the computer prints out.)

Instructions
============

1. Log into the OSG Portal Access Point
'''''''''''''''''''''''''''''''''''''''

Log into an OSG Portal access point (e.g., ``ap20.uc.osg-htc.org`` or
``ap21.uc.osg-htc.org``) using your OSG Portal account and password.

2. Submit the Job
'''''''''''''''''

Submit the job on the access point, indicating that you want it to run
on your own resource (your PATh Facility credits, in this case) with the
``--annex-name`` option:

.. code-block:: text

    $ htcondor job submit example.submit --annex-name example
    Job 123 was submitted and will run only on the annex 'example'.

.. note::

    Notes on the output of this command:

    - ``123`` is the job ID assigned by the access point to the placed job.
    - Placing the job with the annex name specified means that the job
      won't run anywhere other than the annex.
    - The annex name does not say anything about the PATh facility; it is simply
      a label for the PATh Facility resources we will be provisioning
      in the next step.

3. Lease the Resources
''''''''''''''''''''''

To run your job on the PATh Facility, you will need to create an *annex* there;
an annex is a named set of leased resources.  The following command will
submit a request to lease an annex named ``example`` from the PATh Facility.
The **text in bold** is emphasized to distinguish
it from the PATh Facility's log-in prompt.

.. parsed-literal::
    :class: highlight

    $ htcondor annex create example cpu\@path-facility --cpus 2 --login-name LOGIN_NAME
    **This command will access the system named 'PATh Facility' via SSH.  To proceed, follow the**
    **prompts from that system below; to cancel, hit CTRL-C.**

You will need to log into the PATh Facility at this prompt.

.. parsed-literal::
    :class: highlight

    **Thank you.**

    Requesting annex named 'example' from queue 'cpu' on the system named 'PATH Facility'...

The tool will display an indented log of the request progress, because
it may take a while.  Once the request is done, it will display:

.. code-block:: text

    ... requested.

    It may take some time for the PATh Facility to establish the requested annex.

4. Confirm that the Resources are Available
'''''''''''''''''''''''''''''''''''''''''''

Check on the status of the annex to make sure it has started up correctly.

.. code-block:: text

	$ htcondor annex status example
	Annex 'example' is not established.
	You requested 2 nodes for this annex, of which 0 are in established
	annexes.
	There are 0 CPUs in the established nodes, of which 0 are busy.
	1 jobs must run on this annex, and 0 currently are.
	You made 1 resource request(s) for this annex, of which 1 are pending, 0
	are established, and 0 have retired.

Give the PATh Facility a few more minutes to grant your request and then check again.

.. code-block:: text

	$ htcondor annex status example
	Annex 'example' is established.
	Its oldest established request is about 0.29 hours old and will retire in
	0.71 hours.
	You requested 2 nodes for this annex, of which 2 are in established
	annexes.
	There are 136 CPUs in the established nodes, of which 0 are busy.
	1 jobs must run on this annex, and 0 currently are.
	You made 1 resource request(s) for this annex, of which 0 are pending, 1
	are established, and 0 have retired.

5. Confirm Job is Running on the Resources
''''''''''''''''''''''''''''''''''''''''''

After some time has passed, check the status of the job to make sure
that it started running.

.. code-block:: text

	$ htcondor job status 123
	Job will only run on your annex named 'example'.
	Job has been running for 0 hour(s), 2 minute(s), and 21 second(s).

We want to make sure the job is indeed running on the correct annex
resources.  There are two different ways we could do this.  We could ask
the annex itself:

.. code-block:: text

	$ htcondor annex status example
	Annex 'example' is established.
	Its oldest established request is about 0.69 hours old and will retire in
	0.31 hours.
	You requested 2 nodes for this annex, of which 2 are in established
	annexes.
	There are 136 CPUs in the established nodes, of which 1 are busy.
	1 jobs must run on this annex, and 1 currently are.
	You made 1 resource request(s) for this annex, of which 0 are pending,
	1 are established, and 0 have retired.

This indicates that the annex is running jobs, but we don't know for
sure that it's the one we just submitted.  Instead, let's ask the job
itself what resources it is running on.

.. code-block:: text

	$ htcondor job resources 123
	Job is using annex 'example', resource 449_0@osgvo-docker-pilot-facility-74db64959b-q2mq.

6. Terminate the Resource Lease
'''''''''''''''''''''''''''''''

At this point we know that our job is running on the correct resources,
so we can wait for it to finish running.  After some time has passed, we
ask for its status again:

.. code-block:: text

	$ htcondor job status 123
	Job is completed.

Now that the job has finished running, we want to shut down the annex.
When the annex finishes shutting down, the resource lease will be
terminated.  We could just wait for the annex time out automatically
(after 20 minutes of being idle), but we would rather shut the annex down
explicitly to avoid wasting our allocation.

.. code-block:: text

	$ htcondor annex shutdown example
	Shutting down annex 'example'...
	... each resource in 'example' has been commanded to shut down.
	It may take some time for each resource to finish shutting down.
	Annex requests that are still in progress have not been affected.

At this point our workflow is completed, and our job has run
successfully on our allocation.

Reference
=========

You can run either of the following commands for an up-to-date summary
of their corresponding options.

.. code-block:: text

	$ htcondor job --help
	$ htcondor annex --help