File: exceptions.rst

package info (click to toggle)
python-parsl 2025.01.13%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 12,072 kB
  • sloc: python: 23,817; makefile: 349; sh: 276; ansic: 45
file content (171 lines) | stat: -rw-r--r-- 6,014 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
.. _label-exceptions:

Error handling
==============

Parsl provides various mechanisms to add resiliency and robustness to programs.

Exceptions
----------

Parsl is designed to capture, track, and handle various errors occurring
during execution, including those related to the program, apps, execution 
environment, and Parsl itself. 
It also provides functionality to appropriately respond to failures during
execution.

Failures might occur for various reasons:

1. A task failed during execution.
2. A task failed to launch, for example, because an input dependency was not met.
3. There was a formatting error while formatting the command-line string in Bash apps.
4. A task completed execution but failed to produce one or more of its specified
   outputs.
5. Task exceeded the specified walltime.

Since Parsl tasks are executed asynchronously and remotely, it can be difficult to determine
when errors have occurred and to appropriately handle them in a Parsl program.

For errors occurring in Python code, Parsl captures Python exceptions and returns
them to the main Parsl program. For non-Python errors, for example when a node
or worker fails, Parsl imposes a timeout, and considers a task to have failed
if it has not heard from the task by that timeout. Parsl also considers a task to have failed
if it does not meet the contract stated by the user during invocation, such as failing
to produce the stated output files.

Parsl communicates these errors by associating Python exceptions with task futures.
These exceptions are raised only when a result is called on the future
of a failed task. For example:

.. code-block:: python

      @python_app
      def bad_divide(x):
          return 6 / x

      # Call bad divide with 0, to cause a divide by zero exception
      doubled_x = bad_divide(0)

      # Catch and handle the exception.
      try:
           doubled_x.result()
      except ZeroDivisionError as e:
           print('Oops! You tried to divide by 0.')
      except Exception as e:
           print('Oops! Something really bad happened.')


Retries
-------

Often errors in distributed/parallel environments are transient. 
In these cases, retrying failed tasks can be a simple way 
of overcoming transient (e.g., machine failure,
network failure) and intermittent failures.
When ``retries`` are enabled (and set to an integer > 0), Parsl will automatically
re-launch tasks that have failed until the retry limit is reached. 
By default, retries are disabled and exceptions will be communicated
to the Parsl program.

The following example shows how the number of retries can be set to 2:

.. code-block:: python

   import parsl
   from parsl.configs.htex_local import config
   
   config.retries = 2

   parsl.load(config)

More specific retry handling can be specified using retry handlers, documented
below.


Lazy fail
---------

Parsl implements a lazy failure model through which a workload will continue
to execute in the case that some tasks fail. That is, the program will not
halt as soon as it encounters a failure, rather it will continue to execute
unaffected apps.

The following example shows how lazy failures affect execution. In this
case, task C fails and therefore tasks E and F that depend on results from
C cannot be executed; however, Parsl will continue to execute tasks B and D
as they are unaffected by task C's failure.

.. code-block::

    Here's a workflow graph, where
         (X)  is runnable,
         [X]  is completed,
         (X*) is failed.
         (!X) is dependency failed

      (A)           [A]           (A)
      / \           / \           / \
    (B) (C)       [B] (C*)      [B] (C*)
     |   |   =>    |   |   =>    |   |
    (D) (E)       (D) (E)       [D] (!E)
      \ /           \ /           \ /
      (F)           (F)           (!F)

      time ----->


Retry handlers
--------------

The basic parsl retry mechanism keeps a count of the number of times a task
has been (re)tried, and will continue retrying that task until the configured
retry limit is reached.

Retry handlers generalize this to allow more expressive retry handling:
parsl keeps a retry cost for a task, and the task will be retried until the
configured retry limit is reached. Instead of the cost being 1 for each
failure, user-supplied code can examine the failure and compute a custom
cost.

This allows user knowledge about failures to influence the retry mechanism:
an exception which is almost definitely a non-recoverable failure (for example,
due to bad parameters) can be given a high retry cost (so that it will not
be retried many times, or at all), and exceptions which are likely to be
transient (for example, where a worker node has died) can be given a low
retry cost so they will be retried many times.

A retry handler can be specified in the parsl configuration like this:


.. code-block:: python

     Config(
          retries=2,
          retry_handler=example_retry_handler
          )


``example_retry_handler`` should be a function defined by the user that will
compute the retry cost for a particular failure, given some information about
the failure.

For example, the following handler will give a cost of 1 to all exceptions,
except when a bash app exits with unix exitcode 9, in which case the cost will
be 100. This will have the effect that retries will happen as normal for most
errors, but the bash app can indicate that there is little point in retrying
by exiting with exitcode 9.

.. code-block:: python

     def example_retry_handler(exception, task_record):
          if isinstance(exception, BashExitFailure) and exception.exitcode == 9:
               return 100
          else
               return 1

The retry handler is given two parameters: the exception from execution, and
the parsl internal task_record. The task record contains details such as the
app name, parameters and executor.

If a retry handler raises an exception itself, then the task will be aborted
and no further tries will be attempted.