File: checkpoints.rst

package info (click to toggle)
python-parsl 2025.12.15%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 12,144 kB
  • sloc: python: 24,403; makefile: 352; sh: 252; ansic: 45
file content (301 lines) | stat: -rw-r--r-- 10,300 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
.. _label-memos:

Memoization and checkpointing
-----------------------------

When an app is invoked several times with the same parameters, Parsl can
reuse the result from the first invocation without executing the app again.

This can save time and computational resources.

The memoization and checkpointing system is pluggable, with basic behaviour
provided by the `BasicMemoizer`. The rest of this chapter refers to the
behaviour of the `BasicMemoizer`.

Memoization and checkpointing is done in two ways:

* Firstly, *app caching* will allow reuse of results and exceptions within
  the same run. This is also referred to as *memoization*.

* Building on top of that, *checkpointing* will store results (but not
  exceptions) on the filesystem and reuse those results in later runs.

.. _label-appcaching:

App caching
===========


There are many situations in which a program may be re-executed
over time. Often, large fragments of the program will not have changed 
and therefore, re-execution of apps will waste valuable time and 
computation resources. Parsl's app caching solves this problem by 
storing results from apps that have successfully completed
so that they can be re-used. 

App caching is enabled by setting the ``cache``
argument in the :func:`~parsl.app.app.python_app` or :func:`~parsl.app.app.bash_app` 
decorator to ``True`` (by default it is ``False``). 

.. code-block:: python

   @bash_app(cache=True)
   def hello (msg, stdout=None):
       return 'echo {}'.format(msg)
			
App caching can be globally disabled by supplying a new memoizer in
:class:`~parsl.config.Config` defined as ``BasicMemoizer(memoize=False)``.

App caching can be particularly useful when developing interactive programs such as when
using a Jupyter notebook. In this case, cells containing apps are often re-executed
during development. Using app caching will ensure that only modified apps are re-executed.


App equivalence 
^^^^^^^^^^^^^^^

Parsl determines app equivalence using the name of the app function:
if two apps have the same name, then they are equivalent under this
relation.

Changes inside the app, or by functions called by an app will not invalidate
cached values.

There are lots of other ways functions might be compared for equivalence,
and `parsl.dataflow.memoization.id_for_memo` provides a hook to plug in
alternate application-specific implementations.


Invocation equivalence 
^^^^^^^^^^^^^^^^^^^^^^

Two app invocations are determined to be equivalent if their
input arguments are identical.

In simple cases, this follows obvious rules:

.. code-block:: python

  # these two app invocations are the same and the second invocation will
  # reuse any cached input from the first invocation
  x = 7
  f(x).result()

  y = 7
  f(y).result()


Internally, equivalence is determined by hashing the input arguments, and
comparing the hash to hashes from previous app executions.

This approach can only be applied to data types for which a deterministic hash
can be computed.

By default Parsl can compute sensible hashes for basic data types:
str, int, float, None, as well as more some complex types:
functions, and dictionaries and lists containing hashable types.

Attempting to cache apps invoked with other, non-hashable, data types will 
lead to an exception at invocation.

In that case, mechanisms to hash new types can be registered by a program by
implementing the `parsl.dataflow.memoization.id_for_memo` function for
the new type.

Ignoring arguments
^^^^^^^^^^^^^^^^^^

On occasion one may wish to ignore particular arguments when determining
app invocation equivalence - for example, when generating log file
names automatically based on time or run information. 
Parsl allows developers to list the arguments to be ignored
in the ``ignore_for_cache`` app decorator parameter:

.. code-block:: python

   @bash_app(cache=True, ignore_for_cache=['stdout'])
   def hello (msg, stdout=None):
       return 'echo {}'.format(msg)


Caveats
^^^^^^^

It is important to consider several important issues when using app caching:

- Determinism: App caching is generally useful only when the apps are deterministic.
  If the outputs may be different for identical inputs, app caching will obscure
  this non-deterministic behavior. For instance, caching an app that returns
  a random number will result in every invocation returning the same result.

- Timing: If several identical calls to an app are made concurrently having
  not yet cached a result, many instances of the app will be launched.
  Once one invocation completes and the result is cached
  all subsequent calls will return immediately with the cached result.

- Performance: If app caching is enabled, there may be some performance
  overhead especially if a large number of short duration tasks are launched rapidly.
  This overhead has not been quantified.
  
.. _label-checkpointing:

Checkpointing
=============

Large-scale Parsl programs are likely to encounter errors due to node failures, 
application or environment errors, and myriad other issues. Parsl offers an
application-level checkpointing model to improve resilience, fault tolerance, and
efficiency.

.. note::
   Checkpointing builds on top of app caching, and so app caching must be
   enabled. If app caching is disabled in the config ``Config.app_cache``, checkpointing will
   not work.

Parsl follows an incremental checkpointing model, where each checkpoint file contains
all results that have been updated since the last checkpoint.

When a Parsl program loads a checkpoint file and is executed, it will use 
checkpointed results for any apps that have been previously executed. 
Like app caching, checkpoints
use the hash of the app and the invocation input parameters to identify previously computed
results. If multiple checkpoints exist for an app (with the same hash)
the most recent entry will be used.

Parsl provides four checkpointing modes, which can be specified using the ``checkpoint_mode``
parameter to ``memoizer=BasicMemoizer(...)``

1. ``task_exit``: a checkpoint is created each time an app completes or fails
   (after retries if enabled). This mode minimizes the risk of losing information
   from completed tasks.

   .. code-block:: python

      BasicMemoizer(checkpoint_mode = 'task_exit')

2. ``periodic``: a checkpoint is created periodically using a user-specified
   checkpointing interval. Results will be saved to the checkpoint file for
   all tasks that have completed during this period.

   .. code-block:: python

      BasicMemoizer(checkpoint_mode = 'periodic',
                    checkpoint_period = '01:00:00')

3. ``dfk_exit``: checkpoints are created when Parsl is
   about to exit. This reduces the risk of losing results due to
   premature program termination from exceptions, terminate signals, etc. However
   it is still possible that information might be lost if the program is
   terminated abruptly (machine failure, SIGKILL, etc.)

   .. code-block:: python

      BasicMemoizer(checkpoint_mode = 'dfk_exit')

4. ``manual``: in addition to these automated checkpointing modes, it is also possible
   to manually initiate a checkpoint by calling ``checkpoint()`` on the
   `BasicMemoizer` in the Parsl program code.

   .. code-block:: python

      m = BasicMemoizer(checkpoint_mode = 'manual')
      ...
      with parsl.load(Config(memoizer=m, ...)):
         ...
         m.checkpoint()

In all cases the checkpoint file is written out to the ``runinfo/RUN_ID/checkpoint/`` directory.

.. Note:: Checkpoint modes ``periodic``, ``dfk_exit``, and ``manual`` can interfere with garbage collection.
          In these modes task information will be retained after completion, until checkpointing events are triggered.


Creating a checkpoint
^^^^^^^^^^^^^^^^^^^^^

Automated checkpointing must be explicitly enabled in the Parsl configuration.
There is no need to modify a Parsl  program as checkpointing will occur transparently.
In the following example, checkpointing is enabled at task exit. The results of
each invocation of the ``slow_double`` app will be stored in the checkpoint file.

.. code-block:: python

    import parsl
    from parsl.app.app import python_app
    from parsl.configs.local_threads import config

    config.checkpoint_mode = 'task_exit'

    parsl.load(config)

    @python_app(cache=True)
    def slow_double(x):
        import time
        time.sleep(5)
        return x * 2

    d = []
    for i in range(5):
        d.append(slow_double(i))

    print([d[i].result() for i in range(5)])

Alternatively, manual checkpointing can be used to explictly specify when the checkpoint
file should be saved. The following example shows how manual checkpointing can be used.
Here, the ``checkpoint()`` method will save the results of the prior invocations 
of the ``slow_double`` app.

.. code-block:: python

    import parsl
    from parsl import python_app
    from parsl.configs.local_threads import config

    dfk = parsl.load(config)

    @python_app(cache=True)
    def slow_double(x, sleep_dur=1):
        import time
        time.sleep(sleep_dur)
        return x * 2

    N = 5   # Number of calls to slow_double
    d = []  # List to store the futures
    for i in range(0, N):
        d.append(slow_double(i))

    # Wait for the results
    [i.result() for i in d]

    dfk.memoizer.checkpoint()


Resuming from a checkpoint
^^^^^^^^^^^^^^^^^^^^^^^^^^

When resuming a program from a checkpoint Parsl allows the user to select
which checkpoint file(s) to use. 
Checkpoint files are stored in the ``runinfo/RUNID/checkpoint`` directory.

The example below shows how to resume using all available checkpoints. 
Here, the program re-executes the same calls to the ``slow_double`` app
as above and instead of waiting for results to be computed, the values
from the checkpoint file are are immediately returned.

.. code-block:: python

    import parsl
    from parsl.tests.configs.local_threads import config
    from parsl.utils import get_all_checkpoints

    config.memoizer = BasicMemoizer(checkpoint_files = get_all_checkpoints())

    parsl.load(config)
		
    # Rerun the same workflow
    d = []
    for i in range(5):
        d.append(slow_double(i))

    # wait for results
    print([d[i].result() for i in range(5)])