File: parallel.rst

package info (click to toggle)
python-internetarchive 3.3.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 916 kB
  • sloc: python: 6,108; makefile: 180; xml: 180
file content (67 lines) | stat: -rw-r--r-- 2,784 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Using GNU Parallel with ia
==========================

`GNU Parallel <https://www.gnu.org/software/parallel/>`_ is a shell tool for executing jobs in parallel.
It is a very useful tool to use with ``ia`` for bulk jobs.
It can be installed via many OS package managers.

For example, it can be installed via `homebrew <https://brew.sh/>`_ on Mac OS::

    brew install parallel

Refer to the `GNU Parallel homepage <https://www.gnu.org/software/parallel/>`_ for more details on available packaes, source code, installation, and other documentation and tutorials.


Basic Usage
-----------

You can use ``parallel`` to retrieve metadata from archive.org items concurrently:

.. code:: bash

    $ cat itemlist.txt
    jj-test-2020-09-17-1
    jj-test-2020-09-17-2
    jj-test-2020-09-17-3
    $ cat itemlist.txt | parallel 'ia metadata {}' | jq .metadata.date
    "1999"
    "1999"
    "1999"

You can run ``parallel`` with ``--dry-run`` to check your commands before running them:

.. code:: bash

    $ cat itemlist.txt | parallel --dry-run 'ia metadata {}'
    ia metadata jj-test-2020-09-17-2
    ia metadata jj-test-2020-09-17-1
    ia metadata jj-test-2020-09-17-3

Logging and retrying with Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parallel also offers an easy way to log and retry failed commands.

Here's an example of a job that is retrieving metadata for all of the items in the file named ``itemlist.txt``, and outputting the metadata to a file named ``output.jsonl``.
It uses the ``--joblog`` option to log all commands and their exit value to ``/tmp/my_ia_job.log``:

.. code:: bash

    $ cat itemlist.txt | parallel --joblog /tmp/my_ia_job.log 'ia metadata {}' > output.jsonl

You can now retry any commands that failed by using the ``--retry-failed`` option (don't forget to switch ``>`` to ``>>`` in this example, so you don't overwrite ``output.jsonl``! ``>>`` means to append to the output file, rather than clobber it):

.. code:: bash

    $ parallel --retry-failed --joblog /tmp/my_ia_job.log 'ia metadata {}' >> output.jsonl

If there were no failed commands, nothing will be rerun.
You can rerun this command until it exits with ``0``.
You can check the exit code by running ``echo $?`` directly after the ``parallel`` command finishes.

Resources
_________

- Intro videos: `https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 <https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1>`_
- Cheat sheet: `https://www.gnu.org/software/parallel/parallel_cheat.pdf <https://www.gnu.org/software/parallel/parallel_cheat.pdf>`_
- Examples from the man page: `https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-xargs--n1.-Argument-appending <https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-xargs--n1.-Argument-appending>`_