File: deploy.rst

package info (click to toggle)
toil 9.1.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 13,908 kB
  • sloc: python: 58,029; makefile: 313; sh: 168
file content (201 lines) | stat: -rw-r--r-- 8,449 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
.. highlight:: console

.. _autoDeploying:

Auto-Deployment
===============

If you want to run a Toil Python workflow in a distributed environment, on multiple worker machines, either in the cloud or on a
bare-metal cluster, the Python code needs to be made available to those other machines. If the workflow's main module imports other
modules, those modules also need to be made available on the workers. Toil can automatically do that for you, with a
little help on your part. We call this feature *auto-deployment* of a workflow.

Let's first examine various scenarios of auto-deploying a workflow, which, as we'll see shortly cannot be
auto-deployed. Lastly, we'll deal with the issue of declaring :ref:`Toil as a dependency <depending_on_toil>` of a
workflow that is packaged as a setuptools distribution.

Toil can be easily deployed to a remote host. First, assuming you've followed our :ref:`prepareAWS` section to install Toil
and use it to create a remote leader node on (in this example) AWS, you can now log into this into using
:ref:`sshCluster` and once on the remote host, create and activate a virtualenv (noting to make sure to use the
``--system-site-packages`` option!)::

   $ virtualenv --system-site-packages venv
   $ . venv/bin/activate

Note the ``--system-site-packages`` option, which ensures that globally-installed packages are accessible inside the
virtualenv.  Do not (re)install Toil after this!  The ``--system-site-packages`` option has already transferred Toil and
the dependencies from your local installation of Toil for you.

From here, you can install a project and its dependencies::

   $ tree
   .
   ├── util
   │   ├── __init__.py
   │   └── sort
   │       ├── __init__.py
   │       └── quick.py
   └── workflow
       ├── __init__.py
       └── main.py

   3 directories, 5 files
   $ pip install matplotlib
   $ cp -R workflow util venv/lib/python3.9/site-packages

Ideally, your project would have a ``setup.py`` file (see `setuptools`_) which streamlines the installation process::

   $ tree
   .
   ├── util
   │   ├── __init__.py
   │   └── sort
   │       ├── __init__.py
   │       └── quick.py
   ├── workflow
   │   ├── __init__.py
   │   └── main.py
   └── setup.py

   3 directories, 6 files
   $ pip install .

Or, if your project has been published to PyPI::

   $ pip install my-project

In each case, we have created a virtualenv with the ``--system-site-packages`` flag in the ``venv`` subdirectory then
installed the ``matplotlib`` distribution from PyPI along with the two packages that our project consists of. (Again,
both Python and Toil are assumed to be present on the leader and all worker nodes.)

We can now run our workflow::

   $ python3 main.py --batchSystem=kubernetes …

.. important::

   If workflow's external dependencies contain native code (i.e. are not pure
   Python) then they must be manually installed on each worker.

.. warning::

   Neither ``python3 setup.py develop`` nor ``pip install -e .`` can be used in
   this process as, instead of copying the source files, they create ``.egg-link``
   files that Toil can't auto-deploy. Similarly, ``python3 setup.py install``
   doesn't work either as it installs the project as a Python ``.egg`` which is
   also not currently supported by Toil (though it `could be`_ in the future).

   Also note that using the
   ``--single-version-externally-managed`` flag with ``setup.py`` will
   prevent the installation of your package as an ``.egg``. It will also disable
   the automatic installation of your project's dependencies.

.. _setuptools: http://setuptools.readthedocs.io/en/latest/index.html
.. _could be: https://github.com/BD2KGenomics/toil/issues/1367

Auto Deployment with Sibling Python Files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This scenario applies if a Python workflow imports files that are its siblings::

   $ cd my_project
   $ ls
   userScript.py utilities.py
   $ ./userScript.py --batchSystem=kubernetes …

Here ``userScript.py`` imports additional functionality from ``utilities.py``.
Toil detects that ``userScript.py`` has sibling Python files and copies them to the
workers, alongside the main Python file. Note that sibling Python files will be
auto-deployed regardless of whether they are actually imported by the workflow:
all ``.py`` files residing in the same directory as the main workflow Python file will
automatically be auto-deployed.

This structure is a suitable method of organizing the source code of
reasonably complicated workflows.


Auto-Deploying a Package Hierarchy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Recall that in Python, a `package`_ is a directory containing one or more
``.py`` files, one of which must be called ``__init__.py``, and optionally other
packages. For more involved workflows that contain a significant amount of
code, this is the recommended way of organizing the source code. Because we use
a package hierarchy, the main workflow file is actually a Python module.
It is merely one of the modules in the package
hierarchy. We need to inform Toil that we want to use a package hierarchy by
invoking Python's ``-m`` option. This enables Toil to identify the entire set
of modules belonging to the workflow and copy all of them to each worker. Note
that while using the ``-m`` option is optional in the scenarios above, it is
mandatory in this one.

The following shell session illustrates this::

   $ cd my_project
   $ tree
   .
   ├── utils
   │   ├── __init__.py
   │   └── sort
   │       ├── __init__.py
   │       └── quick.py
   └── workflow
       ├── __init__.py
       └── main.py

   3 directories, 5 files
   $ python3 -m workflow.main --batchSystem=kubernetes …

.. _package: https://docs.python.org/2/tutorial/modules.html#packages

Here the workflow entry point module ``main.py`` does not reside in the current directory, but
is part of a package called ``util``, in a subdirectory of the current
directory. Additional functionality is in a separate module called
``util.sort.quick`` which corresponds to ``util/sort/quick.py``. Because we
invoke the workflow via ``python3 -m workflow.main``, Toil can determine the
root directory of the hierarchy–``my_project`` in this case–and copy all Python
modules underneath it to each worker. The ``-m`` option is documented `here`_

.. _here: https://docs.python.org/2/using/cmdline.html#cmdoption-m

When ``-m`` is passed, Python adds the current working directory to
``sys.path``, the list of root directories to be considered when resolving a
module name like ``workflow.main``. Without that added convenience we'd have to
run the workflow as ``PYTHONPATH="$PWD" python3 -m workflow.main``. This also
means that Toil can detect the root directory of the invoked module's package
hierarchy even if it isn't the current working directory. In other words we
could do this::

   $ cd my_project
   $ export PYTHONPATH="$PWD"
   $ cd /some/other/dir
   $ python3 -m workflow.main --batchSystem=kubernetes …

Also note that the root directory itself must not be package, i.e. must not
contain an ``__init__.py``.

Relying on Shared Filesystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Bare-metal clusters typically mount a shared file system like NFS on each node.
If every node has that file system mounted at the same path, you can place your
project on that shared filesystem and run your Python workflow from there.
Additionally, you can clone the Toil source tree into a directory on that
shared file system and you won't even need to install Toil on every worker. Be
sure to add both your project directory and the Toil clone to ``PYTHONPATH``. Toil
replicates ``PYTHONPATH`` from the leader to every worker.

.. admonition:: Using a shared filesystem

   Toil currently only supports a ``tempdir`` set to a local, non-shared directory.

.. _deploying_toil:

Toil Appliance
--------------

The term Toil Appliance refers to the Ubuntu-based Docker image that Toil uses
for the machines in Toil-manages clusters, and for executing jobs on Kubernetes.
It's easily deployed, only needs Docker, and
allows a consistent environment on all Toil clusters. To specify a different
image, see the Toil :ref:`envars` section.  For more information on the Toil
Appliance, see the :ref:`runningAWS` section.