File: quickStart.rst

package info (click to toggle)
toil 9.1.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 13,908 kB
  • sloc: python: 58,029; makefile: 313; sh: 168
file content (682 lines) | stat: -rw-r--r-- 31,485 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
.. _running:

Quickstart Examples
===================

.. _quickstart:
.. _cwlquickstart:

Running a basic CWL workflow
----------------------------

The `Common Workflow Language`_ (CWL) is an emerging standard for writing
workflows that are portable across multiple workflow engines and platforms.
Running CWL workflows using Toil is easy.

#. Copy and paste the following code block into ``example.cwl``:

   .. code-block:: yaml

       cwlVersion: v1.0
       class: CommandLineTool
       baseCommand: echo
       stdout: output.txt
       inputs:
         message:
           type: string
           inputBinding:
             position: 1
       outputs:
         output:
           type: stdout

   and this code into ``example-job.yaml``:

   .. code-block:: yaml

        message: Hello world!

#. To run the workflow simply enter ::

        $ toil-cwl-runner example.cwl example-job.yaml

   Your output will be in ``output.txt``::

        $ cat output.txt
        Hello world!


Congratulations! You've run your first Toil workflow using the default :ref:`Batch System <batchsysteminterface>`, ``single_machine``,
and the default ``file`` job store (which was placed in a temporary directory for you by ``toil-cwl-runner``).

Toil uses batch systems to manage the jobs it creates.

The ``single_machine`` batch system is primarily used to prepare and debug workflows on a
local machine. Once validated, try running them on a full-fledged batch system (see :ref:`batchsysteminterface`).
Toil supports many different batch systems such as `Kubernetes`_ and Grid Engine; its versatility makes it
easy to run your workflow in all kinds of places.

.. _Kubernetes: https://kubernetes.io/

Toil's CWL runner is totally customizable! Run ``toil-cwl-runner --help`` to see a complete list of available options.

To learn more about CWL, see the `CWL User Guide`_ (from where this example was
shamelessly borrowed). For information on using CWL with Toil see the section :ref:`cwl`.
And for an example of CWL on an AWS cluster, have a look at :ref:`awscwl`.

.. _CWL User Guide: https://www.commonwl.org/user_guide/

Running a basic WDL workflow
----------------------------

The `Workflow Description Language`_ (WDL) is another emerging language for writing workflows that are portable across multiple workflow engines and platforms.
Running WDL workflows using Toil is still in alpha, and currently experimental.  Toil currently supports basic workflow syntax (see :ref:`wdl` for more details and examples).  Here we go over running a basic WDL helloworld workflow.

#. Copy and paste the following code block into ``wdl-helloworld.wdl``::

        workflow write_simple_file {
          call write_file
        }
        task write_file {
          String message
          command { echo ${message} > wdl-helloworld-output.txt }
          output { File test = "wdl-helloworld-output.txt" }
        }

   and this code into ``wdl-helloworld.json``::

        {
          "write_simple_file.write_file.message": "Hello world!"
        }

#. To run the workflow simply enter ::

        $ toil-wdl-runner wdl-helloworld.wdl wdl-helloworld.json

   Your output will be in ``wdl-helloworld-output.txt``::

        $ cat wdl-helloworld-output.txt
        Hello world!

This will, like the CWL example above, use the ``single_machine`` batch system
and an automatically-located ``file`` job store by default. You can customize
Toil's execution of the workflow with command-line options; run
``toil-wdl-runner --help`` to learn about them.

To learn more about WDL in general, see the `Terra WDL documentation`_ . For more on using WDL in Toil, see :ref:`wdl`.

.. _Terra WDL documentation: https://support.terra.bio/hc/en-us/sections/360007274612-WDL-Documentation
.. _Workflow Description Language: https://software.broadinstitute.org/wdl/

.. _pyquickstart:

Running a basic Python workflow
-------------------------------

In addition to workflow languages like CWL and WDL, Toil supports running workflows written against its Python API.

An example Toil Python workflow can be run with just three steps:

1. Install Toil (see :ref:`installation-ref`)

2. Copy and paste the following code block into a new file called ``helloWorld.py``:

.. literalinclude:: ../../src/toil/test/docs/scripts/tutorial_helloworld.py

3. Specify the name of the :ref:`job store <jobStoreOverview>` and run the workflow::

       $ python3 helloWorld.py file:my-job-store

For something beyond a "Hello, world!" example, refer to :ref:`runningDetail`.

Toil's customization options are available in Python workflows. Run ``python3 helloWorld.py --help`` to see a complete list of available options.


.. _mitochondriaExample:

Example: mitochondrial variant calling
--------------------------------------

Let's run a more realistic workflow with Toil. This workflow is mitochondrial variant calling: It will take the human reference genome and sequenced reads from an individual and determine how that person's mitochondrial DNA differs from the reference genome.

We will run an example `workflow from Dockstore <https://dockstore.org/workflows/github.com/broadinstitute/gatk/MitochondriaPipeline:master?tab=info>`_.
First, grab an example workflow input::

    (venv) $ wget https://toil-datasets.s3.us-west-2.amazonaws.com/MitochondriaInputs.zip && unzip MitochondriaInputs.zip

Then, change your current working directory::

    (venv) $ cd MitochondriaInputs

This workflow will take approximately 30 minutes to run.

Since Toil supports Dockstore TRS IDs that allows the WDL runner to run any workflow on Dockstore, we will run it directly::

      (venv) $ toil-wdl-runner '#workflow/github.com/broadinstitute/gatk/MitochondriaPipeline:master' -i ExampleInputsMitochondriaPipeline.json --logInfo --container docker --quantCheck false --outputFile mitochondria.json


.. note::
        * ``--logInfo`` runs the workflow with INFO level logging. For different levels of logging, see ``--logLevel``, ``--logCritical``, ``--logError``, ``--logWarning``, ``--logDebug``, and ``--logTrace``.
        * ``--container docker`` uses Docker as the container backend. By default, Toil will run with Singularity. To set explicitly, use ``--container singularity``.
        * ``--outputFile`` will put the workflow JSON outputs into a file. If omitted, Toil will put the workflow outputs onto the commandline.
        * ``--quantCheck false`` disables certain type checks. This is useful for Cromwell compatibility.

Unless fakeroot support is set up for Singularity, this particular workflow must be run with Docker because it assumes commands in the container will run as root.
Additionally, WDL workflows sometimes depend on non-spec compliant behaviors. To see if Toil has an workaround option, see :ref:`wdlOptions`.

Once the workflow is done running, you can look at your JSON output with ``jq . mitochondria.json``. For example, if we want the ``out_vcf`` output from the workflow, we can run ``jq -r '.["MitochondriaPipeline.out_vcf"]' mitochondria.json`` to get its path::

    /private/groups/patenlab/toil-dev/mitochondria/wdl-out-c6o9mjop/MitochondriaPipeline.AlignAndCall.FilterContamination/HG02571.GRCh38.chrM.vcf

To open the VCF file, we can run ``less $(jq -r '.["MitochondriaPipeline.out_vcf"]' mitochondria.json)``

.. note::
        For outputs that aren't files, their values are directly in the JSON. For example, with ``jq '.["MitochondriaPipeline.median_coverage"]' mitochondria.json`` you can fetch the ``median_coverage`` output's value::

            4183.5


Toil uses a jobstore to store all of a workflow's files and to communicate between workers. If not specified, Toil will use an ephemeral directory that is deleted after Toil is done running.
To control where those files are placed or allow a workflow to be restarted, you can use the ``--jobStore`` option. If you specify a jobstore explicitly, the jobstore will stick around if the workflow fails. To keep the jobstore after a successful completion, use ``--clean never``. To remove the jobstore even after a failing run, use ``--clean always``.

On a cluster, the jobstore must be somewhere accessible to all worker nodes. Here's an example of running the workflow with a specified jobstore::

      (venv) $ toil-wdl-runner MitochondriaPipeline.wdl -i ExampleInputsMitochondriaPipeline.json --logInfo --container docker --outputFile mitochondria.json --jobstore mitochondriaJobstore

Toil supports several batch systems. By default, Toil will use ``single_machine``, which will run everything on the local machine. Other batch systems are available. For example, you can use `--batchSystem slurm` to run on a Slurm cluster::

      (venv) $ toil-wdl-runner MitochondriaPipeline.wdl -i ExampleInputsMitochondriaPipeline.json --logInfo --container docker --outputFile mitochondria.json --jobStore mitochondriaJobstore --batchSystem slurm

See :ref:`runningSlurm` for more information, including how to specify time limits and partitions.

Sometimes, a workflow may fail. If this is the case, the workflow can be restarted from the point of failure with ``--restart``, as long as you still have the jobstore::

    (venv) $ toil-wdl-runner MitochondriaPipeline.wdl -i ExampleInputsMitochondriaPipeline.json --logInfo --container docker --outputFile mitochondria.json --jobStore mitochondriaJobstore --batchSystem slurm


.. _runningDetail:

Example: sorting
----------------

For a more detailed example and explanation, we've developed a sample pipeline
that merge-sorts a temporary file. This is not supposed to be an efficient
sorting program, rather a more fully worked example of what Toil is capable of.

.. _sortExample:

Running the example
~~~~~~~~~~~~~~~~~~~

#. Download :download:`the example code <../../src/toil/test/sort/sort.py>`


#. Run it with the default settings::

      $ python3 sort.py file:jobStore

   The workflow created a file called ``sortedFile.txt`` in your current directory.
   Have a look at it and notice that it contains a whole lot of sorted lines!

   This workflow does a smart merge sort on a file it generates, ``fileToSort.txt``. The sort is *smart*
   because each step of the process---splitting the file into separate chunks, sorting these chunks, and merging them
   back together---is compartmentalized into a **job**. Each job can specify its own resource requirements and will
   only be run after the jobs it depends upon have run. Jobs without dependencies will be run in parallel.

.. note::
        Delete ``fileToSort.txt`` before moving on to #3. This example introduces options that specify dimensions for
        ``fileToSort.txt``, if it does not already exist. If it exists, this workflow will use the existing file and
        the results will be the same as #2.

3. Run with custom options::

      $ python3 sort.py file:jobStore \
                   --numLines=5000 \
                   --lineLength=10 \
                   --overwriteOutput=True \
                   --workDir=/tmp/

   Here we see that we can add our own options to a Toil Python workflow. As noted above, the first two
   options, ``--numLines`` and ``--lineLength``, determine the number of lines and how many characters are in each line.
   ``--overwriteOutput`` causes the current contents of ``sortedFile.txt`` to be overwritten, if it already exists.
   The last option, ``--workDir``, is an option built into Toil to specify where temporary files unique to a job are kept.

Describing the source code
~~~~~~~~~~~~~~~~~~~~~~~~~~

To understand the details of what's going on inside.
Let's start with the ``main()`` function. It looks like a lot of code, but don't worry---we'll break it down piece by
piece.

.. literalinclude:: ../../src/toil/test/sort/sort.py
    :pyobject: main

First we make a parser to process command line arguments using the `argparse`_ module. It's important that we add the
call to :func:`Job.Runner.addToilOptions` to initialize our parser with all of Toil's default options. Then we add
the command line arguments unique to this workflow, and parse the input. The help message listed with the arguments
should give you a pretty good idea of what they can do.

Next we do a little bit of verification of the input arguments. The option ``--fileToSort`` allows you to specify a file
that needs to be sorted. If this option isn't given, it's here that we make our own file with the call to
:func:`makeFileToSort`.

Finally we come to the context manager that initializes the workflow. We create a path to the input file prepended with
``'file://'`` as per the documentation for :func:`toil.common.Toil` when staging a file that is stored locally. Notice
that we have to check whether or not the workflow is restarting so that we don't import the file more than once.
Finally we can kick off the workflow by calling :func:`toil.common.Toil.start` on the job ``setup``. When the workflow
ends we capture its output (the sorted file's fileID) and use that in :func:`toil.common.Toil.exportFile` to move the
sorted file from the job store back into "userland".

Next let's look at the job that begins the actual workflow, ``setup``.

.. literalinclude:: ../../src/toil/test/sort/sort.py
    :pyobject: setup

``setup`` really only does two things. First it writes to the logs using :func:`Job.log` and then
calls :func:`addChildJobFn`. Child jobs run directly after the current job. This function turns the 'job function'
``down`` into an actual job and passes in the inputs including an optional resource requirement, ``memory``. The job
doesn't actually get run until the call to :func:`Job.rv`. Once the job ``down`` finishes, its output is returned here.

Now we can look at what ``down`` does.

.. literalinclude:: ../../src/toil/test/sort/sort.py
    :pyobject: down

Down is the recursive part of the workflow. First we read the file into the local filestore by calling
:func:`job.fileStore.readGlobalFile`. This puts a copy of the file in the temp directory for this particular job. This
storage will disappear once this job ends. For a detailed explanation of the filestore, job store, and their interfaces
have a look at :ref:`managingFiles`.

Next ``down`` checks the base case of the recursion: is the length of the input file less than ``N`` (remember ``N``
was an option we added to the workflow in ``main``)? In the base case, we just sort the file, and return the file ID
of this new sorted file.

If the base case fails, then the file is split into two new tempFiles using :func:`job.fileStore.getLocalTempFile` and
the helper function ``copySubRangeOfFile``. Finally we add a follow on Job ``up`` with :func:`job.addFollowOnJobFn`.
We've already seen child jobs. A follow-on Job is a job that runs after the current job and *all* of its children (and their children and follow-ons) have
completed. Using a follow-on makes sense because ``up`` is responsible for merging the files together and we don't want
to merge the files together until we *know* they are sorted. Again, the return value of the follow-on job is requested
using :func:`Job.rv`.

Looking at ``up``

.. literalinclude:: ../../src/toil/test/sort/sort.py
    :pyobject: up

we see that the two input files are merged together and the output is written to a new file using
:func:`job.fileStore.writeGlobalFileStream`. After a little cleanup, the output file is returned.

Once the final ``up`` finishes and all of the ``rv()`` promises are fulfilled, ``main`` receives the sorted file's ID
which it uses in ``exportFile`` to send it to the user.

There are other things in this example that we didn't go over such as :ref:`checkpoints` and the details of much of
the :ref:`api`.

.. _argparse: https://docs.python.org/2.7/library/argparse.html

At the end of the script the lines

.. code-block:: python

    if __name__ == '__main__'
        main()

are included to ensure that the main function is only run once in the '__main__' process
invoked by you, the user.
In Toil terms, by invoking the script you created the *leader process*
in which the ``main()``
function is run. A *worker process* is a separate process whose sole purpose
is to host the execution of one or more jobs defined in that script. In any Toil
workflow there is always one leader process, and potentially many worker processes.

When using the single-machine batch system (the default), the worker processes will be running
on the same machine as the leader process. With full-fledged batch systems like
Kubernetes the worker processes will typically be started on separate machines. The
boilerplate ensures that the pipeline is only started once---on the leader---but
not when its job functions are imported and executed on the individual workers.

Typing ``python3 sort.py --help`` will show the complete list of
arguments for the workflow which includes both Toil's and ones defined inside
``sort.py``. A complete explanation of Toil's arguments can be
found in :ref:`commandRef`.


Logging
~~~~~~~

By default, Toil logs a lot of information related to the current environment
in addition to messages from the batch system and jobs. This can be configured
with the ``--logLevel`` flag. For example, to only log ``CRITICAL`` level
messages to the screen::

   $ python3 sort.py file:jobStore \
                --logLevel=critical \
                --overwriteOutput=True

This hides most of the information we get from the Toil run. For more detail,
we can run the pipeline with ``--logLevel=debug`` to see a comprehensive
output. For more information, see :ref:`workflowOptions`.


Error Handling and Resuming Pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With Toil, you can recover gracefully from a bug in your pipeline without losing
any progress from successfully completed jobs. To demonstrate this, let's add
a bug to our example code to see how Toil handles a failure and how we can
resume a pipeline after that happens. Add a bad assertion at line 52 of the
example (the first line of ``down()``):

.. code-block:: python

   def down(job, inputFileStoreID, N, downCheckpoints, memory=sortMemory):
       ...
       assert 1 == 2, "Test error!"

When we run the pipeline, Toil will show a detailed failure log with a traceback::

   $ python3 sort.py file:jobStore
   ...
   ---TOIL WORKER OUTPUT LOG---
   ...
   m/j/jobonrSMP    Traceback (most recent call last):
   m/j/jobonrSMP      File "toil/src/toil/worker.py", line 340, in main
   m/j/jobonrSMP        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
   m/j/jobonrSMP      File "toil/src/toil/job.py", line 1270, in _runner
   m/j/jobonrSMP        returnValues = self._run(jobGraph, fileStore)
   m/j/jobonrSMP      File "toil/src/toil/job.py", line 1217, in _run
   m/j/jobonrSMP        return self.run(fileStore)
   m/j/jobonrSMP      File "toil/src/toil/job.py", line 1383, in run
   m/j/jobonrSMP        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
   m/j/jobonrSMP      File "toil/example.py", line 30, in down
   m/j/jobonrSMP        assert 1 == 2, "Test error!"
   m/j/jobonrSMP    AssertionError: Test error!

If we try and run the pipeline again, Toil will give us an error message saying
that a job store of the same name already exists. By default, in the event of a
failure, the job store is preserved so that the workflow can be restarted,
starting from the previously failed jobs. We can restart the pipeline by running ::

   $ python3 sort.py file:jobStore \
                --restart \
                --overwriteOutput=True

We can also change the number of times Toil will attempt to retry a failed job::

   $ python3 sort.py file:jobStore \
                --retryCount 2 \
                --restart \
                --overwriteOutput=True

You'll now see Toil attempt to rerun the failed job until it runs out of tries.
``--retryCount`` is useful for non-systemic errors, like downloading a file that
may experience a sporadic interruption, or some other non-deterministic failure.

To successfully restart our pipeline, we can edit our script to comment out
line 30, or remove it, and then run

::

    $ python3 sort.py file:jobStore \
                 --restart \
                 --overwriteOutput=True

The pipeline will run successfully, and the job store will be removed on the
pipeline's completion.


Collecting Statistics
~~~~~~~~~~~~~~~~~~~~~

Please see the :ref:`cli_status` section for more on gathering runtime and resource info on jobs.


Launching a Toil Workflow in AWS
--------------------------------
After having installed the ``aws`` extra for Toil during the :ref:`installation-ref` and set up AWS
(see :ref:`prepareAWS`), the user can run the basic ``helloWorld.py`` script (:ref:`pyquickstart`)
on a VM in AWS just by modifying the run command.

Note that when running in AWS, users can either run the workflow on a single instance or run it on a
cluster (which is running across multiple containers on multiple AWS instances).  For more information
on running Toil workflows on a cluster, see :ref:`runningAWS`.

Also!  Remember to use the :ref:`destroyCluster` command when finished to destroy the cluster!  Otherwise things may not be cleaned up properly.

#. Launch a cluster in AWS using the :ref:`launchCluster` command::

        $ toil launch-cluster <cluster-name> \
                     --clusterType kubernetes \
                     --keyPairName <AWS-key-pair-name> \
                     --leaderNodeType t2.medium \
                     --nodeTypes t2.medium -w 1 \
                     --zone us-west-2a

   The arguments ``keyPairName``, ``leaderNodeType``, and ``zone`` are required to launch a cluster.

#. Copy ``helloWorld.py`` to the ``/tmp`` directory on the leader node using the :ref:`rsyncCluster` command::

        $ toil rsync-cluster --zone us-west-2a <cluster-name> helloWorld.py :/tmp

   Note that the command requires defining the file to copy as well as the target location on the cluster leader node.

#. Login to the cluster leader node using the :ref:`sshCluster` command::

         $ toil ssh-cluster --zone us-west-2a <cluster-name>

   Note that this command will log you in as the ``root`` user.

#. Run the workflow on the cluster::

        $ python3 /tmp/helloWorld.py aws:us-west-2:my-S3-bucket

   In this particular case, we create an S3 bucket called ``my-S3-bucket`` in
   the ``us-west-2`` availability zone to store intermediate job results.

   Along with some other ``INFO`` log messages, you should get the following output in your terminal window:
   ``Hello, world!, here's a message: You did it!``.


#. Exit from the SSH connection. ::

        $ exit

#. Use the :ref:`destroyCluster` command to destroy the cluster::

        $ toil destroy-cluster --zone us-west-2a <cluster-name>

   Note that this command will destroy the cluster leader
   node and any resources created to run the job, including the S3 bucket.


.. _awscwl:

Running a CWL Workflow on AWS
-----------------------------
After having installed the ``aws`` and ``cwl`` extras for Toil during the :ref:`installation-ref` and set up AWS
(see :ref:`prepareAWS`), the user can run a CWL workflow with Toil on AWS.

Also!  Remember to use the :ref:`destroyCluster` command when finished to destroy the cluster!  Otherwise things may not be cleaned up properly.


#. First launch a node in AWS using the :ref:`launchCluster` command::

      $ toil launch-cluster <cluster-name> \
                   --clusterType kubernetes \
                   --keyPairName <AWS-key-pair-name> \
                   --leaderNodeType t2.medium \
                   --nodeTypes t2.medium -w 1 \
                   --zone us-west-2a

#. Copy ``example.cwl`` and ``example-job.yaml`` from the :ref:`CWL example <cwlquickstart>` to the node using
   the :ref:`rsyncCluster` command::

       toil rsync-cluster --zone us-west-2a <cluster-name> example.cwl :/tmp
       toil rsync-cluster --zone us-west-2a <cluster-name> example-job.yaml :/tmp

#. SSH into the cluster's leader node using the :ref:`sshCluster` utility::

      $ toil ssh-cluster --zone us-west-2a <cluster-name>

#. Once on the leader node, command line tools such as ``kubectl`` will be available to you. It's also a good idea to
   update and install the following::

    sudo apt-get update
    sudo apt-get -y upgrade
    sudo apt-get -y dist-upgrade
    sudo apt-get -y install git

#. Now create a new ``virtualenv`` with the ``--system-site-packages`` option and activate::

    virtualenv --system-site-packages venv
    source venv/bin/activate

#. Now run the CWL workflow with the Kubernetes batch system::

      (venv) $ toil-cwl-runner \
                   --provisioner aws \
                   --batchSystem kubernetes \
                   --jobStore aws:us-west-2:any-name \
                   /tmp/example.cwl /tmp/example-job.yaml

   ..  tip::

      When running a CWL workflow on AWS, input files can be provided either on the
      local file system or in S3 buckets using ``s3://`` URI references. Final output
      files will be copied to the local file system of the leader node.

#. Finally, log out of the leader node and from your local computer, destroy the cluster::

      $ toil destroy-cluster --zone us-west-2a <cluster-name>


.. _awscactus:

Running a Workflow with Autoscaling - Cactus
--------------------------------------------

`Cactus <https://github.com/ComparativeGenomicsToolkit/cactus>`__ is a reference-free, whole-genome multiple alignment
program that can be run on any of the cloud platforms Toil supports.

.. note::

      **Cloud Independence**:

      This example provides a "cloud agnostic" view of running Cactus with Toil. Most options will not change between cloud providers.
      However, each provisioner has unique inputs for  ``--leaderNodeType``, ``--nodeType`` and ``--zone``.
      We recommend the following:

        +----------------------+----------------+------------+---------------+
        |        Option        | Used in        |  AWS       |     Google    |
        +----------------------+----------------+------------+---------------+
        | ``--leaderNodeType`` | launch-cluster | t2.medium  | n1-standard-1 |
        +----------------------+----------------+------------+---------------+
        | ``--zone``           | launch-cluster | us-west-2a |               |
        +----------------------+----------------+------------+   us-west1-a  +
        | ``--zone``           | cactus         | us-west-2  |               |
        +----------------------+----------------+------------+---------------+
        | ``--nodeType``       | cactus         | c3.4xlarge | n1-standard-8 |
        +----------------------+----------------+------------+---------------+

      When executing ``toil launch-cluster`` with ``gce`` specified for ``--provisioner``, the option ``--boto`` must
      be specified and given a path to your .boto file. See :ref:`runningGCE` for more information about the ``--boto`` option.

Also!  Remember to use the :ref:`destroyCluster` command when finished to destroy the cluster!  Otherwise things may not be cleaned up properly.

#. Download :download:`pestis.tar.gz <../../src/toil/test/cactus/pestis.tar.gz>`

#. Launch a cluster using the :ref:`launchCluster` command::

        $ toil launch-cluster <cluster-name> \
                     --provisioner <aws, gce> \
                     --keyPairName <key-pair-name> \
                     --leaderNodeType <type> \
                     --nodeType <type> \
                     -w 1-2 \
                     --zone <zone>


   .. note::

        **A Helpful Tip**

        When using AWS, setting the environment variable eliminates having to specify the ``--zone`` option
        for each command. This will be supported for GCE in the future. ::

            $ export TOIL_AWS_ZONE=us-west-2c

#. Create appropriate directory for uploading files::

        $ toil ssh-cluster --provisioner <aws, gce> <cluster-name>
        $ mkdir /root/cact_ex
        $ exit

#. Copy the required files, i.e., seqFile.txt (a text file containing the locations of the input sequences as
   well as their phylogenetic tree, see
   `here <https://github.com/ComparativeGenomicsToolkit/cactus#seqfile-the-input-file>`__), organisms' genome sequence
   files in FASTA format, and configuration files (e.g. blockTrim1.xml, if desired), up to the leader node::

      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> pestis-short-aws-seqFile.txt :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000169655.1_ASM16965v1_genomic.fna :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000006645.1_ASM664v1_genomic.fna :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000182485.1_ASM18248v1_genomic.fna :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000013805.1_ASM1380v1_genomic.fna :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> setup_leaderNode.sh :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim1.xml :/root/cact_ex
      $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim3.xml :/root/cact_ex

#. Log in to the leader node::

        $ toil ssh-cluster --provisioner <aws, gce> <cluster-name>

#. Set up the environment of the leader node to run Cactus::

        $ bash /root/cact_ex/setup_leaderNode.sh
        $ source cact_venv/bin/activate
        (cact_venv) $ cd cactus
        (cact_venv) $ pip install --upgrade .

#. Run `Cactus <https://github.com/ComparativeGenomicsToolkit/cactus>`__ as an autoscaling workflow::

       (cact_venv) $ cactus \
                         --retry 10 \
                         --batchSystem kubernetes \
                         --logDebug \
                         --logFile /logFile_pestis3 \
                         --configFile \
                         /root/cact_ex/blockTrim3.xml <aws, google>:<zone>:cactus-pestis \
                         /root/cact_ex/pestis-short-aws-seqFile.txt \
                         /root/cact_ex/pestis_output3.hal

   .. note::

      **Pieces of the Puzzle**:

      ``--logDebug`` --- equivalent to ``--logLevel DEBUG``.

      ``--logFile /logFile_pestis3`` --- writes logs in a file named `logFile_pestis3` under ``/`` folder.

      ``--configFile`` --- this is not required depending on whether a specific configuration file is intended to run
      the alignment.

      ``<aws, google>:<zone>:cactus-pestis`` --- creates a bucket, named ``cactus-pestis``, with the specified cloud provider to store intermediate job files and metadata.
      **NOTE**: If you want to use a GCE-based jobstore, specify ``google`` here, not ``gce``.

      The result file, named ``pestis_output3.hal``, is stored under ``/root/cact_ex`` folder of the leader node.

      Use ``cactus --help`` to see all the Cactus and Toil flags available.

#. Log out of the leader node::

        (cact_venv) $ exit

#. Download the resulted output to local machine::

        (venv) $ toil rsync-cluster \
                     --provisioner <aws, gce> <cluster-name> \
                     :/root/cact_ex/pestis_output3.hal \
                     <path-of-folder-on-local-machine>

#. Destroy the cluster::

        (venv) $ toil destroy-cluster --provisioner <aws, gce> <cluster-name>