1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948
|
.. _tutorial-google-lifesciences:
Google Life Sciences Tutorial
------------------------------
.. _Snakemake: http://snakemake.readthedocs.io
.. _Snakemake Remotes: https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html
.. _Python: https://www.python.org/
Setup
:::::
To go through this tutorial, you need the following software installed:
* Python_ ≥3.5
* Snakemake_ ≥5.16
* git
First, you have to install the Miniconda Python3 distribution.
See `here <https://conda.io/en/latest/miniconda.html>`_ for installation instructions.
Make sure to ...
* Install the **Python 3** version of Miniconda.
* Answer yes to the question whether conda shall be put into your PATH.
The default conda solver is a bit slow and sometimes has issues with `selecting the latest package releases <https://github.com/conda/conda/issues/9905>`_. Therefore, we recommend to install `Mamba <https://github.com/QuantStack/mamba>`_ as a drop-in replacement via
.. code-block:: console
$ conda install -c conda-forge mamba
Then, you can install Snakemake with
.. code-block:: console
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
from the `Bioconda <https://bioconda.github.io>`_ channel.
This will install snakemake into an isolated software environment, that has to be activated with
.. code-block:: console
$ conda activate snakemake
$ snakemake --help
Credentials
:::::::::::
Google's `Application Default Credentials <https://cloud.google.com/docs/authentication/application-default-credentials>`_
automatically find credentials based on the application environment. Snakemake supports two approaches for running with
Application Default Credentials:
- The `GOOGLE_APPLICATION_CREDENTIALS` environment variable
- The service account attached to your Google Cloud Project
**`GOOGLE_APPLICATION_CREDENTIALS`**
For this approach, export the environment
variable `GOOGLE_APPLICATION_CREDENTIALS`, which should point to
the full path of the file on your local machine. To generate this file, you
can refer to the page under iam-admin to `download your service account <https://console.cloud.google.com/iam-admin/iam>`_ key and export it to the environment.
.. code:: console
export GOOGLE_APPLICATION_CREDENTIALS="/home/[username]/credentials.json"
The suggested, minimal permissions required for this role include the following:
- Compute Storage Admin(Can potentially be restricted further)
- Compute Viewer
- Service Account User
- Cloud Life Sciences Workflows Runner
- Service Usage Consumer
*Note*: This tutorial assumes you are using the `GOOGLE_APPLICATION_CREDENTIALS` approach.
**Service Account**
When running on Google Compute Engine Virtual Machine instances, it is preferable to use your project's
`service account <https://cloud.google.com/docs/authentication/application-default-credentials#attached-sa>`_ .
You can use your service account's email address using the `--google-lifesciences-service-account-email` flag
when running Snakemake. Should you do this, you do not need to set the `GOOGLE_APPLICATION_CREDENTIALS`
environment variable.
Step 1: Upload Your Data
::::::::::::::::::::::::
We will be obtaining inputs from Google Cloud Storage, as well as saving
outputs there. You should first clone the repository with the Snakemake tutorial data:
.. code:: console
git clone https://github.com/snakemake/snakemake-lsh-tutorial-data
cd snakemake-lsh-tutorial-data
And then either manually create a bucket and upload data files there, or
use the `provided script and instructions <https://github.com/snakemake/snakemake-lsh-tutorial-data#google-cloud-storage>`_
to do it programatically from the command line. The script generally works like:
.. code:: console
python upload_google_storage.py <bucket>/<subpath> <folder>
And you aren't required to provide a subfolder path if you want to upload
to the root of a bucket. As an example, for this tutorial we upload the contents of
"data" to the root of the bucket `snakemake-testing-data`
.. code:: console
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
python upload_google_storage.py snakemake-testing-data data/
If you wanted to upload to a "subfolder" path in a bucket, you would do that as follows:
.. code:: console
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
python upload_google_storage.py snakemake-testing-data/subfolder data/
Your bucket (and the folder prefix) will be referred to as the
`--default-remote-prefix` when you run snakemake. You can visually
browse your data in the `storage browser <https://console.cloud.google.com/storage/>_`.
.. image:: workflow/upload-google-storage.png
Step 2: Write your Snakefile, Environment File, and Scripts
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Now that we've exported our credentials and have all dependencies installed, let's
get our workflow! This is the exact same workflow from the :ref:`basic tutorial<tutorial-basics>`,
so if you need a refresher on the design or basics, please see those pages.
You can find the Snakefile, supporting scripts for plotting and environment in the `snakemake-lsh-tutorial-data <https://github.com/snakemake/snakemake-lsh-tutorial-data>`_ repository.
First, how does a working directory work for this executor? The present
working directory, as identified by Snakemake that has the Snakefile, and where
a more advanced setup might have a folder of environment specifications (env) a folder of scripts
(scripts), and rules (rules), is considered within the context of the build.
When the Google Life Sciences executor is used, it generates a build package of all
of the files here (within a reasonable size) and uploads those to storage. This
package includes the .snakemake folder that would have been generated locally.
The build package is then downloaded and extracted by each cloud executor, which
is a Google Compute instance.
We next need an `environment.yaml` file that will define the dependencies
that we want installed with conda for our job. If you cloned the "snakemake-lsh-tutorial-data"
repository you will already have this, and you are good to go. If not, save this to `environment.yaml`
in your working directory:
.. code:: yaml
channels:
- conda-forge
- bioconda
dependencies:
- python =3.6
- jinja2 =2.10
- networkx =2.1
- matplotlib =2.2.3
- graphviz =2.38.0
- bcftools =1.9
- samtools =1.9
- bwa =0.7.17
- pysam =0.15.0
Notice that we reference this `environment.yaml` file in the Snakefile below.
Importantly, if you were optimizing a pipeline, you would likely have a folder
"envs" with more than one environment specification, one for each step.
This workflow uses the same environment (with many dependencies) instead of
this strategy to minimize the number of files for you.
The Snakefile (also included in the repository) then has the following content. It's important to note
that we have not customized this file from the basic tutorial to hard code
any storage. We will be telling snakemake to use the remote bucket as
storage instead of the local filesystem.
.. code:: python
SAMPLES = ["A", "B"]
rule all:
input:
"plots/quals.svg"
rule bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
conda:
"environment.yaml"
output:
"mapped_reads/{sample}.bam"
params:
idx=lambda w, input: os.path.splitext(input.idx[0])[0]
shell:
"bwa mem {params.idx} {input.fastq} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
conda:
"environment.yaml"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
conda:
"environment.yaml"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
conda:
"environment.yaml"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
conda:
"environment.yaml"
script:
"plot-quals.py"
And make sure you also have the script `plot-quals.py` in your present working directory for the last step.
This script will help us to do the plotting, and is also included in the `snakemake-lsh-tutorial-data <https://github.com/snakemake/snakemake-lsh-tutorial-data>`_ repository.
.. code:: python
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
Step 3: Run Snakemake
:::::::::::::::::::::
Now let's run Snakemake with the Google Life Sciences Executor.
.. code:: console
snakemake --google-lifesciences --default-remote-prefix snakemake-testing-data --use-conda --google-lifesciences-region us-west1
The flags above refer to:
- `--google-lifesciences`: to indicate that we want to use the Google Life Sciences API
- `--default-remote-prefix`: refers to the Google Storage bucket. The bucket name is "snakemake-testing-data" and the "subfolder" (or path) (not defined above) would be a subfolder, if needed.
- `--google-lifesciences-region`: the region that you want the instances to deploy to. Your storage bucket should be accessible from here, and your selection can have a small influence on the machine type selected.
Once you submit the job, you'll immediately see the familiar Snakemake console output,
but with additional lines for inspecting google compute instances with gcloud:
.. code:: console
Building DAG of jobs...
Unable to retrieve additional files from git. This is not a git repository.
Using shell: /bin/bash
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 bcftools_call
2 bwa_map
1 plot_quals
2 samtools_index
2 samtools_sort
9
[Thu Apr 16 19:16:24 2020]
rule bwa_map:
input: snakemake-testing-data/genome.fa, snakemake-testing-data/samples/B.fastq
output: snakemake-testing-data/mapped_reads/B.bam
jobid: 8
wildcards: sample=B
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe 13586583122112209762
gcloud beta lifesciences operations list
Take note of those last three lines to describe and list operations - this is how you
get complete error and output logs for the run, which we will demonstrate using later.
And you'll see a block like that for each rule. Here is what the entire workflow looks
like after completion:
.. code:: console
Building DAG of jobs...
Unable to retrieve additional files from git. This is not a git repository.
Using shell: /bin/bash
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 bcftools_call
2 bwa_map
1 plot_quals
2 samtools_index
2 samtools_sort
9
[Fri Apr 17 20:27:51 2020]
rule bwa_map:
input: snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa
output: snakemake-testing-data/mapped_reads/B.bam
jobid: 8
wildcards: sample=B
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/16135317625786219242
gcloud beta lifesciences operations list
[Fri Apr 17 20:31:16 2020]
Finished job 8.
1 of 9 steps (11%) done
[Fri Apr 17 20:31:16 2020]
rule bwa_map:
input: snakemake-testing-data/samples/A.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa
output: snakemake-testing-data/mapped_reads/A.bam
jobid: 7
wildcards: sample=A
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/5458247376121133509
gcloud beta lifesciences operations list
[Fri Apr 17 20:34:30 2020]
Finished job 7.
2 of 9 steps (22%) done
[Fri Apr 17 20:34:30 2020]
rule samtools_sort:
input: snakemake-testing-data/mapped_reads/B.bam
output: snakemake-testing-data/sorted_reads/B.bam
jobid: 4
wildcards: sample=B
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/13750029425473765929
gcloud beta lifesciences operations list
[Fri Apr 17 20:37:34 2020]
Finished job 4.
3 of 9 steps (33%) done
[Fri Apr 17 20:37:35 2020]
rule samtools_sort:
input: snakemake-testing-data/mapped_reads/A.bam
output: snakemake-testing-data/sorted_reads/A.bam
jobid: 3
wildcards: sample=A
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/15643873965497084056
gcloud beta lifesciences operations list
[Fri Apr 17 20:40:37 2020]
Finished job 3.
4 of 9 steps (44%) done
[Fri Apr 17 20:40:38 2020]
rule samtools_index:
input: snakemake-testing-data/sorted_reads/B.bam
output: snakemake-testing-data/sorted_reads/B.bam.bai
jobid: 6
wildcards: sample=B
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/6525320566174651173
gcloud beta lifesciences operations list
[Fri Apr 17 20:43:41 2020]
Finished job 6.
5 of 9 steps (56%) done
[Fri Apr 17 20:43:41 2020]
rule samtools_index:
input: snakemake-testing-data/sorted_reads/A.bam
output: snakemake-testing-data/sorted_reads/A.bam.bai
jobid: 5
wildcards: sample=A
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/9175497885319251567
gcloud beta lifesciences operations list
[Fri Apr 17 20:46:44 2020]
Finished job 5.
6 of 9 steps (67%) done
[Fri Apr 17 20:46:44 2020]
rule bcftools_call:
input: snakemake-testing-data/genome.fa, snakemake-testing-data/sorted_reads/A.bam, snakemake-testing-data/sorted_reads/B.bam, snakemake-testing-data/sorted_reads/A.bam.bai, snakemake-testing-data/sorted_reads/B.bam.bai
output: snakemake-testing-data/calls/all.vcf
jobid: 2
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/622600526583374352
gcloud beta lifesciences operations list
[Fri Apr 17 20:49:57 2020]
Finished job 2.
7 of 9 steps (78%) done
[Fri Apr 17 20:49:57 2020]
rule plot_quals:
input: snakemake-testing-data/calls/all.vcf
output: snakemake-testing-data/plots/quals.svg
jobid: 1
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us-west2/operations/9350722561866518561
gcloud beta lifesciences operations list
[Fri Apr 17 20:53:10 2020]
Finished job 1.
8 of 9 steps (89%) done
[Fri Apr 17 20:53:10 2020]
localrule all:
input: snakemake-testing-data/plots/quals.svg
jobid: 0
resources: mem_mb=15360, disk_mb=128000
Downloading from remote: snakemake-testing-data/plots/quals.svg
Finished download.
[Fri Apr 17 20:53:10 2020]
Finished job 0.
9 of 9 steps (100%) done
Complete log: /home/vanessa/snakemake-work/tutorial/.snakemake/log/2020-04-17T202749.218820.snakemake.log
We've finished the run, great! Let's inspect our results.
Step 4: View Results
::::::::::::::::::::
The entirety of the log that was printed to the terminal will be available
on your local machine where you submit the job in the hidden `.snakemake`
folder under "log" and timestamped accordingly. If you look at the last line
in the output above, you'll see the full path to this file.
You also might notice a line about downloading results:
.. code:: console
Downloading from remote: snakemake-testing-data/plots/quals.svg
Since we defined this to be the target of our run
.. code:: console
rule all:
input:
"plots/quals.svg"
this fill is downloaded to our host too. Actually, you'll notice
that paths in storage are mirrored on your filesystem (this is what the workers
do too):
.. code:: console
$ tree snakemake-testing-data/
snakemake-testing-data/
└── plots
└── quals.svg
We can see the result of our run, quals.svg, below:
.. image:: workflow/quals.svg
And if we look at the remote storage, we see that the result file (under plots) and intermediate
results (under sorted_reads and calls) are saved there too!
.. image:: workflow/results-google-storage.png
The source folder contains a cache folder with archives that contain your working directories
that are extracted on the worker instances. You can safely delete this folder, or keep it if you want to reproduce
the run in the future.
Step 5: Debugging
:::::::::::::::::
Let's introduce an error (purposefully) into our Snakefile to practice debugging.
Let's remove the conda environment.yaml file for the first rule, so we would
expect that Snakemake won't be able to find the executables for bwa and samtools.
In your Snakefile, change this:
.. code:: python
rule bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
conda:
"environment.yaml"
output:
"mapped_reads/{sample}.bam"
params:
idx=lambda w, input: os.path.splitext(input.idx[0])[0]
shell:
"bwa mem {params.idx} {input.fastq} | samtools view -Sb - > {output}"
to this:
.. code:: python
rule bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
output:
"mapped_reads/{sample}.bam"
params:
idx=lambda w, input: os.path.splitext(input.idx[0])[0]
shell:
"bwa mem {params.idx} {input.fastq} | samtools view -Sb - > {output}"
And then for the same command to run everything again, you would need to remove the
plots, mapped_reads, and calls folders. Instead, we can make this request more easily
by adding the argument `--forceall`:
.. code:: console
snakemake --google-lifesciences --default-remote-prefix snakemake-testing-data --use-conda --google-lifesciences-region us-west1 --forceall
Everything will start out okay as it did before, and it will pause on the first
step when it's deploying the first container image. The last part of the
log will look somethig like this:
.. code:: console
[Fri Apr 17 22:01:38 2020]
rule bwa_map:
input: snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa
output: snakemake-testing-data/mapped_reads/B.bam
jobid: 8
wildcards: sample=B
resources: mem_mb=15360, disk_mb=128000
Get status with:
gcloud config set project snakemake-testing
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us/operations/11698975339184312706
gcloud beta lifesciences operations list
Since we removed an important dependency to install libraries with conda,
we are definitely going to hit an error! That looks like this:
.. code:: console
[Fri Apr 17 22:03:08 2020]
Error in rule bwa_map:
jobid: 8
output: snakemake-testing-data/mapped_reads/B.bam
shell:
bwa mem snakemake-testing-data/genome.fa snakemake-testing-data/samples/B.fastq | samtools view -Sb - > snakemake-testing-data/mapped_reads/B.bam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
jobid: 11698975339184312706
Shutting down, this might take some time.
Oh no! How do we debug it? The error above just indicates that "one of the commands
exised with a non-zero exit code," and that isn't really enough to know what happened,
and how to fix it. Debugging is actually quite simple, we can copy paste the gcloud
command to describe our operation into the console. This will spit out an entire structure
that shows every step of the rule running, from pulling a container, to downloading
the working directory, to running the step.
.. code:: console
gcloud beta lifesciences operations describe projects/snakemake-testing/locations/us/operations/11698975339184312706
done: true
error:
code: 9
message: 'Execution failed: generic::failed_precondition: while running "snakejob-bwa_map-8":
unexpected exit status 1 was not ignored'
metadata:
'@type': type.googleapis.com/google.cloud.lifesciences.v2beta.Metadata
createTime: '2020-04-17T22:01:39.642966Z'
endTime: '2020-04-17T22:02:59.149914114Z'
events:
- description: Worker released
timestamp: '2020-04-17T22:02:59.149914114Z'
workerReleased:
instance: google-pipelines-worker-b1cdd36c743c3b477af8114d2511e37e
zone: us-west1-c
- description: 'Execution failed: generic::failed_precondition: while running "snakejob-bwa_map-8":
unexpected exit status 1 was not ignored'
failed:
cause: 'Execution failed: generic::failed_precondition: while running "snakejob-bwa_map-8":
unexpected exit status 1 was not ignored'
code: FAILED_PRECONDITION
timestamp: '2020-04-17T22:02:57.950752682Z'
- description: Unexpected exit status 1 while running "snakejob-bwa_map-8"
timestamp: '2020-04-17T22:02:57.842529458Z'
unexpectedExitStatus:
actionId: 1
exitStatus: 1
- containerStopped:
actionId: 1
exitStatus: 1
stderr: |
me.fa.bwt
Finished download.
/bin/bash: bwa: command not found
/bin/bash: samtools: command not found
[Fri Apr 17 22:02:57 2020]
Error in rule bwa_map:
jobid: 0
output: snakemake-testing-data/mapped_reads/B.bam
shell:
bwa mem snakemake-testing-data/genome.fa snakemake-testing-data/samples/B.fastq | samtools view -Sb - > snakemake-testing-data/mapped_reads/B.bam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job bwa_map since they might be corrupted:
snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa, snakemake-testing-data/mapped_reads/B.bam
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workdir/.snakemake/log/2020-04-17T220254.129519.snakemake.log
description: |-
Stopped running "snakejob-bwa_map-8": exit status 1: me.fa.bwt
Finished download.
/bin/bash: bwa: command not found
/bin/bash: samtools: command not found
[Fri Apr 17 22:02:57 2020]
Error in rule bwa_map:
jobid: 0
output: snakemake-testing-data/mapped_reads/B.bam
shell:
bwa mem snakemake-testing-data/genome.fa snakemake-testing-data/samples/B.fastq | samtools view -Sb - > snakemake-testing-data/mapped_reads/B.bam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job bwa_map since they might be corrupted:
snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa, snakemake-testing-data/mapped_reads/B.bam
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workdir/.snakemake/log/2020-04-17T220254.129519.snakemake.log
timestamp: '2020-04-17T22:02:57.842442588Z'
- containerStarted:
actionId: 1
description: Started running "snakejob-bwa_map-8"
timestamp: '2020-04-17T22:02:51.724433437Z'
- description: Stopped pulling "snakemake/snakemake:v5.10.0"
pullStopped:
imageUri: snakemake/snakemake:v5.10.0
timestamp: '2020-04-17T22:02:43.696978950Z'
- description: Started pulling "snakemake/snakemake:v5.10.0"
pullStarted:
imageUri: snakemake/snakemake:v5.10.0
timestamp: '2020-04-17T22:02:10.339950219Z'
- description: Worker "google-pipelines-worker-b1cdd36c743c3b477af8114d2511e37e"
assigned in "us-west1-c"
timestamp: '2020-04-17T22:01:43.232858222Z'
workerAssigned:
instance: google-pipelines-worker-b1cdd36c743c3b477af8114d2511e37e
machineType: n2-highmem-2
zone: us-west1-c
labels:
app: snakemake
name: snakejob-b346c449-9fd6-4f1e-8043-17c300cc9c0d-bwa_map-8
pipeline:
actions:
- commands:
- /bin/bash
- -c
- 'mkdir -p /workdir && cd /workdir && wget -O /download.py https://gist.githubusercontent.com/vsoch/84886ef6469bedeeb9a79a4eb7aec0d1/raw/181499f8f17163dcb2f89822079938cbfbd258cc/download.py
&& chmod +x /download.py && source activate snakemake || true && pip install
crc32c && python /download.py download snakemake-testing-data source/cache/snakeworkdir-5f4f325b9ddb188d5da8bfab49d915f023509c0b1986eb72cb4a2540d7991c12.tar.gz
/tmp/workdir.tar.gz && tar -xzvf /tmp/workdir.tar.gz && snakemake snakemake-testing-data/mapped_reads/B.bam
--snakefile Snakefile --force -j --keep-target-files --keep-remote --latency-wait
0 --attempt 1 --force-use-threads --allowed-rules bwa_map --nocolor --notemp
--no-hooks --nolock --use-conda --default-remote-provider GS --default-remote-prefix
snakemake-testing-data --default-resources "mem_mb=15360" "disk_mb=128000" '
containerName: snakejob-bwa_map-8
imageUri: snakemake/snakemake:v5.10.0
labels:
app: snakemake
name: snakejob-b346c449-9fd6-4f1e-8043-17c300cc9c0d-bwa_map-8
resources:
regions:
- us-west1
virtualMachine:
bootDiskSizeGb: 135
bootImage: projects/cos-cloud/global/images/family/cos-stable
labels:
app: snakemake
goog-pipelines-worker: 'true'
machineType: n2-highmem-2
serviceAccount:
email: default
scopes:
- https://www.googleapis.com/auth/cloud-platform
timeout: 604800s
startTime: '2020-04-17T22:01:43.232858222Z'
name: projects/411393320858/locations/us/operations/11698975339184312706
The log is hefty, so let's break it into pieces to talk about. Firstly, it's
intended to be read from the bottom up if you want to see things in chronological order.
The very bottom line is the unique id of the operation, and this is what you used
(with the project identifier string, the number after projects, replaced with your project
name) to query for the log. Let's look at the next section, `pipeline`. This was
the specification built up by Snakemake and sent to the Google Life Sciences API
as a request:
.. code:: console
pipeline:
actions:
- commands:
- /bin/bash
- -c
- 'mkdir -p /workdir && cd /workdir && wget -O /download.py https://gist.githubusercontent.com/vsoch/84886ef6469bedeeb9a79a4eb7aec0d1/raw/181499f8f17163dcb2f89822079938cbfbd258cc/download.py
&& chmod +x /download.py && source activate snakemake || true && pip install
crc32c && python /download.py download snakemake-testing-data source/cache/snakeworkdir-5f4f325b9ddb188d5da8bfab49d915f023509c0b1986eb72cb4a2540d7991c12.tar.gz
/tmp/workdir.tar.gz && tar -xzvf /tmp/workdir.tar.gz && snakemake snakemake-testing-data/mapped_reads/B.bam
--snakefile Snakefile --force -j --keep-target-files --keep-remote --latency-wait
0 --attempt 1 --force-use-threads --allowed-rules bwa_map --nocolor --notemp
--no-hooks --nolock --use-conda --default-remote-provider GS --default-remote-prefix
snakemake-testing-data --default-resources "mem_mb=15360" "disk_mb=128000" '
containerName: snakejob-bwa_map-8
imageUri: snakemake/snakemake:v5.10.0
labels:
app: snakemake
name: snakejob-b346c449-9fd6-4f1e-8043-17c300cc9c0d-bwa_map-8
resources:
regions:
- us-west1
virtualMachine:
bootDiskSizeGb: 135
bootImage: projects/cos-cloud/global/images/family/cos-stable
labels:
app: snakemake
goog-pipelines-worker: 'true'
machineType: n2-highmem-2
serviceAccount:
email: default
scopes:
- https://www.googleapis.com/auth/cloud-platform
timeout: 604800s
startTime: '2020-04-17T22:01:43.232858222Z'
There is a lot of useful information here. Under *resources*:
- **virtualMachine** shows the **machineType** that should correspond to the instance type. You can specify a full name or prefix with `--machine-type-prefix` or "machine_type" defined under resources for a step. Since we didn't set any requirements, it chose a reasonable choice for us. This section also shows the size of the boot disk (in GB) and if you added hardware accelerators (GPU) they should show up here too.
- **regions** is the region that the instance was deployed in, which is important to know if you need to specify to run from a particular region. This parameter defalts to regions in the US, and can be modified with the `--google-lifesciences-regions` parameter.
Under *actions* you'll find a few important fields:
- **imageUri** is important to know to see the version of Snakemake (or another container base) that was used. You can customize this with `--container-image`, and it will default to the latest snakemake.
- **commands** are the commands run to execute the container (also known as the entrypoint). For example, if you wanted to bring up your own instance manually and pull the container defined by `imageUri`, you could execute the commands to the container (or shell inside and run them interactively) to interactively debug. Notice that the commands ends with a call to snakemake, and shows the arguments that are used. Make sure that this matches your expectation.
The next set of steps pertain to assigning the worker, pulling the container, and starting it.
That looks something like this, and it's fairly straight forward. You can again see
that earlier timestamps are on the bottom.
.. code:: console
- containerStarted:
actionId: 1
description: Started running "snakejob-bwa_map-8"
timestamp: '2020-04-17T22:02:51.724433437Z'
- description: Stopped pulling "snakemake/snakemake:v5.10.0"
pullStopped:
imageUri: snakemake/snakemake:v5.10.0
timestamp: '2020-04-17T22:02:43.696978950Z'
- description: Started pulling "snakemake/snakemake:v5.10.0"
pullStarted:
imageUri: snakemake/snakemake:v5.10.0
timestamp: '2020-04-17T22:02:10.339950219Z'
- description: Worker "google-pipelines-worker-b1cdd36c743c3b477af8114d2511e37e"
assigned in "us-west1-c"
timestamp: '2020-04-17T22:01:43.232858222Z'
workerAssigned:
instance: google-pipelines-worker-b1cdd36c743c3b477af8114d2511e37e
machineType: n2-highmem-2
zone: us-west1-c
The next section, when the container is stopped, have the meat of the information
that we need to debug! This is the step where there was a non-zero exit code.
.. code:: console
- containerStopped:
actionId: 1
exitStatus: 1
stderr: |
me.fa.bwt
Finished download.
/bin/bash: bwa: command not found
/bin/bash: samtools: command not found
[Fri Apr 17 22:02:57 2020]
Error in rule bwa_map:
jobid: 0
output: snakemake-testing-data/mapped_reads/B.bam
shell:
bwa mem snakemake-testing-data/genome.fa snakemake-testing-data/samples/B.fastq | samtools view -Sb - > snakemake-testing-data/mapped_reads/B.bam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job bwa_map since they might be corrupted:
snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa, snakemake-testing-data/mapped_reads/B.bam
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workdir/.snakemake/log/2020-04-17T220254.129519.snakemake.log
description: |-
Stopped running "snakejob-bwa_map-8": exit status 1: me.fa.bwt
Finished download.
/bin/bash: bwa: command not found
/bin/bash: samtools: command not found
[Fri Apr 17 22:02:57 2020]
Error in rule bwa_map:
jobid: 0
output: snakemake-testing-data/mapped_reads/B.bam
shell:
bwa mem snakemake-testing-data/genome.fa snakemake-testing-data/samples/B.fastq | samtools view -Sb - > snakemake-testing-data/mapped_reads/B.bam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job bwa_map since they might be corrupted:
snakemake-testing-data/samples/B.fastq, snakemake-testing-data/genome.fa.amb, snakemake-testing-data/genome.fa.ann, snakemake-testing-data/genome.fa.bwt, snakemake-testing-data/genome.fa.pac, snakemake-testing-data/genome.fa.sa, snakemake-testing-data/mapped_reads/B.bam
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workdir/.snakemake/log/2020-04-17T220254.129519.snakemake.log
timestamp: '2020-04-17T22:02:57.842442588Z'
Along with seeing the error in `stderr`, the description key holds the same error. We see
what we would have seen if we were running the bwa mem command on our own command line,
that the executables weren't found:
.. code:: console
stderr: |
me.fa.bwt
Finished download.
/bin/bash: bwa: command not found
/bin/bash: samtools: command not found
But we shouldn't be surprised, we on purpose removed the environment file to install
them! This is where you would read the error, and respond by updating your Snakefile with
a fix.
Step 6: Adding a Log File
:::::::::::::::::::::::::
How might we do better at debugging in the future? The answer is to
add a log file for each step, which is where any stderr will be written
in the case of failure. For the same step above, we would update the rule
to look like this:
.. code:: python
rule bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
output:
"mapped_reads/{sample}.bam"
params:
idx=lambda w, input: os.path.splitext(input.idx[0])[0]
shell:
"bwa mem {params.idx} {input.fastq} | samtools view -Sb - > {output}"
log:
"logs/bwa_map/{sample}.log"
In the above, we would write a log file to storage in a "subfolder" of the
snakemake prefix located at "logs/bwa_map." The log file will be named according
to the sample. You could also imagine a flatted structure with a path like
`logs/bwa_map-{sample}.log`. It's up to you how you want to organize your output.
This means that when you see the error appear in your terminal, you can quickly
look at this log file instead of resorting to using the gcloud tool. It's generally
good to remember when debugging that:
- You should not make assumptions about anything's existence. Use print statements to verify.
- The biggest errors tend to be syntax and/or path errors
- If you want to test a different snakemake container, you can use the `--container` flag.
- If the error is especially challenging, set up a small toy example that implements the most basic functionality that you want to achieve.
- If you need help, reach out to ask for it! If there is an issue with the Google Life Sciences workflow executor, please `open an issue <https://github.com/snakemake/snakemake/issues>`_.
- It also sometimes helps to take a break from working on something, and coming back with fresh eyes.
|