1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308
|
.. tutorial-additional_features:
Additional features
-------------------
.. _Snakemake: https://snakemake.readthedocs.io
.. _Snakemake homepage: https://snakemake.readthedocs.io
.. _GNU Make: https://www.gnu.org/software/make
.. _Python: https://www.python.org
.. _BWA: http://bio-bwa.sourceforge.net
.. _SAMtools: https://www.htslib.org
.. _BCFtools: https://www.htslib.org
.. _Pandas: https://pandas.pydata.org
.. _Miniconda: https://conda.pydata.org/miniconda.html
.. _Conda: https://conda.pydata.org
.. _Bash: https://www.tldp.org/LDP/Bash-Beginners-Guide/html
.. _Atom: https://atom.io
.. _Anaconda: https://anaconda.org
.. _Graphviz: https://www.graphviz.org
.. _RestructuredText: https://docutils.sourceforge.io/docs/user/rst/quickstart.html
.. _data URI: https://developer.mozilla.org/en-US/docs/Web/HTTP/data_URIs
.. _JSON: https://json.org
.. _YAML: https://yaml.org
.. _DRMAA: https://www.drmaa.org
.. _rpy2: https://rpy2.github.io
.. _R: https://www.r-project.org
.. _Rscript: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rscript.html
.. _PyYAML: https://pyyaml.org
.. _Docutils: https://docutils.sourceforge.io
.. _Bioconda: https://bioconda.github.io
.. _Vagrant: https://www.vagrantup.com
.. _Vagrant Documentation: https://docs.vagrantup.com
.. _Blogpost: https://blog.osteel.me/posts/2015/01/25/how-to-use-vagrant-on-windows.html
.. _slides: https://slides.com/johanneskoester/deck-1
In the following, we introduce some features that are beyond the scope of above example workflow.
For details and even more features, see :ref:`user_manual-writing_snakefiles`, :ref:`project_info-faq` and the command line help (``snakemake --help``).
Benchmarking
::::::::::::
With the ``benchmark`` directive, Snakemake can be instructed to **measure the wall clock time of a job**.
We activate benchmarking for the rule ``bwa_map``:
.. code:: python
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
benchmark:
"benchmarks/{sample}.bwa.benchmark.txt"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The ``benchmark`` directive takes a string that points to the file where benchmarking results shall be stored.
Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files).
When a job derived from the rule is executed, Snakemake will measure the wall clock time and memory usage (in MiB) and store it in the file in tab-delimited format.
It is possible to repeat a benchmark multiple times in order to get a sense for the variability of the measurements.
This can be done by annotating the benchmark file, e.g., with ``repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)`` Snakemake can be told to run the job three times.
The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.
Modularization
::::::::::::::
In order to re-use building blocks or simply to structure large workflows, it is sometimes reasonable to **split a workflow into modules**.
For this, Snakemake provides the ``include`` directive to include another Snakefile into the current one, e.g.:
.. code:: python
include: "path/to/other.snakefile"
Alternatively, Snakemake allows to **define sub-workflows**.
A sub-workflow refers to a working directory with a complete Snakemake workflow.
Output files of that sub-workflow can be used in the current Snakefile.
When executing, Snakemake ensures that the output files of the sub-workflow are up-to-date before executing the current workflow.
This mechanism is particularly useful when you want to extend a previous analysis without modifying it.
For details about sub-workflows, see the :ref:`documentation <snakefiles-sub_workflows>`.
Exercise
........
* Put the read mapping related rules into a separate Snakefile and use the ``include`` directive to make them available in our example workflow again.
.. _tutorial-conda:
Automatic deployment of software dependencies
:::::::::::::::::::::::::::::::::::::::::::::
In order to get a fully reproducible data analysis, it is not sufficient to
be able to execute each step and document all used parameters.
The used software tools and libraries have to be documented as well.
In this tutorial, you have already seen how Conda_ can be used to specify an
isolated software environment for a whole workflow. With Snakemake, you can
go one step further and specify Conda environments per rule.
This way, you can even make use of conflicting software versions (e.g. combine
Python 2 with Python 3).
In our example, instead of using an external environment we can specify
environments per rule, e.g.:
.. code:: python
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
conda:
"envs/samtools.yaml"
shell:
"samtools index {input}"
with ``envs/samtools.yaml`` defined as
.. code:: yaml
channels:
- bioconda
- conda-forge
dependencies:
- samtools =1.9
.. sidebar:: Note
The conda directive does not work in combination with ``run`` blocks, because
they have to share their Python environment with the surrounding snakefile.
When Snakemake is executed with
.. code:: console
snakemake --use-conda --cores 1
it will automatically create required environments and
activate them before a job is executed.
It is best practice to specify at least the `major and minor version <https://semver.org/>`_ of any packages
in the environment definition. Specifying environments per rule in this way has two
advantages.
First, the workflow definition also documents all used software versions.
Second, a workflow can be re-executed (without admin rights)
on a vanilla system, without installing any
prerequisites apart from Snakemake and Miniconda_.
Tool wrappers
:::::::::::::
In order to simplify the utilization of popular tools, Snakemake provides a
repository of so-called wrappers
(the `Snakemake wrapper repository <https://snakemake-wrappers.readthedocs.io>`_).
A wrapper is a short script that wraps (typically)
a command line application and makes it directly addressable from within Snakemake.
For this, Snakemake provides the ``wrapper`` directive that can be used instead of
``shell``, ``script``, or ``run``.
For example, the rule ``bwa_map`` could alternatively look like this:
.. code:: python
rule bwa_mem:
input:
ref="data/genome.fa",
sample=lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
log:
"logs/bwa_mem/{sample}.log"
params:
"-R '@RG\tID:{sample}\tSM:{sample}'"
threads: 8
wrapper:
"0.15.3/bio/bwa/mem"
.. sidebar:: Note
Updates to the Snakemake wrapper repository are automatically tested via
`continuous integration <https://en.wikipedia.org/wiki/Continuous_integration>`_.
The wrapper directive expects a (partial) URL that points to a wrapper in the repository.
These can be looked up in the corresponding `database <https://snakemake-wrappers.readthedocs.io>`_.
The first part of the URL is a Git version tag. Upon invocation, Snakemake
will automatically download the requested version of the wrapper.
Furthermore, in combination with ``--use-conda`` (see :ref:`tutorial-conda`),
the required software will be automatically deployed before execution.
Cluster execution
:::::::::::::::::
By default, Snakemake executes jobs on the local machine it is invoked on.
Alternatively, it can execute jobs in **distributed environments, e.g., compute clusters or batch systems**.
If the nodes share a common file system, Snakemake supports three alternative execution modes.
In cluster environments, compute jobs are usually submitted as shell scripts via commands like ``qsub``.
Snakemake provides a **generic mode** to execute on such clusters.
By invoking Snakemake with
.. code:: console
$ snakemake --cluster qsub --jobs 100
each job will be compiled into a shell script that is submitted with the given command (here ``qsub``).
The ``--jobs`` flag limits the number of concurrently submitted jobs to 100.
This basic mode assumes that the submission command returns immediately after submitting the job.
Some clusters allow to run the submission command in **synchronous mode**, such that it waits until the job has been executed.
In such cases, we can invoke e.g.
.. code:: console
$ snakemake --cluster-sync "qsub -sync yes" --jobs 100
The specified submission command can also be **decorated with additional parameters taken from the submitted job**.
For example, the number of used threads can be accessed in braces similarly to the formatting of shell commands, e.g.
.. code:: console
$ snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
Alternatively, Snakemake can use the Distributed Resource Management Application API (DRMAA_).
This API provides a common interface to control various resource management systems.
The **DRMAA support** can be activated by invoking Snakemake as follows:
.. code:: console
$ snakemake --drmaa --jobs 100
If available, **DRMAA is preferable over the generic cluster modes** because it provides better control and error handling.
To support additional cluster specific parametrization, a Snakefile can be complemented by a :ref:`snakefiles-cluster_configuration` file.
Using --cluster-status
::::::::::::::::::::::
Sometimes you need specific detection to determine if a cluster job completed successfully, failed or is still running.
Error detection with ``--cluster`` can be improved for edge cases such as timeouts and jobs exceeding memory that are silently terminated by
the queueing system.
This can be achieved with the ``--cluster-status`` option. The value of this option should be a executable script which takes a job id as the first argument and prints to stdout only one of [running|success|failed]. Importantly, the job id snakemake passes on is captured from the stdout of the cluster submit tool. This string will often include more than the job id, but snakemake does not modify this string and will pass this string to the status script unchanged. In the situation where snakemake has received more than the job id these are 3 potential solutions to consider: parse the string received by the script and extract the job id within the script, wrap the submission tool to intercept its stdout and return just the job code, or ideally, the cluster may offer an option to only return the job id upon submission and you can instruct snakemake to use that option. For sge this would look like ``snakemake --cluster "qsub -terse"``.
The following (simplified) script detects the job status on a given SLURM cluster (>= 14.03.0rc1 is required for ``--parsable``).
.. code:: python
#!/usr/bin/env python
import subprocess
import sys
jobid = sys.argv[1]
output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())
running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
if "COMPLETED" in output:
print("success")
elif any(r in output for r in running_status):
print("running")
else:
print("failed")
To use this script call snakemake similar to below, where ``status.py`` is the script above.
.. code:: console
$ snakemake all --jobs 100 --cluster "sbatch --cpus-per-task=1 --parsable" --cluster-status ./status.py
Using --cluster-cancel
::::::::::::::::::::::
When snakemake is terminated by pressing ``Ctrl-C``, it will cancel all currently running node when using ``--drmaa``.
You can get the same behaviour with ``--cluster`` by adding ``--cluster-cancel`` and passing a command to use for canceling jobs by their jobid (e.g., ``scancel`` for SLURM or ``qdel`` for SGE).
Most job schedulers can be passed multiple jobids and you can use ``--cluster-cancel-nargs`` to limit the number of arguments (default is 1000 which is reasonable for most schedulers).
Using --cluster-sidecar
:::::::::::::::::::::::
In certain situations, it is necessary to not perform calls to cluster commands directly and instead have a "sidecar" process, e.g., providing a REST API.
One example is when using SLURM where regular calls to ``scontrol show job JOBID`` or ``sacct -j JOBID`` puts a high load on the controller.
Rather, it is better to use the ``squeue`` command with the ``-i/--iterate`` option.
When using ``--cluster``, you can use ``--cluster-sidecar`` to pass in a command that starts a sidecar server.
The command should print one line to stdout and then block and accept connections.
The line will subsequently be available in the calls to ``--cluster``, ``--cluster-status``, and ``--cluster-cancel`` in the environment variable ``SNAKEMAKE_CLUSTER_SIDECAR_VARS``.
In the case of a REST server, you can use this to return the port that the server is listening on and credentials.
When the Snakemake process terminates, the sidecar process will be terminated as well.
Constraining wildcards
::::::::::::::::::::::
Snakemake uses regular expressions to match output files to input files and determine dependencies between the jobs.
Sometimes it is useful to constrain the values a wildcard can have.
This can be achieved by adding a regular expression that describes the set of allowed wildcard values.
For example, the wildcard ``sample`` in the output file ``"sorted_reads/{sample}.bam"`` can be constrained to only allow alphanumeric sample names as ``"sorted_reads/{sample,[A-Za-z0-9]+}.bam"``.
Constraints may be defined per rule or globally using the ``wildcard_constraints`` keyword, as demonstrated in :ref:`snakefiles-wildcards`.
This mechanism helps to solve two kinds of ambiguity.
* It can help to avoid ambiguous rules, i.e. two or more rules that can be applied to generate the same output file. Other ways of handling ambiguous rules are described in the Section :ref:`snakefiles-ambiguous-rules`.
* It can help to guide the regular expression based matching so that wildcards are assigned to the right parts of a file name. Consider the output file ``{sample}.{group}.txt`` and assume that the target file is ``A.1.normal.txt``. It is not clear whether ``dataset="A.1"`` and ``group="normal"`` or ``dataset="A"`` and ``group="1.normal"`` is the right assignment. Here, constraining the dataset wildcard by ``{sample,[A-Z]+}.{group}`` solves the problem.
When dealing with ambiguous rules, it is best practice to first try to solve the ambiguity by using a proper file structure, for example, by separating the output files of different steps in different directories.
|