1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
|
<%@meta language="R-vignette" content="--------------------------------
%\VignetteIndexEntry{A Future for batchtools}
%\VignetteAuthor{Henrik Bengtsson}
%\VignetteKeyword{R}
%\VignetteKeyword{package}
%\VignetteKeyword{vignette}
%\VignetteKeyword{future}
%\VignetteKeyword{synchronous}
%\VignetteKeyword{asynchronous}
%\VignetteKeyword{parallel}
%\VignetteKeyword{cluster}
%\VignetteKeyword{HPC}
%\VignetteKeyword{batchtools}
%\VignetteEngine{R.rsp::rsp}
%\VignetteTangle{FALSE}
--------------------------------------------------------------------"%>
<%
options(mc.cores = 2L)
%>
# A Future for batchtools
## Introduction
The [future] package provides a generic API for using futures in R.
A future is a simple yet powerful mechanism to evaluate an R expression
and retrieve its value at some point in time. Futures can be resolved
in many different ways depending on which strategy is used.
There are various types of synchronous and asynchronous futures to
choose from in the [future] package.
This package, [future.batchtools], provides a type of futures that
utilizes the [batchtools] package. This means that _any_ type of
backend that the batchtools package supports can be used as a future.
More specifically, future.batchtools will allow you or users of your
package to leverage the compute power of high-performance computing
(HPC) clusters via a simple switch in settings - without having to
change any code at all.
For instance, if batchtools is properly configures, the below two
expressions for futures `x` and `y` will be processed on two different
compute nodes:
```r
> library("future.batchtools")
> plan(batchtools_torque)
>
> x %<-% { Sys.sleep(5); 3.14 }
> y %<-% { Sys.sleep(5); 2.71 }
> x + y
[1] 5.85
```
This is obviously a toy example to illustrate what futures look like
and how to work with them.
A more realistic example comes from the field of cancer research
where very large data FASTQ files, which hold a large number of short
DNA sequence reads, are produced. The first step toward a biological
interpretation of these data is to align the reads in each sample
(one FASTQ file) toward the human genome. In order to speed this up,
we can have each file be processed by a separate compute node and each
node we can use 24 parallel processes such that each process aligns a
separate chromosome. Here is an outline of how this nested parallelism
could be implemented using futures.
```r
library("future")
library("listenv")
## The first level of futures should be submitted to the
## cluster using batchtools. The second level of futures
## should be using multisession, where the number of
## parallel processes is automatically decided based on
## what the cluster grants to each compute node.
plan(list(batchtools_torque, multisession))
## Find all samples (one FASTQ file per sample)
fqs <- dir(pattern = "[.]fastq$")
## The aligned results are stored in BAM files
bams <- listenv()
## For all samples (FASTQ files) ...
for (ss in seq_along(fqs)) {
fq <- fqs[ss]
## ... use futures to align them ...
bams[[ss]] %<-% {
bams_ss <- listenv()
## ... and for each FASTQ file use a second layer
## of futures to align the individual chromosomes
for (cc in 1:24) {
bams_ss[[cc]] %<-% htseq::align(fq, chr = cc)
}
## Resolve the "chromosome" futures and return as a list
as.list(bams_ss)
}
}
## Resolve the "sample" futures and return as a list
bams <- as.list(bams)
```
Note that a user who do not have access to a cluster could use the same script processing samples sequentially and chromosomes in parallel on a single machine using:
```r
plan(list(sequential, multisession))
```
or samples in parallel and chromosomes sequentially using:
```r
plan(list(multisession, sequential))
```
For an introduction as well as full details on how to use futures,
please consult the package vignettes of the [future] package.
## Choosing batchtools backend
The future.batchtools package implements a generic future wrapper
for all batchtools backends. Below are the most common types of
batchtools backends.
| Backend | Description | Alternative in future package
|:-------------------------|:-------------------------------------------------------------------------|:------------------------------------
| `batchtools_torque` | Futures are evaluated via a [TORQUE] / PBS job scheduler | N/A
| `batchtools_slurm` | Futures are evaluated via a [Slurm] job scheduler | N/A
| `batchtools_sge` | Futures are evaluated via a [Sun/Oracle Grid Engine (SGE)] job scheduler | N/A
| `batchtools_lsf` | Futures are evaluated via a [Load Sharing Facility (LSF)] job scheduler | N/A
| `batchtools_openlava` | Futures are evaluated via an [OpenLava] job scheduler | N/A
<%-- | `batchtools_docker` | Futures are evaluated via a [Docker Swarm] cluster | N/A --%>
| `batchtools_custom` | Futures are evaluated via a custom batchtools configuration R script or via a set of cluster functions | N/A
| `batchtools_interactive` | sequential evaluation in the calling R environment | `plan(transparent)`
| `batchtools_multicore` | parallel evaluation by forking the current R process | `plan(multicore)`
| `batchtools_local` | sequential evaluation in a separate R process (on current machine) | `plan(cluster, workers = "localhost")`
### Examples
Below is an examples illustrating how to use `batchtools_torque` to configure the batchtools backend. For further details and examples on how to configure batchtools, see the [batchtools configuration] wiki page.
To configure batchtools for job schedulers we need to setup a `*.tmpl` template
file that is used to generate the script used by the scheduler.
This is what a template file for TORQUE / PBS may look like:
```sh
#!/bin/bash
## Job name:
#PBS -N <%%= if (exists("job.name", mode = "character")) job.name else job.hash %%>
## Direct streams to logfile:
#PBS -o <%%= log.file %%>
## Merge standard error and output:
#PBS -j oe
## Email on abort (a) and termination (e), but not when starting (b)
#PBS -m ae
## Resources needed:
<%% if (length(resources) > 0) {
opts <- unlist(resources, use.names = TRUE)
opts <- sprintf("%s=%s", names(opts), opts)
opts <- paste(opts, collapse = ",") %%>
#PBS -l <%%= opts %%>
<%% } %%>
## Launch R and evaluated the batchtools R job
Rscript -e 'batchtools::doJobCollection("<%%= uri %%>")'
```
If this template is saved to file `batchtools.torque.tmpl` (without period)
in the working directory or as `.batchtools.torque.tmpl` (with period) the
user's home directory, then it will be automatically located by the
batchtools framework and loaded when doing:
```r
> plan(batchtools_torque)
```
Resource parameters can be specified via argument `resources` which should be a named list and is passed as is to the template file. For example, to request that each job would get alloted 12 cores (one a single machine) and up to 5 GiB of memory, use:
```r
> plan(batchtools_torque, resources = list(nodes = "1:ppn=12", vmem = "5gb"))
```
To specify the `resources` argument at the same time as using nested future strategies, one can use `tweak()` to tweak the default arguments. For instance,
```r
plan(list(
tweak(batchtools_torque, resources = list(nodes = "1:ppn=12", vmem = "5gb")),
multisession
))
```
causes the first level of futures to be submitted via the TORQUE job scheduler requesting 12 cores and 5 GiB of memory per job. The second level of futures will be evaluated using multisession using the 12 cores given to each job by the scheduler.
A similar filename format is used for the other types of job schedulers supported. For instance, for Slurm the template file should be named `./batchtools.slurm.tmpl` or `~/.batchtools.slurm.tmpl` in order for
```r
> plan(batchtools_slurm)
```
to locate the file automatically. To specify this template file explicitly, use argument `template`, e.g.
```r
> plan(batchtools_slurm, template = "/path/to/batchtools.slurm.tmpl")
```
For further details and examples on how to configure batchtools per se, see the [batchtools configuration] wiki page.
## Demos
The [future] package provides a demo using futures for calculating a
set of Mandelbrot planes. The demo does not assume anything about
what type of futures are used.
_The user has full control of how futures are evaluated_.
For instance, to use local batchtools futures, run the demo as:
```r
library("future.batchtools")
plan(batchtools_local)
demo("mandelbrot", package = "future", ask = FALSE)
```
[batchtools]: https://cran.r-project.org/package=batchtools
[brew]: https://cran.r-project.org/package=brew
[future]: https://cran.r-project.org/package=future
[future.batchtools]: https://cran.r-project.org/package=future.batchtools
[batchtools configuration]: https://mllg.github.io/batchtools/articles/index.html
[TORQUE]: https://en.wikipedia.org/wiki/TORQUE
[Slurm]: https://en.wikipedia.org/wiki/Slurm_Workload_Manager
[Sun/Oracle Grid Engine (SGE)]: https://en.wikipedia.org/wiki/Oracle_Grid_Engine
[Load Sharing Facility (LSF)]: https://en.wikipedia.org/wiki/Platform_LSF
[OpenLava]: https://en.wikipedia.org/wiki/OpenLava
[Docker Swarm]: https://docs.docker.com/swarm/
|