File: BiocParallel_BatchtoolsParam.Rnw

package info (click to toggle)
r-bioc-biocparallel 1.40.0-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,768 kB
  • sloc: cpp: 139; sh: 14; makefile: 8
file content (282 lines) | stat: -rw-r--r-- 9,298 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
%\VignetteIndexEntry{2. Introduction to BatchtoolsParam}
%\VignetteKeywords{parallel, Infrastructure}
%\VignettePackage{BiocParallel}
%\VignetteEngine{knitr::knitr}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

<<setup, echo=FALSE>>=
suppressPackageStartupMessages({
    library(BiocParallel)
})
@

\newcommand{\BiocParallel}{\Biocpkg{BiocParallel}}

\title{Introduction to \emph{BatchtoolsParam}}
\author{
  Nitesh Turaga\footnote{\url{Nitesh.Turaga@RoswellPark.org}},
  Martin Morgan\footnote{\url{Martin.Morgan@RoswellPark.org}}
}
\date{Edited: March 22, 2018; Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \Rcode{BatchtoolsParam} class is an interface to the
\CRANpkg{batchtools} package from within \BiocParallel{}, for
computing on a high performance cluster such as SGE, TORQUE, LSF,
SLURM, OpenLava.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Quick start}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This example demonstrates the easiest way to launch a 100000 jobs
using batchtools. The first step involves creating a
\Rcode{BatchtoolsParam} class. You can compute using 'bplapply' and
then the result is stored.

<<intro>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

piApprox(1000)

## Apply piApprox over
param <- BatchtoolsParam()
result <- bplapply(rep(10e5, 10), piApprox, BPPARAM=param)
mean(unlist(result))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\emph{BatchtoolsParam} interface}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \Rcode{BatchtoolsParam} interface allows intuitive usage of your
high performance cluster with \BiocParallel{}.

The \Rcode{BatchtoolsParam} class allows the user to specify many
arguments to customize their jobs. Applicable to clusters with formal
schedulers.

\begin{itemize}

  \item{\Rcode{workers}} The number of workers used by the job.

  \item{\Rcode{cluster}}
    We currently support, SGE, SLURM, LSF, TORQUE and
    OpenLava. The 'cluster' argument is supported only if the R
    environment knows how to find the job scheduler. Each cluster type
    uses a template to pass the job to the scheduler. If the template is
    not given we use the default templates as given in the 'batchtools'
    package. The cluster can be accessed by 'bpbackend(param)'.

  \item{\Rcode{registryargs}}
    The 'registryargs' argument takes a list of arguments to create a
    new job registry for you \Rcode{BatchtoolsParam}. The job registry
    is a data.table which stores all the required information to
    process your jobs. The arguments we support for registryargs are:

    \begin{description}

      \item{\Rcode{file.dir}} Path where all files of the registry are
        saved. Note that some templates do not handle relative paths
        well. If nothing is given, a temporary directory will be used
        in your current working directory.

      \item{\Rcode{work.dir}} Working directory for R process for
        running jobs.

      \item{\Rcode{packages}} Packages that will be loaded on each node.

      \item{\Rcode{namespaces}} Namespaces that will be loaded on each
        node.

      \item{\Rcode{source}} Files that are sourced before executing a
        job.

      \item{\Rcode{load}} Files that are loaded before executing a job.

    \end{description}

<<>>=
registryargs <- batchtoolsRegistryargs(
    file.dir = "mytempreg",
    work.dir = getwd(),
    packages = character(0L),
    namespaces = character(0L),
    source = character(0L),
    load = character(0L)
)
param <- BatchtoolsParam(registryargs = registryargs)
param
@

  \item{\Rcode{resources}} A named list of key-value pairs to be
    subsituted into the template file; see
    \Rcode{?batchtools::submitJobs}.

  \item{\Rcode{template}} The template argument is unique to the
    \Rcode{BatchtoolsParam} class. It is required by the job
    scheduler. It defines how the jobs are submitted to the job
    scheduler. If the template is not given and the cluster is chosen,
    a default template is selected from the batchtools package.

  \item{\Rcode{log}} The log option is logical, TRUE/FALSE. If it is
    set to TRUE, then the logs which are in the registry are copied to
    directory given by the user using the \Rcode{logdir} argument.

  \item{\Rcode{logdir}} Path to the logs. It is given only if
    \Rcode{log=TRUE}.

  \item{\Rcode{resultdir}} Path to the directory is given when the job
    has files to be saved in a directory.

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Defining templates}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The job submission template controls how the job is processed by the
job scheduler on the cluster.  Obviously, the format of the template
will differ depending on the type of job scheduler.  Let's look at the
default SLURM template as an example:

<<>>=
fname <- batchtoolsTemplate("slurm")
cat(readLines(fname), sep="\n")
@

The \Rcode{<\%= =>} blocks are automatically replaced by the values of
the elements in the \Rcode{resources} argument in the
\Rcode{BatchtoolsParam} constructor.  Failing to specify critical
parameters properly (e.g., wall time or memory limits too low) will
cause jobs to crash, usually rather cryptically.  We suggest setting
parameters explicitly to provide robustness to changes to system
defaults.  Note that the \Rcode{<\%= =>} blocks themselves do not
usually need to be modified in the template.

The part of the template that is most likely to require explicit
customization is the last line containing the call to \Rcode{Rscript}.
A more customized call may be necessary if the R installation is not
standard, e.g., if multiple versions of R have been installed on a
cluster.  For example, one might use instead:

\begin{verbatim}
echo 'batchtools::doJobCollection("<%= uri %>")' |\
    ArbitraryRcommand --no-save --no-echo
\end{verbatim}

If such customization is necessary, we suggest making a local copy of
the template, modifying it as required, and then constructing a
\Rcode{BiocParallelParam} object with the modified template using the
\Rcode{template} argument.  However, we find that the default
templates accessible with \Rcode{batchtoolsTemplate} are satisfactory
in most cases.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Use cases}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

As an example for a BatchtoolParam job being run on an SGE cluster, we
use the same \Rcode{piApprox} function as defined earlier. The example
runs the function on 5 workers and submits 100 jobs to the SGE
cluster.

Example of SGE with minimal code:

<<simple_sge_example, eval=FALSE>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

template <- system.file(
    package = "BiocParallel",
    "unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox, BPPARAM=param)
@

Example of SGE demonstrating some of \Rcode{BatchtoolsParam} methods.

<<demo_sge, eval=FALSE>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

template <- system.file(
    package = "BiocParallel",
    "unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## start param
bpstart(param)

## Display param
param

## To show the registered backend
bpbackend(param)

## Register the param
register(param)

## Check the registered param
registered()

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox)

bpstop(param)
@

\section{\Rcode{sessionInfo()}}

<<sessionInfo, results="asis">>=
toLatex(sessionInfo())
@

\end{document}