1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">
.code_area {
font-family: monospace;
width: 98%;
margin: 1%;
border: solid 1px black;
}
.code_header {
text-decoration:underline;
background: #F4F4AE;
margin: 0%;
padding: 0.3%;
}
.code_snip {
background: #FFFFCC;
margin: 0%;
padding: 0.3%;
}
</style>
<title>Work Queue User's Manual</title>
</head>
<body>
<h1>Work Queue User's Manual</h1>
<b>Last Updated May 2012</b>
<p>
Work Queue is Copyright (C) 2009 The University of Notre Dame.
This software is distributed under the GNU General Public License.
See the file COPYING for details.
<p>
<h2>Overview</h2>
<p>
Work Queue is a framework for building master/worker applications.
In Work Queue, a Master process is a custom, application-specific program
that uses the Work Queue API to define and submit a large number
of small tasks. The tasks are executed by many Worker processes,
which can run on any available machine. A single Master may direct
hundreds to thousands of Workers, allowing users to easily construct
highly scalable programs.
<p>
Work Queue is a stable framework that has been used to create
highly scalable scientific applications in biometrics, bioinformatics,
economics, and other fields. It can also be used as an execution engine
for the <a href="http://www.cse.nd.edu/~ccl/software/makeflow">Makeflow</a> workflow engine.
<p>
<p>Work Queue is part of the <a
href="http://www.cse.nd.edu/~ccl/software">Cooperating Computing
Tools</a>. You can download the CCTools from <a
href="http://www.cse.nd.edu/~ccl/software/download">this web page</a>,
follow the <a href=install.html>installation instructions</a>, and you
are ready to go. From the same <a
href="http://www.cse.nd.edu/~ccl/software/manuals/api/html/work__queue_8h.html">website</a>,
or from within the CCTools you can view documentation for the full set
features of the Work Queue API.</p>
<h2>Building and Running Work Queue</h2>
Let's begin by running a simple but complete example of a master and a worker.
After trying it out, we will then show how to write a program from scratch.
<p>
We assume that you have already downloaded and installed the cctools in
the directory CCTOOLS. Next, download the example file for the language of your
choice:
<ul>
<li>C: <a href=work_queue_example.c>work_queue_example.c</a></li>
<li>Python: <a href=work_queue_example.py>work_queue_example.py</a></li>
<li>Perl: <a href=work_queue_example.pl>work_queue_example.pl</a></li>
</ul>
If you are using the C example, compile it like this:
<div class ="code_area"><pre class="code_snip">
gcc work_queue_example.c -o work_queue_example -I${CCTOOLS}/include/cctools -L${CCTOOLS}/lib -ldttools -lm
</pre></div>
If you are using the Python example, set PYTHONPATH to include the Python modules in cctools:
<div class ="code_area"><pre class="code_snip">
export PYTHONPATH=${PYTHONPATH}:${CCTOOLS}/lib/python2.6/site-packages
</pre></div>
If you are using the Perl example, set PERL5LIB to include the Perl modules in cctools:
<div class ="code_area"><pre class="code_snip">
export PERL5LIB=${PERL5LIB}:${CCTOOLS}/lib/perl5/site_perl
</pre></div>
This example program simply compresses a bunch of files in parallel. List the files to be
compressed on the command line. Each will be transmitted to a remote worker, compressed,
and then sent back to the master. (This isn't necessarily faster than doing it locally,
but it is easy to run.)
For example, to compress files <tt>a</tt>, <tt>b</tt>, and <tt>c</tt>, run this:
<div class ="code_area"><pre class="code_snip">
./work_queue_example a b c
</pre></div>
You will see this right away:
<div class ="code_area"><pre class="code_snip">
listening on port 9123...
submitted task: /usr/bin/gzip < a > a.gz
submitted task: /usr/bin/gzip < b > b.gz
submitted task: /usr/bin/gzip < c > c.gz
waiting for tasks to complete...
</pre></div>
The master is now waiting for workers to connect and begin requesting work.
(Without any workers, it will wait forever.) You can start one worker on the
same machine by opening a new shell and running:
<div class ="code_area"><pre class="code_snip">
work_queue_worker MACHINENAME 9123
</pre></div>
(Obviously, substitute the name of your machine for MACHINENAME.) If you have
access to other machines, you can <tt>ssh</tt> there and run workers as well.
In general, the more you start, the faster the work gets done.
If a worker should fail, the work queue infrastructure will retry the work
elsewhere, so it is safe to submit many workers to an unreliable
system.
<p>
If you have access to a Condor pool, you can use this shortcut to submit
ten workers at once via Condor:
<div class ="code_area"><pre class="code_snip">
% condor_submit_workers MACHINENAME 9123 10
Submitting job(s)..........
Logging submit event(s)..........
10 job(s) submitted to cluster 298.
</pre></div>
Or, if you have access to an SGE cluster, do this:
<div class ="code_area"><pre class="code_snip">
% sge_submit_workers MACHINENAME 9123 10
Your job 153083 ("worker.sh") has been submitted
Your job 153084 ("worker.sh") has been submitted
Your job 153085 ("worker.sh") has been submitted
...
</pre></div>
<p>
When the master completes, if the workers were not shut down in the
master, your workers will still be available, so you can either run
another master with the same workers, or you can remove the workers
with <tt>kill</tt>, <tt>condor_rm</tt>, or <tt>qdel</tt> as appropriate.
If you forget to remove them, they will exit automatically after fifteen minutes.
(This can be adjusted with the <tt>-t</tt> option to <tt>worker</tt>.)
<h2>Writing a Master Program</h2>
To write your own program using Work Queue, begin with <a
href=work_queue_example.c>C example</a> or <a
href=work_queue_example.py>Python example</a> or <a
href=work_queue_example.pl>Perl example</a>
as a starting point. Here is a basic outline for a Work Queue master:
<div class ="code_area"><pre class="code_snip">
q = work_queue_create(port);
for(all tasks) {
t = work_queue_task_create(command);
/* add to the task description */
work_queue_submit(q,t);
}
while(!work_queue_empty(q)) {
t = work_queue_wait(q);
work_queue_task_delete(t);
}
work_queue_delete(q);
</pre></div>
First create a queue that is listening on a particular TCP port:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
q = work_queue_create(port);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
q = WorkQueue(port)
</pre>
</div>
The master then creates tasks to submit to the queue.
Each task consists of a command line to run and a statement of
what data is needed, and what data will be produced by the command.
Input data can be provided in the form of a file or a local memory buffer.
Output data can be provided in the form of a file or the standard output of the program.
It is also required to specify whether the data, input or output, need to be cached at
the worker site for later use. In the example, we specify a command that takes a
single input file, produces a single output file, and requires both files to be
cached:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
t = work_queue_task_create(command);
work_queue_task_specify_file(t,infile,infile,WORK_QUEUE_INPUT,WORK_QUEUE_CACHE);
work_queue_task_specify_file(t,outfile,outfile,WORK_QUEUE_OUTPUT,WORK_QUEUE_CACHE);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t = Task(command)
t.specify_file(infile,infile,WORK_QUEUE_INPUT,cache=True)
t.specify_file(outfile,outfile,WORK_QUEUE_OUTPUT,cache=True)
</pre>
</div>
If a file does not need to be cached at the execution site to avoid wasteful
strorage, it can be specified so:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
work_queue_task_specify_file(t,outfile,outfile,WORK_QUEUE_OUTPUT,WORK_QUEUE_NOCACHE);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t.specify_file(outfile,outfile,WORK_QUEUE_OUTPUT,cache=False)
</pre>
</div>
You can also run a program that is not necessarily installed at the
remote location, by specifying it as an input file. If the file
is installed on the local machine, then specify the full local path,
and the plain remote path. For example:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
t = work_queue_task_create("./my_compress_program < a > a.gz");
work_queue_task_specify_file(t,"/usr/local/bin/my_compress_program","my_compress_program",WORK_QUEUE_INPUT,WORK_QUEUE_CACHE);
work_queue_task_specify_file(t,"a","a",WORK_QUEUE_INPUT,WORK_QUEUE_CACHE);
work_queue_task_specify_file(t,"a.gz","a.gz",WORK_QUEUE_OUTPUT,WORK_QUEUE_CACHE);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t = Task("./my_compress_program < a > a.gz")
t.specify_file("/usr/local/bin/my_compress_program","my_compress_program",WORK_QUEUE_INPUT,cache=True)
t.specify_file("a","a",WORK_QUEUE_INPUT,cache=True)
t.specify_file("a.gz","a.gz",WORK_QUEUE_OUTPUT,cache=True)
</pre>
</div>
Once a task has been fully specified, it can be submitted to the queue where it
gets assigned a unique taskid:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
taskid = work_queue_submit(q,t);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
taskid = q.submit(t)
</pre>
</div>
Next, wait for a task to complete, stating how long you are willing
to wait for a result, in seconds. (If no tasks have completed by the timeout,
<tt>work_queue_wait</tt> will return null.)
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
t = work_queue_wait(q,5);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t = q.wait(5)
</pre>
</div>
A completed task will have its output files written to disk.
You may examine the standard output of the task in <tt>t->output</tt>
and the exit code in <tt>t->exit_status</tt>. When you are done
with the task, delete it:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
work_queue_task_delete(t);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
Deleted automatically when task object goes out of scope
</pre>
</div>
Continue submitting and waiting for tasks until all work is complete.
You may check to make sure that the queue is empty with <tt>work_queue_empty</tt>
When all is done, delete the queue:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
work_queue_delete(q);
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
Deleted automatically when work_queue object goes out of scope
</pre>
</div>
Full details of all of the Work Queue functions can be found in
the <a href="http://www.cse.nd.edu/~ccl/software/manuals/api/html/work__queue_8h.html">Work Queue API</a>.
<h2>Advanced Usage</h2>
The technique described above is suitable for distributed programs of
tens to hundreds of workers. As you scale your program up to larger sizes,
you may find the following features helpful. All are described in the
<a href="http://www.cse.nd.edu/~ccl/software/manuals/api/html/work__queue_8h.html">Work Queue API</a>.
<ul>
<li><b>Pipelined Submission.</b> If you have a <b>very</b> large number of tasks to run,
it may not be possible to submit all of the tasks, and then wait for all of them. Instead,
submit a small number of tasks, then alternate waiting and submiting to keep a constant
number in the queue. <tt>work_queue_hungry</tt> will tell you if more submission are warranted.
<p>
<li><b>Fast Abort.</b> A large computation can often be slowed down by stragglers. If you have
a large number of small tasks that take a short amount of time, then Fast Abort can help.
The Fast Abort feature keeps statistics on tasks execution times and proactively aborts tasks
that are statistical outliers. See <tt>work_queue_activate_fast_abort</tt>.
<p>
<li><b>Immediate Data.</b> For a large number of tasks or workers, it may be impractical
to create local input files for each one. If the master already has the necessary input
data in memory, it can pass the data directly to the remote task with
<tt>work_queue_task_specify_buffer</tt>.
<p>
<li><b>String Interpolation.</b> If you have workers distributed across
multiple operating systems (such as Linux, Cygwin, Solaris) and/or architectures (such
as i686, x86_64) and have files specific to each of these systems, this feature
will help. The strings $OS and $ARCH are available for use in the specification of input
file names. Work Queue will automatically resolve these strings to the operating system
and architecture of each connected worker and transfer the input file corresponding
to the resolved file name. For example:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
work_queue_task_specify_file(t,"a.$OS.$ARCH","a",WORK_QUEUE_INPUT,WORK_QUEUE_CACHE)
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t.specify_file(t,"a.$OS.$ARCH","a",WORK_QUEUE_INPUT,cache=True)
</pre>
</div>
This will transfer <tt>a.Linux.x86_64</tt> to workers running on a Linux system with an
x86_64 architecture and <tt>a.Cygwin.i686</tt> to workers on Cygwin with an i686
architecture.
<p>
<li><b>Cancel Task.</b> This feature is useful in workflows where there are redundant tasks
or tasks that become obsolete as other tasks finish. Tasks that have been submitted can be
cancelled and immediately retrieved without waiting for Work Queue to return them in
<tt>work_queue_wait</tt>. The tasks to cancel can be identified by either their
<tt>taskid</tt> or <tt>tag</tt>. For example:
<div class="code_area">
<h4 class="code_header">C/Perl</h4>
<pre class="code_snip">
t = work_queue_cancel_by_tasktag(q,"task3");
</pre>
<h4 class="code_header">Python</h4>
<pre class="code_snip">
t = q.cancel_by_tasktag("task3")
</pre>
</div>
This cancels a task with <tt>tag</tt> named 'task4'. Note that in the presence of tasks with
the same tag, <tt>work_queue_cancel_by_tasktag</tt> will cancel and retrieve only one of the
matching tasks.
<p>
<li><b>Statistics.</b> The queue tracks a fair number of statistics that count the number
of tasks, number of workers, number of failures, and so forth. Obtain this data with <tt>work_queue_get_stats</tt>
in order to make a progress bar or other user-visible information.
</ul>
<h2>For More Information</h2>
For the latest information about Work Queue, please visit our <a
href="http://www.cse.nd.edu/~ccl/software/workqueue">web site</a> and
subscribe to our <a href="http://www.cse.nd.edu/~ccl/software/help.shtml">mailing
list</a>.
</body>
</html>
|