File: README.bw

package info (click to toggle)
octave-parallel 2.0.1-1
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 288 kB
  • ctags: 38
  • sloc: cpp: 1,717; makefile: 11
file content (132 lines) | stat: -rw-r--r-- 4,940 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
The functions with prefix "bw_" (for "beowulf") have a specialized
purpose and a high-level user interface. They are made for clusters
with machines that may sometimes be unavailable, or get unavailable
during a job, typically if they have also Windows installed and users
sometimes restart them to temporarily use Windows. Also, temporary
inavailability of the central machine during the job is allowed for.

Prerequisites:

-- One central machine with a Unix-like OS which is running most of
   the time.

-- Some other machines which at least sometimes run a Unix-like OS and
   an SSH server.

-- Authentication to the central machine gives passwordless access to
   the other machines. To start the job while logged into a machine
   different from the central machine, there should be passwordless
   access to the central machine from there. Otherwise you might be
   prompted for a password and it might work, but this is not tested.

-- All Octave-related software used by the jobs is available on each
   machine. To start the job while logged into a machine different
   from the central machine, the used startup files and the file with
   arguments (see below) must be available there. Using a network
   filesystem for the home directories is recommended.

The user has to supply a function which will be called with different
sets of arguments. The supplied function is of the form

function result = f (args[, args_id])

i.e. it accepts an argument "args", which might be a structure or
cell-array to accomodate a set of arguments, and possibly args_id,
which is the index of "args" within all its possible values (given in
a cell-array, see below). "results" of course may be a structure or
cell array too to accomodate more than one value.

For each set of arguments, execution of the function will be scheduled
to one of the currently available machines.  The user supplies a
one-dimensional cell-array with one set of arguments (i.e. the value
of "args") in each entry. The cell-array must be stored in a file
under the data directory (see below) and remain there until
computation is finished (for the case the scheduler needs restarting).

The current state is kept in a variable "state" saved to a file whose
name is sprintf("%s-%s.state", functionname, argumentsfilename) within
a state directory.

Some of the functions read the startup files fullfile(OCTAVE_HOME (),
"share/octave/site/m/startup/bwrc") and then "~/.bwrc", if it
exists. In these files, the following configuration variables can
be set:

-- "computing_machines": cell-array of addresses (strings),

-- "central_machine": single address (string),

-- "data_dir" (optional): data directory for argument files (default:
   "~/bw-data"),

-- "state_dir" (optional): state directoy (default: "~/.bw-state"),

-- "min_save_interv" (optional): mininal time-interval in seconds for
   saving the state (default: 10). The state contains, among others,
   all currently computed results of the user function. If saving
   these should take a long time (you could test this by saving some
   results with Octaves "save" function), you can set min_save_interv
   to a higher value.

-- "connect_timeout" (optional): timeout for connection attempts in
   seconds (default: 30). Scheduler will wait at least so long before
   this machine is contacted again, even if connection was refused
   before timeout.



To start a job:

Prepare user function for your job with the above properties, prepare
cell-array of argument variables for the function and save it in the
data directory. On any of the machines, run from Octave:

bw_start ("my_function", "argument_filename");

This starts the scheduler on the central machine in the background
(with nohup) and returns. You can log out then. If the job had been
running before, e.g if the scheduler had been killed for some reason,
it is restarted.


To inspect jobs:

bw_list ();


To retrieve results:

bw_retrieve (<arguments documented within the function>)


To restart all pending jobs:

bw_start () # without arguments

This may be necessary if the scheduler had been killed, or the central
machine was restarted, or maybe the Kerberos tickets got expired ...


To stop a job and/or remove the statefile:

bw_clear (<arguments documented within the function>)




Technical notes:

The scheduler forks child processes for each configured computing
machine and opens a permanent ssh connection with a permanent Octave
process running remotely. Different sets of arguments (single
variable) are sent over the connection and the respective results
(single variable) are sent back. If a connection gets unavailable, the
child process tries to restart it. The configured computing machines
are continuously scanned for available machines.

Advisory locking is used to avoid starting more than one scheduler for
a single combination of user_function/argument_file.



Olaf Till <olaf.till@uni-jena.de>, 2009-03-29