1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
|
The functions with prefix "bw_" (for "beowulf") have a specialized
purpose and a high-level user interface. They are made for clusters
with machines that may sometimes be unavailable, or get unavailable
during a job, typically if they have also Windows installed and users
sometimes restart them to temporarily use Windows. Also, temporary
inavailability of the central machine during the job is allowed for.
Prerequisites:
-- One central machine with a Unix-like OS which is running most of
the time.
-- Some other machines which at least sometimes run a Unix-like OS and
an SSH server.
-- Authentication to the central machine gives passwordless access to
the other machines. To start the job while logged into a machine
different from the central machine, there should be passwordless
access to the central machine from there. Otherwise you might be
prompted for a password and it might work, but this is not tested.
-- All Octave-related software used by the jobs is available on each
machine. To start the job while logged into a machine different
from the central machine, the used startup files and the file with
arguments (see below) must be available there. Using a network
filesystem for the home directories is recommended.
The user has to supply a function which will be called with different
sets of arguments. The supplied function is of the form
function result = f (args[, args_id])
i.e. it accepts an argument "args", which might be a structure or
cell-array to accomodate a set of arguments, and possibly args_id,
which is the index of "args" within all its possible values (given in
a cell-array, see below). "results" of course may be a structure or
cell array too to accomodate more than one value.
For each set of arguments, execution of the function will be scheduled
to one of the currently available machines. The user supplies a
one-dimensional cell-array with one set of arguments (i.e. the value
of "args") in each entry. The cell-array must be stored in a file
under the data directory (see below) and remain there until
computation is finished (for the case the scheduler needs restarting).
The current state is kept in a variable "state" saved to a file whose
name is sprintf("%s-%s.state", functionname, argumentsfilename) within
a state directory.
Some of the functions read the startup files fullfile(OCTAVE_HOME (),
"share/octave/site/m/startup/bwrc") and then "~/.bwrc", if it
exists. In these files, the following configuration variables can
be set:
-- "computing_machines": cell-array of addresses (strings),
-- "central_machine": single address (string),
-- "data_dir" (optional): data directory for argument files (default:
"~/bw-data"),
-- "state_dir" (optional): state directoy (default: "~/.bw-state"),
-- "min_save_interv" (optional): mininal time-interval in seconds for
saving the state (default: 10). The state contains, among others,
all currently computed results of the user function. If saving
these should take a long time (you could test this by saving some
results with Octaves "save" function), you can set min_save_interv
to a higher value.
-- "connect_timeout" (optional): timeout for connection attempts in
seconds (default: 30). Scheduler will wait at least so long before
this machine is contacted again, even if connection was refused
before timeout.
To start a job:
Prepare user function for your job with the above properties, prepare
cell-array of argument variables for the function and save it in the
data directory. On any of the machines, run from Octave:
bw_start ("my_function", "argument_filename");
This starts the scheduler on the central machine in the background
(with nohup) and returns. You can log out then. If the job had been
running before, e.g if the scheduler had been killed for some reason,
it is restarted.
To inspect jobs:
bw_list ();
To retrieve results:
bw_retrieve (<arguments documented within the function>)
To restart all pending jobs:
bw_start () # without arguments
This may be necessary if the scheduler had been killed, or the central
machine was restarted, or maybe the Kerberos tickets got expired ...
To stop a job and/or remove the statefile:
bw_clear (<arguments documented within the function>)
Technical notes:
The scheduler forks child processes for each configured computing
machine and opens a permanent ssh connection with a permanent Octave
process running remotely. Different sets of arguments (single
variable) are sent over the connection and the respective results
(single variable) are sent back. If a connection gets unavailable, the
child process tries to restart it. The configured computing machines
are continuously scanned for available machines.
Advisory locking is used to avoid starting more than one scheduler for
a single combination of user_function/argument_file.
Olaf Till <olaf.till@uni-jena.de>, 2009-03-29
|