1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
|
FAQ - ADMIN
===========
Release policy
--------------
Since the version 2.2, release numbers are divided into 3 parts:
- The first represents the design and the implementation used.
- The second represents a set of OAR functionalities.
- The third is incremented after bug fixes.
What means the error "Bad configuration option: PermitLocalCommand" when I am using oarsh?
------------------------------------------------------------------------------------------
For security reasons, on the latest OpenSSH releases you are able to execute
a local command when you are connecting to the remote host and we must
deactivate this option because the oarsh wrapper executes the *ssh* command
into the user oar.
So if you encounter this error message it means that your OpenSSH does
not know this option and you have to remove it from the oar.conf.
There is a variable named OARSH_OPENSSH_DEFAULT_OPTIONS_ in oar.conf used by oarsh.
So you have just to remove the not yet implemented option.
How to manage start/stop of the nodes?
--------------------------------------
You have to add a script in /etc/init.d which switches resources of the node
into the "Alive" or "Absent" state.
So when this script is called at boot time, it will change the state into
"Alive". And when it is called at halt time, it will change into "Absent".
There two ways to perform this action:
1. Install OAR "oar-libs" part on all nodes. Thus you will be able to launch
the command oarnodesetting_ (be careful to right configure "oar.conf" with
database login and password AND to allow network connections on this
database).
So you can execute::
oarnodesetting -s Alive -h node_hostname
or
oarnodesetting -s Absent -h node_hostname
2. You do not want to install anything else on each node. So you have to
enable oar user to connect to the server via ssh (for security you
can use another SSH key with restrictions on the command that oar can
launch with this one). Thus you will have in you init script
something like::
sudo -u oar ssh oar-server "oarnodesetting -s Alive -h node_hostname"
or
sudo -u oar ssh oar-server "oarnodesetting -s Absent -h node_hostname"
In this case, further OAR software upgrade will be more painless.
How can I manage scheduling queues?
-----------------------------------
see oarnotify_.
How can I handle licence tokens?
--------------------------------
OAR does not manage resources with an empty "network_address". So you can
define resources that are not linked with a real node.
So the steps to configure OAR with the possibility to reserve licences (or
whatever you want that are other notions):
1. Add a new field in the table resources_ to specify the licence name.
::
oarproperty -a licence -c
2. Add your licence name resources with oarnodesetting_.
::
oarnodesetting -a -h "" -p type=mathlab -p licence=l1
oarnodesetting -a -h "" -p type=mathlab -p licence=l2
oarnodesetting -a -h "" -p type=fluent -p licence=l1
...
After this configuration, users can perform submissions like
::
oarsub -I -l "/switch=2/nodes=10+{type = 'mathlab'}/licence=20"
So users ask OAR to give them some other resource types but nothing block
their program to take more licences than they asked.
You can resolve this problem with the SERVER_SCRIPT_EXEC_FILE_ configuration.
In these files you have to bind OAR allocated resources to the licence servers
to restrict user consumptions to what they asked. This is very dependant of
the licence management.
How can I handle multiple clusters with one OAR?
------------------------------------------------
These are the steps to follow:
1. create a resource property to identify the corresponding cluster (like "cluster")::
oarproperty -a cluster
(you can see this new property when you use oarnodes)
2. with oarnodesetting_ you have to fill this field for all resources; for example::
oarnodesetting -h node42.cluster1.com -p cluster=1
oarnodesetting -h node43.cluster1.com -p cluster=1
oarnodesetting -h node2.cluster2.com -p cluster=2
...
3. Then you have to restrict properties for new job type.
So an admission rule performs this job (this is a SQL syntax to use in a database interpreter)::
INSERT IGNORE INTO admission_rules (rule) VALUES ('
my $cluster_constraint = 0;
if (grep(/^cluster1$/, @{$type_list})){
$cluster_constraint = 1;
}elsif (grep(/^cluster2$/, @{$type_list})){
$cluster_constraint = 2;
}
if ($cluster_constraint > 0){
if ($jobproperties ne ""){
$jobproperties = "($jobproperties) AND cluster = $cluster_constraint";
}else{
$jobproperties = "cluster = $cluster_constraint";
}
print("[ADMISSION RULE] Added automatically cluster resource constraint\\n");
}
');
4. Edit the admission rule which checks the right job types and add
"cluster1" and "cluster2" in.
So when you will use oarsub to submit a "cluster2" job type only resources
with the property "cluster=2" is used. This is the same when you will use the
"cluster1" type.
How to configure a more ecological cluster (or how to make some power consumption economies)?
---------------------------------------------------------------------------------------------
This feature can be performed with the `Dynamic nodes coupling features`.
First you have to make sure that you have a command to wake up a computer
that is stopped. For example you can use the WoL (Wake on Lan) feature
(generally you have to right configure the BIOS and add right options to the
Linux Ethernet driver; see "ethtool").
If you want to enable a node to be woke up the next 12 hours::
((DATE=$(date +%s)+3600*12))
oarnodesetting -h host_name -p cm_availability=$DATE
Otherwise you can disable the wake up of nodes (but not the halt) by::
oarnodesetting -h host_name -p cm_availability=1
If you want to disable the halt on a node (but not the wakeup)::
oarnodesetting -h host_name -p cm_availability=2147483647
2147483647 = 2^31 - 1 : we take this value as infinite and it is used to
disable the halt mechanism.
And if you want to disable the halt and the wakeup::
oarnodesetting -h host_name -p cm_availability=0
Note: In the unstable 2.4 OAR version, cm_availability has been renamed
into available_upto.
Your `SCHEDULER_NODE_MANAGER_WAKE_UP_CMD`_ must be a script that read node
names and translate them into the right wake up command.
So with the right OAR and node configurations you can optimize the power
consumption of your cluster (and your air conditioning infrastructure)
without drawback for the users.
Take a look at your cluster occupation and your electricity bill to know if it
could be interesting for you ;-)
How to configure temporary UID for each job?
--------------------------------------------
For a better way to handle job processes we introduce the temporary user id
feature.
This feature creates a user for each job on assigned nodes. Hence it is
possible to clean temporary files, IPC, every generated processes, ...
Furthermore a lot of system features could be used like bandwidth management
(iptables rules on the user id).
To configure this feature, CPUSET must be activated and the tag
JOB_RESOURCE_MANAGER_JOB_UID_TYPE has to be configured in the oar.conf file.
The value is the content of the "type" field into the resources_ table. After
that you have to add resources in the database with this type and fill the
cpuset field with a unique UID (not used by real users). The maximum number of
concurrent jobs is the number of resources of this type.
For example, if you put this in your oar.onf::
JOB_RESOURCE_MANAGER_PROPERTY_DB_FIELD="cpuset"
JOB_RESOURCE_MANAGER_JOB_UID_TYPE="user"
Then you can add temporary UID::
oarnodesetting -a -h fake -p cpuset=23000 -p type=user
oarnodesetting -a -h fake -p cpuset=23001 -p type=user
oarnodesetting -a -h fake -p cpuset=23002 -p type=user
...
You can put what you want in the place of the hostname (here "fake").
The drawback of this feature is that users haven't their UID only their GID.
How to enable jobs to connect to the frontales from the nodes using oarsh?
--------------------------------------------------------------------------
First you have to install the node part of OAR on the wanted nodes.
After that you have to register the frontales into the database using
oarnodesetting with the "frontal" (for example) type and assigned the desired
cpus into the cpuset field; for example::
oarnodesetting -a -h frontal1 -p type=frontal -p cpuset=0
oarnodesetting -a -h frontal1 -p type=frontal -p cpuset=1
oarnodesetting -a -h frontal2 -p type=frontal -p cpuset=0
...
Thus you will be able to see resources identifier of these resources with
oarnodes; try to type::
oarnodes --sql "type='frontal'"
Then put this type name (here "frontal") into the *oar.conf* file on the OAR
server into the tag SCHEDULER_RESOURCES_ALWAYS_ASSIGNED_TYPE_.
Notes:
- if one of these resources become "Suspected" then the scheduling will
stop.
- you can disable this feature with oarnodesetting_ and put these resources
into the "Absent" state.
A job remains in the "Finishing" state, what can I do?
------------------------------------------------------
If you have waited more than a couple of minutes (10mn for example) then
something wrong occurred (frontal has crashed, out of memory, ...).
So you are able to turn manually a job into the "Error" state by typing with the root user (example with a bash shell)::
export OARCONFFILE=/etc/oar/oar.conf
perl -e 'use OAR::IO; $db = OAR::IO::connect(); OAR::IO::set_job_state($db,42,"Error")'
(Replace 42 by your job identifier)
How can I write my own scheduler?
---------------------------------
.. include:: scheduler/README
What is the syntax of this documentation?
-----------------------------------------
We are using the RST format from the `Docutils
<http://docutils.sourceforge.net/>`_ project. This syntax is easily readable
and can be converted into HTML, LaTex or XML.
You can find basic informations on
http://docutils.sourceforge.net/docs/user/rst/quickref.html
|