1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
|
= Resource testing =
Never created a pacemaker cluster configuration before? Please
read on.
Ever created a pacemaker configuration without errors? All
resources worked from the get go on all your nodes? Really? We
want a photo of you!
Seriously, it is so error prone to get a cluster resource
definition right that I think I ever only managed to do it with
`Dummy` resources. There are many intricate details that have to be
just right, and all of them are stuffed in a single place as simple
name-value attributes. Then there are multiple nodes, each node
containing a complex system environment inevitably always in flux and
changing (entropy anybody?).
Now, once you defined your set of resources and are about to
_commit_ the configuration (at that point it usually takes a
deep breath to do so), be ready to meet an avalanche of error
messages, not all of which are easy to understand or follow. Not
to mention that you need to read the logs too. Even though we do
have a link:history-tutorial.html[tool] to help with digging through
the logs, it is going to be an interesting experience and not quite
recommended if you're just starting with pacemaker clusters. Even the
experts can save a lot of time and headaches by following the advice
below.
== Basic usage ==
Enter resource testing. It is a special feature designed to help
users find problems in resource configurations.
The usage is very simple:
----
crm(live)configure# rsctest web-server
Probing resources ..
testing on xen-f: apache web-ip
testing on xen-g: apache web-ip
crm(live)configure#
----
What actually happened above and what is it good for? From the
output we can infer that the `web-server` resource is actually a
group comprising one apache web server and one IP address.
Indeed:
----
crm(live)configure# show web-server
group web-server apache web-ip \
meta target-role="Stopped"
crm(live)configure#
----
The `rsctest` command first established that the resources are
stopped on all nodes in the cluster. Then it tests the resources
in the order defined by the resource group on all nodes. It does
this by manually starting the resources, one by one, then running
a "monitor" for each resource to make sure that the resources are
healthy, and finally stopping the resources in reverse order.
Since there is no additional output, the test passed. It looks
like we have a properly defined web server group.
== Reporting problems ==
Now, the above run was not very interesting so let's spoil the
idyll:
----
xen-f:~ # mv /etc/apache2/httpd.conf /tmp
----
We moved the apache configuration file away on node `xen-f`. The
`apache` resource should fail now:
----
crm(live)configure# rsctest web-server
Probing resources ..
testing on xen-f: apache
host xen-f (exit code 5)
xen-f stderr:
2013/10/17_16:51:26 ERROR: Configuration file /etc/apache2/httpd.conf not found!
2013/10/17_16:51:26 ERROR: environment is invalid, resource considered stopped
testing on xen-g: apache web-ip
crm(live)configure#
----
As expected, `apache` failed to start on node `xen-f`. When the
cluster resource manager runs an operation on a resource, all
messages are logged (there is no terminal attached to the
cluster, anyway). All one can see in the resource status is the type
of the exit code. In this case, it is an installation problem.
For instance, the output could look like this:
----
xen-f:~ # crm status
Last updated: Thu Oct 17 19:21:44 2013
Last change: Thu Oct 17 19:21:28 2013 by root via crm_resource on xen-f
...
Failed actions:
apache_start_0 on xen-f 'not installed' (5): call=2074, status=complete,
last-rc-change='Thu Oct 17 19:21:31 2013', queued=164ms, exec=0ms
----
That does not look very informative. With `rsctest` we can
immediately see what the problem is. It saves us prowling the
logs looking for messages of the `apache` resource agent.
Note that the IP address is not tested, because the resource it
depends on could not be started.
== What is tested? ==
The start, monitor, and stop operations, in exactly that order,
are tested for every resource specified. Note that normally the
two latter operations should never fail if the resource agent is
well implemented. The RA should under normal circumstances be
able to stop or monitor a started resource. However, this is
_not_ a replacement for resource agent testing. If that is what
you are looking for, see
http://www.linux-ha.org/doc/dev-guides/_testing_resource_agents.html[the
RA testing chapter] of the RA development guide.
== Protecting resources ==
The `rsctest` command goes to great lengths to prevent starting a
resource on more than one node at the same time. For some stuff
that would actually mean data corruption and we certainly don't
want that to happen.
----
xen-f:~ # (echo start web-server; echo show web-server) | crm -w resource
resource web-server is running on: xen-g
xen-f:~ # crm configure rsctest web-server
Probing resources .WARNING: apache:probe: resource running at xen-g
.WARNING: web-ip:probe: resource running at xen-g
Stop all resources before testing!
xen-f:~ # crm configure rsctest web-server xen-f
Probing resources .WARNING: apache:probe: resource running at xen-g
.WARNING: web-ip:probe: resource running at xen-g
Stop all resources before testing!
xen-f:~ #
----
As you can see, if `rsctest` finds any of the resources running
on any node it refuses to run any tests.
== Multi-state and clone resources ==
Apart from groups, the `rsctest` can also handle the other two
special kinds of resources. Let's take a look at one `drbd`-based
configuration:
----
crm(live)configure# show ms_drbd_nfs drbd0-vg
primitive drbd0-vg ocf:heartbeat:LVM \
params volgrpname="drbd0-vg"
primitive p_drbd_nfs ocf:linbit:drbd \
meta target-role="Stopped" \
params drbd_resource="nfs" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="120"
ms ms_drbd_nfs p_drbd_nfs \
meta notify="true" clone-max="2"
crm(live)configure#
----
The `nfs` drbd resource contains a volume group `drbd0-vg`.
----
crm(live)configure# rsctest ms_drbd_nfs drbd0-vg
Probing resources ..
testing on xen-f: p_drbd_nfs drbd0-vg
testing on xen-g: p_drbd_nfs drbd0-vg
crm(live)configure#
----
For the multi-state (master-slave) resources, the involved
resource motions are somewhat more complex: the resource is first
started on both nodes and then promoted on the node where the
next resource is to be tested (in this case the volume group).
Then it gets demoted to slave and promoted on the other
node to master so that the depending resources can be tested on
that node too.
Note that even though we asked for `ms_drbd_nfs` to be tested,
there is `p_drbd_nfs` in the output which is the primitive
encapsulated in the master-slave resource. You can specify either
one.
== Stonith resources ==
The stonith resources are also special and need special
treatment. What is tested is just the device status. Actually
fencing nodes was deemed too drastic. Please use `node fence` to
test the fencing device effectiveness. It also does not matter
whether the stonith resource is "running" on any node: being
started is just something that happens virtually in the
`stonithd` process.
== Summary ==
- use `rsctest` to make sure that the resources can be started
correctly on all nodes
- `rsctest` protects resources by making sure beforehand that
none of them is currently running on any of the cluster nodes
- `rsctest` understands groups, master-slave (multi-state), and
clone resources, but nothing else of the configuration
(constraints or any other placement/order cluster configuration
elements)
- it is up to the user to test resources only on nodes which are
really supposed to run them and in a proper order (if that
order is expressed via constraints)
- `rsctest` cannot protect resources if they are running on
nodes which are not present in the cluster or from bad RA
implementations (but neither would a cluster resource manager)
- `rsctest` was designed as a debugging and configuration aid, and is
not intended to provide full Resource Agent test coverage.
== `crmsh` help and online resources (_sic!_) ==
- link:crm.8.html#topics_Testing[`crm help Testing`]
- link:crm.8.html#cmdhelp_configure_rsctest[`crm configure help
rsctest`]
|