File: status.rst

package info (click to toggle)
pacemaker 3.0.1-1.1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 68,576 kB
  • sloc: xml: 160,564; ansic: 143,744; python: 5,670; sh: 2,969; makefile: 2,427
file content (459 lines) | stat: -rw-r--r-- 15,722 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
.. index::
   pair: XML element; status

Status
------
Pacemaker automatically generates a ``status`` section in the CIB (inside the
``cib`` element, at the same level as ``configuration``). The status is
transient, and is not stored to disk with the rest of the CIB.

The section's structure and contents are internal to Pacemaker and subject to
change from release to release. Its often obscure element and attribute names
are kept for historical reasons, to maintain compatibility with older versions
during rolling upgrades.

Users should not modify the section directly, though various command-line tool
options affect it indirectly.


.. index::
   pair: XML element; node_state
   single: node; state
       
Node State
##########
   
The ``status`` element contains ``node_state`` elements for each node in the
cluster (and potentially nodes that have been removed from the configuration
since the cluster started). The ``node_state`` element has attributes that
allow the cluster to determine whether the node is healthy.

.. topic:: Example minimal node state entry

   .. code-block:: xml

      <node_state id="1" uname="cl-virt-1" in_ccm="1721760952" crmd="1721760952" crm-debug-origin="controld_update_resource_history" join="member" expected="member">
       <transient_attributes id="1"/>
       <lrm id="1"/>
      </node_state>
   
.. list-table:: **Attributes of a node_state Element**
   :class: longtable
   :widths: 20 20 60
   :header-rows: 1

   * - Name
     - Type
     - Description
   * - .. _node_state_id:

       .. index::
          pair: node_state; id

       id
     - :ref:`text <text>`
     - Node ID (identical to ``id`` of corresponding ``node`` element in the
       ``configuration`` section)
   * - .. node_state_uname:

       .. index::
          pair: node_state; uname

       uname
     - :ref:`text <text>`
     - Node name (identical to ``uname`` of corresponding ``node`` element in the
       ``configuration`` section)
   * - .. node_state_in_ccm:

       .. index::
          pair: node_state; in_ccm

       in_ccm
     - :ref:`epoch time <epoch_time>` *(since 2.1.7; previously boolean)*
     - If the node's controller is currently in the cluster layer's membership,
       this is the epoch time at which it joined (or 1 if the node is in the
       process of leaving the cluster), otherwise 0 *(since 2.1.7; previously,
       it was "true" or "false")*
   * - .. node_state_crmd:

       .. index::
          pair: node_state; crmd

       crmd
     - :ref:`epoch time <epoch_time>` *(since 2.1.7; previously an enumeration)*
     - If the node's controller is currently in the cluster layer's controller
       messaging group, this is the epoch time at which it joined, otherwise 0
       *(since 2.1.7; previously, the value was either "online" or "offline")*
   * - .. node_state_crm_debug_origin:

       .. index::
          pair: node_state; crm-debug-origin

       crm-debug-origin
     - :ref:`text <text>`
     - Name of the source code function that recorded this ``node_state``
       element (for debugging)
   * - .. node_state_join:

       .. index::
          pair: node_state; join

       join
     - :ref:`enumeration <enumeration>`
     - Current status of node's controller join sequence (and thus whether it
       is eligible to run resources). Allowed values:

       * ``down``: Not yet joined
       * ``pending``: In the process of joining or leaving
       * ``member``: Fully joined
       * ``banned``: Rejected by DC
   * - .. node_state_expected:

       .. index::
          pair: node_state; expected

       expected
     - :ref:`enumeration <enumeration>`
     - What cluster expects ``join`` to be in the immediate future. Allowed
       values are same as for ``join``.


.. _transient_attributes:

.. index::
   pair: XML element; transient_attributes
   single: node; transient attribute
   single: node attribute; transient

Transient Node Attributes
#########################
   
The ``transient_attributes`` section specifies transient
:ref:`node_attributes`. In addition to any values set by the administrator or
resource agents using the ``attrd_updater`` or ``crm_attribute`` tools, the
cluster stores various state information here.
         
.. topic:: Example transient node attributes for a node

   .. code-block:: xml
   
      <transient_attributes id="cl-virt-1">
        <instance_attributes id="status-cl-virt-1">
           <nvpair id="status-cl-virt-1-pingd" name="pingd" value="3"/>
           <nvpair id="status-cl-virt-1-fail-count-pingd:0.monitor_30000" name="fail-count-pingd:0#monitor_30000" value="1"/>
           <nvpair id="status-cl-virt-1-last-failure-pingd:0" name="last-failure-pingd:0" value="1239009742"/>
        </instance_attributes>
      </transient_attributes>
   

.. index::
   pair: XML element; lrm
   pair: XML element; lrm_resources
   pair: node; history

Node History
############

Each ``node_state`` element contains an ``lrm`` element with a history of
certain resource actions performed on the node. The ``lrm`` element contains an
``lrm_resources`` element.

.. index::
   pair: XML element; lrm_resource
   pair: resource; history

Resource History
________________

The ``lrm_resources`` element contains an ``lrm_resource`` element for each
resource that has had an action performed on the node.

An ``lrm_resource`` entry has attributes allowing the cluster to stop the
resource safely even if it is removed from the configuration. Specifically, the
resource's ``id``, ``class``, ``type`` and ``provider`` are recorded.

.. index::
   pair: XML element; lrm_rsc_op
   pair: action; history

Action History
______________

Each ``lrm_resource`` element contains an ``lrm_rsc_op`` element for each
recorded action performed for that resource on that node. (Not all actions are
recorded, just enough to determine the resource's state.)

.. list-table:: **Attributes of an lrm_rsc_op Element**
   :class: longtable
   :widths: 20 20 60
   :header-rows: 1

   * - Name
     - Type
     - Description
   * - .. _lrm_rsc_op_id:

       .. index::
          pair: lrm_rsc_op; id

       id
     - :ref:`text <text>`
     - Identifier for the history entry constructed from the resource ID,
       action name or history entry type, and action interval.
   * - .. _lrm_rsc_op_operation_key:

       .. index::
          pair: lrm_rsc_op; operation_key

       operation_key
     - :ref:`text <text>`
     - Identifier for the action that was executed, constructed from the
       resource ID, action name, and action interval.
   * - .. _lrm_rsc_op_operation:

       .. index::
          pair: lrm_rsc_op; operation

       operation
     - :ref:`text <text>`
     - The name of the action the history entry is for
   * - .. _lrm_rsc_op_crm_debug_origin:

       .. index::
          pair: lrm_rsc_op; crm-debug-origin

       crm-debug-origin
     - :ref:`text <text>`
     - Name of the source code function that recorded this entry (for
       debugging)
   * - .. _lrm_rsc_op_crm_feature_set:

       .. index::
          pair: lrm_rsc_op; crm_feature_set

       crm_feature_set
     - :ref:`version <version>`
     - The Pacemaker feature set used to record this entry.
   * - .. _lrm_rsc_op_transition_key:

       .. index::
          pair: lrm_rsc_op; transition-key

       transition-key
     - :ref:`text <text>`
     - A concatenation of the action's transition graph action number, the
       transition graph number, the action's expected result, and the UUID of
       the controller instance that scheduled it.
   * - .. _lrm_rsc_op_transition_magic:

       .. index::
          pair: lrm_rsc_op; transition-magic

       transition-magic
     - :ref:`text <text>`
     - A concatenation of ``op-status``, ``rc-code``, and ``transition-key``.
   * - .. _lrm_rsc_op_exit_reason:

       .. index::
          pair: lrm_rsc_op; exit-reason

       exit-reason
     - :ref:`text <text>`
     - An error message (if available) from the resource agent or Pacemaker if
       the action did not return success.
   * - .. _lrm_rsc_op_on_node:

       .. index::
          pair: lrm_rsc_op; on_node

       on_node
     - :ref:`text <text>`
     - The name of the node that executed the action (identical to the
       ``uname`` of the enclosing ``node_state`` element)
   * - .. _lrm_rsc_op_call_id:

       .. index::
          pair: lrm_rsc_op; call-id

       call-id
     - :ref:`integer <integer>`
     - A node-specific counter used to determine the order in which actions
       were executed.
   * - .. _lrm_rsc_op_rc_code:

       .. index::
          pair: lrm_rsc_op; rc-code

       rc-code
     - :ref:`integer <integer>`
     - The resource agent's exit status for this action. Refer to the *Resource
       Agents* chapter of *Pacemaker Administration* for how these values are
       interpreted.
   * - .. _lrm_rsc_op_op_status:

       .. index::
          pair: lrm_rsc_op; op-status

       op-status
     - :ref:`integer <integer>`
     - The execution status of this action. The meanings of these codes are
       internal to Pacemaker.
   * - .. _lrm_rsc_op_interval:

       .. index::
          pair: lrm_rsc_op; interval

       interval
     - :ref:`nonnegative integer <nonnegative_integer>`
     - If the action is recurring, its frequency (in milliseconds), otherwise
       0.
   * - .. _lrm_rsc_op_last_rc_change:

       .. index::
          pair: lrm_rsc_op; last-rc-change

       last-rc-change
     - :ref:`epoch time <epoch_time>`
     - Node-local time at which the action first returned the current value of
       ``rc-code``.
   * - .. _lrm_rsc_op_exec_time:

       .. index::
          pair: lrm_rsc_op; exec-time

       exec-time
     - :ref:`integer <integer>`
     - Time (in seconds) that action execution took (if known)
   * - .. _lrm_rsc_op_queue_time:

       .. index::
          pair: lrm_rsc_op; queue-time

       queue-time
     - :ref:`integer <integer>`
     - Time (in seconds) that action was queued in the local executor (if known)
   * - .. _lrm_rsc_op_op_digest:

       .. index::
          pair: lrm_rsc_op; op-digest

       op-digest
     - :ref:`text <text>`
     - If present, this is a hash of the parameters passed to the action. If a
       hash of the currently configured parameters does not match this, that
       means the resource configuration changed since the action was performed,
       and the resource must be reloaded or restarted.
   * - .. _lrm_rsc_op_op_restart_digest:

       .. index::
          pair: lrm_rsc_op; op-restart-digest

       op-restart-digest
     - :ref:`text <text>`
     - If present, the resource agent supports reloadable parameters, and this
       is a hash of the non-reloadable parameters passed to the action. This
       allows the cluster to choose between reload and restart when one is
       needed.
   * - .. _lrm_rsc_op_op_secure_digest:

       .. index::
          pair: lrm_rsc_op; op-secure-digest

       op-secure-digest
     - :ref:`text <text>`
     - If present, the resource agent marks some parameters as sensitive, and
       this is a hash of the non-sensitive parameters passed to the action.
       This allows the value of sensitive parameters to be removed from a saved
       copy of the CIB while still allowing scheduler simulations to be
       performed on that copy.


Simple Operation History Example
________________________________

.. topic:: A monitor operation (determines current state of the ``apcstonith`` resource)

   .. code-block:: xml

      <lrm_resource id="apcstonith" type="fence_apc_snmp" class="stonith">
        <lrm_rsc_op id="apcstonith_monitor_0" operation="monitor" call-id="2"
          rc-code="7" op-status="0" interval="0"
          crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
          op-digest="2e3da9274d3550dc6526fb24bfcbcba0"
          transition-key="22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          transition-magic="0:7;22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          last-rc-change="1239008085" exec-time="10" queue-time="0"/>
      </lrm_resource>

The above example shows the history entry for a probe (non-recurring monitor
operation) for the ``apcstonith`` resource.

The cluster schedules probes for every configured resource on a node when
the node first starts, in order to determine the resource's current state
before it takes any further action.

From the ``transition-key``, we can see that this was the 22nd action of
the 2nd graph produced by this instance of the controller
(2668bbeb-06d5-40f9-936d-24cb7f87006a).

The third field of the ``transition-key`` contains a 7, which indicates
that the cluster expects to find the resource inactive. By looking at the
``rc-code`` property, we see that this was the case.

As that is the only action recorded for this node, we can conclude that
the cluster started the resource elsewhere.

Complex Operation History Example
_________________________________

.. topic:: Resource history of a ``pingd`` clone with multiple entries

   .. code-block:: xml

      <lrm_resource id="pingd:0" type="pingd" class="ocf" provider="pacemaker">
        <lrm_rsc_op id="pingd:0_monitor_30000" operation="monitor" call-id="34"
          rc-code="0" op-status="0" interval="30000"
          crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
          transition-key="10:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          last-rc-change="1239009741" exec-time="10" queue-time="0"/>
        <lrm_rsc_op id="pingd:0_stop_0" operation="stop"
          crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" call-id="32"
          rc-code="0" op-status="0" interval="0"
          transition-key="11:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          last-rc-change="1239009741" exec-time="10" queue-time="0"/>
        <lrm_rsc_op id="pingd:0_start_0" operation="start" call-id="33"
          rc-code="0" op-status="0" interval="0"
          crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
          transition-key="31:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          last-rc-change="1239009741" exec-time="10" queue-time="0" />
        <lrm_rsc_op id="pingd:0_monitor_0" operation="monitor" call-id="3"
          rc-code="0" op-status="0" interval="0"
          crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
          transition-key="23:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
          last-rc-change="1239008085" exec-time="20" queue-time="0"/>
        </lrm_resource>

When more than one history entry exists, it is important to first sort
them by ``call-id`` before interpreting them.

Once sorted, the above example can be summarized as:

#. A non-recurring monitor operation returning 7 (not running), with a
   ``call-id`` of 3
#. A stop operation returning 0 (success), with a ``call-id`` of 32
#. A start operation returning 0 (success), with a ``call-id`` of 33
#. A recurring monitor returning 0 (success), with a ``call-id`` of 34

The cluster processes each history entry to build up a picture of the
resource's state.  After the first and second entries, it is
considered stopped, and after the third it considered active.

Based on the last operation, we can tell that the resource is
currently active.

Additionally, from the presence of a ``stop`` operation with a lower
``call-id`` than that of the ``start`` operation, we can conclude that the
resource has been restarted.  Specifically this occurred as part of
actions 11 and 31 of transition 11 from the controller instance with the key
``2668bbeb...``.  This information can be helpful for locating the
relevant section of the logs when looking for the source of a failure.