File: README.md

package info (click to toggle)
check-patroni 2.2.0-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 628 kB
  • sloc: python: 2,779; sh: 727; makefile: 25
file content (539 lines) | stat: -rw-r--r-- 18,468 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
# check_patroni

A nagios plugin for patroni.

## Features

- Check presence of leader, replicas, node counts.
- Check each node for replication status.


```
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...

  Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.

Options:
  --config FILE         Read option defaults from the specified INI file
                        [default: config.ini]
  -e, --endpoints TEXT  Patroni API endpoint. Can be specified multiple times
                        or as a list of comma separated addresses. The node
                        services checks the status of one node, therefore if
                        several addresses are specified they should point to
                        different interfaces on the same node. The cluster
                        services check the status of the cluster, therefore
                        it's better to give a list of all Patroni node
                        addresses.  [default: http://127.0.0.1:8008]
  --cert_file PATH      File with the client certificate.
  --key_file PATH       File with the client key.
  --ca_file PATH        The CA certificate.
  -v, --verbose         Increase verbosity -v (info)/-vv (warning)/-vvv
                        (debug)
  --version
  --timeout INTEGER     Timeout in seconds for the API queries (0 to disable)
                        [default: 2]
  --help                Show this message and exit.

Commands:
  cluster_config_has_changed    Check if the hash of the configuration...
  cluster_has_leader            Check if the cluster has a leader.
  cluster_has_replica           Check if the cluster has healthy replicas...
  cluster_has_scheduled_action  Check if the cluster has a scheduled...
  cluster_is_in_maintenance     Check if the cluster is in maintenance...
  cluster_node_count            Count the number of nodes in the cluster.
  node_is_alive                 Check if the node is alive ie patroni is...
  node_is_leader                Check if the node is a leader node.
  node_is_pending_restart       Check if the node is in pending restart...
  node_is_primary               Check if the node is the primary with the...
  node_is_replica               Check if the node is a replica with no...
  node_patroni_version          Check if the version is equal to the input
  node_tl_has_changed           Check if the timeline has changed.
```

## Install

check_patroni is licensed under PostgreSQL license.

```
$ pip install git+https://github.com/dalibo/check_patroni.git
```

check_patroni works on python 3.6, we keep it that way because patroni also
supports it and there are still lots of RH 7 variants around. That being said
python 3.6 has been EOL for ages and there is no support for it in the github
CI.

## Support

If you hit a bug or need help, open a [GitHub
issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !

## Config file

All global and service specific parameters can be specified via a config file has follows:

```
[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0

[options.node_is_replica]
lag=100
```
## Thresholds

The format for the threshold parameters is `[@][start:][end]`.

* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
* To invert the match condition, prefix the range expression with `@`.

A match is found when: `start <= VALUE <= end`.

For example, the following command will raise:

* a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[
* a critical if there are no nodes, which can be translated to outside of range [1;+INF[

```
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:
```

## SSL

Several options are available:

* the server's CA certificate is not available or trusted by the client system:
  * `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle`
* you have a client certificate for authenticating with Patroni's REST API:
  * `--cert_file`: your certificate or the concatenation of your certificate and private key
  * `--key_file`: your private key (optional)

## Shell completion

We use the [click] library which supports shell completion natively.

Shell completion can be added by typing the following command or adding it to
a file spécific to your shell of choice.

* for Bash (add to `~/.bashrc`):
  ```
  eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)"
  ```
* for Zsh  (add to `~/.zshrc`):
  ```
  eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)"
  ```
* for Fish (add to `~/.config/fish/completions/check_patroni.fish`):
  ```
  eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)"
  ```

Please note that shell completion is not supported far all shell versions, for
example only Bash versions older than 4.4 are supported.

[click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/

## Connection errors and service status

If patroni is not running, we have no way to know if the provided endpoint is
valid, therefore the check returns UNKNOWN.

## Cluster services

### cluster_config_has_changed

```
Usage: check_patroni cluster_config_has_changed [OPTIONS]

  Check if the hash of the configuration has changed.

  Note: either a hash or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The hash didn't change
  * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)

  Perfdata:
  * `is_configuration_changed` is 1 if the configuration has changed

Options:
  --hash TEXT            A hash to compare with.
  -s, --state-file TEXT  A state file to store the hash of the configuration.
  --save                 Set the current configuration hash as the reference
                         for future calls.
  --help                 Show this message and exit.
```

### cluster_has_leader

```
Usage: check_patroni cluster_has_leader [OPTIONS]

  Check if the cluster has a leader.

  This check applies to any kind of leaders including standby leaders.

  A leader is a node with the "leader" role and a "running" state.

  A standby leader is a node with a "standby_leader" role and a "streaming" or
  "in archive recovery" state. Please note that log shipping could be stuck
  because the WAL are not available or applicable. Patroni doesn't provide
  information about the origin cluster (timeline or lag), so we cannot check
  if there is a problem in that particular case. That's why we issue a warning
  when the node is "in archive recovery". We suggest using other supervision
  tools to do this (eg. check_pgactivity).

  Check:
  * `OK`: if there is a leader node.
  * 'WARNING': if there is a stanby leader in archive mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `has_leader` is 1 if there is any kind of leader node, 0 otherwise
  * `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in
     archive recovery", 0 otherwise
  * `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise
  * `is_leader` is 1 if there is a "classical" leader node, 0 otherwise

Options:
  --help  Show this message and exit.
```

### cluster_has_replica

```
Usage: check_patroni cluster_has_replica [OPTIONS]

  Check if the cluster has healthy replicas and/or if some are sync or quorum
  standbies

  For patroni (and this check):
  * a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
  * a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.

  A healthy replica:
  * has a `replica`, `quorum_standby` or `sync_standby` role
  * has the same timeline as the leader and
    * is in `running` state (patroni < V3.0.4)
    * is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
  * has a lag lower or equal to `max_lag`

  Please note that replica `in archive recovery` could be stuck because the
  WAL are not available or applicable (the server's timeline has diverged for
  the leader's). We already detect the latter but we will miss the former.
  Therefore, it's preferable to check for the lag in addition to the healthy
  state if you rely on log shipping to help lagging standbies to catch up.

  Since we require a healthy replica to have the same timeline as the leader,
  it's possible that we raise alerts when the cluster is performing a
  switchover or failover and the standbies are in the process of catching up
  with the new leader. The alert shouldn't last long.

  In PostgreSQL, synchronous replication has two modes: on and quorum and is
  configured with the gucs `synchronous_standby_names` and
  `synchronous_commit`. Patroni uses the parameter `synchronous_mode`, which
  can be set to `on`, `quorum` and `off`, and has `synchronous_node_count` to
  configure the synchronous replication factor. Please note that, in
  synchronous replication, the number of servers tagged as
  "{sync|quorum}_standby" (what we measure) is not always equal tot
  `synchronous_node_count`.

  Check:
  * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
          and if the synchronous replica count is compatible with the sync replica count threshold.
  * `WARNING` / `CRITICAL`: otherwise

  Perfdata:
  * healthy_replica & unhealthy_replica count
  * the number of sync_replica (sync or quorum depending on `--sync-type`), they are included
    in the previous count
  * the lag of each replica labelled with "member name"_lag
  * the timeline of each replica labelled with "member name"_timeline
  * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync

Options:
  -w, --warning TEXT             Warning threshold for the number of healthy
                                 replica nodes.
  -c, --critical TEXT            Critical threshold for the number of healthy
                                 replica nodes.
  --sync-warning TEXT            Warning threshold for the number of sync
                                 replica.
  --sync-critical TEXT           Critical threshold for the number of sync
                                 replica.
  --sync-type [any|sync|quorum]  Synchronous replication mode used to filter
                                 and count sync standbies.  [default: any]
  --max-lag TEXT                 maximum allowed lag
  --help                         Show this message and exit.
```

### cluster_has_scheduled_action

```
Usage: check_patroni cluster_has_scheduled_action [OPTIONS]

  Check if the cluster has a scheduled action (switchover or restart)

  Check:
  * `OK`: If the cluster has no scheduled action
  * `CRITICAL`: otherwise.

  Perfdata:
  * `scheduled_actions` is 1 if the cluster has scheduled actions.
  * `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
  * `scheduled_restart` counts the number of scheduled restart in the cluster.

Options:
  --help  Show this message and exit.
```

### cluster_is_in_maintenance

```
Usage: check_patroni cluster_is_in_maintenance [OPTIONS]

  Check if the cluster is in maintenance mode or paused.

  Check:
  * `OK`: If the cluster is in maintenance mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_in_maintenance` is 1 the cluster is in maintenance mode,  0 otherwise

Options:
  --help  Show this message and exit.
```

### cluster_node_count

```
Usage: check_patroni cluster_node_count [OPTIONS]

  Count the number of nodes in the cluster.

  The role refers to the role of the server in the cluster. Possible values
  are:
  * leader (master was removed in patroni 4.0.0)
  * replica
  * standby_leader
  * sync_standby
  * quorum_standby
  * demoted
  * promoted
  * uninitialized

  The state refers to the state of PostgreSQL. Possible values are:
  * initializing new cluster, initdb failed
  * running custom bootstrap script, custom bootstrap failed
  * starting, start failed
  * restarting, restart failed
  * running, streaming, in archive recovery
  * stopping, stopped, stop failed
  * creating replica
  * crashed

  The "healthy" checks only ensures that:
  * a leader has the running state
  * a standby_leader has the running or streaming (V3.0.4) state
  * a replica, quorum_standby or sync_standby has the running or streaming (V3.0.4) state

  Since we dont check the lag or timeline, "in archive recovery" is not
  considered a valid state for this service. See cluster_has_leader and
  cluster_has_replica for specialized checks.

  Check:
  * Compares the number of nodes against the normal and healthy nodes warning and critical thresholds.
  * `OK`:  If they are not provided.

  Perfdata:
  * `members`: the member count.
  * `healthy_members`: the running and streaming member count.
  * all the roles of the nodes in the cluster with their count (start with "role_").
  * all the statuses of the nodes in the cluster with their count (start with "state_").

Options:
  -w, --warning TEXT       Warning threshold for the number of nodes.
  -c, --critical TEXT      Critical threshold for the number of nodes.
  --healthy-warning TEXT   Warning threshold for the number of healthy nodes
                           (running + streaming).
  --healthy-critical TEXT  Critical threshold for the number of healthy nodes
                           (running + streaming).
  --help                   Show this message and exit.
```

## Node services

### node_is_alive

```
Usage: check_patroni node_is_alive [OPTIONS]

  Check if the node is alive ie patroni is running. This is a liveness check
  as defined in Patroni's documentation. If patroni is not running, we have no
  way to know if the provided endpoint is valid, therefore the check returns
  UNKNOWN.

  Check:
  * `OK`: If patroni the liveness check returns with HTTP status 200.
  * `CRITICAL`: if partoni's liveness check returns with an HTTP status
     other than 200.

  Perfdata:
  * `is_running` is 1 if patroni is running, 0 otherwise

Options:
  --help  Show this message and exit.
```

### node_is_pending_restart

```
Usage: check_patroni node_is_pending_restart [OPTIONS]

  Check if the node is in pending restart state.

  This situation can arise if the configuration has been modified but requires
  a restart of PostgreSQL to take effect.

  Check:
  * `OK`: if the node has no pending restart tag.
  * `CRITICAL`: otherwise

  Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
  otherwise.

Options:
  --help  Show this message and exit.
```

### node_is_leader

```
Usage: check_patroni node_is_leader [OPTIONS]

  Check if the node is a leader node.

  This check applies to any kind of leaders including standby leaders. To
  check explicitly for a standby leader use the `--is-standby-leader` option.

  Check:
  * `OK`: if the node is a leader.
  * `CRITICAL:` otherwise

  Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.

Options:
  --is-standby-leader  Check for a standby leader
  --help               Show this message and exit.
```

### node_is_primary

```
Usage: check_patroni node_is_primary [OPTIONS]

  Check if the node is the primary with the leader lock.

  This service is not valid for a standby leader, because this kind of node is
  not a primary.

  Check:
  * `OK`: if the node is a primary with the leader lock.
  * `CRITICAL:` otherwise

  Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
  otherwise.

Options:
  --help  Show this message and exit.
```

### node_is_replica

```
Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a replica with no noloadbalance tag.

  It is possible to check if the node is synchronous or asynchronous. If
  nothing is specified any kind of replica is accepted.  When checking for a
  synchronous replica, it's not possible to specify a lag.

  This service is using the following Patroni endpoints: replica, asynchronous
  and synchronous. The first two implement the `lag` tag. For these endpoints
  the state of a replica node doesn't reflect the replication state
  (`streaming` or `in archive recovery`), we only know if it's `running`. The
  timeline is also not checked.

  Therefore, if a cluster is using asynchronous replication, it is recommended
  to check for the lag to detect a divegence as soon as possible.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

Options:
  --max-lag TEXT                 maximum allowed lag
  --is-sync                      check if the replica is synchronous
  --sync-type [any|sync|quorum]  Synchronous replication mode.  [default: any]
  --is-async                     check if the replica is asynchronous
  --help                         Show this message and exit.
```

### node_patroni_version

```
Usage: check_patroni node_patroni_version [OPTIONS]

  Check if the version is equal to the input

  Check:
  * `OK`: The version is the same as the input `--patroni-version`
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_version_ok` is 1 if version is ok, 0 otherwise

Options:
  --patroni-version TEXT  Patroni version to compare to  [required]
  --help                  Show this message and exit.
```

### node_tl_has_changed

```
Usage: check_patroni node_tl_has_changed [OPTIONS]

  Check if the timeline has changed.

  Note: either a timeline or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
  * `CRITICAL`: The tl is not the same.

  Perfdata:
  * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
  * the timeline

Options:
  --timeline TEXT        A timeline number to compare with.
  -s, --state-file TEXT  A state file to store the last tl number into.
  --save                 Set the current timeline number as the reference for
                         future calls.
  --help                 Show this message and exit.
```