1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539
|
# check_patroni
A nagios plugin for patroni.
## Features
- Check presence of leader, replicas, node counts.
- Check each node for replication status.
```
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...
Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.
Options:
--config FILE Read option defaults from the specified INI file
[default: config.ini]
-e, --endpoints TEXT Patroni API endpoint. Can be specified multiple times
or as a list of comma separated addresses. The node
services checks the status of one node, therefore if
several addresses are specified they should point to
different interfaces on the same node. The cluster
services check the status of the cluster, therefore
it's better to give a list of all Patroni node
addresses. [default: http://127.0.0.1:8008]
--cert_file PATH File with the client certificate.
--key_file PATH File with the client key.
--ca_file PATH The CA certificate.
-v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv
(debug)
--version
--timeout INTEGER Timeout in seconds for the API queries (0 to disable)
[default: 2]
--help Show this message and exit.
Commands:
cluster_config_has_changed Check if the hash of the configuration...
cluster_has_leader Check if the cluster has a leader.
cluster_has_replica Check if the cluster has healthy replicas...
cluster_has_scheduled_action Check if the cluster has a scheduled...
cluster_is_in_maintenance Check if the cluster is in maintenance...
cluster_node_count Count the number of nodes in the cluster.
node_is_alive Check if the node is alive ie patroni is...
node_is_leader Check if the node is a leader node.
node_is_pending_restart Check if the node is in pending restart...
node_is_primary Check if the node is the primary with the...
node_is_replica Check if the node is a replica with no...
node_patroni_version Check if the version is equal to the input
node_tl_has_changed Check if the timeline has changed.
```
## Install
check_patroni is licensed under PostgreSQL license.
```
$ pip install git+https://github.com/dalibo/check_patroni.git
```
check_patroni works on python 3.6, we keep it that way because patroni also
supports it and there are still lots of RH 7 variants around. That being said
python 3.6 has been EOL for ages and there is no support for it in the github
CI.
## Support
If you hit a bug or need help, open a [GitHub
issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !
## Config file
All global and service specific parameters can be specified via a config file has follows:
```
[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0
[options.node_is_replica]
lag=100
```
## Thresholds
The format for the threshold parameters is `[@][start:][end]`.
* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
* To invert the match condition, prefix the range expression with `@`.
A match is found when: `start <= VALUE <= end`.
For example, the following command will raise:
* a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[
* a critical if there are no nodes, which can be translated to outside of range [1;+INF[
```
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:
```
## SSL
Several options are available:
* the server's CA certificate is not available or trusted by the client system:
* `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle`
* you have a client certificate for authenticating with Patroni's REST API:
* `--cert_file`: your certificate or the concatenation of your certificate and private key
* `--key_file`: your private key (optional)
## Shell completion
We use the [click] library which supports shell completion natively.
Shell completion can be added by typing the following command or adding it to
a file spécific to your shell of choice.
* for Bash (add to `~/.bashrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)"
```
* for Zsh (add to `~/.zshrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)"
```
* for Fish (add to `~/.config/fish/completions/check_patroni.fish`):
```
eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)"
```
Please note that shell completion is not supported far all shell versions, for
example only Bash versions older than 4.4 are supported.
[click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/
## Connection errors and service status
If patroni is not running, we have no way to know if the provided endpoint is
valid, therefore the check returns UNKNOWN.
## Cluster services
### cluster_config_has_changed
```
Usage: check_patroni cluster_config_has_changed [OPTIONS]
Check if the hash of the configuration has changed.
Note: either a hash or a state file must be provided for this service to
work.
Check:
* `OK`: The hash didn't change
* `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)
Perfdata:
* `is_configuration_changed` is 1 if the configuration has changed
Options:
--hash TEXT A hash to compare with.
-s, --state-file TEXT A state file to store the hash of the configuration.
--save Set the current configuration hash as the reference
for future calls.
--help Show this message and exit.
```
### cluster_has_leader
```
Usage: check_patroni cluster_has_leader [OPTIONS]
Check if the cluster has a leader.
This check applies to any kind of leaders including standby leaders.
A leader is a node with the "leader" role and a "running" state.
A standby leader is a node with a "standby_leader" role and a "streaming" or
"in archive recovery" state. Please note that log shipping could be stuck
because the WAL are not available or applicable. Patroni doesn't provide
information about the origin cluster (timeline or lag), so we cannot check
if there is a problem in that particular case. That's why we issue a warning
when the node is "in archive recovery". We suggest using other supervision
tools to do this (eg. check_pgactivity).
Check:
* `OK`: if there is a leader node.
* 'WARNING': if there is a stanby leader in archive mode.
* `CRITICAL`: otherwise.
Perfdata:
* `has_leader` is 1 if there is any kind of leader node, 0 otherwise
* `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in
archive recovery", 0 otherwise
* `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise
* `is_leader` is 1 if there is a "classical" leader node, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_has_replica
```
Usage: check_patroni cluster_has_replica [OPTIONS]
Check if the cluster has healthy replicas and/or if some are sync or quorum
standbies
For patroni (and this check):
* a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
* a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.
A healthy replica:
* has a `replica`, `quorum_standby` or `sync_standby` role
* has the same timeline as the leader and
* is in `running` state (patroni < V3.0.4)
* is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
* has a lag lower or equal to `max_lag`
Please note that replica `in archive recovery` could be stuck because the
WAL are not available or applicable (the server's timeline has diverged for
the leader's). We already detect the latter but we will miss the former.
Therefore, it's preferable to check for the lag in addition to the healthy
state if you rely on log shipping to help lagging standbies to catch up.
Since we require a healthy replica to have the same timeline as the leader,
it's possible that we raise alerts when the cluster is performing a
switchover or failover and the standbies are in the process of catching up
with the new leader. The alert shouldn't last long.
In PostgreSQL, synchronous replication has two modes: on and quorum and is
configured with the gucs `synchronous_standby_names` and
`synchronous_commit`. Patroni uses the parameter `synchronous_mode`, which
can be set to `on`, `quorum` and `off`, and has `synchronous_node_count` to
configure the synchronous replication factor. Please note that, in
synchronous replication, the number of servers tagged as
"{sync|quorum}_standby" (what we measure) is not always equal tot
`synchronous_node_count`.
Check:
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
and if the synchronous replica count is compatible with the sync replica count threshold.
* `WARNING` / `CRITICAL`: otherwise
Perfdata:
* healthy_replica & unhealthy_replica count
* the number of sync_replica (sync or quorum depending on `--sync-type`), they are included
in the previous count
* the lag of each replica labelled with "member name"_lag
* the timeline of each replica labelled with "member name"_timeline
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
Options:
-w, --warning TEXT Warning threshold for the number of healthy
replica nodes.
-c, --critical TEXT Critical threshold for the number of healthy
replica nodes.
--sync-warning TEXT Warning threshold for the number of sync
replica.
--sync-critical TEXT Critical threshold for the number of sync
replica.
--sync-type [any|sync|quorum] Synchronous replication mode used to filter
and count sync standbies. [default: any]
--max-lag TEXT maximum allowed lag
--help Show this message and exit.
```
### cluster_has_scheduled_action
```
Usage: check_patroni cluster_has_scheduled_action [OPTIONS]
Check if the cluster has a scheduled action (switchover or restart)
Check:
* `OK`: If the cluster has no scheduled action
* `CRITICAL`: otherwise.
Perfdata:
* `scheduled_actions` is 1 if the cluster has scheduled actions.
* `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
* `scheduled_restart` counts the number of scheduled restart in the cluster.
Options:
--help Show this message and exit.
```
### cluster_is_in_maintenance
```
Usage: check_patroni cluster_is_in_maintenance [OPTIONS]
Check if the cluster is in maintenance mode or paused.
Check:
* `OK`: If the cluster is in maintenance mode.
* `CRITICAL`: otherwise.
Perfdata:
* `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_node_count
```
Usage: check_patroni cluster_node_count [OPTIONS]
Count the number of nodes in the cluster.
The role refers to the role of the server in the cluster. Possible values
are:
* leader (master was removed in patroni 4.0.0)
* replica
* standby_leader
* sync_standby
* quorum_standby
* demoted
* promoted
* uninitialized
The state refers to the state of PostgreSQL. Possible values are:
* initializing new cluster, initdb failed
* running custom bootstrap script, custom bootstrap failed
* starting, start failed
* restarting, restart failed
* running, streaming, in archive recovery
* stopping, stopped, stop failed
* creating replica
* crashed
The "healthy" checks only ensures that:
* a leader has the running state
* a standby_leader has the running or streaming (V3.0.4) state
* a replica, quorum_standby or sync_standby has the running or streaming (V3.0.4) state
Since we dont check the lag or timeline, "in archive recovery" is not
considered a valid state for this service. See cluster_has_leader and
cluster_has_replica for specialized checks.
Check:
* Compares the number of nodes against the normal and healthy nodes warning and critical thresholds.
* `OK`: If they are not provided.
Perfdata:
* `members`: the member count.
* `healthy_members`: the running and streaming member count.
* all the roles of the nodes in the cluster with their count (start with "role_").
* all the statuses of the nodes in the cluster with their count (start with "state_").
Options:
-w, --warning TEXT Warning threshold for the number of nodes.
-c, --critical TEXT Critical threshold for the number of nodes.
--healthy-warning TEXT Warning threshold for the number of healthy nodes
(running + streaming).
--healthy-critical TEXT Critical threshold for the number of healthy nodes
(running + streaming).
--help Show this message and exit.
```
## Node services
### node_is_alive
```
Usage: check_patroni node_is_alive [OPTIONS]
Check if the node is alive ie patroni is running. This is a liveness check
as defined in Patroni's documentation. If patroni is not running, we have no
way to know if the provided endpoint is valid, therefore the check returns
UNKNOWN.
Check:
* `OK`: If patroni the liveness check returns with HTTP status 200.
* `CRITICAL`: if partoni's liveness check returns with an HTTP status
other than 200.
Perfdata:
* `is_running` is 1 if patroni is running, 0 otherwise
Options:
--help Show this message and exit.
```
### node_is_pending_restart
```
Usage: check_patroni node_is_pending_restart [OPTIONS]
Check if the node is in pending restart state.
This situation can arise if the configuration has been modified but requires
a restart of PostgreSQL to take effect.
Check:
* `OK`: if the node has no pending restart tag.
* `CRITICAL`: otherwise
Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_leader
```
Usage: check_patroni node_is_leader [OPTIONS]
Check if the node is a leader node.
This check applies to any kind of leaders including standby leaders. To
check explicitly for a standby leader use the `--is-standby-leader` option.
Check:
* `OK`: if the node is a leader.
* `CRITICAL:` otherwise
Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.
Options:
--is-standby-leader Check for a standby leader
--help Show this message and exit.
```
### node_is_primary
```
Usage: check_patroni node_is_primary [OPTIONS]
Check if the node is the primary with the leader lock.
This service is not valid for a standby leader, because this kind of node is
not a primary.
Check:
* `OK`: if the node is a primary with the leader lock.
* `CRITICAL:` otherwise
Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_replica
```
Usage: check_patroni node_is_replica [OPTIONS]
Check if the node is a replica with no noloadbalance tag.
It is possible to check if the node is synchronous or asynchronous. If
nothing is specified any kind of replica is accepted. When checking for a
synchronous replica, it's not possible to specify a lag.
This service is using the following Patroni endpoints: replica, asynchronous
and synchronous. The first two implement the `lag` tag. For these endpoints
the state of a replica node doesn't reflect the replication state
(`streaming` or `in archive recovery`), we only know if it's `running`. The
timeline is also not checked.
Therefore, if a cluster is using asynchronous replication, it is recommended
to check for the lag to detect a divegence as soon as possible.
Check:
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
Perfdata: `is_replica` is 1 if the node is a running replica with
noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
Options:
--max-lag TEXT maximum allowed lag
--is-sync check if the replica is synchronous
--sync-type [any|sync|quorum] Synchronous replication mode. [default: any]
--is-async check if the replica is asynchronous
--help Show this message and exit.
```
### node_patroni_version
```
Usage: check_patroni node_patroni_version [OPTIONS]
Check if the version is equal to the input
Check:
* `OK`: The version is the same as the input `--patroni-version`
* `CRITICAL`: otherwise.
Perfdata:
* `is_version_ok` is 1 if version is ok, 0 otherwise
Options:
--patroni-version TEXT Patroni version to compare to [required]
--help Show this message and exit.
```
### node_tl_has_changed
```
Usage: check_patroni node_tl_has_changed [OPTIONS]
Check if the timeline has changed.
Note: either a timeline or a state file must be provided for this service to
work.
Check:
* `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
* `CRITICAL`: The tl is not the same.
Perfdata:
* `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
* the timeline
Options:
--timeline TEXT A timeline number to compare with.
-s, --state-file TEXT A state file to store the last tl number into.
--save Set the current timeline number as the reference for
future calls.
--help Show this message and exit.
```
|