1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
|
---
stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments
---
# Troubleshooting PostgreSQL replication and failover for Linux package installations
DETAILS:
**Tier:** Premium, Ultimate
**Offering:** Self-managed
When working with PostgreSQL replication and failover, you might encounter the following issues.
## Consul and PostgreSQL changes not taking effect
Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it does not restart the services. However, not all changes can be activated by reloading.
To restart either service, run `gitlab-ctl restart SERVICE`
For PostgreSQL, it is usually safe to restart the leader node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done.
On the Consul server nodes, it is important to [restart the Consul service](../consul.md#restart-consul) in a controlled manner.
## PgBouncer error `ERROR: pgbouncer cannot connect to server`
You may get this error when running `gitlab-rake gitlab:db:configure` or you
may see the error in the PgBouncer log file.
```plaintext
PG::ConnectionBad: ERROR: pgbouncer cannot connect to server
```
The problem may be that your PgBouncer node's IP address is not included in the
`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes.
You can confirm that this is the issue by checking the PostgreSQL log on the leader
database node. If you see the following error then `trust_auth_cidr_addresses`
is the problem.
```plaintext
2018-03-29_13:59:12.11776 FATAL: no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off
```
To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`.
```ruby
postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)
```
[Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
## PgBouncer nodes don't fail over after Patroni switchover
Due to a [known issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/8166) that
affects versions of GitLab prior to 16.5.0, the automatic failover of PgBouncer nodes does not
happen after a [Patroni switchover](../../administration/postgresql/replication_and_failover.md#manual-failover-procedure-for-patroni). In this
example, GitLab failed to detect a paused database, then attempted to `RESUME` a
not-paused database:
```plaintext
INFO -- : Running: gitlab-ctl pgb-notify --pg-database gitlabhq_production --newhost database7.example.com --user pgbouncer --hostuser gitlab-consul
ERROR -- : STDERR: Error running command: GitlabCtl::Errors::ExecutionError
ERROR -- : STDERR: ERROR: ERROR: database gitlabhq_production is not paused
```
To ensure a [Patroni switchover](../../administration/postgresql/replication_and_failover.md#manual-failover-procedure-for-patroni) succeeds,
you must manually restart the PgBouncer service on all PgBouncer nodes with this command:
```shell
gitlab-ctl restart pgbouncer
```
## Reinitialize a replica
If a replica cannot start or rejoin the cluster, or when it lags behind and cannot catch up, it might be necessary to reinitialize the replica:
1. [Check the replication status](../../administration/postgresql/replication_and_failover.md#check-replication-status) to confirm which server
needs to be reinitialized. For example:
```plaintext
+ Cluster: postgresql-ha (6970678148837286213) ------+---------+--------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------------------------------+--------------+---------+--------------+----+-----------+
| gitlab-database-1.example.com | 172.18.0.111 | Replica | running | 55 | 0 |
| gitlab-database-2.example.com | 172.18.0.112 | Replica | start failed | | unknown |
| gitlab-database-3.example.com | 172.18.0.113 | Leader | running | 55 | |
+-------------------------------------+--------------+---------+--------------+----+-----------+
```
1. Sign in to the broken server and reinitialize the database and replication. Patroni shuts
down PostgreSQL on that server, remove the data directory, and reinitialize it from scratch:
```shell
sudo gitlab-ctl patroni reinitialize-replica --member gitlab-database-2.example.com
```
This can be run on any Patroni node, but be aware that `sudo gitlab-ctl patroni reinitialize-replica`
without `--member` restarts the server it is run on.
You should run it locally on the broken server to reduce the risk of
unintended data loss.
1. Monitor the logs:
```shell
sudo gitlab-ctl tail patroni
```
## Reset the Patroni state in Consul
WARNING:
Resetting the Patroni state in Consul is a potentially destructive process. Make sure that you have a healthy database backup first.
As a last resort you can reset the Patroni state in Consul completely.
This may be required if your Patroni cluster is in an unknown or bad state and no node can start:
```plaintext
+ Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------------------------------+--------------+---------+---------+----+-----------+
| gitlab-database-1.example.com | 172.18.0.111 | Replica | stopped | | unknown |
| gitlab-database-2.example.com | 172.18.0.112 | Replica | stopped | | unknown |
| gitlab-database-3.example.com | 172.18.0.113 | Replica | stopped | | unknown |
+-------------------------------------+--------------+---------+---------+----+-----------+
```
**Before deleting the Patroni state in Consul**,
[try and resolve the `gitlab-ctl` errors](#errors-running-gitlab-ctl) on the Patroni nodes.
This process results in a reinitialized Patroni cluster when
the first Patroni node starts.
To reset the Patroni state in Consul:
1. Take note of the Patroni node that was the leader, or that the application thinks is the current leader,
if the current state shows more than one, or none:
- Look on the PgBouncer nodes in `/var/opt/gitlab/consul/databases.ini`,
which contains the hostname of the current leader.
- Look in the Patroni logs `/var/log/gitlab/patroni/current` (or the older rotated and
compressed logs `/var/log/gitlab/patroni/@40000*`) on **all** database nodes to see
which server was most recently identified as the leader by the cluster:
```plaintext
INFO: no action. I am a secondary (database1.local) and following a leader (database2.local)
```
1. Stop Patroni on all nodes:
```shell
sudo gitlab-ctl stop patroni
```
1. Reset the state in Consul:
```shell
/opt/gitlab/embedded/bin/consul kv delete -recurse /service/postgresql-ha/
```
1. Start one Patroni node, which initializes the Patroni cluster to elect as a leader.
It's highly recommended to start the previous leader (noted in the first step),
so as to not lose existing writes that may have not been replicated because
of the broken cluster state:
```shell
sudo gitlab-ctl start patroni
```
1. Start all other Patroni nodes that join the Patroni cluster as replicas:
```shell
sudo gitlab-ctl start patroni
```
If you are still seeing issues, the next step is restoring the last healthy backup.
## Errors in the Patroni log about a `pg_hba.conf` entry for `127.0.0.1`
The following log entry in the Patroni log indicates the replication is not working
and a configuration change is needed:
```plaintext
FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator"
```
To fix the problem, ensure the loopback interface is included in the CIDR addresses list:
1. Edit `/etc/gitlab/gitlab.rb`:
```ruby
postgresql['trust_auth_cidr_addresses'] = %w(<other_cidrs> 127.0.0.1/32)
```
1. [Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
1. Check that [all the replicas are synchronized](../../administration/postgresql/replication_and_failover.md#check-replication-status)
## Error: requested start point is ahead of the Write Ahead Log (WAL) flush position
This error in Patroni logs indicates that the database is not replicating:
```plaintext
FATAL: could not receive data from WAL stream:
ERROR: requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/4000388
```
This example error is from a replica that was initially misconfigured, and had never replicated.
Fix it [by reinitializing the replica](#reinitialize-a-replica).
## Patroni fails to start with `MemoryError`
Patroni may fail to start, logging an error and stack trace:
```plaintext
MemoryError
Traceback (most recent call last):
File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>
sys.exit(main())
[..]
File "/opt/gitlab/embedded/lib/python3.7/ctypes/__init__.py", line 273, in _reset_cache
CFUNCTYPE(c_int)(lambda: None)
```
If the stack trace ends with `CFUNCTYPE(c_int)(lambda: None)`, this code triggers `MemoryError`
if the Linux server has been hardened for security.
The code causes Python to write temporary executable files, and if it cannot find a file system in which to do this. For example, if `noexec` is set on the `/tmp` file system, it fails with `MemoryError` ([read more in the issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6184)).
## Errors running `gitlab-ctl`
Patroni nodes can get into a state where `gitlab-ctl` commands fail
and `gitlab-ctl reconfigure` cannot fix the node.
If this co-incides with a version upgrade of PostgreSQL, [follow a different procedure](#postgresql-major-version-upgrade-fails-on-a-patroni-replica)
One common symptom is that `gitlab-ctl` cannot determine
information it needs about the installation if the database server is failing to start:
```plaintext
Malformed configuration JSON file found at /opt/gitlab/embedded/nodes/<HOSTNAME>.json.
This usually happens when your last run of `gitlab-ctl reconfigure` didn't complete successfully.
```
```plaintext
Error while reinitializing replica on the current node: Attributes not found in
/opt/gitlab/embedded/nodes/<HOSTNAME>.json, has reconfigure been run yet?
```
Similarly, the nodes file (`/opt/gitlab/embedded/nodes/<HOSTNAME>.json`) should contain a lot of information,
but might get created with only:
```json
{
"name": "<HOSTNAME>"
}
```
The following process for fixing this includes reinitializing this replica:
the current state of PostgreSQL on this node is discarded:
1. Shut down the Patroni and (if present) PostgreSQL services:
```shell
sudo gitlab-ctl status
sudo gitlab-ctl stop patroni
sudo gitlab-ctl stop postgresql
```
1. Remove `/var/opt/gitlab/postgresql/data` in case its state prevents
PostgreSQL from starting:
```shell
cd /var/opt/gitlab/postgresql
sudo rm -rf data
```
**Take care with this step to avoid data loss**.
This step can be also achieved by renaming `data/`:
make sure there's enough free disk for a new copy of the primary database,
and remove the extra directory when the replica is fixed.
1. With PostgreSQL not running, the nodes file now gets created successfully:
```shell
sudo gitlab-ctl reconfigure
```
1. Start Patroni:
```shell
sudo gitlab-ctl start patroni
```
1. Monitor the logs and check the cluster state:
```shell
sudo gitlab-ctl tail patroni
sudo gitlab-ctl patroni members
```
1. Re-run `reconfigure` again:
```shell
sudo gitlab-ctl reconfigure
```
1. Reinitialize the replica if `gitlab-ctl patroni members` indicates this is needed:
```shell
sudo gitlab-ctl patroni reinitialize-replica
```
If this procedure doesn't work **and** if the cluster is unable to elect a leader,
[there is a another fix](#reset-the-patroni-state-in-consul) which should only be
used as a last resort.
## PostgreSQL major version upgrade fails on a Patroni replica
A Patroni **replica** can get stuck in a loop during `gitlab-ctl pg-upgrade`, and
the upgrade fails.
An example set of symptoms is as follows:
1. A `postgresql` service is defined,
which shouldn't usually be present on a Patroni node. It is present because
`gitlab-ctl pg-upgrade` adds it to create a new empty database:
```plaintext
run: patroni: (pid 1972) 1919s; run: log: (pid 1971) 1919s
down: postgresql: 1s, normally up, want up; run: log: (pid 1973) 1919s
```
1. PostgreSQL generates `PANIC` log entries in
`/var/log/gitlab/postgresql/current` as Patroni is removing
`/var/opt/gitlab/postgresql/data` as part of reinitializing the replica:
```plaintext
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
WARNING: terminating connection because of crash of another server process
LOG: all server processes terminated; reinitializing
PANIC: could not open file "global/pg_control": No such file or directory
```
1. In `/var/log/gitlab/patroni/current`, Patroni logs the following.
The local PostgreSQL version is different from the cluster leader:
```plaintext
INFO: trying to bootstrap from leader 'HOSTNAME'
pg_basebackup: incompatible server version 12.6
pg_basebackup: removing data directory "/var/opt/gitlab/postgresql/data"
ERROR: Error when fetching backup: pg_basebackup exited with code=1
```
**Important**: This workaround applies when the Patroni cluster is in the following state:
- The [leader has been successfully upgraded to the new major version](../../administration/postgresql/replication_and_failover.md#upgrading-postgresql-major-version-in-a-patroni-cluster).
- The step to upgrade PostgreSQL on replicas is failing.
This workaround completes the PostgreSQL upgrade on a Patroni replica
by setting the node to use the new PostgreSQL version, and then reinitializing
it as a replica in the new cluster that was created
when the leader was upgraded:
1. Check the cluster status on all nodes to confirm which is the leader
and what state the replicas are in
```shell
sudo gitlab-ctl patroni members
```
1. Replica: check which version of PostgreSQL is active:
```shell
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
1. Replica: ensure the nodes file is correct and `gitlab-ctl` can run. This resolves
the [errors running `gitlab-ctl`](#errors-running-gitlab-ctl) issue if the replica
has any of those errors as well:
```shell
sudo gitlab-ctl stop patroni
sudo gitlab-ctl reconfigure
```
1. Replica: relink the PostgreSQL binaries to the required version
to fix the `incompatible server version` error:
1. Edit `/etc/gitlab/gitlab.rb` and specify the required version:
```ruby
postgresql['version'] = 13
```
1. Reconfigure GitLab:
```shell
sudo gitlab-ctl reconfigure
```
1. Check the binaries are relinked. The binaries distributed for
PostgreSQL vary between major releases, it's typical to
have a small number of incorrect symbolic links:
```shell
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
1. Replica: ensure PostgreSQL is fully reinitialized for the specified version:
```shell
cd /var/opt/gitlab/postgresql
sudo rm -rf data
sudo gitlab-ctl reconfigure
```
1. Replica: optionally monitor the database in two additional terminal sessions:
- Disk use increases as `pg_basebackup` runs. Track progress of the
replica initialization with:
```shell
cd /var/opt/gitlab/postgresql
watch du -sh data
```
- Monitor the process in the logs:
```shell
sudo gitlab-ctl tail patroni
```
1. Replica: Start Patroni to reinitialize the replica:
```shell
sudo gitlab-ctl start patroni
```
1. Replica: After it completes, remove the hardcoded version from `/etc/gitlab/gitlab.rb`:
1. Edit `/etc/gitlab/gitlab.rb` and remove `postgresql['version']`.
1. Reconfigure GitLab:
```shell
sudo gitlab-ctl reconfigure
```
1. Check the correct binaries are linked:
```shell
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
1. Check the cluster status on all nodes:
```shell
sudo gitlab-ctl patroni members
```
Repeat this procedure on the other replica if required.
## Issues with other components
If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page:
- [Consul](../consul.md#troubleshooting-consul)
- [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting)
|