DocumentationpgBalancer Documentation

High Availability & Failover

Configure Health Checks

pgBalancer performs continuous health monitoring of backend PostgreSQL servers:

pgbalancer.conf Health Check Settings

# Health Check Configuration
health_check_period = 5             # Run every 5 seconds
health_check_timeout = 3            # Timeout per check
health_check_user = 'pgbalancer'
health_check_password = 'HealthPass123'
health_check_database = 'postgres'

# Connection retry policy
connect_timeout = 3
connect_retries = 5
retry_interval = 2

# Health check query
health_check_query = 'SELECT 1'

# Enable TCP keepalives
backend_tcp_keepalive = on

Configure Watchdog Failover

Watchdog coordinates pgBalancer instances for cluster-wide failover:

Watchdog settings

# Watchdog configuration
watchdog_enabled = on
watchdog_port = 9996
watchdog_monitoring_user = 'pgbalancer_watchdog'
watchdog_monitoring_password = 'WatchdogPass!'

# Node configuration
wd_hostname = 'balancer-01'
wd_port = 9000
wd_priority = 100
wd_authkey = 'supersecret'

# Partner nodes
other_wd_nodes = 'balancer-02:9000,balancer-03:9000'
quorum = 2

# Virtual IP management
delegate_ip = '10.0.0.50'
delegate_ip_interface = 'eth0'
Tip: Set wd_priority higher on the node that should become the primary during normal operation.

Automatic Failover

Customize automatic failover thresholds and promotion strategy:

Failover policy

# Automatic failover
failover_on_backend_error = on
failover_timeout = 10
failover_wait_time = 5
failover_when_quorum_lost = off

# Promotion script
failover_command = '/usr/local/bin/pgbalancer-promote.sh %d'

# Fallback script for demoted primary
failback_command = '/usr/local/bin/pgbalancer-failback.sh %d'

Failover script example

#!/bin/bash
NODE_ID=$1
/usr/bin/ssh postgres@db${NODE_ID} "touch /tmp/promoted"
/usr/local/bin/pg_ctlcluster 16 main promote

Replication Lag Monitoring

Monitor downstream replication lag to avoid cascading failures:

Lag monitoring query

SELECT
  backend_hostname,
  state,
  sent_lsn::text,
  write_lsn::text,
  flush_lsn::text,
  replay_lag
FROM pgbalancer_replication_lag
ORDER BY replay_lag DESC;

Chaos Testing & Validation

Verify failover behavior with automated chaos testing:

Simulate failure

# Simulate primary outage
sudo systemctl stop postgresql@16-main

# Observe failover logs
journalctl -u pgbalancer -f

Operational Checklist

  • • Confirm health check users exist on every backend
  • • Validate virtual IP failover works across nodes
  • • Monitor watchdog logs for quorum changes
  • • Document manual failover procedure for on-call teams
  • • Rehearse failback after maintenance