DocumentationpgBalancer Documentation
High Availability & Failover
Configure Health Checks
pgBalancer performs continuous health monitoring of backend PostgreSQL servers:
pgbalancer.conf Health Check Settings
# Health Check Configuration
health_check_period = 5 # Run every 5 seconds
health_check_timeout = 3 # Timeout per check
health_check_user = 'pgbalancer'
health_check_password = 'HealthPass123'
health_check_database = 'postgres'
# Connection retry policy
connect_timeout = 3
connect_retries = 5
retry_interval = 2
# Health check query
health_check_query = 'SELECT 1'
# Enable TCP keepalives
backend_tcp_keepalive = onConfigure Watchdog Failover
Watchdog coordinates pgBalancer instances for cluster-wide failover:
Watchdog settings
# Watchdog configuration
watchdog_enabled = on
watchdog_port = 9996
watchdog_monitoring_user = 'pgbalancer_watchdog'
watchdog_monitoring_password = 'WatchdogPass!'
# Node configuration
wd_hostname = 'balancer-01'
wd_port = 9000
wd_priority = 100
wd_authkey = 'supersecret'
# Partner nodes
other_wd_nodes = 'balancer-02:9000,balancer-03:9000'
quorum = 2
# Virtual IP management
delegate_ip = '10.0.0.50'
delegate_ip_interface = 'eth0'Tip: Set
wd_priority higher on the node that should become the primary during normal operation.Automatic Failover
Customize automatic failover thresholds and promotion strategy:
Failover policy
# Automatic failover
failover_on_backend_error = on
failover_timeout = 10
failover_wait_time = 5
failover_when_quorum_lost = off
# Promotion script
failover_command = '/usr/local/bin/pgbalancer-promote.sh %d'
# Fallback script for demoted primary
failback_command = '/usr/local/bin/pgbalancer-failback.sh %d'Failover script example
#!/bin/bash
NODE_ID=$1
/usr/bin/ssh postgres@db${NODE_ID} "touch /tmp/promoted"
/usr/local/bin/pg_ctlcluster 16 main promoteReplication Lag Monitoring
Monitor downstream replication lag to avoid cascading failures:
Lag monitoring query
SELECT
backend_hostname,
state,
sent_lsn::text,
write_lsn::text,
flush_lsn::text,
replay_lag
FROM pgbalancer_replication_lag
ORDER BY replay_lag DESC;Chaos Testing & Validation
Verify failover behavior with automated chaos testing:
Simulate failure
# Simulate primary outage
sudo systemctl stop postgresql@16-main
# Observe failover logs
journalctl -u pgbalancer -fOperational Checklist
- • Confirm health check users exist on every backend
- • Validate virtual IP failover works across nodes
- • Monitor watchdog logs for quorum changes
- • Document manual failover procedure for on-call teams
- • Rehearse failback after maintenance