pgraft Raft Protocol
Consensus Overview
Raft keeps pgraft clusters converged on the same WAL-derived state, even with failures. One leader accepts client writes, followers replicate entries, and terms track leadership epochs. When the leader changes, Raft ensures only the freshest log can win, eliminating double commits.
Leader election
Deterministic timeouts pick a single leader per term. Heartbeats advertise leadership and reset follower timers.
Log replication
Write-ahead log summaries propagate via AppendEntries RPCs. Followers persist and acknowledge entries sequentially.
Consensus safety
Quorum commits, term tracking, and log matching prevent divergent history while keeping read replicas available.
Node Roles
Leader
Accepts SQL writes, appends them to the Raft log, and streams AppendEntries to followers until a quorum confirms.
Leadership probe
SELECT pgraft_is_leader();Follower
Receives AppendEntries, caches uncommitted entries, and applies changes once commit_index advances.
Follower heartbeat lag
SELECT node_id,
last_heartbeat_ms
FROM pgraft_get_nodes()
WHERE node_id = pg_backend_pid();Candidate
Transitional role triggered when heartbeats stop. Requests votes and upgrades to leader after majority approval.
Election metrics
SELECT current_term,
elections_triggered
FROM pgraft_get_cluster_status()
LIMIT 1;Leader Election Flow
Each node runs an independent timer between heartbeats. Randomized offsets avoid simultaneous elections, while vote rules enforce log freshness.
Election timeout
Follower stops receiving heartbeats within the configured election timeout (default 1000 ms).
Candidate declaration
Follower increments the term, becomes candidate, and votes for itself.
Vote solicitation
Candidate issues RequestVote RPCs to the rest of the cluster and compares log freshness.
Leader promotion
Candidate becomes leader after receiving a majority of votes and immediately begins heartbeats.
Manual leadership transfer
-- Redirect leadership before maintenance
SELECT pgraft_transfer_leadership(2);
SELECT pgraft_get_leader();Log Replication Lifecycle
Leaders batch WAL summaries into Raft entries, replicate them concurrently, and advance commit indexes once a quorum acknowledges.
Replication vitals
SELECT node_id,
match_index,
next_index,
commit_index,
lag_entries,
state
FROM pgraft_log_get_replication_status()
ORDER BY lag_entries DESC;Commit index
Highest log entry known to be replicated on a majority of nodes. Entries at or below this index are durable.
Last applied
Highest log entry applied to the PostgreSQL state machine. Followers eventually converge to the leader’s commit index.
Consensus Safety
Raft’s invariants guarantee that clients never observe divergent history, even if the cluster re-elects multiple leaders during failover or partition events.
Election safety
At most one leader can be elected in a given term, preventing split-brain.
Leader append-only
Leaders only append entries to their log; previously committed history is immutable.
Log matching
If two logs contain the same index and term, all prior entries are identical.
Leader completeness
Committed entries must exist in every subsequent leader for higher-numbered terms.
Tuning Consensus Timing
Adjust timeouts and batching to match latency budgets. Keep election_timeout roughly 10× the heartbeat interval to avoid premature elections under load.
postgresql.conf
# postgresql.conf (excerpt)
pgraft.heartbeat_interval = '75ms'
pgraft.election_timeout = '900ms'
pgraft.append_batch_size = 512
pgraft.snapshot_threshold = 8000Verify effective values
SELECT name,
setting,
source
FROM pgraft_get_config()
WHERE name IN ('heartbeat_interval', 'election_timeout', 'append_batch_size', 'snapshot_threshold');