Documentationpgraft Documentation

pgraft Architecture

Core Components

Extension Layer (C)

Hooks into PostgreSQL shared memory, background worker registration, and WAL APIs. Exposes SQL functions and views for cluster management. Responsible for bridging the executor with Raft state transitions.

Raft Worker (Go)

Compiled as pgraft_go.so, it runs inside PostgreSQL's background worker framework. Implements Raft log replication, leader election, snapshotting, and network RPCs. Communicates with the C layer through shared memory rings.

Shared Metadata Store

Stores durable Raft information inside PostgreSQL catalogs (pg_catalog.pgraft_*) for node identity, membership, and configuration parameters. Ensures DDL-like persistence across restarts.

SQL Management API

Functions such as pgraft_add_node, pgraft_get_cluster_status, and pgraft_log_get_stats provide a single control surface for DBAs and automation pipelines.

Replication & Consensus Flow

1. Command submission. Application writes are accepted by the leader through standard SQL. pgraft intercepts commit records and enqueues them for Raft replication before acknowledging the transaction.
2. Raft log append. The Go worker packages WAL descriptions into Raft log entries and sends AppendEntries RPCs to followers. Followers persist entries to disk and respond with acknowledgement and match indexes.
3. Commit acknowledgement. Once a quorum confirms persistence, the leader marks the entry as committed and unblocks SQL backends. Followers apply committed entries to the local database.
4. Snapshotting. When the log reaches the configured threshold, pgraft creates base backups of Raft state to allow log truncation. Followers that fall far behind can install snapshots instead of replaying every entry.
5. Failure detection. Followers monitor leader heartbeats. Missing heartbeats beyond pgraft.election_timeout triggers a new election and leadership handoff.

Background Workers

Raft Coordinator

Primary Go worker that hosts the Raft state machine, handles RPC scheduling, and drives elections. Runs exactly once per Postgres instance.

Apply Dispatcher

Lightweight C worker that reads committed log entries from shared queues and applies them to the local database using SPI.

Health Monitor

Periodically samples metrics (latency, lag, quorum) and updates pgraft_get_cluster_status(). Emits NOTIFY events for alerting systems.

Metadata & Persistence

pgraft stores cluster metadata inside PostgreSQL to remain consistent with regular backups and point-in-time recovery:

pg_catalog.pgraft_nodes: Node identity, current state, Raft match indexes, and networking endpoints.
pg_catalog.pgraft_config: Persisted configuration overrides saved via pgraft_set_config.
pg_catalog.pgraft_events: Append-only log of elections, membership changes, and snapshots for audit trails.
pg_catalog.pgraft_stats: Rolling aggregates for log throughput, RPC success rates, and latency histograms.

Networking Model

Every node listens on a dedicated Raft port specified by pgraft.port. Communication uses protobuf-encoded RPCs over TLS (optional) and is independent from client connections on port 5432. This separation avoids interference between OLTP traffic and consensus messages.

Leader Responsibilities

Accepts SQL writes and appends Raft entries.
Broadcasts heartbeats every pgraft.heartbeat_interval ms.
Tracks follower match indexes to decide commit progress.
Serves pgraft_get_cluster_status() responses to operators.

Follower Responsibilities

Persist incoming entries to the Raft log and acknowledge to the leader.
Apply committed entries to the local database through the dispatcher worker.
Participate in elections and campaign when leadership is lost.
Optionally accept read traffic when pgraft.read_consistency permits.

Observability & Alerting

pgraft emits metrics via SQL views and optional Prometheus exporters. Recommended alerts include:

Leader loss: pgraft_get_leader() returns NULL for more than 15 seconds.
Replication lag: replication_lag_bytes exceeds 64MB for any node.
Election churn: More than three leadership changes within 10 minutes.
Snapshot backlog: pgraft_log_get_stats().pending_snapshots stays non-zero for 5 minutes.

Failure Handling

pgraft adheres to Raft's safety properties:

Deterministic elections: Randomized timeouts avoid split brain even under network partitions.
Log matching: Followers reject inconsistent entries to guarantee the prefix property.
Commit safety: Only log entries replicated to a majority are acknowledged to clients.
Snapshot installation: Slow nodes receive state snapshots to rejoin without replaying the full log.

PreviousInstallation

NextSQL Functions