This template deploys a production-ready, highly-available PostgreSQL cluster on Railway with automatic failover, read replicas, and transparent connection routing.
- 3-node PostgreSQL cluster with streaming replication
- Automatic failover in <10 seconds using Patroni
- 2 standby replicas for automatic failover and read scaling
- HAProxy load balancer with separate read-write and read-only endpoints
- Transparent routing - single endpoint for all database connections
- Automatic recovery - failed nodes rejoin as replicas
- Health monitoring - Patroni REST API for cluster status
- Multi-version support - PostgreSQL 14, 15, 16, 17, and 18
Application
↓
HAProxy ← Load balanced endpoint
├─ :5432 (read-write) → Primary only
└─ :5433 (read-only) → Replicas (round-robin)
↓
PostgreSQL Cluster
├─ postgres-1 (Leader) ← Writes + Reads
├─ postgres-2 (Standby) ← Reads + Failover ready
└─ postgres-3 (Standby) ← Reads + Failover ready
↓
etcd (3 nodes) ← Distributed consensus
- postgres-1, postgres-2, postgres-3 - PostgreSQL with Patroni orchestration
- etcd-1, etcd-2, etcd-3 - Distributed key-value store for leader election
- haproxy - Load balancer with automatic primary detection
Total: 7 services
The postgres-patroni image ships with pgBackRest
installed but dormant. When WAL_ARCHIVE_BUCKET is unset, the image
behaves identically to a vanilla HA Postgres image — no archiving, no
config mutation. When set, the current Patroni leader archives WAL
continuously to S3-compatible storage in async mode
(archive_mode=on); standbys carry the same config + binary so promotion
instantly enables archiving with no config change.
The image reads a tool-agnostic WAL_ARCHIVE_* / WAL_RECOVER_FROM_*
env contract. patroni-runner translates WAL_ARCHIVE_* to pgBackRest's
native PGBACKREST_REPO1_S3_* so the main pgbackrest.conf's repo1
always = the cluster's own archive bucket. WAL_RECOVER_FROM_* is
materialised in a separate /etc/pgbackrest/pgbackrest-recovery-source.conf
(only that file references the source bucket) and is NEVER exported as
env vars — pgBackRest's option resolution is command-line > env > config
default, so a global env export would silently override the
--configwe pass during recovery and could leak archive-push to the source bucket. With this isolation:
WAL_ARCHIVE_*only → standalone archiving cluster. Main conf's repo1 = archive bucket.WAL_RECOVER_FROM_*only → cluster restoring from a source bucket; recovery-source conf has the source. After promote the cluster runs as plain non-archiving Postgres.WAL_ARCHIVE_*+WAL_RECOVER_FROM_*→ restored cluster with PITR re-enabled. Main conf's repo1 = own archive bucket (post-promote archive-push); recovery-source conf has source bucket (archive-get during replay only). Source and own bucket never appear in the same config file, so archive-push from the fork can never spray into the source bucket.
Operator-facing env contract:
| Env var | Purpose |
|---|---|
WAL_ARCHIVE_BUCKET |
bucket name — gates archiving on this cluster |
WAL_ARCHIVE_ENDPOINT |
S3-compatible endpoint (e.g. fly.storage.tigris.dev) |
WAL_ARCHIVE_REGION |
bucket region |
WAL_ARCHIVE_KEY / WAL_ARCHIVE_SECRET |
bucket credentials |
WAL_ARCHIVE_PATH |
path prefix where archive-push writes (default /pgbackrest) |
WAL_RECOVER_FROM_BUCKET / _ENDPOINT / _REGION / _KEY / _SECRET / _PATH |
source-bucket coordinates on a PITR-restored cluster; archive-get reads source WAL from here during replay. Set by backboard on restore; not normally a manual knob. |
POSTGRES_RECOVERY_TARGET_TIME |
ISO 8601 timestamp; stages archive-recovery replay on next start |
POSTGRES_ARCHIVE_TIMEOUT |
seconds Postgres waits before forcing a WAL switch (default 60) |
WAL_BACKUP_FULL_INTERVAL_HOURS |
image-owned full base-backup cadence (default 168 = weekly; 0 disables periodic fulls). Initial / gap-recovery fulls fire regardless. |
WAL_BACKUP_DIFF_INTERVAL_HOURS |
image-owned differential base-backup cadence (default 24; 0 disables) |
WAL_BACKUP_RETENTION_FULL |
full backups kept by pgbackrest expire (default 4) |
WAL_BACKUP_RETENTION_DIFF |
differentials kept by pgbackrest expire (default 14) |
WAL_HEARTBEAT_DISABLED |
set to 1 to disable the idle-DB WAL heartbeat (advanced; reduces archive cost on quiet DBs at the price of stale PITR ceiling) |
Image-level tuning knobs (pgBackRest-native, internal):
| Env var | Purpose |
|---|---|
WAL_DROP_THRESHOLD_MB |
pg_wal/ size at which the archive-push wrapper drops failing segments to keep Postgres running (default 500). Outside the PGBACKREST_* namespace because pgBackRest treats unknown PGBACKREST_* vars as config options and warns about them on every push. |
PGBACKREST_ARCHIVE_PUSH_PROCESS_MAX |
parallel workers for archive-push. Default auto-sized as clamp(cpus/8, 2, 8). |
PGBACKREST_ARCHIVE_GET_PROCESS_MAX |
parallel workers for archive-get. Default 1 (WAL replay is serial). |
PGBACKREST_BACKUP_PROCESS_MAX |
parallel workers for backup. Default auto-sized as clamp(cpus/4, 1, 16). |
PGBACKREST_RESTORE_PROCESS_MAX |
parallel workers for restore. Default auto-sized as clamp(cpus, 1, 32). |
When WAL_ARCHIVE_BUCKET is set, patroni-runner writes
archive_mode=on, archive_command='/usr/local/bin/pgbackrest-archive-push-wrapper.sh %p',
and archive_timeout (default 60, override via POSTGRES_ARCHIVE_TIMEOUT)
into the Patroni-generated cluster config, and renders
/etc/pgbackrest/pgbackrest.conf with operator-policy defaults
(archive-async=y, archive-push-queue-max=5GiB, per-command
process-max sized off cgroup-detected vCPU, compress-type=zst,
spool-path=$PGDATA/pgbackrest-spool so segments staged but not yet
pushed survive container restarts).
Patroni propagates the same config to every node, but under
archive_mode=on only the current leader fires archive_command. Standbys
hold the pgBackRest binary and config so promotion instantly enables
archiving with no config re-push. Residual failover RPO is archive_timeout
(60s) plus failover-detection time; Patroni's promotion logic flushes any
pending archive_status/.ready markers on the new leader to close most of
that gap on planned switchover.
The never-halt guarantee comes from two orthogonal thresholds:
archive-push-queue-max=5GiB(image-baked) governs the spool. Trips on transient S3 stalls — async worker keeps retrying and most segments eventually land. Generous buffer to absorb multi-hour outages.WAL_DROP_THRESHOLD_MB=500(default) governspg_wal/when pgbackrest's foreground returns non-zero. Trips on hard failures (bad creds, deleted bucket) where retrying without operator intervention has zero chance of success. Thearchive_commandis the wrapper script (not pgbackrest directly), which measurespg_wal/on failure and drops segments past the threshold to keep Postgres up.
Either threshold tripping truncates the PITR window; the database keeps
running. This is the explicit reason this image uses pgBackRest instead
of wal-g — synchronous archive_command shapes halt Postgres on full
pg_wal, requiring manual intervention to recover.
patroni-runner runs pgbackrest stanza-create automatically once
Postgres becomes reachable — idempotent, runs from every node, gated on
WAL_ARCHIVE_BUCKET being set (skipped in dual-repo mode so we never
mutate the source's stanza).
Env-var changes on existing clusters: patroni-runner includes a
one-shot DCS reconcile that runs after Patroni's REST API comes up. When
WAL_ARCHIVE_BUCKET is set but DCS lacks archive params (or vice versa),
the reconcile patches DCS via PATCH /config so the env-var intent is
authoritative. Because archive_mode is PGC_POSTMASTER, applying the
patch flags pending_restart on the node — operators must run
patronictl restart <cluster> <node> (or trigger one redeploy per node
through Railway) for archiving to actually start or stop. The dashboard
PITR enable/disable flow handles the rolling restart automatically; raw
env-var users need one extra restart per node.
Per RFC, PITR restore creates a brand-new Postgres service in the project
(typically a single-node SSL Postgres, not a fresh HA cluster); the
source HA cluster stays online and untouched. The restored service's
volume is populated from the source's snapshot, then booted with
WAL_RECOVER_FROM_* pointing at the source's bucket and
POSTGRES_RECOVERY_TARGET_TIME set — archive-get reads source WAL
during replay. After promote, the restored service has no archive bucket
of its own and runs as plain non-archiving Postgres until the operator
opts in via the standard PITR-enable flow.
Source and restored services therefore never share a write path: there
is no risk of the recovered timeline overwriting the source's ongoing
WAL chain. The previous "mandatory repo-path divergence" guard and the
.pgbackrest_source_path sentinel are gone.
For backwards compatibility, when POSTGRES_RECOVERY_TARGET_TIME is set
on an HA primary volume directly, patroni-runner still creates
recovery.signal and writes the matching recovery settings into
postgresql.auto.conf before starting Patroni. The restore_command
references --config=/etc/pgbackrest/pgbackrest-recovery-source.conf
so archive-get during replay reads only the source bucket. Two
filesystem stamps coordinate "exactly once per successful promote":
.pitr_staging is written when replay is handed to Postgres,
.pitr_configured is written on the boot AFTER Postgres consumes
recovery.signal (which it only removes on successful promote). A
failed replay leaves .pitr_staging behind WITHOUT .pitr_configured —
fix env vars and restart, the next boot re-stages cleanly.
When WAL_ARCHIVE_BUCKET is set, patroni-runner spawns a leader-only
backup-watcher task that polls every 60s. Every iteration the watcher
asks the local Patroni REST API (GET /leader, 200 = leader, 503 =
replica) — replicas no-op the iteration, so only the current leader
runs backups. After failover the new leader's watcher takes over within
one poll cycle. pgBackRest's stanza locks are the second-line guarantee
against concurrent backups across cluster nodes.
The watcher runs pgbackrest backup against the archive bucket when
one of these conditions holds:
- Initial backup — no full has been recorded on this volume.
Triggers a
--type=fullimmediately; pgBackRest brackets the base inpg_backup_start/pg_backup_stopand waits for the closing WAL to archive before declaring success, so a brokenarchive_commandfails the backup loudly instead of producing an unrestorable base. - Gap recovery — either the archive-push wrapper dropped a segment
(touches
$PGDATA/.pgbackrest_gap_pending) orpg_stat_archiver.failed_countgrew since the last full. Once archive failures have been quiescent forWAL_BACKUP_GAP_RESOLVED_GRACE_SECONDS(default 300s), runs a fresh full so the PITR window resumes from the new base. The dropped segment itself remains unrestorable; everything from the new base forward is. - Periodic —
WAL_BACKUP_FULL_INTERVAL_HOURS(default 168 h / weekly) for fulls,WAL_BACKUP_DIFF_INTERVAL_HOURS(default 24 h) for differentials. Set either to0to disable that schedule.
Every iteration also emits a tiny non-transactional WAL record
(pg_logical_emit_message) so archive_timeout=60 flushes a segment
on idle DBs — without it the picker's "latest restorable" lags
wall-clock by the checkpoint interval (default 5 min) on quiet
services. Disable via WAL_HEARTBEAT_DISABLED=1.
State persists at $PGDATA/.pgbackrest_backup_state (key=value lines:
last_full_at, last_diff_at, last_full_failed_count). The
bucket-side pgbackrest --stanza=main info --output=json is the
canonical source of truth; the local file is a cache that survives
restarts.
Known gap: pgBackRest's archive-push-queue-max trip drops segments
without going through the archive-push wrapper and without
incrementing failed_count, so neither gap signal fires. Until log
parsing or LSN-lag detection lands, queue-max-trip gaps are sealed by
the next periodic full rather than promptly.
Each cluster archives under a sub-prefix derived from its
system_identifier:
${WAL_ARCHIVE_PATH}/cluster-<system_identifier>. The path is
persisted in $PGDATA/.pgbackrest_repo_path so the archive-push
wrapper, the backup watcher, pgbackrest stanza-create, and out-of-
band invocations (e.g. mono's PITR coverage probe over SSH) all
converge on the same value. patroni-runner rewrites the rendered
/etc/pgbackrest/pgbackrest.conf's repo1-path= line once
pg_control is on disk, so an SSH-driven pgbackrest info (which
inherits no shell env from the container) reads the per-cluster path
correctly.
Why per-cluster: a wipe-and-reuse-bucket cycle (operator drops the
data volume, redeploys against the same WAL_ARCHIVE_BUCKET) produces
a brand-new system_identifier from initdb. Without discrimination,
pgBackRest's stanza-create would refuse the new cluster on system-id
mismatch and the new cluster's WAL would never land — silent data loss
for any operator who didn't notice. With per-cluster paths, the new
cluster lands at cluster-<new_sysid>, the previous cluster's archive
stays at cluster-<old_sysid>, and both histories coexist.
WAL_RECOVER_FROM_PATH on a restored cluster must point at the
specific source-side cluster-<sysid> sub-prefix the user wants to
restore from — pgbackrest restore reads from one path. Backboard
discovers per-cluster sub-prefixes by listing the bucket and surfaces
them as separate "histories" in the restore UI.
For PITR-enabled clusters, pgbackrest expire is the sole WAL
retention authority — no Postgres GUCs, no bucket-side lifecycle
policy. Backup manifests pin the WAL needed to make each backup
restorable; expire releases both together when a backup ages out.
Earlier iterations proposed a bucket-side TTL as a safety net but it's
superfluous: any TTL shorter than expire's horizon would yank WAL out
from under live manifests, and any TTL ≥ that horizon is redundant.
pgbackrest expire runs automatically after each pgbackrest backup
the watcher invokes, removing fulls/diffs beyond
WAL_BACKUP_RETENTION_FULL / WAL_BACKUP_RETENTION_DIFF, plus the WAL
their manifests no longer pin. The default retention (full=4, diff=14,
weekly fulls + daily diffs) covers approximately a four-week PITR
window before the oldest full ages out.
- Click "Deploy Template" in Railway marketplace
- Configure variables (or use defaults):
POSTGRES_USER- Database username (default:railway)POSTGRES_PASSWORD- Auto-generated secure passwordPOSTGRES_DB- Database name (default:railway)
- Wait for deployment (~2-3 minutes)
- Connect using the provided
DATABASE_URL
Once deployed, connect to your database via HAProxy:
# Primary (read-write) - From Railway private network
postgresql://railway:${POSTGRES_PASSWORD}@haproxy.railway.internal:5432/railway
# Replicas (read-only) - For read scaling
postgresql://railway:${POSTGRES_PASSWORD}@haproxy.railway.internal:5433/railway
# From external (via TCP proxy)
postgresql://railway:${POSTGRES_PASSWORD}@${HAPROXY_TCP_PROXY_DOMAIN}/railwayHAProxy automatically routes connections:
- Port 5432: Routes to current Patroni leader (read-write)
- Port 5433: Load-balances across healthy replicas (read-only)
| Variable | Default | Description |
|---|---|---|
POSTGRES_USER |
railway |
Database superuser |
POSTGRES_PASSWORD |
Auto-generated | Superuser password |
POSTGRES_DB |
railway |
Default database name |
PATRONI_SCOPE |
pg-ha-cluster |
Cluster identifier |
PATRONI_TTL |
30 |
Leader lease TTL (seconds) |
PATRONI_LOOP_WAIT |
10 |
Health check interval |
POSTGRES_NODES |
(required for HAProxy) | Node list in format host:pgport:patroniport,... |
HAPROXY_MAX_CONN |
1000 |
Maximum concurrent connections |
HAPROXY_CHECK_INTERVAL |
3s |
Backend health check interval |
HAProxy is stateless and can be scaled horizontally via Railway replicas:
# In haproxy/railway.toml
[deploy]
numReplicas = 3Each service exposes health endpoints:
# PostgreSQL + Patroni
curl http://postgres-1.railway.internal:8008/health
curl http://postgres-1.railway.internal:8008/cluster # Full cluster status
curl http://postgres-1.railway.internal:8008/primary # 200 if primary
curl http://postgres-1.railway.internal:8008/replica # 200 if replica
# etcd
curl http://etcd-1.railway.internal:2379/health
# HAProxy stats dashboard
curl http://haproxy.railway.internal:8404/statsCheck Patroni cluster status from any PostgreSQL node:
curl http://postgres-1.railway.internal:8008/clusterResponse:
{
"members": [
{
"name": "postgres-1",
"role": "leader",
"state": "running",
"timeline": 1,
"lag": 0
},
{
"name": "postgres-2",
"role": "replica",
"state": "streaming",
"timeline": 1,
"lag": 0
},
{
"name": "postgres-3",
"role": "replica",
"state": "streaming",
"timeline": 1,
"lag": 0
}
]
}Access the HAProxy stats dashboard at http://haproxy.railway.internal:8404/stats for real-time backend health and connection metrics.
Timeline:
T+0s postgres-1 (leader) crashes
T+3s HAProxy health check fails (3s interval)
T+6s HAProxy marks backend DOWN (fall 3)
T+8s Patroni elects new leader via etcd
T+10s HAProxy routes to new primary
Impact:
- Existing write connections: Dropped (apps retry)
- Existing read connections: Unaffected (if using :5433)
- New connections: Routed to new primary
- Total downtime: ~10 seconds
When postgres-1 recovers:
T+0s postgres-1 restarts
T+3s Patroni registers with etcd
T+4s Discovers postgres-2 is leader
T+5s Rejoins as replica
T+10s Begins streaming replication
T+12s HAProxy adds to replica pool
Result: Original primary rejoins as replica, not leader.
To manually switch leaders (zero-downtime):
# From inside any PostgreSQL container
patronictl -c /etc/patroni/patroni.yml switchover
# Follow prompts to select new leaderTest the cluster locally with Docker Compose:
docker-compose up -dThis starts all 7 services on your local machine.
Connect to the cluster:
# Read-write (primary)
psql postgresql://railway:railway@localhost:5432/railway
# Read-only (replicas)
psql postgresql://railway:railway@localhost:5433/railwayView HAProxy stats:
open http://localhost:8404/stats-
Check etcd is healthy:
curl http://etcd-1.railway.internal:2379/health
-
Check Patroni logs for PostgreSQL services:
Railway Dashboard → postgres-1 → Logs -
Verify private networking is enabled in Railway project settings
This should never happen due to etcd quorum, but if it does:
- Stop all PostgreSQL services
- Clear etcd data:
etcdctl del --prefix /service/ - Restart services in order: etcd → postgres-1 → postgres-2 → postgres-3
Check lag from Patroni API:
curl http://postgres-2.railway.internal:8008/ | jq '.replication[0].lag'If lag is high (>1GB):
- Check network connectivity between nodes
- Increase
wal_keep_sizein patroni configuration - Consider increasing PostgreSQL resources
-
Check HAProxy stats:
curl http://haproxy.railway.internal:8404/stats
-
Verify backends are healthy:
curl http://postgres-1.railway.internal:8008/health
-
Check HAProxy logs for backend failures
Resource allocation:
- 3 PostgreSQL: 2 vCPU, 2GB RAM each + 10GB volume
- 3 etcd: 0.5 vCPU, 512MB RAM each
- 1 HAProxy: 0.5 vCPU, 512MB RAM
Estimated cost (Railway Pro):
- Compute: ~$60-120/month
- Storage: ~$7.50/month (30GB)
- Total: ~$70-130/month
Comparison:
- AWS RDS Multi-AZ (db.t4g.small): ~$120/month
- GCP Cloud SQL HA: ~$80/month
Railway Pro includes automatic volume snapshots:
- Frequency: Daily
- Retention: 6 days (daily), 27 days (weekly), 89 days (monthly)
- Max 10 backups per volume
Manual backup:
pg_dump -h haproxy.railway.internal -U railway railway > backup.sqlRestore from snapshot:
- Railway Dashboard → postgres-1 → Volumes → Snapshots
- Click "Restore" on desired snapshot
- Create new service from snapshot
This template supports PostgreSQL 14, 15, 16, 17, and 18. To upgrade:
-
Create a logical backup:
pg_dumpall -h haproxy.railway.internal -U railway > cluster_backup.sql -
Deploy new template with updated PostgreSQL version
-
Restore data:
psql -h new-haproxy.railway.internal -U railway < cluster_backup.sql -
Update application
DATABASE_URLto new cluster -
Delete old cluster after verification
- All passwords are auto-generated and encrypted at rest
- Private networking isolates cluster from public internet
- SSL enabled by default for PostgreSQL connections
- Recommend enabling Railway's 2FA for project access
Patroni dynamically generates PostgreSQL configuration. Override via environment variables:
postgresql:
parameters:
shared_buffers: 512MB # 25% of RAM
effective_cache_size: 2GB # 50-75% of RAM
max_connections: 300 # Increase for high concurrency
work_mem: 4MB # Per-query memoryAdjust via environment variables:
HAPROXY_MAX_CONN=2000 # More concurrent connections
HAPROXY_CHECK_INTERVAL=1s # Faster failover detection
HAPROXY_TIMEOUT_CLIENT=60m # Longer idle connectionsPre-built images are published to GitHub Container Registry:
# PostgreSQL + Patroni (multiple versions)
ghcr.io/railwayapp/postgres-ha/postgres-patroni:18
ghcr.io/railwayapp/postgres-ha/postgres-patroni:17
ghcr.io/railwayapp/postgres-ha/postgres-patroni:16
ghcr.io/railwayapp/postgres-ha/postgres-patroni:15
ghcr.io/railwayapp/postgres-ha/postgres-patroni:14
# etcd
ghcr.io/railwayapp/postgres-ha/etcd:3.5.16
# HAProxy
ghcr.io/railwayapp/postgres-ha/haproxy:3.2- Issues: https://github.com/railwayapp/postgres-ha/issues
- Documentation: https://docs.railway.app/
- Community: https://discord.gg/railway
MIT License - See LICENSE file for details
Built with:
- PostgreSQL - World's most advanced open source database
- Patroni - Template for PostgreSQL HA with Python
- etcd - Distributed reliable key-value store
- HAProxy - Reliable, high-performance TCP/HTTP load balancer
- Railway - Infrastructure, instantly