Skip to content

PostgreSQL Connection Scalability Limits with LISTEN/NOTIFY Architecture (75-100 VM Threshold) #240

@eberrigan

Description

@eberrigan

PostgreSQL Connection Scalability Analysis

Executive Summary

LabLink's current architecture uses PostgreSQL LISTEN/NOTIFY with persistent connections for real-time VM assignment notifications. While this works well at our current scale of 50 simultaneous VMs, research reveals critical scalability limits between 75-100 VMs due to:

  1. PostgreSQL default max_connections (100) - Hard limit without configuration changes
  2. LISTEN/NOTIFY global commit lock - Serializes all COMMIT operations under high concurrency
  3. Memory constraints on t3.large - 8GB RAM limits ~100-150 total connections
  4. File descriptor limits - Linux kernel constraints around 100-200 connections

This issue tracks the scalability bottleneck, provides concrete thresholds, and recommends mitigation strategies.


Current Architecture

Client Side

File: packages/client/src/lablink_client_service/subscribe.py

# Client makes ONE blocking POST request
response = requests.post(
    url,
    json={"hostname": hostname},
    timeout=(30, 604800)  # 7-day timeout
)
# Blocks until assignment OR timeout

Server Side

File: packages/allocator/src/lablink_allocator_service/database.py:182-267

def listen_for_notifications(self, channel, target_hostname):
    # Creates NEW dedicated PostgreSQL connection
    listen_conn = psycopg2.connect(...)
    listen_cursor.execute(f"LISTEN {channel};")

    while True:
        # Server-side polling with 10s timeout
        if select.select([listen_conn], [], [], 10) == ([], [], []):
            continue
        # Process notification when received

Resource Usage Per VM

  • 1 HTTP connection (client → allocator, held open up to 7 days)
  • 1 PostgreSQL connection (allocator → DB, in LISTEN state)
  • ~10-14 MB RAM per PostgreSQL connection (typical)
  • 1 file descriptor per connection

Critical Research Findings

1. LISTEN/NOTIFY Global Commit Lock

Source: Recall.ai: "Postgres LISTEN/NOTIFY does not scale"

"When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database during the commit phase, effectively serializing all commits."

Impact:

  • Under high concurrency: Database CPU/IO plummet due to mutex contention
  • Real-world incident: Recall.ai experienced three periods of downtime (March 2025)
  • Symptom: Only one COMMIT can execute at a time across entire database

LabLink Implication:

  • Current (50 VMs): Operating within safe limits ✅
  • 75-100 VMs: Risk of global lock contention increases ⚠️
  • 100+ VMs: High risk of database stalls 🔴

2. PostgreSQL Default Connection Limits

Current Configuration (no tuning):

  • max_connections: 100 (default)
  • shared_buffers: 128 MB (default, too low)
  • work_mem: 4 MB (default)

Memory Formula:

Max RAM = shared_buffers + (temp_buffers + work_mem) × max_connections

For t3.large (8GB RAM):

max_connections shared_buffers work_mem Total RAM Status
100 2GB 5MB ~2.5GB Safe ✅
150 2GB 3.3MB ~3.0GB Acceptable ⚠️
200 2GB 2.5MB ~3.5GB Risky 🔴

3. Connection Pool Exhaustion

Current allocator setup:

  • 50 VM LISTEN connections (persistent, outside SQLAlchemy pool)
  • 1 main database connection (allocator instance)
  • ~10 Flask worker connections (SQLAlchemy pool: pool_size=5, max_overflow=10)

Total connections at 50 VMs: ~60 (60% of default limit)


Scaling Thresholds

Connection Limit Analysis

VM Count Total Connections Memory Usage PostgreSQL Capacity Status
50 ~60 ~600MB - 840MB 60% Current (Safe) ✅
75 ~85 ~850MB - 1.2GB 85% Approaching limit ⚠️
100 ~110 ~1.1GB - 1.5GB 110% - Exceeds default! Critical 🔴
150 ~160 ~1.6GB - 2.2GB 160% (requires config change) Requires tuning ⚠️
200 ~210 ~2.1GB - 2.9GB 210% (requires instance upgrade) Unsafe 🔴

LISTEN/NOTIFY Performance Degradation

Based on Recall.ai research + AWS recommendations:

Concurrent LISTEN Connections Write Frequency Expected Performance
0-50 Any Normal operation ✅
50-75 Low (< 10/min) Normal operation ✅
50-75 Moderate (10-60/min) Minor slowdown ⚠️
75-100 Low Minor slowdown ⚠️
75-100 Moderate Noticeable contention ⚠️
75-100 High (> 60/min) Severe contention 🔴
100-150 Any Significant issues 🔴
150+ Any Critical issues, downtime risk 🔴

LabLink write frequency (per VM):

  • VM startup NOTIFY: Once per assignment
  • Health updates: Every 20 seconds (3/min × 50 VMs = 150/min)
  • Status updates: Every 10 seconds (6/min × 50 VMs = 300/min)
  • Admin operations: Manual (infrequent)

Current total writes: ~450/min (moderate-to-high frequency)

Estimated safe limit: 75-100 VMs before global lock becomes problematic


Immediate Actions Required

1. Monitor Current Connection Count

Add to allocator health check:

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Group by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

-- Check LISTEN connections specifically
SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%';

2. Set Explicit max_connections

Update PostgreSQL configuration to avoid hitting default limit:

# docker-compose.yml or Dockerfile
postgres:
  command:
    - "postgres"
    - "-c"
    - "max_connections=150"
    - "-c"
    - "shared_buffers=2GB"
    - "-c"
    - "work_mem=3MB"

3. Add Connection Health Endpoint

@app.route("/api/health/connections", methods=["GET"])
def connection_health():
    """Monitor PostgreSQL connection usage."""
    try:
        cursor = database.cursor
        cursor.execute("SELECT count(*) FROM pg_stat_activity")
        active_connections = cursor.fetchone()[0]

        cursor.execute("SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%'")
        listen_connections = cursor.fetchone()[0]

        max_connections = 150
        utilization = active_connections / max_connections

        return jsonify({
            "status": "healthy" if utilization < 0.8 else "warning",
            "active_connections": active_connections,
            "listen_connections": listen_connections,
            "max_connections": max_connections,
            "utilization_percent": round(utilization * 100, 1),
            "warning": utilization > 0.8,
            "critical": utilization > 0.9
        }), 200 if utilization < 0.9 else 503
    except Exception as e:
        return jsonify({"status": "unhealthy", "error": str(e)}), 503

Medium-Term Solutions (100-150 VMs)

4. Upgrade EC2 Instance

Current: t3.large (2 vCPU, 8GB RAM)
Recommended: t3.xlarge (4 vCPU, 16GB RAM)

5. Upgrade to Production WSGI Server

Current: Flask development server (threaded=True)
Recommended: Gunicorn with gevent workers

CMD ["gunicorn", "-w", "4", "-k", "gevent", \
     "--worker-connections", "1000", \
     "--bind", "0.0.0.0:5000", \
     "lablink_allocator_service.main:app"]

Long-Term Architecture Changes (150+ VMs)

Option A: Redis Pub/Sub Migration (Recommended)

Architecture:

Client VMs → Redis Pub/Sub ← Allocator
                ↓
          PostgreSQL (state only)

Benefits:

  • No global commit lock (Redis designed for pub/sub)
  • 2KB overhead per connection (vs 10MB PostgreSQL)
  • Sub-millisecond latency
  • Can handle 10,000+ connections

Migration Strategy:

  1. Add Redis container to allocator stack
  2. Implement parallel notification system (PostgreSQL + Redis)
  3. Gradually migrate VMs to Redis subscribers
  4. Deprecate PostgreSQL LISTEN/NOTIFY

Option B: Short Polling (Simpler Alternative)

Change from: Long-polling with LISTEN/NOTIFY
Change to: Simple periodic polling (15-30 seconds)

Trade-offs:

  • Pros: Zero persistent connections, simple implementation, works at any scale
  • Cons: 15-30s latency for student (vs instant), higher request volume

Recommended Action Plan

Phase 1: Immediate (Current: 50 VMs)

  • Add connection monitoring (/api/health/connections)
  • Set max_connections=150 in PostgreSQL config
  • Document scaling limits in config.yaml
  • Add CloudWatch alarm for connection utilization > 80%

Phase 2: Before Reaching 75 VMs

  • Upgrade to Gunicorn WSGI server
  • Increase max_connections=200
  • Add LISTEN/NOTIFY performance monitoring
  • Test Redis Pub/Sub proof-of-concept

Phase 3: Before Reaching 100 VMs

  • Upgrade to t3.xlarge (16GB RAM)
  • Implement adaptive health check intervals
  • Deploy Redis Pub/Sub in parallel with LISTEN/NOTIFY
  • Migrate 10% of VMs to Redis (canary testing)

Phase 4: Before Reaching 150 VMs

  • Migrate all VMs to Redis Pub/Sub
  • Deprecate PostgreSQL LISTEN/NOTIFY
  • Remove persistent connection architecture
  • Horizontal scaling with ALB (if needed)

Success Metrics

Operational Health

  • Connection utilization < 80% of max_connections
  • PostgreSQL CPU < 60% under normal load
  • LISTEN connection count == active VM count
  • Memory usage < 70% of instance RAM

Scaling Goals

  • Support 100 VMs without configuration changes
  • Support 150 VMs with configuration tuning only
  • Support 200+ VMs with Redis migration

Related Issues

References


Priority: Medium (not blocking at current 50 VM scale, but critical for growth)
Effort: Medium (monitoring = 1 day, config changes = 1 day, Redis migration = 1 week)
Impact: High (determines maximum scalability of LabLink platform)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions