PostgreSQL Connection Scalability Limits with LISTEN/NOTIFY Architecture (75-100 VM Threshold)

# PostgreSQL Connection Scalability Analysis

## Executive Summary

LabLink's current architecture uses **PostgreSQL LISTEN/NOTIFY** with persistent connections for real-time VM assignment notifications. While this works well at our current scale of **50 simultaneous VMs**, research reveals **critical scalability limits between 75-100 VMs** due to:

1. **PostgreSQL default max_connections (100)** - Hard limit without configuration changes
2. **LISTEN/NOTIFY global commit lock** - Serializes all COMMIT operations under high concurrency
3. **Memory constraints on t3.large** - 8GB RAM limits ~100-150 total connections
4. **File descriptor limits** - Linux kernel constraints around 100-200 connections

This issue tracks the scalability bottleneck, provides concrete thresholds, and recommends mitigation strategies.

---

## Current Architecture

### Client Side
**File**: `packages/client/src/lablink_client_service/subscribe.py`

```python
# Client makes ONE blocking POST request
response = requests.post(
    url,
    json={"hostname": hostname},
    timeout=(30, 604800)  # 7-day timeout
)
# Blocks until assignment OR timeout
```

### Server Side
**File**: `packages/allocator/src/lablink_allocator_service/database.py:182-267`

```python
def listen_for_notifications(self, channel, target_hostname):
    # Creates NEW dedicated PostgreSQL connection
    listen_conn = psycopg2.connect(...)
    listen_cursor.execute(f"LISTEN {channel};")

    while True:
        # Server-side polling with 10s timeout
        if select.select([listen_conn], [], [], 10) == ([], [], []):
            continue
        # Process notification when received
```

### Resource Usage Per VM
- **1 HTTP connection** (client → allocator, held open up to 7 days)
- **1 PostgreSQL connection** (allocator → DB, in LISTEN state)
- **~10-14 MB RAM** per PostgreSQL connection (typical)
- **1 file descriptor** per connection

---

## Critical Research Findings

### 1. LISTEN/NOTIFY Global Commit Lock

**Source**: [Recall.ai: "Postgres LISTEN/NOTIFY does not scale"](https://www.recall.ai/blog/postgres-listen-notify-does-not-scale)

> "When a NOTIFY query is issued during a transaction, it acquires a **global lock on the entire database** during the commit phase, effectively **serializing all commits**."

**Impact**:
- Under high concurrency: Database CPU/IO **plummet** due to mutex contention
- Real-world incident: Recall.ai experienced **three periods of downtime** (March 2025)
- Symptom: Only one COMMIT can execute at a time across entire database

**LabLink Implication**:
- **Current (50 VMs)**: Operating within safe limits ✅
- **75-100 VMs**: Risk of global lock contention increases ⚠️
- **100+ VMs**: High risk of database stalls 🔴

### 2. PostgreSQL Default Connection Limits

**Current Configuration** (no tuning):
- `max_connections`: 100 (default)
- `shared_buffers`: 128 MB (default, too low)
- `work_mem`: 4 MB (default)

**Memory Formula**:
```
Max RAM = shared_buffers + (temp_buffers + work_mem) × max_connections
```

**For t3.large (8GB RAM)**:
| max_connections | shared_buffers | work_mem | Total RAM | Status |
|-----------------|----------------|----------|-----------|---------|
| 100 | 2GB | 5MB | ~2.5GB | Safe ✅ |
| 150 | 2GB | 3.3MB | ~3.0GB | Acceptable ⚠️ |
| 200 | 2GB | 2.5MB | ~3.5GB | Risky 🔴 |

### 3. Connection Pool Exhaustion

**Current allocator setup**:
- **50 VM LISTEN connections** (persistent, outside SQLAlchemy pool)
- **1 main database connection** (allocator instance)
- **~10 Flask worker connections** (SQLAlchemy pool: pool_size=5, max_overflow=10)

**Total connections at 50 VMs**: ~60 (60% of default limit)

---

## Scaling Thresholds

### Connection Limit Analysis

| VM Count | Total Connections | Memory Usage | PostgreSQL Capacity | Status |
|----------|-------------------|--------------|---------------------|---------|
| **50** | ~60 | ~600MB - 840MB | 60% | Current (Safe) ✅ |
| **75** | ~85 | ~850MB - 1.2GB | 85% | Approaching limit ⚠️ |
| **100** | ~110 | ~1.1GB - 1.5GB | **110% - Exceeds default!** | Critical 🔴 |
| **150** | ~160 | ~1.6GB - 2.2GB | 160% (requires config change) | Requires tuning ⚠️ |
| **200** | ~210 | ~2.1GB - 2.9GB | 210% (requires instance upgrade) | Unsafe 🔴 |

### LISTEN/NOTIFY Performance Degradation

**Based on Recall.ai research** + AWS recommendations:

| Concurrent LISTEN Connections | Write Frequency | Expected Performance |
|-------------------------------|-----------------|----------------------|
| 0-50 | Any | Normal operation ✅ |
| 50-75 | Low (< 10/min) | Normal operation ✅ |
| 50-75 | Moderate (10-60/min) | Minor slowdown ⚠️ |
| 75-100 | Low | Minor slowdown ⚠️ |
| 75-100 | Moderate | Noticeable contention ⚠️ |
| 75-100 | High (> 60/min) | Severe contention 🔴 |
| 100-150 | Any | Significant issues 🔴 |
| 150+ | Any | Critical issues, downtime risk 🔴 |

**LabLink write frequency** (per VM):
- VM startup NOTIFY: Once per assignment
- Health updates: Every 20 seconds (3/min × 50 VMs = **150/min**)
- Status updates: Every 10 seconds (6/min × 50 VMs = **300/min**)
- Admin operations: Manual (infrequent)

**Current total writes**: ~450/min (moderate-to-high frequency)

**Estimated safe limit**: **75-100 VMs** before global lock becomes problematic

---

## Immediate Actions Required

### 1. Monitor Current Connection Count

Add to allocator health check:

```sql
-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Group by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

-- Check LISTEN connections specifically
SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%';
```

### 2. Set Explicit max_connections

**Update PostgreSQL configuration** to avoid hitting default limit:

```yaml
# docker-compose.yml or Dockerfile
postgres:
  command:
    - "postgres"
    - "-c"
    - "max_connections=150"
    - "-c"
    - "shared_buffers=2GB"
    - "-c"
    - "work_mem=3MB"
```

### 3. Add Connection Health Endpoint

```python
@app.route("/api/health/connections", methods=["GET"])
def connection_health():
    """Monitor PostgreSQL connection usage."""
    try:
        cursor = database.cursor
        cursor.execute("SELECT count(*) FROM pg_stat_activity")
        active_connections = cursor.fetchone()[0]

        cursor.execute("SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%'")
        listen_connections = cursor.fetchone()[0]

        max_connections = 150
        utilization = active_connections / max_connections

        return jsonify({
            "status": "healthy" if utilization < 0.8 else "warning",
            "active_connections": active_connections,
            "listen_connections": listen_connections,
            "max_connections": max_connections,
            "utilization_percent": round(utilization * 100, 1),
            "warning": utilization > 0.8,
            "critical": utilization > 0.9
        }), 200 if utilization < 0.9 else 503
    except Exception as e:
        return jsonify({"status": "unhealthy", "error": str(e)}), 503
```

---

## Medium-Term Solutions (100-150 VMs)

### 4. Upgrade EC2 Instance

**Current**: t3.large (2 vCPU, 8GB RAM)
**Recommended**: t3.xlarge (4 vCPU, 16GB RAM)

### 5. Upgrade to Production WSGI Server

**Current**: Flask development server (`threaded=True`)
**Recommended**: Gunicorn with gevent workers

```dockerfile
CMD ["gunicorn", "-w", "4", "-k", "gevent", \
     "--worker-connections", "1000", \
     "--bind", "0.0.0.0:5000", \
     "lablink_allocator_service.main:app"]
```

---

## Long-Term Architecture Changes (150+ VMs)

### Option A: Redis Pub/Sub Migration (Recommended)

**Architecture**:
```
Client VMs → Redis Pub/Sub ← Allocator
                ↓
          PostgreSQL (state only)
```

**Benefits**:
- No global commit lock (Redis designed for pub/sub)
- 2KB overhead per connection (vs 10MB PostgreSQL)
- Sub-millisecond latency
- Can handle 10,000+ connections

**Migration Strategy**:
1. Add Redis container to allocator stack
2. Implement parallel notification system (PostgreSQL + Redis)
3. Gradually migrate VMs to Redis subscribers
4. Deprecate PostgreSQL LISTEN/NOTIFY

### Option B: Short Polling (Simpler Alternative)

**Change from**: Long-polling with LISTEN/NOTIFY
**Change to**: Simple periodic polling (15-30 seconds)

**Trade-offs**:
- ✅ **Pros**: Zero persistent connections, simple implementation, works at any scale
- ❌ **Cons**: 15-30s latency for student (vs instant), higher request volume

---

## Recommended Action Plan

### Phase 1: Immediate (Current: 50 VMs)
- [ ] Add connection monitoring (`/api/health/connections`)
- [ ] Set `max_connections=150` in PostgreSQL config
- [ ] Document scaling limits in `config.yaml`
- [ ] Add CloudWatch alarm for connection utilization > 80%

### Phase 2: Before Reaching 75 VMs
- [ ] Upgrade to Gunicorn WSGI server
- [ ] Increase `max_connections=200`
- [ ] Add LISTEN/NOTIFY performance monitoring
- [ ] Test Redis Pub/Sub proof-of-concept

### Phase 3: Before Reaching 100 VMs
- [ ] Upgrade to t3.xlarge (16GB RAM)
- [ ] Implement adaptive health check intervals
- [ ] Deploy Redis Pub/Sub in parallel with LISTEN/NOTIFY
- [ ] Migrate 10% of VMs to Redis (canary testing)

### Phase 4: Before Reaching 150 VMs
- [ ] Migrate all VMs to Redis Pub/Sub
- [ ] Deprecate PostgreSQL LISTEN/NOTIFY
- [ ] Remove persistent connection architecture
- [ ] Horizontal scaling with ALB (if needed)

---

## Success Metrics

### Operational Health
- **Connection utilization** < 80% of max_connections
- **PostgreSQL CPU** < 60% under normal load
- **LISTEN connection count** == active VM count
- **Memory usage** < 70% of instance RAM

### Scaling Goals
- Support 100 VMs without configuration changes
- Support 150 VMs with configuration tuning only
- Support 200+ VMs with Redis migration

---

## Related Issues
- #232 - Client connection failures (addressed by PR #235)
- #235 - Add retry logic to API connections

## References
- [Recall.ai: Postgres LISTEN/NOTIFY does not scale](https://www.recall.ai/blog/postgres-listen-notify-does-not-scale)
- [AWS: Resources consumed by idle PostgreSQL connections](https://aws.amazon.com/blogs/database/resources-consumed-by-idle-postgresql-connections/)
- [PostgreSQL: Connection Management Best Practices](https://www.postgresql.org/docs/current/runtime-config-connection.html)

---

**Priority**: Medium (not blocking at current 50 VM scale, but critical for growth)
**Effort**: Medium (monitoring = 1 day, config changes = 1 day, Redis migration = 1 week)
**Impact**: High (determines maximum scalability of LabLink platform)

max_connections	shared_buffers	work_mem	Total RAM	Status
100	2GB	5MB	~2.5GB	Safe ✅
150	2GB	3.3MB	~3.0GB	Acceptable ⚠️
200	2GB	2.5MB	~3.5GB	Risky 🔴

VM Count	Total Connections	Memory Usage	PostgreSQL Capacity	Status
50	~60	~600MB - 840MB	60%	Current (Safe) ✅
75	~85	~850MB - 1.2GB	85%	Approaching limit ⚠️
100	~110	~1.1GB - 1.5GB	110% - Exceeds default!	Critical 🔴
150	~160	~1.6GB - 2.2GB	160% (requires config change)	Requires tuning ⚠️
200	~210	~2.1GB - 2.9GB	210% (requires instance upgrade)	Unsafe 🔴

Concurrent LISTEN Connections	Write Frequency	Expected Performance
0-50	Any	Normal operation ✅
50-75	Low (< 10/min)	Normal operation ✅
50-75	Moderate (10-60/min)	Minor slowdown ⚠️
75-100	Low	Minor slowdown ⚠️
75-100	Moderate	Noticeable contention ⚠️
75-100	High (> 60/min)	Severe contention 🔴
100-150	Any	Significant issues 🔴
150+	Any	Critical issues, downtime risk 🔴

PostgreSQL Connection Scalability Limits with LISTEN/NOTIFY Architecture (75-100 VM Threshold) #240

Description

PostgreSQL Connection Scalability Analysis

Executive Summary

Current Architecture

Client Side

Server Side

Resource Usage Per VM

Critical Research Findings

1. LISTEN/NOTIFY Global Commit Lock

2. PostgreSQL Default Connection Limits

3. Connection Pool Exhaustion

Scaling Thresholds

Connection Limit Analysis

LISTEN/NOTIFY Performance Degradation

Immediate Actions Required

1. Monitor Current Connection Count

2. Set Explicit max_connections

3. Add Connection Health Endpoint

Medium-Term Solutions (100-150 VMs)

4. Upgrade EC2 Instance

5. Upgrade to Production WSGI Server

Long-Term Architecture Changes (150+ VMs)

Option A: Redis Pub/Sub Migration (Recommended)

Option B: Short Polling (Simpler Alternative)

Recommended Action Plan

Phase 1: Immediate (Current: 50 VMs)

Phase 2: Before Reaching 75 VMs

Phase 3: Before Reaching 100 VMs

Phase 4: Before Reaching 150 VMs

Success Metrics

Operational Health

Scaling Goals

Related Issues

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions