PostgreSQL Connection Scalability Analysis
Executive Summary
LabLink's current architecture uses PostgreSQL LISTEN/NOTIFY with persistent connections for real-time VM assignment notifications. While this works well at our current scale of 50 simultaneous VMs, research reveals critical scalability limits between 75-100 VMs due to:
- PostgreSQL default max_connections (100) - Hard limit without configuration changes
- LISTEN/NOTIFY global commit lock - Serializes all COMMIT operations under high concurrency
- Memory constraints on t3.large - 8GB RAM limits ~100-150 total connections
- File descriptor limits - Linux kernel constraints around 100-200 connections
This issue tracks the scalability bottleneck, provides concrete thresholds, and recommends mitigation strategies.
Current Architecture
Client Side
File: packages/client/src/lablink_client_service/subscribe.py
# Client makes ONE blocking POST request
response = requests.post(
url,
json={"hostname": hostname},
timeout=(30, 604800) # 7-day timeout
)
# Blocks until assignment OR timeout
Server Side
File: packages/allocator/src/lablink_allocator_service/database.py:182-267
def listen_for_notifications(self, channel, target_hostname):
# Creates NEW dedicated PostgreSQL connection
listen_conn = psycopg2.connect(...)
listen_cursor.execute(f"LISTEN {channel};")
while True:
# Server-side polling with 10s timeout
if select.select([listen_conn], [], [], 10) == ([], [], []):
continue
# Process notification when received
Resource Usage Per VM
- 1 HTTP connection (client → allocator, held open up to 7 days)
- 1 PostgreSQL connection (allocator → DB, in LISTEN state)
- ~10-14 MB RAM per PostgreSQL connection (typical)
- 1 file descriptor per connection
Critical Research Findings
1. LISTEN/NOTIFY Global Commit Lock
Source: Recall.ai: "Postgres LISTEN/NOTIFY does not scale"
"When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database during the commit phase, effectively serializing all commits."
Impact:
- Under high concurrency: Database CPU/IO plummet due to mutex contention
- Real-world incident: Recall.ai experienced three periods of downtime (March 2025)
- Symptom: Only one COMMIT can execute at a time across entire database
LabLink Implication:
- Current (50 VMs): Operating within safe limits ✅
- 75-100 VMs: Risk of global lock contention increases ⚠️
- 100+ VMs: High risk of database stalls 🔴
2. PostgreSQL Default Connection Limits
Current Configuration (no tuning):
max_connections: 100 (default)
shared_buffers: 128 MB (default, too low)
work_mem: 4 MB (default)
Memory Formula:
Max RAM = shared_buffers + (temp_buffers + work_mem) × max_connections
For t3.large (8GB RAM):
| max_connections |
shared_buffers |
work_mem |
Total RAM |
Status |
| 100 |
2GB |
5MB |
~2.5GB |
Safe ✅ |
| 150 |
2GB |
3.3MB |
~3.0GB |
Acceptable ⚠️ |
| 200 |
2GB |
2.5MB |
~3.5GB |
Risky 🔴 |
3. Connection Pool Exhaustion
Current allocator setup:
- 50 VM LISTEN connections (persistent, outside SQLAlchemy pool)
- 1 main database connection (allocator instance)
- ~10 Flask worker connections (SQLAlchemy pool: pool_size=5, max_overflow=10)
Total connections at 50 VMs: ~60 (60% of default limit)
Scaling Thresholds
Connection Limit Analysis
| VM Count |
Total Connections |
Memory Usage |
PostgreSQL Capacity |
Status |
| 50 |
~60 |
~600MB - 840MB |
60% |
Current (Safe) ✅ |
| 75 |
~85 |
~850MB - 1.2GB |
85% |
Approaching limit ⚠️ |
| 100 |
~110 |
~1.1GB - 1.5GB |
110% - Exceeds default! |
Critical 🔴 |
| 150 |
~160 |
~1.6GB - 2.2GB |
160% (requires config change) |
Requires tuning ⚠️ |
| 200 |
~210 |
~2.1GB - 2.9GB |
210% (requires instance upgrade) |
Unsafe 🔴 |
LISTEN/NOTIFY Performance Degradation
Based on Recall.ai research + AWS recommendations:
| Concurrent LISTEN Connections |
Write Frequency |
Expected Performance |
| 0-50 |
Any |
Normal operation ✅ |
| 50-75 |
Low (< 10/min) |
Normal operation ✅ |
| 50-75 |
Moderate (10-60/min) |
Minor slowdown ⚠️ |
| 75-100 |
Low |
Minor slowdown ⚠️ |
| 75-100 |
Moderate |
Noticeable contention ⚠️ |
| 75-100 |
High (> 60/min) |
Severe contention 🔴 |
| 100-150 |
Any |
Significant issues 🔴 |
| 150+ |
Any |
Critical issues, downtime risk 🔴 |
LabLink write frequency (per VM):
- VM startup NOTIFY: Once per assignment
- Health updates: Every 20 seconds (3/min × 50 VMs = 150/min)
- Status updates: Every 10 seconds (6/min × 50 VMs = 300/min)
- Admin operations: Manual (infrequent)
Current total writes: ~450/min (moderate-to-high frequency)
Estimated safe limit: 75-100 VMs before global lock becomes problematic
Immediate Actions Required
1. Monitor Current Connection Count
Add to allocator health check:
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Group by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- Check LISTEN connections specifically
SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%';
2. Set Explicit max_connections
Update PostgreSQL configuration to avoid hitting default limit:
# docker-compose.yml or Dockerfile
postgres:
command:
- "postgres"
- "-c"
- "max_connections=150"
- "-c"
- "shared_buffers=2GB"
- "-c"
- "work_mem=3MB"
3. Add Connection Health Endpoint
@app.route("/api/health/connections", methods=["GET"])
def connection_health():
"""Monitor PostgreSQL connection usage."""
try:
cursor = database.cursor
cursor.execute("SELECT count(*) FROM pg_stat_activity")
active_connections = cursor.fetchone()[0]
cursor.execute("SELECT count(*) FROM pg_stat_activity WHERE query LIKE '%LISTEN%'")
listen_connections = cursor.fetchone()[0]
max_connections = 150
utilization = active_connections / max_connections
return jsonify({
"status": "healthy" if utilization < 0.8 else "warning",
"active_connections": active_connections,
"listen_connections": listen_connections,
"max_connections": max_connections,
"utilization_percent": round(utilization * 100, 1),
"warning": utilization > 0.8,
"critical": utilization > 0.9
}), 200 if utilization < 0.9 else 503
except Exception as e:
return jsonify({"status": "unhealthy", "error": str(e)}), 503
Medium-Term Solutions (100-150 VMs)
4. Upgrade EC2 Instance
Current: t3.large (2 vCPU, 8GB RAM)
Recommended: t3.xlarge (4 vCPU, 16GB RAM)
5. Upgrade to Production WSGI Server
Current: Flask development server (threaded=True)
Recommended: Gunicorn with gevent workers
CMD ["gunicorn", "-w", "4", "-k", "gevent", \
"--worker-connections", "1000", \
"--bind", "0.0.0.0:5000", \
"lablink_allocator_service.main:app"]
Long-Term Architecture Changes (150+ VMs)
Option A: Redis Pub/Sub Migration (Recommended)
Architecture:
Client VMs → Redis Pub/Sub ← Allocator
↓
PostgreSQL (state only)
Benefits:
- No global commit lock (Redis designed for pub/sub)
- 2KB overhead per connection (vs 10MB PostgreSQL)
- Sub-millisecond latency
- Can handle 10,000+ connections
Migration Strategy:
- Add Redis container to allocator stack
- Implement parallel notification system (PostgreSQL + Redis)
- Gradually migrate VMs to Redis subscribers
- Deprecate PostgreSQL LISTEN/NOTIFY
Option B: Short Polling (Simpler Alternative)
Change from: Long-polling with LISTEN/NOTIFY
Change to: Simple periodic polling (15-30 seconds)
Trade-offs:
- ✅ Pros: Zero persistent connections, simple implementation, works at any scale
- ❌ Cons: 15-30s latency for student (vs instant), higher request volume
Recommended Action Plan
Phase 1: Immediate (Current: 50 VMs)
Phase 2: Before Reaching 75 VMs
Phase 3: Before Reaching 100 VMs
Phase 4: Before Reaching 150 VMs
Success Metrics
Operational Health
- Connection utilization < 80% of max_connections
- PostgreSQL CPU < 60% under normal load
- LISTEN connection count == active VM count
- Memory usage < 70% of instance RAM
Scaling Goals
- Support 100 VMs without configuration changes
- Support 150 VMs with configuration tuning only
- Support 200+ VMs with Redis migration
Related Issues
References
Priority: Medium (not blocking at current 50 VM scale, but critical for growth)
Effort: Medium (monitoring = 1 day, config changes = 1 day, Redis migration = 1 week)
Impact: High (determines maximum scalability of LabLink platform)
PostgreSQL Connection Scalability Analysis
Executive Summary
LabLink's current architecture uses PostgreSQL LISTEN/NOTIFY with persistent connections for real-time VM assignment notifications. While this works well at our current scale of 50 simultaneous VMs, research reveals critical scalability limits between 75-100 VMs due to:
This issue tracks the scalability bottleneck, provides concrete thresholds, and recommends mitigation strategies.
Current Architecture
Client Side
File:
packages/client/src/lablink_client_service/subscribe.pyServer Side
File:
packages/allocator/src/lablink_allocator_service/database.py:182-267Resource Usage Per VM
Critical Research Findings
1. LISTEN/NOTIFY Global Commit Lock
Source: Recall.ai: "Postgres LISTEN/NOTIFY does not scale"
Impact:
LabLink Implication:
2. PostgreSQL Default Connection Limits
Current Configuration (no tuning):
max_connections: 100 (default)shared_buffers: 128 MB (default, too low)work_mem: 4 MB (default)Memory Formula:
For t3.large (8GB RAM):
3. Connection Pool Exhaustion
Current allocator setup:
Total connections at 50 VMs: ~60 (60% of default limit)
Scaling Thresholds
Connection Limit Analysis
LISTEN/NOTIFY Performance Degradation
Based on Recall.ai research + AWS recommendations:
LabLink write frequency (per VM):
Current total writes: ~450/min (moderate-to-high frequency)
Estimated safe limit: 75-100 VMs before global lock becomes problematic
Immediate Actions Required
1. Monitor Current Connection Count
Add to allocator health check:
2. Set Explicit max_connections
Update PostgreSQL configuration to avoid hitting default limit:
3. Add Connection Health Endpoint
Medium-Term Solutions (100-150 VMs)
4. Upgrade EC2 Instance
Current: t3.large (2 vCPU, 8GB RAM)
Recommended: t3.xlarge (4 vCPU, 16GB RAM)
5. Upgrade to Production WSGI Server
Current: Flask development server (
threaded=True)Recommended: Gunicorn with gevent workers
Long-Term Architecture Changes (150+ VMs)
Option A: Redis Pub/Sub Migration (Recommended)
Architecture:
Benefits:
Migration Strategy:
Option B: Short Polling (Simpler Alternative)
Change from: Long-polling with LISTEN/NOTIFY
Change to: Simple periodic polling (15-30 seconds)
Trade-offs:
Recommended Action Plan
Phase 1: Immediate (Current: 50 VMs)
/api/health/connections)max_connections=150in PostgreSQL configconfig.yamlPhase 2: Before Reaching 75 VMs
max_connections=200Phase 3: Before Reaching 100 VMs
Phase 4: Before Reaching 150 VMs
Success Metrics
Operational Health
Scaling Goals
Related Issues
References
Priority: Medium (not blocking at current 50 VM scale, but critical for growth)
Effort: Medium (monitoring = 1 day, config changes = 1 day, Redis migration = 1 week)
Impact: High (determines maximum scalability of LabLink platform)