diff --git a/docs/RUNBOOK.md b/docs/RUNBOOK.md new file mode 100644 index 0000000..676d098 --- /dev/null +++ b/docs/RUNBOOK.md @@ -0,0 +1,417 @@ +# Production Runbook — Stellar Mainnet + +## 1. Network & Environment Alignment + +### Verify deployment target + +```bash +# Current env values (must match mainnet) +echo "STELLAR_NETWORK=$STELLAR_NETWORK" +echo "STELLAR_RPC_URL=$STELLAR_RPC_URL" +echo "VAULT_CONTRACT_ID=$VAULT_CONTRACT_ID" +``` + +| Variable | Mainnet value | +|---|---| +| `STELLAR_NETWORK` | `mainnet` | +| `STELLAR_RPC_URL` | `https://soroban-mainnet.stellar.org` | +| Network passphrase | `Public Global Stellar Network ; September 2015` | +| `NODE_ENV` | `production` | + +Contract IDs, token addresses, and the `STELLAR_AGENT_SECRET_KEY` **must** be mainnet instances. A testnet key on mainnet will sign invalid operations. + +### Pre-flight alignment checks + +```bash +# 1. Confirm network in env matches deployment context +grep STELLAR_NETWORK .env | grep -q mainnet || echo "WARN: not mainnet" + +# 2. Verify RPC connection returns mainnet ledger +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# 3. Confirm agent key controls the vault on this network +# (validate via a read-only contract call — getVaultInfo or similar) + +# 4. Verify Prisma migration status matches schema +npx prisma migrate status +``` + +### Boot sequence validation + +On startup, `src/config/env.ts` validates: +- `STELLAR_NETWORK` ∈ {mainnet, testnet, futurenet} +- `STELLAR_AGENT_SECRET_KEY` starts with `S`, length 56 +- `WALLET_ENCRYPTION_KEY` is exactly 64 hex chars +- All required env vars are set (throws if missing) + +The `GET /health/ready` endpoint reports three subsystems: `database`, `eventListener`, `agentLoop`. All must be `ready: true` before the load balancer marks the instance healthy. + +--- + +## 2. Key Custody + +### Secrets under management + +| Secret | Source | Purpose | Rotation | +|---|---|---|---| +| `STELLAR_AGENT_SECRET_KEY` | env var | Signs Soroban contract calls (rebalance, update total assets) | On key compromise or quarterly | +| `WALLET_ENCRYPTION_KEY` | env var | AES-256-GCM key encrypting custodial wallet secrets in `custodial_wallets` table | Coordinated re-encryption migration | +| `JWT_SEED` | env var | Signs session JWTs | Every 90 days (invalidates all sessions) | +| `DATABASE_URL` | env var/env file | PostgreSQL connection | DB password rotation per provider policy | + +### Agent key rotation + +1. Generate new Stellar keypair: + ```bash + stellar keys generate neurowealth-agent-v2 # or via SDK + ``` +2. Fund the new public key with XLM on mainnet. +3. If the vault contract maintains an operator allowlist, update it to include the new key. +4. Set `STELLAR_AGENT_SECRET_KEY` in your secret manager to the new **secret**. +5. Redeploy all instances (rolling update). +6. Verify agent loop health: `GET /health/ready` → `agentLoop: ready`. +7. **Keep the old key funded for 30 days** in case a rollback is needed. +8. Drain and discard the old key after the rollback window. + +### Wallet encryption key rotation + +1. Provision `WALLET_ENCRYPTION_KEY_NEW` in the secret manager alongside the current key. +2. Run a one-off migration script that: + - Reads every row from `custodial_wallets` + - Decrypts `encryptedSecret` with the old key + - Re-encrypts with the new key + - Writes back the new `encryptedSecret`, `iv`, `authTag` +3. Swap the env var to the new key. +4. Verify a sample of users can still sign operations. +5. Remove the old key from the secret store. + +### Custodial wallet recovery + +Losing `WALLET_ENCRYPTION_KEY` **permanently** destroys all custodial wallet keys. +- **Backup**: Regular DB snapshots preserve encrypted key material. +- **Audit**: The `custodial_wallets` table stores (`publicKey`, `encryptedSecret`, `iv`, `authTag`) — never plaintext secrets. +- **Disaster**: If the DB is restored from a backup, the encryption key at backup time must still be available. + +### Secret storage policies + +- **Never** commit secrets to git. Use `.env.example` as a template. +- **Production**: AWS Secrets Manager / HashiCorp Vault with access audit logging. +- **CI/CD**: GitHub Environments secrets, injected as env vars in deploy workflows. +- **Local dev**: `.env` file (gitignored). + +--- + +## 3. RPC Failover + +### Current architecture + +`src/stellar/client.ts` creates a single `rpc.Server(STELLAR_RPC_URL)` singleton. There is **no built-in automatic failover**. A mainnet RPC outage halts event ingestion and agent operations. + +### Failover strategy + +#### Option A: Load-balanced endpoint (recommended) + +Configure a single URL that routes across multiple RPC providers: + +``` +STELLAR_RPC_URL=https://soroban-mainnet.stellar.org +``` + +Replace this with a load balancer or provider that pools: +- `https://soroban-mainnet.stellar.org` (SDF) +- `https://mainnet.sorobanrpc.com` (public) +- `https://rpc.stellar.org/mainnet` (alternative) + +#### Option B: Multi-provider fallback (not yet implemented) + +If you need resilience without a LB, wrap `getRpcServer()` to fall back: + +```typescript +const RPC_URLS = [ + 'https://soroban-mainnet.stellar.org', + 'https://mainnet.sorobanrpc.com', +] +let currentIndex = 0 + +export function getRpcServer(): rpc.Server { + // Returns current server; call rotateRpc() on failure + if (!rpcServer) rpcServer = new rpc.Server(RPC_URLS[currentIndex]) + return rpcServer +} + +export function rotateRpc(): void { + currentIndex = (currentIndex + 1) % RPC_URLS.length + rpcServer = new rpc.Server(RPC_URLS[currentIndex]) + logger.warn(`[RPC] Failed over to ${RPC_URLS[currentIndex]}`) +} +``` + +Wire `rotateRpc()` into error handlers in `fetchEvents` and `submitTransaction`. + +### RPC outage playbook + +| Symptom | Action | +|---|---| +| `fetchEvents` fails with connection error | Rotate RPC URL (manual or automated) | +| `sendTransaction` hangs or times out | Rotate RPC; retry tx via `getTransaction` | +| Persistent RPC failures | Switch to backup RPC provider entirely | +| All known RPCs down | Pause event listener; set `agentLoop` to degraded; page on-call | + +### Verify RPC health + +```bash +# Check latest ledger via RPC +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# Monitor via /metrics +curl -s http://localhost:3001/metrics | grep cursor_lag +``` + +--- + +## 4. Ledger Lag Alerts + +### Metrics + +| Metric | Type | Description | +|---|---|---| +| `cursor_lag_ledgers` | Gauge | `latest_ledger - last_processed_ledger` | +| `last_processed_ledger` | Gauge | Last ledger successfully processed | +| `events_processed_total` | Counter | Events processed, labelled by type and status | + +Alert rules are defined in `docs/OBSERVABILITY.md` and deployed to Prometheus. + +### Alert thresholds + +| Severity | Lag | Action | +|---|---|---| +| Info | > 10 ledgers | Note — may be normal during low traffic | +| Warning | > 50 ledgers for 5 min | Investigate within 1 hour | +| Critical | > 100 ledgers for 2 min | Page immediately | + +### Investigation steps + +```bash +# 1. Check current lag +curl -s http://localhost:3001/metrics | grep cursor_lag + +# 2. Check last processed ledger in DB +psql "$DATABASE_URL" -c "SELECT * FROM event_cursors WHERE \"contractId\" = '$VAULT_CONTRACT_ID';" + +# 3. Check listener logs for errors +grep "Event Listener" /var/log/app/*.log | tail -50 + +# 4. Check RPC connectivity +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# 5. Check for backpressure (DLQ growth) +curl -s http://localhost:3001/metrics | grep dlq_size + +# 6. Check database connection pool +psql "$DATABASE_URL" -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" +``` + +### Common causes & remediation + +| Cause | Signal | Fix | +|---|---|---| +| RPC outage | `fetchEvents` errors in logs | Rotate RPC endpoint (see §3) | +| DB slow / locked | High `db_operation_duration_seconds` | Check locks, pool size, index usage | +| Schema validation failures | DLQ growth, `event_validation` errors | Inspect DLQ, fix event format or validator | +| Listener crashed | `cursor_lag` rising, `agent_loop_status == 0` | Container restart, check OOM killer | +| Network partition | RPC timeouts | Check DNS, firewall, egress rules | + +### Recover from lag + +```bash +# If lag < 1000 ledgers — automatic backfill runs on restart +# If lag > 1000 ledgers — manual backfill recommended via admin endpoint + +# Manual backfill (from a specific ledger) +# Restart the service; backfill runs automatically up to latest +# If auto-backfill is too slow, consider: +# 1. Stop the listener +# 2. Update event_cursors to an earlier ledger +# 3. Restart the listener to trigger backfill +psql "$DATABASE_URL" -c "UPDATE event_cursors SET \"lastProcessedLedger\" = $EARLIER_LEDGER WHERE \"contractId\" = '$VAULT_CONTRACT_ID';" +``` + +--- + +## 5. DLQ Replay Procedure + +### Overview + +Events that fail processing (validation error, DB error, missing user) are stored in the `dead_letter_events` table with status `PENDING`. The DLQ module (`src/stellar/dlq.ts`) manages retries through three admin API endpoints. + +### Inspect DLQ + +```bash +# Via admin API (requires ADMIN_API_TOKEN) +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + http://localhost:3001/api/admin/dlq/inspect | jq + +# Filter by status +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + "http://localhost:3001/api/admin/dlq/inspect?status=PENDING" | jq + +# Via direct DB query +psql "$DATABASE_URL" -c " + SELECT id, \"eventType\", \"txHash\", ledger, status, \"retryCount\", error, \"createdAt\" + FROM dead_letter_events + ORDER BY \"createdAt\" DESC + LIMIT 50; +" +``` + +### Dry-run retry + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"dryRun": true}' \ + http://localhost:3001/api/admin/dlq/retry | jq +``` + +Dry run simulates the retry loop without persisting status changes. + +### Full retry + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{}' \ + http://localhost:3001/api/admin/dlq/retry | jq +``` + +Returns: +```json +{ + "resolved": 5, + "failed": 2, + "totalRemaining": 2 +} +``` + +### Resolve a specific event + +If an event cannot be processed (e.g. user deleted), manually resolve it: + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "uuid-of-event"}' \ + http://localhost:3001/api/admin/dlq/resolve | jq +``` + +### Automatic retry behavior + +- `retryAll()` processes all `PENDING` and `RETRIED` events sequentially. +- Success → status set to `RESOLVED`, count +1. +- Failure → status set to `RETRIED`, count +1, logged. +- There is **no automatic scheduled retry**. All retries are manual via the admin API. +- When DLQ size reaches 50, a critical log line is emitted and the Prometheus `dlq_size` gauge crosses the critical threshold. + +### DLQ replay decision matrix + +| Event type | Common failure | Retry likely? | Notes | +|---|---|---|---| +| `deposit` | User not found | No until user exists | Resolve after user registers | +| `deposit` | Schema validation | Depends | Fix validator or event source | +| `withdraw` | Position not found | No | May indicate data integrity issue | +| `rebalance` | DB constraint | Yes | Transient — retry typically succeeds | +| Any | RPC/DB timeout | Yes | Transient — retry typically succeeds | + +--- + +## 6. Incident Contacts + +### Escalation tiers + +| Tier | Role | Responsibility | Contact | +|---|---|---|---| +| T1 | On-call engineer | Triage, restart, DLQ retry, RPC rotation | PagerDuty / Opsgenie | +| T2 | Backend lead | Code fix, data reconciliation, migration rollback | Slack @backend-lead | +| T3 | Engineering manager | Stakeholder comms, post-mortem, priority decisions | Slack @eng-mgr | +| T4 | Security officer | Key compromise, wallet recovery, audit | Slack @sec-officer | + +### Communication channels + +| Channel | Purpose | +|---|---| +| `#neurowealth-alerts` | Prometheus alert notifications | +| `#neurowealth-incidents` | Incident coordination thread | +| PagerDuty | T1 on-call escalation | +| Email: `ops@neurowealth.io` | Backup contact for critical outages | + +### Incident severity definitions + +| Severity | Definition | Response time | Escalation | +|---|---|---|---| +| **SEV1** | Event processing halted, funds at risk, data loss | < 15 min | T1 → T2 → T3 | +| **SEV2** | Lag > 100 ledgers, DLQ > 50, agent loop degraded | < 1 hour | T1 → T2 | +| **SEV3** | Lag > 50 ledgers, DLQ > 20, elevated error rate | < 8 hours | T1 | +| **SEV4** | Minor anomalies, informational alerts | Next business day | None | + +### Post-incident checklist + +- [ ] Root cause identified and documented +- [ ] Fix deployed (or rollback executed) +- [ ] DLQ resolved and lag cleared +- [ ] Alert thresholds adjusted if needed +- [ ] Post-mortem filed in `docs/post-mortems/` +- [ ] Runbook updated with lessons learned + +--- + +## Quick Reference Commands + +```bash +# Health +curl http://localhost:3001/health/live +curl http://localhost:3001/health/ready +curl http://localhost:3001/health + +# Metrics +curl http://localhost:3001/metrics | grep -E "(cursor_lag|dlq_size|events_processed|agent_loop)" + +# DLQ inspect +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + http://localhost:3001/api/admin/dlq/inspect | jq '. | length' + +# DLQ retry (dry run) +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"dryRun": true}' \ + http://localhost:3001/api/admin/dlq/retry + +# DLQ retry (live) +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{}' \ + http://localhost:3001/api/admin/dlq/retry + +# DB — cursor status +psql "$DATABASE_URL" -c "SELECT * FROM event_cursors;" + +# DB — DLQ count by status +psql "$DATABASE_URL" -c " + SELECT status, count(*) FROM dead_letter_events GROUP BY status; +" + +# DB — recent processed events +psql "$DATABASE_URL" -c " + SELECT \"eventType\", ledger, \"txHash\", \"createdAt\" + FROM processed_events + ORDER BY ledger DESC LIMIT 10; +" +```