diff --git a/docs/NON_CUSTODIAL_ARCHITECTURE.md b/docs/NON_CUSTODIAL_ARCHITECTURE.md new file mode 100644 index 0000000..8c057b2 --- /dev/null +++ b/docs/NON_CUSTODIAL_ARCHITECTURE.md @@ -0,0 +1,298 @@ +# Non-Custodial Wallet Architecture — Evaluation & Migration Plan + +## Current Architecture (Custodial) + +The backend holds user Stellar private keys encrypted at rest (`aes-256-gcm`) in the `custodial_wallets` table. Every deposit and withdrawal goes through `executeCustodialVaultOperation()`, which decrypts the user's key inside the backend process, signs the transaction, and submits it to Stellar. + +``` +User → POST /api/deposit → transaction-controller.ts + → contract.depositForUser() → wallet.getKeypairForUser(userId) + → decrypt secret with WALLET_ENCRYPTION_KEY + → executeWriteContractCall(method, args, userKeypair) + → build → simulate → prepare → SIGN(userKeypair) → submit → wait +``` + +The agent loop (`rebalance`, `update_total_assets`) signs with `STELLAR_AGENT_SECRET_KEY` — this is a separate concern and remains backend-signed (see §4). + +## Target Architecture (Non-Custodial) + +**User keys never enter the backend.** Deposit and withdrawal transactions are built and simulated server-side, returned as unsigned XDR, signed by the user's Stellar wallet (Freighter), and submitted to Stellar — either directly by the client or through a relay endpoint. + +``` +User → POST /api/vault/build-transaction + → contract.buildUnsignedVaultTransaction(method, userAddress, amount, asset) + → build → simulate → prepare → return unsigned XDR + ↓ +User signs XDR with Freighter (user's private key, client-side) + ↓ +Option A: User submits XDR directly to Stellar network + (backend event listener picks up the on-chain event) + +Option B: User → POST /api/vault/submit-signed-transaction { signedXdr } + → backend submits to Stellar RPC → returns txHash + → backend creates Transaction record (status=PENDING) + → event listener confirms status=CONFIRMED on-chain +``` + +## 1. Tradeoffs + +### Security + +| Factor | Custodial (current) | Non-custodial (target) | +|---|---|---| +| Key storage | AES-256-GCM at rest in DB | User holds key in Freighter/extension | +| Key in memory | Decrypted key during every tx | Never in backend memory | +| Backend compromise | Attacker can drain all wallets | Attacker cannot drain — cannot sign | +| WALLET_ENCRYPTION_KEY loss | All wallet keys permanently lost | Not applicable (key not stored) | +| DB backup exposure | Encrypted keys + encryption key = compromise | Only public data | +| User error | Not possible (backend handles signing) | User can lose key / sign wrong tx | +| **Risk: replay** | Backend controls nonce/seq | User signs and submits — backend must verify idempotency via txHash dedup in event listener | +| **Risk: front-running** | Backend controls submission timing | User submits — race with other txs possible | + +### User Experience + +| Aspect | Custodial | Non-custodial | +|---|---|---| +| Onboarding | Backend generates key silently | User must install Freighter / wallet extension | +| Transaction flow | Single API call | 2-step: build → sign in wallet → submit | +| Mobile support | Works with any HTTP client | Requires wallet SDK (Freighter mobile, WalletConnect) | +| Gas fees | Backend pays (from agent key) | User pays Soroban fees (can be subsidized — see §1c) | +| Recovery | Backend can recover via encrypted backup | User must manage their own seed phrase | +| Speed | Single round-trip | Multi-round-trip with user confirmation | + +### Operational Complexity + +| Aspect | Custodial | Non-custodial | +|---|---|---| +| Key management | `WALLET_ENCRYPTION_KEY` rotation, backup, audit | Eliminated entirely | +| Compliance | Custodial license / custody obligations in many jurisdictions | Reduced or eliminated | +| Transaction tracking | Backend knows tx outcome synchronously | Must rely on event listener for confirmation | +| Error recovery | Backend can retry with same key | User must re-sign if tx fails (Freighter may not retain) | +| Testing | Single service to test | Needs wallet integration tests (or mock signing) | + +### Fee Subsidy Design + +For non-custodial flows, the user pays Soroban fees. If the product wants to subsidize fees: +- Build tx with the agent key as the **fee-bump source**, wrapping the user's inner tx in a fee bump transaction. +- Or refund the user out-of-band (e.g., send XLM to their wallet periodically). +- Simplest approach: return the unsigned XDR, let the user sign and submit, and **do not subsidize** — the vault contract already accounts for protocol yield. + +## 2. Migration Plan + +### Phase 0: Inventory & Safety (current state) + +- `buildUnsignedVaultTransaction()` exists and works (`POST /api/vault/build-transaction`). +- No endpoint accepts a signed XDR back for relay submission. +- Frontend does not use the build-transaction endpoint — all traffic goes through custodial `POST /api/deposit` and `POST /api/withdraw`. + +### Phase 1: Add Signed Transaction Relay + +**Goal**: Provide the missing half of the non-custodial flow — accept a user-signed XDR, submit it, and track the result. + +Add `POST /api/vault/submit-transaction`: + +```typescript +// src/validators/vault-validators.ts +export const submitTransactionSchema = z.object({ + signedXdr: z.string().min(1, 'signedXdr is required'), + type: z.enum(['deposit', 'withdraw']), + amount: z.number().positive(), +}) + +// src/routes/vault.ts — new route +router.post('/submit-transaction', requireAuth, async (req, res) => { + const parsed = submitTransactionSchema.safeParse(req.body) + if (!parsed.success) { + return res.status(400).json({ error: 'Validation error', details: parsed.error.flatten() }) + } + + const { signedXdr, type, amount } = parsed.data + const userId = req.auth!.userId + const walletAddress = req.auth!.walletAddress + + // Reconstruct the Transaction from XDR + const tx = TransactionBuilder.fromXDR(signedXdr, getNetworkPassphrase()) + + // Verify source account matches the authenticated user + if (tx.source.accountId() !== walletAddress) { + return res.status(403).json({ error: 'Transaction source does not match authenticated user' }) + } + + // Verify the transaction targets the vault contract with the correct method + const operation = tx.operations[0] + // (validate operation targets VAULT_CONTRACT_ID and method is deposit/withdraw) + + // Check it hasn't been submitted already (idempotency via the event listener) + const hash = tx.hash().toString('hex') + const existing = await db.transaction.findUnique({ where: { txHash: hash } }) + if (existing) { + return res.status(200).json({ txHash: hash, status: existing.status }) + } + + // Submit to Stellar RPC + const txHash = await submitTransaction(tx) + + // Create a PENDING transaction record so the frontend has immediate feedback + await db.transaction.create({ + data: { + userId, + txHash, + type: type === 'deposit' ? TransactionType.DEPOSIT : TransactionType.WITHDRAWAL, + status: TransactionStatus.PENDING, + amount: String(amount), + assetSymbol: parsed.data.assetSymbol ?? 'USDC', + network: extractNetwork(), + }, + }) + + // Fire-and-forget confirmation poll (or rely on event listener) + waitForConfirmation(txHash).then(result => { + // Update transaction status based on result + // The event listener will also catch this, but updating proactively + // reduces latency for the user + }).catch(err => { + logger.error(`[Relay] Confirmation polling failed for ${txHash}:`, err) + }) + + return res.status(200).json({ txHash, status: 'submitted' }) +}) +``` + +**Key verifications on the relayed XDR**: +1. Source account === authenticated user's `walletAddress` +2. Operation targets the known `VAULT_CONTRACT_ID` +3. Method is one of `deposit` or `withdraw` +4. `txHash` not already processed (idempotency) +5. Transaction is fully signed (all signatures present) + +### Phase 2: Frontend Migration + +**Goal**: Move users from custodial API calls to Freighter-signed transactions. + +1. Feature-detect Freighter / wallet extension on the web client. +2. For users without a wallet: show onboarding flow (install Freighter, create wallet). +3. Modify the deposit/withdraw UI flow: + + ``` + Old flow: New flow (non-custodial): + Enter amount Enter amount + Click "Deposit" Click "Deposit" + (backend signs) → POST /api/vault/build-transaction + → Receive unsigned XDR + → window.freighter.signTransaction(xdr) + → POST /api/vault/submit-transaction (signed XDR) + → Show confirmation + ``` + +4. Run both custodial and non-custodial paths in parallel during migration. Use a feature flag: + + ```typescript + const USE_NON_CUSTODIAL = process.env.FEATURE_NON_CUSTODIAL === 'true' + ``` + +5. Frontend routes: + - If `USE_NON_CUSTODIAL`: disable custodial deposit/withdraw buttons, route through build→sign→submit flow. + - If flag off: existing custodial flow unchanged. + +### Phase 3: Deprecate Custodial Deposit/Withdraw + +**Goal**: Remove backend user-key storage and signing. + +1. **Stop creating new custodial wallets**. Remove the `createCustodialWallet()` call from the registration flow. New users authenticate via Stellar challenge (SEP-10-like, already implemented) and use their existing Freighter wallet — no backend-held key needed. + +2. **Add a migration endpoint** for existing custodial users to "claim" their wallet: + ``` + POST /api/vault/claim-wallet + Body: { signedXdr: "" } + ``` + This proves the user controls the custodial key. On success, remove the `CustodialWallet` row. The user then uses their Freighter key for all future transactions. + + Alternatively, simply leave existing custodial wallets in place for legacy users and let them migrate at their own pace. New users are non-custodial from day one. + +3. **Remove custodial endpoints**: + - Remove `depositForUser()` / `withdrawForUser()` from `contract.ts` + - Remove `getKeypairForUser()` and `createCustodialWallet()` from `wallet.ts` + - Remove `POST /api/deposit` and `POST /api/withdraw` routes (or make them call the build→relay flow) + +4. **Drop `custodial_wallets` table** (after all users migrated or after a grace period). + +### Phase 4: Cleanup + +- Remove `WALLET_ENCRYPTION_KEY` from required env vars. +- Remove the `custodial_wallets` Prisma model and migration. +- Audit logs to ensure no secret material is logged. +- Update documentation and runbook. + +## 3. What Stays Backend-Signed + +| Operation | Signer | Reason | +|---|---|---| +| `rebalance(protocol, apy)` | `STELLAR_AGENT_SECRET_KEY` | Protocol-level operation, not user-scoped | +| `update_total_assets(amount)` | `STELLAR_AGENT_SECRET_KEY` | Protocol-level accounting, not user-scoped | +| Event listener | — | Read-only (polling RPC, no signing) | +| Auth challenge/verify | — | Only signature verification (user signs nonce) | + +These are **agent operations** that mutate vault state based on protocol conditions, not individual user actions. They will continue to use `STELLAR_AGENT_SECRET_KEY`. + +## 4. Transaction Confirmation (Event Listener) + +The event listener (`src/stellar/events.ts`) already confirms transactions by polling Soroban RPC for contract events. This is **submission-path agnostic** — whether the backend submits or the user submits directly to Stellar, the event listener will: + +1. Detect `deposit` / `withdraw` events on-chain. +2. Match by `txHash` against the `transactions` table. +3. Update `status → CONFIRMED` and update position balances. + +For the relay path (Phase 1), the backend also proactively polls `getTransaction()` for faster feedback, but the event listener is the source of truth and handles edge cases (e.g., user submits directly to Stellar bypassing the relay). + +## 5. Security Considerations (User-Signed Transactions) + +### Replay Protection + +- The event listener deduplicates by `(contractId, txHash, eventType, ledger)` (see `events.ts:352-368`). +- The `transactions` table has a unique constraint on `txHash`. +- Nonce/sequence number protection is built into Stellar transactions — a submitted transaction cannot be replayed if the sequence number advances. + +### Transaction Validation (Relay Path) + +Before submitting a user-provided signed XDR, the backend must validate: + +1. **Source account** matches `req.auth.walletAddress` — prevents a malicious user from submitting a transaction signed by another user's key. +2. **Contract ID** matches `VAULT_CONTRACT_ID` — prevents the relay from being used to submit arbitrary Stellar transactions. +3. **Method** is `deposit` or `withdraw` — prevents relay of agent-only operations. +4. **Signatures** are present and valid — `tx.signatures.length > 0`. (Full signature verification against the expected source account is ideal but adds complexity; the RPC node will reject invalid signatures on submission.) +5. **txHash** not already processed — prevents duplicate submissions. + +### Fee Attack Mitigation + +A malicious user could submit a signed transaction with an extremely low fee, causing it to hang in the mempool. Mitigations: +- The relay endpoint should enforce a minimum `fee` on the XDR. +- Or the relay submits with a fee-bump transaction (backed by the agent key) to guarantee inclusion. +- Simplest: accept the risk — the tx will either confirm or expire, and the event listener handles both outcomes cleanly. + +## 6. Rollback Plan + +If the non-custodial migration causes issues: + +| Phase | Rollback | +|---|---| +| Phase 1 (relay endpoint) | Remove the new endpoint; custodial paths continue working | +| Phase 2 (frontend migration) | Flip feature flag `USE_NON_CUSTODIAL=false`; revert to custodial API calls | +| Phase 3 (deprecation) | If `custodial_wallets` still exist, re-enable custodial endpoints. If table is dropped, restore from backup | +| Phase 4 (cleanup) | Re-add `WALLET_ENCRYPTION_KEY` and `custodial_wallets` model if needed (full schema revert) | + +## 7. Migration Checklist + +- [ ] Phase 1: `POST /api/vault/submit-transaction` endpoint implemented with XDR validation +- [ ] Phase 1: Idempotency check via `txHash` uniqueness +- [ ] Phase 1: Event listener confirms relayed transactions (already works) +- [ ] Phase 2: Frontend Freighter integration for signing +- [ ] Phase 2: Feature flag for parallel custodial/non-custodial paths +- [ ] Phase 2: Wallet onboarding flow for new users without Freighter +- [ ] Phase 3: Existing custodial user migration (claim-wallet or grace period) +- [ ] Phase 3: Remove `depositForUser()` / `withdrawForUser()` from contract.ts +- [ ] Phase 3: Remove `POST /api/deposit` and `POST /api/withdraw` routes +- [ ] Phase 3: Remove `createCustodialWallet()` from registration flow +- [ ] Phase 4: Drop `custodial_wallets` table +- [ ] Phase 4: Remove `WALLET_ENCRYPTION_KEY` env var +- [ ] Phase 4: Update docs/RUNBOOK.md — remove key custody section for user keys +- [ ] Phase 4: Audit logs for secret exposure diff --git a/docs/RUNBOOK.md b/docs/RUNBOOK.md new file mode 100644 index 0000000..676d098 --- /dev/null +++ b/docs/RUNBOOK.md @@ -0,0 +1,417 @@ +# Production Runbook — Stellar Mainnet + +## 1. Network & Environment Alignment + +### Verify deployment target + +```bash +# Current env values (must match mainnet) +echo "STELLAR_NETWORK=$STELLAR_NETWORK" +echo "STELLAR_RPC_URL=$STELLAR_RPC_URL" +echo "VAULT_CONTRACT_ID=$VAULT_CONTRACT_ID" +``` + +| Variable | Mainnet value | +|---|---| +| `STELLAR_NETWORK` | `mainnet` | +| `STELLAR_RPC_URL` | `https://soroban-mainnet.stellar.org` | +| Network passphrase | `Public Global Stellar Network ; September 2015` | +| `NODE_ENV` | `production` | + +Contract IDs, token addresses, and the `STELLAR_AGENT_SECRET_KEY` **must** be mainnet instances. A testnet key on mainnet will sign invalid operations. + +### Pre-flight alignment checks + +```bash +# 1. Confirm network in env matches deployment context +grep STELLAR_NETWORK .env | grep -q mainnet || echo "WARN: not mainnet" + +# 2. Verify RPC connection returns mainnet ledger +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# 3. Confirm agent key controls the vault on this network +# (validate via a read-only contract call — getVaultInfo or similar) + +# 4. Verify Prisma migration status matches schema +npx prisma migrate status +``` + +### Boot sequence validation + +On startup, `src/config/env.ts` validates: +- `STELLAR_NETWORK` ∈ {mainnet, testnet, futurenet} +- `STELLAR_AGENT_SECRET_KEY` starts with `S`, length 56 +- `WALLET_ENCRYPTION_KEY` is exactly 64 hex chars +- All required env vars are set (throws if missing) + +The `GET /health/ready` endpoint reports three subsystems: `database`, `eventListener`, `agentLoop`. All must be `ready: true` before the load balancer marks the instance healthy. + +--- + +## 2. Key Custody + +### Secrets under management + +| Secret | Source | Purpose | Rotation | +|---|---|---|---| +| `STELLAR_AGENT_SECRET_KEY` | env var | Signs Soroban contract calls (rebalance, update total assets) | On key compromise or quarterly | +| `WALLET_ENCRYPTION_KEY` | env var | AES-256-GCM key encrypting custodial wallet secrets in `custodial_wallets` table | Coordinated re-encryption migration | +| `JWT_SEED` | env var | Signs session JWTs | Every 90 days (invalidates all sessions) | +| `DATABASE_URL` | env var/env file | PostgreSQL connection | DB password rotation per provider policy | + +### Agent key rotation + +1. Generate new Stellar keypair: + ```bash + stellar keys generate neurowealth-agent-v2 # or via SDK + ``` +2. Fund the new public key with XLM on mainnet. +3. If the vault contract maintains an operator allowlist, update it to include the new key. +4. Set `STELLAR_AGENT_SECRET_KEY` in your secret manager to the new **secret**. +5. Redeploy all instances (rolling update). +6. Verify agent loop health: `GET /health/ready` → `agentLoop: ready`. +7. **Keep the old key funded for 30 days** in case a rollback is needed. +8. Drain and discard the old key after the rollback window. + +### Wallet encryption key rotation + +1. Provision `WALLET_ENCRYPTION_KEY_NEW` in the secret manager alongside the current key. +2. Run a one-off migration script that: + - Reads every row from `custodial_wallets` + - Decrypts `encryptedSecret` with the old key + - Re-encrypts with the new key + - Writes back the new `encryptedSecret`, `iv`, `authTag` +3. Swap the env var to the new key. +4. Verify a sample of users can still sign operations. +5. Remove the old key from the secret store. + +### Custodial wallet recovery + +Losing `WALLET_ENCRYPTION_KEY` **permanently** destroys all custodial wallet keys. +- **Backup**: Regular DB snapshots preserve encrypted key material. +- **Audit**: The `custodial_wallets` table stores (`publicKey`, `encryptedSecret`, `iv`, `authTag`) — never plaintext secrets. +- **Disaster**: If the DB is restored from a backup, the encryption key at backup time must still be available. + +### Secret storage policies + +- **Never** commit secrets to git. Use `.env.example` as a template. +- **Production**: AWS Secrets Manager / HashiCorp Vault with access audit logging. +- **CI/CD**: GitHub Environments secrets, injected as env vars in deploy workflows. +- **Local dev**: `.env` file (gitignored). + +--- + +## 3. RPC Failover + +### Current architecture + +`src/stellar/client.ts` creates a single `rpc.Server(STELLAR_RPC_URL)` singleton. There is **no built-in automatic failover**. A mainnet RPC outage halts event ingestion and agent operations. + +### Failover strategy + +#### Option A: Load-balanced endpoint (recommended) + +Configure a single URL that routes across multiple RPC providers: + +``` +STELLAR_RPC_URL=https://soroban-mainnet.stellar.org +``` + +Replace this with a load balancer or provider that pools: +- `https://soroban-mainnet.stellar.org` (SDF) +- `https://mainnet.sorobanrpc.com` (public) +- `https://rpc.stellar.org/mainnet` (alternative) + +#### Option B: Multi-provider fallback (not yet implemented) + +If you need resilience without a LB, wrap `getRpcServer()` to fall back: + +```typescript +const RPC_URLS = [ + 'https://soroban-mainnet.stellar.org', + 'https://mainnet.sorobanrpc.com', +] +let currentIndex = 0 + +export function getRpcServer(): rpc.Server { + // Returns current server; call rotateRpc() on failure + if (!rpcServer) rpcServer = new rpc.Server(RPC_URLS[currentIndex]) + return rpcServer +} + +export function rotateRpc(): void { + currentIndex = (currentIndex + 1) % RPC_URLS.length + rpcServer = new rpc.Server(RPC_URLS[currentIndex]) + logger.warn(`[RPC] Failed over to ${RPC_URLS[currentIndex]}`) +} +``` + +Wire `rotateRpc()` into error handlers in `fetchEvents` and `submitTransaction`. + +### RPC outage playbook + +| Symptom | Action | +|---|---| +| `fetchEvents` fails with connection error | Rotate RPC URL (manual or automated) | +| `sendTransaction` hangs or times out | Rotate RPC; retry tx via `getTransaction` | +| Persistent RPC failures | Switch to backup RPC provider entirely | +| All known RPCs down | Pause event listener; set `agentLoop` to degraded; page on-call | + +### Verify RPC health + +```bash +# Check latest ledger via RPC +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# Monitor via /metrics +curl -s http://localhost:3001/metrics | grep cursor_lag +``` + +--- + +## 4. Ledger Lag Alerts + +### Metrics + +| Metric | Type | Description | +|---|---|---| +| `cursor_lag_ledgers` | Gauge | `latest_ledger - last_processed_ledger` | +| `last_processed_ledger` | Gauge | Last ledger successfully processed | +| `events_processed_total` | Counter | Events processed, labelled by type and status | + +Alert rules are defined in `docs/OBSERVABILITY.md` and deployed to Prometheus. + +### Alert thresholds + +| Severity | Lag | Action | +|---|---|---| +| Info | > 10 ledgers | Note — may be normal during low traffic | +| Warning | > 50 ledgers for 5 min | Investigate within 1 hour | +| Critical | > 100 ledgers for 2 min | Page immediately | + +### Investigation steps + +```bash +# 1. Check current lag +curl -s http://localhost:3001/metrics | grep cursor_lag + +# 2. Check last processed ledger in DB +psql "$DATABASE_URL" -c "SELECT * FROM event_cursors WHERE \"contractId\" = '$VAULT_CONTRACT_ID';" + +# 3. Check listener logs for errors +grep "Event Listener" /var/log/app/*.log | tail -50 + +# 4. Check RPC connectivity +curl -s -X POST "$STELLAR_RPC_URL" \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \ + jq '.result.sequence' + +# 5. Check for backpressure (DLQ growth) +curl -s http://localhost:3001/metrics | grep dlq_size + +# 6. Check database connection pool +psql "$DATABASE_URL" -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" +``` + +### Common causes & remediation + +| Cause | Signal | Fix | +|---|---|---| +| RPC outage | `fetchEvents` errors in logs | Rotate RPC endpoint (see §3) | +| DB slow / locked | High `db_operation_duration_seconds` | Check locks, pool size, index usage | +| Schema validation failures | DLQ growth, `event_validation` errors | Inspect DLQ, fix event format or validator | +| Listener crashed | `cursor_lag` rising, `agent_loop_status == 0` | Container restart, check OOM killer | +| Network partition | RPC timeouts | Check DNS, firewall, egress rules | + +### Recover from lag + +```bash +# If lag < 1000 ledgers — automatic backfill runs on restart +# If lag > 1000 ledgers — manual backfill recommended via admin endpoint + +# Manual backfill (from a specific ledger) +# Restart the service; backfill runs automatically up to latest +# If auto-backfill is too slow, consider: +# 1. Stop the listener +# 2. Update event_cursors to an earlier ledger +# 3. Restart the listener to trigger backfill +psql "$DATABASE_URL" -c "UPDATE event_cursors SET \"lastProcessedLedger\" = $EARLIER_LEDGER WHERE \"contractId\" = '$VAULT_CONTRACT_ID';" +``` + +--- + +## 5. DLQ Replay Procedure + +### Overview + +Events that fail processing (validation error, DB error, missing user) are stored in the `dead_letter_events` table with status `PENDING`. The DLQ module (`src/stellar/dlq.ts`) manages retries through three admin API endpoints. + +### Inspect DLQ + +```bash +# Via admin API (requires ADMIN_API_TOKEN) +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + http://localhost:3001/api/admin/dlq/inspect | jq + +# Filter by status +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + "http://localhost:3001/api/admin/dlq/inspect?status=PENDING" | jq + +# Via direct DB query +psql "$DATABASE_URL" -c " + SELECT id, \"eventType\", \"txHash\", ledger, status, \"retryCount\", error, \"createdAt\" + FROM dead_letter_events + ORDER BY \"createdAt\" DESC + LIMIT 50; +" +``` + +### Dry-run retry + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"dryRun": true}' \ + http://localhost:3001/api/admin/dlq/retry | jq +``` + +Dry run simulates the retry loop without persisting status changes. + +### Full retry + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{}' \ + http://localhost:3001/api/admin/dlq/retry | jq +``` + +Returns: +```json +{ + "resolved": 5, + "failed": 2, + "totalRemaining": 2 +} +``` + +### Resolve a specific event + +If an event cannot be processed (e.g. user deleted), manually resolve it: + +```bash +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "uuid-of-event"}' \ + http://localhost:3001/api/admin/dlq/resolve | jq +``` + +### Automatic retry behavior + +- `retryAll()` processes all `PENDING` and `RETRIED` events sequentially. +- Success → status set to `RESOLVED`, count +1. +- Failure → status set to `RETRIED`, count +1, logged. +- There is **no automatic scheduled retry**. All retries are manual via the admin API. +- When DLQ size reaches 50, a critical log line is emitted and the Prometheus `dlq_size` gauge crosses the critical threshold. + +### DLQ replay decision matrix + +| Event type | Common failure | Retry likely? | Notes | +|---|---|---|---| +| `deposit` | User not found | No until user exists | Resolve after user registers | +| `deposit` | Schema validation | Depends | Fix validator or event source | +| `withdraw` | Position not found | No | May indicate data integrity issue | +| `rebalance` | DB constraint | Yes | Transient — retry typically succeeds | +| Any | RPC/DB timeout | Yes | Transient — retry typically succeeds | + +--- + +## 6. Incident Contacts + +### Escalation tiers + +| Tier | Role | Responsibility | Contact | +|---|---|---|---| +| T1 | On-call engineer | Triage, restart, DLQ retry, RPC rotation | PagerDuty / Opsgenie | +| T2 | Backend lead | Code fix, data reconciliation, migration rollback | Slack @backend-lead | +| T3 | Engineering manager | Stakeholder comms, post-mortem, priority decisions | Slack @eng-mgr | +| T4 | Security officer | Key compromise, wallet recovery, audit | Slack @sec-officer | + +### Communication channels + +| Channel | Purpose | +|---|---| +| `#neurowealth-alerts` | Prometheus alert notifications | +| `#neurowealth-incidents` | Incident coordination thread | +| PagerDuty | T1 on-call escalation | +| Email: `ops@neurowealth.io` | Backup contact for critical outages | + +### Incident severity definitions + +| Severity | Definition | Response time | Escalation | +|---|---|---|---| +| **SEV1** | Event processing halted, funds at risk, data loss | < 15 min | T1 → T2 → T3 | +| **SEV2** | Lag > 100 ledgers, DLQ > 50, agent loop degraded | < 1 hour | T1 → T2 | +| **SEV3** | Lag > 50 ledgers, DLQ > 20, elevated error rate | < 8 hours | T1 | +| **SEV4** | Minor anomalies, informational alerts | Next business day | None | + +### Post-incident checklist + +- [ ] Root cause identified and documented +- [ ] Fix deployed (or rollback executed) +- [ ] DLQ resolved and lag cleared +- [ ] Alert thresholds adjusted if needed +- [ ] Post-mortem filed in `docs/post-mortems/` +- [ ] Runbook updated with lessons learned + +--- + +## Quick Reference Commands + +```bash +# Health +curl http://localhost:3001/health/live +curl http://localhost:3001/health/ready +curl http://localhost:3001/health + +# Metrics +curl http://localhost:3001/metrics | grep -E "(cursor_lag|dlq_size|events_processed|agent_loop)" + +# DLQ inspect +curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + http://localhost:3001/api/admin/dlq/inspect | jq '. | length' + +# DLQ retry (dry run) +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"dryRun": true}' \ + http://localhost:3001/api/admin/dlq/retry + +# DLQ retry (live) +curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{}' \ + http://localhost:3001/api/admin/dlq/retry + +# DB — cursor status +psql "$DATABASE_URL" -c "SELECT * FROM event_cursors;" + +# DB — DLQ count by status +psql "$DATABASE_URL" -c " + SELECT status, count(*) FROM dead_letter_events GROUP BY status; +" + +# DB — recent processed events +psql "$DATABASE_URL" -c " + SELECT \"eventType\", ledger, \"txHash\", \"createdAt\" + FROM processed_events + ORDER BY ledger DESC LIMIT 10; +" +```