diff --git a/docs/NON_CUSTODIAL_ARCHITECTURE.md b/docs/NON_CUSTODIAL_ARCHITECTURE.md
new file mode 100644
index 0000000..8c057b2
--- /dev/null
+++ b/docs/NON_CUSTODIAL_ARCHITECTURE.md
@@ -0,0 +1,298 @@
+# Non-Custodial Wallet Architecture — Evaluation & Migration Plan
+
+## Current Architecture (Custodial)
+
+The backend holds user Stellar private keys encrypted at rest (`aes-256-gcm`) in the `custodial_wallets` table. Every deposit and withdrawal goes through `executeCustodialVaultOperation()`, which decrypts the user's key inside the backend process, signs the transaction, and submits it to Stellar.
+
+```
+User → POST /api/deposit → transaction-controller.ts
+  → contract.depositForUser() → wallet.getKeypairForUser(userId)
+    → decrypt secret with WALLET_ENCRYPTION_KEY
+    → executeWriteContractCall(method, args, userKeypair)
+      → build → simulate → prepare → SIGN(userKeypair) → submit → wait
+```
+
+The agent loop (`rebalance`, `update_total_assets`) signs with `STELLAR_AGENT_SECRET_KEY` — this is a separate concern and remains backend-signed (see §4).
+
+## Target Architecture (Non-Custodial)
+
+**User keys never enter the backend.** Deposit and withdrawal transactions are built and simulated server-side, returned as unsigned XDR, signed by the user's Stellar wallet (Freighter), and submitted to Stellar — either directly by the client or through a relay endpoint.
+
+```
+User → POST /api/vault/build-transaction
+  → contract.buildUnsignedVaultTransaction(method, userAddress, amount, asset)
+    → build → simulate → prepare → return unsigned XDR
+                                                         ↓
+User signs XDR with Freighter (user's private key, client-side)
+                                                         ↓
+Option A: User submits XDR directly to Stellar network
+          (backend event listener picks up the on-chain event)
+
+Option B: User → POST /api/vault/submit-signed-transaction { signedXdr }
+          → backend submits to Stellar RPC → returns txHash
+          → backend creates Transaction record (status=PENDING)
+          → event listener confirms status=CONFIRMED on-chain
+```
+
+## 1. Tradeoffs
+
+### Security
+
+| Factor | Custodial (current) | Non-custodial (target) |
+|---|---|---|
+| Key storage | AES-256-GCM at rest in DB | User holds key in Freighter/extension |
+| Key in memory | Decrypted key during every tx | Never in backend memory |
+| Backend compromise | Attacker can drain all wallets | Attacker cannot drain — cannot sign |
+| WALLET_ENCRYPTION_KEY loss | All wallet keys permanently lost | Not applicable (key not stored) |
+| DB backup exposure | Encrypted keys + encryption key = compromise | Only public data |
+| User error | Not possible (backend handles signing) | User can lose key / sign wrong tx |
+| **Risk: replay** | Backend controls nonce/seq | User signs and submits — backend must verify idempotency via txHash dedup in event listener |
+| **Risk: front-running** | Backend controls submission timing | User submits — race with other txs possible |
+
+### User Experience
+
+| Aspect | Custodial | Non-custodial |
+|---|---|---|
+| Onboarding | Backend generates key silently | User must install Freighter / wallet extension |
+| Transaction flow | Single API call | 2-step: build → sign in wallet → submit |
+| Mobile support | Works with any HTTP client | Requires wallet SDK (Freighter mobile, WalletConnect) |
+| Gas fees | Backend pays (from agent key) | User pays Soroban fees (can be subsidized — see §1c) |
+| Recovery | Backend can recover via encrypted backup | User must manage their own seed phrase |
+| Speed | Single round-trip | Multi-round-trip with user confirmation |
+
+### Operational Complexity
+
+| Aspect | Custodial | Non-custodial |
+|---|---|---|
+| Key management | `WALLET_ENCRYPTION_KEY` rotation, backup, audit | Eliminated entirely |
+| Compliance | Custodial license / custody obligations in many jurisdictions | Reduced or eliminated |
+| Transaction tracking | Backend knows tx outcome synchronously | Must rely on event listener for confirmation |
+| Error recovery | Backend can retry with same key | User must re-sign if tx fails (Freighter may not retain) |
+| Testing | Single service to test | Needs wallet integration tests (or mock signing) |
+
+### Fee Subsidy Design
+
+For non-custodial flows, the user pays Soroban fees. If the product wants to subsidize fees:
+- Build tx with the agent key as the **fee-bump source**, wrapping the user's inner tx in a fee bump transaction.
+- Or refund the user out-of-band (e.g., send XLM to their wallet periodically).
+- Simplest approach: return the unsigned XDR, let the user sign and submit, and **do not subsidize** — the vault contract already accounts for protocol yield.
+
+## 2. Migration Plan
+
+### Phase 0: Inventory & Safety (current state)
+
+- `buildUnsignedVaultTransaction()` exists and works (`POST /api/vault/build-transaction`).
+- No endpoint accepts a signed XDR back for relay submission.
+- Frontend does not use the build-transaction endpoint — all traffic goes through custodial `POST /api/deposit` and `POST /api/withdraw`.
+
+### Phase 1: Add Signed Transaction Relay
+
+**Goal**: Provide the missing half of the non-custodial flow — accept a user-signed XDR, submit it, and track the result.
+
+Add `POST /api/vault/submit-transaction`:
+
+```typescript
+// src/validators/vault-validators.ts
+export const submitTransactionSchema = z.object({
+  signedXdr: z.string().min(1, 'signedXdr is required'),
+  type: z.enum(['deposit', 'withdraw']),
+  amount: z.number().positive(),
+})
+
+// src/routes/vault.ts — new route
+router.post('/submit-transaction', requireAuth, async (req, res) => {
+  const parsed = submitTransactionSchema.safeParse(req.body)
+  if (!parsed.success) {
+    return res.status(400).json({ error: 'Validation error', details: parsed.error.flatten() })
+  }
+
+  const { signedXdr, type, amount } = parsed.data
+  const userId = req.auth!.userId
+  const walletAddress = req.auth!.walletAddress
+
+  // Reconstruct the Transaction from XDR
+  const tx = TransactionBuilder.fromXDR(signedXdr, getNetworkPassphrase())
+
+  // Verify source account matches the authenticated user
+  if (tx.source.accountId() !== walletAddress) {
+    return res.status(403).json({ error: 'Transaction source does not match authenticated user' })
+  }
+
+  // Verify the transaction targets the vault contract with the correct method
+  const operation = tx.operations[0]
+  // (validate operation targets VAULT_CONTRACT_ID and method is deposit/withdraw)
+
+  // Check it hasn't been submitted already (idempotency via the event listener)
+  const hash = tx.hash().toString('hex')
+  const existing = await db.transaction.findUnique({ where: { txHash: hash } })
+  if (existing) {
+    return res.status(200).json({ txHash: hash, status: existing.status })
+  }
+
+  // Submit to Stellar RPC
+  const txHash = await submitTransaction(tx)
+
+  // Create a PENDING transaction record so the frontend has immediate feedback
+  await db.transaction.create({
+    data: {
+      userId,
+      txHash,
+      type: type === 'deposit' ? TransactionType.DEPOSIT : TransactionType.WITHDRAWAL,
+      status: TransactionStatus.PENDING,
+      amount: String(amount),
+      assetSymbol: parsed.data.assetSymbol ?? 'USDC',
+      network: extractNetwork(),
+    },
+  })
+
+  // Fire-and-forget confirmation poll (or rely on event listener)
+  waitForConfirmation(txHash).then(result => {
+    // Update transaction status based on result
+    // The event listener will also catch this, but updating proactively
+    // reduces latency for the user
+  }).catch(err => {
+    logger.error(`[Relay] Confirmation polling failed for ${txHash}:`, err)
+  })
+
+  return res.status(200).json({ txHash, status: 'submitted' })
+})
+```
+
+**Key verifications on the relayed XDR**:
+1. Source account === authenticated user's `walletAddress`
+2. Operation targets the known `VAULT_CONTRACT_ID`
+3. Method is one of `deposit` or `withdraw`
+4. `txHash` not already processed (idempotency)
+5. Transaction is fully signed (all signatures present)
+
+### Phase 2: Frontend Migration
+
+**Goal**: Move users from custodial API calls to Freighter-signed transactions.
+
+1. Feature-detect Freighter / wallet extension on the web client.
+2. For users without a wallet: show onboarding flow (install Freighter, create wallet).
+3. Modify the deposit/withdraw UI flow:
+
+   ```
+   Old flow:           New flow (non-custodial):
+   Enter amount        Enter amount
+   Click "Deposit"     Click "Deposit"
+   (backend signs)     → POST /api/vault/build-transaction
+                       → Receive unsigned XDR
+                       → window.freighter.signTransaction(xdr)
+                       → POST /api/vault/submit-transaction (signed XDR)
+                       → Show confirmation
+   ```
+
+4. Run both custodial and non-custodial paths in parallel during migration. Use a feature flag:
+
+   ```typescript
+   const USE_NON_CUSTODIAL = process.env.FEATURE_NON_CUSTODIAL === 'true'
+   ```
+
+5. Frontend routes:
+   - If `USE_NON_CUSTODIAL`: disable custodial deposit/withdraw buttons, route through build→sign→submit flow.
+   - If flag off: existing custodial flow unchanged.
+
+### Phase 3: Deprecate Custodial Deposit/Withdraw
+
+**Goal**: Remove backend user-key storage and signing.
+
+1. **Stop creating new custodial wallets**. Remove the `createCustodialWallet()` call from the registration flow. New users authenticate via Stellar challenge (SEP-10-like, already implemented) and use their existing Freighter wallet — no backend-held key needed.
+
+2. **Add a migration endpoint** for existing custodial users to "claim" their wallet:
+   ```
+   POST /api/vault/claim-wallet
+   Body: { signedXdr: "<self-transfer of 0 XLM from custodial key to user's Freighter key>" }
+   ```
+   This proves the user controls the custodial key. On success, remove the `CustodialWallet` row. The user then uses their Freighter key for all future transactions.
+
+   Alternatively, simply leave existing custodial wallets in place for legacy users and let them migrate at their own pace. New users are non-custodial from day one.
+
+3. **Remove custodial endpoints**:
+   - Remove `depositForUser()` / `withdrawForUser()` from `contract.ts`
+   - Remove `getKeypairForUser()` and `createCustodialWallet()` from `wallet.ts`
+   - Remove `POST /api/deposit` and `POST /api/withdraw` routes (or make them call the build→relay flow)
+
+4. **Drop `custodial_wallets` table** (after all users migrated or after a grace period).
+
+### Phase 4: Cleanup
+
+- Remove `WALLET_ENCRYPTION_KEY` from required env vars.
+- Remove the `custodial_wallets` Prisma model and migration.
+- Audit logs to ensure no secret material is logged.
+- Update documentation and runbook.
+
+## 3. What Stays Backend-Signed
+
+| Operation | Signer | Reason |
+|---|---|---|
+| `rebalance(protocol, apy)` | `STELLAR_AGENT_SECRET_KEY` | Protocol-level operation, not user-scoped |
+| `update_total_assets(amount)` | `STELLAR_AGENT_SECRET_KEY` | Protocol-level accounting, not user-scoped |
+| Event listener | — | Read-only (polling RPC, no signing) |
+| Auth challenge/verify | — | Only signature verification (user signs nonce) |
+
+These are **agent operations** that mutate vault state based on protocol conditions, not individual user actions. They will continue to use `STELLAR_AGENT_SECRET_KEY`.
+
+## 4. Transaction Confirmation (Event Listener)
+
+The event listener (`src/stellar/events.ts`) already confirms transactions by polling Soroban RPC for contract events. This is **submission-path agnostic** — whether the backend submits or the user submits directly to Stellar, the event listener will:
+
+1. Detect `deposit` / `withdraw` events on-chain.
+2. Match by `txHash` against the `transactions` table.
+3. Update `status → CONFIRMED` and update position balances.
+
+For the relay path (Phase 1), the backend also proactively polls `getTransaction()` for faster feedback, but the event listener is the source of truth and handles edge cases (e.g., user submits directly to Stellar bypassing the relay).
+
+## 5. Security Considerations (User-Signed Transactions)
+
+### Replay Protection
+
+- The event listener deduplicates by `(contractId, txHash, eventType, ledger)` (see `events.ts:352-368`).
+- The `transactions` table has a unique constraint on `txHash`.
+- Nonce/sequence number protection is built into Stellar transactions — a submitted transaction cannot be replayed if the sequence number advances.
+
+### Transaction Validation (Relay Path)
+
+Before submitting a user-provided signed XDR, the backend must validate:
+
+1. **Source account** matches `req.auth.walletAddress` — prevents a malicious user from submitting a transaction signed by another user's key.
+2. **Contract ID** matches `VAULT_CONTRACT_ID` — prevents the relay from being used to submit arbitrary Stellar transactions.
+3. **Method** is `deposit` or `withdraw` — prevents relay of agent-only operations.
+4. **Signatures** are present and valid — `tx.signatures.length > 0`. (Full signature verification against the expected source account is ideal but adds complexity; the RPC node will reject invalid signatures on submission.)
+5. **txHash** not already processed — prevents duplicate submissions.
+
+### Fee Attack Mitigation
+
+A malicious user could submit a signed transaction with an extremely low fee, causing it to hang in the mempool. Mitigations:
+- The relay endpoint should enforce a minimum `fee` on the XDR.
+- Or the relay submits with a fee-bump transaction (backed by the agent key) to guarantee inclusion.
+- Simplest: accept the risk — the tx will either confirm or expire, and the event listener handles both outcomes cleanly.
+
+## 6. Rollback Plan
+
+If the non-custodial migration causes issues:
+
+| Phase | Rollback |
+|---|---|
+| Phase 1 (relay endpoint) | Remove the new endpoint; custodial paths continue working |
+| Phase 2 (frontend migration) | Flip feature flag `USE_NON_CUSTODIAL=false`; revert to custodial API calls |
+| Phase 3 (deprecation) | If `custodial_wallets` still exist, re-enable custodial endpoints. If table is dropped, restore from backup |
+| Phase 4 (cleanup) | Re-add `WALLET_ENCRYPTION_KEY` and `custodial_wallets` model if needed (full schema revert) |
+
+## 7. Migration Checklist
+
+- [ ] Phase 1: `POST /api/vault/submit-transaction` endpoint implemented with XDR validation
+- [ ] Phase 1: Idempotency check via `txHash` uniqueness
+- [ ] Phase 1: Event listener confirms relayed transactions (already works)
+- [ ] Phase 2: Frontend Freighter integration for signing
+- [ ] Phase 2: Feature flag for parallel custodial/non-custodial paths
+- [ ] Phase 2: Wallet onboarding flow for new users without Freighter
+- [ ] Phase 3: Existing custodial user migration (claim-wallet or grace period)
+- [ ] Phase 3: Remove `depositForUser()` / `withdrawForUser()` from contract.ts
+- [ ] Phase 3: Remove `POST /api/deposit` and `POST /api/withdraw` routes
+- [ ] Phase 3: Remove `createCustodialWallet()` from registration flow
+- [ ] Phase 4: Drop `custodial_wallets` table
+- [ ] Phase 4: Remove `WALLET_ENCRYPTION_KEY` env var
+- [ ] Phase 4: Update docs/RUNBOOK.md — remove key custody section for user keys
+- [ ] Phase 4: Audit logs for secret exposure
diff --git a/docs/RUNBOOK.md b/docs/RUNBOOK.md
new file mode 100644
index 0000000..676d098
--- /dev/null
+++ b/docs/RUNBOOK.md
@@ -0,0 +1,417 @@
+# Production Runbook — Stellar Mainnet
+
+## 1. Network & Environment Alignment
+
+### Verify deployment target
+
+```bash
+# Current env values (must match mainnet)
+echo "STELLAR_NETWORK=$STELLAR_NETWORK"
+echo "STELLAR_RPC_URL=$STELLAR_RPC_URL"
+echo "VAULT_CONTRACT_ID=$VAULT_CONTRACT_ID"
+```
+
+| Variable | Mainnet value |
+|---|---|
+| `STELLAR_NETWORK` | `mainnet` |
+| `STELLAR_RPC_URL` | `https://soroban-mainnet.stellar.org` |
+| Network passphrase | `Public Global Stellar Network ; September 2015` |
+| `NODE_ENV` | `production` |
+
+Contract IDs, token addresses, and the `STELLAR_AGENT_SECRET_KEY` **must** be mainnet instances. A testnet key on mainnet will sign invalid operations.
+
+### Pre-flight alignment checks
+
+```bash
+# 1. Confirm network in env matches deployment context
+grep STELLAR_NETWORK .env | grep -q mainnet || echo "WARN: not mainnet"
+
+# 2. Verify RPC connection returns mainnet ledger
+curl -s -X POST "$STELLAR_RPC_URL" \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \
+  jq '.result.sequence'
+
+# 3. Confirm agent key controls the vault on this network
+#    (validate via a read-only contract call — getVaultInfo or similar)
+
+# 4. Verify Prisma migration status matches schema
+npx prisma migrate status
+```
+
+### Boot sequence validation
+
+On startup, `src/config/env.ts` validates:
+- `STELLAR_NETWORK` ∈ {mainnet, testnet, futurenet}
+- `STELLAR_AGENT_SECRET_KEY` starts with `S`, length 56
+- `WALLET_ENCRYPTION_KEY` is exactly 64 hex chars
+- All required env vars are set (throws if missing)
+
+The `GET /health/ready` endpoint reports three subsystems: `database`, `eventListener`, `agentLoop`. All must be `ready: true` before the load balancer marks the instance healthy.
+
+---
+
+## 2. Key Custody
+
+### Secrets under management
+
+| Secret | Source | Purpose | Rotation |
+|---|---|---|---|
+| `STELLAR_AGENT_SECRET_KEY` | env var | Signs Soroban contract calls (rebalance, update total assets) | On key compromise or quarterly |
+| `WALLET_ENCRYPTION_KEY` | env var | AES-256-GCM key encrypting custodial wallet secrets in `custodial_wallets` table | Coordinated re-encryption migration |
+| `JWT_SEED` | env var | Signs session JWTs | Every 90 days (invalidates all sessions) |
+| `DATABASE_URL` | env var/env file | PostgreSQL connection | DB password rotation per provider policy |
+
+### Agent key rotation
+
+1. Generate new Stellar keypair:
+   ```bash
+   stellar keys generate neurowealth-agent-v2  # or via SDK
+   ```
+2. Fund the new public key with XLM on mainnet.
+3. If the vault contract maintains an operator allowlist, update it to include the new key.
+4. Set `STELLAR_AGENT_SECRET_KEY` in your secret manager to the new **secret**.
+5. Redeploy all instances (rolling update).
+6. Verify agent loop health: `GET /health/ready` → `agentLoop: ready`.
+7. **Keep the old key funded for 30 days** in case a rollback is needed.
+8. Drain and discard the old key after the rollback window.
+
+### Wallet encryption key rotation
+
+1. Provision `WALLET_ENCRYPTION_KEY_NEW` in the secret manager alongside the current key.
+2. Run a one-off migration script that:
+   - Reads every row from `custodial_wallets`
+   - Decrypts `encryptedSecret` with the old key
+   - Re-encrypts with the new key
+   - Writes back the new `encryptedSecret`, `iv`, `authTag`
+3. Swap the env var to the new key.
+4. Verify a sample of users can still sign operations.
+5. Remove the old key from the secret store.
+
+### Custodial wallet recovery
+
+Losing `WALLET_ENCRYPTION_KEY` **permanently** destroys all custodial wallet keys.
+- **Backup**: Regular DB snapshots preserve encrypted key material.
+- **Audit**: The `custodial_wallets` table stores (`publicKey`, `encryptedSecret`, `iv`, `authTag`) — never plaintext secrets.
+- **Disaster**: If the DB is restored from a backup, the encryption key at backup time must still be available.
+
+### Secret storage policies
+
+- **Never** commit secrets to git. Use `.env.example` as a template.
+- **Production**: AWS Secrets Manager / HashiCorp Vault with access audit logging.
+- **CI/CD**: GitHub Environments secrets, injected as env vars in deploy workflows.
+- **Local dev**: `.env` file (gitignored).
+
+---
+
+## 3. RPC Failover
+
+### Current architecture
+
+`src/stellar/client.ts` creates a single `rpc.Server(STELLAR_RPC_URL)` singleton. There is **no built-in automatic failover**. A mainnet RPC outage halts event ingestion and agent operations.
+
+### Failover strategy
+
+#### Option A: Load-balanced endpoint (recommended)
+
+Configure a single URL that routes across multiple RPC providers:
+
+```
+STELLAR_RPC_URL=https://soroban-mainnet.stellar.org
+```
+
+Replace this with a load balancer or provider that pools:
+- `https://soroban-mainnet.stellar.org` (SDF)
+- `https://mainnet.sorobanrpc.com` (public)
+- `https://rpc.stellar.org/mainnet` (alternative)
+
+#### Option B: Multi-provider fallback (not yet implemented)
+
+If you need resilience without a LB, wrap `getRpcServer()` to fall back:
+
+```typescript
+const RPC_URLS = [
+  'https://soroban-mainnet.stellar.org',
+  'https://mainnet.sorobanrpc.com',
+]
+let currentIndex = 0
+
+export function getRpcServer(): rpc.Server {
+  // Returns current server; call rotateRpc() on failure
+  if (!rpcServer) rpcServer = new rpc.Server(RPC_URLS[currentIndex])
+  return rpcServer
+}
+
+export function rotateRpc(): void {
+  currentIndex = (currentIndex + 1) % RPC_URLS.length
+  rpcServer = new rpc.Server(RPC_URLS[currentIndex])
+  logger.warn(`[RPC] Failed over to ${RPC_URLS[currentIndex]}`)
+}
+```
+
+Wire `rotateRpc()` into error handlers in `fetchEvents` and `submitTransaction`.
+
+### RPC outage playbook
+
+| Symptom | Action |
+|---|---|
+| `fetchEvents` fails with connection error | Rotate RPC URL (manual or automated) |
+| `sendTransaction` hangs or times out | Rotate RPC; retry tx via `getTransaction` |
+| Persistent RPC failures | Switch to backup RPC provider entirely |
+| All known RPCs down | Pause event listener; set `agentLoop` to degraded; page on-call |
+
+### Verify RPC health
+
+```bash
+# Check latest ledger via RPC
+curl -s -X POST "$STELLAR_RPC_URL" \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \
+  jq '.result.sequence'
+
+# Monitor via /metrics
+curl -s http://localhost:3001/metrics | grep cursor_lag
+```
+
+---
+
+## 4. Ledger Lag Alerts
+
+### Metrics
+
+| Metric | Type | Description |
+|---|---|---|
+| `cursor_lag_ledgers` | Gauge | `latest_ledger - last_processed_ledger` |
+| `last_processed_ledger` | Gauge | Last ledger successfully processed |
+| `events_processed_total` | Counter | Events processed, labelled by type and status |
+
+Alert rules are defined in `docs/OBSERVABILITY.md` and deployed to Prometheus.
+
+### Alert thresholds
+
+| Severity | Lag | Action |
+|---|---|---|
+| Info | > 10 ledgers | Note — may be normal during low traffic |
+| Warning | > 50 ledgers for 5 min | Investigate within 1 hour |
+| Critical | > 100 ledgers for 2 min | Page immediately |
+
+### Investigation steps
+
+```bash
+# 1. Check current lag
+curl -s http://localhost:3001/metrics | grep cursor_lag
+
+# 2. Check last processed ledger in DB
+psql "$DATABASE_URL" -c "SELECT * FROM event_cursors WHERE \"contractId\" = '$VAULT_CONTRACT_ID';"
+
+# 3. Check listener logs for errors
+grep "Event Listener" /var/log/app/*.log | tail -50
+
+# 4. Check RPC connectivity
+curl -s -X POST "$STELLAR_RPC_URL" \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","id":1,"method":"getLatestLedger"}' | \
+  jq '.result.sequence'
+
+# 5. Check for backpressure (DLQ growth)
+curl -s http://localhost:3001/metrics | grep dlq_size
+
+# 6. Check database connection pool
+psql "$DATABASE_URL" -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
+```
+
+### Common causes & remediation
+
+| Cause | Signal | Fix |
+|---|---|---|
+| RPC outage | `fetchEvents` errors in logs | Rotate RPC endpoint (see §3) |
+| DB slow / locked | High `db_operation_duration_seconds` | Check locks, pool size, index usage |
+| Schema validation failures | DLQ growth, `event_validation` errors | Inspect DLQ, fix event format or validator |
+| Listener crashed | `cursor_lag` rising, `agent_loop_status == 0` | Container restart, check OOM killer |
+| Network partition | RPC timeouts | Check DNS, firewall, egress rules |
+
+### Recover from lag
+
+```bash
+# If lag < 1000 ledgers — automatic backfill runs on restart
+# If lag > 1000 ledgers — manual backfill recommended via admin endpoint
+
+# Manual backfill (from a specific ledger)
+# Restart the service; backfill runs automatically up to latest
+# If auto-backfill is too slow, consider:
+#   1. Stop the listener
+#   2. Update event_cursors to an earlier ledger
+#   3. Restart the listener to trigger backfill
+psql "$DATABASE_URL" -c "UPDATE event_cursors SET \"lastProcessedLedger\" = $EARLIER_LEDGER WHERE \"contractId\" = '$VAULT_CONTRACT_ID';"
+```
+
+---
+
+## 5. DLQ Replay Procedure
+
+### Overview
+
+Events that fail processing (validation error, DB error, missing user) are stored in the `dead_letter_events` table with status `PENDING`. The DLQ module (`src/stellar/dlq.ts`) manages retries through three admin API endpoints.
+
+### Inspect DLQ
+
+```bash
+# Via admin API (requires ADMIN_API_TOKEN)
+curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  http://localhost:3001/api/admin/dlq/inspect | jq
+
+# Filter by status
+curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  "http://localhost:3001/api/admin/dlq/inspect?status=PENDING" | jq
+
+# Via direct DB query
+psql "$DATABASE_URL" -c "
+  SELECT id, \"eventType\", \"txHash\", ledger, status, \"retryCount\", error, \"createdAt\"
+  FROM dead_letter_events
+  ORDER BY \"createdAt\" DESC
+  LIMIT 50;
+"
+```
+
+### Dry-run retry
+
+```bash
+curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"dryRun": true}' \
+  http://localhost:3001/api/admin/dlq/retry | jq
+```
+
+Dry run simulates the retry loop without persisting status changes.
+
+### Full retry
+
+```bash
+curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{}' \
+  http://localhost:3001/api/admin/dlq/retry | jq
+```
+
+Returns:
+```json
+{
+  "resolved": 5,
+  "failed": 2,
+  "totalRemaining": 2
+}
+```
+
+### Resolve a specific event
+
+If an event cannot be processed (e.g. user deleted), manually resolve it:
+
+```bash
+curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"id": "uuid-of-event"}' \
+  http://localhost:3001/api/admin/dlq/resolve | jq
+```
+
+### Automatic retry behavior
+
+- `retryAll()` processes all `PENDING` and `RETRIED` events sequentially.
+- Success → status set to `RESOLVED`, count +1.
+- Failure → status set to `RETRIED`, count +1, logged.
+- There is **no automatic scheduled retry**. All retries are manual via the admin API.
+- When DLQ size reaches 50, a critical log line is emitted and the Prometheus `dlq_size` gauge crosses the critical threshold.
+
+### DLQ replay decision matrix
+
+| Event type | Common failure | Retry likely? | Notes |
+|---|---|---|---|
+| `deposit` | User not found | No until user exists | Resolve after user registers |
+| `deposit` | Schema validation | Depends | Fix validator or event source |
+| `withdraw` | Position not found | No | May indicate data integrity issue |
+| `rebalance` | DB constraint | Yes | Transient — retry typically succeeds |
+| Any | RPC/DB timeout | Yes | Transient — retry typically succeeds |
+
+---
+
+## 6. Incident Contacts
+
+### Escalation tiers
+
+| Tier | Role | Responsibility | Contact |
+|---|---|---|---|
+| T1 | On-call engineer | Triage, restart, DLQ retry, RPC rotation | PagerDuty / Opsgenie |
+| T2 | Backend lead | Code fix, data reconciliation, migration rollback | Slack @backend-lead |
+| T3 | Engineering manager | Stakeholder comms, post-mortem, priority decisions | Slack @eng-mgr |
+| T4 | Security officer | Key compromise, wallet recovery, audit | Slack @sec-officer |
+
+### Communication channels
+
+| Channel | Purpose |
+|---|---|
+| `#neurowealth-alerts` | Prometheus alert notifications |
+| `#neurowealth-incidents` | Incident coordination thread |
+| PagerDuty | T1 on-call escalation |
+| Email: `ops@neurowealth.io` | Backup contact for critical outages |
+
+### Incident severity definitions
+
+| Severity | Definition | Response time | Escalation |
+|---|---|---|---|
+| **SEV1** | Event processing halted, funds at risk, data loss | < 15 min | T1 → T2 → T3 |
+| **SEV2** | Lag > 100 ledgers, DLQ > 50, agent loop degraded | < 1 hour | T1 → T2 |
+| **SEV3** | Lag > 50 ledgers, DLQ > 20, elevated error rate | < 8 hours | T1 |
+| **SEV4** | Minor anomalies, informational alerts | Next business day | None |
+
+### Post-incident checklist
+
+- [ ] Root cause identified and documented
+- [ ] Fix deployed (or rollback executed)
+- [ ] DLQ resolved and lag cleared
+- [ ] Alert thresholds adjusted if needed
+- [ ] Post-mortem filed in `docs/post-mortems/`
+- [ ] Runbook updated with lessons learned
+
+---
+
+## Quick Reference Commands
+
+```bash
+# Health
+curl http://localhost:3001/health/live
+curl http://localhost:3001/health/ready
+curl http://localhost:3001/health
+
+# Metrics
+curl http://localhost:3001/metrics | grep -E "(cursor_lag|dlq_size|events_processed|agent_loop)"
+
+# DLQ inspect
+curl -s -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  http://localhost:3001/api/admin/dlq/inspect | jq '. | length'
+
+# DLQ retry (dry run)
+curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"dryRun": true}' \
+  http://localhost:3001/api/admin/dlq/retry
+
+# DLQ retry (live)
+curl -s -X POST -H "Authorization: Bearer $ADMIN_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{}' \
+  http://localhost:3001/api/admin/dlq/retry
+
+# DB — cursor status
+psql "$DATABASE_URL" -c "SELECT * FROM event_cursors;"
+
+# DB — DLQ count by status
+psql "$DATABASE_URL" -c "
+  SELECT status, count(*) FROM dead_letter_events GROUP BY status;
+"
+
+# DB — recent processed events
+psql "$DATABASE_URL" -c "
+  SELECT \"eventType\", ledger, \"txHash\", \"createdAt\"
+  FROM processed_events
+  ORDER BY ledger DESC LIMIT 10;
+"
+```