diff --git a/CHANGELOG.md b/CHANGELOG.md index bfb1262..3b70e53 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -16,6 +16,8 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht - **Portal CORS preflight gains a deny-cache.** `PortalCorsMiddleware.HandlePreflightAsync` now caches both allow and deny outcomes for the same TTL as the per-app signing-key lookup (default 60 s). Browsers don't cache rejected preflights, so before this change every `OPTIONS` against `/api/v1/portal/*` from a disallowed origin re-scanned the portal-enabled application set and the in-process JSON deserialize loop — a low-effort DB hammer vector. New regression test `Preflight_Deny_Decision_Is_Cached_Within_Ttl` pins the behaviour. ### Documentation +- **ADR-004 — Portal Signing Key Storage: Plaintext + Instant Invalidation, No Rotation Grace.** Locks in the v0.2.0 decisions around per-app `PortalSigningKey` at-rest storage (`varchar(64)` plaintext, mirroring the existing `SigningSecret` pattern), the no-grace rotation lifecycle (instant local invalidation, TTL-bounded multi-replica), and the one-shot reveal contract on `enable` / `rotate`. Documents the threat model, alternatives considered (`pgcrypto`, rotation grace, external KMS), and the explicit revisit triggers (compliance regimes, recurring host-integration friction, key-reuse beyond JWT verification). +- **ADR-005 — Portal CORS Preflight Deny-Cache TTL.** Locks in PR #104's choice to cache both allow and deny outcomes for `PortalAuth:LookupCacheTtlSeconds` (default 60s), reusing the existing per-app cache window for symmetry. Documents why no synchronous invalidation hook from `PUT /portal/origins` (the cache key is origin-scoped, the operator action is app-scoped — the bookkeeping cost outweighs the ≤60s staleness), and what would justify revisiting (operator UX complaints, memory-pressure attacks via origin enumeration, richer CORS semantics). - **`docs/API.md` §3.8 — Portal API (Customer-Facing JWT).** Replaces the doc-drift gap surfaced by the v0.2.0 audit (where the portal API surface existed in code and CHANGELOG but was absent from `docs/API.md`). Covers HS256 JWT contract, capability scopes, per-app CORS, rate-limit partition reuse, cross-tenant `404` semantics, every route under `/api/v1/portal/*`, the dashboard portal-admin routes, the portal-specific error code table, and an end-to-end Node.js (`jose`) + cURL probe. - **`docs/ARCHITECTURE.md` §4.3 — Portal Token Authentication.** Documents the load-bearing middleware ordering (`ApiKeyAuth` → `PortalTokenAuth` → `PortalCors` → `RateLimiter`), the three invariants it encodes, the `PortalLookupCache` TTL + atomic-CTS-swap behaviour, and the JWT validator's defense-in-depth choices (HS256 pin, 8 KiB token cap, `MapInboundClaims=false`, lifetime cap, opaque error bodies). diff --git a/docs/adr/adr-004-portal-signing-key-storage.md b/docs/adr/adr-004-portal-signing-key-storage.md new file mode 100644 index 0000000..75acd4e --- /dev/null +++ b/docs/adr/adr-004-portal-signing-key-storage.md @@ -0,0 +1,66 @@ +# ADR-004: Portal Signing Key Storage — Plaintext + Instant Invalidation, No Rotation Grace + +**Status:** Accepted +**Date:** 2026-05-11 +**Context:** B1 Embeddable Customer Portal (v0.2.0); v0.2.0 post-merge audit (PRs #101–#104) + +## Context + +The embeddable customer portal authenticates `/api/v1/portal/*` requests with HS256 JWTs minted by the host SaaS using a per-application signing key. The engine never mints these tokens — it only verifies them. Two storage and lifecycle decisions had to be locked in before v0.2.1: + +1. **At-rest storage** of the per-app `PortalSigningKey`. +2. **Rotation lifecycle** — whether a rotated key has a grace window during which both the old and new keys verify. + +The v0.2.0 audit (specifically the reviewer agent's P2 finding #10) flagged both as worth recording as ADRs because the consequences are not obvious from the code alone, and the choices are likely to be revisited. + +## Decision + +### 1. Plaintext storage in `applications.portal_signing_key` (`varchar(64)`) + +The signing key is stored in plaintext alongside the existing `signing_secret` column. No application-level encryption-at-rest, no `pgcrypto`, no envelope encryption. At-rest protection relies entirely on the operator's Postgres deployment posture (filesystem-level encryption, restricted backups, network isolation). + +### 2. Instant invalidation on rotate / disable, with **no grace period** + +When the operator calls `POST /portal/rotate` or `POST /portal/disable`: + +- The new (or null) value is written to `applications.portal_signing_key`. +- `PortalLookupCache.InvalidateApplication(appId)` is called synchronously on the local node — within milliseconds, the next portal request validates against the new key. +- On remote replicas, the change becomes effective within `PortalAuth:LookupCacheTtlSeconds` (default 60s). +- **Tokens minted with the old key fail verification immediately** (no overlapping validity window). + +### 3. One-shot reveal + +`POST /portal/enable` and `POST /portal/rotate` return the freshly generated key in the response body **once**. Subsequent reads (`GET /portal`) return only `portalEnabled: bool`, the rotated-at timestamp, and the allowed-origins list — never the key itself. The audit log records the rotate action with `portalEnabled: bool` instead of the literal secret. + +## Consequences + +### Positive + +- **Operational simplicity.** Existing `SigningSecret` column already follows the same pattern; a single storage convention covers both auth surfaces. No new key-management subsystem to maintain. +- **Instant revoke.** A leaked or suspected-leaked key can be killed in one click — no minutes-long window during which a stolen token still works. For a self-hosted webhook engine where the operator is the security boundary, this is the right default. +- **Audit clarity.** The audit log entry's `before/after` snapshot redacts the signing key to a boolean, so audit trails never carry secrets — operators can grant audit-log access broadly without leaking auth material. +- **Matches existing convention.** `SigningSecret` (per-app HMAC delivery secret) is also plaintext in `applications.signing_secret`. Diverging here would create asymmetry without solving the underlying threat (DB compromise compromises both). + +### Negative + +- **Database compromise = key exposure.** An attacker with read access to the `applications` table (e.g. backup leak, SQL injection, compromised replica) can mint valid portal tokens until the operator rotates. Mitigation is operational: backup encryption, restricted DB credentials, network isolation. +- **No rotation grace = host SaaS integration friction.** The host application must update its environment variable / secrets store and restart its token-mint workers (or accept a brief window of mint failures) the moment rotate is clicked. There is no "old + new both work for 5 minutes" cushion. Hosts that mint tokens on every page render absorb this gracefully (next render uses the new key); hosts that cache mints longer will see brief failures. + +### Neutral + +- **At-rest encryption is deferred, not refused.** If a future deployment posture mandates encryption-at-rest beyond filesystem-level protection (e.g. compliance regimes that require column-level encryption), `pgcrypto` or an envelope-encryption layer can be introduced without changing the wire contract. The plaintext column is the *initial* choice, not a permanent one. +- **Multi-replica eventual consistency on rotate is bounded by the cache TTL.** Same trade-off the engine already makes for `SigningSecret` — see ADR-005 for the parallel CORS-cache TTL discussion. + +## Alternatives considered + +- **`pgcrypto`-encrypted column.** Rejected for v0.2.0: adds a key-management problem (where does the encryption key live?) without removing the underlying compromise scenario (whoever holds the encryption key can decrypt). Defers, doesn't solve. +- **Rotation grace window (overlapping validity).** Rejected for v0.2.0: doubles the blast radius of a leaked key (now two keys are in flight), complicates the cache (two-key lookup per app), and conflicts with the "one-click revoke" UX. If host integration friction becomes a recurring complaint, this can be revisited as an opt-in. +- **External key store (Vault, KMS).** Out of scope for a self-hosted reference deployment. Operators with strict requirements can wrap the connection string in their KMS layer. + +## Revisit triggers + +This decision should be reopened if any of the following becomes true: + +- A production user reports that "no rotation grace" causes recurring outages they can't engineer around with idempotent retries. +- A compliance regime (SOC 2 Type II, HIPAA, PCI) is requested for the self-hosted distribution and requires column-level encryption. +- The portal signing key gets used for anything other than HS256 JWT verification (e.g. signing webhooks back to the host SaaS), changing the threat model. diff --git a/docs/adr/adr-005-portal-cors-deny-cache.md b/docs/adr/adr-005-portal-cors-deny-cache.md new file mode 100644 index 0000000..c647952 --- /dev/null +++ b/docs/adr/adr-005-portal-cors-deny-cache.md @@ -0,0 +1,65 @@ +# ADR-005: Portal CORS Preflight Deny-Cache — TTL Choice + +**Status:** Accepted +**Date:** 2026-05-11 +**Context:** v0.2.0 post-merge audit, P1 finding #4 (PR #104 / Tur 4) + +## Context + +`PortalCorsMiddleware.HandlePreflightAsync` resolves whether an `OPTIONS` request from a given `Origin` should be allowed by scanning the portal-enabled application set: + +```csharp +var allowed = await appRepo.AnyAllowsPortalOriginAsync(origin, ct); +``` + +`ApplicationRepository.AnyAllowsPortalOriginAsync` loads every portal-enabled app's `AllowedPortalOriginsJson` blob, deserializes each, and walks the lists for an exact (case-insensitive) match. + +The audit's P1 finding flagged that: + +1. **Browsers do not cache rejected preflight responses.** A `403` on `OPTIONS` triggers a fresh preflight on the next attempt — there is no `Access-Control-Max-Age` semantics for a denial. +2. **Therefore every disallowed `OPTIONS /api/v1/portal/*` triggered a fresh DB scan.** An attacker (or a misconfigured browser) hitting the portal prefix from an unauthorized origin in a tight loop would amplify into a per-app, per-origin DB sweep — a free DoS vector. + +The fix is to cache the preflight decision (allow **and** deny) server-side. The remaining question is the TTL. + +## Decision + +Cache both allow and deny outcomes via the existing `IMemoryCache` for **`PortalAuth:LookupCacheTtlSeconds`** (default **60s**). The cache key is the lowercased origin: + +```csharp +var cacheKey = $"portal:cors:{origin.ToLowerInvariant()}"; +``` + +No invalidation hook is wired from operator mutating actions (`PUT /portal/origins`, `POST /portal/disable`, etc.). Operators rely on TTL-bounded eventual consistency for CORS preflight decisions. + +## Consequences + +### Positive + +- **Closes the DB-hammer vector.** A flood of disallowed `OPTIONS` from a single origin now hits the DB at most once per TTL window per origin, instead of once per request. The candidate JSON deserialization loop runs accordingly less often. +- **Symmetry with the per-app signing-key cache.** Both portal-related caches share `PortalAuth:LookupCacheTtlSeconds`. Operators have one knob to turn if they need shorter freshness windows; the mental model stays simple. +- **Allow-cache also helps the legitimate case.** A real customer browsing the embedded portal performs at least one `OPTIONS` per cross-origin request the browser hasn't already cached for the route — caching the allow outcome reduces per-customer DB pressure too. + +### Negative + +- **Origin allowlist updates take up to 60s to take effect.** If an operator removes an origin from `PUT /portal/origins`, requests from that origin may continue to receive a `204` preflight response for up to a minute — a stale allow. This is the same eventual-consistency trade the per-app signing-key cache already makes; it is not a new behaviour, just a new surface. +- **Stale deny is also possible.** If an operator *adds* an origin, requests from it may continue to receive `403` for up to a minute. Same TTL bound. + +### Neutral + +- **Multi-replica deployments were already bounded by cache TTL** for the signing-key lookup. CORS now follows the same pattern; nothing about the multi-replica behaviour gets worse. +- **The cache is local-process.** A multi-replica deployment that rotates origins frequently will see per-replica drift up to 60s. This matches the rest of the portal cache surface — no additional invariant is introduced. + +## Alternatives considered + +- **Wire an invalidation hook from `PUT /portal/origins` and `POST /portal/disable`.** Rejected for v0.2.x: the cache key is *origin-scoped*, not app-scoped, but operator actions act on the *app's* allowlist. Invalidating "all CORS cache entries that might have referenced this app's allowlist" requires either a full-cache flush (noisy across all apps) or a secondary index (added complexity for a problem the TTL already bounds). The cost of being slightly stale for ≤60s is lower than the cost of either workaround. +- **Shorter TTL (e.g. 10s).** Rejected: the DB-hammer mitigation strength scales linearly with the TTL. 60s is already the same window we accept for signing-key freshness. Shorter TTLs reduce the security gain without buying meaningful operator visibility. +- **Longer TTL (e.g. 5 min).** Rejected: makes operator origin updates feel laggy in the dashboard, and the 60s symmetry with the signing-key cache becomes asymmetric noise. +- **Negative-only cache (cache deny, never allow).** Rejected: half-solves the problem. The deny-cache stops the abuse case, but the allow-cache helps the legitimate case symmetrically — there's no reason to skip it. + +## Revisit triggers + +This decision should be reopened if any of the following becomes true: + +- An operator reports that origin allowlist changes "don't take effect" because their support flow needs immediate consistency (e.g. a customer is locked out and waiting on hold while the cache TTL elapses). The fix would be the synchronous-invalidation hook described under Alternatives — moderately complex but well-scoped. +- The CORS deny-cache is observed leaking memory in long-running deployments (origin enumeration attack — attacker rotates `Origin` to balloon the cache). Mitigation: bounded-size cache entries + LRU eviction, or per-app CORS prefix scoping. +- The portal layer adds richer CORS semantics (multiple `Allow-Methods` per app, per-route allowlists, etc.) where origin-scoped caching no longer fits.