Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
- **Portal CORS preflight gains a deny-cache.** `PortalCorsMiddleware.HandlePreflightAsync` now caches both allow and deny outcomes for the same TTL as the per-app signing-key lookup (default 60 s). Browsers don't cache rejected preflights, so before this change every `OPTIONS` against `/api/v1/portal/*` from a disallowed origin re-scanned the portal-enabled application set and the in-process JSON deserialize loop — a low-effort DB hammer vector. New regression test `Preflight_Deny_Decision_Is_Cached_Within_Ttl` pins the behaviour.

### Documentation
- **ADR-004 — Portal Signing Key Storage: Plaintext + Instant Invalidation, No Rotation Grace.** Locks in the v0.2.0 decisions around per-app `PortalSigningKey` at-rest storage (`varchar(64)` plaintext, mirroring the existing `SigningSecret` pattern), the no-grace rotation lifecycle (instant local invalidation, TTL-bounded multi-replica), and the one-shot reveal contract on `enable` / `rotate`. Documents the threat model, alternatives considered (`pgcrypto`, rotation grace, external KMS), and the explicit revisit triggers (compliance regimes, recurring host-integration friction, key-reuse beyond JWT verification).
- **ADR-005 — Portal CORS Preflight Deny-Cache TTL.** Locks in PR #104's choice to cache both allow and deny outcomes for `PortalAuth:LookupCacheTtlSeconds` (default 60s), reusing the existing per-app cache window for symmetry. Documents why no synchronous invalidation hook from `PUT /portal/origins` (the cache key is origin-scoped, the operator action is app-scoped — the bookkeeping cost outweighs the ≤60s staleness), and what would justify revisiting (operator UX complaints, memory-pressure attacks via origin enumeration, richer CORS semantics).
- **`docs/API.md` §3.8 — Portal API (Customer-Facing JWT).** Replaces the doc-drift gap surfaced by the v0.2.0 audit (where the portal API surface existed in code and CHANGELOG but was absent from `docs/API.md`). Covers HS256 JWT contract, capability scopes, per-app CORS, rate-limit partition reuse, cross-tenant `404` semantics, every route under `/api/v1/portal/*`, the dashboard portal-admin routes, the portal-specific error code table, and an end-to-end Node.js (`jose`) + cURL probe.
- **`docs/ARCHITECTURE.md` §4.3 — Portal Token Authentication.** Documents the load-bearing middleware ordering (`ApiKeyAuth` → `PortalTokenAuth` → `PortalCors` → `RateLimiter`), the three invariants it encodes, the `PortalLookupCache` TTL + atomic-CTS-swap behaviour, and the JWT validator's defense-in-depth choices (HS256 pin, 8 KiB token cap, `MapInboundClaims=false`, lifetime cap, opaque error bodies).

Expand Down
66 changes: 66 additions & 0 deletions docs/adr/adr-004-portal-signing-key-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# ADR-004: Portal Signing Key Storage — Plaintext + Instant Invalidation, No Rotation Grace

**Status:** Accepted
**Date:** 2026-05-11
**Context:** B1 Embeddable Customer Portal (v0.2.0); v0.2.0 post-merge audit (PRs #101–#104)

## Context

The embeddable customer portal authenticates `/api/v1/portal/*` requests with HS256 JWTs minted by the host SaaS using a per-application signing key. The engine never mints these tokens — it only verifies them. Two storage and lifecycle decisions had to be locked in before v0.2.1:

1. **At-rest storage** of the per-app `PortalSigningKey`.
2. **Rotation lifecycle** — whether a rotated key has a grace window during which both the old and new keys verify.

The v0.2.0 audit (specifically the reviewer agent's P2 finding #10) flagged both as worth recording as ADRs because the consequences are not obvious from the code alone, and the choices are likely to be revisited.

## Decision

### 1. Plaintext storage in `applications.portal_signing_key` (`varchar(64)`)

The signing key is stored in plaintext alongside the existing `signing_secret` column. No application-level encryption-at-rest, no `pgcrypto`, no envelope encryption. At-rest protection relies entirely on the operator's Postgres deployment posture (filesystem-level encryption, restricted backups, network isolation).

### 2. Instant invalidation on rotate / disable, with **no grace period**

When the operator calls `POST /portal/rotate` or `POST /portal/disable`:

- The new (or null) value is written to `applications.portal_signing_key`.
- `PortalLookupCache.InvalidateApplication(appId)` is called synchronously on the local node — within milliseconds, the next portal request validates against the new key.
- On remote replicas, the change becomes effective within `PortalAuth:LookupCacheTtlSeconds` (default 60s).
- **Tokens minted with the old key fail verification immediately** (no overlapping validity window).

### 3. One-shot reveal

`POST /portal/enable` and `POST /portal/rotate` return the freshly generated key in the response body **once**. Subsequent reads (`GET /portal`) return only `portalEnabled: bool`, the rotated-at timestamp, and the allowed-origins list — never the key itself. The audit log records the rotate action with `portalEnabled: bool` instead of the literal secret.

## Consequences

### Positive

- **Operational simplicity.** Existing `SigningSecret` column already follows the same pattern; a single storage convention covers both auth surfaces. No new key-management subsystem to maintain.
- **Instant revoke.** A leaked or suspected-leaked key can be killed in one click — no minutes-long window during which a stolen token still works. For a self-hosted webhook engine where the operator is the security boundary, this is the right default.
- **Audit clarity.** The audit log entry's `before/after` snapshot redacts the signing key to a boolean, so audit trails never carry secrets — operators can grant audit-log access broadly without leaking auth material.
- **Matches existing convention.** `SigningSecret` (per-app HMAC delivery secret) is also plaintext in `applications.signing_secret`. Diverging here would create asymmetry without solving the underlying threat (DB compromise compromises both).

### Negative

- **Database compromise = key exposure.** An attacker with read access to the `applications` table (e.g. backup leak, SQL injection, compromised replica) can mint valid portal tokens until the operator rotates. Mitigation is operational: backup encryption, restricted DB credentials, network isolation.
- **No rotation grace = host SaaS integration friction.** The host application must update its environment variable / secrets store and restart its token-mint workers (or accept a brief window of mint failures) the moment rotate is clicked. There is no "old + new both work for 5 minutes" cushion. Hosts that mint tokens on every page render absorb this gracefully (next render uses the new key); hosts that cache mints longer will see brief failures.

### Neutral

- **At-rest encryption is deferred, not refused.** If a future deployment posture mandates encryption-at-rest beyond filesystem-level protection (e.g. compliance regimes that require column-level encryption), `pgcrypto` or an envelope-encryption layer can be introduced without changing the wire contract. The plaintext column is the *initial* choice, not a permanent one.
- **Multi-replica eventual consistency on rotate is bounded by the cache TTL.** Same trade-off the engine already makes for `SigningSecret` — see ADR-005 for the parallel CORS-cache TTL discussion.

## Alternatives considered

- **`pgcrypto`-encrypted column.** Rejected for v0.2.0: adds a key-management problem (where does the encryption key live?) without removing the underlying compromise scenario (whoever holds the encryption key can decrypt). Defers, doesn't solve.
- **Rotation grace window (overlapping validity).** Rejected for v0.2.0: doubles the blast radius of a leaked key (now two keys are in flight), complicates the cache (two-key lookup per app), and conflicts with the "one-click revoke" UX. If host integration friction becomes a recurring complaint, this can be revisited as an opt-in.
- **External key store (Vault, KMS).** Out of scope for a self-hosted reference deployment. Operators with strict requirements can wrap the connection string in their KMS layer.

## Revisit triggers

This decision should be reopened if any of the following becomes true:

- A production user reports that "no rotation grace" causes recurring outages they can't engineer around with idempotent retries.
- A compliance regime (SOC 2 Type II, HIPAA, PCI) is requested for the self-hosted distribution and requires column-level encryption.
- The portal signing key gets used for anything other than HS256 JWT verification (e.g. signing webhooks back to the host SaaS), changing the threat model.
65 changes: 65 additions & 0 deletions docs/adr/adr-005-portal-cors-deny-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# ADR-005: Portal CORS Preflight Deny-Cache — TTL Choice

**Status:** Accepted
**Date:** 2026-05-11
**Context:** v0.2.0 post-merge audit, P1 finding #4 (PR #104 / Tur 4)

## Context

`PortalCorsMiddleware.HandlePreflightAsync` resolves whether an `OPTIONS` request from a given `Origin` should be allowed by scanning the portal-enabled application set:

```csharp
var allowed = await appRepo.AnyAllowsPortalOriginAsync(origin, ct);
```

`ApplicationRepository.AnyAllowsPortalOriginAsync` loads every portal-enabled app's `AllowedPortalOriginsJson` blob, deserializes each, and walks the lists for an exact (case-insensitive) match.

The audit's P1 finding flagged that:

1. **Browsers do not cache rejected preflight responses.** A `403` on `OPTIONS` triggers a fresh preflight on the next attempt — there is no `Access-Control-Max-Age` semantics for a denial.
2. **Therefore every disallowed `OPTIONS /api/v1/portal/*` triggered a fresh DB scan.** An attacker (or a misconfigured browser) hitting the portal prefix from an unauthorized origin in a tight loop would amplify into a per-app, per-origin DB sweep — a free DoS vector.

The fix is to cache the preflight decision (allow **and** deny) server-side. The remaining question is the TTL.

## Decision

Cache both allow and deny outcomes via the existing `IMemoryCache` for **`PortalAuth:LookupCacheTtlSeconds`** (default **60s**). The cache key is the lowercased origin:

```csharp
var cacheKey = $"portal:cors:{origin.ToLowerInvariant()}";
```

No invalidation hook is wired from operator mutating actions (`PUT /portal/origins`, `POST /portal/disable`, etc.). Operators rely on TTL-bounded eventual consistency for CORS preflight decisions.

## Consequences

### Positive

- **Closes the DB-hammer vector.** A flood of disallowed `OPTIONS` from a single origin now hits the DB at most once per TTL window per origin, instead of once per request. The candidate JSON deserialization loop runs accordingly less often.
- **Symmetry with the per-app signing-key cache.** Both portal-related caches share `PortalAuth:LookupCacheTtlSeconds`. Operators have one knob to turn if they need shorter freshness windows; the mental model stays simple.
- **Allow-cache also helps the legitimate case.** A real customer browsing the embedded portal performs at least one `OPTIONS` per cross-origin request the browser hasn't already cached for the route — caching the allow outcome reduces per-customer DB pressure too.

### Negative

- **Origin allowlist updates take up to 60s to take effect.** If an operator removes an origin from `PUT /portal/origins`, requests from that origin may continue to receive a `204` preflight response for up to a minute — a stale allow. This is the same eventual-consistency trade the per-app signing-key cache already makes; it is not a new behaviour, just a new surface.
- **Stale deny is also possible.** If an operator *adds* an origin, requests from it may continue to receive `403` for up to a minute. Same TTL bound.

### Neutral

- **Multi-replica deployments were already bounded by cache TTL** for the signing-key lookup. CORS now follows the same pattern; nothing about the multi-replica behaviour gets worse.
- **The cache is local-process.** A multi-replica deployment that rotates origins frequently will see per-replica drift up to 60s. This matches the rest of the portal cache surface — no additional invariant is introduced.

## Alternatives considered

- **Wire an invalidation hook from `PUT /portal/origins` and `POST /portal/disable`.** Rejected for v0.2.x: the cache key is *origin-scoped*, not app-scoped, but operator actions act on the *app's* allowlist. Invalidating "all CORS cache entries that might have referenced this app's allowlist" requires either a full-cache flush (noisy across all apps) or a secondary index (added complexity for a problem the TTL already bounds). The cost of being slightly stale for ≤60s is lower than the cost of either workaround.
- **Shorter TTL (e.g. 10s).** Rejected: the DB-hammer mitigation strength scales linearly with the TTL. 60s is already the same window we accept for signing-key freshness. Shorter TTLs reduce the security gain without buying meaningful operator visibility.
- **Longer TTL (e.g. 5 min).** Rejected: makes operator origin updates feel laggy in the dashboard, and the 60s symmetry with the signing-key cache becomes asymmetric noise.
- **Negative-only cache (cache deny, never allow).** Rejected: half-solves the problem. The deny-cache stops the abuse case, but the allow-cache helps the legitimate case symmetrically — there's no reason to skip it.

## Revisit triggers

This decision should be reopened if any of the following becomes true:

- An operator reports that origin allowlist changes "don't take effect" because their support flow needs immediate consistency (e.g. a customer is locked out and waiting on hold while the cache TTL elapses). The fix would be the synchronous-invalidation hook described under Alternatives — moderately complex but well-scoped.
- The CORS deny-cache is observed leaking memory in long-running deployments (origin enumeration attack — attacker rotates `Origin` to balloon the cache). Mitigation: bounded-size cache entries + LRU eviction, or per-app CORS prefix scoping.
- The portal layer adds richer CORS semantics (multiple `Allow-Methods` per app, per-route allowlists, etc.) where origin-scoped caching no longer fits.
Loading