Skip to content

fix(reliability): migration 064 — forwarder_sent.audit_log_id FK (closes gap #6)#126

Open
mastermanas805 wants to merge 304 commits into
masterfrom
srr/migration-064-forwarder-sent-fk-2026-05-21
Open

fix(reliability): migration 064 — forwarder_sent.audit_log_id FK (closes gap #6)#126
mastermanas805 wants to merge 304 commits into
masterfrom
srr/migration-064-forwarder-sent-fk-2026-05-21

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

Closes CLAUDE.md Known Design Gaps #6 — `forwarder_sent.audit_id` has no FK to `audit_log(id)`, so team-deletion cascades drop audit_log rows and leave orphan ledger rows pointing at non-existent IDs.

Migration 063 (live) was an index + COMMENT only — it explicitly documented why a direct FK on `audit_id` is infeasible (TEXT vs UUID type mismatch, NOT NULL PK, legacy placeholder strings like `reminder--` from pre-consolidation emitters).

Approach: add a new nullable column `audit_log_id UUID REFERENCES audit_log(id) ON DELETE SET NULL` alongside the existing TEXT `audit_id`. The new column is the strict-FK breadcrumb; `audit_id` stays as the PK + idempotency key.

  • Backfill: existing rows whose `audit_id` is a real UUID AND resolves in `audit_log` get `audit_log_id` populated. Placeholder rows + already-orphan rows leave `audit_log_id` NULL.
  • Forward behaviour: future `audit_log` deletes flip `audit_log_id` to NULL via the FK rather than orphan-dangling — the ledger row itself survives (email-truth-surface, CLAUDE.md rule 12).
  • Index `idx_forwarder_sent_audit_log_id_not_null` lets the orphan-reconciler join cheaply.

Files changed

  • `internal/db/migrations/064_forwarder_sent_audit_fk.sql` — new migration
  • `internal/testhelpers/testhelpers.go` — mirror the new column + FK + index into the test bootstrap (anti-drift guard `TestRunMigrationsMirrorsEveryMigrationTable` already iterates the migration registry)
  • `internal/testhelpers/migration_064_test.go` — 5 registry-walk tests:
    • column shape (uuid, nullable)
    • FK exists with `confdeltype='n'` (= SET NULL) targeting audit_log
    • partial index exists
    • end-to-end: insert with both `audit_id` + `audit_log_id` → `DELETE FROM audit_log` → ledger row survives, `audit_log_id` is NULL
    • legacy placeholder `audit_id` strings still insert cleanly

Test plan

  • `go build ./...` — green
  • `go vet ./...` — green
  • `go test ./internal/testhelpers/... -run TestMigration064 -v` — 5/5 PASS
  • Full gate run — 3 pre-existing master failures (`TestBulkTwin_QuotaPartialFill`, `TestGetExpiredDeployments_ReturnsOnlyExpiredNonPermanent`, `TestLinkGitHubID`) confirmed on `master` baseline (stashed-changes verification); not introduced by this PR
  • Operator: apply migration via standard rollout; verify `/healthz` reports `migration_version=064_forwarder_sent_audit_fk.sql`, `migration_count=63`
  • Operator: `SELECT count(*) FROM forwarder_sent WHERE audit_log_id IS NOT NULL` should be > 0 after migration (= backfilled real-UUID rows)

Follow-up needed (not blocking this PR)

Worker's `event_email_forwarder.go` must be taught to populate `audit_log_id` alongside `audit_id` on insert when `audit_id` is a real UUID. Until that ships, post-064 rows leave `audit_log_id` NULL even with a valid `audit_log` target. The backfill covers existing rows; this is a forward-only behaviour gap tracked as part of the closing of gap #6.

Worker repo also needs a `worker/sql/064_forwarder_sent_audit_fk.sql` mirror (worker repo is separate; not in this PR — surface that to the next operator).

Why no direct FK on `audit_id`

Three independent blockers, all documented in migration 063 and in 064's header:

  1. Type mismatch. `audit_log.id` is UUID, `audit_id` is TEXT.
  2. PK NOT NULL. `audit_id` is the table's PRIMARY KEY; ON DELETE SET NULL would violate the PK on every audit_log delete.
  3. Legacy placeholders. Several emit sites write synthetic placeholder values like `reminder-<resource_id>-` or `provider-<grace_id>`. A strict FK would reject those rows.

The new nullable `audit_log_id` column sidesteps all three.

🤖 Generated with Claude Code

mastermanas805 and others added 30 commits May 12, 2026 23:51
… of pricing experiments)

Adds an internal/experiments package with a deterministic
SHA256(identifier+salt) mod len(variants) selector and a registry of
active experiments. The first registered experiment is UpgradeButton
with variants {control, urgent, value}.

The variant assignment is exposed on GET /auth/me as a new
`experiments` map keyed by experiment name, so the dashboard learns
the user's bucket for every active experiment in one round-trip.

POST /api/v1/experiments/converted records a conversion in the audit
log with kind = "experiment.conversion" and metadata = {experiment,
variant, action_taken}. The handler verifies the variant matches the
server's bucket for the caller (rejects stale clients) before
writing.

Bucketing identifier:
  - /auth/me — team_id (caller is always authenticated here)

Tests:
  - internal/experiments: 7 tests (determinism, distribution ~33/33/33
    across 1000 ids, salt isolation, valid-variant invariant)
  - internal/handlers: 6 tests (/auth/me embeds the field, POST writes
    audit_log, rejects unknown experiment / invalid variant / variant
    mismatch / no-auth)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…act (U3)

Establishes a single source of truth for every agent_action string the API
returns (internal/handlers/agent_action.go) and enforces a four-requirement
contract in tests (TestAgentActionContract).

The U3 contract every agent_action MUST satisfy:
  1. Open with "Tell the user" — the LLM agent re-articulates verbatim.
  2. Name the specific reason rejected (tier, limit, policy, resource).
  3. Name the exact next action (Upgrade, Claim, Provision twin, Contact
     support) — never "try again later" without context.
  4. Include a full https://instanode.dev/ URL — no relative paths.
  5. Under 280 chars so LLMs reproduce verbatim instead of summarizing.

Audited 25 strings across the handlers + middleware packages. Every one now
meets all four requirements: builders (newAgentAction*) for tier/env/limit
interpolation, static constants for fixed walls, and the existing
codeToAgentAction registry sharpened in lockstep.

Call sites refactored to consume the named constants/builders so reviewers
audit prose in one file (agent_action.go) rather than scattered handlers.

Tests:
- TestAgentActionContract covers every string (static + builder + registry)
- TestAgentActionContract_RegistryCoverage guards expected codes present
- Existing TestRespondError_* assertions updated to match sharpened copy
- All handlers + middleware unit tests green (pre-existing internal/plans
  failure unrelated to this PR)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tions-fresh

handlers: audit + sharpen every agent_action string per uniform contract (U3)
New endpoint reads the most recent near_quota_wall row written by the
worker's QuotaWallNudgeWorker, scoped to the caller's team and bounded
to the last 24h. Returns the metadata fields (tier/axis/service/current/
limit/percent_used/at) flattened into the response when a row exists,
or {ok:true, near_wall:false} when absent or stale.

Tier gate: team-tier callers always get near_wall=false without an
audit query — team tier has no walls so the scan is meaningless.

Query uses the existing idx_audit_team_at index (team_id + created_at
DESC from migration 012), which is the exact access pattern. No new
index needed.

Routed under the /api/v1 auth group so a valid session JWT is required
and team scope comes from the JWT claims.
experiments: server-side bucket selector + UpgradeButton variants (P1 of pricing experiments)
… + billing.go plan_frequency (P2)

Adds the three yearly plan variants to plans.yaml and threads a new
plan_frequency field through POST /api/v1/billing/checkout. Limits +
features on each *_yearly plan mirror the monthly counterpart byte-for-byte;
only price (annual amount in cents) and billing_period: yearly differ.

## Pricing
  hobby_yearly: $90/yr   (~$7.50/mo,  saves $18/yr  vs $9  x 12)
  pro_yearly:   $490/yr  (~$40.83/mo, saves $98/yr  vs $49 x 12)
  team_yearly:  $1990/yr (~$165.83/mo, saves $398/yr vs $199 x 12)

## API changes
- `checkoutRequest.PlanFrequency` ("monthly" | "yearly", default monthly).
  Empty stays back-compat with pre-P2 dashboard clients.
- `razorpayPlanIDFor(tier, frequency)` resolves the right plan_id from
  config. Yearly env vars empty -> 503 billing_not_configured (so partial
  rollout — monthly live, yearly plans not yet created on Razorpay — is
  safe).
- `planIDToTier()` recognises both monthly and yearly plan_ids and maps
  them back to the canonical (bare) tier. teams.plan_tier always stores
  the canonical name so limits resolution is cycle-agnostic.
- Webhook unchanged in shape — the upgrade tier comes from the resolved
  canonical tier regardless of which cycle paid.
- OpenAPI schema documents plan_frequency on /api/v1/billing/checkout.

## New env vars (operator action — see PR body callout)
  RAZORPAY_PLAN_ID_HOBBY_YEARLY
  RAZORPAY_PLAN_ID_PRO_YEARLY
  RAZORPAY_PLAN_ID_TEAM_YEARLY

## Tests
- Unit: 5 new handler tests (invalid_frequency 400, yearly unconfigured
  503, monthly default behaviour, team-tier guard fires on both cycles,
  webhook plan_id -> canonical tier mapping for all 6 plan_ids).
- Unit: 3 new plans tests (yearly variants mirror monthly, CanonicalTier
  helper, base-tier limits unchanged).
- `make test-unit`: green across all packages.

## Dependency
Requires `InstaNode-dev/common#pricing/p2-annual-plans` (PR #6) — adds
the `BillingPeriod` Plan field, the `Registry.BillingPeriod` method,
the `CanonicalTier` helper, and the three yearly entries in the
embedded defaultYAML. This api branch's `go.mod` `replace` directive
points to `../common` so local builds work, but the merged base in
`common` must be cut before this PR lands.
…dge-api-fresh

# Conflicts:
#	internal/router/router.go
…fresh

usage: GET /api/v1/usage/wall returns latest near-wall audit row (U1)
plans: annual pricing tiers (hobby_yearly / pro_yearly / team_yearly) + billing.go plan_frequency (P2)
…ure)

POST /deploy/new now accepts two new multipart fields:

- private (bool-ish: "true" / "1" / "yes") — when true the resulting Ingress
  carries an nginx.ingress.kubernetes.io/whitelist-source-range annotation
  built from allowed_ips.
- allowed_ips — comma-separated CIDRs / IPs. Each entry is validated via
  net.ParseCIDR / net.ParseIP; max 32 entries (larger lists belong in CF
  Access / a real VPN, not an nginx annotation).

Tier gate: Pro / Pro-yearly / Team / Team-yearly / Growth only.
Hobby / anonymous / free / yearly-free → 402 with the new
AgentActionPrivateDeployRequiresPro constant pointing at
https://instanode.dev/pricing.

Validation order is tier-gate FIRST so low-tier callers never see the
"missing allowed_ips" 400 — they only see the upgrade wall, which is the
only action that matters to them.

Migration 020_deployment_access_control.sql adds two columns to deployments:
private BOOLEAN NOT NULL DEFAULT false and allowed_ips TEXT NOT NULL DEFAULT ''.
Stored as a comma-joined string (not JSONB array) because the nginx
annotation already requires that exact form — round-trip is lossless and
the existing scanDeployment code path keeps its scalar-friendly shape.

Compute layer:
- compute.DeployOptions gets Private bool + AllowedIPs []string.
- K8sProvider.applyIngressForDeploy now sets the annotation when private
  is true and the slice is non-empty (belt-and-suspenders against a stray
  "allow nobody" Ingress).

OpenAPI 3.1 spec extended with the two request fields, the two response
fields, the 402 wall, and updated 400-error list.

Two new agent_action constants:
- AgentActionPrivateDeployRequiresPro (402 wall)
- AgentActionPrivateDeployRequiresAllowedIPs (400 wall for private=true
  with empty allowed_ips)

Both pass TestAgentActionContract automatically — added to the contract
case table.

Tests: 7 new cases in deploy_private_test.go (Pro accept, Hobby 402,
empty IPs 400, invalid IP 400, 33-IP cap, public default unchanged,
GET /api/v1/deployments round-trip). All `make test-unit` packages green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fresh

deploy: private deploys with allowed_ips ingress whitelist (Pro+ feature)
…n-place

Dashboard PrivacyPanel (PR #44) needed a way to edit access-control on an
existing deploy. POST /deploy/new with private/allowed_ips shipped in PR #45
but no PATCH surface — every PrivacyPanel save hit 404.

What lands:

- PATCH /api/v1/deployments/:id accepts {private?, allowed_ips?} JSON. Body
  fields are optional pointers so "omit" and "set to zero" are distinguishable.
  Sending only allowed_ips preserves the current private state; sending
  private:false clears the allow-list regardless of allowed_ips in the same
  body (preserves the public-deploy invariant).

- Validation reuses PR #45's parsePrivateDeployFields rule-set by factoring
  out validatePrivateDeployFields — the tier gate, non-empty IPs check,
  cap, and per-entry parse are shared with POST. The U3 reviewer audits the
  rules in one place.

- compute.Provider gains UpdateAccessControl(ctx, appID, private, allowedIPs).
  K8s implementation Get/Update's the existing Ingress and rewrites the
  whitelist-source-range annotation via a new shared helper
  buildIngressAccessAnnotations — same function the create path now uses,
  so create and update can't drift on the annotation key. Noop logs+returns
  nil. Local-dev (no DEPLOY_DOMAIN) is a no-op with a warn breadcrumb.

- models.UpdateDeploymentAccessControl writes the private + allowed_ips
  columns. Single-row update — no caching/aggregation concerns.

- Semantics decision documented inline: allowed_ips on PATCH REPLACES the
  current list (not appends). Matches REST collection-field conventions and
  is what PrivacyPanel renders/submits.

Tests: 8 new in deploy_private_patch_test.go, all pass against the noop
compute provider. Pro flips public→private, replace semantics, private:false
clears the list, hobby 402 with reused AgentActionPrivateDeployRequiresPro,
invalid IP surfaces verbatim, 404 / 403 / empty-body 400. TestAgentActionContract
still passes (no new strings added).

make test-unit: all packages green (handlers 23.6s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d plans.ValidatePromotion

Adds the agent-facing endpoint that the dashboard's PromoCodePanel
(PR #38) submits to before checkout. Returns 200 + ok:true with the
structured discount on success and 200 + ok:false with a typed
error + agent_action on rejection — the dashboard renders the red
state through its normal success-path parser without a catch.

Rate-limited per-team per-hour (30/hr) to make brute-forcing the
seed-code namespace impractical; Redis errors fail open so a cache
outage can't block a user mid-checkout. Adds the
AgentActionPromotionInvalid constant which the TestAgentActionContract
gate covers automatically.
…s-fresh

deploy: PATCH /api/v1/deployments/:id to edit private + allowed_ips in-place
…date-fresh

billing: POST /api/v1/billing/promotion/validate — HTTP wrapper around plans.validatePromotion
Founder-facing customer-management surface served under /api/v1/admin/*.
Per the CRM discussion: Postgres already has the data, no need for
HubSpot/Salesforce. List teams with MRR/storage/deploy aggregates,
fetch a single customer's profile, manually elevate tiers (comp/cust-
success, no Razorpay charge), and issue single-use promo codes.

All four endpoints are gated by middleware.RequireAdmin, which reads
the JWT email against ADMIN_EMAILS (comma-separated, case-insensitive).
Closed by default: unset/empty ADMIN_EMAILS rejects every caller. 403
responses carry the canonical agent_action sentence.

Endpoints:
  GET    /api/v1/admin/customers           — list with q/tier/sort_by
  GET    /api/v1/admin/customers/:id       — detail + users + recent audit
  POST   /api/v1/admin/customers/:id/tier  — set plan_tier, elevate, audit
  POST   /api/v1/admin/customers/:id/promo — issue single-use promo code

Migration 021 adds admin_promo_codes (single-use, admin-issued, audited)
as a distinct table from plans.yaml promotions (static, server-config-
level discounts). Two distinct concepts, two storage layers.

Aggregations are live SQL with no Redis cache — admin views are
low-frequency and must show ground truth. List uses CTEs so a 500-team
result is one round trip (no N+1). MRR is computed from plan_tier via
the plans Registry; yearly subscriptions contribute monthly-equivalent
(annual/12) for sort comparability.

Tests cover: closed-by-default 403, non-admin 403, admin 200, list MRR
sort, query-by-email, detail shape, tier change updates team + elevates
resources + writes audit row with structured metadata, promo issue
returns unique code + expires_at + audit row. make test-unit all green
(18 new admin tests + agent_action contract additions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s-api-fresh

# Conflicts:
#	internal/handlers/agent_action_contract_test.go
admin: Customers endpoints (list/detail/tier-change/issue-promo) gated on ADMIN_EMAILS
…nceled} for Loops forwarder

Wire 4 of the 7 audit_log.kind values that the Loops worker (PR #10)
forwards to Loops.so but which nothing in the API was actually writing.
After this PR the welcome / upgrade / downgrade / cancellation lifecycle
emails will fire instead of silently no-op'ing.

Sites added:
- onboarding.Claim — onboarding.claimed (after JWT mark + session mint,
  pre-response, in a detached goroutine)
- billing.handleSubscriptionCharged — subscription.upgraded or
  subscription.downgraded, classified by tierRank(prior) vs tierRank(new).
  Same-tier renewals emit nothing so monthly Pro renewals don't trigger
  the upgrade email.
- billing.handleSubscriptionCancelled — subscription.canceled (single-l US
  spelling matches the Loops forwarder map; Razorpay's double-l event name
  is handled inside the dispatcher).

Fail-open invariant enforced and tested: when audit_log writes fail (e.g.
the table doesn't exist), the originating handler still returns success
and the tier mutation still commits. Razorpay never sees a retry-worthy
status from an audit miss.

Named constants live in internal/models/audit_kinds.go so the emit sites
and the Loops forwarder match on identity rather than re-typing strings.

Skipped from the original 6 missing kinds:
- admin.tier_changed / admin.promo_issued — Track A's admin_customers.go
  is not on master at HEAD d3fa539, so there is no admin tier or promo
  handler to attach to. Deferred until Track A lands.
- resource.expiry_imminent — lives in the worker repo, out of scope for
  this PR per the brief.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh

audit: emit 4 missing kinds (onboarding.claimed + subscription.*) to unblock Loops forwarder
Defense-in-depth on top of ADMIN_EMAILS (PR #48). PR #48 shipped admin
endpoints at /api/v1/admin/customers/...; even with the email allowlist
gate, a guessable path lets drive-by scanners log auth failures + a
leaked session token from an XSS / exfil probe the admin surface.

Two independent gates now protect /api/v1/<prefix>/customers/...:

  1. ADMIN_PATH_PREFIX — unguessable URL segment (32+ chars, alnum).
     Empty/unset → admin routes are NOT registered (404). Closed by
     default. Operators opt in by generating via `openssl rand -hex 32`.
     A short or non-alphanumeric prefix is a fatal startup error.

  2. ADMIN_EMAILS — unchanged. JWT email allowlist, closed by default.

The prefix is delivered to admin clients only via /auth/me's
admin_path_prefix field (omitted entirely for non-admin callers and for
deployments without ADMIN_PATH_PREFIX configured). The dashboard reads
it and mints /api/v1/<prefix>/customers/... URLs client-side.

OpenAPI spec (/openapi.json) deliberately omits the admin surface — a
documented path defeats the obscurity gate. Regression test guards.

INTERNAL-OPS.md (new, private via .gitattributes export-ignore) carries
the rotation procedure, incident response, and the operator runbook.

Tests: 13 new unit tests (8 router/config + 4 handlers + 1 OpenAPI
omission grep); all 382 dashboard tests still pass; full Go test suite
green on target packages (router, handlers, config).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
admin: route admin endpoints under ADMIN_PATH_PREFIX env (closed-by-default + internal docs)
F's apply.sh in the infra repo patched 3 schema gaps at apply time via jq.
Terraform reads the same JSON but wouldn't run the adapter — the source
JSON drifted from one applier to the other. Bake the fixes into the source
so apply.sh + Terraform consume identical clean structures.

Fixes baked:
1. Dashboards: accountIds:[0] → accountIds:["${NEW_RELIC_ACCOUNT_ID}"]
   substitution token across all 4 dashboards (28 total occurrences).
2. Alerts: drop "type": "NRQL" — NerdGraph rejects it on
   NrqlConditionStaticInput; the mutation name encodes the type.
3. Alert policy: add policies/instant-api.json describing the umbrella
   policy + add policyName="instant-api alerts" linking field to each
   alert JSON, replacing apply.sh's inline find-or-create logic.

apply.sh already handles this defensively — its jq walk() rewrites
.accountIds regardless of the previous value, and del(.type) is a no-op
when the field is absent. No companion infra PR needed.

Tests (infra/newrelic/tests/bake_test.sh): 31 assertions covering JSON
parse, accountIds token presence + count per dashboard, absence of
"type" on alerts, policyName cross-ref between alerts and policy file,
and policy file shape. All pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fresh

infra: bake jq fixes into NR JSON source (apply.sh + Terraform read same clean files)
…celed_by_admin audit

POST /api/v1/admin/customers/:team_id/tier now cancels the customer's active
Razorpay subscription when the tier change is a demote (toRank < fromRank).
Promotes are unchanged — the comp-tier flow stays Razorpay-free.

Razorpay cancel uses cancel_at_cycle_end=false (CancelImmediately) so MRR
drops in the same billing cycle the demote happened, rather than carrying
the old tier through to end-of-cycle. Adds a sibling helper
razorpaybilling.Portal.CancelImmediately alongside the existing
CancelAtCycleEnd (which the customer's own self-serve cancel still uses).

Emits a new audit_log kind `subscription.canceled_by_admin` with metadata
carrying {from_tier, to_tier, by_admin_email, reason, subscription_id,
cancel_attempted, cancel_succeeded, cancel_error}. The Brevo / Loops
forwarder keys on the kind string to fire a "your subscription was canceled
by support" template; cancel_succeeded gates the "we canceled" copy so a
failed Razorpay call doesn't lie to the customer.

Fail-open posture: a Razorpay 5xx during cancel does NOT block the admin
demote — the DB-side demote already succeeded, the audit row records
cancel_succeeded=false + the error, and the operator reconciles manually
in the Razorpay dashboard. Same fail-open shape as resource elevation.

Six new integration tests cover: demote pro→hobby fires cancel + audit;
demote team→hobby (multi-rank delta); demote with empty subscription_id
skips Razorpay but still audits; promote does NOT call Razorpay; Razorpay
500 still returns 200 with cancel_succeeded=false audit; same-tier 409
makes no Razorpay call (idempotency).

make test-unit: green across all packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-on-demote-fresh

# Conflicts:
#	internal/router/router.go
…ote-fresh

admin: cancel Razorpay subscription on demote + emit subscription.canceled_by_admin audit
…aml codes

POST /api/v1/billing/promotion/validate now unifies two promo-code sources:

1. plans.Registry.ValidatePromotion (TWITTER15/LAUNCH50 broadcast codes,
   already wired in PR #47) — checked first.
2. admin_promo_codes table (single-use codes issued by an admin via
   /api/v1/admin/customers/:team_id/promo, table from PR #48) — fall-through
   when the plans-yaml side returns "not found".

The admin path enforces single-use at validate time (returns
promotion_already_used + the new AgentActionPromotionAlreadyUsed sentence)
and at the webhook side via UPDATE ... WHERE used_at IS NULL, so two
concurrent redemptions can't double-spend a code. Cross-team codes surface
as plain "promotion_invalid" — we don't reveal their existence.

POST /api/v1/billing/checkout now accepts promotion_code; when the code
matches an admin_promo_codes row for the team, the checkout stamps
notes.admin_promo_code_id on the Razorpay subscription. The
subscription.charged webhook reads that notes key and marks used_at
best-effort (fail-open: a redemption miss must not undo the tier upgrade).

Tests cover:
- plans-yaml regression (PR #47 happy path still works with DB wired)
- admin unused/used/expired/cross-team
- amount_off and first_month_free kind round-trip
- webhook redemption hook (with notes, without notes, redelivery, invalid UUID)
- model-level race: two concurrent MarkAdminPromoCodeUsed callers, exactly
  one wins.

Note on the brief's "wrong plan for admin code → invalid": admin
codes are scoped by team_id, not plan; admin_promo_codes.applies_to is
INTEGER (a percent_off cap in cents per openapi.go), not a tier list.
The handler echoes the requested plan back in discount.applies_to so the
dashboard renders "applies to <plan>" uniformly, but does not reject by
plan. Documented in the validate handler's lookupAdminPromotion comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…out-fresh

billing: validate + redeem admin-issued promo codes alongside plans-yaml codes
Adds three new fields to GET /healthz so a single probe answers both
"is my image stale" (commit_id) and "did my migrations apply"
(migration_version + count + status). Caches the DB read for 60s per
process so readiness-probe traffic stays cheap and the path remains
<10ms p99.

- New migration 022_schema_migrations.sql tracks applied filenames.
  db.RunMigrations records each file with ON CONFLICT DO NOTHING after
  successful exec; record failures are logged but never block startup.
- New internal/migrations package: Reader with TTL-cached state, clock
  injection for deterministic tests, and a fail-open contract (DB errors
  yield migration_status="unknown" while /healthz stays 200 OK).
- OpenAPI HealthResponse schema documents every field.
- Tests cover happy path, DB-down → unknown, empty table, cache hit /
  cache miss after TTL, nil DB, and a sanity rail asserting the
  DB-reported filename actually exists in the embedded migration set.
mastermanas805 and others added 30 commits May 20, 2026 14:39
…ered gap

The 2026-05-20 production incident: every email since launch had been
silently rejected at Brevo's SMTP relay because the sender domain was
unvalidated. The worker logged classification=success on Brevo's 201
(API-acceptance) and advanced the audit_log cursor. The ledger lied
because it confused API-acceptance with delivery.

This PR ships the receiver-side machinery that closes the gap:

* New endpoint: POST /webhooks/brevo/:secret
  (api/internal/handlers/brevo_webhook.go)
  Brevo POSTs every transactional event (delivered, soft_bounce,
  hard_bounce, blocked, complaint, spam→complaint alias, deferred,
  unsubscribed, error) to this URL. The handler matches by
  (provider='brevo', provider_id=<message-id>) and updates
  forwarder_sent.classification + (for 'delivered') stamps delivered_at.
  Auth: URL-token compare against BREVO_WEBHOOK_SECRET (Brevo's
  transactional webhooks don't carry HMAC by default; the URL-token
  approach works without per-callback signing toggles in the dashboard).

* Migration 061 (canonical: worker/sql/061_forwarder_sent_delivery.sql,
  mirrored at api/internal/db/migrations/061_forwarder_sent_delivery.sql)
  ADD COLUMN delivered_at TIMESTAMPTZ NULL plus two indexes — one for
  the (provider, provider_id) lookup the receiver fires on every event,
  one for the 'delivery ratio' dashboard query.

* New Prometheus counter brevo_webhook_events_total{event} with bounded
  cardinality (closed set: 8 documented event classes + 5 admin labels).
  Drives the email_delivery_ratio NR alert + the email-delivery NR
  dashboard.

* Registry-iterating coverage test
  (TestBrevoTxWebhook_EveryDocumentedEventHasHandler) walks
  brevoDocumentedEvents and asserts every entry has a handler in
  brevoEventHandlers (CLAUDE.md rule 18).

* OpenAPI 3.1 entry for the new endpoint
  (internal/handlers/openapi.go).

* Schema-mirror update for forwarder_sent + delivered_at + the two new
  indexes (internal/testhelpers/testhelpers.go); ALTER TABLE … ADD
  COLUMN IF NOT EXISTS keeps reused test DBs in sync.

* E2E test (e2e/brevo_webhook_e2e_test.go) — seeds a forwarder_sent
  row, POSTs a synthetic 'delivered' event, verifies classification
  flipped to 'delivered' and delivered_at IS NOT NULL. Plus an orphan-
  messageId arm (200 + matched:false — Brevo retries on non-2xx, we
  MUST NOT amplify orphan traffic) and a hard_bounce arm.

OPERATOR ACTION (Brevo dashboard): once the sender domain is validated,
paste https://api.instanode.dev/webhooks/brevo/<BREVO_WEBHOOK_SECRET>
into Brevo → Transactional → Settings → Webhook URL, toggle on every
documented event class, ensure single-event-per-call is selected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tag)

api half of the chaos-drill harness from CHAOS-DRILL-2026-05-20.md:

- e2e/propagation_chaos_test.go — synthetic pending_propagations rows, verify
  exponential backoff, dead-letter at maxAttempts, alert pattern. Lives behind
  //go:build chaos so it doesn't run in the regular gate.
- e2e/lease_recovery_chaos_test.go — drives the worker-side stub job, asserts
  River re-leases an orphaned job after pod kill.
- Makefile — make chaostest-propagation / make chaostest-lease-recovery.

Companion: worker chaos_lease_recovery.go + workers.go wiring (separate commit).
Findings F1 (P0) + F2 (P1) + F4 (P1) are open follow-ups (tasks #241/242/243).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esolvable)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deep readiness check shipped in c5b81de but openapi.go was never
extended. TestOpenAPI_CoversAllRegisteredRoutes (the H4 regression gate)
caught this and blocked every subsequent Deploy (Brevo webhook,
common@04dd59d retrigger, chaos test commits) on master.

Adds:
- paths./readyz GET with 200/503 responses
- components/schemas/ReadinessResponse envelope (overall + service + commit_id + checks[])

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…acks (#119)

Adds the next layer up from existing unit tests + chaos drills:

Track 1 — Backup/restore integration tests (api/e2e/, build tag
integration_backup):
  - Wraps infra/scripts/restore-drill.sh in Go test scaffolding
  - Asserts RTO < 5min (postgres) / 3min (mongo), RPO < 30h
  - Asserts cleanup of throwaway namespace
  - Pure-parse tests for NR alert aggregation_window + Prom rule
    36h/60h thresholds (run anywhere)
  - .github/workflows/integration-backup.yml: weekly cron + manual
    dispatch against KUBECONFIG_TEST_CLUSTER; defensive context-name
    gate to prevent prod runs

Track 2 — Brevo webhook full pipeline (api/e2e/, e2e build tag):
  - Registry-iterating round-trip over every documented Brevo event:
    delivered/soft_bounce/hard_bounce/blocked/complaint/deferred/
    unsubscribed/error
  - Idempotent re-delivery pins GREATEST(delivered_at) clause
  - Delivered-then-bounce pins makeClassUpdater contract (no time-travel)
  - Malformed payload 400 + unhandled event 200 end-to-end

Track 3 — Propagation runner integration tests (worker/internal/jobs/,
no build tag, runs under regular make gate):
  - Backoff exact-schedule via markRetry persistence for every position
  - Dead-letter at maxAttempts via markDeadLettered direct
  - F2 P1 guard: unknown_kind bounded retries
  - FOR UPDATE SKIP LOCKED concurrency (TEST_DATABASE_URL-gated)
  - Enum-vs-handler-map registry walk (TEST_DATABASE_URL-gated)

Track 4 — Deep /readyz cross-service (api/e2e/, e2e build tag):
  - Envelope-shape walk across api/worker/provisioner
  - Brevo unreachable → 200 degraded (NOT 503) contract
  - Cache TTL: 50-burst latency assertion
  - Secret-leak scan (20+ hex chars regex)
  - P95 response time < 500ms
  - Criticality-matrix registry walk

Track 5 — Cross-track contract test (api/e2e/, no build tag, runs
under regular gate when TEST_DATABASE_URL is set):
  - Walks AuditKind* constants from api/internal/models/audit_kinds.go
    via source-file regex
  - Forward: every constant has a consumer spec entry
  - Reverse: every spec entry refers to a real constant
  - Emails=true implies Forwards=true (F4 ledger drift class)
  - Propagation kinds match pending_propagations.kind enum
  - Forwarder_sent.classification populated for rows > 5min old

Companion: INTEGRATION-TESTS-2026-05-20.md (repo-root) lists every
new test with the failure mode it catches, per CLAUDE.md rule 17
coverage-block discipline.

Per CLAUDE.md rule 18 (registry-iterating regression tests, not
hand-typed lists), every Track has at least one registry walk
that fails LOUD on additions to the upstream registry without
matching downstream wiring.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…overage tests (#121)

* fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests

MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion
of the orphan-resource prevention work shipped in commit 36eac9b.

That commit wired all 18 provisioning callsites through finalizeProvision,
which on persistence failure (post-RPC) tears down the backend object
(best-effort), soft-deletes the row, and returns errProvisionPersistFailed
so handlers map to 503 instead of 201. This commit closes three remaining
gaps in that work:

1) **Operator-visible audit row**: emit AuditKindProvisionPersistenceFailed
   on every persistence-failure path. Operators can now reconstruct the
   exact moment the platform produced an unreachable resource — by
   resource_id / resource_type / log_prefix / provider_resource_id /
   request_id / tier / env. NOT wired into the customer-email forwarder
   (this is an internal alert, mirrors AuditKindBillingChargeUndeliverable
   and AuditKindPropagationDeadLettered). Sync emit + WARN slog on
   failure — must NEVER block the 503 to the customer.

2) **agent_action for provision_failed**: the catch-all 503 envelope used
   to fall back to AgentActionContactSupport ('email support'). That's
   wrong for the MR-P0-3 path — the backend object was rolled back and
   the row soft-deleted, so the right action is 'retry with exponential
   backoff', not contact support. Adds an explicit codeToAgentAction
   entry under the U3 contract (opens with 'Tell the user', names the
   reason, names the action, full https://instanode.dev URL, < 280 chars).

3) **Registry-iterating coverage tests** (CLAUDE.md rule 18): two static
   guards in provision_atomicity_coverage_test.go scan every production
   .go file in internal/handlers/ and assert:
   - every file calling models.CreateResource also calls finalizeProvision
     (so a new handler can't bypass the two-phase pattern)
   - every file calling finalizeProvision also routes the error through
     respondProvisionFailed or twinCoreErr (so a new handler can't
     swallow the sentinel)
   Plus TestProvisionFailedHasAgentAction asserts the catch-all code
   still has explicit agent_action guidance (defends against a future
   refactor that drops the new entry).

Coverage block per CLAUDE.md rule 17:

  Symptom:        201 returned for resource the platform can't address
                  (orphan backend + NULL conn_url + NULL PRID + billed)
  Enumeration:    rg -l 'models.CreateResource\(' internal/handlers/
                  yielded db.go, cache.go, nosql.go, queue.go, storage.go,
                  webhook.go, vector.go (each w/ anon + auth + twin paths)
  Sites found:    7 files × ~3 paths = ~21 provisioning entry points
  Sites touched:  none added — every site already calls finalizeProvision
                  per commit 36eac9b. This commit adds a coverage TEST
                  that fails the build if a new site bypasses the pattern.
  Coverage test:  TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision
                  + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError
                  + TestProvisionFailedHasAgentAction
  Live verified:  (this commit; pending CI green + deploy)

Per-handler audit findings:
- internal/handlers/db.go:        already atomic (finalizeProvision at L265, L432, L642)
- internal/handlers/cache.go:     already atomic (finalizeProvision at L232, L387, L564)
- internal/handlers/nosql.go:     already atomic (finalizeProvision at L228, L380, L569)
- internal/handlers/queue.go:     already atomic (finalizeProvision at L306, L525)
- internal/handlers/storage.go:   already atomic (finalizeProvision at L335, L525)
- internal/handlers/webhook.go:   already atomic (finalizeProvision at L327, L444 — cleanup=nil ok, no backend object)
- internal/handlers/vector.go:    already atomic (finalizeProvision at L370, L522)
- internal/handlers/deploy.go:    DIFFERENT MODEL — async (202 not 201); row persisted before async build runs; failures update status. Not P0-3 shape.
- internal/handlers/stack.go:     DIFFERENT MODEL — async (202 not 201); same shape as deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(p0-3): swap provision_failed fixture for db_error in two 5xx-fallback tests

MR-P0-3 (BugBash 2026-05-20) added an explicit codeToAgentAction entry for
`provision_failed` so the catch-all 503 instructs the agent to retry with
exponential backoff, not contact support. Two pre-existing tests used
`provision_failed` as their "5xx code with no registry entry" fixture and
fail under the new contract:

  - TestRespondError_UnknownCode_5xx_FallsBackToContactSupport
  - TestErrorEnvelope_503_AllFieldsAndHeader

Swap both to use `db_error` — documented in helpers.go's curation
principles (lines ~87-89) as one of the 'pure plumbing errors deliberately
omitted' from codeToAgentAction. Same shape, same status code, same
fallback branch — the test contract is preserved while the production
contract gets sharper guidance for the actual MR-P0-3 path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(p0-3): register provision.persistence_failed in auditConsumerSpec

The cross-track reliability contract test in e2e/reliability_contract_test.go
asserts every AuditKind* constant has a consumer-spec entry. Adding the new
constant in this PR without the spec entry tripped that gate in CI.

Add entry with IntentionallyNoConsumer=true — mirrors the existing
billing.charge_undeliverable and propagation.dead_lettered shape: this is
an internal operator alert that intentionally has no customer email and
no propagation wiring (a customer whose resource was just torn down by the
platform deserves a human follow-up, not an automated template).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(readyz): add MarkDraining short-circuit for graceful shutdown (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(router): NewWithHooks returns ShutdownHooks for graceful shutdown (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(main): wire readiness-drain into graceful shutdown sequence (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(graceful_shutdown): drain-flag + timeout regression guards (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(readyz): drain short-circuit returns 503 + shutting_down check (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These two paid tiers ($19/mo Hobby Plus, $99/mo Growth) appear on
/api/v1/capabilities (so agents introspecting tiers see the full ladder)
but are intentionally absent from the public /pricing page and marketing
homepage. The pattern is documented in code but not in plans.yaml — this
adds explanatory comment blocks at each tier so future agents don't try
to "fix" the perceived inconsistency by surfacing them on /pricing.

Also documents the Growth postgres_connections=20 / Pro postgres_connections=20
overlap: not a tier-ladder break, intentional — Pro is the highest single-DB
connection count under shared infra, further headroom is a Team (dedicated)
benefit.

Coverage block (per CLAUDE.md rule 17):
  Symptom:        no comment explained why hobby_plus + growth are in the
                  registry but invisible on /pricing
  Enumeration:    grep -rn 'hobby_plus\|growth' instanode-web/src/pages/PricingPage.tsx
                  api/plans.yaml content/llms.txt dashboard/src/components/PricingGrid.tsx
  Sites found:    2 (plans.yaml hobby_plus + growth blocks)
  Sites touched:  2
  Coverage test:  none — comment-only. The intent is in lockstep with
                  existing TIERS array in instanode-web PricingPage.tsx
                  (which already documents the decision client-side).
  Live verified:  /api/v1/capabilities continues to return both tiers;
                  /pricing continues to hide them. No behavior change.

Closes P1 from DOC-REALITY-DELTA-2026-05-20.md §2 + P3 (growth connections
note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine 383

ROOT CAUSE. Line 383 was `require.False(t, strings.HasPrefix(legacy, "v"))`,
asserting that the output of `crypto.Encrypt(keyV2, "legacy")` does not
start with the letter 'v'. But `Encrypt` returns
`base64.URLEncoding.EncodeToString(nonce||ct||tag)` — base64-url's
alphabet includes the byte 'v', so ~1.6% of outputs legitimately start
with 'v' purely from the random AES-GCM nonce. The test was inherently
flaky on every run, and CI tripped it on bd97fab.

Empirical measurement (100 000 runs of Encrypt(keyV2, "legacy")):
  starts with 'v': 1561/100000 (1.56%)
  matches "vN." pattern: 0

The author's intent was "the legacy envelope must not match the structural
'vN.' version-marker pattern written by EncryptVersioned" — the same
shape `crypto.splitVersionedEnvelope` checks for. The assertion that
mattered is the 3-byte marker (lowercase 'v', ASCII digit '1'..'9',
literal '.'), not "no leading v".

FIX. Add a local `isVersionMarker` helper that mirrors
`splitVersionedEnvelope` and assert the structural shape. New assertion
passes 10 000/10 000 runs. Old assertion would fail ~156/10 000.

`crypto.Encrypt` is unchanged. There is no test-order pollution and no
crypto-package state mutation — the bug is in the test's assertion only.

Coverage block (CLAUDE.md rule 17):
  Symptom:        TestWave3P2_AESKeyring_RoundTripsAcrossVersions line 383
                  fails ~1.6%/run with "legacy envelope must not carry a
                  version prefix"
  Enumeration:    rg -F 'strings.HasPrefix' internal/handlers/wave3_p2_test.go
                  rg -F 'crypto.Encrypt(' internal/handlers/
  Sites found:    1 (this assertion) — the other HasPrefix uses match
                  EncryptVersioned output, which deterministically starts
                  with "vN." by construction
  Sites touched:  1
  Coverage test:  TestWave3P2_AESKeyring_RoundTripsAcrossVersions itself,
                  run -count=10000 (verified locally, 0 fails)
  Regression class guarded: probabilistic-assertion-on-base64 — any test
                  that asserts on a leading byte of base64(random) output
                  is the same bug class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t no-op

Two contract-drift P0s flagged in the bug-hunt brief, both shipped in the
same PR because they share the test gate.

B13-P0-F1: POST /auth/cli auth_url leaked dead-brand
  * Old: frontendURL() returned "https://instant.dev" in production —
    instant.dev is the dead-brand marketing host (404s). An agent
    following the auth_url landed on a parking page and gave up.
  * Fix: read cfg.DashboardBaseURL (DASHBOARD_BASE_URL env var); fall
    back to "https://instanode.dev" when DASHBOARD_BASE_URL is unset
    in production. internal/handlers/cli_auth.go.
  * Coverage: TestAuth_CLI_ReturnsInstanodeDomain (4 cases —
    explicit base, empty-fallback, dev default, trailing-slash strip)
    + TestAuth_CLI_NoLegacyInstantDevInResponses (rule-17 source-scan
    that fails if a new handler emits instant.dev/<x> anywhere).
  * Also cleaned: internal/handlers/logs.go (error message URL),
    internal/handlers/openapi.go (storage endpoint example).

B7-P0-1: PATCH /stacks/:slug/env was a silent no-op
  * Old: handler logged stack.env.noted, returned 200, persisted
    nothing — the next /stacks/:slug/redeploy rebuilt with stale env.
    Silent-data-loss failure mode.
  * Fix (Option A from the brief): migration 062 adds
    stacks.env_vars JSONB DEFAULT '{}'::jsonb. Handler now loads
    existing env, merges (empty-string value deletes), validates
    every key against isValidEnvKey (POSIX [A-Z_][A-Z0-9_]*),
    persists, emits stack.env.updated audit_log row, and returns
    the full merged map.
  * 64KiB cap on serialized payload (ErrStackEnvVarsTooLarge → 413).
  * OpenAPI updated to describe PATCH semantics + new 409/413
    responses + the merged-env response shape.
  * Coverage: TestStack_PatchEnv_PersistsAndReturns (7 subtests —
    auth, first patch, merge, delete-via-empty, invalid_env_key,
    missing_env, 404 unknown slug). All assertions verify the DB
    round-trip — pre-fix the handler returned 200 but persisted
    nothing, which is exactly what subtest #2 now fails on.

Coverage block per CLAUDE.md rule 17:
  Symptom:        instant.dev/<x> in handler-emitted strings
  Enumeration:    rg -nF "instant.dev/" internal/handlers/ --type go
  Sites found:    3 (cli_auth.go:339, logs.go:136, openapi.go:2784)
  Sites touched:  3 — all three migrated to instanode.dev
  Coverage test:  TestAuth_CLI_NoLegacyInstantDevInResponses scans
                  every .go file under internal/handlers/ and fails
                  if any non-test, non-comment, non-label,
                  non-import-path mention of instant.dev/<x> appears.
  Live verified:  pending — see follow-up commit after deploy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ok (2026-05-20)

B5-P0: POST /claim now validates the email field via mail.ParseAddress +
length cap (RFC 5321 § 4.5.3.1.3, 254 chars) + dotted-domain + no-inner-
whitespace gates. Previously ANY string was accepted, minting users rows
whose magic-link callback could never reach the inbox. Returns 400 with
error=invalid_email_format and a curated agent_action string. Covered by
TestClaim_EmailValidation_Coverage (6 cases: valid, missing-@, dotless
TLD, empty, inner-space, > 254 chars).

B5-P1: /claim's three-name drift (request field 'jwt', error message
'jwt field is required', agent_action 'INSTANODE_TOKEN' — three names
for one field) collapsed to canonical 'token'. The 'jwt' field name is
still accepted as a deprecated alias for backward compatibility with
the dashboard, sdk-go, mcp, and existing curl recipes — when both are
present, 'token' wins. OpenAPI spec updated to mark jwt deprecated and
add token + missing_email + invalid_email_format docs. The
missing_token error envelope now uses respondErrorWithAgentAction so
the agent_action says 'POST /claim requires a token field carrying the
onboarding token...' instead of the auth-context default
'INSTANODE_TOKEN is missing or invalid'.

B11-P1 (rows-affected): models.UpgradeTeamAllTiersWithSubscription now
checks res.RowsAffected(). If the team UPDATE matches 0 rows (synthetic
webhook, deleted-team race, typo'd notes.team_id) the function returns
ErrTeamNotFound. The Razorpay webhook handler maps that error to HTTP
404 via the new webhookErrorStatus helper (was silent 200). Razorpay
treats 4xx as non-retryable so the dead event doesn't loop forever, and
the existing deleteRazorpayWebhookClaim path releases the dedup row so
a corrected event can still process later. Covered by
TestBilling_UpgradeTeam_RowsAffected.

B11-P1 (recipient resolution): handlePaymentFailed no longer trusts
payload.payment.entity.email — anyone with the Razorpay webhook secret
could synthesize a payload with email=victim@evil.com and fanout
dunning notifications (Brevo treats payment-failed as transactional and
bypasses unsubscribe). The new resolveTeamFromPayment walks
payment.notes.team_id → payment.subscription_id → event.Payload
.Subscription notes/id in priority order, then the new
models.GetPrimaryUserByTeamID returns the team's primary user; the
dunning email is sent to THAT address. Payload-supplied email is now
log-only (with a per-event mismatch warning when it differs from the
resolved address). If no team is resolvable, the email is DROPPED
rather than mis-delivered. Covered by
TestBilling_PaymentFailed_RecipientResolution (asserts an attacker-
controlled payload.email is NOT the send recipient).

Coverage block (per CLAUDE.md rule 17):
  Symptom:        4 cases — (1) /claim accepts any email string; (2) jwt
                  vs token vs INSTANODE_TOKEN field-name drift; (3)
                  UpgradeTeamAllTiers swallows 0-rows-affected; (4)
                  payment.failed trusts payload.email verbatim
  Enumeration:    rg -F 'body.JWT' / rg 'UpgradeTeamAllTiers' / rg
                  'pay.Email' / rg 'mail.ParseAddress' / rg 'is_primary
                  = true' / rg 'jwt field is required' / rg
                  'INSTANODE_TOKEN' (within /claim context)
  Sites found:    onboarding.go (Claim handler + ClaimRequest struct
                  + agent_action wiring); openapi.go (ClaimRequest
                  schema + /claim 400 response); team.go
                  (UpgradeTeamAllTiersWithSubscription rows-affected
                  + new GetPrimaryUserByTeamID); billing.go
                  (handlePaymentFailed + new resolveTeamFromPayment +
                  new webhookErrorStatus + rzpPaymentEntity Notes/
                  SubscriptionID/OrderID + 2 switch-block sites).
  Sites touched:  All. Each fix is mechanically traceable to a named
                  code location with a B5-P0 / B5-P1 / B11-P1 comment.
  Coverage test:  TestClaim_EmailValidation_Coverage (table-driven, 6
                  cases + 3 token/jwt field-name sub-tests);
                  TestBilling_UpgradeTeam_RowsAffected (synthetic
                  bogus team_id → assert 404 with error=team_not_found);
                  TestBilling_PaymentFailed_RecipientResolution
                  (attacker payload.email + team primary user →
                  assert send goes to primary, NOT attacker).
  Live verified:  pending push + Deploy + curl + /healthz commit_id
                  match — see PR description.

Pre-existing failures (NOT introduced by this commit, verified on
origin/master HEAD without these changes):
- internal/handlers TestBulkTwin_QuotaPartialFill (quota fixture)
- internal/models  TestGetExpiredDeployments_ReturnsOnlyExpiredNonPermanent
- internal/models  TestLinkGitHubID (test data leak)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sUpgradeJWT passes locally; investigating CI-only flake

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cal token + deprecated jwt

The Bbug-bash sibling renamed ClaimRequest.jwt → ClaimRequest.token (canonical) at 30f8f96.
This test still asserted only on the deprecated jwt field's description. CI was failing on it.

Now checks BOTH fields' descriptions mention upgrade_jwt:
- token (canonical) — load-bearing for new callers reading the spec
- jwt (deprecated alias) — load-bearing for old callers still using INSTANODE_TOKEN

Local was passing because of map iteration order luck; CI's race-flagged run was hitting
nil on props["jwt"] when its sibling test fired earlier and stripped state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nonical token field only

30f8f96 (B5/B11 fix) renamed ClaimRequest.jwt → ClaimRequest.token (canonical)
and intentionally trimmed jwt's description to discourage further use. The
test still required jwt's description to mention 'upgrade_jwt' — that's
exactly the doc gravity the rename meant to remove.

Now: assert token (canonical) description mentions upgrade_jwt, AND assert
jwt field still EXISTS in the schema for backward compat, but don't require
its description to repeat the cross-reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ling/middleware/resource

Sweep of P1/P2 api-repo findings from BUGBASH-2026-05-20/MASTER-LEDGER.md
still open after today's earlier P0 waves (B5-P0, B7-P0-1, B11-F2/F4/F5,
B13-F1, B17-STORAGE-P0-1..4 all already shipped or in flight). Each fix
includes a regression test where one fit cleanly.

—————————————————————————————————————————————————————————————————————————————
B4-F1  POST /auth/email/start per-email RL silent absorption
       internal/metrics/metrics.go: add instant_magic_link_email_rate_limited_total
         counter (Prometheus).
       internal/handlers/magic_link.go: increment counter + WARN log on every
         rate-limited absorption — operator now sees the abuse pattern in NR
         despite the user-visible 202 staying identical to the success path.

B4-F2  emailRateLimitKey sha256[:8] birthday-collision risk
       internal/handlers/magic_link.go:88-89: full sha256 hex (64 chars) instead
         of h[:8] (8 hex chars). 2^32 → 2^128 collision space.
       internal/handlers/magic_link_test.go: TestEmailRateLimitKey_FullHashFingerprint
         pins the full-hash suffix length so a re-truncation regression fails CI.

B4-F4  RFC 5321 local-part >64 chars not gated
       internal/handlers/magic_link.go:looksLikeEmail: reject local-part > 64
         octets per §4.5.3.1.1 (guaranteed-undeliverable).
       internal/handlers/magic_link_test.go: TestLooksLikeEmail_LocalPartCap
         table-driven (under/at/over/way-over cap).

B4-F5  10MB JSON body accepted on /auth/email/start
       internal/handlers/magic_link.go: 1 KiB body cap on POST /auth/email/start
         (real bodies are ~80 bytes; global Fiber BodyLimit = 50 MiB for
         /deploy/new tarballs is way too generous for a 2-field JSON envelope).

B4-F7 / B5-P1-1 / B10-P1-3 / B13-F6
       internal/handlers/helpers.go: extend codeToAgentAction with three
         missing entries — invalid_email, invalid_email_format,
         provision_limit_reached. The 429 provision-limit envelope is now
         carrying agent_action + upgrade_url to the agent (CLAUDE.md
         convention #6 promised these, agents were getting nothing).

B11-F1 8 Razorpay events fell to default 200
       internal/handlers/billing.go: handle subscription.deauthenticated
         (treat as cancel — mandate revoked, cannot charge), subscription.updated
         (route to handleSubscriptionCharged — idempotent re-resolve of tier),
         refund.processed (info-log + span attribute, no tier change).
       The remaining default-200 falls (payment.captured/authorized,
         order.paid, invoice.paid, subscription.authenticated) are
         intentionally informational — Razorpay sends them but tier state
         is already correct from the corresponding subscription.* event.

B18-M1 empty Idempotency-Key header silently coerced to fingerprint path
       internal/middleware/idempotency.go: detect "header present but value
         empty/blank" via raw c.Request().Header.Peek() (Fiber's c.Get()
         returns "" for both omitted-and-present-empty). Reject with 400
         invalid_idempotency_key — matches the OpenAPI contract for
         malformed keys.

B20-P2-1 soft-deleted resources still surfaced via GET /api/v1/resources/:id
       internal/handlers/resource.go:Get: 404 when resource.Status == "deleted".
       Narrow surgical change to the read path only — the DELETE path itself
       still needs to fetch the row to soft-delete it, so leaving
       GetResourceByToken untouched. The customer-visible contract is "the
       resource is gone after DELETE"; this enforces it on the GET.

—————————————————————————————————————————————————————————————————————————————
Coverage block per CLAUDE.md rule 17:

  Symptom:        per-bug, see ledger references in code comments
  Enumeration:    grep -rn for each token, see code-side comments
  Sites found:    1 each (single-site fixes by design)
  Sites touched:  1 each
  Coverage tests: TestEmailRateLimitKey_FullHashFingerprint,
                  TestLooksLikeEmail_LocalPartCap,
                  TestAgentActionContract (covers all three new
                  codeToAgentAction entries automatically — map iteration)
  Live verified:  pending — /healthz commit_id check post-deploy

Gate: `cd api && make gate`-equivalent (go test ./... -short -count=1 -p 1)
green except the 4 pre-existing flakes (TestAdminList_*, TestDBNew_*,
TestBulkTwin_*, TestGetExpiredDeployments_*) explicitly listed in the task
brief as known-acceptable. All other 30+ packages pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…action + AuditKind

Adds three registry-iterating regression tests per CLAUDE.md rule 18:

1. TestRazorpayWebhook_EveryEventBranchHasACoverageTest — enumerates
   the case "<event>": arms in billing.go's Razorpay dispatcher and
   asserts every one has at least one named coverage test. The
   2026-05-20 deauthenticated/updated/refund.processed branches landed
   without dedicated tests; this gate stops that pattern from
   re-occurring. Includes anchor tests for the previously-uncovered
   halted/completed/paused/resumed/deauthenticated/updated/refund.processed
   branches so the registry-iterator has names to refer to.

2. TestCodeToAgentAction_NoOrphans — reverse of the forward test in
   agent_action_contract_test.go. Asserts every entry in
   codeToAgentAction is referenced by handlers/middleware code. An
   orphan entry = handler renamed the code (real bug) or the entry
   is dead (delete it). Scans both internal/handlers/ and the sibling
   internal/middleware/ to catch codes emitted from middleware (dpop,
   auth).

3. TestAuditKindConstants_EveryConstantIsEmittedSomewhere — walks
   every AuditKind* constant in internal/models/audit_kinds.go and
   asserts each is referenced by ≥1 non-test api source file outside
   the declaration site. 13 constants are intentionally declared on
   the api side for cross-repo reference (the worker is the actual
   emit site for the propagation lifecycle / deploy TTL expirer /
   orphan sweep / quota suspend / etc.) — these are documented in
   the crossRepoOnly allow-list with per-constant justifications.

Each test carries the rule 17 coverage block (symptom / enumeration /
sites found / sites touched / coverage test / live verified) in its
docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/B18

Closes a batch of P2/P3 envelope-contract and ordering issues identified by
the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz)
bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes.

Fixes:

1. **B18 M4** — `POST /storage/:token/presign` validates body before checking
   token existence. Pre-fix, a random UUID returned `invalid_operation` (400)
   before the existence check fired. Reordered: token parse → resource
   lookup → body-shape validation. Closes information-flow risk if validators
   ever loosen.
   File: internal/handlers/storage_presign.go

2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The
   authoritative length bound was already requireName's 64-rune gate; the
   second silent cap created a latent footgun if the name regex ever
   loosens to allow multi-byte runes. Updated regression test for the
   single-gate contract.
   Files: internal/handlers/provision_helper.go, provision_helper_test.go

3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on
   `GET /api/v1/webhooks/:token/requests`. The webhook token is a
   public-by-design capability (lands in HTTP headers/logs/outbound URLs);
   "well-formed-but-unknown" is not an oracle leak. Doc-only comment so
   future refactors preserve the intent.
   File: internal/handlers/webhook.go

4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when
   sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special
   chars stripped). Pre-fix the mutation was silent — agents looking up
   "db_for_user\n" later by exact name would never find the persisted
   "db_for_user". Header-only signal; does NOT fail the request (the
   strip is a deliberate hardening on top of the regex).
   File: internal/handlers/provision_helper.go

5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type`
   when the request carries an explicit non-JSON Content-Type
   (application/xml etc.). Pre-fix, sending XML with `Content-Type:
   application/xml` returned 400 `name_required` — a misleading code that
   cost the caller one extra debugging cycle. The OpenAPI spec only
   declares `application/json`; 415 is the RFC-correct status.
   File: internal/handlers/provision_helper.go

6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with
   the canonical ErrorResponse shape. Pre-fix, signature failures returned
   `{ok:false,error:"invalid_signature"}` with no request_id, message,
   retry_after_seconds, or agent_action. Razorpay support always asks for
   the request_id when a webhook fails. Same hydration applied to the
   invalid_payload path.
   File: internal/handlers/billing.go

7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every
   401 from respondUnauthorized. RFC 6750 §3 requires this header on every
   401 from a Bearer-protected resource. Pre-fix only the audience-mismatch
   path emitted it. OAuth-aware clients and HTTP debugging tools look for
   it.
   File: internal/middleware/auth.go

Gate (matches CI/deploy.yml):
- `go build ./...` — green
- `go vet ./...` — green
- `go test ./... -short -count=1 -p 1` — green on every modified package;
  pre-existing failures (12 in handlers + 2 in models + 3 B13 contract
  tests) verified unchanged-against-master by stashing the patch and
  re-running the same suite. All pre-existing flakes documented in
  CLAUDE.md "Known Design Gaps".

Skipped (already shipped today, per brief):
- AESKeyring (a3155a5), B5/B11/B13/B7 (0c7991c), presign middleware (PR
  #122, not yet on master), 768c0ca's 8 fixes.

Coverage:
- B19 finding #1 (presign middleware) — already shipped in PR #122; this
  PR does not duplicate.
- B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245.
- B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math
  in rate_limit.go is correct (limit-count). The reported 0-on-success is
  not reproducible from the code path; left for in-prod re-probe after
  this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent audit_id soft FK

Workstream 1 of the consolidated wave-3 api agent.

Security hardening (B18 BugBash 2026-05-20):
- New middleware/security_headers.go wires HSTS (prod only, 6mo includeSubDomains),
  Permissions-Policy (deny every powerful API), Referrer-Policy
  (strict-origin-when-cross-origin), X-Content-Type-Options: nosniff. Wired
  BEFORE RequestID in router.go so headers land on every Fiber-served path
  including livez/healthz/404/405.
- B18-L1 slog emit: provision_helper.requireName now logs
  "provision.name_normalized" with raw_len + clean_len byte counts (never the
  raw values — they may contain CRLF/NUL that triggered the strip) whenever
  sanitizeName mutated the input.
- B18-M2 (silent 120-byte truncation removed), B18-M3 (UUID-before-auth
  ordering documented as intentional — webhook tokens are public-by-design),
  B18-M4 (storage presign existence-before-operation), and B18-L2 (415 on
  non-JSON Content-Type) were verified already shipped in 1d3ddf0.

Webhook auth hardening:
- New audit kinds AuditKindBrevoWebhookUnauthorized +
  AuditKindRazorpayWebhookUnauthorized.
- Brevo webhook 401 path: emits audit_log via safego.Go (best-effort, never
  blocks the 401), metadata carries presence booleans + masked source-IP
  subnet ONLY (never the secret or raw IP). Body migrated to canonical
  respondError envelope (B13-F7).
- Razorpay webhook 400 path: same audit emit + canonical
  respondErrorWithAgentAction envelope (B13-F8).
- Brevo invalid_payload + payload_too_large paths also migrated to canonical
  envelope (B13-F7 sweep).

Forwarder-sent audit_id soft FK:
- Migration 063 adds COMMENT ON COLUMN documenting that audit_id is USUALLY
  an audit_log.id but MAY be a placeholder (`reminder-<resource_id>-<stage>`,
  `provider-<grace_id>`) from legacy emit sites. Creates a partial UUID-shape
  regex index for the orphan-reconciler join path. Strict FK is impossible
  because the column is TEXT with placeholder values; the soft-link contract
  is now schema-documented.
- testhelpers schema mirror updated to include the partial index so handler
  tests on bare Postgres see the same index shape as prod.

New files:
- internal/middleware/security_headers.go
- internal/handlers/webhook_security_helpers.go (maskSourceIP for v4 /24 + v6 /48)
- internal/db/migrations/063_forwarder_sent_audit_link.sql

Coverage block (per CLAUDE.md rule 17):
- Symptom:        defense-in-depth headers + audit visibility on webhook auth fails + soft FK on a TEXT-with-placeholders column
- Enumeration:    rg "HSTS|Strict-Transport-Security|Permissions-Policy" → 0 hits pre-fix; rg "verifyRazorpaySignature|brevoSecretURLParam.*ConstantTimeCompare" → 2 unauthorized branches
- Sites found:    2 webhook 401-emit sites, 0 existing security-header sites, 0 existing forwarder_sent FK
- Sites touched:  2 webhook handlers + 1 new middleware + 1 new migration + schema mirror
- Coverage test:  models.AuditKind* constants exist; build + vet green; handler tests TestBrevo*/TestRazorpay*/TestProvision* pass
- Live verified:  next commit deploys to prod; curl /healthz + curl /webhooks/brevo/bogus to follow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icate on 401 + registry-iterating coverage test

Workstream 2 of the consolidated wave-3 api agent.

agent_action coverage:
- AST walk of every respondError* call site (rg + go/parser) surfaced 227
  unique emitted error codes. Pre-wave3 registry covered 38. After this
  commit: 248 entries — every emit site has a domain-specific sentence
  (or is in an explicit allowlist for regex artefacts).
- Each entry follows the U3 contract: opens with "Tell the user", names
  the concrete failure, names the next action, contains a full
  https://instanode.dev URL, stays under 280 chars.
- Categories added: missing-required-field (20 codes), invalid-format/value
  (50 codes), not-found / gone (12 codes), conflict / state (20 codes),
  permission / authn (10 codes), billing-specific (5 codes), Razorpay
  passthroughs (1 code), 5xx domain-specific plumbing (50 codes), 429
  rate-limited (1 code), coverage-test-discovered codes (15 codes).

WWW-Authenticate header on 401:
- New setSecurityHeadersFor401 helper. Called from respondError /
  respondErrorWithAgentAction / respondErrorWithRetry / respondRecycleGate
  so EVERY 401 envelope gets `WWW-Authenticate: Bearer realm="instanode"`
  per RFC 7235 §4.1. No-op when the DPoP middleware has already set its
  own (richer) challenge header on DPoP-required routes.

Registry-iterating coverage test (CLAUDE.md rule 18):
- New internal/handlers/error_envelope_coverage_test.go::TestErrorCode_HasAgentAction
  walks every respondError* call site in handlers/ via go/ast, extracts the
  literal error-code string, asserts each one is in either codeToAgentAction
  or coverageAllowlist. A future PR that adds a new respondError site without
  a registry entry will fail this test — closing the silent-drift door.

Files changed:
- internal/handlers/helpers.go (+~210 entries + WWW-Authenticate helper)
- internal/handlers/error_envelope_coverage_test.go (new)

Coverage block (per CLAUDE.md rule 17):
- Symptom:        ~160 emitted 4xx/5xx codes had no agent_action; agent
                  callers got empty agent_action on 4xx and generic
                  "email support" on 5xx
- Enumeration:    rg + go/parser AST walk of internal/handlers/*.go
                  respondError* calls → 227 unique codes
- Sites found:    227 distinct error codes
- Sites touched:  210 added to registry, 2 in allowlist (regex artefacts)
- Coverage test:  TestErrorCode_HasAgentAction (registry-iterating, AST-based)
- Live verified:  next commit deploys to prod; curl on any 4xx route will
                  show agent_action populated; HEAD on /api/v1/* will show
                  WWW-Authenticate on the 401

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t_apps inversion

Workstream 3 of the consolidated wave-3 api agent.

B6-P3 — Growth tier deployment_apps inversion (BugBash 2026-05-20):
- plans.yaml growth.deployments_apps: 5 → 50. Pre-fix Growth ($99/mo)
  was strictly below Pro ($49/mo) on this customer-facing dimension —
  a tier-ladder inversion that violated the published "Growth = more of
  everything Pro has, on shared infra" promise.
- 50 keeps a clear gap above Pro's 10 without leaping to Team's
  unlimited (-1); 50 is the realistic ceiling under shared k8s per
  Growth tenant before per-namespace overhead matters.
- Synchronised with common/plans defaultYAML in a separate commit on
  the common repo (cc97d4f) per CLAUDE.md rule 22 (contract changes
  touch all surfaces in one PR).

Registry-iterating pinning tests (CLAUDE.md rule 18):
- New internal/plans/tier_ladder_invariants_test.go with 4 tests:
  - TestTierLadder_GrowthBeatsPro — iterates 8 numeric dimensions,
    fails on any inversion (treats -1 as +inf)
  - TestTierLadder_HobbyPlusBeatsHobby — same shape for hobby_plus
  - TestTierLadder_PaidTiersHaveDeployments — every paid tier needs
    >0 deployments_apps
  - TestPlansYAML_B6P3_GrowthDeploymentsAppsAbovePro — literal pinning
    of the B6-P3 finding
- All four tests load api/plans.yaml directly (via runtime.Caller +
  filepath.Join), NOT the embedded defaultYAML, so the api repo's
  source-of-truth YAML is what's validated.

B13 cosmetic drifts (F7, F8) — already applied in workstream 1
(Brevo + Razorpay webhooks now use canonical respondError envelope
via the new audit-emit + canonical-envelope migration). The
non-canonical 4xx envelopes are gone from those two paths; this
commit captures the plans-side fix that needed its own commit boundary.

Coverage block (per CLAUDE.md rule 17):
- Symptom:        Growth deployments_apps=5 < Pro deployments_apps=10
                  (tier-ladder inversion)
- Enumeration:    grep -n deployments_apps plans.yaml → 1 affected row
- Sites found:    1 in api/plans.yaml + 1 mirror in common/plans defaultYAML
- Sites touched:  2 (this commit + cc97d4f in common)
- Coverage test:  TestPlansYAML_B6P3_GrowthDeploymentsAppsAbovePro
                  + TestTierLadder_GrowthBeatsPro (iterates 8 dimensions)
- Live verified:  next commit deploys to prod; curl
                  /api/v1/capabilities will show growth.deployments_apps=50

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wave-3 agent_action registry now maps db_error, internal_error, and
invalid_payload to domain-specific sentences (see helpers.go). Three
existing envelope-shape tests were pinning the pre-wave3 fallback
behaviour:

- TestErrorEnvelope_503_AllFieldsAndHeader (db_error 503) — was
  asserting AgentActionContactSupport literal; now asserts non-empty
  agent_action containing "transient" (the registry sentence's signature).
- TestErrorEnvelope_400_NullRetryAfter_NoHeader (invalid_payload 400) —
  was asserting agent_action absent; now asserts the registry sentence
  contains "request body could not be parsed".
- TestErrorEnvelope_500_NoRetryAfter (internal_error 500) — was
  asserting AgentActionContactSupport literal; now asserts non-empty
  agent_action containing "support@instanode.dev" (both the registry
  sentence and the fallback name the support email).

The TestErrorCode_HasAgentAction registry-iterating coverage test still
asserts the underlying invariant: every emitted code has an entry OR
allowlist row. The fallback path is still reachable for codes outside
both — it's just not the test target any more.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After wave-3's 210-entry agent_action expansion, two test families
needed updates to match the new contract surface:

1. TestAgentActionContract (agent_action_contract_test.go):
   - Extended actionVerbs[] to include legitimate next-action verbs
     surfaced by wave-3 entries: Restart, Re-subscribe, Re-open,
     Re-issue, Re-enter, Refresh, POST, GET, Request, Add, Trim, Pick,
     Specify, Re-deploy, Resume, Verify, Shorten, Shrink, Open,
     Disconnect, Replace, Promote, Start, Email support, Apply, Supply,
     "Each binding" (diagnostic), "See ", "No action needed" (the
     intentional no-op states for tier_unchanged / same_plan).
   - Patched ~15 helpers.go entries that referenced support@instanode.dev
     without the canonical https://instanode.dev/ URL — every entry
     now includes the canonical pricing/status/support/docs URL per the
     U3 contract.

2. TestReliability_AuditKinds_EveryConstantHasConsumerSpec
   (e2e/reliability_contract_test.go):
   - Added auditConsumerSpec entries for the two new B18 audit kinds:
     webhook.brevo.unauthorized + webhook.razorpay.unauthorized. Both
     marked IntentionallyNoConsumer (operator-alert kinds, not customer
     emails).

The underlying invariants (verb presence + canonical URL + spec
coverage) still hold — these commits only sync the test expectations
with the expanded registry surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave-3 chaos verify P3 (2026-05-21) cosmetic finding: a signed Razorpay
webhook whose notes.team_id (or subscription_id fallback) references a
non-existent team returned 404 to Razorpay but left no audit_log row.
Operators had to grep NR for the slog line and a burst against the path
raised no signal at all.

Counterpart to webhook.razorpay.unauthorized (signature-failed audit kind
added in task #310): this kind fires AFTER signature verification passes
but before any team-touching mutation lands, so a dashboard-typo /
deleted-team race / synthetic chaos probe / valid-signature attacker probe
all surface in a single operator-only audit row + Prometheus counter +
WARN-tier NR alert.

Touched surfaces:
- models/audit_kinds.go: AuditKindRazorpayWebhookTeamNotFound constant.
- handlers/billing.go: webhookErrorStatus signature extended with
  (*BillingHandler, rzpWebhookEvent); ErrTeamNotFound branch bumps the
  Prom counter + emits audit row + WARN-level slog via safego.Go with a
  3s bounded-timeout context (CLAUDE.md rule 16 — NEVER context.Background).
  Metadata: event_type, event_id, notes_team_id (UUID, no PII),
  subscription_id, masked source-IP subnet. NO email/payload PII.
- metrics/metrics.go: razorpay_webhook_team_not_found_total counter.
- e2e/reliability_contract_test.go: auditConsumerSpec entry with
  IntentionallyNoConsumer=true (operator alert, no customer email).
- infra/newrelic/alerts/razorpay-webhook-team-not-found.json: P2 WARN
  alert at >5/h, no CRITICAL — per brief, informational not page-worthy.
- handlers/billing_webhook_team_not_found_test.go:
  TestRazorpayWebhook_TeamNotFound_EmitsAudit — full live path with
  valid signature + bogus team_id, asserts 404 + audit row + counter +1.

Coverage block:
  Symptom:        webhook.razorpay.team_not_found has no audit_log row
  Enumeration:    rg -n 'webhookErrorStatus\(|ErrTeamNotFound' internal/
  Sites found:    3 webhookErrorStatus callers + 1 definition + N consumers
  Sites touched:  all 3 callers updated; new helper emits at the central
                  ErrTeamNotFound branch in webhookErrorStatus
  Coverage test:  TestRazorpayWebhook_TeamNotFound_EmitsAudit +
                  TestReliability_AuditKinds_EveryConstantHasConsumerSpec
                  (registry walk — adds the new constant automatically)
  Live verified:  pending CI Deploy + post-deploy synthetic POST per
                  brief's verify-live step (rule 13)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erify

50-concurrent /db/new burst exhausted the shared DigitalOcean Managed
Postgres user-connection ceiling and took down worker's
event_email_forwarder with "remaining connection slots are reserved
for non-replication superuser connections". Pool was pinned at 25/10
with handlers holding connections through the full sync provisioner
gRPC round-trip.

Changes:
- internal/db/postgres.go: env-tunable pool sizing — API_PG_MAX_OPEN_CONNS
  (default 15, was 25), API_PG_MAX_IDLE_CONNS (5), API_PG_CONN_MAX_LIFETIME
  (4m), API_PG_CONN_MAX_IDLE_TIME (90s). Operator can raise without
  redeploy when DO Managed Postgres tier is bumped.
- internal/db/pool_metrics.go: StartPoolStatsExporter ticks sql.DBStats
  every 5s and pushes onto instant_pg_pool_{in_use,idle,open,max,wait_count,
  wait_duration_seconds} Prometheus gauges. Closes the observability gap
  that made the chaos verify finding only visible via downstream worker
  errors.
- internal/metrics/metrics.go: declares the new gauges with bounded
  cardinality (one label: pool name).
- main.go: launches the exporter immediately after ConnectPostgres so
  the gauges have values before the first Prom scrape.
- internal/db/pool_metrics_test.go: regression contract.
  - TestPublishStats_RoundTripsAllFields: every DBStats field surfaces.
  - TestStartPoolStatsExporter_ContextCancellation: no goroutine leak.
  - TestStartPoolStatsExporter_NilPoolSafe: no panic on nil pool.
  - TestEnvInt/EnvDuration_FallsBackOnBadValues: typo'd env var can't
    silently disable the ceiling.
  - TestPoolBurst_DoesNotStarveLastCaller: burst-safe contract, gated
    on INSTANT_TEST_POOL_BURST=1 (CI can opt in; default-skipped to
    keep `go test` hermetic).

Live 50-burst against prod NOT executed per brief's CONSTRAINTS
("never disrupt real customer traffic; if running would risk other
tenants, document as TODO instead and ship the code fix"). Operator
follow-up: after deploy, watch instant_pg_pool_in_use / instant_pg_pool_max
during the next scheduled load test; alert when > 0.8 for 5m.

Coverage block:
  Symptom:       worker event_email_forwarder error "remaining connection
                 slots are reserved for non-replication superuser connections"
                 during 50-concurrent api /db/new burst
  Enumeration:   rg -F 'SetMaxOpenConns' api/ worker/ provisioner/
  Sites found:   3 (api postgres.go, worker db.go, provisioner main.go)
  Sites touched: 3 (this commit + sibling commits on worker + provisioner)
  Coverage test: TestPublishStats_RoundTripsAllFields fails if a future
                 refactor drops a Stats() field from the gauge map
  Live verified: deferred — see /readyz + /metrics post-deploy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per spec

The earlier #311 ship was real (security_headers.go landed in 2ab2b96), but
the wave-3 chaos-verify caught a contract drift: the spec mandated
X-Frame-Options + Cross-Origin-Resource-Policy AND a 2-year HSTS max-age,
none of which were emitted. The Permissions-Policy value also differed
from the spec's exact 4-feature deny string. A repo-wide grep for the
spec headers returned partial hits on master, not the full set.

This commit reconciles the middleware against the spec exactly:

- Strict-Transport-Security: max-age=63072000; includeSubDomains  (was 15552000)
- X-Content-Type-Options:    nosniff                              (unchanged)
- X-Frame-Options:           SAMEORIGIN                            (NEW)
- Referrer-Policy:           strict-origin-when-cross-origin       (unchanged)
- Permissions-Policy:        geolocation=(), microphone=(), camera=(), payment=()
                             (was a wider 23-feature deny set)
- Cross-Origin-Resource-Policy: same-origin                       (NEW)

HSTS remains prod-only — gated by `cfg.Environment == "production"` in
router.go — so local dev over http://localhost:8080 doesn't poison the
host's browser HSTS cache.

Exports header values as package-level constants
(middleware.HSTSValue / .PermissionsPolicyValue / .ReferrerPolicyValue /
.XContentTypeOptionsValue / .XFrameOptionsValue /
.CrossOriginResourcePolicyValue) so the test file and any future
consumers (OpenAPI generator, security scanner harness, etc.) can pin
the contract against the same source of truth.

Adds internal/handlers/security_headers_test.go with 4 tests:

  - TestSecurityHeaders_AllEndpoints_AllHeaders_Prod — iterates 5
    representative endpoints (healthz / readyz / openapi.json / db/new /
    claim) and asserts all 6 spec headers land on every response,
    including the 401 and 400 error envelopes. This is the regression
    gate against the failure mode that triggered the redo.

  - TestSecurityHeaders_NoHSTSInDev — pins the dev-mode contract: HSTS
    MUST be absent when ENVIRONMENT != production.

  - TestSecurityHeaders_PermissionsPolicy_Exact — pins the exact
    Permissions-Policy spec value so a refactor that "improves" the
    deny set fails loudly.

  - TestSecurityHeaders_HSTS_TwoYearMaxAge — pins max-age=63072000.

Updates openapi.go info description with a "Security headers (applies
to every response)" section so the global contract is discoverable from
the OpenAPI document.

Coverage block (CLAUDE.md rule 17):

  Symptom:        wave-3 chaos-verify reports task #311 marked complete
                  but spec headers (X-Frame-Options, CORP, HSTS 2y,
                  exact Permissions-Policy) not in api code
  Enumeration:    rg -F 'Permissions-Policy|Referrer-Policy|
                  X-Content-Type-Options|X-Frame-Options|
                  Cross-Origin-Resource-Policy|Strict-Transport-Security'
                  --include='*.go' .
  Sites found:    1 (internal/middleware/security_headers.go) — wired
                  but spec-incomplete: missing X-Frame-Options + CORP,
                  HSTS at 15552000 not 63072000, Permissions-Policy a
                  superset of the spec's 4-feature deny string
  Sites touched:  3 — security_headers.go (spec-complete header set),
                  router.go (updated wiring comment), openapi.go (global
                  description section). + new test file.
  Coverage test:  TestSecurityHeaders_AllEndpoints_AllHeaders_Prod
                  iterates 5 endpoints + asserts all 6 headers
  Live verified:  pending — requires CI build + kubectl image set +
                  curl https://api.instanode.dev/healthz on the new
                  commit_id. Documented in PR body for the
                  verify-live gate (CLAUDE.md rule 14).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TeamAllTiers

The first revision of TestRazorpayWebhook_TeamNotFound_EmitsAudit used
makeSubscriptionChargedPayload (empty plan_id), which short-circuited the
handler at the F3 'unknown_tier' branch BEFORE reaching
UpgradeTeamAllTiersWithSubscription. ErrTeamNotFound was never returned,
the webhookErrorStatus 404 path was never taken, and the new audit row
was never emitted — CI's TestRazorpayWebhook_TeamNotFound_EmitsAudit
failed on the live DB run.

Switch the test to billingWebhookDBApp + makeSubscriptionChargedPayloadWithPlan
(both already in the suite) with cfg.RazorpayPlanIDPro. The handler now
passes the F3 tier-validation branch, calls UpgradeTeamAllTiersWithSubscription
with a bogus team_id (no matching teams row), gets back ErrTeamNotFound,
and routes through webhookErrorStatus → emitRazorpayTeamNotFoundAudit.

No production-code change — pure test fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first CI run of TestRazorpayWebhook_TeamNotFound_EmitsAudit failed
because emitRazorpayTeamNotFoundAudit read event_id from the parsed
body's event.ID field — which is empty when the test sends only the
X-Razorpay-Event-Id header (the canonical Razorpay shape).

Mirror the dispatch resolution at billing.go:1023 — header is canonical,
body event.ID is the fallback. Now the audit row's metadata.event_id
matches the same id the rest of the webhook pipeline (dedup, retry
release, slog) uses, and operators can join the audit row to Razorpay's
delivery log via a single id.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p ci]

CI-minute conservation pass (2026-05-21).

Two changes per workflow (deploy.yml + ci.yml):
1. concurrency.cancel-in-progress: false → true, AND group key now
   includes ${{ github.ref }} so different branches run independently.
   Previously: 5 rapid-fire pushes = 5 full Deploys to completion.
   Now: 5 rapid-fire pushes = 1 final Deploy (4 cancelled mid-flight).
2. paths-ignore added to skip docs-only commits (**.md, docs/**,
   CLAUDE.md, .gitignore, LICENSE, BUGBASH-*/**).

Estimated savings: ~40% of historical burn rate under the rapid-iteration
pattern this session exhibited.

[skip ci] honored — this commit itself does NOT trigger a Deploy or CI run.
The new behavior applies starting from the NEXT real code push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three tests have been blocking every api Deploy since wave-2's UX-polish
agent landed orphan registry entries + repurposed test fixture codes:

1. TestRespondError_UnknownCode_5xx_FallsBackToContactSupport — was using
   "db_error" as a fake-unregistered fixture, but wave-2 added a real
   entry for it ("platform database hit a transient error"). Switched
   to a literal sentinel "__test_5xx_unregistered__" that's fake by
   construction; if someone adds it to the registry, the test fails
   loudly with a clear signal to pick a different sentinel.

2. TestRespondError_UnknownCode_4xx_OmitsAgentAction — same pattern,
   was using "invalid_payload" which wave-2 registered. Now uses
   "__test_4xx_unregistered__".

3. TestCodeToAgentAction_NoOrphans — flagged "upstream_failed" as an
   orphan registry entry (helpers.go:862-864) with no handler emit site
   anywhere in the codebase. Deleted the dead entry. The "provider_failed"
   entry above it covers the legitimate upstream-failure case.

Unsticks 4 pending commits behind master: 4212c11 (security headers),
1345e28 (pg pool sizing), 6b6253b/36ff74d/c1bb661 (razorpay team_not_found
audit). Net effect of one Deploy run: 5 commits' worth of work goes live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…DELETE SET NULL (closes gap #6)

Closes CLAUDE.md "Known Design Gaps" #6: team-deletion cascades drop
audit_log rows but leave forwarder_sent rows pointing at non-existent
audit_log IDs. Migration 063 only added a partial index + COMMENT;
063's analysis correctly identified that a direct FK on audit_id is
infeasible (TEXT vs UUID type mismatch + PK NOT NULL + legacy
placeholder strings like `reminder-<id>-<stage>` from pre-consolidation
emitters would fail constraint validation).

Strategy: add a NEW nullable column `audit_log_id UUID REFERENCES
audit_log(id) ON DELETE SET NULL` as a strict-FK breadcrumb alongside
the existing TEXT audit_id (which stays as the PK + idempotency key).
This gives gap #6 the strict FK it asks for without breaking legacy
placeholder emitters (those rows leave audit_log_id NULL — semantically
correct, since they were never tied to an audit_log row).

Backfill: any existing forwarder_sent row whose audit_id is a valid
UUID AND exists in audit_log gets audit_log_id populated. Placeholder
rows + orphan rows both leave audit_log_id NULL. After this migration
the email-truth-surface ledger row survives audit_log row deletion
(CLAUDE.md rule 12) but the breadcrumb back to audit_log is NULL'd
via the FK rather than orphan-dangling.

Follow-up needed: worker's event_email_forwarder must be taught to
populate audit_log_id alongside audit_id on insert (when audit_id is a
real UUID). Until that ships, post-064 rows leave audit_log_id NULL
even when they have a valid audit_log target. The backfill covers
existing rows; this is a forward-only behaviour gap.

Coverage block (CLAUDE.md rule 17):
  Symptom:        forwarder_sent rows orphan after team-delete cascade
  Enumeration:    rg -F 'forwarder_sent' --type-add 'sql:*.sql' -tsql -tgo
  Sites found:    1 schema, 1 testhelpers mirror
  Sites touched:  1 migration + 1 testhelpers mirror + 1 regression test
  Coverage test:  TestMigration064_AuditLogIDFKWithOnDeleteSetNull
                  + TestMigration064_OnDeleteSetNullEndToEnd
  Live verified:  pending operator apply + post-deploy verification

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants