fix(reliability): migration 064 — forwarder_sent.audit_log_id FK (closes gap #6) by mastermanas805 · Pull Request #126 · InstaNode-dev/api

mastermanas805 · 2026-05-21T16:26:54Z

Summary

Closes CLAUDE.md Known Design Gaps #6 — `forwarder_sent.audit_id` has no FK to `audit_log(id)`, so team-deletion cascades drop audit_log rows and leave orphan ledger rows pointing at non-existent IDs.

Migration 063 (live) was an index + COMMENT only — it explicitly documented why a direct FK on `audit_id` is infeasible (TEXT vs UUID type mismatch, NOT NULL PK, legacy placeholder strings like `reminder--` from pre-consolidation emitters).

Approach: add a new nullable column `audit_log_id UUID REFERENCES audit_log(id) ON DELETE SET NULL` alongside the existing TEXT `audit_id`. The new column is the strict-FK breadcrumb; `audit_id` stays as the PK + idempotency key.

Backfill: existing rows whose `audit_id` is a real UUID AND resolves in `audit_log` get `audit_log_id` populated. Placeholder rows + already-orphan rows leave `audit_log_id` NULL.
Forward behaviour: future `audit_log` deletes flip `audit_log_id` to NULL via the FK rather than orphan-dangling — the ledger row itself survives (email-truth-surface, CLAUDE.md rule 12).
Index `idx_forwarder_sent_audit_log_id_not_null` lets the orphan-reconciler join cheaply.

Files changed

`internal/db/migrations/064_forwarder_sent_audit_fk.sql` — new migration
`internal/testhelpers/testhelpers.go` — mirror the new column + FK + index into the test bootstrap (anti-drift guard `TestRunMigrationsMirrorsEveryMigrationTable` already iterates the migration registry)
`internal/testhelpers/migration_064_test.go` — 5 registry-walk tests:
- column shape (uuid, nullable)
- FK exists with `confdeltype='n'` (= SET NULL) targeting audit_log
- partial index exists
- end-to-end: insert with both `audit_id` + `audit_log_id` → `DELETE FROM audit_log` → ledger row survives, `audit_log_id` is NULL
- legacy placeholder `audit_id` strings still insert cleanly

Test plan

`go build ./...` — green
`go vet ./...` — green
`go test ./internal/testhelpers/... -run TestMigration064 -v` — 5/5 PASS
Full gate run — 3 pre-existing master failures (`TestBulkTwin_QuotaPartialFill`, `TestGetExpiredDeployments_ReturnsOnlyExpiredNonPermanent`, `TestLinkGitHubID`) confirmed on `master` baseline (stashed-changes verification); not introduced by this PR
Operator: apply migration via standard rollout; verify `/healthz` reports `migration_version=064_forwarder_sent_audit_fk.sql`, `migration_count=63`
Operator: `SELECT count(*) FROM forwarder_sent WHERE audit_log_id IS NOT NULL` should be > 0 after migration (= backfilled real-UUID rows)

Follow-up needed (not blocking this PR)

Worker's `event_email_forwarder.go` must be taught to populate `audit_log_id` alongside `audit_id` on insert when `audit_id` is a real UUID. Until that ships, post-064 rows leave `audit_log_id` NULL even with a valid `audit_log` target. The backfill covers existing rows; this is a forward-only behaviour gap tracked as part of the closing of gap #6.

Worker repo also needs a `worker/sql/064_forwarder_sent_audit_fk.sql` mirror (worker repo is separate; not in this PR — surface that to the next operator).

Why no direct FK on `audit_id`

Three independent blockers, all documented in migration 063 and in 064's header:

Type mismatch. `audit_log.id` is UUID, `audit_id` is TEXT.
PK NOT NULL. `audit_id` is the table's PRIMARY KEY; ON DELETE SET NULL would violate the PK on every audit_log delete.
Legacy placeholders. Several emit sites write synthetic placeholder values like `reminder-<resource_id>-` or `provider-<grace_id>`. A strict FK would reject those rows.

The new nullable `audit_log_id` column sidesteps all three.

🤖 Generated with Claude Code

… of pricing experiments) Adds an internal/experiments package with a deterministic SHA256(identifier+salt) mod len(variants) selector and a registry of active experiments. The first registered experiment is UpgradeButton with variants {control, urgent, value}. The variant assignment is exposed on GET /auth/me as a new `experiments` map keyed by experiment name, so the dashboard learns the user's bucket for every active experiment in one round-trip. POST /api/v1/experiments/converted records a conversion in the audit log with kind = "experiment.conversion" and metadata = {experiment, variant, action_taken}. The handler verifies the variant matches the server's bucket for the caller (rejects stale clients) before writing. Bucketing identifier: - /auth/me — team_id (caller is always authenticated here) Tests: - internal/experiments: 7 tests (determinism, distribution ~33/33/33 across 1000 ids, salt isolation, valid-variant invariant) - internal/handlers: 6 tests (/auth/me embeds the field, POST writes audit_log, rejects unknown experiment / invalid variant / variant mismatch / no-auth) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…act (U3) Establishes a single source of truth for every agent_action string the API returns (internal/handlers/agent_action.go) and enforces a four-requirement contract in tests (TestAgentActionContract). The U3 contract every agent_action MUST satisfy: 1. Open with "Tell the user" — the LLM agent re-articulates verbatim. 2. Name the specific reason rejected (tier, limit, policy, resource). 3. Name the exact next action (Upgrade, Claim, Provision twin, Contact support) — never "try again later" without context. 4. Include a full https://instanode.dev/ URL — no relative paths. 5. Under 280 chars so LLMs reproduce verbatim instead of summarizing. Audited 25 strings across the handlers + middleware packages. Every one now meets all four requirements: builders (newAgentAction*) for tier/env/limit interpolation, static constants for fixed walls, and the existing codeToAgentAction registry sharpened in lockstep. Call sites refactored to consume the named constants/builders so reviewers audit prose in one file (agent_action.go) rather than scattered handlers. Tests: - TestAgentActionContract covers every string (static + builder + registry) - TestAgentActionContract_RegistryCoverage guards expected codes present - Existing TestRespondError_* assertions updated to match sharpened copy - All handlers + middleware unit tests green (pre-existing internal/plans failure unrelated to this PR) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tions-fresh handlers: audit + sharpen every agent_action string per uniform contract (U3)

New endpoint reads the most recent near_quota_wall row written by the worker's QuotaWallNudgeWorker, scoped to the caller's team and bounded to the last 24h. Returns the metadata fields (tier/axis/service/current/ limit/percent_used/at) flattened into the response when a row exists, or {ok:true, near_wall:false} when absent or stale. Tier gate: team-tier callers always get near_wall=false without an audit query — team tier has no walls so the scan is meaningless. Query uses the existing idx_audit_team_at index (team_id + created_at DESC from migration 012), which is the exact access pattern. No new index needed. Routed under the /api/v1 auth group so a valid session JWT is required and team scope comes from the JWT claims.

experiments: server-side bucket selector + UpgradeButton variants (P1 of pricing experiments)

… + billing.go plan_frequency (P2) Adds the three yearly plan variants to plans.yaml and threads a new plan_frequency field through POST /api/v1/billing/checkout. Limits + features on each *_yearly plan mirror the monthly counterpart byte-for-byte; only price (annual amount in cents) and billing_period: yearly differ. ## Pricing hobby_yearly: $90/yr (~$7.50/mo, saves $18/yr vs $9 x 12) pro_yearly: $490/yr (~$40.83/mo, saves $98/yr vs $49 x 12) team_yearly: $1990/yr (~$165.83/mo, saves $398/yr vs $199 x 12) ## API changes - `checkoutRequest.PlanFrequency` ("monthly" | "yearly", default monthly). Empty stays back-compat with pre-P2 dashboard clients. - `razorpayPlanIDFor(tier, frequency)` resolves the right plan_id from config. Yearly env vars empty -> 503 billing_not_configured (so partial rollout — monthly live, yearly plans not yet created on Razorpay — is safe). - `planIDToTier()` recognises both monthly and yearly plan_ids and maps them back to the canonical (bare) tier. teams.plan_tier always stores the canonical name so limits resolution is cycle-agnostic. - Webhook unchanged in shape — the upgrade tier comes from the resolved canonical tier regardless of which cycle paid. - OpenAPI schema documents plan_frequency on /api/v1/billing/checkout. ## New env vars (operator action — see PR body callout) RAZORPAY_PLAN_ID_HOBBY_YEARLY RAZORPAY_PLAN_ID_PRO_YEARLY RAZORPAY_PLAN_ID_TEAM_YEARLY ## Tests - Unit: 5 new handler tests (invalid_frequency 400, yearly unconfigured 503, monthly default behaviour, team-tier guard fires on both cycles, webhook plan_id -> canonical tier mapping for all 6 plan_ids). - Unit: 3 new plans tests (yearly variants mirror monthly, CanonicalTier helper, base-tier limits unchanged). - `make test-unit`: green across all packages. ## Dependency Requires `InstaNode-dev/common#pricing/p2-annual-plans` (PR #6) — adds the `BillingPeriod` Plan field, the `Registry.BillingPeriod` method, the `CanonicalTier` helper, and the three yearly entries in the embedded defaultYAML. This api branch's `go.mod` `replace` directive points to `../common` so local builds work, but the merged base in `common` must be cut before this PR lands.

…dge-api-fresh # Conflicts: # internal/router/router.go

…fresh usage: GET /api/v1/usage/wall returns latest near-wall audit row (U1)

plans: annual pricing tiers (hobby_yearly / pro_yearly / team_yearly) + billing.go plan_frequency (P2)

…ure) POST /deploy/new now accepts two new multipart fields: - private (bool-ish: "true" / "1" / "yes") — when true the resulting Ingress carries an nginx.ingress.kubernetes.io/whitelist-source-range annotation built from allowed_ips. - allowed_ips — comma-separated CIDRs / IPs. Each entry is validated via net.ParseCIDR / net.ParseIP; max 32 entries (larger lists belong in CF Access / a real VPN, not an nginx annotation). Tier gate: Pro / Pro-yearly / Team / Team-yearly / Growth only. Hobby / anonymous / free / yearly-free → 402 with the new AgentActionPrivateDeployRequiresPro constant pointing at https://instanode.dev/pricing. Validation order is tier-gate FIRST so low-tier callers never see the "missing allowed_ips" 400 — they only see the upgrade wall, which is the only action that matters to them. Migration 020_deployment_access_control.sql adds two columns to deployments: private BOOLEAN NOT NULL DEFAULT false and allowed_ips TEXT NOT NULL DEFAULT ''. Stored as a comma-joined string (not JSONB array) because the nginx annotation already requires that exact form — round-trip is lossless and the existing scanDeployment code path keeps its scalar-friendly shape. Compute layer: - compute.DeployOptions gets Private bool + AllowedIPs []string. - K8sProvider.applyIngressForDeploy now sets the annotation when private is true and the slice is non-empty (belt-and-suspenders against a stray "allow nobody" Ingress). OpenAPI 3.1 spec extended with the two request fields, the two response fields, the 402 wall, and updated 400-error list. Two new agent_action constants: - AgentActionPrivateDeployRequiresPro (402 wall) - AgentActionPrivateDeployRequiresAllowedIPs (400 wall for private=true with empty allowed_ips) Both pass TestAgentActionContract automatically — added to the contract case table. Tests: 7 new cases in deploy_private_test.go (Pro accept, Hobby 402, empty IPs 400, invalid IP 400, 33-IP cap, public default unchanged, GET /api/v1/deployments round-trip). All `make test-unit` packages green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-fresh deploy: private deploys with allowed_ips ingress whitelist (Pro+ feature)

…n-place Dashboard PrivacyPanel (PR #44) needed a way to edit access-control on an existing deploy. POST /deploy/new with private/allowed_ips shipped in PR #45 but no PATCH surface — every PrivacyPanel save hit 404. What lands: - PATCH /api/v1/deployments/:id accepts {private?, allowed_ips?} JSON. Body fields are optional pointers so "omit" and "set to zero" are distinguishable. Sending only allowed_ips preserves the current private state; sending private:false clears the allow-list regardless of allowed_ips in the same body (preserves the public-deploy invariant). - Validation reuses PR #45's parsePrivateDeployFields rule-set by factoring out validatePrivateDeployFields — the tier gate, non-empty IPs check, cap, and per-entry parse are shared with POST. The U3 reviewer audits the rules in one place. - compute.Provider gains UpdateAccessControl(ctx, appID, private, allowedIPs). K8s implementation Get/Update's the existing Ingress and rewrites the whitelist-source-range annotation via a new shared helper buildIngressAccessAnnotations — same function the create path now uses, so create and update can't drift on the annotation key. Noop logs+returns nil. Local-dev (no DEPLOY_DOMAIN) is a no-op with a warn breadcrumb. - models.UpdateDeploymentAccessControl writes the private + allowed_ips columns. Single-row update — no caching/aggregation concerns. - Semantics decision documented inline: allowed_ips on PATCH REPLACES the current list (not appends). Matches REST collection-field conventions and is what PrivacyPanel renders/submits. Tests: 8 new in deploy_private_patch_test.go, all pass against the noop compute provider. Pro flips public→private, replace semantics, private:false clears the list, hobby 402 with reused AgentActionPrivateDeployRequiresPro, invalid IP surfaces verbatim, 404 / 403 / empty-body 400. TestAgentActionContract still passes (no new strings added). make test-unit: all packages green (handlers 23.6s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d plans.ValidatePromotion Adds the agent-facing endpoint that the dashboard's PromoCodePanel (PR #38) submits to before checkout. Returns 200 + ok:true with the structured discount on success and 200 + ok:false with a typed error + agent_action on rejection — the dashboard renders the red state through its normal success-path parser without a catch. Rate-limited per-team per-hour (30/hr) to make brute-forcing the seed-code namespace impractical; Redis errors fail open so a cache outage can't block a user mid-checkout. Adds the AgentActionPromotionInvalid constant which the TestAgentActionContract gate covers automatically.

…s-fresh deploy: PATCH /api/v1/deployments/:id to edit private + allowed_ips in-place

…date-fresh billing: POST /api/v1/billing/promotion/validate — HTTP wrapper around plans.validatePromotion

Founder-facing customer-management surface served under /api/v1/admin/*. Per the CRM discussion: Postgres already has the data, no need for HubSpot/Salesforce. List teams with MRR/storage/deploy aggregates, fetch a single customer's profile, manually elevate tiers (comp/cust- success, no Razorpay charge), and issue single-use promo codes. All four endpoints are gated by middleware.RequireAdmin, which reads the JWT email against ADMIN_EMAILS (comma-separated, case-insensitive). Closed by default: unset/empty ADMIN_EMAILS rejects every caller. 403 responses carry the canonical agent_action sentence. Endpoints: GET /api/v1/admin/customers — list with q/tier/sort_by GET /api/v1/admin/customers/:id — detail + users + recent audit POST /api/v1/admin/customers/:id/tier — set plan_tier, elevate, audit POST /api/v1/admin/customers/:id/promo — issue single-use promo code Migration 021 adds admin_promo_codes (single-use, admin-issued, audited) as a distinct table from plans.yaml promotions (static, server-config- level discounts). Two distinct concepts, two storage layers. Aggregations are live SQL with no Redis cache — admin views are low-frequency and must show ground truth. List uses CTEs so a 500-team result is one round trip (no N+1). MRR is computed from plan_tier via the plans Registry; yearly subscriptions contribute monthly-equivalent (annual/12) for sort comparability. Tests cover: closed-by-default 403, non-admin 403, admin 200, list MRR sort, query-by-email, detail shape, tier change updates team + elevates resources + writes audit row with structured metadata, promo issue returns unique code + expires_at + audit row. make test-unit all green (18 new admin tests + agent_action contract additions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s-api-fresh # Conflicts: # internal/handlers/agent_action_contract_test.go

admin: Customers endpoints (list/detail/tier-change/issue-promo) gated on ADMIN_EMAILS

…nceled} for Loops forwarder Wire 4 of the 7 audit_log.kind values that the Loops worker (PR #10) forwards to Loops.so but which nothing in the API was actually writing. After this PR the welcome / upgrade / downgrade / cancellation lifecycle emails will fire instead of silently no-op'ing. Sites added: - onboarding.Claim — onboarding.claimed (after JWT mark + session mint, pre-response, in a detached goroutine) - billing.handleSubscriptionCharged — subscription.upgraded or subscription.downgraded, classified by tierRank(prior) vs tierRank(new). Same-tier renewals emit nothing so monthly Pro renewals don't trigger the upgrade email. - billing.handleSubscriptionCancelled — subscription.canceled (single-l US spelling matches the Loops forwarder map; Razorpay's double-l event name is handled inside the dispatcher). Fail-open invariant enforced and tested: when audit_log writes fail (e.g. the table doesn't exist), the originating handler still returns success and the tier mutation still commits. Razorpay never sees a retry-worthy status from an audit miss. Named constants live in internal/models/audit_kinds.go so the emit sites and the Loops forwarder match on identity rather than re-typing strings. Skipped from the original 6 missing kinds: - admin.tier_changed / admin.promo_issued — Track A's admin_customers.go is not on master at HEAD d3fa539, so there is no admin tier or promo handler to attach to. Deferred until Track A lands. - resource.expiry_imminent — lives in the worker repo, out of scope for this PR per the brief. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resh audit: emit 4 missing kinds (onboarding.claimed + subscription.*) to unblock Loops forwarder

Defense-in-depth on top of ADMIN_EMAILS (PR #48). PR #48 shipped admin endpoints at /api/v1/admin/customers/...; even with the email allowlist gate, a guessable path lets drive-by scanners log auth failures + a leaked session token from an XSS / exfil probe the admin surface. Two independent gates now protect /api/v1/<prefix>/customers/...: 1. ADMIN_PATH_PREFIX — unguessable URL segment (32+ chars, alnum). Empty/unset → admin routes are NOT registered (404). Closed by default. Operators opt in by generating via `openssl rand -hex 32`. A short or non-alphanumeric prefix is a fatal startup error. 2. ADMIN_EMAILS — unchanged. JWT email allowlist, closed by default. The prefix is delivered to admin clients only via /auth/me's admin_path_prefix field (omitted entirely for non-admin callers and for deployments without ADMIN_PATH_PREFIX configured). The dashboard reads it and mints /api/v1/<prefix>/customers/... URLs client-side. OpenAPI spec (/openapi.json) deliberately omits the admin surface — a documented path defeats the obscurity gate. Regression test guards. INTERNAL-OPS.md (new, private via .gitattributes export-ignore) carries the rotation procedure, incident response, and the operator runbook. Tests: 13 new unit tests (8 router/config + 4 handlers + 1 OpenAPI omission grep); all 382 dashboard tests still pass; full Go test suite green on target packages (router, handlers, config). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

admin: route admin endpoints under ADMIN_PATH_PREFIX env (closed-by-default + internal docs)

F's apply.sh in the infra repo patched 3 schema gaps at apply time via jq. Terraform reads the same JSON but wouldn't run the adapter — the source JSON drifted from one applier to the other. Bake the fixes into the source so apply.sh + Terraform consume identical clean structures. Fixes baked: 1. Dashboards: accountIds:[0] → accountIds:["${NEW_RELIC_ACCOUNT_ID}"] substitution token across all 4 dashboards (28 total occurrences). 2. Alerts: drop "type": "NRQL" — NerdGraph rejects it on NrqlConditionStaticInput; the mutation name encodes the type. 3. Alert policy: add policies/instant-api.json describing the umbrella policy + add policyName="instant-api alerts" linking field to each alert JSON, replacing apply.sh's inline find-or-create logic. apply.sh already handles this defensively — its jq walk() rewrites .accountIds regardless of the previous value, and del(.type) is a no-op when the field is absent. No companion infra PR needed. Tests (infra/newrelic/tests/bake_test.sh): 31 assertions covering JSON parse, accountIds token presence + count per dashboard, absence of "type" on alerts, policyName cross-ref between alerts and policy file, and policy file shape. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-fresh infra: bake jq fixes into NR JSON source (apply.sh + Terraform read same clean files)

…celed_by_admin audit POST /api/v1/admin/customers/:team_id/tier now cancels the customer's active Razorpay subscription when the tier change is a demote (toRank < fromRank). Promotes are unchanged — the comp-tier flow stays Razorpay-free. Razorpay cancel uses cancel_at_cycle_end=false (CancelImmediately) so MRR drops in the same billing cycle the demote happened, rather than carrying the old tier through to end-of-cycle. Adds a sibling helper razorpaybilling.Portal.CancelImmediately alongside the existing CancelAtCycleEnd (which the customer's own self-serve cancel still uses). Emits a new audit_log kind `subscription.canceled_by_admin` with metadata carrying {from_tier, to_tier, by_admin_email, reason, subscription_id, cancel_attempted, cancel_succeeded, cancel_error}. The Brevo / Loops forwarder keys on the kind string to fire a "your subscription was canceled by support" template; cancel_succeeded gates the "we canceled" copy so a failed Razorpay call doesn't lie to the customer. Fail-open posture: a Razorpay 5xx during cancel does NOT block the admin demote — the DB-side demote already succeeded, the audit row records cancel_succeeded=false + the error, and the operator reconciles manually in the Razorpay dashboard. Same fail-open shape as resource elevation. Six new integration tests cover: demote pro→hobby fires cancel + audit; demote team→hobby (multi-rank delta); demote with empty subscription_id skips Razorpay but still audits; promote does NOT call Razorpay; Razorpay 500 still returns 200 with cancel_succeeded=false audit; same-tier 409 makes no Razorpay call (idempotency). make test-unit: green across all packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-on-demote-fresh # Conflicts: # internal/router/router.go

…ote-fresh admin: cancel Razorpay subscription on demote + emit subscription.canceled_by_admin audit

…aml codes POST /api/v1/billing/promotion/validate now unifies two promo-code sources: 1. plans.Registry.ValidatePromotion (TWITTER15/LAUNCH50 broadcast codes, already wired in PR #47) — checked first. 2. admin_promo_codes table (single-use codes issued by an admin via /api/v1/admin/customers/:team_id/promo, table from PR #48) — fall-through when the plans-yaml side returns "not found". The admin path enforces single-use at validate time (returns promotion_already_used + the new AgentActionPromotionAlreadyUsed sentence) and at the webhook side via UPDATE ... WHERE used_at IS NULL, so two concurrent redemptions can't double-spend a code. Cross-team codes surface as plain "promotion_invalid" — we don't reveal their existence. POST /api/v1/billing/checkout now accepts promotion_code; when the code matches an admin_promo_codes row for the team, the checkout stamps notes.admin_promo_code_id on the Razorpay subscription. The subscription.charged webhook reads that notes key and marks used_at best-effort (fail-open: a redemption miss must not undo the tier upgrade). Tests cover: - plans-yaml regression (PR #47 happy path still works with DB wired) - admin unused/used/expired/cross-team - amount_off and first_month_free kind round-trip - webhook redemption hook (with notes, without notes, redelivery, invalid UUID) - model-level race: two concurrent MarkAdminPromoCodeUsed callers, exactly one wins. Note on the brief's "wrong plan for admin code → invalid": admin codes are scoped by team_id, not plan; admin_promo_codes.applies_to is INTEGER (a percent_off cap in cents per openapi.go), not a tier list. The handler echoes the requested plan back in discount.applies_to so the dashboard renders "applies to <plan>" uniformly, but does not reject by plan. Documented in the validate handler's lookupAdminPromotion comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…out-fresh billing: validate + redeem admin-issued promo codes alongside plans-yaml codes

Adds three new fields to GET /healthz so a single probe answers both "is my image stale" (commit_id) and "did my migrations apply" (migration_version + count + status). Caches the DB read for 60s per process so readiness-probe traffic stays cheap and the path remains <10ms p99. - New migration 022_schema_migrations.sql tracks applied filenames. db.RunMigrations records each file with ON CONFLICT DO NOTHING after successful exec; record failures are logged but never block startup. - New internal/migrations package: Reader with TTL-cached state, clock injection for deterministic tests, and a fail-open contract (DB errors yield migration_status="unknown" while /healthz stays 200 OK). - OpenAPI HealthResponse schema documents every field. - Tests cover happy path, DB-down → unknown, empty table, cache hit / cache miss after TTL, nil DB, and a sanity rail asserting the DB-reported filename actually exists in the embedded migration set.

…ered gap The 2026-05-20 production incident: every email since launch had been silently rejected at Brevo's SMTP relay because the sender domain was unvalidated. The worker logged classification=success on Brevo's 201 (API-acceptance) and advanced the audit_log cursor. The ledger lied because it confused API-acceptance with delivery. This PR ships the receiver-side machinery that closes the gap: * New endpoint: POST /webhooks/brevo/:secret (api/internal/handlers/brevo_webhook.go) Brevo POSTs every transactional event (delivered, soft_bounce, hard_bounce, blocked, complaint, spam→complaint alias, deferred, unsubscribed, error) to this URL. The handler matches by (provider='brevo', provider_id=<message-id>) and updates forwarder_sent.classification + (for 'delivered') stamps delivered_at. Auth: URL-token compare against BREVO_WEBHOOK_SECRET (Brevo's transactional webhooks don't carry HMAC by default; the URL-token approach works without per-callback signing toggles in the dashboard). * Migration 061 (canonical: worker/sql/061_forwarder_sent_delivery.sql, mirrored at api/internal/db/migrations/061_forwarder_sent_delivery.sql) ADD COLUMN delivered_at TIMESTAMPTZ NULL plus two indexes — one for the (provider, provider_id) lookup the receiver fires on every event, one for the 'delivery ratio' dashboard query. * New Prometheus counter brevo_webhook_events_total{event} with bounded cardinality (closed set: 8 documented event classes + 5 admin labels). Drives the email_delivery_ratio NR alert + the email-delivery NR dashboard. * Registry-iterating coverage test (TestBrevoTxWebhook_EveryDocumentedEventHasHandler) walks brevoDocumentedEvents and asserts every entry has a handler in brevoEventHandlers (CLAUDE.md rule 18). * OpenAPI 3.1 entry for the new endpoint (internal/handlers/openapi.go). * Schema-mirror update for forwarder_sent + delivered_at + the two new indexes (internal/testhelpers/testhelpers.go); ALTER TABLE … ADD COLUMN IF NOT EXISTS keeps reused test DBs in sync. * E2E test (e2e/brevo_webhook_e2e_test.go) — seeds a forwarder_sent row, POSTs a synthetic 'delivered' event, verifies classification flipped to 'delivered' and delivered_at IS NOT NULL. Plus an orphan- messageId arm (200 + matched:false — Brevo retries on non-2xx, we MUST NOT amplify orphan traffic) and a hard_bounce arm. OPERATOR ACTION (Brevo dashboard): once the sender domain is validated, paste https://api.instanode.dev/webhooks/brevo/<BREVO_WEBHOOK_SECRET> into Brevo → Transactional → Settings → Webhook URL, toggle on every documented event class, ensure single-event-per-call is selected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tag) api half of the chaos-drill harness from CHAOS-DRILL-2026-05-20.md: - e2e/propagation_chaos_test.go — synthetic pending_propagations rows, verify exponential backoff, dead-letter at maxAttempts, alert pattern. Lives behind //go:build chaos so it doesn't run in the regular gate. - e2e/lease_recovery_chaos_test.go — drives the worker-side stub job, asserts River re-leases an orphaned job after pod kill. - Makefile — make chaostest-propagation / make chaostest-lease-recovery. Companion: worker chaos_lease_recovery.go + workers.go wiring (separate commit). Findings F1 (P0) + F2 (P1) + F4 (P1) are open follow-ups (tasks #241/242/243). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…esolvable) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deep readiness check shipped in c5b81de but openapi.go was never extended. TestOpenAPI_CoversAllRegisteredRoutes (the H4 regression gate) caught this and blocked every subsequent Deploy (Brevo webhook, common@04dd59d retrigger, chaos test commits) on master. Adds: - paths./readyz GET with 200/503 responses - components/schemas/ReadinessResponse envelope (overall + service + commit_id + checks[]) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…acks (#119) Adds the next layer up from existing unit tests + chaos drills: Track 1 — Backup/restore integration tests (api/e2e/, build tag integration_backup): - Wraps infra/scripts/restore-drill.sh in Go test scaffolding - Asserts RTO < 5min (postgres) / 3min (mongo), RPO < 30h - Asserts cleanup of throwaway namespace - Pure-parse tests for NR alert aggregation_window + Prom rule 36h/60h thresholds (run anywhere) - .github/workflows/integration-backup.yml: weekly cron + manual dispatch against KUBECONFIG_TEST_CLUSTER; defensive context-name gate to prevent prod runs Track 2 — Brevo webhook full pipeline (api/e2e/, e2e build tag): - Registry-iterating round-trip over every documented Brevo event: delivered/soft_bounce/hard_bounce/blocked/complaint/deferred/ unsubscribed/error - Idempotent re-delivery pins GREATEST(delivered_at) clause - Delivered-then-bounce pins makeClassUpdater contract (no time-travel) - Malformed payload 400 + unhandled event 200 end-to-end Track 3 — Propagation runner integration tests (worker/internal/jobs/, no build tag, runs under regular make gate): - Backoff exact-schedule via markRetry persistence for every position - Dead-letter at maxAttempts via markDeadLettered direct - F2 P1 guard: unknown_kind bounded retries - FOR UPDATE SKIP LOCKED concurrency (TEST_DATABASE_URL-gated) - Enum-vs-handler-map registry walk (TEST_DATABASE_URL-gated) Track 4 — Deep /readyz cross-service (api/e2e/, e2e build tag): - Envelope-shape walk across api/worker/provisioner - Brevo unreachable → 200 degraded (NOT 503) contract - Cache TTL: 50-burst latency assertion - Secret-leak scan (20+ hex chars regex) - P95 response time < 500ms - Criticality-matrix registry walk Track 5 — Cross-track contract test (api/e2e/, no build tag, runs under regular gate when TEST_DATABASE_URL is set): - Walks AuditKind* constants from api/internal/models/audit_kinds.go via source-file regex - Forward: every constant has a consumer spec entry - Reverse: every spec entry refers to a real constant - Emails=true implies Forwards=true (F4 ledger drift class) - Propagation kinds match pending_propagations.kind enum - Forwarder_sent.classification populated for rows > 5min old Companion: INTEGRATION-TESTS-2026-05-20.md (repo-root) lists every new test with the failure mode it catches, per CLAUDE.md rule 17 coverage-block discipline. Per CLAUDE.md rule 18 (registry-iterating regression tests, not hand-typed lists), every Track has at least one registry walk that fails LOUD on additions to the upstream registry without matching downstream wiring. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…overage tests (#121) * fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion of the orphan-resource prevention work shipped in commit 36eac9b. That commit wired all 18 provisioning callsites through finalizeProvision, which on persistence failure (post-RPC) tears down the backend object (best-effort), soft-deletes the row, and returns errProvisionPersistFailed so handlers map to 503 instead of 201. This commit closes three remaining gaps in that work: 1) **Operator-visible audit row**: emit AuditKindProvisionPersistenceFailed on every persistence-failure path. Operators can now reconstruct the exact moment the platform produced an unreachable resource — by resource_id / resource_type / log_prefix / provider_resource_id / request_id / tier / env. NOT wired into the customer-email forwarder (this is an internal alert, mirrors AuditKindBillingChargeUndeliverable and AuditKindPropagationDeadLettered). Sync emit + WARN slog on failure — must NEVER block the 503 to the customer. 2) **agent_action for provision_failed**: the catch-all 503 envelope used to fall back to AgentActionContactSupport ('email support'). That's wrong for the MR-P0-3 path — the backend object was rolled back and the row soft-deleted, so the right action is 'retry with exponential backoff', not contact support. Adds an explicit codeToAgentAction entry under the U3 contract (opens with 'Tell the user', names the reason, names the action, full https://instanode.dev URL, < 280 chars). 3) **Registry-iterating coverage tests** (CLAUDE.md rule 18): two static guards in provision_atomicity_coverage_test.go scan every production .go file in internal/handlers/ and assert: - every file calling models.CreateResource also calls finalizeProvision (so a new handler can't bypass the two-phase pattern) - every file calling finalizeProvision also routes the error through respondProvisionFailed or twinCoreErr (so a new handler can't swallow the sentinel) Plus TestProvisionFailedHasAgentAction asserts the catch-all code still has explicit agent_action guidance (defends against a future refactor that drops the new entry). Coverage block per CLAUDE.md rule 17: Symptom: 201 returned for resource the platform can't address (orphan backend + NULL conn_url + NULL PRID + billed) Enumeration: rg -l 'models.CreateResource\(' internal/handlers/ yielded db.go, cache.go, nosql.go, queue.go, storage.go, webhook.go, vector.go (each w/ anon + auth + twin paths) Sites found: 7 files × ~3 paths = ~21 provisioning entry points Sites touched: none added — every site already calls finalizeProvision per commit 36eac9b. This commit adds a coverage TEST that fails the build if a new site bypasses the pattern. Coverage test: TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError + TestProvisionFailedHasAgentAction Live verified: (this commit; pending CI green + deploy) Per-handler audit findings: - internal/handlers/db.go: already atomic (finalizeProvision at L265, L432, L642) - internal/handlers/cache.go: already atomic (finalizeProvision at L232, L387, L564) - internal/handlers/nosql.go: already atomic (finalizeProvision at L228, L380, L569) - internal/handlers/queue.go: already atomic (finalizeProvision at L306, L525) - internal/handlers/storage.go: already atomic (finalizeProvision at L335, L525) - internal/handlers/webhook.go: already atomic (finalizeProvision at L327, L444 — cleanup=nil ok, no backend object) - internal/handlers/vector.go: already atomic (finalizeProvision at L370, L522) - internal/handlers/deploy.go: DIFFERENT MODEL — async (202 not 201); row persisted before async build runs; failures update status. Not P0-3 shape. - internal/handlers/stack.go: DIFFERENT MODEL — async (202 not 201); same shape as deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(p0-3): swap provision_failed fixture for db_error in two 5xx-fallback tests MR-P0-3 (BugBash 2026-05-20) added an explicit codeToAgentAction entry for `provision_failed` so the catch-all 503 instructs the agent to retry with exponential backoff, not contact support. Two pre-existing tests used `provision_failed` as their "5xx code with no registry entry" fixture and fail under the new contract: - TestRespondError_UnknownCode_5xx_FallsBackToContactSupport - TestErrorEnvelope_503_AllFieldsAndHeader Swap both to use `db_error` — documented in helpers.go's curation principles (lines ~87-89) as one of the 'pure plumbing errors deliberately omitted' from codeToAgentAction. Same shape, same status code, same fallback branch — the test contract is preserved while the production contract gets sharper guidance for the actual MR-P0-3 path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(p0-3): register provision.persistence_failed in auditConsumerSpec The cross-track reliability contract test in e2e/reliability_contract_test.go asserts every AuditKind* constant has a consumer-spec entry. Adding the new constant in this PR without the spec entry tripped that gate in CI. Add entry with IntentionallyNoConsumer=true — mirrors the existing billing.charge_undeliverable and propagation.dead_lettered shape: this is an internal operator alert that intentionally has no customer email and no propagation wiring (a customer whose resource was just torn down by the platform deserves a human follow-up, not an automated template). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(readyz): add MarkDraining short-circuit for graceful shutdown (MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): NewWithHooks returns ShutdownHooks for graceful shutdown (MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(main): wire readiness-drain into graceful shutdown sequence (MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(graceful_shutdown): drain-flag + timeout regression guards (MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(readyz): drain short-circuit returns 503 + shutting_down check (MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These two paid tiers ($19/mo Hobby Plus, $99/mo Growth) appear on /api/v1/capabilities (so agents introspecting tiers see the full ladder) but are intentionally absent from the public /pricing page and marketing homepage. The pattern is documented in code but not in plans.yaml — this adds explanatory comment blocks at each tier so future agents don't try to "fix" the perceived inconsistency by surfacing them on /pricing. Also documents the Growth postgres_connections=20 / Pro postgres_connections=20 overlap: not a tier-ladder break, intentional — Pro is the highest single-DB connection count under shared infra, further headroom is a Team (dedicated) benefit. Coverage block (per CLAUDE.md rule 17): Symptom: no comment explained why hobby_plus + growth are in the registry but invisible on /pricing Enumeration: grep -rn 'hobby_plus\|growth' instanode-web/src/pages/PricingPage.tsx api/plans.yaml content/llms.txt dashboard/src/components/PricingGrid.tsx Sites found: 2 (plans.yaml hobby_plus + growth blocks) Sites touched: 2 Coverage test: none — comment-only. The intent is in lockstep with existing TIERS array in instanode-web PricingPage.tsx (which already documents the decision client-side). Live verified: /api/v1/capabilities continues to return both tiers; /pricing continues to hide them. No behavior change. Closes P1 from DOC-REALITY-DELTA-2026-05-20.md §2 + P3 (growth connections note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ine 383 ROOT CAUSE. Line 383 was `require.False(t, strings.HasPrefix(legacy, "v"))`, asserting that the output of `crypto.Encrypt(keyV2, "legacy")` does not start with the letter 'v'. But `Encrypt` returns `base64.URLEncoding.EncodeToString(nonce||ct||tag)` — base64-url's alphabet includes the byte 'v', so ~1.6% of outputs legitimately start with 'v' purely from the random AES-GCM nonce. The test was inherently flaky on every run, and CI tripped it on bd97fab. Empirical measurement (100 000 runs of Encrypt(keyV2, "legacy")): starts with 'v': 1561/100000 (1.56%) matches "vN." pattern: 0 The author's intent was "the legacy envelope must not match the structural 'vN.' version-marker pattern written by EncryptVersioned" — the same shape `crypto.splitVersionedEnvelope` checks for. The assertion that mattered is the 3-byte marker (lowercase 'v', ASCII digit '1'..'9', literal '.'), not "no leading v". FIX. Add a local `isVersionMarker` helper that mirrors `splitVersionedEnvelope` and assert the structural shape. New assertion passes 10 000/10 000 runs. Old assertion would fail ~156/10 000. `crypto.Encrypt` is unchanged. There is no test-order pollution and no crypto-package state mutation — the bug is in the test's assertion only. Coverage block (CLAUDE.md rule 17): Symptom: TestWave3P2_AESKeyring_RoundTripsAcrossVersions line 383 fails ~1.6%/run with "legacy envelope must not carry a version prefix" Enumeration: rg -F 'strings.HasPrefix' internal/handlers/wave3_p2_test.go rg -F 'crypto.Encrypt(' internal/handlers/ Sites found: 1 (this assertion) — the other HasPrefix uses match EncryptVersioned output, which deterministically starts with "vN." by construction Sites touched: 1 Coverage test: TestWave3P2_AESKeyring_RoundTripsAcrossVersions itself, run -count=10000 (verified locally, 0 fails) Regression class guarded: probabilistic-assertion-on-base64 — any test that asserts on a leading byte of base64(random) output is the same bug class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t no-op Two contract-drift P0s flagged in the bug-hunt brief, both shipped in the same PR because they share the test gate. B13-P0-F1: POST /auth/cli auth_url leaked dead-brand * Old: frontendURL() returned "https://instant.dev" in production — instant.dev is the dead-brand marketing host (404s). An agent following the auth_url landed on a parking page and gave up. * Fix: read cfg.DashboardBaseURL (DASHBOARD_BASE_URL env var); fall back to "https://instanode.dev" when DASHBOARD_BASE_URL is unset in production. internal/handlers/cli_auth.go. * Coverage: TestAuth_CLI_ReturnsInstanodeDomain (4 cases — explicit base, empty-fallback, dev default, trailing-slash strip) + TestAuth_CLI_NoLegacyInstantDevInResponses (rule-17 source-scan that fails if a new handler emits instant.dev/<x> anywhere). * Also cleaned: internal/handlers/logs.go (error message URL), internal/handlers/openapi.go (storage endpoint example). B7-P0-1: PATCH /stacks/:slug/env was a silent no-op * Old: handler logged stack.env.noted, returned 200, persisted nothing — the next /stacks/:slug/redeploy rebuilt with stale env. Silent-data-loss failure mode. * Fix (Option A from the brief): migration 062 adds stacks.env_vars JSONB DEFAULT '{}'::jsonb. Handler now loads existing env, merges (empty-string value deletes), validates every key against isValidEnvKey (POSIX [A-Z_][A-Z0-9_]*), persists, emits stack.env.updated audit_log row, and returns the full merged map. * 64KiB cap on serialized payload (ErrStackEnvVarsTooLarge → 413). * OpenAPI updated to describe PATCH semantics + new 409/413 responses + the merged-env response shape. * Coverage: TestStack_PatchEnv_PersistsAndReturns (7 subtests — auth, first patch, merge, delete-via-empty, invalid_env_key, missing_env, 404 unknown slug). All assertions verify the DB round-trip — pre-fix the handler returned 200 but persisted nothing, which is exactly what subtest #2 now fails on. Coverage block per CLAUDE.md rule 17: Symptom: instant.dev/<x> in handler-emitted strings Enumeration: rg -nF "instant.dev/" internal/handlers/ --type go Sites found: 3 (cli_auth.go:339, logs.go:136, openapi.go:2784) Sites touched: 3 — all three migrated to instanode.dev Coverage test: TestAuth_CLI_NoLegacyInstantDevInResponses scans every .go file under internal/handlers/ and fails if any non-test, non-comment, non-label, non-import-path mention of instant.dev/<x> appears. Live verified: pending — see follow-up commit after deploy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ok (2026-05-20) B5-P0: POST /claim now validates the email field via mail.ParseAddress + length cap (RFC 5321 § 4.5.3.1.3, 254 chars) + dotted-domain + no-inner- whitespace gates. Previously ANY string was accepted, minting users rows whose magic-link callback could never reach the inbox. Returns 400 with error=invalid_email_format and a curated agent_action string. Covered by TestClaim_EmailValidation_Coverage (6 cases: valid, missing-@, dotless TLD, empty, inner-space, > 254 chars). B5-P1: /claim's three-name drift (request field 'jwt', error message 'jwt field is required', agent_action 'INSTANODE_TOKEN' — three names for one field) collapsed to canonical 'token'. The 'jwt' field name is still accepted as a deprecated alias for backward compatibility with the dashboard, sdk-go, mcp, and existing curl recipes — when both are present, 'token' wins. OpenAPI spec updated to mark jwt deprecated and add token + missing_email + invalid_email_format docs. The missing_token error envelope now uses respondErrorWithAgentAction so the agent_action says 'POST /claim requires a token field carrying the onboarding token...' instead of the auth-context default 'INSTANODE_TOKEN is missing or invalid'. B11-P1 (rows-affected): models.UpgradeTeamAllTiersWithSubscription now checks res.RowsAffected(). If the team UPDATE matches 0 rows (synthetic webhook, deleted-team race, typo'd notes.team_id) the function returns ErrTeamNotFound. The Razorpay webhook handler maps that error to HTTP 404 via the new webhookErrorStatus helper (was silent 200). Razorpay treats 4xx as non-retryable so the dead event doesn't loop forever, and the existing deleteRazorpayWebhookClaim path releases the dedup row so a corrected event can still process later. Covered by TestBilling_UpgradeTeam_RowsAffected. B11-P1 (recipient resolution): handlePaymentFailed no longer trusts payload.payment.entity.email — anyone with the Razorpay webhook secret could synthesize a payload with email=victim@evil.com and fanout dunning notifications (Brevo treats payment-failed as transactional and bypasses unsubscribe). The new resolveTeamFromPayment walks payment.notes.team_id → payment.subscription_id → event.Payload .Subscription notes/id in priority order, then the new models.GetPrimaryUserByTeamID returns the team's primary user; the dunning email is sent to THAT address. Payload-supplied email is now log-only (with a per-event mismatch warning when it differs from the resolved address). If no team is resolvable, the email is DROPPED rather than mis-delivered. Covered by TestBilling_PaymentFailed_RecipientResolution (asserts an attacker- controlled payload.email is NOT the send recipient). Coverage block (per CLAUDE.md rule 17): Symptom: 4 cases — (1) /claim accepts any email string; (2) jwt vs token vs INSTANODE_TOKEN field-name drift; (3) UpgradeTeamAllTiers swallows 0-rows-affected; (4) payment.failed trusts payload.email verbatim Enumeration: rg -F 'body.JWT' / rg 'UpgradeTeamAllTiers' / rg 'pay.Email' / rg 'mail.ParseAddress' / rg 'is_primary = true' / rg 'jwt field is required' / rg 'INSTANODE_TOKEN' (within /claim context) Sites found: onboarding.go (Claim handler + ClaimRequest struct + agent_action wiring); openapi.go (ClaimRequest schema + /claim 400 response); team.go (UpgradeTeamAllTiersWithSubscription rows-affected + new GetPrimaryUserByTeamID); billing.go (handlePaymentFailed + new resolveTeamFromPayment + new webhookErrorStatus + rzpPaymentEntity Notes/ SubscriptionID/OrderID + 2 switch-block sites). Sites touched: All. Each fix is mechanically traceable to a named code location with a B5-P0 / B5-P1 / B11-P1 comment. Coverage test: TestClaim_EmailValidation_Coverage (table-driven, 6 cases + 3 token/jwt field-name sub-tests); TestBilling_UpgradeTeam_RowsAffected (synthetic bogus team_id → assert 404 with error=team_not_found); TestBilling_PaymentFailed_RecipientResolution (attacker payload.email + team primary user → assert send goes to primary, NOT attacker). Live verified: pending push + Deploy + curl + /healthz commit_id match — see PR description. Pre-existing failures (NOT introduced by this commit, verified on origin/master HEAD without these changes): - internal/handlers TestBulkTwin_QuotaPartialFill (quota fixture) - internal/models TestGetExpiredDeployments_ReturnsOnlyExpiredNonPermanent - internal/models TestLinkGitHubID (test data leak) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sUpgradeJWT passes locally; investigating CI-only flake Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cal token + deprecated jwt The Bbug-bash sibling renamed ClaimRequest.jwt → ClaimRequest.token (canonical) at 30f8f96. This test still asserted only on the deprecated jwt field's description. CI was failing on it. Now checks BOTH fields' descriptions mention upgrade_jwt: - token (canonical) — load-bearing for new callers reading the spec - jwt (deprecated alias) — load-bearing for old callers still using INSTANODE_TOKEN Local was passing because of map iteration order luck; CI's race-flagged run was hitting nil on props["jwt"] when its sibling test fired earlier and stripped state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nonical token field only 30f8f96 (B5/B11 fix) renamed ClaimRequest.jwt → ClaimRequest.token (canonical) and intentionally trimmed jwt's description to discourage further use. The test still required jwt's description to mention 'upgrade_jwt' — that's exactly the doc gravity the rename meant to remove. Now: assert token (canonical) description mentions upgrade_jwt, AND assert jwt field still EXISTS in the schema for backward compat, but don't require its description to repeat the cross-reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ling/middleware/resource Sweep of P1/P2 api-repo findings from BUGBASH-2026-05-20/MASTER-LEDGER.md still open after today's earlier P0 waves (B5-P0, B7-P0-1, B11-F2/F4/F5, B13-F1, B17-STORAGE-P0-1..4 all already shipped or in flight). Each fix includes a regression test where one fit cleanly. ————————————————————————————————————————————————————————————————————————————— B4-F1 POST /auth/email/start per-email RL silent absorption internal/metrics/metrics.go: add instant_magic_link_email_rate_limited_total counter (Prometheus). internal/handlers/magic_link.go: increment counter + WARN log on every rate-limited absorption — operator now sees the abuse pattern in NR despite the user-visible 202 staying identical to the success path. B4-F2 emailRateLimitKey sha256[:8] birthday-collision risk internal/handlers/magic_link.go:88-89: full sha256 hex (64 chars) instead of h[:8] (8 hex chars). 2^32 → 2^128 collision space. internal/handlers/magic_link_test.go: TestEmailRateLimitKey_FullHashFingerprint pins the full-hash suffix length so a re-truncation regression fails CI. B4-F4 RFC 5321 local-part >64 chars not gated internal/handlers/magic_link.go:looksLikeEmail: reject local-part > 64 octets per §4.5.3.1.1 (guaranteed-undeliverable). internal/handlers/magic_link_test.go: TestLooksLikeEmail_LocalPartCap table-driven (under/at/over/way-over cap). B4-F5 10MB JSON body accepted on /auth/email/start internal/handlers/magic_link.go: 1 KiB body cap on POST /auth/email/start (real bodies are ~80 bytes; global Fiber BodyLimit = 50 MiB for /deploy/new tarballs is way too generous for a 2-field JSON envelope). B4-F7 / B5-P1-1 / B10-P1-3 / B13-F6 internal/handlers/helpers.go: extend codeToAgentAction with three missing entries — invalid_email, invalid_email_format, provision_limit_reached. The 429 provision-limit envelope is now carrying agent_action + upgrade_url to the agent (CLAUDE.md convention #6 promised these, agents were getting nothing). B11-F1 8 Razorpay events fell to default 200 internal/handlers/billing.go: handle subscription.deauthenticated (treat as cancel — mandate revoked, cannot charge), subscription.updated (route to handleSubscriptionCharged — idempotent re-resolve of tier), refund.processed (info-log + span attribute, no tier change). The remaining default-200 falls (payment.captured/authorized, order.paid, invoice.paid, subscription.authenticated) are intentionally informational — Razorpay sends them but tier state is already correct from the corresponding subscription.* event. B18-M1 empty Idempotency-Key header silently coerced to fingerprint path internal/middleware/idempotency.go: detect "header present but value empty/blank" via raw c.Request().Header.Peek() (Fiber's c.Get() returns "" for both omitted-and-present-empty). Reject with 400 invalid_idempotency_key — matches the OpenAPI contract for malformed keys. B20-P2-1 soft-deleted resources still surfaced via GET /api/v1/resources/:id internal/handlers/resource.go:Get: 404 when resource.Status == "deleted". Narrow surgical change to the read path only — the DELETE path itself still needs to fetch the row to soft-delete it, so leaving GetResourceByToken untouched. The customer-visible contract is "the resource is gone after DELETE"; this enforces it on the GET. ————————————————————————————————————————————————————————————————————————————— Coverage block per CLAUDE.md rule 17: Symptom: per-bug, see ledger references in code comments Enumeration: grep -rn for each token, see code-side comments Sites found: 1 each (single-site fixes by design) Sites touched: 1 each Coverage tests: TestEmailRateLimitKey_FullHashFingerprint, TestLooksLikeEmail_LocalPartCap, TestAgentActionContract (covers all three new codeToAgentAction entries automatically — map iteration) Live verified: pending — /healthz commit_id check post-deploy Gate: `cd api && make gate`-equivalent (go test ./... -short -count=1 -p 1) green except the 4 pre-existing flakes (TestAdminList_*, TestDBNew_*, TestBulkTwin_*, TestGetExpiredDeployments_*) explicitly listed in the task brief as known-acceptable. All other 30+ packages pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…action + AuditKind Adds three registry-iterating regression tests per CLAUDE.md rule 18: 1. TestRazorpayWebhook_EveryEventBranchHasACoverageTest — enumerates the case "<event>": arms in billing.go's Razorpay dispatcher and asserts every one has at least one named coverage test. The 2026-05-20 deauthenticated/updated/refund.processed branches landed without dedicated tests; this gate stops that pattern from re-occurring. Includes anchor tests for the previously-uncovered halted/completed/paused/resumed/deauthenticated/updated/refund.processed branches so the registry-iterator has names to refer to. 2. TestCodeToAgentAction_NoOrphans — reverse of the forward test in agent_action_contract_test.go. Asserts every entry in codeToAgentAction is referenced by handlers/middleware code. An orphan entry = handler renamed the code (real bug) or the entry is dead (delete it). Scans both internal/handlers/ and the sibling internal/middleware/ to catch codes emitted from middleware (dpop, auth). 3. TestAuditKindConstants_EveryConstantIsEmittedSomewhere — walks every AuditKind* constant in internal/models/audit_kinds.go and asserts each is referenced by ≥1 non-test api source file outside the declaration site. 13 constants are intentionally declared on the api side for cross-repo reference (the worker is the actual emit site for the propagation lifecycle / deploy TTL expirer / orphan sweep / quota suspend / etc.) — these are documented in the crossRepoOnly allow-list with per-constant justifications. Each test carries the rule 17 coverage block (symptom / enumeration / sites found / sites touched / coverage test / live verified) in its docstring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/B18 Closes a batch of P2/P3 envelope-contract and ordering issues identified by the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz) bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes. Fixes: 1. **B18 M4** — `POST /storage/:token/presign` validates body before checking token existence. Pre-fix, a random UUID returned `invalid_operation` (400) before the existence check fired. Reordered: token parse → resource lookup → body-shape validation. Closes information-flow risk if validators ever loosen. File: internal/handlers/storage_presign.go 2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The authoritative length bound was already requireName's 64-rune gate; the second silent cap created a latent footgun if the name regex ever loosens to allow multi-byte runes. Updated regression test for the single-gate contract. Files: internal/handlers/provision_helper.go, provision_helper_test.go 3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on `GET /api/v1/webhooks/:token/requests`. The webhook token is a public-by-design capability (lands in HTTP headers/logs/outbound URLs); "well-formed-but-unknown" is not an oracle leak. Doc-only comment so future refactors preserve the intent. File: internal/handlers/webhook.go 4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special chars stripped). Pre-fix the mutation was silent — agents looking up "db_for_user\n" later by exact name would never find the persisted "db_for_user". Header-only signal; does NOT fail the request (the strip is a deliberate hardening on top of the regex). File: internal/handlers/provision_helper.go 5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type` when the request carries an explicit non-JSON Content-Type (application/xml etc.). Pre-fix, sending XML with `Content-Type: application/xml` returned 400 `name_required` — a misleading code that cost the caller one extra debugging cycle. The OpenAPI spec only declares `application/json`; 415 is the RFC-correct status. File: internal/handlers/provision_helper.go 6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with the canonical ErrorResponse shape. Pre-fix, signature failures returned `{ok:false,error:"invalid_signature"}` with no request_id, message, retry_after_seconds, or agent_action. Razorpay support always asks for the request_id when a webhook fails. Same hydration applied to the invalid_payload path. File: internal/handlers/billing.go 7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every 401 from respondUnauthorized. RFC 6750 §3 requires this header on every 401 from a Bearer-protected resource. Pre-fix only the audience-mismatch path emitted it. OAuth-aware clients and HTTP debugging tools look for it. File: internal/middleware/auth.go Gate (matches CI/deploy.yml): - `go build ./...` — green - `go vet ./...` — green - `go test ./... -short -count=1 -p 1` — green on every modified package; pre-existing failures (12 in handlers + 2 in models + 3 B13 contract tests) verified unchanged-against-master by stashing the patch and re-running the same suite. All pre-existing flakes documented in CLAUDE.md "Known Design Gaps". Skipped (already shipped today, per brief): - AESKeyring (a3155a5), B5/B11/B13/B7 (0c7991c), presign middleware (PR #122, not yet on master), 768c0ca's 8 fixes. Coverage: - B19 finding #1 (presign middleware) — already shipped in PR #122; this PR does not duplicate. - B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245. - B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math in rate_limit.go is correct (limit-count). The reported 0-on-success is not reproducible from the code path; left for in-prod re-probe after this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ent audit_id soft FK Workstream 1 of the consolidated wave-3 api agent. Security hardening (B18 BugBash 2026-05-20): - New middleware/security_headers.go wires HSTS (prod only, 6mo includeSubDomains), Permissions-Policy (deny every powerful API), Referrer-Policy (strict-origin-when-cross-origin), X-Content-Type-Options: nosniff. Wired BEFORE RequestID in router.go so headers land on every Fiber-served path including livez/healthz/404/405. - B18-L1 slog emit: provision_helper.requireName now logs "provision.name_normalized" with raw_len + clean_len byte counts (never the raw values — they may contain CRLF/NUL that triggered the strip) whenever sanitizeName mutated the input. - B18-M2 (silent 120-byte truncation removed), B18-M3 (UUID-before-auth ordering documented as intentional — webhook tokens are public-by-design), B18-M4 (storage presign existence-before-operation), and B18-L2 (415 on non-JSON Content-Type) were verified already shipped in 1d3ddf0. Webhook auth hardening: - New audit kinds AuditKindBrevoWebhookUnauthorized + AuditKindRazorpayWebhookUnauthorized. - Brevo webhook 401 path: emits audit_log via safego.Go (best-effort, never blocks the 401), metadata carries presence booleans + masked source-IP subnet ONLY (never the secret or raw IP). Body migrated to canonical respondError envelope (B13-F7). - Razorpay webhook 400 path: same audit emit + canonical respondErrorWithAgentAction envelope (B13-F8). - Brevo invalid_payload + payload_too_large paths also migrated to canonical envelope (B13-F7 sweep). Forwarder-sent audit_id soft FK: - Migration 063 adds COMMENT ON COLUMN documenting that audit_id is USUALLY an audit_log.id but MAY be a placeholder (`reminder-<resource_id>-<stage>`, `provider-<grace_id>`) from legacy emit sites. Creates a partial UUID-shape regex index for the orphan-reconciler join path. Strict FK is impossible because the column is TEXT with placeholder values; the soft-link contract is now schema-documented. - testhelpers schema mirror updated to include the partial index so handler tests on bare Postgres see the same index shape as prod. New files: - internal/middleware/security_headers.go - internal/handlers/webhook_security_helpers.go (maskSourceIP for v4 /24 + v6 /48) - internal/db/migrations/063_forwarder_sent_audit_link.sql Coverage block (per CLAUDE.md rule 17): - Symptom: defense-in-depth headers + audit visibility on webhook auth fails + soft FK on a TEXT-with-placeholders column - Enumeration: rg "HSTS|Strict-Transport-Security|Permissions-Policy" → 0 hits pre-fix; rg "verifyRazorpaySignature|brevoSecretURLParam.*ConstantTimeCompare" → 2 unauthorized branches - Sites found: 2 webhook 401-emit sites, 0 existing security-header sites, 0 existing forwarder_sent FK - Sites touched: 2 webhook handlers + 1 new middleware + 1 new migration + schema mirror - Coverage test: models.AuditKind* constants exist; build + vet green; handler tests TestBrevo*/TestRazorpay*/TestProvision* pass - Live verified: next commit deploys to prod; curl /healthz + curl /webhooks/brevo/bogus to follow Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…icate on 401 + registry-iterating coverage test Workstream 2 of the consolidated wave-3 api agent. agent_action coverage: - AST walk of every respondError* call site (rg + go/parser) surfaced 227 unique emitted error codes. Pre-wave3 registry covered 38. After this commit: 248 entries — every emit site has a domain-specific sentence (or is in an explicit allowlist for regex artefacts). - Each entry follows the U3 contract: opens with "Tell the user", names the concrete failure, names the next action, contains a full https://instanode.dev URL, stays under 280 chars. - Categories added: missing-required-field (20 codes), invalid-format/value (50 codes), not-found / gone (12 codes), conflict / state (20 codes), permission / authn (10 codes), billing-specific (5 codes), Razorpay passthroughs (1 code), 5xx domain-specific plumbing (50 codes), 429 rate-limited (1 code), coverage-test-discovered codes (15 codes). WWW-Authenticate header on 401: - New setSecurityHeadersFor401 helper. Called from respondError / respondErrorWithAgentAction / respondErrorWithRetry / respondRecycleGate so EVERY 401 envelope gets `WWW-Authenticate: Bearer realm="instanode"` per RFC 7235 §4.1. No-op when the DPoP middleware has already set its own (richer) challenge header on DPoP-required routes. Registry-iterating coverage test (CLAUDE.md rule 18): - New internal/handlers/error_envelope_coverage_test.go::TestErrorCode_HasAgentAction walks every respondError* call site in handlers/ via go/ast, extracts the literal error-code string, asserts each one is in either codeToAgentAction or coverageAllowlist. A future PR that adds a new respondError site without a registry entry will fail this test — closing the silent-drift door. Files changed: - internal/handlers/helpers.go (+~210 entries + WWW-Authenticate helper) - internal/handlers/error_envelope_coverage_test.go (new) Coverage block (per CLAUDE.md rule 17): - Symptom: ~160 emitted 4xx/5xx codes had no agent_action; agent callers got empty agent_action on 4xx and generic "email support" on 5xx - Enumeration: rg + go/parser AST walk of internal/handlers/*.go respondError* calls → 227 unique codes - Sites found: 227 distinct error codes - Sites touched: 210 added to registry, 2 in allowlist (regex artefacts) - Coverage test: TestErrorCode_HasAgentAction (registry-iterating, AST-based) - Live verified: next commit deploys to prod; curl on any 4xx route will show agent_action populated; HEAD on /api/v1/* will show WWW-Authenticate on the 401 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t_apps inversion Workstream 3 of the consolidated wave-3 api agent. B6-P3 — Growth tier deployment_apps inversion (BugBash 2026-05-20): - plans.yaml growth.deployments_apps: 5 → 50. Pre-fix Growth ($99/mo) was strictly below Pro ($49/mo) on this customer-facing dimension — a tier-ladder inversion that violated the published "Growth = more of everything Pro has, on shared infra" promise. - 50 keeps a clear gap above Pro's 10 without leaping to Team's unlimited (-1); 50 is the realistic ceiling under shared k8s per Growth tenant before per-namespace overhead matters. - Synchronised with common/plans defaultYAML in a separate commit on the common repo (cc97d4f) per CLAUDE.md rule 22 (contract changes touch all surfaces in one PR). Registry-iterating pinning tests (CLAUDE.md rule 18): - New internal/plans/tier_ladder_invariants_test.go with 4 tests: - TestTierLadder_GrowthBeatsPro — iterates 8 numeric dimensions, fails on any inversion (treats -1 as +inf) - TestTierLadder_HobbyPlusBeatsHobby — same shape for hobby_plus - TestTierLadder_PaidTiersHaveDeployments — every paid tier needs >0 deployments_apps - TestPlansYAML_B6P3_GrowthDeploymentsAppsAbovePro — literal pinning of the B6-P3 finding - All four tests load api/plans.yaml directly (via runtime.Caller + filepath.Join), NOT the embedded defaultYAML, so the api repo's source-of-truth YAML is what's validated. B13 cosmetic drifts (F7, F8) — already applied in workstream 1 (Brevo + Razorpay webhooks now use canonical respondError envelope via the new audit-emit + canonical-envelope migration). The non-canonical 4xx envelopes are gone from those two paths; this commit captures the plans-side fix that needed its own commit boundary. Coverage block (per CLAUDE.md rule 17): - Symptom: Growth deployments_apps=5 < Pro deployments_apps=10 (tier-ladder inversion) - Enumeration: grep -n deployments_apps plans.yaml → 1 affected row - Sites found: 1 in api/plans.yaml + 1 mirror in common/plans defaultYAML - Sites touched: 2 (this commit + cc97d4f in common) - Coverage test: TestPlansYAML_B6P3_GrowthDeploymentsAppsAbovePro + TestTierLadder_GrowthBeatsPro (iterates 8 dimensions) - Live verified: next commit deploys to prod; curl /api/v1/capabilities will show growth.deployments_apps=50 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wave-3 agent_action registry now maps db_error, internal_error, and invalid_payload to domain-specific sentences (see helpers.go). Three existing envelope-shape tests were pinning the pre-wave3 fallback behaviour: - TestErrorEnvelope_503_AllFieldsAndHeader (db_error 503) — was asserting AgentActionContactSupport literal; now asserts non-empty agent_action containing "transient" (the registry sentence's signature). - TestErrorEnvelope_400_NullRetryAfter_NoHeader (invalid_payload 400) — was asserting agent_action absent; now asserts the registry sentence contains "request body could not be parsed". - TestErrorEnvelope_500_NoRetryAfter (internal_error 500) — was asserting AgentActionContactSupport literal; now asserts non-empty agent_action containing "support@instanode.dev" (both the registry sentence and the fallback name the support email). The TestErrorCode_HasAgentAction registry-iterating coverage test still asserts the underlying invariant: every emitted code has an entry OR allowlist row. The fallback path is still reachable for codes outside both — it's just not the test target any more. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After wave-3's 210-entry agent_action expansion, two test families needed updates to match the new contract surface: 1. TestAgentActionContract (agent_action_contract_test.go): - Extended actionVerbs[] to include legitimate next-action verbs surfaced by wave-3 entries: Restart, Re-subscribe, Re-open, Re-issue, Re-enter, Refresh, POST, GET, Request, Add, Trim, Pick, Specify, Re-deploy, Resume, Verify, Shorten, Shrink, Open, Disconnect, Replace, Promote, Start, Email support, Apply, Supply, "Each binding" (diagnostic), "See ", "No action needed" (the intentional no-op states for tier_unchanged / same_plan). - Patched ~15 helpers.go entries that referenced support@instanode.dev without the canonical https://instanode.dev/ URL — every entry now includes the canonical pricing/status/support/docs URL per the U3 contract. 2. TestReliability_AuditKinds_EveryConstantHasConsumerSpec (e2e/reliability_contract_test.go): - Added auditConsumerSpec entries for the two new B18 audit kinds: webhook.brevo.unauthorized + webhook.razorpay.unauthorized. Both marked IntentionallyNoConsumer (operator-alert kinds, not customer emails). The underlying invariants (verb presence + canonical URL + spec coverage) still hold — these commits only sync the test expectations with the expanded registry surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wave-3 chaos verify P3 (2026-05-21) cosmetic finding: a signed Razorpay webhook whose notes.team_id (or subscription_id fallback) references a non-existent team returned 404 to Razorpay but left no audit_log row. Operators had to grep NR for the slog line and a burst against the path raised no signal at all. Counterpart to webhook.razorpay.unauthorized (signature-failed audit kind added in task #310): this kind fires AFTER signature verification passes but before any team-touching mutation lands, so a dashboard-typo / deleted-team race / synthetic chaos probe / valid-signature attacker probe all surface in a single operator-only audit row + Prometheus counter + WARN-tier NR alert. Touched surfaces: - models/audit_kinds.go: AuditKindRazorpayWebhookTeamNotFound constant. - handlers/billing.go: webhookErrorStatus signature extended with (*BillingHandler, rzpWebhookEvent); ErrTeamNotFound branch bumps the Prom counter + emits audit row + WARN-level slog via safego.Go with a 3s bounded-timeout context (CLAUDE.md rule 16 — NEVER context.Background). Metadata: event_type, event_id, notes_team_id (UUID, no PII), subscription_id, masked source-IP subnet. NO email/payload PII. - metrics/metrics.go: razorpay_webhook_team_not_found_total counter. - e2e/reliability_contract_test.go: auditConsumerSpec entry with IntentionallyNoConsumer=true (operator alert, no customer email). - infra/newrelic/alerts/razorpay-webhook-team-not-found.json: P2 WARN alert at >5/h, no CRITICAL — per brief, informational not page-worthy. - handlers/billing_webhook_team_not_found_test.go: TestRazorpayWebhook_TeamNotFound_EmitsAudit — full live path with valid signature + bogus team_id, asserts 404 + audit row + counter +1. Coverage block: Symptom: webhook.razorpay.team_not_found has no audit_log row Enumeration: rg -n 'webhookErrorStatus\(|ErrTeamNotFound' internal/ Sites found: 3 webhookErrorStatus callers + 1 definition + N consumers Sites touched: all 3 callers updated; new helper emits at the central ErrTeamNotFound branch in webhookErrorStatus Coverage test: TestRazorpayWebhook_TeamNotFound_EmitsAudit + TestReliability_AuditKinds_EveryConstantHasConsumerSpec (registry walk — adds the new constant automatically) Live verified: pending CI Deploy + post-deploy synthetic POST per brief's verify-live step (rule 13) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erify 50-concurrent /db/new burst exhausted the shared DigitalOcean Managed Postgres user-connection ceiling and took down worker's event_email_forwarder with "remaining connection slots are reserved for non-replication superuser connections". Pool was pinned at 25/10 with handlers holding connections through the full sync provisioner gRPC round-trip. Changes: - internal/db/postgres.go: env-tunable pool sizing — API_PG_MAX_OPEN_CONNS (default 15, was 25), API_PG_MAX_IDLE_CONNS (5), API_PG_CONN_MAX_LIFETIME (4m), API_PG_CONN_MAX_IDLE_TIME (90s). Operator can raise without redeploy when DO Managed Postgres tier is bumped. - internal/db/pool_metrics.go: StartPoolStatsExporter ticks sql.DBStats every 5s and pushes onto instant_pg_pool_{in_use,idle,open,max,wait_count, wait_duration_seconds} Prometheus gauges. Closes the observability gap that made the chaos verify finding only visible via downstream worker errors. - internal/metrics/metrics.go: declares the new gauges with bounded cardinality (one label: pool name). - main.go: launches the exporter immediately after ConnectPostgres so the gauges have values before the first Prom scrape. - internal/db/pool_metrics_test.go: regression contract. - TestPublishStats_RoundTripsAllFields: every DBStats field surfaces. - TestStartPoolStatsExporter_ContextCancellation: no goroutine leak. - TestStartPoolStatsExporter_NilPoolSafe: no panic on nil pool. - TestEnvInt/EnvDuration_FallsBackOnBadValues: typo'd env var can't silently disable the ceiling. - TestPoolBurst_DoesNotStarveLastCaller: burst-safe contract, gated on INSTANT_TEST_POOL_BURST=1 (CI can opt in; default-skipped to keep `go test` hermetic). Live 50-burst against prod NOT executed per brief's CONSTRAINTS ("never disrupt real customer traffic; if running would risk other tenants, document as TODO instead and ship the code fix"). Operator follow-up: after deploy, watch instant_pg_pool_in_use / instant_pg_pool_max during the next scheduled load test; alert when > 0.8 for 5m. Coverage block: Symptom: worker event_email_forwarder error "remaining connection slots are reserved for non-replication superuser connections" during 50-concurrent api /db/new burst Enumeration: rg -F 'SetMaxOpenConns' api/ worker/ provisioner/ Sites found: 3 (api postgres.go, worker db.go, provisioner main.go) Sites touched: 3 (this commit + sibling commits on worker + provisioner) Coverage test: TestPublishStats_RoundTripsAllFields fails if a future refactor drops a Stats() field from the gauge map Live verified: deferred — see /readyz + /metrics post-deploy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…per spec The earlier #311 ship was real (security_headers.go landed in 2ab2b96), but the wave-3 chaos-verify caught a contract drift: the spec mandated X-Frame-Options + Cross-Origin-Resource-Policy AND a 2-year HSTS max-age, none of which were emitted. The Permissions-Policy value also differed from the spec's exact 4-feature deny string. A repo-wide grep for the spec headers returned partial hits on master, not the full set. This commit reconciles the middleware against the spec exactly: - Strict-Transport-Security: max-age=63072000; includeSubDomains (was 15552000) - X-Content-Type-Options: nosniff (unchanged) - X-Frame-Options: SAMEORIGIN (NEW) - Referrer-Policy: strict-origin-when-cross-origin (unchanged) - Permissions-Policy: geolocation=(), microphone=(), camera=(), payment=() (was a wider 23-feature deny set) - Cross-Origin-Resource-Policy: same-origin (NEW) HSTS remains prod-only — gated by `cfg.Environment == "production"` in router.go — so local dev over http://localhost:8080 doesn't poison the host's browser HSTS cache. Exports header values as package-level constants (middleware.HSTSValue / .PermissionsPolicyValue / .ReferrerPolicyValue / .XContentTypeOptionsValue / .XFrameOptionsValue / .CrossOriginResourcePolicyValue) so the test file and any future consumers (OpenAPI generator, security scanner harness, etc.) can pin the contract against the same source of truth. Adds internal/handlers/security_headers_test.go with 4 tests: - TestSecurityHeaders_AllEndpoints_AllHeaders_Prod — iterates 5 representative endpoints (healthz / readyz / openapi.json / db/new / claim) and asserts all 6 spec headers land on every response, including the 401 and 400 error envelopes. This is the regression gate against the failure mode that triggered the redo. - TestSecurityHeaders_NoHSTSInDev — pins the dev-mode contract: HSTS MUST be absent when ENVIRONMENT != production. - TestSecurityHeaders_PermissionsPolicy_Exact — pins the exact Permissions-Policy spec value so a refactor that "improves" the deny set fails loudly. - TestSecurityHeaders_HSTS_TwoYearMaxAge — pins max-age=63072000. Updates openapi.go info description with a "Security headers (applies to every response)" section so the global contract is discoverable from the OpenAPI document. Coverage block (CLAUDE.md rule 17): Symptom: wave-3 chaos-verify reports task #311 marked complete but spec headers (X-Frame-Options, CORP, HSTS 2y, exact Permissions-Policy) not in api code Enumeration: rg -F 'Permissions-Policy|Referrer-Policy| X-Content-Type-Options|X-Frame-Options| Cross-Origin-Resource-Policy|Strict-Transport-Security' --include='*.go' . Sites found: 1 (internal/middleware/security_headers.go) — wired but spec-incomplete: missing X-Frame-Options + CORP, HSTS at 15552000 not 63072000, Permissions-Policy a superset of the spec's 4-feature deny string Sites touched: 3 — security_headers.go (spec-complete header set), router.go (updated wiring comment), openapi.go (global description section). + new test file. Coverage test: TestSecurityHeaders_AllEndpoints_AllHeaders_Prod iterates 5 endpoints + asserts all 6 headers Live verified: pending — requires CI build + kubectl image set + curl https://api.instanode.dev/healthz on the new commit_id. Documented in PR body for the verify-live gate (CLAUDE.md rule 14). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TeamAllTiers The first revision of TestRazorpayWebhook_TeamNotFound_EmitsAudit used makeSubscriptionChargedPayload (empty plan_id), which short-circuited the handler at the F3 'unknown_tier' branch BEFORE reaching UpgradeTeamAllTiersWithSubscription. ErrTeamNotFound was never returned, the webhookErrorStatus 404 path was never taken, and the new audit row was never emitted — CI's TestRazorpayWebhook_TeamNotFound_EmitsAudit failed on the live DB run. Switch the test to billingWebhookDBApp + makeSubscriptionChargedPayloadWithPlan (both already in the suite) with cfg.RazorpayPlanIDPro. The handler now passes the F3 tier-validation branch, calls UpgradeTeamAllTiersWithSubscription with a bogus team_id (no matching teams row), gets back ErrTeamNotFound, and routes through webhookErrorStatus → emitRazorpayTeamNotFoundAudit. No production-code change — pure test fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first CI run of TestRazorpayWebhook_TeamNotFound_EmitsAudit failed because emitRazorpayTeamNotFoundAudit read event_id from the parsed body's event.ID field — which is empty when the test sends only the X-Razorpay-Event-Id header (the canonical Razorpay shape). Mirror the dispatch resolution at billing.go:1023 — header is canonical, body event.ID is the fallback. Now the audit row's metadata.event_id matches the same id the rest of the webhook pipeline (dedup, retry release, slog) uses, and operators can join the audit row to Razorpay's delivery log via a single id. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p ci] CI-minute conservation pass (2026-05-21). Two changes per workflow (deploy.yml + ci.yml): 1. concurrency.cancel-in-progress: false → true, AND group key now includes ${{ github.ref }} so different branches run independently. Previously: 5 rapid-fire pushes = 5 full Deploys to completion. Now: 5 rapid-fire pushes = 1 final Deploy (4 cancelled mid-flight). 2. paths-ignore added to skip docs-only commits (**.md, docs/**, CLAUDE.md, .gitignore, LICENSE, BUGBASH-*/**). Estimated savings: ~40% of historical burn rate under the rapid-iteration pattern this session exhibited. [skip ci] honored — this commit itself does NOT trigger a Deploy or CI run. The new behavior applies starting from the NEXT real code push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three tests have been blocking every api Deploy since wave-2's UX-polish agent landed orphan registry entries + repurposed test fixture codes: 1. TestRespondError_UnknownCode_5xx_FallsBackToContactSupport — was using "db_error" as a fake-unregistered fixture, but wave-2 added a real entry for it ("platform database hit a transient error"). Switched to a literal sentinel "__test_5xx_unregistered__" that's fake by construction; if someone adds it to the registry, the test fails loudly with a clear signal to pick a different sentinel. 2. TestRespondError_UnknownCode_4xx_OmitsAgentAction — same pattern, was using "invalid_payload" which wave-2 registered. Now uses "__test_4xx_unregistered__". 3. TestCodeToAgentAction_NoOrphans — flagged "upstream_failed" as an orphan registry entry (helpers.go:862-864) with no handler emit site anywhere in the codebase. Deleted the dead entry. The "provider_failed" entry above it covers the legitimate upstream-failure case. Unsticks 4 pending commits behind master: 4212c11 (security headers), 1345e28 (pg pool sizing), 6b6253b/36ff74d/c1bb661 (razorpay team_not_found audit). Net effect of one Deploy run: 5 commits' worth of work goes live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…DELETE SET NULL (closes gap #6) Closes CLAUDE.md "Known Design Gaps" #6: team-deletion cascades drop audit_log rows but leave forwarder_sent rows pointing at non-existent audit_log IDs. Migration 063 only added a partial index + COMMENT; 063's analysis correctly identified that a direct FK on audit_id is infeasible (TEXT vs UUID type mismatch + PK NOT NULL + legacy placeholder strings like `reminder-<id>-<stage>` from pre-consolidation emitters would fail constraint validation). Strategy: add a NEW nullable column `audit_log_id UUID REFERENCES audit_log(id) ON DELETE SET NULL` as a strict-FK breadcrumb alongside the existing TEXT audit_id (which stays as the PK + idempotency key). This gives gap #6 the strict FK it asks for without breaking legacy placeholder emitters (those rows leave audit_log_id NULL — semantically correct, since they were never tied to an audit_log row). Backfill: any existing forwarder_sent row whose audit_id is a valid UUID AND exists in audit_log gets audit_log_id populated. Placeholder rows + orphan rows both leave audit_log_id NULL. After this migration the email-truth-surface ledger row survives audit_log row deletion (CLAUDE.md rule 12) but the breadcrumb back to audit_log is NULL'd via the FK rather than orphan-dangling. Follow-up needed: worker's event_email_forwarder must be taught to populate audit_log_id alongside audit_id on insert (when audit_id is a real UUID). Until that ships, post-064 rows leave audit_log_id NULL even when they have a valid audit_log target. The backfill covers existing rows; this is a forward-only behaviour gap. Coverage block (CLAUDE.md rule 17): Symptom: forwarder_sent rows orphan after team-delete cascade Enumeration: rg -F 'forwarder_sent' --type-add 'sql:*.sql' -tsql -tgo Sites found: 1 schema, 1 testhelpers mirror Sites touched: 1 migration + 1 testhelpers mirror + 1 regression test Coverage test: TestMigration064_AuditLogIDFKWithOnDeleteSetNull + TestMigration064_OnDeleteSetNullEndToEnd Live verified: pending operator apply + post-deploy verification Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 and others added 30 commits May 12, 2026 23:51

Merge pull request #42 from InstaNode-dev/pricing/u3-sharpen-agent-ac…

2273866

…tions-fresh handlers: audit + sharpen every agent_action string per uniform contract (U3)

Merge pull request #41 from InstaNode-dev/pricing/p1-ab-server-fresh

2c6c355

experiments: server-side bucket selector + UpgradeButton variants (P1 of pricing experiments)

Merge remote-tracking branch 'origin/master' into pricing/u1-quota-nu…

9a9979c

…dge-api-fresh # Conflicts: # internal/router/router.go

Merge pull request #43 from InstaNode-dev/pricing/u1-quota-nudge-api-…

12feb4a

…fresh usage: GET /api/v1/usage/wall returns latest near-wall audit row (U1)

Merge pull request #44 from InstaNode-dev/pricing/p2-annual-server-fresh

7da67c2

plans: annual pricing tiers (hobby_yearly / pro_yearly / team_yearly) + billing.go plan_frequency (P2)

Merge pull request #45 from InstaNode-dev/feat/private-deploy-backend…

d3fa539

…-fresh deploy: private deploys with allowed_ips ingress whitelist (Pro+ feature)

Merge pull request #46 from InstaNode-dev/feat/patch-deployment-acces…

0e02996

…s-fresh deploy: PATCH /api/v1/deployments/:id to edit private + allowed_ips in-place

Merge pull request #47 from InstaNode-dev/feat/billing-promotion-vali…

0022b19

…date-fresh billing: POST /api/v1/billing/promotion/validate — HTTP wrapper around plans.validatePromotion

Merge remote-tracking branch 'origin/master' into feat/admin-customer…

0932290

…s-api-fresh # Conflicts: # internal/handlers/agent_action_contract_test.go

Merge pull request #48 from InstaNode-dev/feat/admin-customers-api-fresh

2028700

admin: Customers endpoints (list/detail/tier-change/issue-promo) gated on ADMIN_EMAILS

Merge pull request #49 from InstaNode-dev/feat/audit-log-emit-sites-f…

114b6d6

…resh audit: emit 4 missing kinds (onboarding.claimed + subscription.*) to unblock Loops forwarder

Merge pull request #50 from InstaNode-dev/feat/admin-path-prefix-fresh

c57d012

admin: route admin endpoints under ADMIN_PATH_PREFIX env (closed-by-default + internal docs)

Merge pull request #51 from InstaNode-dev/chore/nr-json-bake-jq-fixes…

7e54ce6

…-fresh infra: bake jq fixes into NR JSON source (apply.sh + Terraform read same clean files)

Merge remote-tracking branch 'origin/master' into feat/razorpay-cance…

9626451

…l-on-demote-fresh # Conflicts: # internal/router/router.go

Merge pull request #52 from InstaNode-dev/feat/razorpay-cancel-on-dem…

f3b8b87

…ote-fresh admin: cancel Razorpay subscription on demote + emit subscription.canceled_by_admin audit

Merge pull request #53 from InstaNode-dev/feat/promo-redemption-check…

802e0a4

…out-fresh billing: validate + redeem admin-issued promo codes alongside plans-yaml codes

mastermanas805 and others added 30 commits May 20, 2026 14:39

ci: retrigger Deploy after common@04dd59d pushed (readiness pkg now r…

a8a8120

…esolvable) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: retrigger Deploy after 30f8f96 — TestOpenAPI_ClaimRequestDocument…

03e2481

…sUpgradeJWT passes locally; investigating CI-only flake Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reliability): migration 064 — forwarder_sent.audit_log_id FK (closes gap #6)#126

fix(reliability): migration 064 — forwarder_sent.audit_log_id FK (closes gap #6)#126
mastermanas805 wants to merge 304 commits into
masterfrom
srr/migration-064-forwarder-sent-fk-2026-05-21

mastermanas805 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mastermanas805 commented May 21, 2026

Summary

Files changed

Test plan

Follow-up needed (not blocking this PR)

Why no direct FK on `audit_id`

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants