Skip to content

alerts(NR+Prom): brevo_send_errors + goroutine_panics + entitlement_regrade_failed + drift fixes#17

Merged
mastermanas805 merged 1 commit into
masterfrom
bugbash-infra-content-prov-2026-05-20
May 21, 2026
Merged

alerts(NR+Prom): brevo_send_errors + goroutine_panics + entitlement_regrade_failed + drift fixes#17
mastermanas805 merged 1 commit into
masterfrom
bugbash-infra-content-prov-2026-05-20

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

BugBash 2026-05-20 sweep — adds NR alert JSON + Prometheus rules for production metrics that had zero alert coverage, plus a NodePort drift fix.

New NR alerts (newrelic/alerts/)

  • brevo-send-errors-spike.json — SEND-side counter on worker; complements the existing email-delivery-ratio alert (RECEIVE-side). Send errors mean messages never queue at Brevo, so the ratio alert is silent. Page > 5 events in 10m.
  • goroutine-panics-recovered.json — `instant_goroutine_panics_total` (api) + `instant_worker_goroutine_panics_recovered_total` (worker). Recovered panics keep the pod up but indicate a real code defect that escaped tests. Any panic pages.
  • entitlement-regrade-failed.json — Worker called provisioner.RegradeResource() and it failed. A paying customer is now on lower-tier backend limits despite paying. Inverse failure mode to billing-charge-undeliverable.

New Prometheus rules (parallel groups in k8s/prometheus-rules.yaml + prometheus/alert-rules.yml)

Same three signals, file-format mirrored for the two apply paths (PrometheusRule CRD + standalone Prom rule file).

Drift fix

  • k8s/website.yaml — ConfigMap claimed dev API URL was `http://localhost:30080\` (NodePort), but NodePort retired 2026-05-11 (CLAUDE.md confirms — instant-api Service is ClusterIP only). Switched to `http://localhost:8080\` + documented the port-forward command.

Validation

  • All 3 NR JSON files `jq empty` clean
  • Both Prom rule files `yaml.safe_load_all` clean

Apply

  • NR alerts: `infra/newrelic/apply.sh` (idempotent)
  • Prom rules: kubectl apply k8s/prometheus-rules.yaml OR reload Prom with file_sd alert-rules.yml

🤖 Generated with Claude Code

…titlement_regrade_failed + drift fixes

Adds three NR alert JSON files + parallel Prometheus rules for metric
counters that exist in production but had NO alert coverage. Sweep of
BugBash 2026-05-20 master-ledger entries against the actual emitted
metrics in api + worker /metrics endpoints.

newrelic/alerts/brevo-send-errors-spike.json
  brevo_send_errors_total > 5 over 10m → CRITICAL.
  Worker-side counter. SEND-side of the email pipeline (worker → Brevo
  /v3/smtp/email). Complements the existing email-delivery-ratio alert
  which is RECEIVE-side (Brevo → /webhooks/brevo). Send errors mean
  messages never queue at Brevo, so the ratio alert is silent.
  Source: worker/internal/email/brevo/brevo.go, counter in metrics.go.

newrelic/alerts/goroutine-panics-recovered.json
  instant_goroutine_panics_total + instant_worker_goroutine_panics_recovered_total > 0
  over 5m → CRITICAL.
  Both counters tick when safego.Go's deferred recover() catches a panic
  that would otherwise crash a background goroutine. Pod stays up, but
  the panic almost always indicates a code defect that escaped tests.
  Source: common/safego.Go.

newrelic/alerts/entitlement-regrade-failed.json
  instant_entitlement_regrade_failed_total > 0 over 10m, FACET service →
  CRITICAL.
  Worker calls provisioner.RegradeResource() when resource.tier <
  team.plan_tier (post-upgrade drift). A failure here = paying customer
  still on lower-tier backend limits despite paying for the higher tier.
  Inverse failure mode to billing-charge-undeliverable. Source:
  worker/internal/jobs/entitlement_reconciler.go.

k8s/prometheus-rules.yaml — adds three groups mirroring the NR alerts
  for environments where Prometheus + Alertmanager are the primary path
  (instant-code-defects, instant-worker-entitlements,
  instant-worker-email-send).

prometheus/alert-rules.yml — same three rules added to the standalone-
  Prometheus format file so both apply paths converge.

k8s/website.yaml — NodePort retirement drift fix. The ConfigMap claimed
  http://localhost:30080 (NodePort) was the dev API URL, but NodePort
  was retired 2026-05-11 (CLAUDE.md confirms — instant-api Service is
  ClusterIP only). Switched to http://localhost:8080 + documented the
  port-forward command. Without this fix any operator copying the
  ConfigMap value into a dev shell sees connection-refused.

Validation:
  - All three NR JSON files: `jq empty` clean.
  - k8s/prometheus-rules.yaml + prometheus/alert-rules.yml: `python3 -c "import yaml; list(yaml.safe_load_all(...))"` clean.

Apply path:
  - NR alerts: operator runs `infra/newrelic/apply.sh` (idempotent
    delete-then-create).
  - Prometheus rules: operator applies k8s/prometheus-rules.yaml via
    kubectl OR reloads Prometheus with the file_sd alert-rules.yml.

Coverage block per CLAUDE.md rule 17:
  Symptom:        prod metrics with no alert (brevo_send_errors_total,
                  instant_goroutine_panics_total,
                  instant_worker_goroutine_panics_recovered_total,
                  instant_entitlement_regrade_failed_total)
  Enumeration:    grep 'Name: "' api/internal/metrics/metrics.go
                  worker/internal/metrics/metrics.go, then grep -rn
                  '<name>' infra/newrelic/ infra/prometheus/ infra/k8s/
  Sites found:    4 metrics with zero alert coverage in either NR or Prom
  Sites touched:  4 — one alert file per metric, mirrored to both apply
                  paths (NR json + k8s/prometheus-rules.yaml +
                  prometheus/alert-rules.yml)
  Coverage test:  Existing `for f in newrelic/alerts/*.json; do jq empty
                  "$f"; done` + python3 yaml.safe_load_all over the two
                  Prom rule files (already part of apply.sh validation).
  Live verified:  Not applicable — alert files codify intent; the
                  operator must run apply.sh + reload Prometheus to
                  activate. Verify-live: trigger a synthetic panic via
                  /admin/test-panic (if exposed) OR wait for a real
                  occurrence, confirm NR violation opens within 5m.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 force-pushed the bugbash-infra-content-prov-2026-05-20 branch from 875e555 to e70d813 Compare May 21, 2026 16:08
@mastermanas805 mastermanas805 merged commit 37e86c9 into master May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant