alerts(NR+Prom): brevo_send_errors + goroutine_panics + entitlement_regrade_failed + drift fixes#17
Merged
Conversation
…titlement_regrade_failed + drift fixes
Adds three NR alert JSON files + parallel Prometheus rules for metric
counters that exist in production but had NO alert coverage. Sweep of
BugBash 2026-05-20 master-ledger entries against the actual emitted
metrics in api + worker /metrics endpoints.
newrelic/alerts/brevo-send-errors-spike.json
brevo_send_errors_total > 5 over 10m → CRITICAL.
Worker-side counter. SEND-side of the email pipeline (worker → Brevo
/v3/smtp/email). Complements the existing email-delivery-ratio alert
which is RECEIVE-side (Brevo → /webhooks/brevo). Send errors mean
messages never queue at Brevo, so the ratio alert is silent.
Source: worker/internal/email/brevo/brevo.go, counter in metrics.go.
newrelic/alerts/goroutine-panics-recovered.json
instant_goroutine_panics_total + instant_worker_goroutine_panics_recovered_total > 0
over 5m → CRITICAL.
Both counters tick when safego.Go's deferred recover() catches a panic
that would otherwise crash a background goroutine. Pod stays up, but
the panic almost always indicates a code defect that escaped tests.
Source: common/safego.Go.
newrelic/alerts/entitlement-regrade-failed.json
instant_entitlement_regrade_failed_total > 0 over 10m, FACET service →
CRITICAL.
Worker calls provisioner.RegradeResource() when resource.tier <
team.plan_tier (post-upgrade drift). A failure here = paying customer
still on lower-tier backend limits despite paying for the higher tier.
Inverse failure mode to billing-charge-undeliverable. Source:
worker/internal/jobs/entitlement_reconciler.go.
k8s/prometheus-rules.yaml — adds three groups mirroring the NR alerts
for environments where Prometheus + Alertmanager are the primary path
(instant-code-defects, instant-worker-entitlements,
instant-worker-email-send).
prometheus/alert-rules.yml — same three rules added to the standalone-
Prometheus format file so both apply paths converge.
k8s/website.yaml — NodePort retirement drift fix. The ConfigMap claimed
http://localhost:30080 (NodePort) was the dev API URL, but NodePort
was retired 2026-05-11 (CLAUDE.md confirms — instant-api Service is
ClusterIP only). Switched to http://localhost:8080 + documented the
port-forward command. Without this fix any operator copying the
ConfigMap value into a dev shell sees connection-refused.
Validation:
- All three NR JSON files: `jq empty` clean.
- k8s/prometheus-rules.yaml + prometheus/alert-rules.yml: `python3 -c "import yaml; list(yaml.safe_load_all(...))"` clean.
Apply path:
- NR alerts: operator runs `infra/newrelic/apply.sh` (idempotent
delete-then-create).
- Prometheus rules: operator applies k8s/prometheus-rules.yaml via
kubectl OR reloads Prometheus with the file_sd alert-rules.yml.
Coverage block per CLAUDE.md rule 17:
Symptom: prod metrics with no alert (brevo_send_errors_total,
instant_goroutine_panics_total,
instant_worker_goroutine_panics_recovered_total,
instant_entitlement_regrade_failed_total)
Enumeration: grep 'Name: "' api/internal/metrics/metrics.go
worker/internal/metrics/metrics.go, then grep -rn
'<name>' infra/newrelic/ infra/prometheus/ infra/k8s/
Sites found: 4 metrics with zero alert coverage in either NR or Prom
Sites touched: 4 — one alert file per metric, mirrored to both apply
paths (NR json + k8s/prometheus-rules.yaml +
prometheus/alert-rules.yml)
Coverage test: Existing `for f in newrelic/alerts/*.json; do jq empty
"$f"; done` + python3 yaml.safe_load_all over the two
Prom rule files (already part of apply.sh validation).
Live verified: Not applicable — alert files codify intent; the
operator must run apply.sh + reload Prometheus to
activate. Verify-live: trigger a synthetic panic via
/admin/test-panic (if exposed) OR wait for a real
occurrence, confirm NR violation opens within 5m.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
875e555 to
e70d813
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BugBash 2026-05-20 sweep — adds NR alert JSON + Prometheus rules for production metrics that had zero alert coverage, plus a NodePort drift fix.
New NR alerts (newrelic/alerts/)
New Prometheus rules (parallel groups in k8s/prometheus-rules.yaml + prometheus/alert-rules.yml)
Same three signals, file-format mirrored for the two apply paths (PrometheusRule CRD + standalone Prom rule file).
Drift fix
Validation
Apply
🤖 Generated with Claude Code