observability: add the kube-state-metrics addon (+ operator CR-state metrics)#44
Merged
Merged
Conversation
…metrics) grafana-agent statically scrapes kube-state-metrics.kube-system.svc:8080, but no addon ever shipped kube-state-metrics — so that scrape hit nothing and every kube_* / kube_customresource_* panel silently no-data'd: the cilium/cert-manager/ eso dashboards on the standard metrics, and the seven agent persona dashboards + operator-slo CR-status alerts on the custom-resource ones. Add the addon (prometheus-community/kube-state-metrics 7.5.1, app 2.19.1) in kube-system with fullnameOverride: kube-state-metrics so the existing scrape target resolves — grafana-agent unchanged. customResourceState.config carries the eks-agent-platform operator's CR-state definitions (Platform / Tenant / BudgetPolicy / AgentFleet / EvalSuite status_phase + status_field + condition), and rbac.extraRules grants KSM list/watch on those CRDs so kube_customresource_* series actually emit. The CR-state config is inlined here rather than mounted from the operator chart's ConfigMap on purpose: the operator runs at a later sync wave in a different namespace, so mounting it would couple KSM's startup to the operator and need a restart once the ConfigMap appeared. Observability scrape config belongs in the observability repo; keep it in step with the operator chart's files/slo/customresourcestatemetrics.yaml when its CRD status surface changes. Part of #33 — the second half (flip slo.alerting in production) stays blocked on the pagerduty-platform + slack-webhook-* Secrets the AlertmanagerConfig receivers reference.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
grafana-agent statically scrapes
kube-state-metrics.kube-system.svc:8080, but no addon ever shipped kube-state-metrics — so that scrape hit nothing and everykube_*/kube_customresource_*panel silently no-data'd: the cilium/cert-manager/eso dashboards and the seven agent persona dashboards + operator-slo CR-status alerts.Adds the addon (
prometheus-community/kube-state-metrics7.5.1) inkube-systemwithfullnameOverride: kube-state-metricsso the existing scrape target resolves — grafana-agent unchanged.customResourceState.configcarries the operator's CR-state definitions (Platform / Tenant / BudgetPolicy / AgentFleet / EvalSuite);rbac.extraRulesgrants list/watch on those CRDs.Inlined, not mounted (chosen design): the operator's ConfigMap is in another namespace at a later sync wave, so mounting it would couple KSM's startup to it and need a restart. It mirrors the operator chart's
files/slo/customresourcestatemetrics.yaml.Validated:
helm templaterenders clean (service name, the flag, all 5 GVKs, RBAC for 3 API groups);task validatepasses.Part of #33 — the prod-alerting flip stays blocked on the pagerduty/slack Secrets.