Skip to content

[design-spec] k8s-karpenter-control-plane-health #91

@rw-codebundle-agent

Description

@rw-codebundle-agent

Design Spec: k8s-karpenter-control-plane-health

Parent: #90
Target: rw-cli-codecollection

Spec

codebundle_name: "k8s-karpenter-control-plane-health"
target_collection: "rw-cli-codecollection"
display_name: "Kubernetes Karpenter Control Plane Health"
author: "rw-codebundle-agent"

purpose: |
  Monitors the health of the Karpenter controller installation in a cluster:
  workload readiness, admission webhooks, and high-signal Kubernetes events in
  the Karpenter namespace. This bundle answers "Is Karpenter itself running and
  wired correctly?" before investigating provisioning behavior.

tasks:
  - name: "Check Karpenter Controller Workload Health in Cluster `${CONTEXT}`"
    description: "Verifies Karpenter controller pods (typical labels app.kubernetes.io/name=karpenter or helm release name) are Ready, counts restarts, and surfaces CrashLoopBackOff or missing replicas."
    script_name: "check-karpenter-controller-pods.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "config"

  - name: "Verify Karpenter Admission Webhooks in Cluster `${CONTEXT}`"
    description: "Lists ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects that reference Karpenter and checks for misconfigured caBundle, missing endpoints, or webhook failures in recent events."
    script_name: "check-karpenter-webhooks.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "config"

  - name: "Inspect Warning Events in Karpenter Namespace `${KARPENTER_NAMESPACE}`"
    description: "Aggregates Warning-type events involving Karpenter pods, services, or webhook resources within RW_LOOKBACK_WINDOW; groups by involved object for triage."
    script_name: "karpenter-namespace-warning-events.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "events"

  - name: "Summarize Installed Karpenter API Versions and CRDs in Cluster `${CONTEXT}`"
    description: "Detects installed Karpenter CRD groups/versions (e.g. karpenter.sh, karpenter.k8s.aws) to support version-aware checks and to flag mixed or deprecated APIs."
    script_name: "check-karpenter-crds.sh"
    expected_issue_severity: [4]
    access_level: "read-only"
    data_type: "config"

  - name: "Check Karpenter Service and Metrics Endpoints in Namespace `${KARPENTER_NAMESPACE}`"
    description: "Validates Service exists for metrics/monitoring port and optionally probes /metrics or health endpoints where exposed, surfacing misconfiguration that breaks observability integrations."
    script_name: "check-karpenter-service-metrics.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

scope:
  level: "Resource"
  qualifiers:
    - CONTEXT
    - KARPENTER_NAMESPACE
  iteration_pattern: |
    One taskset per Kubernetes cluster context. KARPENTER_NAMESPACE defaults to
    `karpenter` but must be overridable for custom installs.

resource_types:
  - "kubernetes_cluster"
generation_strategy: |
  Generation rule matches cluster-level SLXs where Karpenter is installed.
  Qualifier: cluster context name. Optional label or annotation on the workspace
  can indicate Karpenter presence to avoid generating for clusters without it.

env_vars:
  - name: CONTEXT
    description: "kubectl context name for the target cluster"
    required: true

  - name: KARPENTER_NAMESPACE
    description: "Namespace where Karpenter controller runs"
    required: false
    default: "karpenter"

  - name: KUBERNETES_DISTRIBUTION_BINARY
    description: "kubectl-compatible binary"
    required: false
    default: "kubectl"

  - name: RW_LOOKBACK_WINDOW
    description: "Time window for events (inherits platform convention)"
    required: false
    default: "30m"

secrets:
  - name: kubeconfig
    description: "Kubeconfig with read-only cluster access"
    format: "Standard kubeconfig file"

platform:
  name: "kubernetes"
  cli_tools:
    - "kubectl"
    - "jq"
  auth_methods:
    - "kubeconfig"
  api_docs: "https://karpenter.sh/docs/"

related_bundles:
  - name: "k8s-namespace-healthcheck"
    relationship: "complements"
    notes: "General namespace triage; this bundle is Karpenter-scoped and adds webhook and controller-specific checks."
  - name: "k8s-karpenter-autoscaling-health"
    relationship: "complements"
    notes: "Provisioning and log-level autoscaling signals live in the sibling bundle; run control-plane checks first."

test_scenarios:
  - name: "healthy_karpenter_install"
    description: "Controller pods ready, webhooks configured, no warning events"
    expected_issues: 0

  - name: "controller_crashloop"
    description: "Karpenter pod in CrashLoopBackOff"
    expected_issues: 1
    expected_severities: [2]

notes: |
  Karpenter Helm chart layouts differ; implementors should discover controller
  workloads via common labels (app.kubernetes.io/name, app.kubernetes.io/instance)
  and document any required overrides. Do not assume only EKS—keep checks
  portable; cloud-specific CRDs appear in the autoscaling-health bundle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    completedAgent work completeddesign-specArchitect has produced a design specnew-codebundleScoped issue for SRE to implement a new CodeBundle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions