Skip to content

[draft] untaint controller#2753

Draft
levan-m wants to merge 1 commit intomainfrom
levan-m/untaint-controller
Draft

[draft] untaint controller#2753
levan-m wants to merge 1 commit intomainfrom
levan-m/untaint-controller

Conversation

@levan-m
Copy link
Collaborator

@levan-m levan-m commented Mar 12, 2026

What does this PR do?

Prototype untaint controller to prevent scheduling workloads before Agent when nodes join cluster.

When enabled, untaint controller will watch nodes and remove agent.datadoghq.com/not-ready=presence:NoSchedule taints once Agent reaches ready state. Until then pods can't be scheduled on the node unless they tolerate this taint. This is effective during cluster scale out when new nodes join with the taint. Once removed Operator will not add taint back for any reason.

Feature can be enabled via --untaintControllerEnabled=flag arg, optionally one can enable taint removal events using DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true env var.

Motivation

#2052 and similar requests through other channels.

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Can be tested locally on kind cluster:

  1. Create kind cluster with multiple nodes (assume 1 control plane, 3 workers).
  2. Deploy Operator with feature enabled.
  3. Taint 2 worker nodes
kubectl taint nodes tainted-worker agent.datadoghq.com/not-ready=presence:NoSchedule
kubectl taint nodes tainted-worker2 agent.datadoghq.com/not-ready=presence:NoSchedule
  1. Create nginx deployment with 3 replicas and antiaffinity not to schedule two on same node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx-test
            topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: nginx
  1. One replica should be schedule on tainted-worker3, other two will be pending.
  2. Apply DatadogAgent manifest (add toleration to agent pods)
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
spec:
  global:
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    kubelet:
      tlsVerify: false
  features:
    logCollection:
      enabled: true
      containerCollectUsingFiles: true
      containerCollectAll: true
  override:
    nodeAgent:
      tolerations:
        - key: agent.datadoghq.com/not-ready
          operator: Exists
          effect: NoSchedule
  1. Agents will be schedule to all nodes, once Agents become ready on tainted-worker and tainted-worker2 nginx pods will be scheduled too.
# monitor events and taint removal
kubectl get event -w -n default | grep -i taint
kubectl get nodes -w -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key'

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@levan-m levan-m force-pushed the levan-m/untaint-controller branch from 2dc7aa1 to 0b567c6 Compare March 12, 2026 21:45
@levan-m levan-m added the enhancement New feature or request label Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant