Skip to content

Add autoscaling failover config to DatadogAgent CR#2723

Open
clamoriniere wants to merge 2 commits intomainfrom
dd/add-autoscaling-failover-config
Open

Add autoscaling failover config to DatadogAgent CR#2723
clamoriniere wants to merge 2 commits intomainfrom
dd/add-autoscaling-failover-config

Conversation

@clamoriniere
Copy link
Collaborator

What does this PR do?

Adds a new failover.enabled boolean option under spec.features.autoscaling.workload in the DatadogAgent CR to control the autoscaling failover mechanism. When enabled (the default), the operator automatically sets DD_AUTOSCALING_FAILOVER_ENABLED and DD_AUTOSCALING_FAILOVER_METRICS environment variables on the Cluster Agent and Node Agent components.

Motivation

Previously, users who enabled workload autoscaling had to manually add DD_AUTOSCALING_FAILOVER_ENABLED environment variable overrides to both clusterAgent and nodeAgent sections. This change makes failover a first-class configuration option that is enabled by default when workload autoscaling is active, removing the need for manual env var overrides.

Additional Notes

The failover is enabled by default (true) when workload autoscaling is enabled. Users can explicitly disable it:

spec:
  features:
    autoscaling:
      workload:
        enabled: true
        failover:
          enabled: false

Minimum Agent Versions

  • Agent: N/A
  • Cluster Agent: N/A

Describe your test plan

  • Unit tests updated and extended with two new test cases: explicit failover enabled and failover disabled
  • All 7 tests pass
  • go vet and gofmt clean
  • CRD, deepcopy, and OpenAPI schemas regenerated

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

PR by Bits
View session in Datadog

Comment @DataDog to request changes

Co-authored-by: clamoriniere <cedric.lamoriniere@datadoghq.com>
@datadog-prod-us1-3
Copy link

View session in Datadog

Bits Dev status: ✅ Done

CI Auto-fix: Disabled | Enable

Comment @DataDog to request changes

@datadog-prod-us1-3
Copy link

I can only run on private repositories.

@clamoriniere clamoriniere added this to the v1.25.0 milestone Mar 9, 2026
@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 38.81%. Comparing base (a2fb04f) to head (0c544aa).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2723   +/-   ##
=======================================
  Coverage   38.81%   38.81%           
=======================================
  Files         308      308           
  Lines       26705    26707    +2     
=======================================
+ Hits        10365    10367    +2     
  Misses      15561    15561           
  Partials      779      779           
Flag Coverage Δ
unittests 38.81% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...roller/datadogagent/feature/autoscaling/feature.go 84.61% <100.00%> (+0.40%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2fb04f...0c544aa. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clamoriniere clamoriniere marked this pull request as ready for review March 12, 2026 21:41
@clamoriniere clamoriniere requested a review from a team March 12, 2026 21:41
@clamoriniere clamoriniere requested review from a team as code owners March 12, 2026 21:41
Copy link
Member

@tbavelier tbavelier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing offline so "blocking" the PR until then: failover was already always enabled when configured with the operator. Metrics env var was only added on node Agent, not on DCA

@clamoriniere
Copy link
Collaborator Author

Ok, when I checked that it was missing,it was indeed in the cluster-agent
Still we need to give a way to disable it in some cases.
I will update the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants