Add Fleet Automation experiment lifecycle for remote agent management by zhuminyi · Pull Request #2760 · DataDog/datadog-operator

zhuminyi · 2026-03-16T13:19:02Z

Implement the full experiment lifecycle for Fleet Automation, enabling the operator to receive experiment signals (start/stop/promote) via Remote Config and manage ControllerRevision-based spec snapshots for rollback.

CRD changes

Add to DatadogAgentStatus:

CurrentRevision / PreviousRevision: track ControllerRevision pointers
Experiment: ExperimentStatus with phase, startedAt, baselineRevision, id, and expectedSpecHash fields
ExperimentPhase enum: running, rollback, promoted, aborted, timeout

Reconciler side (experiment package)

New package: internal/controller/datadogagent/experiment/

ControllerRevision creation with hash-based naming ({dda}-{md5[:10]})
Revision pointer tracking on every spec change
Experiment phase handling hooked into internalReconcileV2 after defaults
Timeout detection: auto-rollback when now - startedAt >= 30min
Conflict detection: abort on external spec edits during experiment
ExpectedSpecHash validation: RC computes hash of defaulted FA config; reconciler verifies spec matches on first reconcile, catching user edits between RC patch and first reconcile. Hash survives status-before-spec race (RC status update arrives before spec patch).
Spec restoration from ControllerRevision baseline
GC of old ControllerRevisions (keep current + previous + baseline)
Status fields preserved in generateNewStatusFromDDA

RC callback side (remoteconfig package)

New file: pkg/remoteconfig/experiment.go

ExperimentSignal type for FA payloads (action + experiment_id + config)
parseExperimentSignal: detects experiment signals vs regular agent configs
handleStartExperiment: sets phase=running, locks baselineRevision, computes expectedSpecHash with defaults applied, patches spec
handleStopExperiment: sets phase=rollback (only from running, rejects aborted)
handlePromoteExperiment: sets phase=promoted (only from running)
Re-fetches DDA after status update to avoid resourceVersion conflicts

RBAC

Updated controllerrevisions verbs from list;watch to get;list;watch;create;update;delete across all manifests:

config/rbac/role.yaml (via kubebuilder marker)
bundle/manifests/datadog-operator.clusterserviceversion.yaml
marketplaces/addon_manifest.yaml
marketplaces/charts/google-marketplace/schema.yaml

Tests

59 unit tests covering:

ControllerRevision CRUD, naming, ownerRef, GC (17 tests)
Experiment phase transitions: start, promote, rollback, abort (8 tests)
Timeout detection and post-restart recovery (5 tests)
Conflict detection with specChanged flag (5 tests)
First-reconcile validation via ExpectedSpecHash (3 tests)
Status-before-spec race: hash survives until spec arrives (1 test)
Revision pointer tracking (3 tests)
RC signal parsing and handlers (16 tests)
BuildKeepSet (3 tests)

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: vX.Y.Z
Cluster Agent: vX.Y.Z

Describe your test plan

Write there any instructions and details you may have to test your PR.

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

zhuminyi · 2026-03-16T13:38:40Z

@codex review

codecov-commenter · 2026-03-16T14:34:50Z

Codecov Report

❌ Patch coverage is 64.85014% with 129 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.15%. Comparing base (8669701) to head (72fd331).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/remoteconfig/experiment.go	69.59%	27 Missing and 18 partials ⚠️
...nal/controller/datadogagent/experiment/revision.go	68.67%	14 Missing and 12 partials ⚠️
pkg/remoteconfig/updater.go	0.00%	24 Missing ⚠️
...al/controller/datadogagent/experiment/lifecycle.go	76.84%	14 Missing and 8 partials ⚠️
...controller/datadogagent/controller_reconcile_v2.go	33.33%	2 Missing and 2 partials ⚠️
...ler/datadogagent/controller_reconcile_v2_common.go	20.00%	2 Missing and 2 partials ⚠️
...er/datadogagent/controller_reconcile_v2_helpers.go	50.00%	1 Missing and 1 partial ⚠️
.../controller/datadogagent/testutils/client_utils.go	0.00%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (64.85%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2760      +/-   ##
==========================================
+ Coverage   38.80%   39.15%   +0.35%     
==========================================
  Files         309      312       +3     
  Lines       26750    27114     +364     
==========================================
+ Hits        10379    10617     +238     
- Misses      15592    15675      +83     
- Partials      779      822      +43

Flag	Coverage Δ
unittests	`39.15% <64.85%> (+0.35%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/controller/datadogagent_controller.go	`66.66% <ø> (ø)`
...er/datadogagent/controller_reconcile_v2_helpers.go	`64.53% <50.00%> (-0.30%)`	⬇️
.../controller/datadogagent/testutils/client_utils.go	`0.00% <0.00%> (ø)`
...controller/datadogagent/controller_reconcile_v2.go	`59.59% <33.33%> (-0.83%)`	⬇️
...ler/datadogagent/controller_reconcile_v2_common.go	`32.83% <20.00%> (-0.14%)`	⬇️
...al/controller/datadogagent/experiment/lifecycle.go	`76.84% <76.84%> (ø)`
pkg/remoteconfig/updater.go	`0.00% <0.00%> (ø)`
...nal/controller/datadogagent/experiment/revision.go	`68.67% <68.67%> (ø)`
pkg/remoteconfig/experiment.go	`69.59% <69.59%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8669701...72fd331. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1d56e2aaa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/remoteconfig/experiment.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eef163f2a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/remoteconfig/experiment.go

Implement the full experiment lifecycle for Fleet Automation, enabling the operator to receive experiment signals (start/stop/promote) via Remote Config and manage ControllerRevision-based spec snapshots for rollback. ## CRD changes Add to DatadogAgentStatus: - CurrentRevision / PreviousRevision: track ControllerRevision pointers - Experiment: ExperimentStatus with phase, startedAt, baselineRevision, id, and expectedSpecHash fields - ExperimentPhase enum: running, rollback, promoted, aborted, timeout ## Reconciler side (experiment package) New package: internal/controller/datadogagent/experiment/ - ControllerRevision creation with hash-based naming ({dda}-{md5[:10]}) - Revision pointer tracking on every spec change - Experiment phase handling hooked into internalReconcileV2 after defaults - Timeout detection: auto-rollback when now - startedAt >= 30min - Conflict detection via ExpectedSpecHash: RC computes hash of defaulted FA config; reconciler verifies spec matches on first reconcile, catching user edits between RC patch and first reconcile. Hash survives status-before-spec race. - Spec restoration from ControllerRevision baseline with direct status persistence (re-fetch after spec update to avoid resourceVersion conflict) - Two-reconcile terminal phase pattern: rollback/timeout phase is persisted and observable before being cleared on the next reconcile - GC of old ControllerRevisions (keep current + previous + baseline) - Status fields preserved in generateNewStatusFromDDA ## RC callback side (remoteconfig package) New file: pkg/remoteconfig/experiment.go - ExperimentSignal type for FA payloads (action + experiment_id + config) - parseExperimentSignal: detects experiment signals vs regular agent configs - Signal routing in agentConfigUpdateCallback: experiment signals handled separately, regular configs continue through normal path. Mixed batches handled correctly with per-update ACK/error reporting. - handleStartExperiment: guards (no config, no baseline, active experiment in any phase). Allows same-ID retry if ExpectedSpecHash still set (partial failure recovery). Sets phase=running, locks baselineRevision, computes expectedSpecHash with defaults applied, patches spec. - handleStopExperiment: validates phase=running and signal ID matches running experiment ID. Sets phase=rollback. - handlePromoteExperiment: same validation. Sets phase=promoted. ## RBAC Updated controllerrevisions verbs from list;watch to get;list;watch;create;update;delete across all manifests: - config/rbac/role.yaml (via kubebuilder marker) - bundle/manifests/datadog-operator.clusterserviceversion.yaml - marketplaces/addon_manifest.yaml - marketplaces/charts/google-marketplace/schema.yaml ## Tests (68 unit tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17e34e1f6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/remoteconfig/experiment.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 782e64e423

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/remoteconfig/experiment.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df3d50dab4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/remoteconfig/experiment.go

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from cd53bea to 8c8ddc0 Compare March 16, 2026 13:33

This comment was marked as duplicate.

Sign in to view

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 8c8ddc0 to 0c8d2d5 Compare March 16, 2026 13:49

This comment was marked as duplicate.

Sign in to view

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 0c8d2d5 to b2f2478 Compare March 16, 2026 14:00

This comment was marked as outdated.

Sign in to view

This comment was marked as duplicate.

Sign in to view

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from ef622c5 to f1d56e2 Compare March 16, 2026 15:45

chatgpt-codex-connector bot reviewed Mar 16, 2026

View reviewed changes

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from f1d56e2 to eef163f Compare March 17, 2026 09:19

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

zhuminyi added this to the v1.26.0 milestone Mar 17, 2026

zhuminyi added the enhancement New feature or request label Mar 17, 2026

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch 2 times, most recently from 118fb27 to 17e34e1 Compare March 17, 2026 13:22

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

Merge FA patch

e86d076

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 17e34e1 to e86d076 Compare March 17, 2026 13:40

fix for stale stop/promote signal scenario

782e64e

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 8133090 to df3d50d Compare March 17, 2026 14:07

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

pkg/remoteconfig/experiment.go Outdated Show resolved Hide resolved

Experiment ID is enforced

72fd331

zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from df3d50d to 72fd331 Compare March 17, 2026 16:12

Conversation

zhuminyi commented Mar 16, 2026

CRD changes

Reconciler side (experiment package)

RC callback side (remoteconfig package)

RBAC

Tests

What does this PR do?

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

zhuminyi commented Mar 16, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as duplicate.

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Mar 16, 2026 •

edited

Loading