Skip to content

Add Fleet Automation experiment lifecycle for remote agent management#2760

Draft
zhuminyi wants to merge 4 commits intomainfrom
minyi/fleet-automation-experiment-lifecycle
Draft

Add Fleet Automation experiment lifecycle for remote agent management#2760
zhuminyi wants to merge 4 commits intomainfrom
minyi/fleet-automation-experiment-lifecycle

Conversation

@zhuminyi
Copy link
Contributor

Implement the full experiment lifecycle for Fleet Automation, enabling the operator to receive experiment signals (start/stop/promote) via Remote Config and manage ControllerRevision-based spec snapshots for rollback.

CRD changes

Add to DatadogAgentStatus:

  • CurrentRevision / PreviousRevision: track ControllerRevision pointers
  • Experiment: ExperimentStatus with phase, startedAt, baselineRevision, id, and expectedSpecHash fields
  • ExperimentPhase enum: running, rollback, promoted, aborted, timeout

Reconciler side (experiment package)

New package: internal/controller/datadogagent/experiment/

  • ControllerRevision creation with hash-based naming ({dda}-{md5[:10]})
  • Revision pointer tracking on every spec change
  • Experiment phase handling hooked into internalReconcileV2 after defaults
  • Timeout detection: auto-rollback when now - startedAt >= 30min
  • Conflict detection: abort on external spec edits during experiment
  • ExpectedSpecHash validation: RC computes hash of defaulted FA config; reconciler verifies spec matches on first reconcile, catching user edits between RC patch and first reconcile. Hash survives status-before-spec race (RC status update arrives before spec patch).
  • Spec restoration from ControllerRevision baseline
  • GC of old ControllerRevisions (keep current + previous + baseline)
  • Status fields preserved in generateNewStatusFromDDA

RC callback side (remoteconfig package)

New file: pkg/remoteconfig/experiment.go

  • ExperimentSignal type for FA payloads (action + experiment_id + config)
  • parseExperimentSignal: detects experiment signals vs regular agent configs
  • handleStartExperiment: sets phase=running, locks baselineRevision, computes expectedSpecHash with defaults applied, patches spec
  • handleStopExperiment: sets phase=rollback (only from running, rejects aborted)
  • handlePromoteExperiment: sets phase=promoted (only from running)
  • Re-fetches DDA after status update to avoid resourceVersion conflicts

RBAC

Updated controllerrevisions verbs from list;watch to get;list;watch;create;update;delete across all manifests:

  • config/rbac/role.yaml (via kubebuilder marker)
  • bundle/manifests/datadog-operator.clusterserviceversion.yaml
  • marketplaces/addon_manifest.yaml
  • marketplaces/charts/google-marketplace/schema.yaml

Tests

59 unit tests covering:

  • ControllerRevision CRUD, naming, ownerRef, GC (17 tests)
  • Experiment phase transitions: start, promote, rollback, abort (8 tests)
  • Timeout detection and post-restart recovery (5 tests)
  • Conflict detection with specChanged flag (5 tests)
  • First-reconcile validation via ExpectedSpecHash (3 tests)
  • Status-before-spec race: hash survives until spec arrives (1 test)
  • Revision pointer tracking (3 tests)
  • RC signal parsing and handlers (16 tests)
  • BuildKeepSet (3 tests)

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Write there any instructions and details you may have to test your PR.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from cd53bea to 8c8ddc0 Compare March 16, 2026 13:33
@zhuminyi
Copy link
Contributor Author

@codex review

chatgpt-codex-connector[bot]

This comment was marked as duplicate.

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 8c8ddc0 to 0c8d2d5 Compare March 16, 2026 13:49
chatgpt-codex-connector[bot]

This comment was marked as duplicate.

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 0c8d2d5 to b2f2478 Compare March 16, 2026 14:00
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 64.85014% with 129 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.15%. Comparing base (8669701) to head (72fd331).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/remoteconfig/experiment.go 69.59% 27 Missing and 18 partials ⚠️
...nal/controller/datadogagent/experiment/revision.go 68.67% 14 Missing and 12 partials ⚠️
pkg/remoteconfig/updater.go 0.00% 24 Missing ⚠️
...al/controller/datadogagent/experiment/lifecycle.go 76.84% 14 Missing and 8 partials ⚠️
...controller/datadogagent/controller_reconcile_v2.go 33.33% 2 Missing and 2 partials ⚠️
...ler/datadogagent/controller_reconcile_v2_common.go 20.00% 2 Missing and 2 partials ⚠️
...er/datadogagent/controller_reconcile_v2_helpers.go 50.00% 1 Missing and 1 partial ⚠️
.../controller/datadogagent/testutils/client_utils.go 0.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (64.85%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2760      +/-   ##
==========================================
+ Coverage   38.80%   39.15%   +0.35%     
==========================================
  Files         309      312       +3     
  Lines       26750    27114     +364     
==========================================
+ Hits        10379    10617     +238     
- Misses      15592    15675      +83     
- Partials      779      822      +43     
Flag Coverage Δ
unittests 39.15% <64.85%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogagent_controller.go 66.66% <ø> (ø)
...er/datadogagent/controller_reconcile_v2_helpers.go 64.53% <50.00%> (-0.30%) ⬇️
.../controller/datadogagent/testutils/client_utils.go 0.00% <0.00%> (ø)
...controller/datadogagent/controller_reconcile_v2.go 59.59% <33.33%> (-0.83%) ⬇️
...ler/datadogagent/controller_reconcile_v2_common.go 32.83% <20.00%> (-0.14%) ⬇️
...al/controller/datadogagent/experiment/lifecycle.go 76.84% <76.84%> (ø)
pkg/remoteconfig/updater.go 0.00% <0.00%> (ø)
...nal/controller/datadogagent/experiment/revision.go 68.67% <68.67%> (ø)
pkg/remoteconfig/experiment.go 69.59% <69.59%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8669701...72fd331. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector[bot]

This comment was marked as duplicate.

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from ef622c5 to f1d56e2 Compare March 16, 2026 15:45
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1d56e2aaa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from f1d56e2 to eef163f Compare March 17, 2026 09:19
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eef163f2a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@zhuminyi zhuminyi added this to the v1.26.0 milestone Mar 17, 2026
@zhuminyi zhuminyi added the enhancement New feature or request label Mar 17, 2026
Implement the full experiment lifecycle for Fleet Automation, enabling
the operator to receive experiment signals (start/stop/promote) via
Remote Config and manage ControllerRevision-based spec snapshots for
rollback.

## CRD changes

Add to DatadogAgentStatus:
- CurrentRevision / PreviousRevision: track ControllerRevision pointers
- Experiment: ExperimentStatus with phase, startedAt, baselineRevision,
  id, and expectedSpecHash fields
- ExperimentPhase enum: running, rollback, promoted, aborted, timeout

## Reconciler side (experiment package)

New package: internal/controller/datadogagent/experiment/

- ControllerRevision creation with hash-based naming ({dda}-{md5[:10]})
- Revision pointer tracking on every spec change
- Experiment phase handling hooked into internalReconcileV2 after defaults
- Timeout detection: auto-rollback when now - startedAt >= 30min
- Conflict detection via ExpectedSpecHash: RC computes hash of defaulted
  FA config; reconciler verifies spec matches on first reconcile, catching
  user edits between RC patch and first reconcile. Hash survives
  status-before-spec race.
- Spec restoration from ControllerRevision baseline with direct status
  persistence (re-fetch after spec update to avoid resourceVersion conflict)
- Two-reconcile terminal phase pattern: rollback/timeout phase is persisted
  and observable before being cleared on the next reconcile
- GC of old ControllerRevisions (keep current + previous + baseline)
- Status fields preserved in generateNewStatusFromDDA

## RC callback side (remoteconfig package)

New file: pkg/remoteconfig/experiment.go

- ExperimentSignal type for FA payloads (action + experiment_id + config)
- parseExperimentSignal: detects experiment signals vs regular agent configs
- Signal routing in agentConfigUpdateCallback: experiment signals handled
  separately, regular configs continue through normal path. Mixed batches
  handled correctly with per-update ACK/error reporting.
- handleStartExperiment: guards (no config, no baseline, active experiment
  in any phase). Allows same-ID retry if ExpectedSpecHash still set (partial
  failure recovery). Sets phase=running, locks baselineRevision, computes
  expectedSpecHash with defaults applied, patches spec.
- handleStopExperiment: validates phase=running and signal ID matches
  running experiment ID. Sets phase=rollback.
- handlePromoteExperiment: same validation. Sets phase=promoted.

## RBAC

Updated controllerrevisions verbs from list;watch to
get;list;watch;create;update;delete across all manifests:
- config/rbac/role.yaml (via kubebuilder marker)
- bundle/manifests/datadog-operator.clusterserviceversion.yaml
- marketplaces/addon_manifest.yaml
- marketplaces/charts/google-marketplace/schema.yaml

## Tests (68 unit tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch 2 times, most recently from 118fb27 to 17e34e1 Compare March 17, 2026 13:22
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17e34e1f6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 17e34e1 to e86d076 Compare March 17, 2026 13:40
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 782e64e423

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from 8133090 to df3d50d Compare March 17, 2026 14:07
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df3d50dab4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@zhuminyi zhuminyi force-pushed the minyi/fleet-automation-experiment-lifecycle branch from df3d50d to 72fd331 Compare March 17, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants