Add Fleet Automation experiment lifecycle for remote agent management#2760
Add Fleet Automation experiment lifecycle for remote agent management#2760
Conversation
cd53bea to
8c8ddc0
Compare
|
@codex review |
8c8ddc0 to
0c8d2d5
Compare
0c8d2d5 to
b2f2478
Compare
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (64.85%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2760 +/- ##
==========================================
+ Coverage 38.80% 39.15% +0.35%
==========================================
Files 309 312 +3
Lines 26750 27114 +364
==========================================
+ Hits 10379 10617 +238
- Misses 15592 15675 +83
- Partials 779 822 +43
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
ef622c5 to
f1d56e2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1d56e2aaa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
f1d56e2 to
eef163f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eef163f2a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Implement the full experiment lifecycle for Fleet Automation, enabling
the operator to receive experiment signals (start/stop/promote) via
Remote Config and manage ControllerRevision-based spec snapshots for
rollback.
## CRD changes
Add to DatadogAgentStatus:
- CurrentRevision / PreviousRevision: track ControllerRevision pointers
- Experiment: ExperimentStatus with phase, startedAt, baselineRevision,
id, and expectedSpecHash fields
- ExperimentPhase enum: running, rollback, promoted, aborted, timeout
## Reconciler side (experiment package)
New package: internal/controller/datadogagent/experiment/
- ControllerRevision creation with hash-based naming ({dda}-{md5[:10]})
- Revision pointer tracking on every spec change
- Experiment phase handling hooked into internalReconcileV2 after defaults
- Timeout detection: auto-rollback when now - startedAt >= 30min
- Conflict detection via ExpectedSpecHash: RC computes hash of defaulted
FA config; reconciler verifies spec matches on first reconcile, catching
user edits between RC patch and first reconcile. Hash survives
status-before-spec race.
- Spec restoration from ControllerRevision baseline with direct status
persistence (re-fetch after spec update to avoid resourceVersion conflict)
- Two-reconcile terminal phase pattern: rollback/timeout phase is persisted
and observable before being cleared on the next reconcile
- GC of old ControllerRevisions (keep current + previous + baseline)
- Status fields preserved in generateNewStatusFromDDA
## RC callback side (remoteconfig package)
New file: pkg/remoteconfig/experiment.go
- ExperimentSignal type for FA payloads (action + experiment_id + config)
- parseExperimentSignal: detects experiment signals vs regular agent configs
- Signal routing in agentConfigUpdateCallback: experiment signals handled
separately, regular configs continue through normal path. Mixed batches
handled correctly with per-update ACK/error reporting.
- handleStartExperiment: guards (no config, no baseline, active experiment
in any phase). Allows same-ID retry if ExpectedSpecHash still set (partial
failure recovery). Sets phase=running, locks baselineRevision, computes
expectedSpecHash with defaults applied, patches spec.
- handleStopExperiment: validates phase=running and signal ID matches
running experiment ID. Sets phase=rollback.
- handlePromoteExperiment: same validation. Sets phase=promoted.
## RBAC
Updated controllerrevisions verbs from list;watch to
get;list;watch;create;update;delete across all manifests:
- config/rbac/role.yaml (via kubebuilder marker)
- bundle/manifests/datadog-operator.clusterserviceversion.yaml
- marketplaces/addon_manifest.yaml
- marketplaces/charts/google-marketplace/schema.yaml
## Tests (68 unit tests)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
118fb27 to
17e34e1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 17e34e1f6e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
17e34e1 to
e86d076
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 782e64e423
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
8133090 to
df3d50d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df3d50dab4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
df3d50d to
72fd331
Compare
Implement the full experiment lifecycle for Fleet Automation, enabling the operator to receive experiment signals (start/stop/promote) via Remote Config and manage ControllerRevision-based spec snapshots for rollback.
CRD changes
Add to DatadogAgentStatus:
Reconciler side (experiment package)
New package: internal/controller/datadogagent/experiment/
RC callback side (remoteconfig package)
New file: pkg/remoteconfig/experiment.go
RBAC
Updated controllerrevisions verbs from list;watch to get;list;watch;create;update;delete across all manifests:
Tests
59 unit tests covering:
What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Additional Notes
Anything else we should know when reviewing?
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Write there any instructions and details you may have to test your PR.
Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel