Skip to content

[CONTINT-5126] DPA and autoscaling RBAC updates#2743

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 8 commits intomainfrom
triviajon/proto/ipdpa
Mar 19, 2026
Merged

[CONTINT-5126] DPA and autoscaling RBAC updates#2743
gh-worker-dd-mergequeue-cf854d[bot] merged 8 commits intomainfrom
triviajon/proto/ipdpa

Conversation

@triviajon
Copy link
Contributor

@triviajon triviajon commented Mar 11, 2026

What does this PR do?

Adds support for IPVPA for the agent:

  • Adds PATCH pods/resize and CREATE pods/eviction permissions to the autoscaling cluster role.
  • Adds new fields to the DPA under the structs DatadogPodAutoscalerUpdatePolicy and DatadogPodAutoscalerVerticalTargetStatus

Motivation

https://datadoghq.atlassian.net/browse/CONTINT-5187

Additional Notes

  • Should be paired with agent changes in this PR.
  • Equivalent helm-chart update adding the permissions to the autoscaling cluster role can be found here.

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

See agent PR.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@codecov-commenter
Copy link

codecov-commenter commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 38.80%. Comparing base (6a0f976) to head (6c4043b).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2743      +/-   ##
==========================================
+ Coverage   38.78%   38.80%   +0.01%     
==========================================
  Files         309      309              
  Lines       26839    26847       +8     
==========================================
+ Hits        10409    10417       +8     
  Misses      15650    15650              
  Partials      780      780              
Flag Coverage Δ
unittests 38.80% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ontroller/datadogagent/feature/autoscaling/rbac.go 100.00% <100.00%> (ø)
internal/controller/datadogagent_controller.go 66.66% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a0f976...6c4043b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@triviajon triviajon added the enhancement New feature or request label Mar 18, 2026
@triviajon triviajon changed the title [IPVPA] DPA and autoscaling RBAC updates [CONTINT-5126] DPA and autoscaling RBAC updates Mar 18, 2026
@triviajon triviajon marked this pull request as ready for review March 18, 2026 16:37
@triviajon triviajon requested a review from a team March 18, 2026 16:37
@triviajon triviajon requested a review from a team as a code owner March 18, 2026 16:37
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29f27ea47a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +255 to +259
mode:
description: Mode controls the ability to trigger rollouts.
enum:
- Auto
- TriggerRollout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Regenerate the published bundle after extending the DPA CRD

The release pipeline publishes the committed bundle/ tree, not a freshly generated one (.gitlab-ci.yml:754-766, hack/redhat-bundle.sh:16, hack/publish-community-bundles.sh:33-38), but this commit only updates config/crd/.... bundle/manifests/datadoghq.com_datadogpodautoscalers.yaml still lacks mode, resizePendingPeriod, rolloutFallbackDelay, and evicted, so OperatorHub/OLM installs would continue serving the old CRD and reject these new spec/status fields until the bundle is regenerated and committed.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this is done with the Operator release cycle?

@triviajon triviajon added this to the v1.25.0 milestone Mar 19, 2026
Copy link

@NoelM NoelM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CRD looks fine

Copy link
Contributor

@gabedos gabedos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates to Operator + DCA RBAC lgtm

gh-worker-dd-mergequeue-cf854d bot pushed a commit to DataDog/datadog-agent that referenced this pull request Mar 19, 2026
…47998)

### What does this PR do?
This PR implements IPVPA in the autoscaling vertical controller according to the [RFC](https://datadoghq.atlassian.net/wiki/spaces/CONT/pages/6246498427/In-Place+Vertical+Pod+Resizing+for+Workload+Autoscaling)

See the RFC for the full specification, but key components are:
- In-place resize via pods/resize subresource, with eviction fallback (PDB-aware) and rollout fallback
- API server feature gate check (pods/resize discovery, cached 15min)
- ResizeSuccessful event emitted once

### Motivation
https://datadoghq.atlassian.net/browse/CONTINT-5126

### Describe how you validated your changes
Deployed several workloads and DPAs on an EKS cluster to dddev. 
1. Happy path (i.e., in-place resize with no restarts) -> ResizeSuccessful event emitted exactly once and restartCount=0.
2. Trigger rollout (i.e., using `mode:TriggerRollout` on the DPA forces the legacy rollout path): works as expected
3. Memory restart policy (i.e., container has resizePolicy requiring restart on memory limit/req changes): Verified restartCount > 0 on pods after a memory recommendation change.
4. Sidecar (i.e., DPA with `constraints.containers: [{name: server}]`). Only the server container is resized.

Cluster/workloads are still available for inspection: https://dddev.datadoghq.com/orchestration/scaling/workload?query=kube_cluster_name%3Ajrosario-ipvpa-final%20-kube_cluster_name%3Ajrosario-ipvpa3-mar18&workload_scaling_tab=optimized-workloads
### Additional Notes
This change is also related to/relies on:
- [datadog-operator](DataDog/datadog-operator#2743). For local testing I used `go.work` entry to point to local operator.
- helm-charts [RBAC for pods/resize](DataDog/helm-charts#2493) (patch verb on pods subresource). 

Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants