[CONTINT-5126] DPA and autoscaling RBAC updates#2743
[CONTINT-5126] DPA and autoscaling RBAC updates#2743gh-worker-dd-mergequeue-cf854d[bot] merged 8 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2743 +/- ##
==========================================
+ Coverage 38.78% 38.80% +0.01%
==========================================
Files 309 309
Lines 26839 26847 +8
==========================================
+ Hits 10409 10417 +8
Misses 15650 15650
Partials 780 780
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29f27ea47a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| mode: | ||
| description: Mode controls the ability to trigger rollouts. | ||
| enum: | ||
| - Auto | ||
| - TriggerRollout |
There was a problem hiding this comment.
Regenerate the published bundle after extending the DPA CRD
The release pipeline publishes the committed bundle/ tree, not a freshly generated one (.gitlab-ci.yml:754-766, hack/redhat-bundle.sh:16, hack/publish-community-bundles.sh:33-38), but this commit only updates config/crd/.... bundle/manifests/datadoghq.com_datadogpodautoscalers.yaml still lacks mode, resizePendingPeriod, rolloutFallbackDelay, and evicted, so OperatorHub/OLM installs would continue serving the old CRD and reject these new spec/status fields until the bundle is regenerated and committed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
My understanding is that this is done with the Operator release cycle?
…erator into triviajon/proto/ipdpa
gabedos
left a comment
There was a problem hiding this comment.
Updates to Operator + DCA RBAC lgtm
…47998) ### What does this PR do? This PR implements IPVPA in the autoscaling vertical controller according to the [RFC](https://datadoghq.atlassian.net/wiki/spaces/CONT/pages/6246498427/In-Place+Vertical+Pod+Resizing+for+Workload+Autoscaling) See the RFC for the full specification, but key components are: - In-place resize via pods/resize subresource, with eviction fallback (PDB-aware) and rollout fallback - API server feature gate check (pods/resize discovery, cached 15min) - ResizeSuccessful event emitted once ### Motivation https://datadoghq.atlassian.net/browse/CONTINT-5126 ### Describe how you validated your changes Deployed several workloads and DPAs on an EKS cluster to dddev. 1. Happy path (i.e., in-place resize with no restarts) -> ResizeSuccessful event emitted exactly once and restartCount=0. 2. Trigger rollout (i.e., using `mode:TriggerRollout` on the DPA forces the legacy rollout path): works as expected 3. Memory restart policy (i.e., container has resizePolicy requiring restart on memory limit/req changes): Verified restartCount > 0 on pods after a memory recommendation change. 4. Sidecar (i.e., DPA with `constraints.containers: [{name: server}]`). Only the server container is resized. Cluster/workloads are still available for inspection: https://dddev.datadoghq.com/orchestration/scaling/workload?query=kube_cluster_name%3Ajrosario-ipvpa-final%20-kube_cluster_name%3Ajrosario-ipvpa3-mar18&workload_scaling_tab=optimized-workloads ### Additional Notes This change is also related to/relies on: - [datadog-operator](DataDog/datadog-operator#2743). For local testing I used `go.work` entry to point to local operator. - helm-charts [RBAC for pods/resize](DataDog/helm-charts#2493) (patch verb on pods subresource). Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
What does this PR do?
Adds support for IPVPA for the agent:
PATCH pods/resizeandCREATE pods/evictionpermissions to the autoscaling cluster role.DatadogPodAutoscalerUpdatePolicyandDatadogPodAutoscalerVerticalTargetStatusMotivation
https://datadoghq.atlassian.net/browse/CONTINT-5187
Additional Notes
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
See agent PR.
Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel