Skip to content

[feature][autoscaling] add Spot scheduling feature and update RBAC#2957

Open
AlexanderYastrebov wants to merge 1 commit intomainfrom
alexander.yastrebov/spot-statefulset-patch-rbac
Open

[feature][autoscaling] add Spot scheduling feature and update RBAC#2957
AlexanderYastrebov wants to merge 1 commit intomainfrom
alexander.yastrebov/spot-statefulset-patch-rbac

Conversation

@AlexanderYastrebov
Copy link
Copy Markdown
Contributor

@AlexanderYastrebov AlexanderYastrebov commented Apr 29, 2026

What does this PR do?

Adds features.autoscaling.cluster.spot.enabled to the DatadogAgent API, which grants the cluster-agent permission to patch Deployments/StatefulSets with the spot-disabled-until annotation and evict pending spot pods during on-demand fallback. Spot is enforced as a sub-feature of cluster autoscaling (requires cluster.enabled=true).

Motivation

Support spot instance scheduling in the operator so that users can enable the spot scheduler via the DatadogAgent CR without manually managing RBAC.

Additional Notes

Spot scheduler implementation: DataDog/datadog-agent#47429

Updates https://datadoghq.atlassian.net/browse/CASCL-1312

Minimum Agent Versions

  • Cluster Agent: v7.79.0

Describe your test plan

Unit tests cover the new RBAC rules for spot scheduling. E2E tests in DataDog/datadog-agent validate spot scheduling behaviour end-to-end.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b87a451c73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +116 to +120
if clusterEnabled {
pr = append(pr, []rbacv1.PolicyRule{
{
// Patch workloads to write spot-disabled-until annotation during on-demand fallback
APIGroups: []string{rbac.AppsAPIGroup},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope spot RBAC rule behind spot feature flag

This rule is added whenever clusterEnabled is true, so users who enable only features.autoscaling.cluster.enabled now get extra apps patch access even though spot scheduling is documented as a separate sub-feature. That broadens cluster-agent privileges for non-spot deployments and breaks the intended least-privilege gating of cluster.spot.enabled; moving this rule to the clusterSpotEnabled block would keep permissions aligned with the enabled feature set.

Useful? React with 👍 / 👎.

@AlexanderYastrebov AlexanderYastrebov force-pushed the alexander.yastrebov/spot-statefulset-patch-rbac branch from b87a451 to 7852bcf Compare April 29, 2026 10:11
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.40%. Comparing base (099f33f) to head (58dacda).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2957      +/-   ##
==========================================
+ Coverage   41.38%   41.40%   +0.02%     
==========================================
  Files         327      327              
  Lines       28952    28962      +10     
==========================================
+ Hits        11982    11992      +10     
  Misses      16109    16109              
  Partials      861      861              
Flag Coverage Δ
unittests 41.40% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...roller/datadogagent/feature/autoscaling/feature.go 85.54% <100.00%> (+1.33%) ⬆️
...ontroller/datadogagent/feature/autoscaling/rbac.go 100.00% <100.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 099f33f...58dacda. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-datadog-prod-us1

This comment has been minimized.

@AlexanderYastrebov AlexanderYastrebov marked this pull request as draft April 29, 2026 10:30
@AlexanderYastrebov AlexanderYastrebov force-pushed the alexander.yastrebov/spot-statefulset-patch-rbac branch 3 times, most recently from 207ec23 to 17f0f7e Compare April 29, 2026 10:59
@AlexanderYastrebov AlexanderYastrebov marked this pull request as ready for review April 29, 2026 11:01
Enabled *bool `json:"enabled,omitempty"`

// Spot contains the configuration for the spot instance scheduling sub-feature.
// Requires cluster autoscaling to be enabled.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this condition? In the agent spot scheduling does not depend on cluster scaling although its a nested feature.

What does this PR do?
---------------------
Adds `features.autoscaling.cluster.spot.enabled` to the DatadogAgent API,
which grants the cluster-agent permission to patch Deployments/StatefulSets
with the `spot-disabled-until` annotation and evict pending spot pods during
on-demand fallback. Spot is enforced as a sub-feature of cluster autoscaling
(requires `cluster.enabled=true`).

Motivation
----------
Support spot instance scheduling in the operator so that users can enable
the spot scheduler via the DatadogAgent CR without manually managing RBAC.

Additional Notes
----------------
Spot scheduler implementation: DataDog/datadog-agent#47429

Updates https://datadoghq.atlassian.net/browse/CASCL-1312

Minimum Agent Versions
----------------------
* Cluster Agent: v7.79.0

Describe your test plan
-----------------------
Unit tests cover the new RBAC rules for spot scheduling.
E2E tests in DataDog/datadog-agent validate spot scheduling behaviour end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AlexanderYastrebov AlexanderYastrebov force-pushed the alexander.yastrebov/spot-statefulset-patch-rbac branch from 17f0f7e to 58dacda Compare April 30, 2026 09:47
}

if clusterEnabled {
if f.workloadEnabled || f.clusterSpotEnabled {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can split this condition and have duplicate rules as k8s merges them.

Copy link
Copy Markdown
Collaborator

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants