Skip to content

Add DDA rollback functionality#2838

Merged
khewonc merged 9 commits intomainfrom
khewonc/rollback
Apr 7, 2026
Merged

Add DDA rollback functionality#2838
khewonc merged 9 commits intomainfrom
khewonc/rollback

Conversation

@khewonc
Copy link
Copy Markdown
Collaborator

@khewonc khewonc commented Mar 26, 2026

What does this PR do?

Adds rollback functionality for DDAs with fleet experiments. Includes:

  • stop: trigger rollback (for eventual stopExperiment)
  • timeout: rollback after 15min of a running experiment
  • abort: user makes a manual change (ignores manual change if done at the same time as the timeout b/c of complexity)

Motivation

https://datadoghq.atlassian.net/browse/CONTP-1404

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Setup:

  • Deploy the operator with createControllerRevisions and datadogAgentInternalEnabled set to true:
    flag.BoolVar(&opts.datadogAgentInternalEnabled, "datadogAgentInternalEnabled", true, "Enable the DatadogAgentInternal controller")
    flag.BoolVar(&opts.createControllerRevisions, "createControllerRevisions", false, "Enable creation of ControllerRevision snapshots on each DDA spec change")
  • Create a DDA

Stop signal:

  • Start an experiment (see below for how to mock)
  • Mock stopping the experiment by patching the status:
kubectl patch dd <name> --type=merge --subresource status --patch 'status: {experiment: {phase: stopped}}'
  • Check that the DDA spec is rolled back to the initial pre-experiment state
  • Check that the experiment phase is rollback (kubectl describe dd <name>)

Timeout

  • Start an experiment (see below for how to mock)
  • Wait 15 minutes (not configurable, sorry)
  • Check that the DDA spec is rolled back to the initial pre-experiment state
  • Check that the experiment phase is timeout (kubectl describe dd <name>)

Abort

  • Start an experiment (see below for how to mock)
  • Patch the spec again
kubectl patch datadogagent <name> -n <namespace> --type merge -p '{"spec":{"global":{"tags":["experiment:true","manual-change:true"]}}}'
  • Check that the experiment phase is aborted (kubectl describe dd <name>)
  • Check that the DDA spec matches the change you just made (no rollback to pre-experiment or experiment spec)

Start an experiment (mock the start experiment signal):

# patch dda spec
# note: this can be any spec change
kubectl patch datadogagent <name> -n <namespace> --type merge -p '{"spec":{"global":{"tags":["experiment:true"]}}}'
# patch dda status
# generation should match the current (post spec patch) dda's generation
kubectl patch datadogagent <name> -n <namespace> --type merge --subresource=status -p "{\"status\":{\"experiment\":{\"phase\":\"running\",\"id\":\"test-exp-1\",\"generation\":$(kubectl get datadogagent <name> -n <namespace> -o jsonpath='{.metadata.generation}')}}}"

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 75.72254% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.40%. Comparing base (2b3594e) to head (d4b5a04).

Files with missing lines Patch % Lines
internal/controller/datadogagent_controller.go 41.02% 17 Missing and 6 partials ⚠️
internal/controller/datadogagent/experiment.go 84.78% 7 Missing and 7 partials ⚠️
...controller/datadogagent/controller_reconcile_v2.go 16.66% 2 Missing and 3 partials ⚠️

❌ Your patch status has failed because the patch coverage (75.72%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2838      +/-   ##
==========================================
+ Coverage   39.21%   39.40%   +0.18%     
==========================================
  Files         314      315       +1     
  Lines       27301    27449     +148     
==========================================
+ Hits        10707    10817     +110     
- Misses      15803    15828      +25     
- Partials      791      804      +13     
Flag Coverage Δ
unittests 39.40% <75.72%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogagent/controller.go 92.85% <ø> (ø)
...ler/datadogagent/controller_reconcile_v2_common.go 33.90% <100.00%> (+0.28%) ⬆️
...er/datadogagent/controller_reconcile_v2_helpers.go 65.00% <100.00%> (+0.17%) ⬆️
internal/controller/datadogagent/revision.go 81.51% <100.00%> (+3.53%) ⬆️
...controller/datadogagent/controller_reconcile_v2.go 61.00% <16.66%> (-1.06%) ⬇️
internal/controller/datadogagent/experiment.go 84.78% <84.78%> (ø)
internal/controller/datadogagent_controller.go 59.54% <41.02%> (-7.13%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b3594e...d4b5a04. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@khewonc khewonc marked this pull request as ready for review March 27, 2026 16:31
@khewonc khewonc requested a review from a team March 27, 2026 16:31
@khewonc khewonc requested a review from a team as a code owner March 27, 2026 16:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d358c9c2bd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/controller/datadogagent/experiment.go Outdated
Comment thread internal/controller/datadogagent/experiment.go
Comment thread internal/controller/datadogagent/controller_reconcile_v2.go Outdated

ctx = ctrl.LoggerInto(ctx, ctrl.LoggerFrom(ctx).WithValues("experimentID", experiment.ID))

if err := r.handleRollback(ctx, instance, newStatus, now, revList); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is better to call abortExperiment first to detects user edit. If there is a user edit, phase will be changed to aborted and user's edit will be preserved (this is a narrow window of race condtion).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it called first as an early return, but calling abortExperiment first right now causes the operator logs to look like it's aborting when it's actually in timeout. I ended up deciding to reorder over complicating the function

Comment on lines +79 to +84
if err := r.manageExperiment(ctx, instance, newDDAStatus, now, revList); err != nil {
return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
}
if err := r.manageRevision(ctx, instance, revList, newDDAStatus); err != nil {
return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to do in two steps ? What if the second steps fails after the first step succeeded ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like won't this prevent rollbacks after apply ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to do in two steps. The actual rollback is handled in manageExperiment so it won't prevent experiment rollbacks. We don't allow user-initiated manual rollbacks so no issues there. There is one bug though in that after a rollback, if the user tries to apply the same change again, the operator will immediately roll back so it looks like there was no change. I'll add a fix for that

@khewonc khewonc merged commit 3d96136 into main Apr 7, 2026
37 checks passed
@khewonc khewonc deleted the khewonc/rollback branch April 7, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants