Add DDA rollback functionality by khewonc · Pull Request #2838 · DataDog/datadog-operator

khewonc · 2026-03-26T20:43:53Z

What does this PR do?

Adds rollback functionality for DDAs with fleet experiments. Includes:

stop: trigger rollback (for eventual stopExperiment)
timeout: rollback after 15min of a running experiment
abort: user makes a manual change (ignores manual change if done at the same time as the timeout b/c of complexity)

Motivation

https://datadoghq.atlassian.net/browse/CONTP-1404

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: vX.Y.Z
Cluster Agent: vX.Y.Z

Describe your test plan

Setup:

Deploy the operator with createControllerRevisions and datadogAgentInternalEnabled set to true:

datadog-operator/cmd/main.go

Lines 188 to 189 in 62fdb06

    
           flag.BoolVar(&opts.datadogAgentInternalEnabled, "datadogAgentInternalEnabled", true, "Enable the DatadogAgentInternal controller") 
        
           flag.BoolVar(&opts.createControllerRevisions, "createControllerRevisions", false, "Enable creation of ControllerRevision snapshots on each DDA spec change")

Create a DDA

Stop signal:

Start an experiment (see below for how to mock)
Mock stopping the experiment by patching the status:

kubectl patch dd <name> --type=merge --subresource status --patch 'status: {experiment: {phase: stopped}}'

Check that the DDA spec is rolled back to the initial pre-experiment state
Check that the experiment phase is rollback (kubectl describe dd <name>)

Timeout

Start an experiment (see below for how to mock)
Wait 15 minutes (not configurable, sorry)
Check that the DDA spec is rolled back to the initial pre-experiment state
Check that the experiment phase is timeout (kubectl describe dd <name>)

Abort

Start an experiment (see below for how to mock)
Patch the spec again

kubectl patch datadogagent <name> -n <namespace> --type merge -p '{"spec":{"global":{"tags":["experiment:true","manual-change:true"]}}}'

Check that the experiment phase is aborted (kubectl describe dd <name>)
Check that the DDA spec matches the change you just made (no rollback to pre-experiment or experiment spec)

Start an experiment (mock the start experiment signal):

# patch dda spec
# note: this can be any spec change
kubectl patch datadogagent <name> -n <namespace> --type merge -p '{"spec":{"global":{"tags":["experiment:true"]}}}'
# patch dda status
# generation should match the current (post spec patch) dda's generation
kubectl patch datadogagent <name> -n <namespace> --type merge --subresource=status -p "{\"status\":{\"experiment\":{\"phase\":\"running\",\"id\":\"test-exp-1\",\"generation\":$(kubectl get datadogagent <name> -n <namespace> -o jsonpath='{.metadata.generation}')}}}"

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

codecov-commenter · 2026-03-26T20:52:11Z

Codecov Report

❌ Patch coverage is 75.72254% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.40%. Comparing base (2b3594e) to head (d4b5a04).

Files with missing lines	Patch %	Lines
internal/controller/datadogagent_controller.go	41.02%	17 Missing and 6 partials ⚠️
internal/controller/datadogagent/experiment.go	84.78%	7 Missing and 7 partials ⚠️
...controller/datadogagent/controller_reconcile_v2.go	16.66%	2 Missing and 3 partials ⚠️

❌ Your patch status has failed because the patch coverage (75.72%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2838      +/-   ##
==========================================
+ Coverage   39.21%   39.40%   +0.18%     
==========================================
  Files         314      315       +1     
  Lines       27301    27449     +148     
==========================================
+ Hits        10707    10817     +110     
- Misses      15803    15828      +25     
- Partials      791      804      +13

Flag	Coverage Δ
unittests	`39.40% <75.72%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/controller/datadogagent/controller.go	`92.85% <ø> (ø)`
...ler/datadogagent/controller_reconcile_v2_common.go	`33.90% <100.00%> (+0.28%)`	⬆️
...er/datadogagent/controller_reconcile_v2_helpers.go	`65.00% <100.00%> (+0.17%)`	⬆️
internal/controller/datadogagent/revision.go	`81.51% <100.00%> (+3.53%)`	⬆️
...controller/datadogagent/controller_reconcile_v2.go	`61.00% <16.66%> (-1.06%)`	⬇️
internal/controller/datadogagent/experiment.go	`84.78% <84.78%> (ø)`
internal/controller/datadogagent_controller.go	`59.54% <41.02%> (-7.13%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b3594e...d4b5a04. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d358c9c2bd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

zhuminyi · 2026-03-27T19:08:00Z

+
+	ctx = ctrl.LoggerInto(ctx, ctrl.LoggerFrom(ctx).WithValues("experimentID", experiment.ID))
+
+	if err := r.handleRollback(ctx, instance, newStatus, now, revList); err != nil {


nit: It is better to call abortExperiment first to detects user edit. If there is a user edit, phase will be changed to aborted and user's edit will be preserved (this is a narrow window of race condtion).

I initially had it called first as an early return, but calling abortExperiment first right now causes the operator logs to look like it's aborting when it's actually in timeout. I ended up deciding to reorder over complicating the function

arbll · 2026-04-01T14:15:20Z

+		if err := r.manageExperiment(ctx, instance, newDDAStatus, now, revList); err != nil {
+			return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
+		}
+		if err := r.manageRevision(ctx, instance, revList, newDDAStatus); err != nil {
+			return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
+		}


Is this safe to do in two steps ? What if the second steps fails after the first step succeeded ?

Like won't this prevent rollbacks after apply ?

It should be fine to do in two steps. The actual rollback is handled in manageExperiment so it won't prevent experiment rollbacks. We don't allow user-initiated manual rollbacks so no issues there. There is one bug though in that after a rollback, if the user tries to apply the same change again, the operator will immediately roll back so it looks like there was no change. I'll add a fix for that

khewonc added 4 commits March 26, 2026 16:37

Add dd annotations and experiment phase as predicates

e9a4fa4

Add experiment status

5c6deda

Pass revision list instead of entire object

7474895

Add rollback functionality

b6980bf

khewonc added this to the v1.26.0 milestone Mar 26, 2026

khewonc added the enhancement New feature or request label Mar 26, 2026

github-actions Bot added team/container-platform team/container-autoscaling labels Mar 26, 2026

make generate manifests

d358c9c

khewonc marked this pull request as ready for review March 27, 2026 16:31

khewonc requested a review from a team March 27, 2026 16:31

khewonc requested a review from a team as a code owner March 27, 2026 16:31

chatgpt-codex-connector Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread internal/controller/datadogagent/experiment.go Outdated

Comment thread internal/controller/datadogagent/experiment.go

Comment thread internal/controller/datadogagent/controller_reconcile_v2.go Outdated

zhuminyi reviewed Mar 27, 2026

View reviewed changes

zhuminyi approved these changes Mar 30, 2026

View reviewed changes

Address review suggestions

7814314

arbll reviewed Apr 1, 2026

View reviewed changes

khewonc added 3 commits April 1, 2026 13:13

Allow applying same change after rollback

6106687

Merge branch 'main' into khewonc/rollback

1503e74

Fix build

d4b5a04

khewonc mentioned this pull request Apr 7, 2026

Add experiment signals to fleet remote config #2872

Merged

3 tasks

zhuminyi approved these changes Apr 7, 2026

View reviewed changes

khewonc merged commit 3d96136 into main Apr 7, 2026
37 checks passed

khewonc deleted the khewonc/rollback branch April 7, 2026 13:47

dd-octo-sts Bot mentioned this pull request Apr 23, 2026

[Backport v1.26] Add experiment signals to fleet remote config #2920

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDA rollback functionality#2838

Add DDA rollback functionality#2838
khewonc merged 9 commits intomainfrom
khewonc/rollback

khewonc commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuminyi Mar 27, 2026

Uh oh!

khewonc Mar 31, 2026

Uh oh!

arbll Apr 1, 2026

Uh oh!

arbll Apr 1, 2026

Uh oh!

khewonc Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	flag.BoolVar(&opts.datadogAgentInternalEnabled, "datadogAgentInternalEnabled", true, "Enable the DatadogAgentInternal controller")
	flag.BoolVar(&opts.createControllerRevisions, "createControllerRevisions", false, "Enable creation of ControllerRevision snapshots on each DDA spec change")


		ctx = ctrl.LoggerInto(ctx, ctrl.LoggerFrom(ctx).WithValues("experimentID", experiment.ID))

		if err := r.handleRollback(ctx, instance, newStatus, now, revList); err != nil {

Conversation

khewonc commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuminyi Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

khewonc Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

arbll Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

arbll Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

khewonc Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khewonc commented Mar 26, 2026 •

edited

Loading

codecov-commenter commented Mar 26, 2026 •

edited

Loading