Skip to content

fix: handle not-found error in nodeagent DaemonSet deletion#2198

Open
kaovilai wants to merge 1 commit into
openshift:oadp-devfrom
kaovilai:feature/fix-nodeagent-toctou-race
Open

fix: handle not-found error in nodeagent DaemonSet deletion#2198
kaovilai wants to merge 1 commit into
openshift:oadp-devfrom
kaovilai:feature/fix-nodeagent-toctou-race

Conversation

@kaovilai
Copy link
Copy Markdown
Member

@kaovilai kaovilai commented May 7, 2026

Summary

  • Fix TOCTOU race condition in ReconcileNodeAgentDaemonset when NodeAgent is disabled
  • Treat "not found" errors during DaemonSet deletion as success, since the desired state (DaemonSet absent) is already achieved
  • Matches the pattern already used by other Delete calls in the same file (ConfigMap deletion at lines 232 and 770)
    Fixes flakes from feat: OLMv1 lifecycle — fresh install tests + OLMv0→OLMv1 migration target #2160

Problem

When NodeAgent is disabled, concurrent reconciliation loops can race between Get() and Delete():

  1. Loop A checks DaemonSet exists via Get() → found
  2. Loop B deletes the DaemonSet
  3. Loop A calls Delete() → fails with "not found"
  4. Error causes DPA to transition to Reconciled=False

Test plan

  • Existing unit tests pass
  • E2E tests pass with NodeAgent disable/enable scenarios
  • Verify Reconciled=True remains stable when disabling NodeAgent under concurrent reconciliation

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling when disabling NodeAgent to gracefully handle scenarios where underlying system components don't exist, enhancing overall reliability during cleanup operations.

When NodeAgent is disabled, a TOCTOU race condition can occur between
concurrent reconciliation loops: one loop deletes the DaemonSet while
another is between Get() and Delete(), causing Delete() to fail with
"not found". This error was incorrectly treated as a failure, causing
DPA to transition to Reconciled=False.

Treat "not found" errors during DaemonSet deletion as success since the
desired state (DaemonSet absent) is already achieved. This matches the
pattern already used by other Delete calls in the same file (lines 232
and 770).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Copilot AI review requested due to automatic review settings May 7, 2026 22:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 91a4bb87-ed0d-4ea0-9af4-e176289e0e9f

📥 Commits

Reviewing files that changed from the base of the PR and between 282c54a and 1380ec6.

📒 Files selected for processing (1)
  • internal/controller/nodeagent.go

Walkthrough

The NodeAgent DaemonSet reconciliation now explicitly handles "not found" errors during the deletion path when NodeAgent is disabled. If a DaemonSet does not exist, the reconciliation returns success instead of propagating the error.

Changes

NodeAgent DaemonSet Deletion Error Handling

Layer / File(s) Summary
Error Handling
internal/controller/nodeagent.go
When deleting a DaemonSet in ReconcileNodeAgentDaemonset, the method now returns (true, nil) if the target DaemonSet is not found, treating this case as a successful reconciliation rather than an error.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Tests don't cover PR fix for NodeAgent disabled with "not found" errors. All 4 entries have Enable: true. Assertions lack meaningful failure messages per criterion 4. Add test cases for NodeAgent disabled scenario to verify "not found" error handling. Add meaningful messages to Expect() assertions for diagnostic clarity.
Microshift Test Compatibility ⚠️ Warning New Ginkgo e2e tests in nodeagent_test.go use configv1.Infrastructure API (config.openshift.io/v1) which is unavailable on MicroShift. No skip conditions or apigroup tags protect these tests. Add [apigroup:config.openshift.io] tag to test name or guard with exutil.IsMicroShiftCluster() check, since Infrastructure API is unavailable on MicroShift.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New virt_backup_restore_suite_test.go contains getLatestCirrosImageURL() function requiring external HTTP connectivity to download.cirros-cloud.net. Fails in disconnected IPv6-only environments. Replace getLatestCirrosImageURL() to use internal/cached images. Add [Skipped:Disconnected] to test entries requiring external downloads, or use cluster's internal registry for VM image sources.
✅ Passed checks (9 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: handling not-found errors in nodeagent DaemonSet deletion, which directly matches the primary fix in the changeset.
Description check ✅ Passed The PR description exceeds template requirements with detailed problem explanation, solution rationale, and a test plan, though the template required sections are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names are stable. Test titles use static strings with no dynamic content (UUIDs, timestamps, pod/node names, IPs). Test data correctly placed in bodies.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Ginkgo e2e tests in dpa_deployment_suite_test.go do not make multi-node assumptions. Tests verify component enable/disable and DaemonSet config - all compatible with SNO.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds error handling for TOCTOU race in NodeAgent DaemonSet deletion. No topology-breaking constraints are introduced. All affinity, nodeSelector settings are user-configurable, not hardcoded.
Ote Binary Stdout Contract ✅ Passed PR adds error handling to nodeagent.go method. No process-level code, stdout writes, or logging configuration changes introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
@kaovilai
Copy link
Copy Markdown
Member Author

kaovilai commented May 7, 2026

/cherry-pick oadp-1.5

@openshift-cherrypick-robot
Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.5 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaovilai
Copy link
Copy Markdown
Member Author

kaovilai commented May 7, 2026

/cherry-pick oadp-1.6

@openshift-cherrypick-robot
Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.6 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the NodeAgent DaemonSet reconciliation logic so that disabling NodeAgent is resilient to concurrent reconciles where the DaemonSet may already have been deleted, preventing spurious reconcile failures.

Changes:

  • Treat NotFound errors from DaemonSet Delete() as success when NodeAgent is disabled.
  • Prevent transient Reconciled=False status caused by TOCTOU delete races during disable flows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 278 to +281
if err := r.Delete(deleteContext, ds, &client.DeleteOptions{PropagationPolicy: ptr.To(metav1.DeletePropagationForeground)}); err != nil {
if errors.IsNotFound(err) {
return true, nil
}
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 8, 2026

@kaovilai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/5.0-e2e-test-aws 1380ec6 link true /test 5.0-e2e-test-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kaovilai
Copy link
Copy Markdown
Member Author

kaovilai commented May 8, 2026

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2026
@kaovilai
Copy link
Copy Markdown
Member Author

kaovilai commented May 8, 2026

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants