Skip to content

fix(operator): reconcile DGD on PodClique readiness-field status changes#8423

Merged
jthomson04 merged 1 commit into
mainfrom
jthomson04/reconcile-dgd-on-podclique-status
May 7, 2026
Merged

fix(operator): reconcile DGD on PodClique readiness-field status changes#8423
jthomson04 merged 1 commit into
mainfrom
jthomson04/reconcile-dgd-on-podclique-status

Conversation

@jthomson04
Copy link
Copy Markdown
Contributor

@jthomson04 jthomson04 commented Apr 21, 2026

Summary

Companion to #8328 for the single-node PodClique branch of the readiness check at dynamographdeployment_controller.go:419-423. #8328 fixed the multi-node path (CheckPCSGReady) by adding a PCSG watch; this PR closes the analogous gap on the PodClique path (CheckPodCliqueReady).

Root cause

The existing PodClique watch predicate only triggered on Status.ReadyReplicas or Spec.Replicas diffs, but CheckPodCliqueReady (in deploy/operator/internal/dynamo/grove.go) gates readiness on more than that:

Gate Field
observedGeneration == nil → not ready Status.ObservedGeneration
observedGeneration < generation → not ready Status.ObservedGeneration
desiredReplicas != readyReplicas → not ready Status.ReadyReplicas ✅ already watched
desiredReplicas != updatedReplicas → not ready Status.UpdatedReplicas
replicas != desiredReplicas → "performing rolling update" Status.Replicas

At the tail of a rolling update, UpdatedReplicas / Status.Replicas / ObservedGeneration can transition without ReadyReplicas moving. No PodClique event fires that the predicate accepts, so the DGD keeps its stale aggregate until something else triggers a reconcile.

Changes

🤖 Generated with Claude Code


Open in Devin Review

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced controller event handling to trigger reconciliation on additional status change events, improving system responsiveness and ensuring more reliable state synchronization during deployment operations.

@jthomson04 jthomson04 requested a review from a team as a code owner April 21, 2026 03:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added fix deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 49abf0ae-f09c-4ea9-ac71-69388ddde843

📥 Commits

Reviewing files that changed from the base of the PR and between 781e410 and efc6f4a.

📒 Files selected for processing (1)
  • deploy/operator/internal/controller/dynamographdeployment_controller.go

Walkthrough

Updated PodClique UpdateFunc event predicate to trigger reconciliation on additional status fields: Status.UpdatedReplicas, Status.Replicas, and Status.ObservedGeneration, expanding trigger conditions beyond Status.ReadyReplicas and Spec.Replicas.

Changes

Cohort / File(s) Summary
Event Predicate Expansion
deploy/operator/internal/controller/dynamographdeployment_controller.go
Broadened UpdateFunc event predicate conditions in SetupWithManager to include Status.UpdatedReplicas, Status.Replicas, and Status.ObservedGeneration checks, aligning trigger logic with CheckPodCliqueReady readiness gating.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: expanding PodClique watch predicates to trigger reconciliation on readiness-field status changes.
Description check ✅ Passed The description covers all required template sections (Overview, Details, Where to start) with comprehensive context, root cause analysis, and test plan. However, the Related Issues section does not use action keywords as required by the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jthomson04 jthomson04 force-pushed the jthomson04/reconcile-dgd-on-podclique-status branch from efc6f4a to 3fc6f33 Compare April 21, 2026 03:52
@jthomson04
Copy link
Copy Markdown
Contributor Author

/ok to test 3fc6f33

Companion to #8328 for the single-node PodClique branch of
CheckPodCliqueReady. The PodClique watch predicate only triggered on
ReadyReplicas and Spec.Replicas, but CheckPodCliqueReady also gates
readiness on UpdatedReplicas, Status.Replicas, and ObservedGeneration.
At the tail of a rolling update those fields can transition without
ReadyReplicas moving, leaving the DGD stuck in a stale NotReady state
until something else triggered a reconcile.

Expand the predicate to match the full set of fields CheckPodCliqueReady
consumes, using the ptrInt64Equal helper introduced in #8328.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
@jthomson04 jthomson04 force-pushed the jthomson04/reconcile-dgd-on-podclique-status branch from 3fc6f33 to bcc8e93 Compare May 7, 2026 06:50
@jthomson04
Copy link
Copy Markdown
Contributor Author

/ok to test bcc8e93

@jthomson04 jthomson04 merged commit b755a82 into main May 7, 2026
58 checks passed
@jthomson04 jthomson04 deleted the jthomson04/reconcile-dgd-on-podclique-status branch May 7, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes fix size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants