fix(operator): reconcile DGD on PodClique readiness-field status changes#8423
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughUpdated PodClique Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
efc6f4a to
3fc6f33
Compare
|
/ok to test 3fc6f33 |
Companion to #8328 for the single-node PodClique branch of CheckPodCliqueReady. The PodClique watch predicate only triggered on ReadyReplicas and Spec.Replicas, but CheckPodCliqueReady also gates readiness on UpdatedReplicas, Status.Replicas, and ObservedGeneration. At the tail of a rolling update those fields can transition without ReadyReplicas moving, leaving the DGD stuck in a stale NotReady state until something else triggered a reconcile. Expand the predicate to match the full set of fields CheckPodCliqueReady consumes, using the ptrInt64Equal helper introduced in #8328. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>
3fc6f33 to
bcc8e93
Compare
|
/ok to test bcc8e93 |
Summary
Companion to #8328 for the single-node PodClique branch of the readiness check at
dynamographdeployment_controller.go:419-423. #8328 fixed the multi-node path (CheckPCSGReady) by adding a PCSG watch; this PR closes the analogous gap on the PodClique path (CheckPodCliqueReady).Root cause
The existing PodClique watch predicate only triggered on
Status.ReadyReplicasorSpec.Replicasdiffs, butCheckPodCliqueReady(indeploy/operator/internal/dynamo/grove.go) gates readiness on more than that:observedGeneration == nil→ not readyStatus.ObservedGenerationobservedGeneration < generation→ not readyStatus.ObservedGenerationdesiredReplicas != readyReplicas→ not readyStatus.ReadyReplicas✅ already watcheddesiredReplicas != updatedReplicas→ not readyStatus.UpdatedReplicasreplicas != desiredReplicas→ "performing rolling update"Status.ReplicasAt the tail of a rolling update,
UpdatedReplicas/Status.Replicas/ObservedGenerationcan transition withoutReadyReplicasmoving. No PodClique event fires that the predicate accepts, so the DGD keeps its stale aggregate until something else triggers a reconcile.Changes
UpdateFuncpredicate on the&grovev1alpha1.PodClique{}watch to also fire onStatus.UpdatedReplicas,Status.Replicas, andStatus.ObservedGenerationtransitions.ptrInt64Equalhelper introduced in fix(operator): reconcile DynamoGraphDeployment on PodCliqueScalingGroup status changes #8328 for the optionalObservedGeneration(*int64) comparison, matching the pattern already used for the sibling PCSG predicate.grove.io/podcliquesget;list;watchis already declared (added in fix(operator): reconcile DynamoGraphDeployment on PodCliqueScalingGroup status changes #8328).🤖 Generated with Claude Code
Summary by CodeRabbit