Skip to content

fix(operator): reconcile DynamoGraphDeployment on PodCliqueScalingGroup status changes#8328

Merged
julienmancuso merged 3 commits into
mainfrom
jsm/watch-pcsg
Apr 20, 2026
Merged

fix(operator): reconcile DynamoGraphDeployment on PodCliqueScalingGroup status changes#8328
julienmancuso merged 3 commits into
mainfrom
jsm/watch-pcsg

Conversation

@julienmancuso
Copy link
Copy Markdown
Contributor

@julienmancuso julienmancuso commented Apr 17, 2026

Summary

The DGD controller does not watch PodCliqueScalingGroup, so a DGD whose readiness depends on PCSG.Status.AvailableReplicas can remain stuck at Ready=False even after the underlying PCSG becomes fully available.

Observed on dynamo-gcp-dev-01 (Grove v0.1.0-alpha.6, Dynamo operator v1.0.1):

  • DGD Ready=False with message pcsg/dynamo-recipe-0-worker: desired=4, available=3
  • PCSG itself reports status.availableReplicas=4, status.replicas=4, status.scheduledReplicas=4, status.updatedReplicas=4
  • DGD and PCSG lastTransitionTime both match the exact second the last worker pod went Ready

Root cause

CheckPCSGReady reads pcsg.Status.AvailableReplicas directly and compares to pcsg.Spec.Replicas. At the moment the last PodClique's ReadyReplicas flipped to 1, the DGD reconciled, read a pre-update PCSG snapshot (AvailableReplicas=3), and wrote available=3 into the DGD aggregate. Immediately after, the PCSG controller recomputed AvailableReplicas to 4, but:

  • The DGD controller only watches PodClique (on ReadyReplicas / Spec.Replicas diffs) and PodCliqueSet.
  • No PodCliqueScalingGroup watch exists (there was even a comment claiming it wasn't needed).
  • No PodClique event fires after the last Ready transition, so the DGD never reconciles again and stays stuck on its cached available=3 indefinitely.

Changes

  • Add a Watches(&grovev1alpha1.PodCliqueScalingGroup{}, ...) on the DGD controller (Grove-enabled path only), firing on status-replica or Spec.Replicas changes.
  • New mapPodCliqueScalingGroupToRequests map func. PCSGs are controller-owned by PodCliqueSet, which Dynamo creates with the same name as the DGD (graph.go: gangSet.Name = dynamoDeployment.Name), so the controller ownerRef.Name is the DGD name directly, no extra API call.
  • Add missing kubebuilder RBAC markers for grove.io/podcliques and grove.io/podcliquescalinggroups with get;list;watch. These were never declared even though CheckPodCliqueReady and CheckPCSGReady already do client.Get() on both, and the new watch requires list/watch.
  • Regenerated config/rbac/role.yaml via make manifests.
  • Manually synced the hand-maintained Helm chart template deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml (it is not auto-generated from controller-gen).
  • New table-driven unit test TestMapPodCliqueScalingGroupToRequests covering: PCSG with PodCliqueSet controller ownerRef → DGD request; no ownerRef → none; non-PodCliqueSet ownerRef → none; non-PCSG object → none.

Test plan

  • go build ./... clean
  • go test ./internal/controller/... -run TestMapPodCliqueScalingGroupToRequests passes
  • helm lint deploy/helm/charts/platform/components/operator passes
  • make manifests produces only the intended RBAC diff

@julienmancuso julienmancuso requested a review from a team as a code owner April 17, 2026 21:30
@github-actions github-actions Bot added fix deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 17, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Walkthrough

The changes extend RBAC permissions to grant read access (get, list, watch) to podcliques and podcliquescalinggroups resources in the grove.io API group across Helm and operator configurations. Additionally, the DynamoGraphDeployment controller is enhanced to watch PodCliqueScalingGroup changes and map them to reconciliation requests via owner references, with corresponding test coverage.

Changes

Cohort / File(s) Summary
RBAC Configuration
deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml, deploy/operator/config/rbac/role.yaml
Added explicit read-only (get, list, watch) permissions for podcliques and podcliquescalinggroups under grove.io API group in ClusterRole and Helm template.
Controller Implementation
deploy/operator/internal/controller/dynamographdeployment_controller.go
Added kubebuilder RBAC annotations for podcliques and podcliquescalinggroups read permissions. Registered new Watches() for PodCliqueScalingGroup with PredicateFuncs filtering replica-related field changes. Introduced mapPodCliqueScalingGroupToRequests() method to resolve scaling group changes to owning PodCliqueSet reconciliation requests via owner references.
Controller Tests
deploy/operator/internal/controller/dynamographdeployment_controller_test.go
Added TestMapPodCliqueScalingGroupToRequests() covering mapping validation with valid owner references, missing/mismatched owner references, and non-PCSG input objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding reconciliation of DynamoGraphDeployment triggered by PodCliqueScalingGroup status changes.
Description check ✅ Passed The PR description comprehensively covers all required template sections with detailed technical context, root cause analysis, specific changes, and test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
deploy/operator/internal/controller/dynamographdeployment_controller.go (1)

1760-1782: Mapper ignores the Controller flag on the ownerRef.

The function matches any PodCliqueSet ownerRef (regardless of Controller: true/false). In practice Grove sets a single controller ownerRef pointing at the parent PCS, so this is benign today, but if a PCSG ever has a non-controller PodCliqueSet ownerRef (e.g., via external tooling) this would still enqueue a reconcile for that name. Consider tightening to the controller ownerRef to make the intent explicit and match the Dynamo-owns-PCS invariant noted in the comment:

Proposed tightening
-	groveAPIVersion := grovev1alpha1.SchemeGroupVersion.String()
-	for _, ownerRef := range pcsg.GetOwnerReferences() {
-		if ownerRef.Kind == "PodCliqueSet" && ownerRef.APIVersion == groveAPIVersion {
+	groveAPIVersion := grovev1alpha1.SchemeGroupVersion.String()
+	for _, ownerRef := range pcsg.GetOwnerReferences() {
+		if ownerRef.Kind == "PodCliqueSet" && ownerRef.APIVersion == groveAPIVersion &&
+			ownerRef.Controller != nil && *ownerRef.Controller {
 			return []ctrl.Request{{

You could equivalently use metav1.GetControllerOf(pcsg) and then check Kind/APIVersion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/controller/dynamographdeployment_controller.go`
around lines 1760 - 1782, The mapper mapPodCliqueScalingGroupToRequests
currently enqueues a PodCliqueSet reconcile for any ownerRef with Kind
"PodCliqueSet" ignoring the ownerRef.Controller flag; update the logic to only
consider the controller owner reference (either by checking ownerRef.Controller
== true when iterating pcsg.GetOwnerReferences() or by using
metav1.GetControllerOf(pcsg) and validating its Kind == "PodCliqueSet" and
APIVersion matches grovev1alpha1.SchemeGroupVersion.String()), then return the
ctrl.Request for that controller-only owner; ensure the function still returns
nil and logs when no controller PodCliqueSet owner is found.
deploy/operator/internal/controller/dynamographdeployment_controller_test.go (1)

2700-2775: Consider adding a case for a PodCliqueSet ownerRef with a mismatched APIVersion.

mapPodCliqueScalingGroupToRequests matches on both Kind == "PodCliqueSet" and APIVersion == grovev1alpha1.SchemeGroupVersion.String(). The negative cases currently cover "no ownerRefs", "different Kind", and "non-PCSG object", but not the branch where Kind is PodCliqueSet but APIVersion is something else (e.g., a future v1beta1 or a hand-crafted ref). Adding that case would lock in the APIVersion check.

Also, the happy-path fixture sets Controller: ptr.To(true) but the production code does not actually require the ownerRef to be the controller — so the flag has no effect on the assertion. Either assert the behavior you want (e.g., require Controller=true in the mapper) or drop the field from the fixture to avoid implying coverage you don't have.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/controller/dynamographdeployment_controller_test.go`
around lines 2700 - 2775, Add a test case to
TestMapPodCliqueScalingGroupToRequests that covers a PodCliqueScalingGroup whose
OwnerReference has Kind == "PodCliqueSet" but APIVersion !=
grovev1alpha1.SchemeGroupVersion.String() to ensure
mapPodCliqueScalingGroupToRequests ignores mismatched APIVersion; locate the
existing table-driven tests and add an entry similar to the happy-path but with
a different APIVersion and expect 0 requests. Also make the fixture consistent
with actual mapper behavior: either remove the Controller: ptr.To(true) field
from the happy-path PodCliqueScalingGroup or change the assertions to require
Controller==true in mapPodCliqueScalingGroupToRequests (choose one and update
the test accordingly) so the test intent matches the implementation.
deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml (1)

97-102: Pre-existing wildcard rule makes the new grove.io rule redundant (not introduced by this PR).

The existing rule granting * verbs on * resources in * apiGroups already covers everything, so the new podcliques/podcliquescalinggroups rule added at lines 139-147 has no runtime effect in this Helm chart. This is not a blocker for this PR — just flagging that the explicit grant here is cosmetic while the wildcard exists. Consider tightening the wildcard as a separate cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml`
around lines 97 - 102, The RBAC manifest contains a global wildcard rule
(apiGroups: '*', resources: '*', verbs: '*') that already grants access to
grove.io resources, so the explicit grove.io rules for podcliques and
podcliquescalinggroups are redundant; either remove the explicit
podcliques/podcliquescalinggroups rule (the entries that reference apiGroup
grove.io and resources podcliques/podcliquescalinggroups) to avoid cosmetic
duplication, or (preferred as a separate cleanup) tighten the global wildcard
rule (replace apiGroups: '*', resources: '*', verbs: '*' with a minimal set) so
the grove.io stanza becomes necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml`:
- Around line 97-102: The RBAC manifest contains a global wildcard rule
(apiGroups: '*', resources: '*', verbs: '*') that already grants access to
grove.io resources, so the explicit grove.io rules for podcliques and
podcliquescalinggroups are redundant; either remove the explicit
podcliques/podcliquescalinggroups rule (the entries that reference apiGroup
grove.io and resources podcliques/podcliquescalinggroups) to avoid cosmetic
duplication, or (preferred as a separate cleanup) tighten the global wildcard
rule (replace apiGroups: '*', resources: '*', verbs: '*' with a minimal set) so
the grove.io stanza becomes necessary.

In
`@deploy/operator/internal/controller/dynamographdeployment_controller_test.go`:
- Around line 2700-2775: Add a test case to
TestMapPodCliqueScalingGroupToRequests that covers a PodCliqueScalingGroup whose
OwnerReference has Kind == "PodCliqueSet" but APIVersion !=
grovev1alpha1.SchemeGroupVersion.String() to ensure
mapPodCliqueScalingGroupToRequests ignores mismatched APIVersion; locate the
existing table-driven tests and add an entry similar to the happy-path but with
a different APIVersion and expect 0 requests. Also make the fixture consistent
with actual mapper behavior: either remove the Controller: ptr.To(true) field
from the happy-path PodCliqueScalingGroup or change the assertions to require
Controller==true in mapPodCliqueScalingGroupToRequests (choose one and update
the test accordingly) so the test intent matches the implementation.

In `@deploy/operator/internal/controller/dynamographdeployment_controller.go`:
- Around line 1760-1782: The mapper mapPodCliqueScalingGroupToRequests currently
enqueues a PodCliqueSet reconcile for any ownerRef with Kind "PodCliqueSet"
ignoring the ownerRef.Controller flag; update the logic to only consider the
controller owner reference (either by checking ownerRef.Controller == true when
iterating pcsg.GetOwnerReferences() or by using metav1.GetControllerOf(pcsg) and
validating its Kind == "PodCliqueSet" and APIVersion matches
grovev1alpha1.SchemeGroupVersion.String()), then return the ctrl.Request for
that controller-only owner; ensure the function still returns nil and logs when
no controller PodCliqueSet owner is found.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 44e72a67-442c-4bd2-a391-a810a90585f7

📥 Commits

Reviewing files that changed from the base of the PR and between 8428c65 and f3a51fe.

📒 Files selected for processing (4)
  • deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml
  • deploy/operator/config/rbac/role.yaml
  • deploy/operator/internal/controller/dynamographdeployment_controller.go
  • deploy/operator/internal/controller/dynamographdeployment_controller_test.go

@julienmancuso julienmancuso merged commit c68d159 into main Apr 20, 2026
85 checks passed
@julienmancuso julienmancuso deleted the jsm/watch-pcsg branch April 20, 2026 20:04
jthomson04 added a commit that referenced this pull request Apr 21, 2026
Companion to #8328 for the single-node PodClique branch of
CheckPodCliqueReady. The PodClique watch predicate only triggered on
ReadyReplicas and Spec.Replicas, but CheckPodCliqueReady also gates
readiness on UpdatedReplicas, Status.Replicas, and ObservedGeneration.
At the tail of a rolling update those fields can transition without
ReadyReplicas moving, leaving the DGD stuck in a stale NotReady state
until something else triggered a reconcile.

Expand the predicate to match the full set of fields CheckPodCliqueReady
consumes, using the ptrInt64Equal helper introduced in #8328.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
jthomson04 added a commit that referenced this pull request May 7, 2026
Companion to #8328 for the single-node PodClique branch of
CheckPodCliqueReady. The PodClique watch predicate only triggered on
ReadyReplicas and Spec.Replicas, but CheckPodCliqueReady also gates
readiness on UpdatedReplicas, Status.Replicas, and ObservedGeneration.
At the tail of a rolling update those fields can transition without
ReadyReplicas moving, leaving the DGD stuck in a stale NotReady state
until something else triggered a reconcile.

Expand the predicate to match the full set of fields CheckPodCliqueReady
consumes, using the ptrInt64Equal helper introduced in #8328.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants