Skip to content

fix: deterministic rollout order for multiple nodes per NodeType#15

Open
aruraghuwanshi wants to merge 2 commits intoapache:masterfrom
aruraghuwanshi:deterministic-rollout-ordering
Open

fix: deterministic rollout order for multiple nodes per NodeType#15
aruraghuwanshi wants to merge 2 commits intoapache:masterfrom
aruraghuwanshi:deterministic-rollout-ordering

Conversation

@aruraghuwanshi
Copy link
Copy Markdown

@aruraghuwanshi aruraghuwanshi commented Apr 23, 2026

Summary

getNodeSpecsByOrder groups node specs by NodeType but previously appended them from for key, nodeSpec := range m.Spec.Nodes. In Go, map iteration order is not stable, so the relative order of multiple specs with the same NodeType could change between reconciles.

With rollingDeploy: true, the handler walks that ordered list and may return early while a workload is still rolling. If the intra–NodeType order flips between calls, the operator can effectively start or advance rollouts for more than one StatefulSet/Deployment of the same NodeType at a time, instead of finishing one before the next.

This PR sorts specs by their map key (ascending) within each NodeType before building the final list, while keeping the existing cross–NodeType order from druidServicesOrder unchanged.


What changed

  • getNodeSpecsByOrder (controllers/druid/ordering.go): sort.Slice on each per–NodeType slice by ServiceGroup.key.
  • Unit tests (controllers/druid/ordering_test.go): Ginkgo test data uses multiple historical tiers (historicalstier13); added testing.T tests that would fail on the pre-fix map-only ordering and pass with stable sorting, plus a guard for cross–NodeType order.
  • E2E (e2e/configs/druid-rolling-deploy-cr.yaml, e2e/test-rolling-deploy-ordering.sh, wired from e2e/e2e.sh): two historical tiers (historicalstier1 / historicalstier2) with rollingDeploy: true, patch to trigger a rollout, and checks that only one of the two historical StatefulSets is mid-update at a time, with lexicographically first tier finishing before the second starts (when transitions are observable at the poll interval).

Testing

  • Tested on a real Kubernetes cluster
  • Tested for backward compatibility on an existing cluster (no API/schema change; behavior change is ordering only, which is the intended fix for rollingDeploy with multiple nodes per NodeType).
  • Comments in getNodeSpecsByOrder document why sorting is required (kept short).
  • User-facing docs (e.g. operator docs) — not added; release note text below is suitable for release notes if the project uses them.

Release note (suggested)

Druid Operator: When rollingDeploy is enabled, rollout order for multiple StatefulSets/Deployments that share the same NodeType (e.g. historicalstier1 and historicalstier2) is now stable (sorted by node spec key). That avoids concurrent rollouts within the same NodeType caused by non-deterministic map iteration.

This is especially helpful if these two teirs are holding segment replicas across (1 in each tier). Both historicals getting rolled out causes the Druid cluster to have partial unavailability today.


Key changed/added files

  • controllers/druid/ordering.go — sort within each NodeType by spec key
  • controllers/druid/ordering_test.go — Ginkgo + testing.T coverage
  • controllers/druid/testdata/ordering.yaml — fixture with multiple historical tiers
  • e2e/configs/druid-rolling-deploy-cr.yaml — rolling-deploy E2E CR
  • e2e/test-rolling-deploy-ordering.sh — E2E script
  • e2e/e2e.sh — invoke the new E2E test

Fixes #XXXX.

Sort node specs by map key within each NodeType in getNodeSpecsByOrder
so rollingDeploy does not flap on Go map iteration. Add unit tests
(prove non-determinism pre-fix) and an E2E check for two historical
tiers (historicalstier1/2).
@razinbouzar
Copy link
Copy Markdown
Contributor

@aruraghuwanshi I believe the failing 900s timeout is caused by the test’s rollout trigger, not by the deterministic ordering change itself.

e2e/test-rolling-deploy-ordering.sh currently patches spec.nodes.historicalstier{1,2}.workloadAnnotations and assumes that this forces both StatefulSets to roll. In practice, that only changes StatefulSet object metadata, not the pod template, so the StatefulSet updateRevision never changes and the script eventually times out.

I’d recommend:

  • patch podAnnotations instead of workloadAnnotations to force a real StatefulSet revision change
  • fail early if tier1 never picks up a new revision
  • assert the ordering directly via revision progression:
    • tier1 revision changes first
    • tier2 revision is still unchanged at that moment
    • tier1 rollout completes
    • tier2 revision changes only afterwards
  • use trap-based cleanup so failed runs do not leak test resources

I tested this locally and the revised flow behaves as expected: the rollout is triggered, tier1 updates first, tier2 waits, and the test completes successfully without hitting the 900s timeout.

@razinbouzar
Copy link
Copy Markdown
Contributor

@AdheipSingh can you take a look at this PR?

workloadAnnotations only touch StatefulSet object metadata, not the pod
template, so updateRevision never changes and the test times out at 900s.

Switch to podAnnotations (which flow into PodTemplateSpec), add trap-based
cleanup, and fail fast if tier1 never picks up a new revision.
@aruraghuwanshi
Copy link
Copy Markdown
Author

Thanks @razinbouzar for the insights. Does seem to be the core issue. I've pushed another commit fixing that. Lets see.

@razinbouzar
Copy link
Copy Markdown
Contributor

@abhishekrb19 or @AdheipSingh can you kick off the test workflow and review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants