Skip to content

fix: batch stickiness — nodes in NodePriority finish all packages before new nodes are picked#183

Merged
lockwobr merged 2 commits intomainfrom
ordered-and-sticky
Mar 17, 2026
Merged

fix: batch stickiness — nodes in NodePriority finish all packages before new nodes are picked#183
lockwobr merged 2 commits intomainfrom
ordered-and-sticky

Conversation

@lockwobr
Copy link
Collaborator

@lockwobr lockwobr commented Mar 17, 2026

Summary

  • Batch stickiness fix: Nodes picked for a batch now finish all their packages before new nodes are selected. Previously, IntrospectNode transitioning nodes from InProgress → Waiting between packages caused GetNodesForNextBatch() to pick new nodes prematurely — a Fixed 1-by-1 strategy would process all nodes in parallel instead of sequentially.

  • Monotonic node ordering: Pods now receive SKYHOOK_NODE_ORDER env var so packages can know their position (e.g., first node runs kubeadm upgrade apply, subsequent nodes run kubeadm upgrade node). Order is computed from NodeOrderOffset + position in the sorted NodePriority map, and all pruning goes through RemoveNodePriority to keep the offset in sync.

  • CLI reset: skyhook reset now clears NodeOrderOffset and NodePriority so ordering starts fresh. Removed omitempty from both fields so nil/0 values serialize correctly in merge patches.

Test plan

  • Unit tests: RemoveNodePriority (offset increment, no-op on missing/nil), NodeOrder (offset+position, name tiebreaker), CleanupRemovedNodes (offset assertion), createPodFromPackage (monotonic order across batches), getStickyBatchNodes (sticky returned, complete excluded, anti-leak tests)
  • E2E strict-order: Assert SKYHOOK_NODE_ORDER=0 on kind-worker config pod, SKYHOOK_NODE_ORDER=1 on kind-worker2 config pod, nodeOrderOffset: 2 on completed Skyhook status
  • E2E CLI reset: Assert nodeOrderOffset: 0 after skyhook reset

…ore new nodes are picked

Between packages, IntrospectNode transitions nodes from InProgress → Waiting,
causing getInProgressCount() to return 0 and createNewBatch() to pick new nodes
prematurely. With a Fixed 1-by-1 strategy, this meant all nodes processed in
parallel instead of sequentially.

GetNodesForNextBatch now checks NodePriority (via getStickyBatchNodes) before
falling through to createNewBatch. Nodes already in NodePriority that aren't
Complete are returned as the current sticky batch.
@lockwobr lockwobr force-pushed the ordered-and-sticky branch from 043a053 to f112ddd Compare March 17, 2026 05:03
Packages need to know their position for kubeadm upgrades (first node
runs "kubeadm upgrade apply", subsequent nodes run "kubeadm upgrade node").

- Add NodeOrderOffset to SkyhookStatus, incremented via RemoveNodePriority
- Add RemoveNodePriority/NodeOrder methods on *Skyhook wrapper
- Replace 3 raw delete(NodePriority) sites with RemoveNodePriority
- Add SKYHOOK_NODE_ORDER to getAgentConfigEnvVars for all pods
- Add name tiebreaker to node sort for deterministic ordering
- Reset NodeOrderOffset and NodePriority on CLI reset
- Remove omitempty from NodePriority/NodeOrderOffset so nil/0 serialize in merge patches
@lockwobr lockwobr force-pushed the ordered-and-sticky branch from f112ddd to 641a42d Compare March 17, 2026 06:37
@lockwobr lockwobr merged commit 9f439c6 into main Mar 17, 2026
22 of 27 checks passed
@lockwobr lockwobr deleted the ordered-and-sticky branch March 17, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants