Do not clear pendingCompletionTaskGroups in clearAllocationInfo#18715
Do not clear pendingCompletionTaskGroups in clearAllocationInfo#18715kfaraz merged 6 commits intoapache:masterfrom
pendingCompletionTaskGroups in clearAllocationInfo#18715Conversation
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the fix, @amaechler !
I have left a couple of minor suggestions.
We might want to add an embedded test for this race condition. But that need not block this bugfix.
|
Thanks @kfaraz for taking the time to review! I updated the wording a bit based on your suggestions. I'm not sure about how I could rewrite the test to be more high-level, so I kept the actual test for now. |
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the fix, @amaechler !
| partitionOffsets.clear(); | ||
|
|
||
| pendingCompletionTaskGroups.clear(); | ||
| // Note: We intentionally do NOT clear pendingCompletionTaskGroups here. |
There was a problem hiding this comment.
Since the original line of code has already been removed, this comment seems out of place here. It has already been called out in the javadoc anyway.
Fair enough, we can address that and add an embedded test in follow up PRs. |
Changes: - Do not clear `pendingCompletionTaskGroups` in `clearAllocationInfo` - Add unit test
Description
Fixes a bug where the SeekableStream supervisor autoscaler creates duplicate history entries every
minTriggerScaleActionFrequencyMillis(default 10min) during scale-down operations, causing database pollution and preventing scale-down from completing.Lots of help from Claude.
Problem
When the autoscaler scales down tasks,
clearAllocationInfo()prematurely clearspendingCompletionTaskGroups, causing the supervisor to "forget" about tasks transitioning from READING to PUBLISHING state. On the next supervisor cycle, these tasks are rediscovered and re-added toactivelyReadingTaskGroups, triggering another scale-down attempt and creating a duplicate history entry. This repeats everyminTriggerScaleActionFrequencyMillis(default: 10 minutes). I saw hundreds of duplicate history entries, with entries created at exact 10-minute intervals.The root cause is that the autoscaler has a built-in safeguard (line 480-496) to skip scale actions when
pendingCompletionTaskGroupsis non-empty, but this check is ineffective becauseclearAllocationInfo()clears the map immediately after tasks were moved there.Solution
Preserve
pendingCompletionTaskGroupsinclearAllocationInfo(). This allows the autoscaler's existing skip logic to function correctly, preventing duplicate scale attempts until tasks naturally complete (removed bycheckPendingCompletionTasks()every supervisor cycle).Release note
Fixed a bug in the SeekableStream supervisor autoscaler where scale-down operations would create duplicate supervisor history entries. The autoscaler now correctly waits for tasks to complete before attempting subsequent scale operations.
Key changed/added classes in this PR
SeekableStreamSupervisor- ModifiedclearAllocationInfo()to preservependingCompletionTaskGroupsThis PR has: