Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253
Merged
xiangfu0 merged 1 commit intoapache:masterfrom Apr 19, 2026
Merged
Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0 merged 1 commit intoapache:masterfrom
Conversation
…ocessAllModeMultiLevelConcat The per-iteration gauge assertions on `mergeRollupTaskNumBucketsToProcess.*` race with the scheduler: `waitForTaskToComplete()` only waits for Helix `COMPLETED` state, but the gauge is (re)registered and updated by `PinotTaskManager.scheduleTasks`, which can see an in-flight task or a pending segment-lineage commit and either skip the merge level entirely (leaving the gauge value stale) or briefly miss the re-registration (causing `gaugeExists` to return false). Extract the gauge check into a helper that polls via `TestUtils.waitForCondition` until both gauges exist and match the expected values, replacing the duplicated inline assertions in both for-loops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
92c798c to
6c191cc
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18253 +/- ##
============================================
+ Coverage 63.48% 63.49% +0.01%
Complexity 1627 1627
============================================
Files 3244 3244
Lines 197342 197342
Branches 30529 30529
============================================
+ Hits 125285 125312 +27
+ Misses 62014 62001 -13
+ Partials 10043 10029 -14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Improves stability of MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat by replacing one-shot metric assertions with a polling helper that waits until the expected mergeRollupTaskNumBucketsToProcess.* gauges both exist and match expected values, reducing flakiness from short scheduling/Helix/lineage race windows.
Changes:
- Add
waitForExpectedNumBucketsToProcess(...)helper that polls controller gauges viaTestUtils.waitForCondition. - Replace duplicated inline metric assertions in both scheduling loops with the new helper.
- Add
ControllerMetricsimport to avoid repeated_controllerStarter.getControllerMetrics()calls in the helper.
Jackie-Jiang
approved these changes
Apr 19, 2026
xiangfu0
added a commit
to xiangfu0/pinot
that referenced
this pull request
Apr 20, 2026
Extends the polling pattern introduced in apache#18253 (for mergeRollupTaskNumBucketsToProcess) to the remaining five mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test class. The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics and removed by resetDelayMetrics when a scheduleTasks call observes no eligible segments. The per-iteration body's assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask)) probe triggers an extra synchronized scheduleTasks that can race with the previous merge task's segment-lineage commit, transiently resetting the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake on the same window that apache#18253 addressed. A new waitForGaugesToExist(String...) helper polls via TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used in testOfflineTableSingleLevelConcat, testOfflineTableSingleLevelConcatWithMetadataPush, testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both 45days + 90days atomically), and testRealtimeTableSingleLevelConcat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
xiangfu0
added a commit
that referenced
this pull request
Apr 20, 2026
…18260) Extends the polling pattern introduced in #18253 (for mergeRollupTaskNumBucketsToProcess) to the remaining five mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test class. The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics and removed by resetDelayMetrics when a scheduleTasks call observes no eligible segments. The per-iteration body's assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask)) probe triggers an extra synchronized scheduleTasks that can race with the previous merge task's segment-lineage commit, transiently resetting the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake on the same window that #18253 addressed. A new waitForGaugesToExist(String...) helper polls via TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used in testOfflineTableSingleLevelConcat, testOfflineTableSingleLevelConcatWithMetadataPush, testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both 45days + 90days atomically), and testRealtimeTableSingleLevelConcat. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat, which occasionally fails atassertTrue(MetricValueUtils.gaugeExists(...))formergeRollupTaskNumBucketsToProcess.myTable6_REALTIME.100days.mergeRollupTaskNumBucketsToProcess.*gauges are (re)registered and updated only whenPinotTaskManager.scheduleTasksruns for a merge level with no in-flight task. The per-iteration check in the test races with (a) the in-flight task's HelixCOMPLETEDtransition and (b) the segment-lineage commit that follows — producing either a stale value or, more rarely, a missed gauge registration.waitForExpectedNumBucketsToProcess, which polls viaTestUtils.waitForConditionuntil both gauges exist and their values match the expected tuple. This absorbs the short race window and replaces the duplicated inline assertions in both for-loops.Test plan
./mvnw test-compile -pl pinot-integration-testspasses../mvnw spotless:apply checkstyle:check license:format license:check -pl pinot-integration-testsclean.MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcatsuccessfully across multiple runs.🤖 Generated with Claude Code