Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat by xiangfu0 · Pull Request #18253 · apache/pinot

xiangfu0 · 2026-04-18T08:45:44Z

Summary

Fixes flaky MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat, which occasionally fails at assertTrue(MetricValueUtils.gaugeExists(...)) for mergeRollupTaskNumBucketsToProcess.myTable6_REALTIME.100days.
Root cause: the mergeRollupTaskNumBucketsToProcess.* gauges are (re)registered and updated only when PinotTaskManager.scheduleTasks runs for a merge level with no in-flight task. The per-iteration check in the test races with (a) the in-flight task's Helix COMPLETED transition and (b) the segment-lineage commit that follows — producing either a stale value or, more rarely, a missed gauge registration.
Fix: extract the gauge check into waitForExpectedNumBucketsToProcess, which polls via TestUtils.waitForCondition until both gauges exist and their values match the expected tuple. This absorbs the short race window and replaces the duplicated inline assertions in both for-loops.

Test plan

./mvnw test-compile -pl pinot-integration-tests passes.
./mvnw spotless:apply checkstyle:check license:format license:check -pl pinot-integration-tests clean.
CI runs MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat successfully across multiple runs.

🤖 Generated with Claude Code

…ocessAllModeMultiLevelConcat The per-iteration gauge assertions on `mergeRollupTaskNumBucketsToProcess.*` race with the scheduler: `waitForTaskToComplete()` only waits for Helix `COMPLETED` state, but the gauge is (re)registered and updated by `PinotTaskManager.scheduleTasks`, which can see an in-flight task or a pending segment-lineage commit and either skip the merge level entirely (leaving the gauge value stale) or briefly miss the re-registration (causing `gaugeExists` to return false). Extract the gauge check into a helper that polls via `TestUtils.waitForCondition` until both gauges exist and match the expected values, replacing the duplicated inline assertions in both for-loops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-18T09:43:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.49%. Comparing base (ade9f6c) to head (6c191cc).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18253      +/-   ##
============================================
+ Coverage     63.48%   63.49%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3244              
  Lines        197342   197342              
  Branches      30529    30529              
============================================
+ Hits         125285   125312      +27     
+ Misses        62014    62001      -13     
+ Partials      10043    10029      -14

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.48% <ø> (+0.01%)`	⬆️
java-21	`63.44% <ø> (+<0.01%)`	⬆️
temurin	`63.49% <ø> (+0.01%)`	⬆️
unittests	`63.49% <ø> (+0.01%)`	⬆️
unittests1	`55.47% <ø> (+<0.01%)`	⬆️
unittests2	`34.98% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Improves stability of MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat by replacing one-shot metric assertions with a polling helper that waits until the expected mergeRollupTaskNumBucketsToProcess.* gauges both exist and match expected values, reducing flakiness from short scheduling/Helix/lineage race windows.

Changes:

Add waitForExpectedNumBucketsToProcess(...) helper that polls controller gauges via TestUtils.waitForCondition.
Replace duplicated inline metric assertions in both scheduling loops with the new helper.
Add ControllerMetrics import to avoid repeated _controllerStarter.getControllerMetrics() calls in the helper.

Extends the polling pattern introduced in apache#18253 (for mergeRollupTaskNumBucketsToProcess) to the remaining five mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test class. The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics and removed by resetDelayMetrics when a scheduleTasks call observes no eligible segments. The per-iteration body's assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask)) probe triggers an extra synchronized scheduleTasks that can race with the previous merge task's segment-lineage commit, transiently resetting the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake on the same window that apache#18253 addressed. A new waitForGaugesToExist(String...) helper polls via TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used in testOfflineTableSingleLevelConcat, testOfflineTableSingleLevelConcatWithMetadataPush, testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both 45days + 90days atomically), and testRealtimeTableSingleLevelConcat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…18260) Extends the polling pattern introduced in #18253 (for mergeRollupTaskNumBucketsToProcess) to the remaining five mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test class. The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics and removed by resetDelayMetrics when a scheduleTasks call observes no eligible segments. The per-iteration body's assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask)) probe triggers an extra synchronized scheduleTasks that can race with the previous merge task's segment-lineage commit, transiently resetting the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake on the same window that #18253 addressed. A new waitForGaugesToExist(String...) helper polls via TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used in testOfflineTableSingleLevelConcat, testOfflineTableSingleLevelConcatWithMetadataPush, testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both 45days + 90days atomically), and testRealtimeTableSingleLevelConcat. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xiangfu0 force-pushed the claude/dazzling-banach-52a301 branch from 92c798c to 6c191cc Compare April 18, 2026 08:54

xiangfu0 requested a review from Copilot April 18, 2026 09:54

Copilot started reviewing on behalf of xiangfu0 April 18, 2026 09:55 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Jackie-Jiang added the flaky-test Tracks a test that intermittently fails label Apr 19, 2026

Jackie-Jiang approved these changes Apr 19, 2026

View reviewed changes

xiangfu0 merged commit 5af58ec into apache:master Apr 19, 2026
35 of 36 checks passed

xiangfu0 deleted the claude/dazzling-banach-52a301 branch April 19, 2026 00:29

xiangfu0 mentioned this pull request Apr 20, 2026

Fix flaky MergeRollupMinionClusterIntegrationTest gauge assertions #18260

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253

Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253
xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0:claude/dazzling-banach-52a301

xiangfu0 commented Apr 18, 2026

Uh oh!

codecov-commenter commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiangfu0 commented Apr 18, 2026

Summary

Test plan

Uh oh!

codecov-commenter commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 18, 2026 •

edited

Loading