Skip to content

Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253

Merged
xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0:claude/dazzling-banach-52a301
Apr 19, 2026
Merged

Fix flaky testRealtimeTableProcessAllModeMultiLevelConcat#18253
xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0:claude/dazzling-banach-52a301

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

Summary

  • Fixes flaky MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat, which occasionally fails at assertTrue(MetricValueUtils.gaugeExists(...)) for mergeRollupTaskNumBucketsToProcess.myTable6_REALTIME.100days.
  • Root cause: the mergeRollupTaskNumBucketsToProcess.* gauges are (re)registered and updated only when PinotTaskManager.scheduleTasks runs for a merge level with no in-flight task. The per-iteration check in the test races with (a) the in-flight task's Helix COMPLETED transition and (b) the segment-lineage commit that follows — producing either a stale value or, more rarely, a missed gauge registration.
  • Fix: extract the gauge check into waitForExpectedNumBucketsToProcess, which polls via TestUtils.waitForCondition until both gauges exist and their values match the expected tuple. This absorbs the short race window and replaces the duplicated inline assertions in both for-loops.

Test plan

  • ./mvnw test-compile -pl pinot-integration-tests passes.
  • ./mvnw spotless:apply checkstyle:check license:format license:check -pl pinot-integration-tests clean.
  • CI runs MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat successfully across multiple runs.

🤖 Generated with Claude Code

…ocessAllModeMultiLevelConcat

The per-iteration gauge assertions on `mergeRollupTaskNumBucketsToProcess.*`
race with the scheduler: `waitForTaskToComplete()` only waits for Helix
`COMPLETED` state, but the gauge is (re)registered and updated by
`PinotTaskManager.scheduleTasks`, which can see an in-flight task or a
pending segment-lineage commit and either skip the merge level entirely
(leaving the gauge value stale) or briefly miss the re-registration
(causing `gaugeExists` to return false).

Extract the gauge check into a helper that polls via
`TestUtils.waitForCondition` until both gauges exist and match the
expected values, replacing the duplicated inline assertions in both
for-loops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the claude/dazzling-banach-52a301 branch from 92c798c to 6c191cc Compare April 18, 2026 08:54
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.49%. Comparing base (ade9f6c) to head (6c191cc).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18253      +/-   ##
============================================
+ Coverage     63.48%   63.49%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3244              
  Lines        197342   197342              
  Branches      30529    30529              
============================================
+ Hits         125285   125312      +27     
+ Misses        62014    62001      -13     
+ Partials      10043    10029      -14     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.48% <ø> (+0.01%) ⬆️
java-21 63.44% <ø> (+<0.01%) ⬆️
temurin 63.49% <ø> (+0.01%) ⬆️
unittests 63.49% <ø> (+0.01%) ⬆️
unittests1 55.47% <ø> (+<0.01%) ⬆️
unittests2 34.98% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves stability of MergeRollupMinionClusterIntegrationTest.testRealtimeTableProcessAllModeMultiLevelConcat by replacing one-shot metric assertions with a polling helper that waits until the expected mergeRollupTaskNumBucketsToProcess.* gauges both exist and match expected values, reducing flakiness from short scheduling/Helix/lineage race windows.

Changes:

  • Add waitForExpectedNumBucketsToProcess(...) helper that polls controller gauges via TestUtils.waitForCondition.
  • Replace duplicated inline metric assertions in both scheduling loops with the new helper.
  • Add ControllerMetrics import to avoid repeated _controllerStarter.getControllerMetrics() calls in the helper.

@Jackie-Jiang Jackie-Jiang added the flaky-test Tracks a test that intermittently fails label Apr 19, 2026
@xiangfu0 xiangfu0 merged commit 5af58ec into apache:master Apr 19, 2026
35 of 36 checks passed
@xiangfu0 xiangfu0 deleted the claude/dazzling-banach-52a301 branch April 19, 2026 00:29
xiangfu0 added a commit to xiangfu0/pinot that referenced this pull request Apr 20, 2026
Extends the polling pattern introduced in apache#18253 (for
mergeRollupTaskNumBucketsToProcess) to the remaining five
mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test
class.

The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics
and removed by resetDelayMetrics when a scheduleTasks call observes no
eligible segments. The per-iteration body's
  assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask))
probe triggers an extra synchronized scheduleTasks that can race with
the previous merge task's segment-lineage commit, transiently resetting
the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake
on the same window that apache#18253 addressed.

A new waitForGaugesToExist(String...) helper polls via
TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used
in testOfflineTableSingleLevelConcat,
testOfflineTableSingleLevelConcatWithMetadataPush,
testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both
45days + 90days atomically), and testRealtimeTableSingleLevelConcat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xiangfu0 added a commit that referenced this pull request Apr 20, 2026
…18260)

Extends the polling pattern introduced in #18253 (for
mergeRollupTaskNumBucketsToProcess) to the remaining five
mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test
class.

The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics
and removed by resetDelayMetrics when a scheduleTasks call observes no
eligible segments. The per-iteration body's
  assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask))
probe triggers an extra synchronized scheduleTasks that can race with
the previous merge task's segment-lineage commit, transiently resetting
the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake
on the same window that #18253 addressed.

A new waitForGaugesToExist(String...) helper polls via
TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used
in testOfflineTableSingleLevelConcat,
testOfflineTableSingleLevelConcatWithMetadataPush,
testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both
45days + 90days atomically), and testRealtimeTableSingleLevelConcat.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flaky-test Tracks a test that intermittently fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants