Skip to content

Fix flaky MergeRollupMinionClusterIntegrationTest gauge assertions#18260

Merged
xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0:claude/eager-pascal-ae2a1c
Apr 20, 2026
Merged

Fix flaky MergeRollupMinionClusterIntegrationTest gauge assertions#18260
xiangfu0 merged 1 commit intoapache:masterfrom
xiangfu0:claude/eager-pascal-ae2a1c

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

Summary

Extends the polling pattern introduced in #18253 (for mergeRollupTaskNumBucketsToProcess) to the remaining five mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test class.

Root cause

The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics and removed by resetDelayMetrics when a scheduleTasks call observes no eligible segments for the table. Each per-iteration body's

assertNull(_taskManager.scheduleTasks(context).get(MinionConstants.RealtimeToOfflineSegmentsTask.TASK_TYPE));

probe triggers an extra synchronized scheduleTasks that can race with the previous merge task's segment-lineage commit — transiently reaching the resetDelayMetrics branch and removing the gauge. The post-loop assertTrue(MetricValueUtils.gaugeExists(...)) then flakes on the same window that #18253 addressed for the processAll-mode test.

Fix

Added a waitForGaugesToExist(String...) helper that polls via TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, mirroring waitForExpectedNumBucketsToProcess introduced in #18253. Replaced the assertTrue(gaugeExists(...)) calls in:

  • testOfflineTableSingleLevelConcat
  • testOfflineTableSingleLevelConcatWithMetadataPush
  • testOfflineTableSingleLevelRollup
  • testOfflineTableMultiLevelConcat (polls 45days + 90days atomically)
  • testRealtimeTableSingleLevelConcat

Test plan

  • ./mvnw test-compile -pl pinot-integration-tests -am passes.
  • ./mvnw spotless:apply checkstyle:check license:format license:check -pl pinot-integration-tests clean.
  • CI runs MergeRollupMinionClusterIntegrationTest successfully across multiple runs.

🤖 Generated with Claude Code

Extends the polling pattern introduced in apache#18253 (for
mergeRollupTaskNumBucketsToProcess) to the remaining five
mergeRollupTaskDelayInNumBuckets.* gaugeExists checks in the same test
class.

The gauge is registered by MergeRollupTaskGenerator.createOrUpdateDelayMetrics
and removed by resetDelayMetrics when a scheduleTasks call observes no
eligible segments. The per-iteration body's
  assertNull(scheduleTasks(context).get(RealtimeToOfflineSegmentsTask))
probe triggers an extra synchronized scheduleTasks that can race with
the previous merge task's segment-lineage commit, transiently resetting
the gauge and causing the post-loop assertTrue(gaugeExists(...)) to flake
on the same window that apache#18253 addressed.

A new waitForGaugesToExist(String...) helper polls via
TestUtils.waitForCondition with the existing TIMEOUT_IN_MS, and is used
in testOfflineTableSingleLevelConcat,
testOfflineTableSingleLevelConcatWithMetadataPush,
testOfflineTableSingleLevelRollup, testOfflineTableMultiLevelConcat (both
45days + 90days atomically), and testRealtimeTableSingleLevelConcat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xiangfu0 xiangfu0 added the flaky-test Tracks a test that intermittently fails label Apr 20, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.51%. Comparing base (3eff094) to head (bdabcf9).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18260      +/-   ##
============================================
+ Coverage     63.48%   63.51%   +0.02%     
  Complexity     1627     1627              
============================================
  Files          3244     3244              
  Lines        197365   197365              
  Branches      30540    30540              
============================================
+ Hits         125306   125351      +45     
+ Misses        62019    61975      -44     
+ Partials      10040    10039       -1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.48% <ø> (+0.02%) ⬆️
java-21 63.47% <ø> (+<0.01%) ⬆️
temurin 63.51% <ø> (+0.02%) ⬆️
unittests 63.50% <ø> (+0.02%) ⬆️
unittests1 55.48% <ø> (+0.02%) ⬆️
unittests2 34.98% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in MergeRollupMinionClusterIntegrationTest by replacing one-shot gauge existence assertions with a polling helper, aligning the remaining mergeRollupTaskDelayInNumBuckets.* checks with the polling approach previously introduced for mergeRollupTaskNumBucketsToProcess.*.

Changes:

  • Replaced direct assertTrue(MetricValueUtils.gaugeExists(...)) checks with a new polling helper in 5 test cases.
  • Added waitForGaugesToExist(String... metricNames) that uses TestUtils.waitForCondition and the existing TIMEOUT_IN_MS.

@xiangfu0 xiangfu0 merged commit 159f046 into apache:master Apr 20, 2026
20 checks passed
@xiangfu0 xiangfu0 deleted the claude/eager-pascal-ae2a1c branch April 20, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flaky-test Tracks a test that intermittently fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants