fix flaky compaction test by cecemei · Pull Request #19157 · apache/druid

cecemei · 2026-03-13T23:53:39Z

fix flaky test:

switch to index task from kafka ingestion
wait for new segment to show up in BrokerServerView before querying for total rows

This PR has:

gianm · 2026-03-15T20:01:06Z

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

cecemei · 2026-03-16T00:12:13Z

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

yes this is due to intermediatePersistPeriod set to PT10M, sometimes only 500 events are persisted to the segment and the next 500 events + next 1000 events are persisted to another segment but in this case processed events metric would be 2500.

kfaraz · 2026-03-16T04:17:12Z

If the test is slow enough to hit the intermediatePersistPeriod of PT10M, may be we should reconsider the approach.

Should we just use batch append to simplify the test and make it more deterministic (and faster)?

cecemei · 2026-03-16T07:01:53Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

this waitUntilPublishedRecordsAreIngested is used in multiple tests, e.x. FaultyClusterTest. i wonder they are also flaky or maybe it's because i updated the schema to inflate the segment size which made the test flaky somehow.

kfaraz · 2026-03-16T07:16:06Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor.
OR
Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

cecemei · 2026-03-17T05:44:46Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor. OR Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

updated to use an index task instead of kafka, PTAL!

kfaraz

Minor non-blocking suggestions.

kfaraz · 2026-03-17T06:14:04Z

+        .map(DataSegment::getId)
+        .collect(Collectors.toSet());
+
+    ITRetryUtil.retryUntilEquals(


We should wait for a Broker metric instead.
Does cluster.callApi().waitForSegmentsToBeAvailable() not work for this?

updated this

kfaraz · 2026-03-17T09:37:39Z

+        () ->
+            broker.bindings()
+                  .getInstance(BrokerServerView.class)
+                  .getTimeline(TableDataSource.create(dataSource))


Instead of querying the timeline directly, please use SELECT id FROM sys.segments.

kfaraz · 2026-03-17T09:38:33Z

+        .map(DataSegment::getId)
+        .collect(Collectors.toSet());
+
+    ITRetryUtil.retryUntilEquals(


Instead of ITRetryUtil, try using cluster.callApi().waitForResult().

kfaraz · 2026-03-17T09:42:06Z

-      kafkaServer.produceRecordsWithoutTransaction(producerRecords);
-    }
-    return producerRecords.size();
+    final StreamGenerator streamGenerator = new WikipediaStreamEventStreamGenerator(serializer, 500, 100);


Is the large number of records crucial for this test?
If not, you could try using some of the templates from MoreResources such as MoreResources.Task.BASIC_INDEX, MoreResources.Task.INDEX_TASK_WITH_AGGREGATORS or MoreResources.MSQ.INSERT_TINY_WIKI_JSON.

For a large dataset (wikipedia 1 day = 24k rows), you could also try the following (from IngestionSmokeTest.test_runIndexParallelTask_andCompactData())

final String taskId = IdUtils.getRandomId(); final ParallelIndexSupervisorTask task = TaskBuilder .ofTypeIndexParallel() .timestampColumn("timestamp") .jsonInputFormat() .inputSource(Resources.HttpData.wikipedia1Day()) .dimensions() .tuningConfig(t -> t.withMaxNumConcurrentSubTasks(1)) .dataSource(dataSource) .withId(taskId); cluster.callApi().onLeaderOverlord(o -> o.runTask(taskId, task)); cluster.callApi().waitForTaskToSucceed(taskId, eventCollector.latchableEmitter());

test-flaky

6a83692

cecemei marked this pull request as ready for review March 13, 2026 23:53

processed

e8d7a0e

kfaraz mentioned this pull request Mar 16, 2026

Improve extensibility of MSQ Dart engine via extensions #19127

Merged

10 tasks

cecemei added 3 commits March 16, 2026 16:06

batch-ingest

cec4cc2

flaky

1f1ca3d

flaky

789de9d

kfaraz approved these changes Mar 17, 2026

View reviewed changes

comment

07d7e13

cecemei merged commit f9d8ef9 into apache:master Mar 17, 2026
36 checks passed

github-actions Bot added this to the 37.0.0 milestone Mar 17, 2026

Conversation

cecemei commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 17, 2026

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

cecemei Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cecemei commented Mar 13, 2026 •

edited

Loading

gianm commented Mar 15, 2026 •

edited

Loading