Skip to content

Ingestion failure when number of segments in time trunk exceeded Short.MAX_VALUE #15091

@dulu98Kurz

Description

@dulu98Kurz

Affected Version

26.0.0, 27.0.0, master, possible-pre 26.0.0

Description

Please include as much detailed information about the problem as possible.

  • Cluster size
    15 * i3.4xlarge, 130 cpus for Druid
    500K segments loaded in cluster

  • Configurations in use
    General configurations

  • Steps to reproduce the problem
    When there are more than 32767 segments in single time period, ingestion start to fail

  • The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.

2023-09-29T02:45:16,308 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
java.lang.IllegalArgumentException: fromKey > toKey
	at java.util.TreeMap$NavigableSubMap.<init>(TreeMap.java:1368) ~[?:1.8.0_302]
	at java.util.TreeMap$AscendingSubMap.<init>(TreeMap.java:1855) ~[?:1.8.0_302]
	at java.util.TreeMap.subMap(TreeMap.java:913) ~[?:1.8.0_302]
	at org.apache.druid.timeline.partition.OvershadowableManager.entryIteratorGreaterThan(OvershadowableManager.java:423) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.findOvershadowedBy(OvershadowableManager.java:299) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.findOvershadowedBy(OvershadowableManager.java:275) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.moveNewStandbyToVisibleIfNecessary(OvershadowableManager.java:456) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.determineVisibleGroupAfterAdd(OvershadowableManager.java:432) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.addAtomicUpdateGroupWithState(OvershadowableManager.java:629) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.OvershadowableManager.addChunk(OvershadowableManager.java:699) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.PartitionHolder.add(PartitionHolder.java:70) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.partition.PartitionHolder.<init>(PartitionHolder.java:52) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.VersionedIntervalTimeline.addAll(VersionedIntervalTimeline.java:201) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.timeline.VersionedIntervalTimeline.add(VersionedIntervalTimeline.java:180) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.segment.realtime.appenderator.StreamAppenderator.getOrCreateSink(StreamAppenderator.java:486) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.segment.realtime.appenderator.StreamAppenderator.add(StreamAppenderator.java:267) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:411) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.add(StreamAppenderatorDriver.java:191) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:654) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:266) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.runTask(SeekableStreamIndexTask.java:151) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:169) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_302]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_302]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_302]
image (3)
  • Any debugging that you have already done
    Unloaded and deleted segments in deepstorage for the time period temporarily remediated the problem

Background

This issue only happens to our Kafka ingesting datasources where too many small segments were created due to compaction failures/ back filling/ late messages, the datasource was configured with DAY segment granularity, when number of segments were too large ( exceeds Short.MAX_VALUE 32767 ) for that day, the ingestion task failed with error above

It seems strange to me when I realized Druid assumes/limits it`s ability to holding more than 32767 segments in single time period, I really hope someone could share some context about why this assumption/limitation exists to better understand how to fix the issue.

I have a PR ready to review to remediate this issue temporarily, at least keep ingestion happy, I believe a better way to solve this issue more completely is to expand the range to Integer instead of Short, without context on where the assumption/limitation coming from I did not proceed with this route, please suggest

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions