Affected Version
26.0.0, 27.0.0, master, possible-pre 26.0.0
Description
Please include as much detailed information about the problem as possible.
-
Cluster size
15 * i3.4xlarge, 130 cpus for Druid
500K segments loaded in cluster
-
Configurations in use
General configurations
-
Steps to reproduce the problem
When there are more than 32767 segments in single time period, ingestion start to fail
-
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
2023-09-29T02:45:16,308 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
java.lang.IllegalArgumentException: fromKey > toKey
at java.util.TreeMap$NavigableSubMap.<init>(TreeMap.java:1368) ~[?:1.8.0_302]
at java.util.TreeMap$AscendingSubMap.<init>(TreeMap.java:1855) ~[?:1.8.0_302]
at java.util.TreeMap.subMap(TreeMap.java:913) ~[?:1.8.0_302]
at org.apache.druid.timeline.partition.OvershadowableManager.entryIteratorGreaterThan(OvershadowableManager.java:423) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.findOvershadowedBy(OvershadowableManager.java:299) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.findOvershadowedBy(OvershadowableManager.java:275) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.moveNewStandbyToVisibleIfNecessary(OvershadowableManager.java:456) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.determineVisibleGroupAfterAdd(OvershadowableManager.java:432) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.addAtomicUpdateGroupWithState(OvershadowableManager.java:629) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.OvershadowableManager.addChunk(OvershadowableManager.java:699) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.PartitionHolder.add(PartitionHolder.java:70) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.partition.PartitionHolder.<init>(PartitionHolder.java:52) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.VersionedIntervalTimeline.addAll(VersionedIntervalTimeline.java:201) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.timeline.VersionedIntervalTimeline.add(VersionedIntervalTimeline.java:180) ~[druid-processing-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderator.getOrCreateSink(StreamAppenderator.java:486) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderator.add(StreamAppenderator.java:267) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:411) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.add(StreamAppenderatorDriver.java:191) ~[druid-server-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:654) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:266) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.runTask(SeekableStreamIndexTask.java:151) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:169) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449) ~[druid-indexing-service-2023.03.1-iap.jar:2023.03.1-iap]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_302]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_302]
- Any debugging that you have already done
Unloaded and deleted segments in deepstorage for the time period temporarily remediated the problem
Background
This issue only happens to our Kafka ingesting datasources where too many small segments were created due to compaction failures/ back filling/ late messages, the datasource was configured with DAY segment granularity, when number of segments were too large ( exceeds Short.MAX_VALUE 32767 ) for that day, the ingestion task failed with error above
It seems strange to me when I realized Druid assumes/limits it`s ability to holding more than 32767 segments in single time period, I really hope someone could share some context about why this assumption/limitation exists to better understand how to fix the issue.
I have a PR ready to review to remediate this issue temporarily, at least keep ingestion happy, I believe a better way to solve this issue more completely is to expand the range to Integer instead of Short, without context on where the assumption/limitation coming from I did not proceed with this route, please suggest
Affected Version
26.0.0, 27.0.0, master, possible-pre 26.0.0
Description
Please include as much detailed information about the problem as possible.
Cluster size
15 * i3.4xlarge, 130 cpus for Druid
500K segments loaded in cluster
Configurations in use
General configurations
Steps to reproduce the problem
When there are more than 32767 segments in single time period, ingestion start to fail
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
Unloaded and deleted segments in deepstorage for the time period temporarily remediated the problem
Background
This issue only happens to our Kafka ingesting datasources where too many small segments were created due to compaction failures/ back filling/ late messages, the datasource was configured with DAY segment granularity, when number of segments were too large ( exceeds Short.MAX_VALUE 32767 ) for that day, the ingestion task failed with error above
It seems strange to me when I realized Druid assumes/limits it`s ability to holding more than 32767 segments in single time period, I really hope someone could share some context about why this assumption/limitation exists to better understand how to fix the issue.
I have a PR ready to review to remediate this issue temporarily, at least keep ingestion happy, I believe a better way to solve this issue more completely is to expand the range to
Integerinstead ofShort, without context on where the assumption/limitation coming from I did not proceed with this route, please suggest