As discussed here, the current implementation of IndexTask has a potential out of disk problem, which is caused by keeping all segments in a local disk until publishing them at the very end of IndexTask.
The solution what I'm considering is publishing segments early when the size of generated segments exceeds a predefined threshold in the middle of segment generation. Thus, the behavior of FiniteAppenderatorDriver.add() will be changed to
- Add a row to appenderator.
- Check the size of segments generated so far.
- If it exceeds the predefined threshold,
3.1) Persist all data indexed through FiniteAppenderatorDriver so far.
3.2) Publish all data indexed through FiniteAppenderatorDriver so far, and waits for segment hand-off. This is because data stored in a local disk are removed after hand-off.
- Repeat from 1 to 3 until all data are read from the firehose.
Also, this proposal requires to use LinearShardSpec instead of HashBasedNumberedShardSpec when an interval needs to be partitioned. Currently, shardSpecs must be determined before starting ingestion. However, with this proposal, it is very difficult to determine required shardSpecs in advance. Instead, LinearShardSpec can be used for incremental sharding, i.e., creating a new shard whenever a segment is published.
This proposal changes the rollup mode of IndexTask from guaranteed rollup to non-guaranteed rollup when the rollup is enabled because two rows that should roll up might be in different publish cycles. I think it's ok for now because non-guaranteed rollup does not affect to the query result accuracy even though it can increase the space usage. However, we need to devise another solution to support guaranteed rollup in the near future.
I think this will also work well with KafkaIndexTask.
As discussed here, the current implementation of IndexTask has a potential out of disk problem, which is caused by keeping all segments in a local disk until publishing them at the very end of IndexTask.
The solution what I'm considering is publishing segments early when the size of generated segments exceeds a predefined threshold in the middle of segment generation. Thus, the behavior of
FiniteAppenderatorDriver.add()will be changed to3.1) Persist all data indexed through FiniteAppenderatorDriver so far.
3.2) Publish all data indexed through FiniteAppenderatorDriver so far, and waits for segment hand-off. This is because data stored in a local disk are removed after hand-off.
Also, this proposal requires to use LinearShardSpec instead of HashBasedNumberedShardSpec when an interval needs to be partitioned. Currently,
shardSpecsmust be determined before starting ingestion. However, with this proposal, it is very difficult to determine requiredshardSpecsin advance. Instead, LinearShardSpec can be used for incremental sharding, i.e., creating a new shard whenever a segment is published.This proposal changes the rollup mode of IndexTask from guaranteed rollup to non-guaranteed rollup when the rollup is enabled because two rows that should roll up might be in different publish cycles. I think it's ok for now because non-guaranteed rollup does not affect to the query result accuracy even though it can increase the space usage. However, we need to devise another solution to support guaranteed rollup in the near future.
I think this will also work well with KafkaIndexTask.