Early publishing segments in IndexTask

As discussed [here](https://github.com/druid-io/druid/pull/4193#pullrequestreview-34456242), the current implementation of IndexTask has a potential out of disk problem, which is caused by keeping all segments in a local disk until publishing them at the very end of IndexTask. 
The solution what I'm considering is publishing segments early when the size of generated segments exceeds a predefined threshold in the middle of segment generation. Thus, the behavior of ```FiniteAppenderatorDriver.add()``` will be changed to
1) Add a row to appenderator.
2) Check the size of segments generated so far.
3) If it exceeds the predefined threshold, 
3.1) Persist all data indexed through FiniteAppenderatorDriver so far.
3.2) Publish all data indexed through FiniteAppenderatorDriver so far, and waits for segment hand-off. This is because data stored in a local disk are removed after hand-off.
4) Repeat from 1 to 3 until all data are read from the firehose.

Also, this proposal requires to use LinearShardSpec instead of HashBasedNumberedShardSpec when an interval needs to be partitioned. Currently, ```shardSpecs``` must be determined before starting ingestion. However, with this proposal, it is very difficult to determine required ```shardSpecs``` in advance. Instead, LinearShardSpec can be used for incremental sharding, i.e., creating a new shard whenever a segment is published.

This proposal changes the rollup mode of IndexTask from guaranteed rollup to _non-guaranteed rollup_ when the rollup is enabled because two rows that should roll up might be in different publish cycles. I think it's ok for now because non-guaranteed rollup does not affect to the query result accuracy even though it can increase the space usage. However, we need to devise another solution to support guaranteed rollup in the near future.

I think this will also work well with KafkaIndexTask.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early publishing segments in IndexTask #4227

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Early publishing segments in IndexTask #4227

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions