Skip to content

Early publishing segments in IndexTask #4227

@jihoonson

Description

@jihoonson

As discussed here, the current implementation of IndexTask has a potential out of disk problem, which is caused by keeping all segments in a local disk until publishing them at the very end of IndexTask.
The solution what I'm considering is publishing segments early when the size of generated segments exceeds a predefined threshold in the middle of segment generation. Thus, the behavior of FiniteAppenderatorDriver.add() will be changed to

  1. Add a row to appenderator.
  2. Check the size of segments generated so far.
  3. If it exceeds the predefined threshold,
    3.1) Persist all data indexed through FiniteAppenderatorDriver so far.
    3.2) Publish all data indexed through FiniteAppenderatorDriver so far, and waits for segment hand-off. This is because data stored in a local disk are removed after hand-off.
  4. Repeat from 1 to 3 until all data are read from the firehose.

Also, this proposal requires to use LinearShardSpec instead of HashBasedNumberedShardSpec when an interval needs to be partitioned. Currently, shardSpecs must be determined before starting ingestion. However, with this proposal, it is very difficult to determine required shardSpecs in advance. Instead, LinearShardSpec can be used for incremental sharding, i.e., creating a new shard whenever a segment is published.

This proposal changes the rollup mode of IndexTask from guaranteed rollup to non-guaranteed rollup when the rollup is enabled because two rows that should roll up might be in different publish cycles. I think it's ok for now because non-guaranteed rollup does not affect to the query result accuracy even though it can increase the space usage. However, we need to devise another solution to support guaranteed rollup in the near future.

I think this will also work well with KafkaIndexTask.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions