Skip to content

Behavior of index_parallel with appendToExisting=false and no bucketIntervals in GranularitySpec is surprising #6989

@glasser

Description

@glasser

We're experimenting with native batch ingestion on our 0.13-incubating cluster for the first time (with a custom firehose reading from files saved to GCS by Secor, with a custom InputRowParser).

There was a period of a week where the data source had no data. We ran batch ingestion (index_parallel) over one particular hour (5am-6am on December 16th) and it successfully ingested that hour — a segment showed up in the coordinator, it could be queried, etc. (Our segment granularity is HOUR.)

Then we ran it again on the entire 24 hours of December 16th. It ran 24 subtasks (our firehose divides up by hour) and ingested the full day, yay!

Except when we look in the coordinator, it now lists 2 segments with identical sizes for the 5am hour that we first tested with. Also both of them are listed with the same version which was from the first batch ingestion, not the version that the other 23 segments have from the second batch ingestion.

We did not explicitly specify appendToExisting in our ioConfig but I believe the default is false and looking at the task payload it is expanded to false.

Are we doing something wrong if our goal is to replace existing segments? Isn't that what appendToExisting: false should do?

The bad hour in the coordinator:
image

The good hour:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions