Skip to content

S3 InputSource issue when using prefix mode if a directory contains _SUCCESS files #9411

@sascha-coenen

Description

@sascha-coenen

Affected Version

v 0.17.0

Description

We set up Druid Indexer nodes to test the new native parallel ingestion.
Then we used the following InputSource section within an index_parallel spec to point to a "directory" in S3 that would contain a _SUCCESS file along with a bunch of data files.

      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://smt-druid-ingestion-stage/SI-835/year=2020/month=01/day=20/hour=00/1580297687716/auction"]
      }

The index_parallel task fails and we observed in the logs that the above section got rewritten to the following

      "inputSource": {
        "type": "s3",
        "uris": null,
        "prefixes": null,
        "objects": [
          {
            "bucket": "smt-druid-ingestion-stage",
            "path": "SI-835/year=2020/month=01/day=20/hour=00/1580297687716/auction/_SUCCESS"
          }
        ]
      }

This looks to me like an attempt was made to support filtering out _SUCCESS files from the file list and that inadvertently the filter condition is doing the opposite.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions