Skip to content

add google cloud storage InputSource for native batch#8907

Merged
gianm merged 6 commits intoapache:masterfrom
clintropolis:google-input-source
Nov 20, 2019
Merged

add google cloud storage InputSource for native batch#8907
gianm merged 6 commits intoapache:masterfrom
clintropolis:google-input-source

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Nov 19, 2019

Description

Another follow up to #8823, this PR adds a Google CloudStorage InputSource and InputEntity implementation allowing it to be used with the new native batch indexing interfaces. This implementation differs from the StaticGoogleBlobStoreFirehoseFactory in that it uses a uris list like the S3InputSource rather than a blobs list.

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "google",
        "uris": ["gs://some/path/file.json", "gs://some/other/path/file2.json"]
      },
      "inputFormat": {
        "type": "json"
      },
      "appendToExisting": false
    },
...

In a later PR, I think it would be nice to add a prefixes option that lists bucket path contents similar to the s3 extension, but I think out of scope of this PR.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • GoogleCloudStorageInputSource
  • GoogleCloudStorageEntity

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch LGTM aside from one minor comment, would be nice to support some kind of wildcarding in a later patch, but this has parity with the existing firehose impl

}

@Provides
public GoogleStorage getRestS3Service()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have a google-related name instead

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, you fixed already, cool

final String bucket = uri.getAuthority();
final String key = GoogleUtils.extractGoogleCloudStorageObjectKey(uri);
final GoogleByteSource byteSource = new GoogleByteSource(storage, bucket, key);
return CompressionUtils.decompress(byteSource.openStream(), uri.toString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: uri.getPath() is more appropriate here, since it will return the decoded version, whereas uri.toString() URI-encodes everything. It probably won't matter for this particular call site, but still seems like good form to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to getPath, tested that the extension works as expected with this change in the actual google clouds.


@JsonCreator
public GoogleCloudStorageInputSource(
@JacksonInject("googleStorage") GoogleStorage storage,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't generally a need for a string here. The type is enough.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gianm gianm merged commit 074a452 into apache:master Nov 20, 2019
@clintropolis clintropolis deleted the google-input-source branch November 20, 2019 05:22
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 26, 2019
* add google cloud storage InputSource for native batch

* rename

* checkstyle

* fix

* fix spelling

* review comments
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants