Skip to content

add 'prefixes' support to google input source#8930

Merged
clintropolis merged 11 commits intoapache:masterfrom
clintropolis:google-input-source-list-objects
Dec 5, 2019
Merged

add 'prefixes' support to google input source#8930
clintropolis merged 11 commits intoapache:masterfrom
clintropolis:google-input-source-list-objects

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Nov 22, 2019

Description

Adds prefixes to the google storage input source added in #8907, making this extension symmetrical with the s3 extension in #8903, after which this implementation is modeled, except using the google storage object list API https://cloud.google.com/storage/docs/json_api/v1/objects/list instead of the s3 list objects API (I guess obviously).

Additionally, this refactors common code between google and s3 input sources into a new abstract class,CloudObjectInputSource.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

@clintropolis
Copy link
Copy Markdown
Member Author

Marked with WIP label until I try to add some additional unit tests since this stuff barely has any coverage.

@clintropolis clintropolis removed the WIP label Nov 27, 2019
|--------|-----------|-------|---------|
|type|This should be `google`.|N/A|yes|
|uris|JSON array of URIs where Google Cloud Storage files to be ingested are located.|N/A|yes|
|uris|JSON array of URIs where Google Cloud Storage objects to be ingested are located.|N/A|`uris` or `prefixes` must be set|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "required?" field here should mention objects as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, added

public URI getUri()
{
return uri;
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like the method is called in the codebase currently but this could construct a URI from the bucket/path if needed.

S3Entity currently returns null for this as well, could probably have a comment on why in both locations

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified GoogleCloudStorageEntity and S3Entity to implement getUri using the CloudObjectLocation.toUri method


protected abstract T createEntity(InputSplit<CloudObjectLocation> split);

protected abstract Stream<InputSplit<CloudObjectLocation>> getPrefixesSplitStream();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs for the new protected methods?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if prefixes aren't specified for the input source?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated javadoc to mention that this method is called internally by createSplits and estimateNumSplits

}


public static Iterator<StorageObject> lazyFetchingStorageObjectsIterator(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be done now, but it may be worth building a common abstraction for this and the equivalent in S3Utils if we end up doing similar stuff for all the cloud object stores.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be nice to have, I will revisit this in a future PR, especially if another cloud (azure or whatever) has a similar API.

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI

@clintropolis clintropolis merged commit 5ecdf94 into apache:master Dec 5, 2019
@clintropolis clintropolis deleted the google-input-source-list-objects branch December 5, 2019 05:01
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants