add 'prefixes' support to google input source#8930
add 'prefixes' support to google input source#8930clintropolis merged 11 commits intoapache:masterfrom
Conversation
|
Marked with WIP label until I try to add some additional unit tests since this stuff barely has any coverage. |
| |--------|-----------|-------|---------| | ||
| |type|This should be `google`.|N/A|yes| | ||
| |uris|JSON array of URIs where Google Cloud Storage files to be ingested are located.|N/A|yes| | ||
| |uris|JSON array of URIs where Google Cloud Storage objects to be ingested are located.|N/A|`uris` or `prefixes` must be set| |
There was a problem hiding this comment.
The "required?" field here should mention objects as well
| public URI getUri() | ||
| { | ||
| return uri; | ||
| return null; |
There was a problem hiding this comment.
Doesn't look like the method is called in the codebase currently but this could construct a URI from the bucket/path if needed.
S3Entity currently returns null for this as well, could probably have a comment on why in both locations
There was a problem hiding this comment.
I modified GoogleCloudStorageEntity and S3Entity to implement getUri using the CloudObjectLocation.toUri method
|
|
||
| protected abstract T createEntity(InputSplit<CloudObjectLocation> split); | ||
|
|
||
| protected abstract Stream<InputSplit<CloudObjectLocation>> getPrefixesSplitStream(); |
There was a problem hiding this comment.
Can you add javadocs for the new protected methods?
There was a problem hiding this comment.
What happens if prefixes aren't specified for the input source?
There was a problem hiding this comment.
updated javadoc to mention that this method is called internally by createSplits and estimateNumSplits
| } | ||
|
|
||
|
|
||
| public static Iterator<StorageObject> lazyFetchingStorageObjectsIterator( |
There was a problem hiding this comment.
Doesn't have to be done now, but it may be worth building a common abstraction for this and the equivalent in S3Utils if we end up doing similar stuff for all the cloud object stores.
There was a problem hiding this comment.
I agree this would be nice to have, I will revisit this in a future PR, especially if another cloud (azure or whatever) has a similar API.
Description
Adds
prefixesto the google storage input source added in #8907, making this extension symmetrical with the s3 extension in #8903, after which this implementation is modeled, except using the google storage object list API https://cloud.google.com/storage/docs/json_api/v1/objects/list instead of the s3 list objects API (I guess obviously).Additionally, this refactors common code between google and s3 input sources into a new abstract class,
CloudObjectInputSource.This PR has: