Skip to content

Introduce StorageConnector for GCS#14611

Closed
LakshSingla wants to merge 9 commits intoapache:masterfrom
LakshSingla:gcs-storage-connector
Closed

Introduce StorageConnector for GCS#14611
LakshSingla wants to merge 9 commits intoapache:masterfrom
LakshSingla:gcs-storage-connector

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

Description

This PR adds the storage connector to interact with GCS using the API functions exposed in google-api-services-storage. It will allow Durable storage and MSQ's interactive APIs to work with GCS.

This also refactors the currently available S3 connector so that the chunking downloads that is currently done by the S3 connector can be extended to other connectors.

Due to the current versions of libraries used, the connector has the following 3 improvement areas:

  1. Currently, due to the limitations of google-api-services-storage and the version used by it, we can't use multipart uploads or streaming uploads. Therefore GCS connector writes the intermediate contents to a file and uploads them in a single go. There are composite objects, however, the functionality seems incorrect. This can be improved once we upgrade the libraries.

  2. For fetching the file, there is a isChunkedDownloads flag which controls if we want to download in chunks using the range header, https://cloud.google.com/storage/docs/xml-api/reference-headers#range, however since it can be ignored, the functionality is kept behind a flag for now. Fetching using range isn't supported in the library currently.

  3. All delete requests are done individually and not in a batched manner.

This implementation can be improved provided that we use the google-cloud-storage library instead of the google-api-services-storage library, though that would require a rehaul of the currently existing Google functions.

Release note

To be added


Key changed/added classes in this PR
  • GoogleStorageConnector
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cryptoe cryptoe added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Jul 19, 2023
@LakshSingla
Copy link
Copy Markdown
Contributor Author

Parking this for now, since the current library doesn’t support chunked downloads, and uploads, and Druid is bound to the library because Guava cannot be updated for a while.

Will update the PR with a list of requirements and the versions of the libraries required for enabling this connector. Working on Azure connector in the meantime.

@LakshSingla LakshSingla deleted the gcs-storage-connector branch January 15, 2024 09:24
@LakshSingla
Copy link
Copy Markdown
Contributor Author

Closed in favor of #15398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants