Skip to content

Conversation

@Buktoria
Copy link
Contributor

@Buktoria Buktoria commented Feb 22, 2023

Adding Google Cloud Storage support to the Python Iceberg CLI

  • Added GCSFileIO support through the gcsfs python lib
  • Added PyArrow GCS support with PyArrows GcsFileSystem class
  • Added GCS test suite using a GCS emulator Docker Container (https://github.com/fsouza/fake-gcs-server)
  • Updated docs to include GCSFileIO configs options (full list of possible options can be found here)

@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch from 68cc990 to 5345bb9 Compare February 25, 2023 01:42
@Fokko
Copy link
Contributor

Fokko commented Feb 27, 2023

@Buktoria this is awesome, thanks for working on this. Let me know when it is ready for review!

@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch 2 times, most recently from 2e59713 to aa58ac2 Compare March 8, 2023 01:48
@Buktoria Buktoria marked this pull request as ready for review March 8, 2023 01:59
@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch from aa58ac2 to 626aa7e Compare March 8, 2023 02:33
python/Makefile Outdated

test:
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs and not gcs" ${PYTEST_ARGS}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get prettier once #6398 has been merged

@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch 2 times, most recently from 9c5a826 to a3d430b Compare March 13, 2023 00:12
@Buktoria
Copy link
Contributor Author

@Fokko This PR is ready to be reviewed

Copy link
Contributor

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your great work!

}
return GcsFileSystem(**client_kwargs)
else:
return GcsFileSystem()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Is it possible to take out this if...else..., like:

        elif scheme in {"gs", "gcs"}:
            client_kwargs = {
                "access_token": self.properties.get("gs.access-token"),
                "credential_token_expiration": self.properties.get("gs.credential-token-expiration"),
            }
            return GcsFileSystem(**client_kwargs)

?
Is there any problem if access_token is not given but credential_token_expiration is provided? Thank you in advance for your answer.

@Fokko Fokko self-requested a review April 10, 2023 22:28
@Fokko
Copy link
Contributor

Fokko commented Apr 10, 2023

@Buktoria thanks for raising this. I was out for a couple of weeks, could you resolve the merge conflicts? I'll review the PR asap.

@Fokko Fokko added this to the PyIceberg 0.4.0 release milestone Apr 18, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Buktoria Sorry for pinging you again? Would you have any time to fix the merge conflicts? It would be great to get this into 0.4.0

@Fokko Fokko changed the title pyiceberg: Add Google Cloud Storage support Python: Add Google Cloud Storage support Apr 24, 2023
@Fokko
Copy link
Contributor

Fokko commented May 14, 2023

@Buktoria Gentle ping! :)

@Buktoria
Copy link
Contributor Author

Hey Fokko. I have been away, and just got back over the weekend. I will make sure to wrap this up this week. So sorry for the delay.

@Fokko
Copy link
Contributor

Fokko commented Jun 13, 2023

@Buktoria No problem, just checking if it is still on your list. Would be really cool to get this in.

@Fokko
Copy link
Contributor

Fokko commented Jul 18, 2023

@Buktoria Gente ping from my side. Would be great to get this in.

}
return S3FileSystem(**client_kwargs)
elif scheme in {"gs", "gcs"}:
if access_token := self.properties.get("gs.access-token"):
Copy link
Contributor

@danielcweeks danielcweeks Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this property should be gs.token based on the properties in the readme

}
return S3FileSystem(**client_kwargs)
elif scheme in {"gs", "gcs"}:
if access_token := self.properties.get("gs.access-token"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add gs.credential-token-expiration to the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. I've added it to the docs now.

| adlfs.client-secret | oCA3R6P\*ka#oa1Sms2J74z... | The client-secret |
| adlfs.client-secret | oCA3R6P\*ka#oa1Sms2J74z... | The client-secret |

### Google Cloud Storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the Python GCS properties could be consistent with the Java properties where possible. Java already has some properties defined so starting with those would help avoid any backwards compatibility issues (rather than changing them). There is also a PR to add OAuth2 access token properties on the Java side.

@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch 2 times, most recently from 71e7135 to d17d112 Compare July 24, 2023 00:05
@Buktoria Buktoria force-pushed the pyiceberg/accpet-file-paths-with-gs-prefix branch from d17d112 to 8b2ba33 Compare July 24, 2023 00:08
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, it looks like the CI is failing. Can you run pip3 install pre-commit && pre-commit run --all-files?

| Key | Example | Description |
| ----------------------- | ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| adlfs.connection-string | AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqF...;BlobEndpoint=http://localhost/ | A [connection string](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string). This could be used to use FileIO with any adlfs-compatible object storage service that has a different endpoint (like [azurite](https://github.com/azure/azurite)). |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this unrelated change?

| s3fs | S3FS as a FileIO implementation to interact with the object store |
| adlfs | ADLFS as a FileIO implementation to interact with the object store |
| snappy | Support for snappy Avro compression |
| gcs | GCS as the FileIO implementation to interact with the object store |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| gcs | GCS as the FileIO implementation to interact with the object store |
| gcsfs | GCS as the FileIO implementation to interact with the object store |

@Fokko Fokko mentioned this pull request Aug 2, 2023
@Fokko
Copy link
Contributor

Fokko commented Aug 14, 2023

Closing this one, since #8207 is in! Thanks for the PR @Buktoria 🙌🏻

@Fokko Fokko closed this Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants