-
Notifications
You must be signed in to change notification settings - Fork 426
Set Glue Table Information when creating/updating tables #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Resolves apache#216. This PR adds information about the schema (on update/create) and location (create) of the table to Glue, enabling both an improved UI experience as well as querying with Athena.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great contribution, @mgmarino. Also, thanks to both you and @nicor88 for the tests on Athena. I didn't initially realize the importance of StorageDescriptor. However, I've now reproduced the issue on Athena and confirmed that adding StorageDescriptor resolves the issue with the SELECT query.
I've left some comments below. Please let me know your thoughts. 😊
|
Also, if possible, could you please also add some tests in integration_test_glue.py? This file contains similar tests in iceberg-python/tests/conftest.py Lines 1701 to 1706 in 3085c40
|
Hi @HonahX thanks for the review! I’ll have a look soon, but I just wanted to get a feeling about what we should test in the integration test that’s not covered here. IMHO, for this update I think a full test that looks like:
would make sense, but would possibly require a little more than just access to a bucket. (I’m just not sure about if there are any limitations on the CI AWS account.) i will anyways go in that direction, let me know if you had something else in mind. |
|
Thanks @mgmarino /@HonahX something else that comes to mind when working with glue/ Athena (valid for other engines too). |
|
@mgmarino @nicor88 Thanks for your input.
I did a simple test table = catalog.load_table(table_identifier)
table.update_schema().add_column("y", StringType()).commit()
to_append2 = pa.Table.from_pandas(
pd.DataFrame([dict(x="hello!", y="world!")])
)
table.append(to_append2)and the Athena could successfully run For testing, I think it would be nice to add one or two Athena-related checks to both unit tests and integration tests. However , I think |
We can drop the additional argument because it is always there in the metadata.
|
Yes, I verified the correct behavior as well (creation of table, schema change, etc.) before pushing, but I am happy to formalize this in an integration test and will do so. :-) Re moto: for me, the only real value of moto is in ensuring that we call the aws API correctly. I think we can't rely on it to reproduce the internal functionality of AWS services, and certainly not an Athena query, so I would simply go the direction of doing this an integration test. |
|
Regarding moto - I totally agree with you @mgmarino, it helps to test only API specs, and as the main catalog source of true is glue (also for athena), mocking athena via moto is not enough. |
|
@HonahX I just realized that the integration tests for glue are not automatically collected/run in CI, so I guess these are just up to "us" to run by hand? That makes it a bit easier, I thought I might need to do something else, let me know if I missed anything here. |
|
@mgmarino You are correct. We need to use our own AWS accounts to run them. |
|
@mgmarino @HonahX - I was testing this, and after the change I confirm that I can query the table in Athena (I'm still doing some deep dive on why the table is not droppable in athena), but anyhow, I have a weird behvior: In the final table I expect to have multiple records x=Alice, and it's not the case - I only have 2 records: 1 Alice and 1 Bob. Then I tried to run: as I was expecting an overwrite behaviour - but I still get the first snapshot data, so Alice and Bob again. Checking the data folder seems that the parquet files are written, also new metadata files are popping up. To me seems an issue on the glue catalog side, as it still point to the first snapshot - and that's definitely not right to me. Am I missing something here? |
This covers a few use cases, including updating the schema and reading reading real data back from Athena.
|
Integration tests added in 40ab6e6 |
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the integration test!
@jackye1995 Could you please take a look at this when you have a chance? This is related to both Glue and Athena
@nicor88. I tried the same code you posted along with this PR, and it seemed to work on my side. Each time I ran the code, a new "Alice" and a new "Bob" were appended to the table. Do you still have the same issue on your side? |
I also tried this and could not reproduce the problem (i.e. I saw the additional rows). |
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thank you very much @mgmarino! Just adding one final request to make IcebergSchemaToGlueType internal by adding _. Sorry I missed that early
I've added this to 0.6.0 milestone to ensure that Athena users can query the table written by pyiceberg after releasing write support.
pyiceberg/catalog/glue.py
Outdated
| } | ||
|
|
||
|
|
||
| class IcebergSchemaToGlueType(SchemaVisitor[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class IcebergSchemaToGlueType(SchemaVisitor[str]): | |
| class _IcebergSchemaToGlueType(SchemaVisitor[str]): |
pyiceberg/catalog/glue.py
Outdated
| ColumnTypeDef, | ||
| { | ||
| "Name": field.name, | ||
| "Type": visit(field.field_type, IcebergSchemaToGlueType()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Type": visit(field.field_type, IcebergSchemaToGlueType()), | |
| "Type": visit(field.field_type, _IcebergSchemaToGlueType()), |
Sure, np!
🎉 |
Resolves #216.
This PR adds information about the schema (on update/create) and
location (create) of the table to Glue, enabling both an improved UI
experience as well as querying with Athena.
It follows mainly the behavior in the Java library