Set Glue Table Information when creating/updating tables #288

mgmarino · 2024-01-20T10:05:40Z

Resolves #216.

This PR adds information about the schema (on update/create) and
location (create) of the table to Glue, enabling both an improved UI
experience as well as querying with Athena.

It follows mainly the behavior in the Java library

Resolves apache#216. This PR adds information about the schema (on update/create) and location (create) of the table to Glue, enabling both an improved UI experience as well as querying with Athena.

HonahX

Thanks for the great contribution, @mgmarino. Also, thanks to both you and @nicor88 for the tests on Athena. I didn't initially realize the importance of StorageDescriptor. However, I've now reproduced the issue on Athena and confirmed that adding StorageDescriptor resolves the issue with the SELECT query.

I've left some comments below. Please let me know your thoughts. 😊

pyiceberg/catalog/glue.py

tests/catalog/test_glue.py

pyiceberg/catalog/glue.py

HonahX · 2024-01-21T06:50:26Z

Also, if possible, could you please also add some tests in integration_test_glue.py? This file contains similar tests in test.glue.py but it interacts with the real Glue service. To run the test, you will need AWS credentials and specify a bucket via env variable AWS_TEST_BUCKET:

iceberg-python/tests/conftest.py

Lines 1701 to 1706 in 3085c40

    
           def get_bucket_name() -> str: 
        
               """Set the environment variable AWS_TEST_BUCKET for a default bucket to test.""" 
        
               bucket_name = os.getenv("AWS_TEST_BUCKET") 
        
               if bucket_name is None: 
        
                   raise ValueError("Please specify a bucket to run the test by setting environment variable AWS_TEST_BUCKET") 
        
               return bucket_name

Thanks!

mgmarino · 2024-01-21T13:50:49Z

Also, if possible, could you please also add some tests in integration_test_glue.py? This file contains similar tests in test.glue.py but it interacts with the real Glue service. To run the test, you will need AWS credentials and specify a bucket via env variable AWS_TEST_BUCKET:

iceberg-python/tests/conftest.py

Lines 1701 to 1706 in 3085c40

def get_bucket_name() -> str:

"""Set the environment variable AWS_TEST_BUCKET for a default bucket to test."""

bucket_name = os.getenv("AWS_TEST_BUCKET")

if bucket_name is None:

raise ValueError("Please specify a bucket to run the test by setting environment variable AWS_TEST_BUCKET")

return bucket_name

Thanks!

Hi @HonahX thanks for the review! I’ll have a look soon, but I just wanted to get a feeling about what we should test in the integration test that’s not covered here. IMHO, for this update I think a full test that looks like:

make a table with pyiceberg
verify subsequent query with Athena

would make sense, but would possibly require a little more than just access to a bucket. (I’m just not sure about if there are any limitations on the CI AWS account.)

i will anyways go in that direction, let me know if you had something else in mind.

nicor88 · 2024-01-21T18:44:01Z

Thanks @mgmarino /@HonahX something else that comes to mind when working with glue/ Athena (valid for other engines too).
Did you try to evolve the table schema and see if the changes are properly updated in glue and usable in Athena? Also I'm wondering if such "check" should be a dedicated test case, unit tests (moto mocking) plus an integration test?

HonahX · 2024-01-22T01:54:35Z

@mgmarino @nicor88 Thanks for your input.

Did you try to evolve the table schema and see if the changes are properly updated in glue and usable in Athena?

I did a simple test

table = catalog.load_table(table_identifier)
table.update_schema().add_column("y", StringType()).commit()
to_append2 = pa.Table.from_pandas(
    pd.DataFrame([dict(x="hello!", y="world!")])
)
table.append(to_append2)

and the Athena could successfully run SELECT * FROM on the updated table: both the original column and new column show up in the result. But I haven't tried other schema update options

For testing, I think it would be nice to add one or two Athena-related checks to both unit tests and integration tests. However , I think moto does not run Athena query by default, which cannot help us verify the Athena behavior in unit test. So, may be adding such check in integration tests is good enough. (Please correct me if I am wrong about moto athena)

We can drop the additional argument because it is always there in the metadata.

mgmarino · 2024-01-22T08:15:54Z

Thanks, @HonahX, @nicor88.

Yes, I verified the correct behavior as well (creation of table, schema change, etc.) before pushing, but I am happy to formalize this in an integration test and will do so. :-)

Re moto: for me, the only real value of moto is in ensuring that we call the aws API correctly. I think we can't rely on it to reproduce the internal functionality of AWS services, and certainly not an Athena query, so I would simply go the direction of doing this an integration test.

nicor88 · 2024-01-22T08:34:50Z

Regarding moto - I totally agree with you @mgmarino, it helps to test only API specs, and as the main catalog source of true is glue (also for athena), mocking athena via moto is not enough.
Therefore, to support both of your comments, I believe that integration testing for Athena/glue/s3 are the only way to fully test and assure that all works as expected - this is the same path that we took also in https://github.com/dbt-athena/dbt-athena, where we use moto for unit tests, but then we have end to end tests were we test full athena/glue behaviour (for iceberg tables included).

mgmarino · 2024-01-22T09:00:14Z

@HonahX I just realized that the integration tests for glue are not automatically collected/run in CI, so I guess these are just up to "us" to run by hand? That makes it a bit easier, I thought I might need to do something else, let me know if I missed anything here.

HonahX · 2024-01-22T09:04:43Z

@mgmarino You are correct. We need to use our own AWS accounts to run them.

nicor88 · 2024-01-22T09:51:28Z

@mgmarino @HonahX - I was testing this, and after the change I confirm that I can query the table in Athena (I'm still doing some deep dive on why the table is not droppable in athena), but anyhow, I have a weird behvior:
if I run this multiple times

data = [
    {"x": "Alice"},
    {"x": "Bob"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

In the final table I expect to have multiple records x=Alice, and it's not the case - I only have 2 records: 1 Alice and 1 Bob.

Then I tried to run:

data = [
    {"x": "Alice v1"},
    {"x": "Bob v1"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

as I was expecting an overwrite behaviour - but I still get the first snapshot data, so Alice and Bob again.

Checking the data folder seems that the parquet files are written, also new metadata files are popping up. To me seems an issue on the glue catalog side, as it still point to the first snapshot - and that's definitely not right to me. Am I missing something here?

This covers a few use cases, including updating the schema and reading reading real data back from Athena.

mgmarino · 2024-01-22T10:36:44Z

Integration tests added in 40ab6e6

HonahX

Thanks for adding the integration test!

@jackye1995 Could you please take a look at this when you have a chance? This is related to both Glue and Athena

pyiceberg/catalog/glue.py

tests/catalog/integration_test_glue.py

HonahX · 2024-01-23T06:06:50Z

@mgmarino @HonahX - I was testing this, and after the change I confirm that I can query the table in Athena (I'm still doing some deep dive on why the table is not droppable in athena), but anyhow, I have a weird behvior: if I run this multiple times
data = [
    {"x": "Alice"},
    {"x": "Bob"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

@nicor88. I tried the same code you posted along with this PR, and it seemed to work on my side. Each time I ran the code, a new "Alice" and a new "Bob" were appended to the table. Do you still have the same issue on your side?

mgmarino · 2024-01-23T07:02:50Z

@nicor88. I tried the same code you posted along with this PR, and it seemed to work on my side. Each time I ran the code, a new "Alice" and a new "Bob" were appended to the table. Do you still have the same issue on your side?

I also tried this and could not reproduce the problem (i.e. I saw the additional rows).

nicor88 · 2024-01-23T10:33:22Z

@HonahX @mgmarino all good on my side, all worked with a new table and with the last commit 💯
great work @mgmarino !

HonahX

Overall LGTM! Thank you very much @mgmarino! Just adding one final request to make IcebergSchemaToGlueType internal by adding _. Sorry I missed that early

I've added this to 0.6.0 milestone to ensure that Athena users can query the table written by pyiceberg after releasing write support.

HonahX · 2024-01-24T06:03:33Z

pyiceberg/catalog/glue.py

+}
+
+
+class IcebergSchemaToGlueType(SchemaVisitor[str]):


Suggested change

class IcebergSchemaToGlueType(SchemaVisitor[str]):

class _IcebergSchemaToGlueType(SchemaVisitor[str]):

HonahX · 2024-01-24T06:04:10Z

pyiceberg/catalog/glue.py

+            ColumnTypeDef,
+            {
+                "Name": field.name,
+                "Type": visit(field.field_type, IcebergSchemaToGlueType()),


Suggested change

"Type": visit(field.field_type, IcebergSchemaToGlueType()),

"Type": visit(field.field_type, _IcebergSchemaToGlueType()),

mgmarino · 2024-01-24T07:03:58Z

Overall LGTM! Thank you very much @mgmarino! Just adding one final request to make IcebergSchemaToGlueType internal by adding _. Sorry I missed that early

Sure, np!

I've added this to 0.6.0 milestone to ensure that Athena users can query the table written by pyiceberg after releasing write support.

🎉

Thanks, @HonahX and @nicor88, for the input and reviews!

Set Glue Table Information when creating/updating tables

89cecbe

Resolves apache#216. This PR adds information about the schema (on update/create) and location (create) of the table to Glue, enabling both an improved UI experience as well as querying with Athena.

mgmarino marked this pull request as ready for review January 20, 2024 10:07

mgmarino mentioned this pull request Jan 20, 2024

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

Closed

HonahX reviewed Jan 21, 2024

View reviewed changes

mgmarino added 3 commits January 22, 2024 08:54

Always include Location from metadata

0579208

We can drop the additional argument because it is always there in the metadata.

Change typing to TableMetadata

5e37da9

Use SchemaVisitor to traverse Glue types/schema

defdac4

Add integration tests for glue/Athena

40ab6e6

This covers a few use cases, including updating the schema and reading reading real data back from Athena.

HonahX reviewed Jan 23, 2024

View reviewed changes

pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved

tests/catalog/integration_test_glue.py Show resolved Hide resolved

tests/catalog/integration_test_glue.py Show resolved Hide resolved

Rename schema visitor to indicate direction of conversion

2877c69

HonahX added this to the PyIceberg 0.6.0 release milestone Jan 24, 2024

HonahX approved these changes Jan 24, 2024

View reviewed changes

Make Schema Visitor internal

b3badb4

HonahX merged commit 4391919 into apache:main Jan 25, 2024

Fokko mentioned this pull request Feb 16, 2024

Empty iceberg table created with PyIceberg in AWS Glue misses location and schema #435

Closed

	class IcebergSchemaToGlueType(SchemaVisitor[str]):
	class _IcebergSchemaToGlueType(SchemaVisitor[str]):

	"Type": visit(field.field_type, IcebergSchemaToGlueType()),
	"Type": visit(field.field_type, _IcebergSchemaToGlueType()),

Set Glue Table Information when creating/updating tables #288

Set Glue Table Information when creating/updating tables #288

Conversation

mgmarino commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HonahX left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX commented Jan 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmarino commented Jan 21, 2024

Uh oh!

nicor88 commented Jan 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HonahX commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmarino commented Jan 22, 2024

Uh oh!

nicor88 commented Jan 22, 2024

Uh oh!

mgmarino commented Jan 22, 2024

Uh oh!

HonahX commented Jan 22, 2024

Uh oh!

nicor88 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmarino commented Jan 22, 2024

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX commented Jan 23, 2024

Uh oh!

mgmarino commented Jan 23, 2024

Uh oh!

nicor88 commented Jan 23, 2024

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

HonahX Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

HonahX Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

mgmarino commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgmarino commented Jan 20, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading

HonahX commented Jan 21, 2024 •

edited

Loading

nicor88 commented Jan 21, 2024 •

edited

Loading

HonahX commented Jan 22, 2024 •

edited

Loading

nicor88 commented Jan 22, 2024 •

edited

Loading