Skip to content

Conversation

@mgmarino
Copy link
Contributor

@mgmarino mgmarino commented Jan 20, 2024

Resolves #216.

This PR adds information about the schema (on update/create) and
location (create) of the table to Glue, enabling both an improved UI
experience as well as querying with Athena.

It follows mainly the behavior in the Java library

Resolves apache#216.

This PR adds information about the schema (on update/create) and
location (create) of the table to Glue, enabling both an improved UI
experience as well as querying with Athena.
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great contribution, @mgmarino. Also, thanks to both you and @nicor88 for the tests on Athena. I didn't initially realize the importance of StorageDescriptor. However, I've now reproduced the issue on Athena and confirmed that adding StorageDescriptor resolves the issue with the SELECT query.

I've left some comments below. Please let me know your thoughts. 😊

@HonahX
Copy link
Contributor

HonahX commented Jan 21, 2024

Also, if possible, could you please also add some tests in integration_test_glue.py? This file contains similar tests in test.glue.py but it interacts with the real Glue service. To run the test, you will need AWS credentials and specify a bucket via env variable AWS_TEST_BUCKET:

def get_bucket_name() -> str:
"""Set the environment variable AWS_TEST_BUCKET for a default bucket to test."""
bucket_name = os.getenv("AWS_TEST_BUCKET")
if bucket_name is None:
raise ValueError("Please specify a bucket to run the test by setting environment variable AWS_TEST_BUCKET")
return bucket_name
Thanks!

@mgmarino
Copy link
Contributor Author

Also, if possible, could you please also add some tests in integration_test_glue.py? This file contains similar tests in test.glue.py but it interacts with the real Glue service. To run the test, you will need AWS credentials and specify a bucket via env variable AWS_TEST_BUCKET:

def get_bucket_name() -> str:
"""Set the environment variable AWS_TEST_BUCKET for a default bucket to test."""
bucket_name = os.getenv("AWS_TEST_BUCKET")
if bucket_name is None:
raise ValueError("Please specify a bucket to run the test by setting environment variable AWS_TEST_BUCKET")
return bucket_name

Thanks!

Hi @HonahX thanks for the review! I’ll have a look soon, but I just wanted to get a feeling about what we should test in the integration test that’s not covered here. IMHO, for this update I think a full test that looks like:

  • make a table with pyiceberg
  • verify subsequent query with Athena

would make sense, but would possibly require a little more than just access to a bucket. (I’m just not sure about if there are any limitations on the CI AWS account.)

i will anyways go in that direction, let me know if you had something else in mind.

@nicor88
Copy link

nicor88 commented Jan 21, 2024

Thanks @mgmarino /@HonahX something else that comes to mind when working with glue/ Athena (valid for other engines too).
Did you try to evolve the table schema and see if the changes are properly updated in glue and usable in Athena? Also I'm wondering if such "check" should be a dedicated test case, unit tests (moto mocking) plus an integration test?

@HonahX
Copy link
Contributor

HonahX commented Jan 22, 2024

@mgmarino @nicor88 Thanks for your input.

Did you try to evolve the table schema and see if the changes are properly updated in glue and usable in Athena?

I did a simple test

table = catalog.load_table(table_identifier)
table.update_schema().add_column("y", StringType()).commit()
to_append2 = pa.Table.from_pandas(
    pd.DataFrame([dict(x="hello!", y="world!")])
)
table.append(to_append2)

and the Athena could successfully run SELECT * FROM on the updated table: both the original column and new column show up in the result. But I haven't tried other schema update options

For testing, I think it would be nice to add one or two Athena-related checks to both unit tests and integration tests. However , I think moto does not run Athena query by default, which cannot help us verify the Athena behavior in unit test. So, may be adding such check in integration tests is good enough. (Please correct me if I am wrong about moto athena)

@mgmarino
Copy link
Contributor Author

Thanks, @HonahX, @nicor88.

Yes, I verified the correct behavior as well (creation of table, schema change, etc.) before pushing, but I am happy to formalize this in an integration test and will do so. :-)

Re moto: for me, the only real value of moto is in ensuring that we call the aws API correctly. I think we can't rely on it to reproduce the internal functionality of AWS services, and certainly not an Athena query, so I would simply go the direction of doing this an integration test.

@nicor88
Copy link

nicor88 commented Jan 22, 2024

Regarding moto - I totally agree with you @mgmarino, it helps to test only API specs, and as the main catalog source of true is glue (also for athena), mocking athena via moto is not enough.
Therefore, to support both of your comments, I believe that integration testing for Athena/glue/s3 are the only way to fully test and assure that all works as expected - this is the same path that we took also in https://github.com/dbt-athena/dbt-athena, where we use moto for unit tests, but then we have end to end tests were we test full athena/glue behaviour (for iceberg tables included).

@mgmarino
Copy link
Contributor Author

@HonahX I just realized that the integration tests for glue are not automatically collected/run in CI, so I guess these are just up to "us" to run by hand? That makes it a bit easier, I thought I might need to do something else, let me know if I missed anything here.

@HonahX
Copy link
Contributor

HonahX commented Jan 22, 2024

@mgmarino You are correct. We need to use our own AWS accounts to run them.

@nicor88
Copy link

nicor88 commented Jan 22, 2024

@mgmarino @HonahX - I was testing this, and after the change I confirm that I can query the table in Athena (I'm still doing some deep dive on why the table is not droppable in athena), but anyhow, I have a weird behvior:
if I run this multiple times

data = [
    {"x": "Alice"},
    {"x": "Bob"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

In the final table I expect to have multiple records x=Alice, and it's not the case - I only have 2 records: 1 Alice and 1 Bob.

Then I tried to run:

data = [
    {"x": "Alice v1"},
    {"x": "Bob v1"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

as I was expecting an overwrite behaviour - but I still get the first snapshot data, so Alice and Bob again.

Checking the data folder seems that the parquet files are written, also new metadata files are popping up. To me seems an issue on the glue catalog side, as it still point to the first snapshot - and that's definitely not right to me. Am I missing something here?

This covers a few use cases, including updating the schema and reading
reading real data back from Athena.
@mgmarino
Copy link
Contributor Author

Integration tests added in 40ab6e6

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the integration test!

@jackye1995 Could you please take a look at this when you have a chance? This is related to both Glue and Athena

@HonahX
Copy link
Contributor

HonahX commented Jan 23, 2024

@mgmarino @HonahX - I was testing this, and after the change I confirm that I can query the table in Athena (I'm still doing some deep dive on why the table is not droppable in athena), but anyhow, I have a weird behvior: if I run this multiple times

data = [
    {"x": "Alice"},
    {"x": "Bob"}
]
df = pd.DataFrame(data)

to_append = pa.Table.from_pandas(df)

t.append(to_append)

@nicor88. I tried the same code you posted along with this PR, and it seemed to work on my side. Each time I ran the code, a new "Alice" and a new "Bob" were appended to the table. Do you still have the same issue on your side?

@mgmarino
Copy link
Contributor Author

@nicor88. I tried the same code you posted along with this PR, and it seemed to work on my side. Each time I ran the code, a new "Alice" and a new "Bob" were appended to the table. Do you still have the same issue on your side?

I also tried this and could not reproduce the problem (i.e. I saw the additional rows).

@nicor88
Copy link

nicor88 commented Jan 23, 2024

@HonahX @mgmarino all good on my side, all worked with a new table and with the last commit 💯
great work @mgmarino !

@HonahX HonahX added this to the PyIceberg 0.6.0 release milestone Jan 24, 2024
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thank you very much @mgmarino! Just adding one final request to make IcebergSchemaToGlueType internal by adding _. Sorry I missed that early

I've added this to 0.6.0 milestone to ensure that Athena users can query the table written by pyiceberg after releasing write support.

}


class IcebergSchemaToGlueType(SchemaVisitor[str]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class IcebergSchemaToGlueType(SchemaVisitor[str]):
class _IcebergSchemaToGlueType(SchemaVisitor[str]):

ColumnTypeDef,
{
"Name": field.name,
"Type": visit(field.field_type, IcebergSchemaToGlueType()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Type": visit(field.field_type, IcebergSchemaToGlueType()),
"Type": visit(field.field_type, _IcebergSchemaToGlueType()),

@mgmarino
Copy link
Contributor Author

Overall LGTM! Thank you very much @mgmarino! Just adding one final request to make IcebergSchemaToGlueType internal by adding _. Sorry I missed that early

Sure, np!

I've added this to 0.6.0 milestone to ensure that Athena users can query the table written by pyiceberg after releasing write support.

🎉

Thanks, @HonahX and @nicor88, for the input and reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GlueCatalog: Set Glue table input information based on Iceberg table metadata

3 participants