-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Implement Support for DynamoDB Catalog #6646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c8e086e to
0f1983f
Compare
|
@Fokko @jackye1995 |
65fdfd9 to
9004de2
Compare
JonasJ-ap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arminnajafi Thank you for the implementation of DynamoDB Catalog and the refactor to other catalogs.
|
@JonasJ-ap |
9004de2 to
ea453de
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @arminnajafi! First of all, sorry for my super late reply to this. I was traveling last week, and this PR needs some concentration. I also had to set up a dynamodb catalog, and got it working:
➜ arrow git:(fd-fix-drop-docstring) ✗ python3
Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyiceberg.catalog import load_catalog
>>> from pyiceberg.expressions import GreaterThanOrEqual, And, LessThan
>>>
>>> catalog = load_catalog('dynamo')
>>>
>>> tbl = catalog.load_table('nyc.taxis')
>>>
>>> tbl.scan(row_filter=And(
... GreaterThanOrEqual("tpep_pickup_datetime", "2021-04-01T00:00:00.000000+00:00"),
... LessThan("tpep_pickup_datetime", "2021-05-01T00:00:00.000000+00:00")
... )).to_arrow().to_pandas()
Empty DataFrame
Columns: [VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge]
Index: []
>>> tbl = catalog.load_table('nyc.taxis')
>>> tbl.scan().to_arrow().to_pandas()
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count ... tolls_amount improvement_surcharge total_amount congestion_surcharge
0 1 2020-04-01 00:41:22 2020-04-01 01:01:53 1 ... 0 0.3 6.8 0
1 1 2020-04-01 00:56:00 2020-04-01 01:09:25 1 ... 0 0.3 16.55 0
2 1 2020-04-01 00:00:26 2020-04-01 00:09:25 1 ... 0 0.3 14.8 2.5
3 1 2020-04-01 00:24:38 2020-04-01 00:34:38 0 ... 0 0.3 14.8 2.5
4 2 2020-04-01 00:13:24 2020-04-01 00:18:26 1 ... 0 0.3 13.3 2.5
... ... ... ... ... ... ... ... ... ...
237988 None 2020-04-30 19:47:00 2020-04-30 20:07:00 None ... 0 0.3 29.62 0
237989 None 2020-04-30 19:34:00 2020-04-30 19:39:00 None ... 0 0.3 5.85 0
237990 None 2020-04-30 19:58:14 2020-04-30 20:03:47 None ... 0 0.3 10.89 2.5
237991 None 2020-04-30 19:35:15 2020-04-30 19:56:51 None ... 6.12 0.3 31.84 2.5
237992 None 2020-04-30 19:27:47 2020-04-30 19:45:00 None ... 0 0.3 17.21 0
[237993 rows x 18 columns]
I've done some basic tests and will dig into the code in detail tomorrow. But it is looking great already
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments, but it is looking great! Thanks @arminnajafi for working on this, and special thanks for consolidating overlapping logic between the different catalogs 👏🏻
ea453de to
306896f
Compare
|
Thanks for the review @Fokko ! I addressed most of the comments in the new commits but also asked a few questions. |
5002fa7 to
78cc8fc
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this @arminnajafi
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! Thanks for all the refactoring!
JonasJ-ap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for all your work. I have a followup regarding to the changes made to the list_namespaces method
python/pyiceberg/catalog/glue.py
Outdated
| # Glue does not support hierarchical namespace, therefore return an empty list | ||
| if namespace: | ||
| tuple_identifier = Catalog.identifier_to_tuple(namespace) | ||
| if len(tuple_identifier) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I may not fully understand the change here. Why is the condition >1 instead of >0?
It seems now if I call GlueCatalog.list_namespaces("test_database") I will still get a list of databases instead of an empty list so that the CLI can switch to list_tables("test_database") .
Related discussion: #6646 (comment)
If this is intended, currently the test_list_namespaces in both integration_test_dynamodb.py and integration_test_glue.py may need some changes to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
If I run pyiceberg list default, now I will get all my databases:
(pyiceberg-py3.9) (base) ➜ python git:(support-ddb-catalog) pyiceberg list default
default
delta_lake_demo
delta_playgrounddb
hudi_playgrounddb
migrate_delta_to_iceberg_demo
migrated_to_iceberg_demo
mygluedatabase
mygluedatabase2
...
but previously I got all my tables in default database:
(pyiceberg-py3.9) (base) ➜ python git:(master) pyiceberg list default
default.cloudtrail_logs_aws_cloudtrail_logs_481640105715_2dd430ba
default.cloudtrail_logs_aws_cloudtrail_logs_481640105715_de0525cc
default.table100
default.test2
default.test21
default.test22
default.test23
default.test24
default.test27
default.test28
default.test29
default.test3
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed it to what it was before:
if namespace:
return []
| for database_name in database_list: | ||
| assert (database_name,) in db_list | ||
|
|
||
| with pytest.raises(ValidationError, match="This API is not supported for hierarchical namespaces."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change this test to accomodate the changes made in list_namespaces discussed :#6646 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed. I completely missed checking integ tests for the previous push and GitHub actions didn't run them either.
78cc8fc to
9d7be15
Compare
|
Merging this, thanks for the awesome work @arminnajafi! And thanks @jackye1995 @JonasJ-ap for the reviews, much appreciated! |
* Implement support for DynamoDB catalog * Python: Support DynamoDB catalog apache#6541 * Fix licence * Remove BaseAwsCatalog class * Add custom DynamoDB table name * Remove __init__.py from tests/catalog/ * Apply comments from PR * Rebase and fix tests * Return empty list for hierarchical namespace * Return empty list for hierarchical namespace * Apply comments from PR * Rebase and fix circular import * Fix linter issue * Fix integ tests
Fixes #6541