Skip to content

Conversation

@arminnajafi
Copy link
Contributor

Fixes #6541

@arminnajafi arminnajafi marked this pull request as ready for review January 26, 2023 05:50
@arminnajafi
Copy link
Contributor Author

@Fokko @jackye1995
Ready for review

@arminnajafi arminnajafi force-pushed the support-ddb-catalog branch 2 times, most recently from 65fdfd9 to 9004de2 Compare January 27, 2023 09:25
Copy link
Contributor

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arminnajafi Thank you for the implementation of DynamoDB Catalog and the refactor to other catalogs.

@Fokko Fokko added this to the Python 0.4.0 release milestone Jan 30, 2023
@arminnajafi
Copy link
Contributor Author

@JonasJ-ap
Thanks for reviewing this PR and also your Glue Catalog PR. This work is mostly based on that PR.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arminnajafi! First of all, sorry for my super late reply to this. I was traveling last week, and this PR needs some concentration. I also had to set up a dynamodb catalog, and got it working:

➜  arrow git:(fd-fix-drop-docstring) ✗ python3
Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyiceberg.catalog import load_catalog
>>> from pyiceberg.expressions import GreaterThanOrEqual, And, LessThan
>>> 
>>> catalog = load_catalog('dynamo')
>>> 
>>> tbl = catalog.load_table('nyc.taxis')
>>> 
>>> tbl.scan(row_filter=And(
...     GreaterThanOrEqual("tpep_pickup_datetime", "2021-04-01T00:00:00.000000+00:00"),
...     LessThan("tpep_pickup_datetime", "2021-05-01T00:00:00.000000+00:00")
... )).to_arrow().to_pandas()
Empty DataFrame
Columns: [VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge]
Index: []
>>> tbl = catalog.load_table('nyc.taxis')
>>> tbl.scan().to_arrow().to_pandas()
       VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count  ... tolls_amount improvement_surcharge total_amount congestion_surcharge
0             1  2020-04-01 00:41:22   2020-04-01 01:01:53               1  ...            0                   0.3          6.8                    0
1             1  2020-04-01 00:56:00   2020-04-01 01:09:25               1  ...            0                   0.3        16.55                    0
2             1  2020-04-01 00:00:26   2020-04-01 00:09:25               1  ...            0                   0.3         14.8                  2.5
3             1  2020-04-01 00:24:38   2020-04-01 00:34:38               0  ...            0                   0.3         14.8                  2.5
4             2  2020-04-01 00:13:24   2020-04-01 00:18:26               1  ...            0                   0.3         13.3                  2.5
...         ...                  ...                   ...             ...  ...          ...                   ...          ...                  ...
237988     None  2020-04-30 19:47:00   2020-04-30 20:07:00            None  ...            0                   0.3        29.62                    0
237989     None  2020-04-30 19:34:00   2020-04-30 19:39:00            None  ...            0                   0.3         5.85                    0
237990     None  2020-04-30 19:58:14   2020-04-30 20:03:47            None  ...            0                   0.3        10.89                  2.5
237991     None  2020-04-30 19:35:15   2020-04-30 19:56:51            None  ...         6.12                   0.3        31.84                  2.5
237992     None  2020-04-30 19:27:47   2020-04-30 19:45:00            None  ...            0                   0.3        17.21                    0

[237993 rows x 18 columns]

I've done some basic tests and will dig into the code in detail tomorrow. But it is looking great already

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments, but it is looking great! Thanks @arminnajafi for working on this, and special thanks for consolidating overlapping logic between the different catalogs 👏🏻

@Fokko Fokko changed the title Implement Support for DynamoDB Catalog Python: Implement Support for DynamoDB Catalog Feb 7, 2023
@arminnajafi
Copy link
Contributor Author

Thanks for the review @Fokko !

I addressed most of the comments in the new commits but also asked a few questions.
Let me know whenever you get a chance to go over them.
It might be easier to discuss some of them in Slack.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this @arminnajafi

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! Thanks for all the refactoring!

Copy link
Contributor

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks for all your work. I have a followup regarding to the changes made to the list_namespaces method

# Glue does not support hierarchical namespace, therefore return an empty list
if namespace:
tuple_identifier = Catalog.identifier_to_tuple(namespace)
if len(tuple_identifier) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I may not fully understand the change here. Why is the condition >1 instead of >0?

It seems now if I call GlueCatalog.list_namespaces("test_database") I will still get a list of databases instead of an empty list so that the CLI can switch to list_tables("test_database") .

Related discussion: #6646 (comment)

If this is intended, currently the test_list_namespaces in both integration_test_dynamodb.py and integration_test_glue.py may need some changes to reflect this.

Copy link
Contributor

@JonasJ-ap JonasJ-ap Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:
If I run pyiceberg list default, now I will get all my databases:

(pyiceberg-py3.9) (base) ➜  python git:(support-ddb-catalog) pyiceberg list default
default                                   
delta_lake_demo                           
delta_playgrounddb                        
hudi_playgrounddb                         
migrate_delta_to_iceberg_demo             
migrated_to_iceberg_demo                  
mygluedatabase                            
mygluedatabase2                           
...

but previously I got all my tables in default database:

(pyiceberg-py3.9) (base) ➜  python git:(master) pyiceberg list default
default.cloudtrail_logs_aws_cloudtrail_logs_481640105715_2dd430ba
default.cloudtrail_logs_aws_cloudtrail_logs_481640105715_de0525cc
default.table100                                                 
default.test2                                                    
default.test21                                                   
default.test22                                                   
default.test23                                                   
default.test24                                                   
default.test27                                                   
default.test28                                                   
default.test29                                                   
default.test3     
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Changed it to what it was before:

if namespace:
    return []

for database_name in database_list:
assert (database_name,) in db_list

with pytest.raises(ValidationError, match="This API is not supported for hierarchical namespaces."):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change this test to accomodate the changes made in list_namespaces discussed :#6646 (comment)

Copy link
Contributor Author

@arminnajafi arminnajafi Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed. I completely missed checking integ tests for the previous push and GitHub actions didn't run them either.

@Fokko
Copy link
Contributor

Fokko commented Feb 15, 2023

Merging this, thanks for the awesome work @arminnajafi! And thanks @jackye1995 @JonasJ-ap for the reviews, much appreciated!

@Fokko Fokko merged commit 18d45b4 into apache:master Feb 15, 2023
@arminnajafi arminnajafi deleted the support-ddb-catalog branch February 16, 2023 20:20
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
* Implement support for DynamoDB catalog

* Python: Support DynamoDB catalog apache#6541

* Fix licence

* Remove BaseAwsCatalog class

* Add custom DynamoDB table name

* Remove __init__.py from tests/catalog/

* Apply comments from PR

* Rebase and fix tests

* Return empty list for hierarchical namespace

* Return empty list for hierarchical namespace

* Apply comments from PR

* Rebase and fix circular import

* Fix linter issue

* Fix integ tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: Support DynamoDB catalog

4 participants