Python: Fine-tune the API #5672

Fokko · 2022-08-30T12:03:58Z

The API wasn't consistent everywhere. Now the ids will just initialize at 1, so the user doesn't have to do this.

Fokko · 2022-08-30T19:58:21Z

~~Waiting for #5627~~

The API wasn't consistent everywhere. Now the ids will just initialize at 1, so the user doesn't have to do this.

docs/python-api-intro.md

docs/python-feature-support.md

docs/python-quickstart.md

python/pyiceberg/table/metadata.py

rdblue · 2022-09-02T22:05:12Z

python/pyiceberg/table/metadata.py

    """A list of schemas, stored as objects with schema-id."""

-    current_schema_id: int = Field(alias="current-schema-id", default=DEFAULT_SCHEMA_ID)
+    current_schema_id: int = Field(alias="current-schema-id", default=INITIAL_SCHEMA_ID)


These probably shouldn't have defaults because they need to be explicitly set to some ID that exists in the list of schemas, specs, or sort orders.

#5672 (comment)

I think that defaulting the ID when creating a new schema, spec, or order is fine. But I don't think it is a good idea to default it here. At this point, we no longer have users constructing metadata by hand and we want to make sure that we're setting the ID correctly. If we re-create a schema for a new table metadata object, then we should also set the current schema ID to that schema's ID rather than relying on the same default in two places. That way if we ever change the default assignment we don't break tables.

Fair point. I've removed this set explicitly when we get a v1 metadata.

python/pyiceberg/table/partitioning.py

rdblue · 2022-09-02T22:06:56Z

python/pyiceberg/table/partitioning.py


-    spec_id: int = Field(alias="spec-id")
-    fields: Tuple[PartitionField, ...] = Field(default_factory=tuple)
+    spec_id: int = Field(alias="spec-id", default=INITIAL_PARTITION_SPEC_ID)


I think I'd prefer to handle ID assignment manually rather than defaulting. Defaulting seems to bring in complexity because if we forget to pass along an ID somewhere, it would cause problems.

I feel that we don't should really expose this to the user. For example, when we create a new table, we re-assign the IDs anyway (using the assign fresh IDs logic).
If we follow the Java API, and we have something similar to updateSpec: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Table.java#L165-L171 Then we can just take the next ID. What do you think of this?

That sounds reasonable to me. I think we just need to make sure that reassignment is correct!

This will definitely involve a lot of testing 👍🏻

python/pyiceberg/table/partitioning.py

python/pyiceberg/table/sorting.py

python/pyproject.toml

docs/python-api-intro.md

…date-python-docs

Fokko · 2022-09-08T12:06:46Z

Split out the changes to the docs to #5727

…date-python-docs

Fokko · 2022-09-19T13:59:42Z

@rdblue I've resolved the merge conflicts, would you have time for another pass? Thanks!

rdblue · 2022-09-20T01:56:27Z

python/pyiceberg/table/sorting.py


    Args:
-      order_id (int): The id of the sort-order. To keep track of historical sorting
+      order_id (int): An unique id of the sort-order of a table.


I don't think we need "of a table" -- that assumes the context that uses the sort order.

rdblue · 2022-09-20T02:03:57Z

python/tests/catalog/test_rest.py

        location=None,
        partition_spec=PartitionSpec(
-            spec_id=1, fields=(PartitionField(source_id=1, field_id=1000, transform=TruncateTransform(width=3), name="id"),)
+            PartitionField(source_id=1, field_id=1000, transform=TruncateTransform(width=3), name="id"), spec_id=1


Looks like the mock causes the result to not match the request. We should start testing against the REST catalog servlet as soon as we can.

rdblue · 2022-09-20T02:11:21Z

Looks good. There were a couple minor things, but those aren't blockers.

github-actions bot added docs python labels Aug 30, 2022

Fokko marked this pull request as draft August 30, 2022 19:58

Fokko added 2 commits September 1, 2022 16:59

Python: Update docs and fine-tune the API

83ae7aa

The API wasn't consistent everywhere. Now the ids will just initialize at 1, so the user doesn't have to do this.

Cleanup

dd28be8

Fokko force-pushed the fd-update-python-docs branch from 75d7e3a to dd28be8 Compare September 1, 2022 20:02

Fokko marked this pull request as ready for review September 1, 2022 20:03