-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Fine-tune the API #5672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
The API wasn't consistent everywhere. Now the ids will just initialize at 1, so the user doesn't have to do this.
75d7e3a to
dd28be8
Compare
0f32ba3 to
a0806cd
Compare
python/pyiceberg/table/metadata.py
Outdated
| """A list of schemas, stored as objects with schema-id.""" | ||
|
|
||
| current_schema_id: int = Field(alias="current-schema-id", default=DEFAULT_SCHEMA_ID) | ||
| current_schema_id: int = Field(alias="current-schema-id", default=INITIAL_SCHEMA_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These probably shouldn't have defaults because they need to be explicitly set to some ID that exists in the list of schemas, specs, or sort orders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that defaulting the ID when creating a new schema, spec, or order is fine. But I don't think it is a good idea to default it here. At this point, we no longer have users constructing metadata by hand and we want to make sure that we're setting the ID correctly. If we re-create a schema for a new table metadata object, then we should also set the current schema ID to that schema's ID rather than relying on the same default in two places. That way if we ever change the default assignment we don't break tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I've removed this set explicitly when we get a v1 metadata.
|
|
||
| spec_id: int = Field(alias="spec-id") | ||
| fields: Tuple[PartitionField, ...] = Field(default_factory=tuple) | ||
| spec_id: int = Field(alias="spec-id", default=INITIAL_PARTITION_SPEC_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to handle ID assignment manually rather than defaulting. Defaulting seems to bring in complexity because if we forget to pass along an ID somewhere, it would cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we don't should really expose this to the user. For example, when we create a new table, we re-assign the IDs anyway (using the assign fresh IDs logic).
If we follow the Java API, and we have something similar to updateSpec: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Table.java#L165-L171 Then we can just take the next ID. What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me. I think we just need to make sure that reassignment is correct!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely involve a lot of testing 👍🏻
61147bd to
8fa8d88
Compare
|
Split out the changes to the docs to #5727 |
446d94e to
6f49ae3
Compare
6f49ae3 to
0614692
Compare
|
@rdblue I've resolved the merge conflicts, would you have time for another pass? Thanks! |
| Args: | ||
| order_id (int): The id of the sort-order. To keep track of historical sorting | ||
| order_id (int): An unique id of the sort-order of a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need "of a table" -- that assumes the context that uses the sort order.
| location=None, | ||
| partition_spec=PartitionSpec( | ||
| spec_id=1, fields=(PartitionField(source_id=1, field_id=1000, transform=TruncateTransform(width=3), name="id"),) | ||
| PartitionField(source_id=1, field_id=1000, transform=TruncateTransform(width=3), name="id"), spec_id=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the mock causes the result to not match the request. We should start testing against the REST catalog servlet as soon as we can.
|
Looks good. There were a couple minor things, but those aren't blockers. |
The API wasn't consistent everywhere. Now the ids will just initialize at 1, so the user doesn't have to do this.