-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: ManifestWriter and ManifestListWriter #8622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: ManifestWriter and ManifestListWriter #8622
Conversation
…_manifest_list_writers
…_manifest_list_writers
python/pyiceberg/manifest.py
Outdated
| ), | ||
| NestedField( | ||
| field_id=105, | ||
| name="block_size_in_bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still thinking of some better ways to handle this block_size_in_bytes. This field is required in V1 but should not write in V2. However, with the current approach, we are writing a None for this field to the v2 manifest file. I was trying to find a way to make this only exist in V1 manifest file's schema. But it seemed the only way to achieve this is to have separated DataFile class declarations for v1 and v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could create two versions, where we filter out the field for V2 (and re-use the objects from V1):
v2 = StructType(*[field for field in v1.fields if field.field_id != 105])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Based on my understanding not only do we need two version of data_file_type, but we may also need two "versions" of DataFile(Record) such that the field block_size_in_bytes is skipped when written to the v2 manifest file. Specifically, under current Avro write framework, when handling v2 case, we need to ensure that block_size_in_bytes does not exist in self._position_to_field_name of the Record class. I've come up with a possible solution which I will explain below.
| @singledispatch | ||
| def partition_field_to_data_file_partition_field(partition_field_type: IcebergType) -> PrimitiveType: | ||
| raise TypeError(f"Unsupported partition field type: {partition_field_type}") | ||
|
|
||
|
|
||
| @partition_field_to_data_file_partition_field.register(LongType) | ||
| @partition_field_to_data_file_partition_field.register(DateType) | ||
| @partition_field_to_data_file_partition_field.register(TimeType) | ||
| @partition_field_to_data_file_partition_field.register(TimestampType) | ||
| @partition_field_to_data_file_partition_field.register(TimestamptzType) | ||
| def _(partition_field_type: PrimitiveType) -> IntegerType: | ||
| return IntegerType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partition_types got from PartitionSpec contains the type of the field in the table schema. Some of these are different from the actual types in data_file.partition. For example,
{
"name": "partition",
"type": {
"type": "record",
"name": "r102",
"fields": [{
"name": "tpep_pickup_datetime_day",
"type": ["null", {
"type": "int",
"logicalType": "date"
}],
"default": null,
"field-id": 1000
}]
},
"field-id": 102
}Hence, we need some measure to convert these to the right types for Avro Writer
python/pyiceberg/manifest.py
Outdated
| def __init__(self, format_version: Literal[1, 2] = 1, *data: Any, **named_data: Any) -> None: | ||
| super().__init__( | ||
| *data, | ||
| **{"struct": DATA_FILE_TYPE if format_version == 1 else data_file_with_partition(StructType(), 2), **named_data}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional format_version is added to handle the v2 case, where the field block_size_in_bytes should be skipped when written to the manifest file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the creation of the data_file_with_partition(StructType(), 2) to a constant as well? I liked how you had the DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I made two constants DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2. I think in this way we can also simplify the data_file_with_partition a little bit.
|
|
||
|
|
||
| @pytest.mark.integration | ||
| def test_write_sample_manifest(table_test_all_types: Table) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more integration tests here. The idea here is to use real manifest file/list as a reference so that we do not need to manually construct many data structures
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HonahX for working on this. This looks great, let's get this in!
This PR is a continuation of #8012:
implements ManifestWriter and ManifestListWriter, which are part of the iceberg commit phase.
Based on: #7873
This PR currently includes prototypes of both writers, which are still subject to changes and improvements. I would greatly appreciate receiving some initial review and suggestions to foster the discussion around the development of the overall commit phase. Your insights and feedback would be invaluable. Thank you in advance for your kind assistance!