Python: ManifestWriter and ManifestListWriter #8622

HonahX · 2023-09-23T08:16:05Z

This PR is a continuation of #8012:

implements ManifestWriter and ManifestListWriter, which are part of the iceberg commit phase.

Based on: #7873

This PR currently includes prototypes of both writers, which are still subject to changes and improvements. I would greatly appreciate receiving some initial review and suggestions to foster the discussion around the development of the overall commit phase. Your insights and feedback would be invaluable. Thank you in advance for your kind assistance!

…_manifest_list_writers

…written

…_manifest_list_writers

python/pyiceberg/manifest.py

HonahX · 2023-09-23T08:54:46Z

python/pyiceberg/manifest.py

+        ),
+        NestedField(
+            field_id=105,
+            name="block_size_in_bytes",


I am still thinking of some better ways to handle this block_size_in_bytes. This field is required in V1 but should not write in V2. However, with the current approach, we are writing a None for this field to the v2 manifest file. I was trying to find a way to make this only exist in V1 manifest file's schema. But it seemed the only way to achieve this is to have separated DataFile class declarations for v1 and v2?

We could create two versions, where we filter out the field for V2 (and re-use the objects from V1):

v2 = StructType(*[field for field in v1.fields if field.field_id != 105])

Thanks for the suggestion. Based on my understanding not only do we need two version of data_file_type, but we may also need two "versions" of DataFile(Record) such that the field block_size_in_bytes is skipped when written to the v2 manifest file. Specifically, under current Avro write framework, when handling v2 case, we need to ensure that block_size_in_bytes does not exist in self._position_to_field_name of the Record class. I've come up with a possible solution which I will explain below.

python/pyiceberg/manifest.py

python/tests/utils/test_manifest.py

…ataFile class properly.

HonahX · 2023-09-27T08:10:59Z

python/pyiceberg/manifest.py

+@singledispatch
+def partition_field_to_data_file_partition_field(partition_field_type: IcebergType) -> PrimitiveType:
+    raise TypeError(f"Unsupported partition field type: {partition_field_type}")
+
+
+@partition_field_to_data_file_partition_field.register(LongType)
+@partition_field_to_data_file_partition_field.register(DateType)
+@partition_field_to_data_file_partition_field.register(TimeType)
+@partition_field_to_data_file_partition_field.register(TimestampType)
+@partition_field_to_data_file_partition_field.register(TimestamptzType)
+def _(partition_field_type: PrimitiveType) -> IntegerType:
+    return IntegerType()


The partition_types got from PartitionSpec contains the type of the field in the table schema. Some of these are different from the actual types in data_file.partition. For example,

{ "name": "partition", "type": { "type": "record", "name": "r102", "fields": [{ "name": "tpep_pickup_datetime_day", "type": ["null", { "type": "int", "logicalType": "date" }], "default": null, "field-id": 1000 }] }, "field-id": 102 }

Hence, we need some measure to convert these to the right types for Avro Writer

HonahX · 2023-09-27T08:12:20Z

python/pyiceberg/manifest.py

+    def __init__(self, format_version: Literal[1, 2] = 1, *data: Any, **named_data: Any) -> None:
+        super().__init__(
+            *data,
+            **{"struct": DATA_FILE_TYPE if format_version == 1 else data_file_with_partition(StructType(), 2), **named_data},


The additional format_version is added to handle the v2 case, where the field block_size_in_bytes should be skipped when written to the manifest file

Can you move the creation of the data_file_with_partition(StructType(), 2) to a constant as well? I liked how you had the DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2

Thanks for the suggestion! I made two constants DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2. I think in this way we can also simplify the data_file_with_partition a little bit.

HonahX · 2023-09-27T08:15:32Z

python/tests/test_integration_manifest.py

+
+
+@pytest.mark.integration
+def test_write_sample_manifest(table_test_all_types: Table) -> None:


I will add more integration tests here. The idea here is to use real manifest file/list as a reference so that we do not need to manually construct many data structures

…_manifest_list_writers

Fokko

Thanks @HonahX for working on this. This looks great, let's get this in!

HonahX added 9 commits September 9, 2023 15:14

add ManifestWriter and ManifestListWriter

462635e

Merge remote-tracking branch 'origin/master' into python/manfiest_and…

4a3dd34

…_manifest_list_writers

fix lint issue

1dba142

remove assert, fix format issue

c8f0dee

add prepare to ManifestWriter, remove TODO, fix format issue

bceb0cf

fix some nit issue, add prepare... to ensure the correctness of data …

cf69b30

…written

Merge remote-tracking branch 'origin/master' into python/manfiest_and…

c352c41

…_manifest_list_writers

fix format issue

7cb046b

fix lint issue

ac9aa75

github-actions bot added the python label Sep 23, 2023

JonasJ-ap mentioned this pull request Sep 23, 2023

Python: ManifestWriter and ManifestListWriter #8012

Closed

HonahX commented Sep 23, 2023

View reviewed changes

python/pyiceberg/manifest.py Outdated Show resolved Hide resolved

HonahX commented Sep 23, 2023

View reviewed changes

python/pyiceberg/manifest.py Outdated Show resolved Hide resolved

python/pyiceberg/manifest.py Outdated Show resolved Hide resolved

HonahX commented Sep 23, 2023

View reviewed changes

Fokko reviewed Sep 25, 2023

View reviewed changes

python/pyiceberg/manifest.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 25, 2023

View reviewed changes

python/pyiceberg/manifest.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 25, 2023

View reviewed changes

python/tests/utils/test_manifest.py Outdated Show resolved Hide resolved

HonahX added 3 commits September 25, 2023 23:57

avoid creating too much objects, handling v1, v2 data_file_type and D…

1c11232

…ataFile class properly.

modify tests

aa91f7b

refactor the way of handling two version of DataFile record

0be22c2

Fokko mentioned this pull request Sep 27, 2023

Phase 1 - New Docs Deployment #8659

Merged

HonahX added 2 commits September 27, 2023 00:45

add integration tests, fix bugs, change PartitionSummary to a function

2cb5ed5

fix format issue

621d438

HonahX commented Sep 27, 2023

View reviewed changes

HonahX added 2 commits September 29, 2023 01:21

make data_type_v2 constants

d6d7eb8

Merge remote-tracking branch 'origin/master' into python/manfiest_and…

adedb11

…_manifest_list_writers

HonahX marked this pull request as ready for review September 29, 2023 08:50

Fokko approved these changes Sep 29, 2023

View reviewed changes

Fokko merged commit 8062aef into apache:master Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: ManifestWriter and ManifestListWriter #8622

Python: ManifestWriter and ManifestListWriter #8622

Uh oh!

HonahX commented Sep 23, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX Sep 23, 2023

Uh oh!

Fokko Sep 25, 2023 •

edited

Loading

Uh oh!

HonahX Sep 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX Sep 27, 2023

Uh oh!

HonahX Sep 27, 2023

Uh oh!

Fokko Sep 28, 2023

Uh oh!

HonahX Sep 29, 2023 •

edited

Loading

Uh oh!

HonahX Sep 27, 2023

Uh oh!

Fokko left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@pytest.mark.integration
		def test_write_sample_manifest(table_test_all_types: Table) -> None:

Python: ManifestWriter and ManifestListWriter #8622

Python: ManifestWriter and ManifestListWriter #8622

Uh oh!

Conversation

HonahX commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX Sep 23, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HonahX Sep 27, 2023

Choose a reason for hiding this comment

Uh oh!

HonahX Sep 27, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 28, 2023

Choose a reason for hiding this comment

Uh oh!

HonahX Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX Sep 27, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HonahX commented Sep 23, 2023 •

edited

Loading

Fokko Sep 25, 2023 •

edited

Loading

HonahX Sep 27, 2023 •

edited

Loading

HonahX Sep 29, 2023 •

edited

Loading