Skip to content

Conversation

@HonahX
Copy link
Contributor

@HonahX HonahX commented Sep 23, 2023

This PR is a continuation of #8012:

implements ManifestWriter and ManifestListWriter, which are part of the iceberg commit phase.

Based on: #7873

This PR currently includes prototypes of both writers, which are still subject to changes and improvements. I would greatly appreciate receiving some initial review and suggestions to foster the discussion around the development of the overall commit phase. Your insights and feedback would be invaluable. Thank you in advance for your kind assistance!

),
NestedField(
field_id=105,
name="block_size_in_bytes",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking of some better ways to handle this block_size_in_bytes. This field is required in V1 but should not write in V2. However, with the current approach, we are writing a None for this field to the v2 manifest file. I was trying to find a way to make this only exist in V1 manifest file's schema. But it seemed the only way to achieve this is to have separated DataFile class declarations for v1 and v2?

Copy link
Contributor

@Fokko Fokko Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create two versions, where we filter out the field for V2 (and re-use the objects from V1):

v2 = StructType(*[field for field in v1.fields if field.field_id != 105])

Copy link
Contributor Author

@HonahX HonahX Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Based on my understanding not only do we need two version of data_file_type, but we may also need two "versions" of DataFile(Record) such that the field block_size_in_bytes is skipped when written to the v2 manifest file. Specifically, under current Avro write framework, when handling v2 case, we need to ensure that block_size_in_bytes does not exist in self._position_to_field_name of the Record class. I've come up with a possible solution which I will explain below.

Comment on lines +197 to +208
@singledispatch
def partition_field_to_data_file_partition_field(partition_field_type: IcebergType) -> PrimitiveType:
raise TypeError(f"Unsupported partition field type: {partition_field_type}")


@partition_field_to_data_file_partition_field.register(LongType)
@partition_field_to_data_file_partition_field.register(DateType)
@partition_field_to_data_file_partition_field.register(TimeType)
@partition_field_to_data_file_partition_field.register(TimestampType)
@partition_field_to_data_file_partition_field.register(TimestamptzType)
def _(partition_field_type: PrimitiveType) -> IntegerType:
return IntegerType()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partition_types got from PartitionSpec contains the type of the field in the table schema. Some of these are different from the actual types in data_file.partition. For example,

{
	"name": "partition",
	"type": {
		"type": "record",
		"name": "r102",
		"fields": [{
			"name": "tpep_pickup_datetime_day",
			"type": ["null", {
				"type": "int",
				"logicalType": "date"
			}],
			"default": null,
			"field-id": 1000
		}]
	},
	"field-id": 102
}

Hence, we need some measure to convert these to the right types for Avro Writer

def __init__(self, format_version: Literal[1, 2] = 1, *data: Any, **named_data: Any) -> None:
super().__init__(
*data,
**{"struct": DATA_FILE_TYPE if format_version == 1 else data_file_with_partition(StructType(), 2), **named_data},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional format_version is added to handle the v2 case, where the field block_size_in_bytes should be skipped when written to the manifest file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the creation of the data_file_with_partition(StructType(), 2) to a constant as well? I liked how you had the DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2

Copy link
Contributor Author

@HonahX HonahX Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I made two constants DATA_FILE_TYPE_V1 and DATA_FILE_TYPE_V2. I think in this way we can also simplify the data_file_with_partition a little bit.



@pytest.mark.integration
def test_write_sample_manifest(table_test_all_types: Table) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more integration tests here. The idea here is to use real manifest file/list as a reference so that we do not need to manually construct many data structures

@HonahX HonahX marked this pull request as ready for review September 29, 2023 08:50
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HonahX for working on this. This looks great, let's get this in!

@Fokko Fokko merged commit 8062aef into apache:master Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants