Apache Iceberg version
main (development)
Please describe the bug 🐞
DecimalType
Currently, if we use AvroOutputFile to write decimal values to an avro file, sometimes the result file cannot be successfully read by other avro reader such as fastavro.
The cause is that when encoding and writing decimal value, we obey the iceberg specification and write it as a fixed. However, in the avro file's schema, the current implementation specify it as a variable-length binary:
|
|
|
def visit_decimal(self, decimal_type: DecimalType) -> AvroType: |
|
return {"type": "bytes", "logicalType": "decimal", "precision": decimal_type.precision, "scale": decimal_type.scale} |
|
|
I think this should be changed to
def visit_decimal(self, decimal_type: DecimalType) -> AvroType:
return {"type": "fixed",
"size": decimal_required_bytes(decimal_type.precision),
"logicalType": "decimal",
"precision": decimal_type.precision,
"scale": decimal_type.scale,
"name":f"decimal_{decimal_type.precision}_{decimal_type.scale}"
}
So that other avro reader can correctly interpret the encoded value as fixed-length bytes instead of trying to read the length.
I think this is also the root cause of the failure I observed when using ManifestWriter to write manifest entry for table partitioned by decimalType col. I ran some local test and verified that the above change could fix this issue.
FixedType and UUIDType
For Fixed and UUID, I think the current conversion miss the required name field: https://avro.apache.org/docs/1.11.1/specification/#fixed
|
def visit_fixed(self, fixed_type: FixedType) -> AvroType: |
|
return {"type": "fixed", "size": len(fixed_type)} |
|
def visit_uuid(self, uuid_type: UUIDType) -> AvroType: |
|
return {"type": "fixed", "size": "16", "logicalType": "uuid"} |
The fastavro will complain
fastavro._schema_common.SchemaParseException: "name" is a required field missing from the schema: {'type': 'fixed', 'size': 16}
Once we fix these, I think the ManifestWriter should work with all types of partition values
Apache Iceberg version
main (development)
Please describe the bug 🐞
DecimalType
Currently, if we use
AvroOutputFileto write decimal values to an avro file, sometimes the result file cannot be successfully read by other avro reader such asfastavro.The cause is that when encoding and writing decimal value, we obey the iceberg specification and write it as a fixed. However, in the avro file's schema, the current implementation specify it as a variable-length binary:
iceberg-python/pyiceberg/utils/schema_conversion.py
Lines 569 to 572 in 553695e
I think this should be changed to
So that other avro reader can correctly interpret the encoded value as fixed-length bytes instead of trying to read the length.
I think this is also the root cause of the failure I observed when using ManifestWriter to write manifest entry for table partitioned by decimalType col. I ran some local test and verified that the above change could fix this issue.
FixedType and UUIDType
For Fixed and UUID, I think the current conversion miss the required
namefield: https://avro.apache.org/docs/1.11.1/specification/#fixediceberg-python/pyiceberg/utils/schema_conversion.py
Lines 567 to 568 in 553695e
iceberg-python/pyiceberg/utils/schema_conversion.py
Lines 605 to 606 in 553695e
The fastavro will complain
Once we fix these, I think the ManifestWriter should work with all types of partition values