Skip to content

AvroWriter Issue: Incorrect iceberg_to_avro Schema Conversion for Decimal, Fixed, and UUID #14

@HonahX

Description

@HonahX

Apache Iceberg version

main (development)

Please describe the bug 🐞

DecimalType

Currently, if we use AvroOutputFile to write decimal values to an avro file, sometimes the result file cannot be successfully read by other avro reader such as fastavro.

The cause is that when encoding and writing decimal value, we obey the iceberg specification and write it as a fixed. However, in the avro file's schema, the current implementation specify it as a variable-length binary:

def visit_decimal(self, decimal_type: DecimalType) -> AvroType:
return {"type": "bytes", "logicalType": "decimal", "precision": decimal_type.precision, "scale": decimal_type.scale}

I think this should be changed to

    def visit_decimal(self, decimal_type: DecimalType) -> AvroType:
        return {"type": "fixed", 
                     "size": decimal_required_bytes(decimal_type.precision), 
                     "logicalType": "decimal", 
                     "precision": decimal_type.precision, 
                     "scale": decimal_type.scale, 
                     "name":f"decimal_{decimal_type.precision}_{decimal_type.scale}"
}

So that other avro reader can correctly interpret the encoded value as fixed-length bytes instead of trying to read the length.

I think this is also the root cause of the failure I observed when using ManifestWriter to write manifest entry for table partitioned by decimalType col. I ran some local test and verified that the above change could fix this issue.

FixedType and UUIDType

For Fixed and UUID, I think the current conversion miss the required name field: https://avro.apache.org/docs/1.11.1/specification/#fixed

def visit_fixed(self, fixed_type: FixedType) -> AvroType:
return {"type": "fixed", "size": len(fixed_type)}

def visit_uuid(self, uuid_type: UUIDType) -> AvroType:
return {"type": "fixed", "size": "16", "logicalType": "uuid"}

The fastavro will complain

fastavro._schema_common.SchemaParseException: "name" is a required field missing from the schema: {'type': 'fixed', 'size': 16}

Once we fix these, I think the ManifestWriter should work with all types of partition values

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions