Skip to content

Parsing of partition-spec JSON from Avro manifest files is not to spec, causing deserialization to fail on files written by pyiceberg #419

@sdd

Description

@sdd

Currently, when we deserialize an Avro manifest, we get the PartitionSpec by deserializing partition-spec into a Vec<PartitionField>, then deserializing partition-spec-id into a string, and then build a PartitionSpec out of these two parts:

let partition_spec = {
let fields = {
let bs = meta.get("partition-spec").ok_or_else(|| {
Error::new(
ErrorKind::DataInvalid,
"partition-spec is required in manifest metadata but not found",
)
})?;
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
Error::new(
ErrorKind::DataInvalid,
"Fail to parse partition spec in manifest metadata",
)
.with_source(err)
})?
};
let spec_id = meta
.get("partition-spec-id")
.map(|bs| {
String::from_utf8_lossy(bs).parse().map_err(|err| {
Error::new(
ErrorKind::DataInvalid,
"Fail to parse partition spec id in manifest metadata",
)
.with_source(err)
})
})
.transpose()?
.unwrap_or(0);
PartitionSpec { spec_id, fields }

But the Iceberg spec expects partition-spec to be encoded as an object like this: https://iceberg.apache.org/spec/?h=avro#partition-specs

In reality we need to deserialize partition-spec directly into a PartitionSpec (

pub struct PartitionSpec {
), not a Vec<PartitionField>.

I can submit a PR to fix. But, we might want to:

  • First try to parse partition-spec into a PartitionSpec as per the spec,
  • If that fails, revert to the current behaviour

How should we do this?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions