Skip to content

[Variant] Improve write API in Variant::Object#7741

Merged
alamb merged 7 commits intoapache:mainfrom
pydantic:friendlymatthew/duplicate
Jun 25, 2025
Merged

[Variant] Improve write API in Variant::Object#7741
alamb merged 7 commits intoapache:mainfrom
pydantic:friendlymatthew/duplicate

Conversation

@friendlymatthew
Copy link
Copy Markdown
Contributor

@friendlymatthew friendlymatthew commented Jun 23, 2025

Which issue does this PR close?

Rationale for this change

This commit changes the function name ObjectBuilder::append_value to ObjectBuilder::insert.

Right now, calling insert() with a duplicate key results in two fields with the same key in the object, which deviates from the Variant spec. This PR updates the logic such that the second insert() with a duplicate key will update the value. The old value still exists in the backing buffer, but is unreferenced. One side effect from this approach is a larger variant size.

The Parquet Variant spec states:

Field names are case-sensitive. Field names are required to be unique for each object. It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 23, 2025
@friendlymatthew friendlymatthew changed the title [Variant] Error when inserting duplicate field name in Variant::Object [Variant] Improve write API in Variant::Object Jun 24, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/duplicate branch from e750ecc to 8b7abc9 Compare June 24, 2025 19:26
@friendlymatthew
Copy link
Copy Markdown
Contributor Author

friendlymatthew commented Jun 24, 2025

Hi, I wanted to get people's thoughts on the current ObjectBuilder::insert API design.

Right now, calling insert() with a duplicate key results in two fields with the same key in the object, which deviates from the Variant spec. We're considering changing this behavior to either return an Err on the second insert() or to update the existing value-- similar to how std::HashMap::insert works.

From a user standpoint, I'd prefer the latter approach. However, it's a relatively expensive operation. Since each insert() encodes the value directly into the backing buffer, updating a key would require rewriting not just the value for that key, but also everything that comes after it in the buffer.

Worst-case, we'd rewrite our entire fields: BTreeMap<u32, usize> since all of our value's start offsets would have differed.

cc/ @alamb @scovich

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 24, 2025

Right now, calling insert() with a duplicate key results in two fields with the same key in the object, which deviates from the Variant spec. We're considering changing this behavior to either return an Err on the second insert() or to update the existing value-- similar to how std::HashMap::insert works.

From a user standpoint, I'd prefer the latter approach. However, it's a relatively expensive operation. Since each insert() encodes the value directly into the backing buffer, updating a key would require rewriting not just the value for that key, but also everything that comes after it in the buffer.

I also agree updating the existing value is preferable.

My reading of the Variant spec didn't require all bytes in the variant's value to be used

So what i am saying is I think it would be correct for the VariantBuilder to just update the key and leave the old value there (but not referenced) 🤔 That would result in a larger final variant, but I think as long as we documented this behavior it would be ok from the user perspective (I am envisioning many different possible desired optimizations for variant creation)

Comment thread parquet-variant/src/builder.rs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much nicer

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/duplicate branch from cb89527 to 1de0779 Compare June 24, 2025 20:26
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/duplicate branch from 45655af to 057d736 Compare June 24, 2025 20:31
@friendlymatthew
Copy link
Copy Markdown
Contributor Author

So what i am saying is I think it would be correct for the VariantBuilder to just update the key and leave the old value there (but not referenced) 🤔 That would result in a larger final variant, but I think as long as we documented this behavior it would be ok from the user perspective (I am envisioning many different possible desired optimizations for variant creation)

I wonder if we can add a gc() method to VariantObject. Similar to a StringView, users can call VariantObject::gc to undergo this expensive operation of pruning stale references, etc...

@friendlymatthew friendlymatthew marked this pull request as ready for review June 24, 2025 20:37
@scovich
Copy link
Copy Markdown
Contributor

scovich commented Jun 24, 2025

Right now, calling insert() with a duplicate key results in two fields with the same key in the object, which deviates from the Variant spec. We're considering changing this behavior to either return an Err on the second insert() or to update the existing value-- similar to how std::HashMap::insert works.
From a user standpoint, I'd prefer the latter approach. However, it's a relatively expensive operation. Since each insert() encodes the value directly into the backing buffer, updating a key would require rewriting not just the value for that key, but also everything that comes after it in the buffer.

I also agree updating the existing value is preferable.

My reading of the Variant spec didn't require all bytes in the variant's value to be used

So what i am saying is I think it would be correct for the VariantBuilder to just update the key and leave the old value there (but not referenced) 🤔 That would result in a larger final variant, but I think as long as we documented this behavior it would be ok from the user perspective (I am envisioning many different possible desired optimizations for variant creation)

The above would certainly work, in the sense of producing a valid variant object. My only concern would be that the scenario almost certainly arises due to user error (which is quite different from a generic map or set), and silently tolerating that error isn't necessarily doing the user any favors in the long run. They'll just discover at read time that they lost data, instead of fast-failing at write time. We can probably get away with either approach -- silently replacing or loudly complaining -- I just want to be sure we make the choice intentionally.

Comment thread parquet-variant/src/builder.rs
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 24, 2025

I wonder if we can add a gc() method to VariantObject. Similar to a StringView, users can call VariantObject::gc to undergo this expensive operation of pruning stale references, etc...

I think maybe we could just show how to do this with an example (rather than having to make a function) prior to anyone actually encountering the error. I think it could be done by recursively rewriting the entire variant

The above would certainly work, in the sense of producing a valid variant object. My only concern would be that the scenario almost certainly arises due to user error (which is quite different from a generic map or set), and silently tolerating that error isn't necessarily doing the user any favors in the long run. They'll just discover at read time that they lost data, instead of fast-failing at write time. We can probably get away with either approach -- silently replacing or loudly complaining -- I just want to be sure we make the choice intentionally.

Maybe we could have some flag that controls the validation behavior? Something like

let mut builder = VariantBuilder::new();
let mut obj = builder.new_object()
  // specify that an error should be thrown on repeated fields
  .with_validate_unique_fields()
...
obj.finish()?; // this throws error if there were repeated fields

That way people could check for errors programatically if they wanted to and could disable the checking if they didn't care 🤔

This is all for a follow on PR I think

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 25, 2025

@friendlymatthew -- I think this PR has a logical merge conflct now

Here is a proposed fix:

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 25, 2025

close/reopen to rerun CI

@alamb alamb closed this Jun 25, 2025
@alamb alamb reopened this Jun 25, 2025
Fix logical conflict in Variant write API PR
@friendlymatthew
Copy link
Copy Markdown
Contributor Author

I wonder if we can add a gc() method to VariantObject. Similar to a StringView, users can call VariantObject::gc to undergo this expensive operation of pruning stale references, etc...

I think maybe we could just show how to do this with an example (rather than having to make a function) prior to anyone actually encountering the error. I think it could be done by recursively rewriting the entire variant

The above would certainly work, in the sense of producing a valid variant object. My only concern would be that the scenario almost certainly arises due to user error (which is quite different from a generic map or set), and silently tolerating that error isn't necessarily doing the user any favors in the long run. They'll just discover at read time that they lost data, instead of fast-failing at write time. We can probably get away with either approach -- silently replacing or loudly complaining -- I just want to be sure we make the choice intentionally.

Maybe we could have some flag that controls the validation behavior? Something like

let mut builder = VariantBuilder::new();
let mut obj = builder.new_object()
  // specify that an error should be thrown on repeated fields
  .with_validate_unique_fields()
...
obj.finish()?; // this throws error if there were repeated fields

That way people could check for errors programatically if they wanted to and could disable the checking if they didn't care 🤔

This is all for a follow on PR I think

Opened #7777

@friendlymatthew
Copy link
Copy Markdown
Contributor Author

@alamb this has been rebased onto main

@alamb alamb merged commit 340c7dc into apache:main Jun 25, 2025
12 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 25, 2025

Thanks @friendlymatthew and @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Variant::Object can contain two fields with the same field name

3 participants