Skip to content

[Variant] Use BTreeMap for VariantBuilder.dict and ObjectBuilder.fields to maintain invariants upon entry writes#7720

Merged
alamb merged 6 commits intoapache:mainfrom
pydantic:friendlymatthew/avoid-sort-on-finish
Jun 24, 2025
Merged

[Variant] Use BTreeMap for VariantBuilder.dict and ObjectBuilder.fields to maintain invariants upon entry writes#7720
alamb merged 6 commits intoapache:mainfrom
pydantic:friendlymatthew/avoid-sort-on-finish

Conversation

@friendlymatthew
Copy link
Copy Markdown
Contributor

@friendlymatthew friendlymatthew commented Jun 20, 2025

Which issue does this PR close?

Rationale for this change

This commit changes the dict field in VariantBuilder + the fields field in ObjectBuilder to be BTreeMaps, and checks for existing field names in a object before appending a new field.

These collections are often used in places where having an already sorted structure would be more performant. Inside of ObjectBuilder::finish(), we sort the fields by field_name and we can use the fact that VariantBuilder's dict maintains a sorted mapping to field_id by field_name.

To check whether an existing field name exists in a object, it is simply two lookups: 1) to find the field_name: &str's unique field_name_id, and 2) check if the ObjectBuilder fields already has a key with that field_name_id.

We make ObjectBuilder fields a BTreeMap sorted by field_id. Since field_ids correlate to insertion order, we now have some notion of which fields were inserted first. This also improves the time to look up the max field id, as it changes the linear scan over the entire fields collection to a logarithmic call using fields.keys().last().

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 20, 2025
Comment thread parquet-variant/src/builder.rs Outdated
Comment on lines 538 to 547
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. The dict is the entire metadata dictionary, shared by all (sub-)objects in the overall variant value -- parents, siblings, children, cousins, etc. It's a superset of the field names for this specific object we're finishing.

Unfortunately, I don't know a good way to build an "indirect" map in rust that allows the custom key comparator we'd need, to lexically sort field id keys according to the string values they represent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, yes. I figured since the current VariantBuilder can only build 1 object as of now, it would be ok to assume the field names of the current object maps 1:1 with the field names in the dict metadata dictionary.

I pushed up 22789798797b5b42950569ef6fdb720b1a256a68, which filters by the relative field ids within the current object. I thought it would make sense to update this logic when thinking about nested lists and objects.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's really helpful to optimize a known dead end path -- we have to figure out something that works for nested arrays and objects -- but maybe that's just me.

Copy link
Copy Markdown
Contributor Author

@friendlymatthew friendlymatthew Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree and fwiw, I included a change in this PR to remove this 1:1 object builder field names to metadata dictionary assumption.

We now do something like:

let field_ids_by_sorted_field_name = self
            .parent
            .dict
            .iter()
            .filter_map(|(_, id)| self.fields.contains_key(id).then_some(*id))
            .collect::<Vec<_>>();

which will work with nested objects

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an improvement to me

I wonder if there are some tests we could / should write for it?

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/avoid-sort-on-finish branch from 2278979 to 63e067d Compare June 21, 2025 13:38
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/avoid-sort-on-finish branch from 63e067d to 87b1ee7 Compare June 21, 2025 13:47
@friendlymatthew
Copy link
Copy Markdown
Contributor Author

Seems like an improvement to me

I wonder if there are some tests we could / should write for it?

As of now, the VariantBuilder can only build 1 object. I pushed up a test that asserts the sorting invariant as we append new fields to the object.

But once I get to nested objects and lists, I think the tests there become a lot more interesting!

Comment thread parquet-variant/src/builder.rs Outdated
/// Add a field with key and value to the object
pub fn append_value<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) {
let id = self.parent.add_key(key);
pub fn append_value<'m, 'd, T: Into<Variant<'m, 'd>>>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternate to erroring on adding a new field would be to just overwrite the existing value, which I think is more inline with other Rust collection apis such as https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.insert

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was considering that, but chose to take the Variant spec literally. I'm happy to change the implementation however

Comment thread parquet-variant/src/builder.rs Outdated
fn check_duplicate_field_name(&self, key: &str) -> Result<(), ArrowError> {
if let Some(field_name_id) = self.parent.dict.get(key) {
if self.fields.contains_key(field_name_id) {
return Err(ArrowError::InvalidArgumentError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to make this an error, I think we should at least return the name of the field in the message to make it easier to debug

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 22, 2025

I am not sure about returning an error on append_value

Also, while typing this it seems like ObjectBuilder::append_value is a somewhat strange name -- maybe it would be better to be called append_field or insert or something 🤔 (for a different PR)

@friendlymatthew
Copy link
Copy Markdown
Contributor Author

I am not sure about returning an error on append_value

Also, while typing this it seems like ObjectBuilder::append_value is a somewhat strange name -- maybe it would be better to be called append_field or insert or something 🤔 (for a different PR)

I got too excited and decided to add the duplicate field name check to this PR. I'm happy to roll that commit back and merge this PR as strictly a BTreeMap change, and then push up a following PR with the method name change + the duplicate field name check

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 22, 2025

I got too excited and decided to add the duplicate field name check to this PR. I'm happy to roll that commit back and merge this PR as strictly a BTreeMap change, and then push up a following PR with the method name change + the duplicate field name check

I think that would be a good idea -- I'll plan to merge the BTree part and then we can iterate on other things in a follow on

@alamb alamb merged commit a795030 into apache:main Jun 24, 2025
12 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 24, 2025

Thanks again @friendlymatthew

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 24, 2025

I was just distracted and missed merging this one yesterday -- sorry about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants