[Variant] Use BTreeMap for VariantBuilder.dict and ObjectBuilder.fields to maintain invariants upon entry writes#7720
Conversation
There was a problem hiding this comment.
This doesn't look right. The dict is the entire metadata dictionary, shared by all (sub-)objects in the overall variant value -- parents, siblings, children, cousins, etc. It's a superset of the field names for this specific object we're finishing.
Unfortunately, I don't know a good way to build an "indirect" map in rust that allows the custom key comparator we'd need, to lexically sort field id keys according to the string values they represent.
There was a problem hiding this comment.
Hi, yes. I figured since the current VariantBuilder can only build 1 object as of now, it would be ok to assume the field names of the current object maps 1:1 with the field names in the dict metadata dictionary.
I pushed up 22789798797b5b42950569ef6fdb720b1a256a68, which filters by the relative field ids within the current object. I thought it would make sense to update this logic when thinking about nested lists and objects.
There was a problem hiding this comment.
I'm not sure it's really helpful to optimize a known dead end path -- we have to figure out something that works for nested arrays and objects -- but maybe that's just me.
There was a problem hiding this comment.
Yeah I agree and fwiw, I included a change in this PR to remove this 1:1 object builder field names to metadata dictionary assumption.
We now do something like:
let field_ids_by_sorted_field_name = self
.parent
.dict
.iter()
.filter_map(|(_, id)| self.fields.contains_key(id).then_some(*id))
.collect::<Vec<_>>();which will work with nested objects
alamb
left a comment
There was a problem hiding this comment.
Seems like an improvement to me
I wonder if there are some tests we could / should write for it?
2278979 to
63e067d
Compare
63e067d to
87b1ee7
Compare
As of now, the But once I get to nested objects and lists, I think the tests there become a lot more interesting! |
| /// Add a field with key and value to the object | ||
| pub fn append_value<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) { | ||
| let id = self.parent.add_key(key); | ||
| pub fn append_value<'m, 'd, T: Into<Variant<'m, 'd>>>( |
There was a problem hiding this comment.
The other alternate to erroring on adding a new field would be to just overwrite the existing value, which I think is more inline with other Rust collection apis such as https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.insert
There was a problem hiding this comment.
Yes, I was considering that, but chose to take the Variant spec literally. I'm happy to change the implementation however
| fn check_duplicate_field_name(&self, key: &str) -> Result<(), ArrowError> { | ||
| if let Some(field_name_id) = self.parent.dict.get(key) { | ||
| if self.fields.contains_key(field_name_id) { | ||
| return Err(ArrowError::InvalidArgumentError( |
There was a problem hiding this comment.
If we are going to make this an error, I think we should at least return the name of the field in the message to make it easier to debug
|
I am not sure about returning an error on Also, while typing this it seems like |
I got too excited and decided to add the duplicate field name check to this PR. I'm happy to roll that commit back and merge this PR as strictly a |
I think that would be a good idea -- I'll plan to merge the BTree part and then we can iterate on other things in a follow on |
… object" This reverts commit e8af7ff.
|
Thanks again @friendlymatthew |
|
I was just distracted and missed merging this one yesterday -- sorry about that |
Which issue does this PR close?
Rationale for this change
This commit changes the
dictfield inVariantBuilder+ thefieldsfield inObjectBuilderto beBTreeMaps, and checks for existing field names in a object before appending a new field.These collections are often used in places where having an already sorted structure would be more performant. Inside of
ObjectBuilder::finish(), we sort the fields byfield_nameand we can use the fact thatVariantBuilder'sdictmaintains a sorted mapping tofield_idbyfield_name.To check whether an existing field name exists in a object, it is simply two lookups: 1) to find the
field_name: &str's uniquefield_name_id, and 2) check if theObjectBuilderfieldsalready has a key with thatfield_name_id.We make
ObjectBuilderfieldsaBTreeMapsorted byfield_id. Sincefield_ids correlate to insertion order, we now have some notion of which fields were inserted first. This also improves the time to look up the max field id, as it changes the linear scan over the entirefieldscollection to a logarithmic call usingfields.keys().last().