Skip to content

[Variant] remove BorrowedShreddingState#9791

Open
sdf-jkl wants to merge 3 commits intoapache:mainfrom
sdf-jkl:remove-borrowedshreddingstate
Open

[Variant] remove BorrowedShreddingState#9791
sdf-jkl wants to merge 3 commits intoapache:mainfrom
sdf-jkl:remove-borrowedshreddingstate

Conversation

@sdf-jkl
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl commented Apr 22, 2026

Which issue does this PR close?

Rationale for this change

Check issue

What changes are included in this PR?

  • Drop BorrowedShreddingState
  • Replace it with ShreddingState
  • Removed the lifetimes in unshred_variant as they required helpers to cover recursive ShreddingState handling.
  • Lifetimes removal introduces clone on NullBuffer. Extra 3 usize (24 bytes) per Array. Only used in NullUnshredVariantBuilder Removed the only place where NullBuffer was stored. No regression.

Are these changes tested?

Yes, unit tests.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the parquet-variant parquet-variant* crates label Apr 22, 2026
@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 22, 2026

@alamb @scovich please take a look!

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 22, 2026

I also removed NullBuffer from NullUnshredVariantBuilder, since the arm where nulls are used is already covered by upper unshred_variant in loc96-97.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand the change, but it seems like it introduces a lot of cloning of concrete array types? Not sure why that's needed when the [Borrowed]ShreddingState is only concerned about ArrayRef vs. &ArrayRef?

mod variant_get;
mod variant_to_arrow;

pub use variant_array::{BorrowedShreddingState, ShreddingState, VariantArray, VariantType};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Was this already an unused import? I wonder why clippy didn't flag it when it became unused?

Self::$enum_variant(UnshredPrimitiveRowBuilder::new(
value,
typed_value.$cast_fn(),
typed_value.$cast_fn().clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand -- why would a cast return a borrowed value that needs cloning?

Self::Decimal32(DecimalUnshredRowBuilder::new(value, typed_value, *s as _))
Self::Decimal32(DecimalUnshredRowBuilder::new(
value,
typed_value.as_primitive().clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to clone a PrimitiveArray? It's not directly an Arc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're cloning a &PrimitiveArray<T>

pub struct PrimitiveArray<T: ArrowPrimitiveType> {
    data_type: DataType, // 24 bytes 
    /// Values data
    values: ScalarBuffer<T::Native>, // 24 bytes
    nulls: Option<NullBuffer>, // 48 bytes if Some
}

DataType::List(_) => Self::List(ListUnshredVariantBuilder::try_new(
value,
typed_value.as_list(),
typed_value.as_list::<i32>().clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question for these -- cost of cloning struct and list arrays?

Comment on lines +397 to +410
struct UnshredPrimitiveRowBuilder<'a, T> {
value: Option<&'a ArrayRef>,
typed_value: &'a T,
struct UnshredPrimitiveRowBuilder<T> {
value: Option<ArrayRef>,
typed_value: T,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes like this one seem unrelated to shared shredding state that needed an Option<&ArrayRef>? Even if value needs to be cloned, we could still keep a borrowed reference to typed_value which is anyway a bare type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't keep the reference without using lifetimes in builders. It seems like removing them was a bad idea. I'll switch it back.

/// Creates a new UnshredVariantRowBuilder from the `(value, typed_value)` pair of a shredded
/// variant struct. Returns None for the None/None case - caller decides how to handle based on
/// context.
fn try_new_opt(inner_struct: &'a StructArray) -> Result<Option<Self>> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place where BorrowedShreddingState was so useful.

For nested builders (List/Struct) we used to recursively call try_new_opt(field_array.try_into()?): the try_into produced a BorrowedShreddingState<'a> whose ArrayRefs were borrowed from the source field_array. The wrapper got consumed, but the &'a ArrayRefs inside it carried 'a along and could be stored in the builder.

In ShreddingState the ArrayRefs are owned, so once we call the function - the ShreddingState goes out of scope and the references point to free memory.

A workaround is using &StructArray since it's a reference to the outer Array data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I wonder if we're going about this wrong.

First question: What problem are we actually trying to solve by eliminating BorrowedShreddingState? Is it just annoying to have two similar types? Or something else more serious?

Second question: What if (thought experiment) we standardized on BorrowedShreddingState everywhere instead?

  • Only use ShreddingState as an internal helper member of VariantArray, ShreddedVariantFieldArray, etc? (its job is to centralize the name-based lookup and validation code; we should probably push inner inside as well, since that's always there)
  • VariantArray::shredding_state() then returns self.shredding_state.borrow() (BorrowedShreddingState<'_> return type)
  • All functions that currently expect ShreddingState change to expect BorrowedShreddingState instead (I think this is already the case)

Third question: Where are we actually cloning StructArray, PrimitiveArray, etc today? Does the proposed change improve the situation, make it worse, or leave it unchanged? For example, VariantArray and ShreddedVariantFieldArray constructors both clone their input struct array today, and I don't think the current PR changes that.

Fourth question: Now that we know VariantArray is only a temporary helper that cannot actually impl Array, should we revisit the decision to make it an owned type? If VariantArray maintained references internally instead of owned values, then we could just use borrowed types everywhere and be done with it. Would the benefits be worth the headaches it causes code that uses VariantArray?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Remove BorrowedShreddingState

2 participants