Skip to content

feat: add is_there_null_pointing_to_non_empty_value helper function in OffsetBuffer#9711

Open
rluvaton wants to merge 2 commits intoapache:mainfrom
rluvaton:add-is-there-null-pointing-to-non-empty-value
Open

feat: add is_there_null_pointing_to_non_empty_value helper function in OffsetBuffer#9711
rluvaton wants to merge 2 commits intoapache:mainfrom
rluvaton:add-is-there-null-pointing-to-non-empty-value

Conversation

@rluvaton
Copy link
Copy Markdown
Member

Which issue does this PR close?

N/A

Rationale for this change

In variable-length array types (e.g., StringArray, ListArray), null entries may have non-empty offset ranges, meaning the underlying data buffer contains data behind nulls. This matters when wanting to work on the underlying values of variable length data for example when unwrapping (flattening) a list array, as the child values are exposed, including those behind null entries. If null entries point to non-empty ranges, the unwrapped values will contain data that may not be
meaningful to operate on and could cause errors (e.g., division by zero in the child values).

Usages when this will be helpful:

  • flattening list array
  • casting lists/map - we don't wanna cast values that are not used so this is a check if there is one
  • explode on list - we don't want the null values behind it so this give us a check if it exists (will have another pr to cleanup empty values)
  • gc on lists/map/strings to remove unneeded data

What changes are included in this PR?

Add OffsetBuffer::is_there_null_pointing_to_non_empty_value method that checks if any null positions correspond to non-empty offset
ranges

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, a new public method OffsetBuffer::is_there_null_pointing_to_non_empty_value is added.


Related to:

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 14, 2026
@rluvaton
Copy link
Copy Markdown
Member Author

Would appreciate a review

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rluvaton -- i reviewed the code and it looks good to me overall

I think we should reconsider the super long name and the seemingly over-large number of tests, but otherwise 👍

return last_offset != initial_offset;
}

let mut valid_slices_iter = null_buffer.valid_slices();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code would be a lot simpler if there were a NullBuffer::invalid_slices 🤔

/// # Panics
///
/// Panics if the length of the `null_buffer` does not equal `self.len() - 1`.
pub fn is_there_null_pointing_to_non_empty_value(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function name is a little excessively long and doesn't fit with the rest of this codebase. Can you please rename it to something shorter and more concise. Here are some ideas:

  1. has_null_data
  2. has_non_empty_nulls
  3. nulls_have_values
  4. has_populated_nulls

// ---------------------------------------------------------------

#[test]
fn null_pointing_none_null_buffer() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all these tests? It seems like a single test for each representative case would be more than adequate and trim this PR down substantially

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants