Don't access and validate offset buffer in ListArray::from(ArrayData)#1602
Conversation
| .unwrap(); | ||
|
|
||
| // Construct an empty offset buffer | ||
| let value_offsets = Buffer::from_iter(std::iter::empty::<i32>()); |
There was a problem hiding this comment.
Buffer::from([]) might be slightly cleaner?
| #[test] | ||
| #[should_panic(expected = "offsets do not start at zero")] | ||
| fn test_list_array_invalid_value_offset_start() { | ||
| fn test_list_array_offsets_need_not_start_at_zero() { |
There was a problem hiding this comment.
The spec doesn't explicitly state either way on this, however, for variable length lists (e.g. UTF-8) it states
Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.
So I think this is likely correct
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1602 +/- ##
==========================================
- Coverage 82.95% 82.90% -0.06%
==========================================
Files 193 193
Lines 55384 55499 +115
==========================================
+ Hits 45944 46010 +66
- Misses 9440 9489 +49 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The CI failure looks unrelated: May need to re-trigger it. |
| // Construct an empty value array | ||
| let value_data = ArrayData::builder(DataType::Int32) | ||
| .len(0) | ||
| .add_buffer(Buffer::from_iter(std::iter::empty::<i32>())) |
There was a problem hiding this comment.
Hmm, I don't see C++ ListArray checks first slot in offsets, but it checks the length of offsets:
Seems it requires offsets to have non-zero length?
There was a problem hiding this comment.
Should the offsets for an empty ListArray to be something like [0, 0]?
nvm: that will result in an zero-length list element.
There was a problem hiding this comment.
Good idea checking the C++ version. The docs for binary layout also mention
The offsets buffer contains length + 1 signed integers
...
and the last slot is the length of the values array
which would mean there has to be a single zero in the offsets buffer for an empty ListArray.
If this is a requirements it would be better to validate it when creating ArrayData. The code in ArrayData::validate_each_offset explicitly allow this case though:
// An empty binary-like array can have 0 offsets
if self.len == 0 && offset_buffer.is_empty() {
return Ok(());
}I'm more hesitant to change this now, maybe let's wait for some more eyes on this.
There was a problem hiding this comment.
Seems like arrow2 also requires at least one offset
There was a problem hiding this comment.
Perhaps a ticket to change this?
nevi-me
left a comment
There was a problem hiding this comment.
I presume that the consensus is to determine/address the 0-list 1 offset value item separately.
|
I took the liberty of fixing the clippy errror in 225d86c. I am also going to see if this change helps with #1545 at all Thanks @jhorstmann and all reviewers |
Which issue does this PR close?
Closes #1601.
Rationale for this change
All required validation should already be done as part of the ArrayData creation.
What changes are included in this PR?
Are there any user-facing changes?