Skip to content

Don't access and validate offset buffer in ListArray::from(ArrayData)#1602

Merged
alamb merged 3 commits intoapache:masterfrom
jhorstmann:remove-offset-check-in-listarray-from-data
Apr 25, 2022
Merged

Don't access and validate offset buffer in ListArray::from(ArrayData)#1602
alamb merged 3 commits intoapache:masterfrom
jhorstmann:remove-offset-check-in-listarray-from-data

Conversation

@jhorstmann
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #1601.

Rationale for this change

All required validation should already be done as part of the ArrayData creation.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 20, 2022
Comment thread arrow/src/array/array_list.rs Outdated
.unwrap();

// Construct an empty offset buffer
let value_offsets = Buffer::from_iter(std::iter::empty::<i32>());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer::from([]) might be slightly cleaner?

#[test]
#[should_panic(expected = "offsets do not start at zero")]
fn test_list_array_invalid_value_offset_start() {
fn test_list_array_offsets_need_not_start_at_zero() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't explicitly state either way on this, however, for variable length lists (e.g. UTF-8) it states

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.

So I think this is likely correct

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2022

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.90%. Comparing base (b4642ec) to head (225d86c).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1602      +/-   ##
==========================================
- Coverage   82.95%   82.90%   -0.06%     
==========================================
  Files         193      193              
  Lines       55384    55499     +115     
==========================================
+ Hits        45944    46010      +66     
- Misses       9440     9489      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@viirya
Copy link
Copy Markdown
Member

viirya commented Apr 20, 2022

The CI failure looks unrelated:

Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": received unexpected HTTP status: 503 Service Unavailable
  Error: Docker pull failed with exit code 1

May need to re-trigger it.

Comment thread arrow/src/array/array_list.rs Outdated
// Construct an empty value array
let value_data = ArrayData::builder(DataType::Int32)
.len(0)
.add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see C++ ListArray checks first slot in offsets, but it checks the length of offsets:

https://github.com/apache/arrow/blob/c70426f73326b3852d1bd7c31d98be4743f3fcba/cpp/src/arrow/array/array_nested.cc#L111-L113

Seems it requires offsets to have non-zero length?

Copy link
Copy Markdown
Member

@viirya viirya Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the offsets for an empty ListArray to be something like [0, 0]?

nvm: that will result in an zero-length list element.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea checking the C++ version. The docs for binary layout also mention

The offsets buffer contains length + 1 signed integers
...
and the last slot is the length of the values array

which would mean there has to be a single zero in the offsets buffer for an empty ListArray.

If this is a requirements it would be better to validate it when creating ArrayData. The code in ArrayData::validate_each_offset explicitly allow this case though:

        // An empty binary-like array can have 0 offsets
        if self.len == 0 && offset_buffer.is_empty() {
            return Ok(());
        }

I'm more hesitant to change this now, maybe let's wait for some more eyes on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a ticket to change this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1620 as the follow on work

Copy link
Copy Markdown
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume that the consensus is to determine/address the 0-list 1 offset value item separately.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 25, 2022

I took the liberty of fixing the clippy errror in 225d86c. I am also going to see if this change helps with #1545 at all

Thanks @jhorstmann and all reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ListArray::from(ArrayData) dereferences invalid pointer when offsets are empty

6 participants