-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6157: [C++] Array data validation #5892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
98953f6 to
bc03343
Compare
cpp/src/arrow/array.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I was wondering... should we rename these methods to type_codes and raw_type_codes? There is a bit of a mixed terminology here (UnionType uses type_codes, also type_id is used for something different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of changing the names to be more clear and where relevant, conforming to the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the specification uses "types" and "type ids" :-/
|
cc @wesm For the union changes. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing all this, definitely of high value to provide deeper validation than we had previously.
I left a couple comments about things that stood out but everything else looked in line with expectations
cpp/src/arrow/array/validate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conforming to the specification? My understanding is that we do not place any conditions on the value "underneath" a null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is simply moved around from array.cc.
The spec does not spell it out explicitly but the examples are compliant with this expectation:
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
cpp/src/arrow/type.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I have it right that this is adding sizeof(int) * (kMaxTypeId + 1) (so 512 or 1024 bytes depending on the system) to the footprint of every UnionType instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think this is simpler than having to lazily-initialize it.
|
Should the python constructors do a cheap or a full validation? (since So that means that the initial example in the issue (https://issues.apache.org/jira/browse/ARROW-6157) with |
|
Full validation is essentially O(n), so it sounds undesirable to do it by default in Python constructors. I don't know what @wesm thinks about it. |
Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics.
bc03343 to
c8983f6
Compare
|
Rebased. Will merge when green. |
Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation
for a few types (list, union, dictionary).
Also, fix the assumptions about union arrays to match official semantics.