Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Nov 25, 2019

Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation
for a few types (list, union, dictionary).

Also, fix the assumptions about union arrays to match official semantics.

@github-actions
Copy link

@pitrou pitrou force-pushed the ARROW-6157-array-data-validation branch from 98953f6 to bc03343 Compare November 25, 2019 19:25
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I was wondering... should we rename these methods to type_codes and raw_type_codes? There is a bit of a mixed terminology here (UnionType uses type_codes, also type_id is used for something different).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of changing the names to be more clear and where relevant, conforming to the specification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the specification uses "types" and "type ids" :-/

@pitrou
Copy link
Member Author

pitrou commented Nov 25, 2019

cc @wesm For the union changes.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing all this, definitely of high value to provide deeper validation than we had previously.

I left a couple comments about things that stood out but everything else looked in line with expectations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conforming to the specification? My understanding is that we do not place any conditions on the value "underneath" a null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is simply moved around from array.cc.
The spec does not spell it out explicitly but the examples are compliant with this expectation:
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I have it right that this is adding sizeof(int) * (kMaxTypeId + 1) (so 512 or 1024 bytes depending on the system) to the footprint of every UnionType instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think this is simpler than having to lazily-initialize it.

@jorisvandenbossche
Copy link
Member

Should the python constructors do a cheap or a full validation? (since full=Falseis the default, I assume they will now do a cheap validation)

So that means that the initial example in the issue (https://issues.apache.org/jira/browse/ARROW-6157) with pa.UnionArray.from_dense can still generate a invalid array that will segfault on certain operations?
That might be fine (it should probably at least be possible to not do the full validation, in case you know you have valid array data), but maybe we should then update some of those docstrings to note that?

@pitrou
Copy link
Member Author

pitrou commented Nov 26, 2019

Full validation is essentially O(n), so it sounds undesirable to do it by default in Python constructors. I don't know what @wesm thinks about it.

Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation
for a few types (list, union, dictionary).

Also, fix the assumptions about union arrays to match official semantics.
@pitrou pitrou force-pushed the ARROW-6157-array-data-validation branch from bc03343 to c8983f6 Compare November 28, 2019 15:47
@pitrou
Copy link
Member Author

pitrou commented Nov 28, 2019

Rebased. Will merge when green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants