-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16855: [C++] Adding Read Relation ToProto #13401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
cc @westonpace added the initial PR to integrate The idea is to add part by part in smaller PRs. |
|
I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday |
Wednesday works for me 👍 |
|
I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a |
Thanks a lot for the quick check on this. It’s very interesting how you validated things using the tool. Do you think it’s wise to add a CI to test Substrait related queries using this tool? Please feel free to add suggestions. @jvanstraten One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true. cc @westonpace For future reference in the review. |
IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI). I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through. Link, just in case: https://github.com/substrait-io/substrait-validator
Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation. |
Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace
Here it is rather, the differentiation between a user passed value vs the default. We could assume the default and do the comparison to see if an explicit value is passed. There is no API calls in Expression to check if it |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay. I have a few suggestions but overall I think this looks good.
I think it would make a lot of sense for unit tests to bring in the validator as a C dependency. |
Should we create a JIRA for this? |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more thoughts. Using parquet is fine. My only concern was the test data directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/tmp/ will not be portable once we support Windows URIs. Can you use arrow::internal::TemporaryDir from arrow/util/io_util.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we got an issue with Windows paths already in Substrait? We have added a skip for Windows tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a missing path support.
GTEST_SKIP() << "ARROW-16392: Substrait File URI not supported for Windows";There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that JIRA will eventually be fixed. We don't want to make it harder to support Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be okay now. I updated the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting here seems off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping: when I reformatted it comes to this shape. I tried a few times. Not sure what's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah instead of just saying test.parquet we say /test.parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the des prefix for? Can you use a whole word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
des was meant to represent deserialized. I will update it's usage.
We got rid of that. By the way this could would remain very much same, but ths usage would be different once this is subjected to a registry usage in Substrait |
|
** THIS PR IS UNDERGOING A REFACTOR ** |
e4dac09 to
c1de2b8
Compare
|
@westonpace I added a fix for the path issue on Mac. I think now it is more generalized. Any other suggestions? |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is close. Mostly just some cleanup comments at this point.
| } | ||
| read_rel_lfs->mutable_items()->AddAllocated(read_rel_lfs_ffs.release()); | ||
| } | ||
| read_rel->set_allocated_local_files(read_rel_lfs.release()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a follow-up JIRA to add support for scan options projection & filter? I don't think it should be done as part of this JIRA since it is changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Jira created: https://issues.apache.org/jira/browse/ARROW-17647
|
@westonpace I updated the PR. Seems like a few CIs are failing. But, it seems like not related to the changes applied here. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this.
|
@westonpace Thanks a lot for keeping up with the major changes and a few rounds of reviews. 👍 |
|
Benchmark runs are scheduled for baseline = 8fe7e35 and contender = 7475605. 7475605 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations. Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. In this PR the `read` relation ToProto is added. Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
This is the initial PR to set the util functions and structure to include the
ToProtofunctionality to relations.Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation.
In this PR the
readrelation ToProto is added.