Skip to content

Conversation

@vibhatha
Copy link
Contributor

This is the initial PR to set the util functions and structure to include the ToProto functionality to relations.
Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation.
In this PR the read relation ToProto is added.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@vibhatha
Copy link
Contributor Author

cc @westonpace added the initial PR to integrate ToProto for relations. The detailed task breakdown for ToProto is documented in here: https://issues.apache.org/jira/browse/ARROW-16854

The idea is to add part by part in smaller PRs.

@westonpace
Copy link
Member

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

@vibhatha
Copy link
Contributor Author

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

Wednesday works for me 👍

@jvanstraten
Copy link
Contributor

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

@vibhatha
Copy link
Contributor Author

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

Thanks a lot for the quick check on this. It’s very interesting how you validated things using the tool. Do you think it’s wise to add a CI to test Substrait related queries using this tool?

Please feel free to add suggestions. @jvanstraten

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

cc @westonpace For future reference in the review.

@jvanstraten
Copy link
Contributor

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

@vibhatha
Copy link
Contributor Author

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

Here it is rather, the differentiation between a user passed value vs the default. We could assume the default and do the comparison to see if an explicit value is passed. There is no API calls in Expression to check if it has_filter or has_projection. May be that kind of a function could be useful.

@vibhatha vibhatha marked this pull request as ready for review June 23, 2022 12:58
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I have a few suggestions but overall I think this looks good.

@westonpace
Copy link
Member

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

@vibhatha
Copy link
Contributor Author

vibhatha commented Jul 6, 2022

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

Should we create a JIRA for this?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more thoughts. Using parquet is fine. My only concern was the test data directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp/ will not be portable once we support Windows URIs. Can you use arrow::internal::TemporaryDir from arrow/util/io_util.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we got an issue with Windows paths already in Substrait? We have added a skip for Windows tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing path support.

GTEST_SKIP() << "ARROW-16392: Substrait File URI not supported for Windows";

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that JIRA will eventually be fixed. We don't want to make it harder to support Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay now. I updated the test case.

Comment on lines 1484 to 1571
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here seems off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping: when I reformatted it comes to this shape. I tried a few times. Not sure what's wrong.

Comment on lines 1500 to 1501
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this note.

Copy link
Contributor Author

@vibhatha vibhatha Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah instead of just saying test.parquet we say /test.parquet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the des prefix for? Can you use a whole word?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

des was meant to represent deserialized. I will update it's usage.

@vibhatha
Copy link
Contributor Author

A few more thoughts. Using parquet is fine. My only concern was the test data directory.

We got rid of that. By the way this could would remain very much same, but ths usage would be different once this is subjected to a registry usage in Substrait ToProto methods.

@vibhatha vibhatha marked this pull request as draft July 13, 2022 07:16
@vibhatha
Copy link
Contributor Author

vibhatha commented Jul 13, 2022

** THIS PR IS UNDERGOING A REFACTOR **

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 7, 2022

@westonpace I added a fix for the path issue on Mac. I think now it is more generalized.

Any other suggestions?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close. Mostly just some cleanup comments at this point.

}
read_rel_lfs->mutable_items()->AddAllocated(read_rel_lfs_ffs.release());
}
read_rel->set_allocated_local_files(read_rel_lfs.release());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a follow-up JIRA to add support for scan options projection & filter? I don't think it should be done as part of this JIRA since it is changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 8, 2022

@westonpace I updated the PR. Seems like a few CIs are failing. But, it seems like not related to the changes applied here.
WDYT?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this.

@westonpace westonpace merged commit 7475605 into apache:master Sep 8, 2022
@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 8, 2022

@westonpace Thanks a lot for keeping up with the major changes and a few rounds of reviews. 👍

@ursabot
Copy link

ursabot commented Sep 8, 2022

Benchmark runs are scheduled for baseline = 8fe7e35 and contender = 7475605. 7475605 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.31% ⬆️0.2%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 74756051 ec2-t3-xlarge-us-east-2
[Finished] 74756051 test-mac-arm
[Failed] 74756051 ursa-i9-9960x
[Finished] 74756051 ursa-thinkcentre-m75q
[Finished] 8fe7e353 ec2-t3-xlarge-us-east-2
[Finished] 8fe7e353 test-mac-arm
[Failed] 8fe7e353 ursa-i9-9960x
[Finished] 8fe7e353 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations.
Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. 
In this PR the `read` relation ToProto is added. 

Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants