ARROW-16855: [C++] Adding Read Relation ToProto #13401

vibhatha · 2022-06-20T01:26:09Z

This is the initial PR to set the util functions and structure to include the ToProto functionality to relations.
Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation.
In this PR the read relation ToProto is added.

github-actions · 2022-06-20T01:26:30Z

https://issues.apache.org/jira/browse/ARROW-16855

github-actions · 2022-06-20T01:26:31Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

vibhatha · 2022-06-20T01:47:49Z

cc @westonpace added the initial PR to integrate ToProto for relations. The detailed task breakdown for ToProto is documented in here: https://issues.apache.org/jira/browse/ARROW-16854

The idea is to add part by part in smaller PRs.

cpp/src/arrow/engine/substrait/serde_test.cc

westonpace · 2022-06-20T07:34:01Z

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

vibhatha · 2022-06-20T07:55:05Z

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

Wednesday works for me 👍

jvanstraten · 2022-06-20T09:53:15Z

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

vibhatha · 2022-06-20T09:58:59Z

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

Thanks a lot for the quick check on this. It’s very interesting how you validated things using the tool. Do you think it’s wise to add a CI to test Substrait related queries using this tool?

Please feel free to add suggestions. @jvanstraten

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

cc @westonpace For future reference in the review.

jvanstraten · 2022-06-20T10:41:44Z

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

vibhatha · 2022-06-20T14:02:40Z

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

Here it is rather, the differentiation between a user passed value vs the default. We could assume the default and do the comparison to see if an explicit value is passed. There is no API calls in Expression to check if it has_filter or has_projection. May be that kind of a function could be useful.

westonpace

Sorry for the long delay. I have a few suggestions but overall I think this looks good.

cpp/src/arrow/engine/substrait/relation_internal.cc

cpp/src/arrow/engine/substrait/serde_test.cc

westonpace · 2022-07-06T00:17:42Z

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

cpp/src/arrow/engine/substrait/relation_internal.cc

vibhatha · 2022-07-06T02:33:20Z

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

Should we create a JIRA for this?

westonpace

A few more thoughts. Using parquet is fine. My only concern was the test data directory.

cpp/src/arrow/engine/substrait/relation_internal.cc

westonpace · 2022-07-11T23:57:34Z

cpp/src/arrow/engine/substrait/serde_test.cc

/tmp/ will not be portable once we support Windows URIs. Can you use arrow::internal::TemporaryDir from arrow/util/io_util.h?

Didn't we got an issue with Windows paths already in Substrait? We have added a skip for Windows tests.

There is a missing path support.

GTEST_SKIP() << "ARROW-16392: Substrait File URI not supported for Windows";

Yes, but that JIRA will eventually be fixed. We don't want to make it harder to support Windows.

This should be okay now. I updated the test case.

westonpace · 2022-07-11T23:58:31Z

cpp/src/arrow/engine/substrait/serde_test.cc

The formatting here seems off

ping: when I reformatted it comes to this shape. I tried a few times. Not sure what's wrong.

westonpace · 2022-07-11T23:59:17Z

cpp/src/arrow/engine/substrait/serde_test.cc

I'm not sure I understand this note.

Ah instead of just saying test.parquet we say /test.parquet

westonpace · 2022-07-12T00:00:16Z

cpp/src/arrow/engine/substrait/serde_test.cc

What is the des prefix for? Can you use a whole word?

des was meant to represent deserialized. I will update it's usage.

cpp/src/arrow/engine/substrait/serde_test.cc

vibhatha · 2022-07-12T00:40:31Z

A few more thoughts. Using parquet is fine. My only concern was the test data directory.

We got rid of that. By the way this could would remain very much same, but ths usage would be different once this is subjected to a registry usage in Substrait ToProto methods.

vibhatha · 2022-07-13T07:17:12Z

** THIS PR IS UNDERGOING A REFACTOR **

vibhatha · 2022-09-07T15:06:25Z

@westonpace I added a fix for the path issue on Mac. I think now it is more generalized.

Any other suggestions?

westonpace

This is close. Mostly just some cleanup comments at this point.

cpp/src/arrow/engine/substrait/plan_internal.h

cpp/src/arrow/engine/substrait/relation_internal.cc

westonpace · 2022-09-08T00:23:54Z

cpp/src/arrow/engine/substrait/relation_internal.cc

+    }
+    read_rel_lfs->mutable_items()->AddAllocated(read_rel_lfs_ffs.release());
+  }
+  read_rel->set_allocated_local_files(read_rel_lfs.release());


Can we have a follow-up JIRA to add support for scan options projection & filter? I don't think it should be done as part of this JIRA since it is changing.

Nice catch. Jira created: https://issues.apache.org/jira/browse/ARROW-17647

cpp/src/arrow/engine/substrait/relation_internal.h

cpp/src/arrow/util/io_util.cc

cpp/src/arrow/util/uri.h

vibhatha · 2022-09-08T11:16:36Z

@westonpace I updated the PR. Seems like a few CIs are failing. But, it seems like not related to the changes applied here.
WDYT?

westonpace

Thanks for sticking with this.

vibhatha · 2022-09-08T16:01:14Z

@westonpace Thanks a lot for keeping up with the major changes and a few rounds of reviews. 👍

ursabot · 2022-09-08T21:22:56Z

Benchmark runs are scheduled for baseline = 8fe7e35 and contender = 7475605. 7475605 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.31% ⬆️0.2%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 74756051 ec2-t3-xlarge-us-east-2
[Finished] 74756051 test-mac-arm
[Failed] 74756051 ursa-i9-9960x
[Finished] 74756051 ursa-thinkcentre-m75q
[Finished] 8fe7e353 ec2-t3-xlarge-us-east-2
[Finished] 8fe7e353 test-mac-arm
[Failed] 8fe7e353 ursa-i9-9960x
[Finished] 8fe7e353 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations. Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. In this PR the `read` relation ToProto is added. Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added the Component: C++ label Jun 20, 2022

vibhatha commented Jun 20, 2022

View reviewed changes

cpp/src/arrow/engine/substrait/serde_test.cc Outdated Show resolved Hide resolved

vibhatha commented Jun 20, 2022

View reviewed changes

cpp/src/arrow/engine/substrait/serde_test.cc Outdated Show resolved Hide resolved

vibhatha marked this pull request as ready for review June 23, 2022 12:58

westonpace requested changes Jun 27, 2022

View reviewed changes

vibhatha requested review from lidavidm and westonpace June 27, 2022 14:36

vibhatha force-pushed the arrow-16855 branch from 9c34ff3 to 14cc638 Compare June 27, 2022 15:37

westonpace reviewed Jul 6, 2022

View reviewed changes

cpp/src/arrow/engine/substrait/relation_internal.cc Show resolved Hide resolved

cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved

vibhatha force-pushed the arrow-16855 branch from 14cc638 to dd78cfa Compare July 6, 2022 04:47

vibhatha requested a review from westonpace July 7, 2022 09:16

westonpace requested changes Jul 12, 2022

View reviewed changes

vibhatha force-pushed the arrow-16855 branch from 3c4e7f3 to d984627 Compare July 13, 2022 07:16

vibhatha marked this pull request as draft July 13, 2022 07:16

vibhatha force-pushed the arrow-16855 branch from d984627 to 61c1c55 Compare July 18, 2022 12:37

vibhatha closed this Jul 27, 2022

vibhatha force-pushed the arrow-16855 branch from bfa6efe to 49ae8fa Compare July 27, 2022 04:56

vibhatha added 14 commits September 6, 2022 11:16

fix(review): addressing a previous review comment

1a179a1

fix(review): addressing review comment

630524a

fix(code): missed move op added

ce13740

fix(path): using ToNative instead of ToString

e6abfc9

fix(docs): added conversion_options to docstring

f07de57

fix(rebase): rebasing with Substrait changes

ea878ea

fix(address_review): refactor

1daecba

fix(registry): cleaning up registry

ea8c557

fix(reviews): uri fix, remove SetRelation, simplify code

ef407b0

fix(cleanup): raddressing reviews

6571de2

fix(reviews): updated input handling

3841bfb

fix(native): updated the file_path method to check CI failure

b9d6f07

fix(ipc): adding ipc write replacing parquet

0479dac

fix(file_path_issue): temp commit

c1de2b8

vibhatha force-pushed the arrow-16855 branch from e4dac09 to c1de2b8 Compare September 6, 2022 05:56

vibhatha added 2 commits September 6, 2022 18:10

fix(temp): testing a fix for additional slash in file handling

3156fd2

fix(path-issue): fixed the path issue and updated the test cases

491a985

vibhatha mentioned this pull request Sep 7, 2022

ARROW-15584: [C++] Add support for Substrait's RelCommon::Emit #13914

Merged

fix(path): windows issue fixing

33d7753

westonpace requested changes Sep 8, 2022

View reviewed changes

fix(reviews): address reviews

616d6e5

westonpace approved these changes Sep 8, 2022

View reviewed changes

westonpace merged commit 7475605 into apache:master Sep 8, 2022

vibhatha mentioned this pull request Sep 13, 2022

ARROW-16856: [C++] Adding Filter Relation ToProto #13452

Closed

ARROW-16855: [C++] Adding Read Relation ToProto #13401

ARROW-16855: [C++] Adding Read Relation ToProto #13401

Uh oh!

Conversation

vibhatha commented Jun 20, 2022

Uh oh!

github-actions bot commented Jun 20, 2022

Uh oh!

github-actions bot commented Jun 20, 2022

Uh oh!

vibhatha commented Jun 20, 2022

Uh oh!

Uh oh!

Uh oh!

westonpace commented Jun 20, 2022

Uh oh!

vibhatha commented Jun 20, 2022

Uh oh!

jvanstraten commented Jun 20, 2022

Uh oh!

vibhatha commented Jun 20, 2022

Uh oh!

jvanstraten commented Jun 20, 2022

Uh oh!

vibhatha commented Jun 20, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Jul 6, 2022

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Jul 6, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vibhatha Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Jul 12, 2022

Uh oh!

vibhatha commented Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vibhatha Jul 12, 2022 •

edited

Loading

vibhatha commented Jul 13, 2022 •

edited

Loading