-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16033: [C++] Pass schema to consuming sink node #12721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16033: [C++] Pass schema to consuming sink node #12721
Conversation
|
|
|
There is an entirely different implementation we could take (pass the schema in an init method before anything runs): Pros:
Opinions? @lidavidm @vibhatha @paleolimbot |
|
I would probably prefer Init/Consume/Finish, we could default-implement Init if it's a concern |
|
Taking this with the grain of salt that I first heard of a SinkNodeConsumer a few weeks ago, passing the schema in |
|
This one is cleaner. I also prefer this method for consuming_sink. One question, does this take care of ‘SinkNode’ and ‘OrderBySink’ node too? We may also need to know what’s the schema of the batches we are going to get out of a given sink, am I right? I am merely considering a general case about sink nodes. |
|
Another Advantage is we may be solving this issue too. https://issues.apache.org/jira/browse/ARROW-15297 @westonpace @lidavidm what do you think? |
It doesn't currently fix
Almost, looks like the python test is failing that was failing on that issue too. I think we can move towards |
|
Looks like, if we can address these issues, this PR can cover these grounds. |
I've updated this PR to use an I could add a default implementation that actually stored the schema (in a protected variable) but I kind of prefer pure-virtual interfaces for the public API, though maybe that is just my Java/C# background. Happy to change if anyone has any strong opinion on the matter.
This changes the
I've added a test to make sure we test the failure path (i.e. |
cpp/src/arrow/compute/exec/options.h
Outdated
| @@ -304,10 +310,9 @@ class ARROW_EXPORT TableSinkNodeOptions : public ExecNodeOptions { | |||
| public: | |||
| TableSinkNodeOptions(std::shared_ptr<Table>* output_table, | |||
| std::shared_ptr<Schema> output_schema) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this second argument now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've removed it.
vibhatha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @westonpace. This will help in exposing Substrait plan very neatly. I cannot think of any additional test cases since, this is basically provides a method to extract the schema in the user code.
65d3a08 to
5f04ffd
Compare
|
It seems an R change managed to sneak in that exposes a little dilemma. The output schema is including the augmented fields, even though the Substrait plan didn't call for them. I'm still processing how we want to handle this but I'm thinking right now we just make sure not to include augmented fields when running against a Substrait plan. |
|
I did notice there were unexpected fields (although because there was no schema I had no idea what they were!). At least the ability to turn them off would be nice (or else we're still stuck having to calculate the column names in advance to know what columns not to pass on to the user). |
…s the schema along with each batch
…g schema to consume. Added back in support for custom metadata which is needed for python tests to pass.
5f04ffd to
aad8039
Compare
|
I added an explicit selection step to drop those columns. I think the proper fix will have to be done in a follow-up. I added a comment to the R test to this effect. @paleolimbot mind taking a quick look for sanity's sake? Then I will merge. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much easier than what I was doing before...thank you!
|
Benchmark runs are scheduled for baseline = 9e08c50 and contender = 45a97e1. 45a97e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…ROW_R_WITH_ENGINE After ARROW-16033 (#12721) we get this compiler warning when compiling with `ARROW_R_WITH_ENGINE`: ``` compute-exec.cpp:304:17: warning: 'Init' overrides a member function but is not marked 'override' [-Winconsistent-missing-override] arrow::Status Init(const std::shared_ptr<arrow::Schema>& schema) { ^ /Users/deweydunnington/.r-arrow-dev-build/dist/include/arrow/compute/exec/options.h:153:18: note: overridden virtual function is here virtual Status Init(const std::shared_ptr<Schema>& schema) = 0; ^ 1 warning generated. ``` This PR just adds the requisite `override`. Closes #12823 from paleolimbot/r-minor-override Authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Jonathan Keane <jkeane@gmail.com>
No description provided.