ARROW-14970: [C++] Make ExecNodes can generate/consume tasks #11923

aucahuasi · 2021-12-09T19:53:04Z

https://issues.apache.org/jira/browse/ARROW-14970

github-actions · 2021-12-09T19:53:25Z

https://issues.apache.org/jira/browse/ARROW-14970

github-actions · 2021-12-09T19:53:27Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

aucahuasi · 2021-12-10T14:31:14Z

cc @bkietz for visibility

westonpace

I'll take a second look at this on Monday but I have some initial questions.

westonpace · 2021-12-11T02:59:04Z

cpp/src/arrow/compute/exec/exec_plan.h

I would expect the signature to be:

virtual void InputReceived(ExecNode* input, std::function<Result<ExecBatch>(ExecBatch)> task) = 0;

I'm not sure what it means to have batch and task?

Is this some kind of intermediate step between the two models?

westonpace · 2021-12-11T03:00:12Z

cpp/src/arrow/compute/exec/exec_plan.h

This isn't really a task though is it. I was expecting something like...

static inline std::function<Result<ExecBatch>(ExecBatch)> IdentityTask() { return [] (ExecBatch batch) { return batch; }; }

aucahuasi · 2021-12-13T15:56:41Z

Thanks for the feedback @westonpace !
I'm working now for improving all of this

aucahuasi · 2021-12-13T16:10:30Z

@westonpace I've sent some changes, feel free to take another look!

westonpace

This is good. I think we've just got things a bit switched at the moment though. The non-pipeline breakers (filter, project) are submitting tasks and the pipeline breakers are not.

What we want is the pipeline breakers to submit tasks and the ordinary nodes to compose the task.

cpp/src/arrow/compute/exec/filter_node.cc

westonpace · 2021-12-14T09:03:13Z

cpp/src/arrow/compute/exec/aggregate_node.cc

Suggested change

auto prev = task();

if (!prev.ok()) {

ErrorIfNotOk(prev.status());

return;

}

if (ErrorIfNotOk(DoConsume(prev.MoveValueUnsafe(), thread_index))) return;

auto func = [this] (Result<ExecBatch> task) {

ARROW_ASSIGN_OR_RAISE(auto prev, task());

auto thread_index = get_thread_index_();

return DoConsume(prev.MoveValueUnsafe(), thread_index);

};

plan_->scheduler()->SubmitTask(std::move(func));

This is what I'm thinking pipeline breakers would look like.

plan_->scheduler()->SubmitTask(std::move(func));

Yes that is the idea, but this PR is to enable that construction later, this PR is not going to define any scheduler or submitting logic.

If we aren't going to address this now let's make another JIRA (taskify 3?) Something like, "Fix logic in existing nodes so that pipeline breakers submit and non-breakers forward" and then add a comment in all of these spots along the lines of...

// This node should be forwarding the task downstream but that will be addressed in ARROW-XYZ

westonpace · 2021-12-14T09:07:31Z

cpp/src/arrow/compute/exec/aggregate_node.cc

This is good 👍

cpp/src/arrow/compute/exec/aggregate_node.cc

cpp/src/arrow/compute/exec/hash_join_node.cc

cpp/src/arrow/compute/exec/sink_node.cc

westonpace · 2021-12-14T09:13:30Z

cpp/src/arrow/compute/exec/source_node.cc

This would actually not create a task but forward to downstream like filter/project.

So what will this eventually look like? If we assume we don't know how many batches a scanner will emit then how many "scan tasks" do we submit individually? I suppose we can always "over-submit" and then the final tasks will just abandon themselves if the scanner is finished. Could this be another spot for backpressure? I don't think we have to solve all of these problems right now.

cpp/src/arrow/compute/exec/union_node.cc

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace · 2021-12-14T18:08:42Z

We will want to update the docs in docs/source/cpp/streaming_execution.rst as well. I think it's ok in those docs to mention the concept of a scheduler / task submission if needed to explain the API.

minor changes format improve the task API for ExecNodes format fix guarantee issues with project and filter nodes minor format fix build dataset examples fix arrow compute docs

aucahuasi · 2021-12-14T22:11:59Z

We will want to update the docs in docs/source/cpp/streaming_execution.rst

Done! Good catch! thanks.

westonpace

Since we changed the interface there are a number of nodes that aren't really doing things the right way. I agree we don't need to convert all of them right away (since they still work correctly). As a compromise can we add comments in all the places we will need to change referencing a JIRA that will implement that change?

Once those are in place I think this is good to go.

westonpace · 2021-12-17T02:47:12Z

docs/source/cpp/streaming_execution.rst

+      // by an input of this node to push a task here for processing.
+      // For non-terminating nodes (e.g. filter/project/etc.): the node can wrap
+      // its own work with the task (using function composition/fusing) and then
+      // call InputReceived on the downstream node.
+      // A "terminating node" (e.g. sink node / pipeline breaker) could then submit
+      // the task to a scheduler.
+      void InputReceived(ExecNode* input,
+                         std::function<Result<ExecBatch>()> task) override {


Suggested change

// by an input of this node to push a task here for processing.

// For non-terminating nodes (e.g. filter/project/etc.): the node can wrap

// its own work with the task (using function composition/fusing) and then

// call InputReceived on the downstream node.

// A "terminating node" (e.g. sink node / pipeline breaker) could then submit

// the task to a scheduler.

void InputReceived(ExecNode* input,

std::function<Result<ExecBatch>()> task) override {

// by an input of this node to push a task here for processing.

// Non-terminating nodes (e.g. filter/project/etc.): should wrap

// their own work with the task (using function composition/fusing) and then

// call InputReceived on the downstream node.

// Terminating nodes (e.g. sink node / pipeline breaker) should submit

// the task to an executor or task group.

void InputReceived(ExecNode* input,

std::function<Result<ExecBatch>()> task) override {

Some minor wording and removing the term scheduler as one doesn't exist yet.

westonpace · 2021-12-17T02:51:15Z

cpp/src/arrow/compute/exec/aggregate_node.cc

If we aren't going to address this now let's make another JIRA (taskify 3?) Something like, "Fix logic in existing nodes so that pipeline breakers submit and non-breakers forward" and then add a comment in all of these spots along the lines of...

// This node should be forwarding the task downstream but that will be addressed in ARROW-XYZ

cpp/src/arrow/compute/exec/aggregate_node.cc

cpp/src/arrow/compute/exec/exec_plan.cc

cpp/src/arrow/compute/exec/filter_node.cc

cpp/src/arrow/compute/exec/project_node.cc

cpp/src/arrow/compute/exec/sink_node.cc

westonpace · 2021-12-17T02:59:34Z

cpp/src/arrow/compute/exec/source_node.cc

So what will this eventually look like? If we assume we don't know how many batches a scanner will emit then how many "scan tasks" do we submit individually? I suppose we can always "over-submit" and then the final tasks will just abandon themselves if the scanner is finished. Could this be another spot for backpressure? I don't think we have to solve all of these problems right now.

amol- · 2023-03-30T17:19:19Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

github-actions bot added the Component: C++ label Dec 9, 2021

aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch 3 times, most recently from 51e9d0d to f281e50 Compare December 10, 2021 01:56

aucahuasi marked this pull request as ready for review December 10, 2021 14:17

westonpace reviewed Dec 11, 2021

View reviewed changes

westonpace self-requested a review December 11, 2021 03:02

aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from f281e50 to 3dfd2d3 Compare December 13, 2021 16:09

westonpace reviewed Dec 14, 2021

View reviewed changes

aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch 4 times, most recently from 694d209 to b5dcaa4 Compare December 14, 2021 16:48

aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from b5dcaa4 to 243cbb4 Compare December 14, 2021 21:09

Make possible the ExecNodes can generate/consume tasks

0a58b6f

minor changes format improve the task API for ExecNodes format fix guarantee issues with project and filter nodes minor format fix build dataset examples fix arrow compute docs

aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from 243cbb4 to 0a58b6f Compare December 14, 2021 21:26

westonpace self-requested a review December 16, 2021 22:06

westonpace requested changes Dec 17, 2021

View reviewed changes

asfimport mentioned this pull request Jul 12, 2022

[C++][Compute] Replace ExecNode::InputReceived with ::MakeTask (Part 2) #30492

Open

amol- closed this Mar 30, 2023

-    auto prev = task();
-    if (!prev.ok()) {
-      ErrorIfNotOk(prev.status());
-      return;
-    }
-    if (ErrorIfNotOk(DoConsume(prev.MoveValueUnsafe(), thread_index))) return;
+    auto func = [this] (Result<ExecBatch> task) {
+      ARROW_ASSIGN_OR_RAISE(auto prev, task());
+      auto thread_index = get_thread_index_();
+      return DoConsume(prev.MoveValueUnsafe(), thread_index);
+    };
+    plan_->scheduler()->SubmitTask(std::move(func));

ARROW-14970: [C++] Make ExecNodes can generate/consume tasks #11923

ARROW-14970: [C++] Make ExecNodes can generate/consume tasks #11923

Uh oh!

Conversation

aucahuasi commented Dec 9, 2021

Uh oh!

github-actions bot commented Dec 9, 2021

Uh oh!

github-actions bot commented Dec 9, 2021

Uh oh!

aucahuasi commented Dec 10, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi commented Dec 13, 2021

Uh oh!

aucahuasi commented Dec 13, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

westonpace commented Dec 14, 2021

Uh oh!

aucahuasi commented Dec 14, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants