Skip to content

Conversation

@aucahuasi
Copy link
Contributor

@github-actions
Copy link

github-actions bot commented Dec 9, 2021

@github-actions
Copy link

github-actions bot commented Dec 9, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@aucahuasi aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch 3 times, most recently from 51e9d0d to f281e50 Compare December 10, 2021 01:56
@aucahuasi aucahuasi marked this pull request as ready for review December 10, 2021 14:17
@aucahuasi
Copy link
Contributor Author

cc @bkietz for visibility

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a second look at this on Monday but I have some initial questions.

Comment on lines 137 to 139
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the signature to be:

virtual void InputReceived(ExecNode* input, std::function<Result<ExecBatch>(ExecBatch)> task) = 0;

I'm not sure what it means to have batch and task?

Is this some kind of intermediate step between the two models?

Comment on lines 230 to 233
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a task though is it. I was expecting something like...

static inline std::function<Result<ExecBatch>(ExecBatch)> IdentityTask() { return [] (ExecBatch batch) { return batch; }; }

@westonpace westonpace self-requested a review December 11, 2021 03:02
@aucahuasi
Copy link
Contributor Author

Thanks for the feedback @westonpace !
I'm working now for improving all of this

@aucahuasi aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from f281e50 to 3dfd2d3 Compare December 13, 2021 16:09
@aucahuasi
Copy link
Contributor Author

@westonpace I've sent some changes, feel free to take another look!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. I think we've just got things a bit switched at the moment though. The non-pipeline breakers (filter, project) are submitting tasks and the pipeline breakers are not.

What we want is the pipeline breakers to submit tasks and the ordinary nodes to compose the task.

Comment on lines +182 to +187
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto prev = task();
if (!prev.ok()) {
ErrorIfNotOk(prev.status());
return;
}
if (ErrorIfNotOk(DoConsume(prev.MoveValueUnsafe(), thread_index))) return;
auto func = [this] (Result<ExecBatch> task) {
ARROW_ASSIGN_OR_RAISE(auto prev, task());
auto thread_index = get_thread_index_();
return DoConsume(prev.MoveValueUnsafe(), thread_index);
};
plan_->scheduler()->SubmitTask(std::move(func));

This is what I'm thinking pipeline breakers would look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan_->scheduler()->SubmitTask(std::move(func));

Yes that is the idea, but this PR is to enable that construction later, this PR is not going to define any scheduler or submitting logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't going to address this now let's make another JIRA (taskify 3?) Something like, "Fix logic in existing nodes so that pipeline breakers submit and non-breakers forward" and then add a comment in all of these spots along the lines of...

// This node should be forwarding the task downstream but that will be addressed in ARROW-XYZ

Comment on lines +241 to +252
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good 👍

Comment on lines -110 to +125
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually not create a task but forward to downstream like filter/project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what will this eventually look like? If we assume we don't know how many batches a scanner will emit then how many "scan tasks" do we submit individually? I suppose we can always "over-submit" and then the final tasks will just abandon themselves if the scanner is finished. Could this be another spot for backpressure? I don't think we have to solve all of these problems right now.

@aucahuasi aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch 4 times, most recently from 694d209 to b5dcaa4 Compare December 14, 2021 16:48
@westonpace
Copy link
Member

We will want to update the docs in docs/source/cpp/streaming_execution.rst as well. I think it's ok in those docs to mention the concept of a scheduler / task submission if needed to explain the API.

@aucahuasi aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from b5dcaa4 to 243cbb4 Compare December 14, 2021 21:09
minor changes

format

improve the task API for ExecNodes

format

fix guarantee issues with project and filter nodes

minor format

fix build dataset examples

fix arrow compute docs
@aucahuasi aucahuasi force-pushed the ARROW-14970-taskfy-part2 branch from 243cbb4 to 0a58b6f Compare December 14, 2021 21:26
@aucahuasi
Copy link
Contributor Author

We will want to update the docs in docs/source/cpp/streaming_execution.rst

Done! Good catch! thanks.

@westonpace westonpace self-requested a review December 16, 2021 22:06
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we changed the interface there are a number of nodes that aren't really doing things the right way. I agree we don't need to convert all of them right away (since they still work correctly). As a compromise can we add comments in all the places we will need to change referencing a JIRA that will implement that change?

Once those are in place I think this is good to go.

Comment on lines +98 to +105
// by an input of this node to push a task here for processing.
// For non-terminating nodes (e.g. filter/project/etc.): the node can wrap
// its own work with the task (using function composition/fusing) and then
// call InputReceived on the downstream node.
// A "terminating node" (e.g. sink node / pipeline breaker) could then submit
// the task to a scheduler.
void InputReceived(ExecNode* input,
std::function<Result<ExecBatch>()> task) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// by an input of this node to push a task here for processing.
// For non-terminating nodes (e.g. filter/project/etc.): the node can wrap
// its own work with the task (using function composition/fusing) and then
// call InputReceived on the downstream node.
// A "terminating node" (e.g. sink node / pipeline breaker) could then submit
// the task to a scheduler.
void InputReceived(ExecNode* input,
std::function<Result<ExecBatch>()> task) override {
// by an input of this node to push a task here for processing.
// Non-terminating nodes (e.g. filter/project/etc.): should wrap
// their own work with the task (using function composition/fusing) and then
// call InputReceived on the downstream node.
// Terminating nodes (e.g. sink node / pipeline breaker) should submit
// the task to an executor or task group.
void InputReceived(ExecNode* input,
std::function<Result<ExecBatch>()> task) override {

Some minor wording and removing the term scheduler as one doesn't exist yet.

Comment on lines +182 to +187
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't going to address this now let's make another JIRA (taskify 3?) Something like, "Fix logic in existing nodes so that pipeline breakers submit and non-breakers forward" and then add a comment in all of these spots along the lines of...

// This node should be forwarding the task downstream but that will be addressed in ARROW-XYZ

Comment on lines -110 to +125
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what will this eventually look like? If we assume we don't know how many batches a scanner will emit then how many "scan tasks" do we submit individually? I suppose we can always "over-submit" and then the final tasks will just abandon themselves if the scanner is finished. Could this be another spot for backpressure? I don't think we have to solve all of these problems right now.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants