Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented May 25, 2021

So far this involved a lot of refactoring of Expressions to be compatible with ExecBatches. The next step is to add a ScanNode wrapping a ScannerBuilder

@bkietz bkietz requested a review from pitrou May 25, 2021 21:05
@github-actions
Copy link

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to digest but I think looks good overall, with much of it being refactoring or restructuring. (Thank you for taking care of cleaning up things all over!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that stopping a producer doesn't necessarily immediately terminate everything, the consumer needs to be prepared to get and handle/ignore an error anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that handling/ignoring trailing batches is necessary; the producer may take a while to stop. However I wonder if it's reasonable to do the same for trailing errors. For example: let's say we have a plan where a LimitNode is taking the first 99 rows from EmitsErrorAfterHundredthRowNode. There's a race condition here (also depends on chunking) because the LimitNode will sometimes receive the trailing error before it can stop the producer and sometimes will succeed in stopping its producer before it gets around to raising an error. I'm not sure what the correct answer here is, but I lean toward: if any node emits any error, that always puts all subsequent nodes into an error state too (unless explicitly intercepted). The above example seems like a problem we need to fix in EmitsErrorAfterHundredthRowNode rather than requiring all consumers to ignore post-stop errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I think I have the same inclination then; except for maybe a sink node that's already gotten all its results, in which case subsequent errors are probably irrelevant, propagating errors even when otherwise 'finished' makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree to keep the logic as simple as possible.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance it seems like a good overall cleanup, thanks. How do you see things evolving? Do you think the various operations achieved by a scanner today will be achieved by an execution plan? For example, will ScanBatches, CountRows, etc. create and execute an execution plan instead of maintaining the dual paths?

Comment on lines +520 to +473
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is going on here (though that is likely my own problem). If the value is a scalar record batch you want to end up with one each value being a scalar. Can you not just grab the first item from each column of partial_array? Why do you need to go back in and patch things up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was as compact as I could write this case; if you see a way to compress/simplify it then I'll take it but the scalar/array cases are really just for testing purposes

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only took a partial look at this.

From a high-level point of view, it seems to me that we want an entire execution pipeline to be wrapped in a single exec node, rather than having individual project, filter... nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we'll want to be able to apply backpressure at some point. But a generator doesn't allow for that. So it seems that, instead of wrapping a generator, you should really have a ExecNode that wraps a dataset scanner directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull-based models (e.g. generators) apply backpressure by default, you have to specifically ask for each item of data that you want. Although it's turned around so maybe it's right to just call it pressure. If we want to apply it here it could be done by adding a flag in the loop (perhaps near if (finished_)) that looks something like...

if (pause_future_) {
  return pause_future_;
}

Then PauseProducing becomes:

pause_future_ = Future<>::Make();

and ResumeProducing becomes:

pause_future_.MarkFinished();
pause_future_ = Future<>(); // Maybe we need a `Reset` or `MakeInvalid`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, my bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any kind of "map-like" node that has an input and an output a call to PauseProducing should always call PauseProducing on the input. That's the only way to ensure that back pressure is properly channeled to the source (which can actually pause)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are currently stubs due to lack of support for pausing in any source node. For now, I'll remove these and add a follow up to support pause/resume

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote is that ErrorReceived is sufficient. I think a node could recover from a failure but, if it does so, it shouldn't call ErrorReceived.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pondering how back pressure would be applied. I think there would be a new argument added to this SinkNode for max_items_queued or something like that. However, we could not naively apply that limit to received_batches_ because of the resequencing.

Since we are delivering to a pull-based model I think the appropriate way to apply back pressure would be to have the PushGenerator keep track of how many undelivered items it has. Then there would need to be a check in this code and, after pushing, if the PushGenerator is full, then apply back pressure to the inputs. The PushGenerator would also need some way of signalling back into the SinkNode that the pressure has been relieved and it is ready for more items.

I don't think this has to be implemented now, but does that sound reasonable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable, but the prerequisite is to support pause/resume in a source node

@bkietz
Copy link
Member Author

bkietz commented Jun 2, 2021

How do you see things evolving? Do you think the various operations achieved by a scanner today will be achieved by an execution plan? For example, will ScanBatches, CountRows, etc. create and execute an execution plan instead of maintaining the dual paths?

I'd like the ExecPlan to be usable enough to replace all filtering and projection currently in Scanner. So for example ScanBatches could assemble an ExecPlan to handle filtering and projection then receive and reorder batches; never needing to explicitly evaluate an expression.

Ultimately, I'm not positive we'll keep Scanner. It's possible we could simplify the dataset module to a factory for source/sink nodes. In that case, anything which currently builds a Scanner would instead produce an ExecPlan. We'll see

@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from bb8a0a7 to 5fe4d10 Compare June 8, 2021 23:30
@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 10f3aea to 1dff3bf Compare June 16, 2021 20:49
@bkietz bkietz marked this pull request as ready for review June 17, 2021 19:53
@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch 2 times, most recently from 47710c1 to 0791f80 Compare June 21, 2021 14:42
@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 05365cf to 371d5ca Compare June 21, 2021 16:41
@bkietz
Copy link
Member Author

bkietz commented Jun 21, 2021

@pitrou PTAL

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

under the License.
-->

# ExecNodes and logical operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the status of this document. If this is meant to be a persistent document, then can it be part of the Sphinx development docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll promote this to a Sphinx doc in a follow up. https://issues.apache.org/jira/browse/ARROW-13227


/// \brief Like MapVector, but where the function can fail.
template <typename Fn, typename From = internal::call_traits::argument_type<0, Fn>,
typename To = typename internal::call_traits::return_type<Fn>::ValueType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the decltype(declval) pattern here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's not a good reason; just uniformity with getting From from call_traits.

if (auto name = this->name()) {
return internal::MapVector([](int i) { return FieldPath{i}; },
schema.GetAllFieldIndices(*name));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch 2 times, most recently from 529a805 to 828c188 Compare June 30, 2021 16:25
@bkietz bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 11bc3c1 to e91ef9f Compare June 30, 2021 20:01
@bkietz
Copy link
Member Author

bkietz commented Jul 1, 2021

@pitrou I think I've addressed your comments. Could we merge this and address anything else in follow up?

@bkietz
Copy link
Member Author

bkietz commented Jul 1, 2021

+1, merging

@bkietz bkietz closed this in 1ae979d Jul 1, 2021
@bkietz bkietz deleted the 11930-Refactor-Dataset-scans-to branch July 1, 2021 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants