ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans #10397

bkietz · 2021-05-25T21:05:55Z

So far this involved a lot of refactoring of Expressions to be compatible with ExecBatches. The next step is to add a ScanNode wrapping a ScannerBuilder

github-actions · 2021-05-25T21:06:14Z

https://issues.apache.org/jira/browse/ARROW-11930

lidavidm

This is a lot to digest but I think looks good overall, with much of it being refactoring or restructuring. (Thank you for taking care of cleaning up things all over!)

cpp/src/arrow/compute/exec/expression.cc

cpp/src/arrow/compute/exec/expression.h

lidavidm · 2021-05-27T16:48:59Z

cpp/src/arrow/compute/exec/exec_plan.cc

Given that stopping a producer doesn't necessarily immediately terminate everything, the consumer needs to be prepared to get and handle/ignore an error anyways.

I'd agree that handling/ignoring trailing batches is necessary; the producer may take a while to stop. However I wonder if it's reasonable to do the same for trailing errors. For example: let's say we have a plan where a LimitNode is taking the first 99 rows from EmitsErrorAfterHundredthRowNode. There's a race condition here (also depends on chunking) because the LimitNode will sometimes receive the trailing error before it can stop the producer and sometimes will succeed in stopping its producer before it gets around to raising an error. I'm not sure what the correct answer here is, but I lean toward: if any node emits any error, that always puts all subsequent nodes into an error state too (unless explicitly intercepted). The above example seems like a problem we need to fix in EmitsErrorAfterHundredthRowNode rather than requiring all consumers to ignore post-stop errors

Ah, I see. I think I have the same inclination then; except for maybe a sink node that's already gotten all its results, in which case subsequent errors are probably irrelevant, propagating errors even when otherwise 'finished' makes sense.

I would agree to keep the logic as simple as possible.

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace

At first glance it seems like a good overall cleanup, thanks. How do you see things evolving? Do you think the various operations achieved by a scanner today will be achieved by an execution plan? For example, will ScanBatches, CountRows, etc. create and execute an execution plan instead of maintaining the dual paths?

cpp/src/arrow/compute/exec/exec_plan.h

cpp/src/arrow/compute/exec/expression.cc

westonpace · 2021-06-01T18:10:28Z

cpp/src/arrow/compute/exec/expression.cc

I'm not sure what is going on here (though that is likely my own problem). If the value is a scalar record batch you want to end up with one each value being a scalar. Can you not just grab the first item from each column of partial_array? Why do you need to go back in and patch things up?

This was as compact as I could write this case; if you see a way to compress/simplify it then I'll take it but the scalar/array cases are really just for testing purposes

cpp/src/arrow/compute/exec/expression.cc

cpp/src/arrow/compute/exec/test_util.cc

cpp/src/arrow/dataset/dataset_internal.h

cpp/src/arrow/util/async_generator.h

cpp/src/arrow/util/iterator.h

pitrou

I only took a partial look at this.

From a high-level point of view, it seems to me that we want an entire execution pipeline to be wrapped in a single exec node, rather than having individual project, filter... nodes?

cpp/src/arrow/compute/exec.h

cpp/src/arrow/util/iterator.h

cpp/src/arrow/util/future.h

cpp/src/arrow/util/async_generator.h

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/compute/exec/exec_plan.cc

pitrou · 2021-06-02T14:19:48Z

cpp/src/arrow/compute/exec/exec_plan.cc

The problem is that we'll want to be able to apply backpressure at some point. But a generator doesn't allow for that. So it seems that, instead of wrapping a generator, you should really have a ExecNode that wraps a dataset scanner directly.

Pull-based models (e.g. generators) apply backpressure by default, you have to specifically ask for each item of data that you want. Although it's turned around so maybe it's right to just call it pressure. If we want to apply it here it could be done by adding a flag in the loop (perhaps near if (finished_)) that looks something like...

if (pause_future_) { return pause_future_; }

Then PauseProducing becomes:

pause_future_ = Future<>::Make();

and ResumeProducing becomes:

pause_future_.MarkFinished(); pause_future_ = Future<>(); // Maybe we need a `Reset` or `MakeInvalid`

Oh, you're right, my bad.

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace · 2021-06-02T16:35:45Z

cpp/src/arrow/compute/exec/exec_plan.cc

For any kind of "map-like" node that has an input and an output a call to PauseProducing should always call PauseProducing on the input. That's the only way to ensure that back pressure is properly channeled to the source (which can actually pause)

These are currently stubs due to lack of support for pausing in any source node. For now, I'll remove these and add a follow up to support pause/resume

westonpace · 2021-06-02T16:47:16Z

cpp/src/arrow/compute/exec/exec_plan.cc

My vote is that ErrorReceived is sufficient. I think a node could recover from a failure but, if it does so, it shouldn't call ErrorReceived.

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace · 2021-06-02T17:26:26Z

cpp/src/arrow/compute/exec/exec_plan.cc

I'm pondering how back pressure would be applied. I think there would be a new argument added to this SinkNode for max_items_queued or something like that. However, we could not naively apply that limit to received_batches_ because of the resequencing.

Since we are delivering to a pull-based model I think the appropriate way to apply back pressure would be to have the PushGenerator keep track of how many undelivered items it has. Then there would need to be a check in this code and, after pushing, if the PushGenerator is full, then apply back pressure to the inputs. The PushGenerator would also need some way of signalling back into the SinkNode that the pressure has been relieved and it is ready for more items.

I don't think this has to be implemented now, but does that sound reasonable?

This sounds reasonable, but the prerequisite is to support pause/resume in a source node

bkietz · 2021-06-02T21:02:50Z

How do you see things evolving? Do you think the various operations achieved by a scanner today will be achieved by an execution plan? For example, will ScanBatches, CountRows, etc. create and execute an execution plan instead of maintaining the dual paths?

I'd like the ExecPlan to be usable enough to replace all filtering and projection currently in Scanner. So for example ScanBatches could assemble an ExecPlan to handle filtering and projection then receive and reorder batches; never needing to explicitly evaluate an expression.

Ultimately, I'm not positive we'll keep Scanner. It's possible we could simplify the dataset module to a factory for source/sink nodes. In that case, anything which currently builds a Scanner would instead produce an ExecPlan. We'll see

bkietz · 2021-06-21T18:57:59Z

@pitrou PTAL

pitrou

Some more comments.

pitrou · 2021-06-23T12:06:27Z

cpp/src/arrow/compute/exec/doc/exec_node.md

+  under the License.
+-->
+
+# ExecNodes and logical operators


I'm not sure I understand the status of this document. If this is meant to be a persistent document, then can it be part of the Sphinx development docs?

I'll promote this to a Sphinx doc in a follow up. https://issues.apache.org/jira/browse/ARROW-13227

cpp/src/arrow/compute/exec.h

cpp/src/arrow/compute/exec/expression.cc

cpp/src/arrow/compute/exec/plan_test.cc

cpp/src/arrow/testing/gtest_util.h

cpp/src/arrow/util/future.h

cpp/src/arrow/util/future_test.cc

cpp/src/arrow/util/thread_pool_test.cc

pitrou · 2021-06-23T12:43:56Z

cpp/src/arrow/util/vector.h


+/// \brief Like MapVector, but where the function can fail.
+template <typename Fn, typename From = internal::call_traits::argument_type<0, Fn>,
+          typename To = typename internal::call_traits::return_type<Fn>::ValueType>


Why not use the decltype(declval) pattern here as well?

there's not a good reason; just uniformity with getting From from call_traits.

pitrou · 2021-06-23T12:46:03Z

cpp/src/arrow/type.cc

+  if (auto name = this->name()) {
+    return internal::MapVector([](int i) { return FieldPath{i}; },
+                               schema.GetAllFieldIndices(*name));
+  }


Can you add a test for this?

This is tested https://github.com/bkietz/arrow/blob/1b8bedcbf4b794858228c85644248300b64ce5a4/cpp/src/arrow/type_test.cc#L391-L392

bkietz · 2021-07-01T01:07:15Z

@pitrou I think I've addressed your comments. Could we merge this and address anything else in follow up?

…tialized

bkietz · 2021-07-01T12:14:56Z

+1, merging

bkietz requested a review from pitrou May 25, 2021 21:05

github-actions bot added the Component: C++ label May 25, 2021

lidavidm reviewed May 27, 2021

View reviewed changes

westonpace approved these changes Jun 1, 2021

View reviewed changes

pitrou reviewed Jun 2, 2021

View reviewed changes

westonpace reviewed Jun 2, 2021

View reviewed changes

cpp/src/arrow/compute/exec/exec_plan.cc Outdated Show resolved Hide resolved

westonpace reviewed Jun 2, 2021

View reviewed changes

cpp/src/arrow/compute/exec/exec_plan.cc Outdated Show resolved Hide resolved

westonpace reviewed Jun 2, 2021

View reviewed changes

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from bb8a0a7 to 5fe4d10 Compare June 8, 2021 23:30

github-actions bot added the Component: R label Jun 14, 2021

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 10f3aea to 1dff3bf Compare June 16, 2021 20:49

bkietz marked this pull request as ready for review June 17, 2021 19:53

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch 2 times, most recently from 47710c1 to 0791f80 Compare June 21, 2021 14:42

github-actions bot added the Component: Python label Jun 21, 2021

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 05365cf to 371d5ca Compare June 21, 2021 16:41

pitrou reviewed Jun 23, 2021

View reviewed changes

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch 2 times, most recently from 529a805 to 828c188 Compare June 30, 2021 16:25

bkietz added 6 commits June 30, 2021 16:00

ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

1aa49b7

use Loop in GeneratorNode

a367f60

Make CollectNode public (as SinkNode)

a6615d9

add ProjectNode

d996e05

add sketch of ScanNode

f2d4626

flesh out ScanNode, tag ExecBatches with guarantees

0351ccd

bkietz added 21 commits June 30, 2021 16:00

add fast path for FieldRef.Name lookup in Schema

e9468cd

remove seq reordering from SinkNode

57264ee

minor review comments

1f8e93b

use compute/type_fwd.h

2e2612a

Add (very) basic ExecNode doc

d7b4534

Append to ExecNode doc

1019c37

add Result and Status matchers

b63e585

replace output_descr with output_schema for named fields

d0c9eac

repair r/src/dataset.cpp

86cfce5

add accessor to check for thread membership

55df44d

add support for Future<> to ResultWith, Raises

9410b10

replaced ScanBatchesUnorderedAsync but it hangs

1de1a14

gcc4.8: more explicit construction

ceac80b

paranoid reversion in async_generator.h

84c2182

move new scan path into a unit test for now

aaaa353

reduce #includes in expression.h

1da1775

repair python binding with inlined KnownFieldValues def

f50a89c

review comments

256c29c

consistent end signaling for Enumerated

02c7eb4

transfer from background thread

5133e7e

ensure that plans are stopped before they are destroyed

e91ef9f

bkietz force-pushed the 11930-Refactor-Dataset-scans-to branch from 11bc3c1 to e91ef9f Compare June 30, 2021 20:01

move hash into Expression::Expression(Call) to ensure it's always ini…

046e057

…tialized

bkietz closed this in 1ae979d Jul 1, 2021

bkietz deleted the 11930-Refactor-Dataset-scans-to branch July 1, 2021 12:15

asfimport mentioned this pull request Jul 1, 2021

[C++][Dataset][Compute] Refactor Dataset scans to use an ExecNode graph #27767

Closed

ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans #10397

ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans #10397

Uh oh!

Conversation

bkietz commented May 25, 2021

Uh oh!

github-actions bot commented May 25, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz commented Jun 2, 2021

Uh oh!

bkietz commented Jun 21, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!