-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7878: [C++][Compute] Draft LogicalPlan classes #6506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7878: [C++][Compute] Draft LogicalPlan classes #6506
Conversation
cpp/src/arrow/engine/expression.h
Outdated
| private: | ||
| explicit FieldRefExpr(std::shared_ptr<Field> field); | ||
|
|
||
| std::shared_ptr<Field> field_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be replaced with the class in ARROW-7421.
| field("bool", boolean()), | ||
| field("i32", int32()), | ||
| field("u64", uint64()), | ||
| field("f32", uint32()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a mismatch for the f32 type?
|
This is exciting! This may be the final motivation I need to start coding in C++ again. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the LogicalPlanBuilder isn't finished. Some random comments below.
cpp/src/arrow/engine/logical_plan.h
Outdated
| std::shared_ptr<Catalog> catalog; | ||
| }; | ||
|
|
||
| class LogicalPlanBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I suppose LogicalPlanBuilder will grow a Build method to build a LogicalPlan? How will it work? The builder doesn't seem to keep track of any state.
cpp/src/arrow/testing/gmock.h
Outdated
| @@ -0,0 +1,31 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call it "gmock_util.h"?
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks @fsaintjacques !
|
|
||
| ASSERT_OK_AND_ASSIGN(auto empty, EmptyRelExpr::Make(schema_1)); | ||
| EXPECT_THAT(empty->type(), ExprType::Table(schema_1)); | ||
| EXPECT_THAT(empty->schema(), PtrEquals(schema_1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one thing, the console shows expression_test.cc:205 instead of gtest_util.cc:206.
cpp/src/arrow/engine/logical_plan.h
Outdated
| ResultExpr Filter(const std::shared_ptr<Expr>& input, | ||
| const std::shared_ptr<Expr>& predicate); | ||
|
|
||
| /// \brief Project (mutate) columns with given expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same when I read this. Could we leave the dplyr specific alias (Mutate method, comment is fine) to R?
cpp/src/arrow/engine/expression.h
Outdated
| static Result<std::shared_ptr<ProjectionRelExpr>> Make( | ||
| std::shared_ptr<Expr> input, std::vector<std::shared_ptr<Expr>> expressions); | ||
|
|
||
| const std::vector<std::shared_ptr<Expr>> expressions() const { return expressions_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preface: I am assuming that ProjectionRelExpr will always have shape TABLE, is that incorrect?
I would have expected ProjectionRelExpr to be parametrized not by a vector<Expr> but by ordered_map<string, Expr> or ordered_map<FieldRefExpr, Expr>. (Maybe you plan to have the key value pairs be implicit in an equality expression?)
For example
project(some_table, {
a: field("alpha"), // result's column "a" will be column "alpha" from some_table
b: scalar(3), // result's column "b" will be the scalar 3 broadcast with "a"
})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll have an AliasExpr(expr, name) for this. I also intend to have each expression have the equivalent of python's repr which is a "short" print, I'll use this short print as the inferred field name. The AliasExpr overrides this and pass down the input expr.
sqlite> SELECT * FROM tmp;
a|b
1|2
sqlite> SELECT a+1, b FROM tmp;
a+1|b
2|2
sqlite> SELECT a+1, a+b FROM tmp;
a+1|a+b
2|3
sqlite> SELECT AVG(a) + 1, a+b FROM tmp;
AVG(a) + 1|a+b
2.0|3
sqlite> SELECT AVG(a) + 1 as mean_plus_one, a+b FROM tmp;
mean_plus_one|a+b
2.0|3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to answer your question, all RelExpr will have Table shape (expect maybe Aggregate which can reduce to a single scalar), that's their common point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, please add AliasExpr (or a TODO in the doccomment here). Out of curiosity, what will the field name for a projected Expr be if it is neither an AliasExpr nor a FieldRefExpr?
|
Perhaps I'm just too used to the abstractions in ibis, but I'd separate the operations from the expressions like the basic implementation in the compute namespace. I'm afraid that if we need to encode the same functionality using inheritance will end up a less maintainable codebase. |
|
@kszucs could you give a concrete example of the flexibility we'd gain by separating those? |
42e014d to
60a4887
Compare
cpp/src/arrow/engine/expression.h
Outdated
| // The expression yields a Scalar, e.g. "1". | ||
| SCALAR, | ||
| // The expression yields an Array, e.g. "[1, 2, 3]". | ||
| ARRAY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One unknown here is I don't know if we need to differentiate from Array and Column. For example, a user can still pass an "inline" array. Some operators (e.g. FilterRelExpr) requires that the predicate/mask has the same number of elements than the input table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an operator which doesn't require this? For example, I think In's argument will be ExprType::Scalar|Array(list(some_type())) rather than a top level array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the difference between Array and Column, the latter one is named, right?
I don't think we need to differentiate because their shape is identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kszucs no the main difference would be an implicit length constraint; an expression of Column shape would have equal length to all sibling expressions of Column shape.
@fsaintjacques what cases do you have in mind where an Array would have differing length from a Column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main difference is knowledge of the length of the array. Column has an implicit that the length is the same as the referenced table. Where a table may or not have this.
An example, a user could have computed the mask out-of-band in panda because we're missing a functionality. In this case, from the perspective of the LogicalPlan, this is a value array that does not come from an expression of a column. It needs to validate the length. The same goes for any operators, e.g. in projection I could provide an array literal to concatenate to an existing table as a new column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case you raise is interesting, but sounds more like something we would support through a custom Expr subclass than through a (guessing) first class ArrayExpr.
WRT the implicit length assumption: we'll still need to insert validation for this on translation to a physical plan, since a non-custom expression could still result in Columns of differing length:
# example 1:
# no custom, but one column is filtered while the other isn't
# so they probably won't have the same length
And(
Filter(FieldRef("int32_column"), FieldRef("mask")),
FieldRef("bool_column"))
# example 2:
# custom; no guarantee that my_pandas_mask has equal size to int32_column
# physical plan must include a length check
Filter(FieldRef("int32_column"), CustomArrayExpr(my_pandas_mask))
The physical plan builder should recognize when the column length assumption is not guaranteed by an expression tree and insert length validation nodes. (Example 1 is a toy; rather than inserting a length validation node we should probably raise an error on translation to a physical plan since for non trivial mask it must break the column length assumption.)
|
Could you give me an example expression for a query like |
|
|
Ok, so you encode the operation with inheritance in the expression. This means that all expression type needs to have a corresponding type enum and the visitors will have the same amount of overloads? (as comparison in ibis we have ~250 operations and ~60 expression types)
|
c956347 to
d1f8c6d
Compare
|
Sorry I'm late to the party do I still have time to review this? I'll spend time on it today |
|
@wesm I'm still fixing portability issues and adding more operators, e.g. AggregateFn and AggregateRel. It's not too late and we should review thoroughly this one, especially someone with more data frame view than my database centric view. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments (some cosmetic, some design questions) on some of the basic things that jumped out at me. I think we will want to invest some time in prototyping desired APIs based on more complex SQL queries. The document https://docs.ibis-project.org/sql.html could be useful to provide a list of different types of SQL concepts to try to capture.
I'd also like to make sure we're thinking about column expressions on nested types, consider for example the APIs in Ibis (as a strawman) for dealing with nested types in e.g. postgres
thanks again for getting this started!!
| @@ -173,6 +173,8 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") | |||
|
|
|||
| define_option(ARROW_DATASET "Build the Arrow Dataset Modules" OFF) | |||
|
|
|||
| define_option(ARROW_ENGINE "Build the Arrow Query Engine Modules" OFF) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to bikeshed about this but ARROW_QUERY_ENGINE or something more "obvious" might aid readability, unless we come up with some other name for the project
cpp/src/arrow/CMakeLists.txt
Outdated
| @@ -583,6 +583,10 @@ if(ARROW_DATASET) | |||
| add_subdirectory(dataset) | |||
| endif() | |||
|
|
|||
| if(ARROW_ENGINE) | |||
| add_subdirectory(engine) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone have name opinions. Is arrow/query or arrow/query_engine better? More curious about opinions than anything
Some examples
- Impala uses
exec - MapD uses
QueryEngine - Clickhouse seems to be called "Interpreters"
🤷♂
cpp/src/arrow/engine/CMakeLists.txt
Outdated
| endforeach() | ||
|
|
||
| # Adding unit tests part of the "engine" portion of the test suite | ||
| function(ADD_ARROW_ENGINE_TEST REL_TEST_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self and team, we might want to make a helper function to reduce boilerplate for component-specific unit tests
cpp/src/arrow/engine/CMakeLists.txt
Outdated
|
|
||
| add_arrow_engine_test(catalog_test PREFIX arrow-engine) | ||
| add_arrow_engine_test(expression_test PREFIX arrow-engine) | ||
| add_arrow_engine_test(logical_plan_test PREFIX arrow-engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: combine these into a single executable (until it can be shown that splitting them up improves parallel test performance)?
cpp/src/arrow/engine/catalog.h
Outdated
| class ARROW_EN_EXPORT Entry { | ||
| public: | ||
| Entry(std::shared_ptr<dataset::Dataset> dataset, std::string name); | ||
| Entry(std::shared_ptr<Table> table, std::string name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: name first?
cpp/src/arrow/util/compare.h
Outdated
| @@ -22,6 +22,7 @@ | |||
| #include <utility> | |||
|
|
|||
| #include "arrow/util/macros.h" | |||
| #include "arrow/util/visibility.h" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IWYU or...?
cpp/src/arrow/engine/logical_plan.h
Outdated
| /// \brief References a field by index. | ||
| ResultExpr Field(const std::shared_ptr<Expr>& input, int field_index); | ||
| /// \brief References a field by name. | ||
| ResultExpr Field(const std::shared_ptr<Expr>& input, const std::string& field_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing comments about nested column paths
cpp/src/arrow/engine/logical_plan.h
Outdated
| @@ -0,0 +1,127 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this file being somewhat WIP, a lot of stuff needs to be fleshed out and I expect refactoring. There's some other stuff that needs to be resolved either here or in follow up PRs
cpp/src/arrow/engine/expression.cc
Outdated
| " should not be have a table shape"); | ||
| // TODO(fsaintjacques): better name handling. Callers should be able to | ||
| // pass a vector of names. | ||
| fields.push_back(field("expr", expr_type.type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think names need to be a property of the expr
254677a to
0423241
Compare
|
|
||
| class ARROW_EN_EXPORT LogicalPlanBuilder { | ||
| public: | ||
| using ResultExpr = Result<std::shared_ptr<Expr>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| using ResultExpr = Result<std::shared_ptr<Expr>>; | |
| using MaybeExpr = Result<std::shared_ptr<Expr>>; |
| } | ||
|
|
||
| template <typename E> | ||
| bool IsA(const std::shared_ptr<Expr>& expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the const& overloads are enabled, it'd be best to forward that to this overload with a trailing return type:
| bool IsA(const std::shared_ptr<Expr>& expr) { | |
| auto IsA(const std::shared_ptr<Expr>& expr) -> decltype(IsA<E>(*expr)) { |
| if (!expr) return false; | ||
| return IsA<E>(*expr); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| template <typename E, typename Enable = decltype(IsA<E>(std::declval<const Expr&>()))> | |
| const E* DynCast(const Expr& expr) { | |
| return IsA<E>(expr) ? &expr : NULLPTR; | |
| } | |
| template <typename E, typename Enable = decltype(IsA<E>(std::declval<const Expr&>()))> | |
| const E* DynCast(const std::shared_ptr<Expr>& expr) { | |
| return IsA<E>(expr) ? expr.get() : NULLPTR; | |
| } |
|
@fsaintjacques where do we stand on this? |
|
I'm going to review this again in the near future. I don't think this is blocking anything at the moment? |
|
@wesm @fsaintjacques @kszucs do you want to keep this PR open or resurrect it at some point in the future? |
|
I'd assume so, but not sure when are we going to have the time for it. |
|
Will close for now until there is bandwidth to make a push on this project. |
Draft of the LogicalPlan basic classes for the query engine. One should read the classes in this order for review: