Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented Feb 28, 2020

Draft of the LogicalPlan basic classes for the query engine. One should read the classes in this order for review:

  • ExprType/Expr
  • LogicalPlanBuilder/LogicalPlan
  • Catalog

@github-actions
Copy link

private:
explicit FieldRefExpr(std::shared_ptr<Field> field);

std::shared_ptr<Field> field_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be replaced with the class in ARROW-7421.

field("bool", boolean()),
field("i32", int32()),
field("u64", uint64()),
field("f32", uint32()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a mismatch for the f32 type?

@andygrove
Copy link
Member

This is exciting! This may be the final motivation I need to start coding in C++ again.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the LogicalPlanBuilder isn't finished. Some random comments below.

std::shared_ptr<Catalog> catalog;
};

class LogicalPlanBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I suppose LogicalPlanBuilder will grow a Build method to build a LogicalPlan? How will it work? The builder doesn't seem to keep track of any state.

@@ -0,0 +1,31 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it "gmock_util.h"?

@wesm wesm self-requested a review March 2, 2020 17:02
@bkietz bkietz self-requested a review March 3, 2020 19:11
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks @fsaintjacques !


ASSERT_OK_AND_ASSIGN(auto empty, EmptyRelExpr::Make(schema_1));
EXPECT_THAT(empty->type(), ExprType::Table(schema_1));
EXPECT_THAT(empty->schema(), PtrEquals(schema_1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing, the console shows expression_test.cc:205 instead of gtest_util.cc:206.

ResultExpr Filter(const std::shared_ptr<Expr>& input,
const std::shared_ptr<Expr>& predicate);

/// \brief Project (mutate) columns with given expressions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same when I read this. Could we leave the dplyr specific alias (Mutate method, comment is fine) to R?

static Result<std::shared_ptr<ProjectionRelExpr>> Make(
std::shared_ptr<Expr> input, std::vector<std::shared_ptr<Expr>> expressions);

const std::vector<std::shared_ptr<Expr>> expressions() const { return expressions_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preface: I am assuming that ProjectionRelExpr will always have shape TABLE, is that incorrect?

I would have expected ProjectionRelExpr to be parametrized not by a vector<Expr> but by ordered_map<string, Expr> or ordered_map<FieldRefExpr, Expr>. (Maybe you plan to have the key value pairs be implicit in an equality expression?)

For example

project(some_table, {
  a: field("alpha"), // result's column "a" will be column "alpha" from some_table
  b: scalar(3), // result's column "b" will be the scalar 3 broadcast with "a"
})

Copy link
Contributor Author

@fsaintjacques fsaintjacques Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll have an AliasExpr(expr, name) for this. I also intend to have each expression have the equivalent of python's repr which is a "short" print, I'll use this short print as the inferred field name. The AliasExpr overrides this and pass down the input expr.

sqlite> SELECT * FROM tmp;                                                                                                                                                                                                                                                                
a|b                                                                                                                                          
1|2                                                                                                                                                                                                                                                                  
sqlite> SELECT a+1, b FROM tmp;                                                                                                                                                                                                                                                           
a+1|b                                                                                                                                                                                                                                                                                     
2|2                                                                                                                                                                                                                                                                                       
sqlite> SELECT a+1, a+b FROM tmp;                                                                                                                                                                                                                                                         
a+1|a+b                                                                                                                                                                                                                                                                                   
2|3                                                                                                                                                                                                                                                                                                                                                                                                   
sqlite> SELECT AVG(a) + 1, a+b FROM tmp;                                                                                                                                                                                                                                                  
AVG(a) + 1|a+b                                                                                                                               
2.0|3                                                                                                                                        
sqlite> SELECT AVG(a) + 1 as mean_plus_one, a+b FROM tmp;                                                                                                                                                                                                                                 
mean_plus_one|a+b                                                                                                                            
2.0|3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to answer your question, all RelExpr will have Table shape (expect maybe Aggregate which can reduce to a single scalar), that's their common point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, please add AliasExpr (or a TODO in the doccomment here). Out of curiosity, what will the field name for a projected Expr be if it is neither an AliasExpr nor a FieldRefExpr?

@kszucs
Copy link
Member

kszucs commented Mar 5, 2020

Perhaps I'm just too used to the abstractions in ibis, but I'd separate the operations from the expressions like the basic implementation in the compute namespace.

I'm afraid that if we need to encode the same functionality using inheritance will end up a less maintainable codebase.

@bkietz
Copy link
Member

bkietz commented Mar 5, 2020

@kszucs could you give a concrete example of the flexibility we'd gain by separating those?

@fsaintjacques fsaintjacques force-pushed the ARROW-7878-logical-plan branch from 42e014d to 60a4887 Compare March 9, 2020 13:55
// The expression yields a Scalar, e.g. "1".
SCALAR,
// The expression yields an Array, e.g. "[1, 2, 3]".
ARRAY,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One unknown here is I don't know if we need to differentiate from Array and Column. For example, a user can still pass an "inline" array. Some operators (e.g. FilterRelExpr) requires that the predicate/mask has the same number of elements than the input table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an operator which doesn't require this? For example, I think In's argument will be ExprType::Scalar|Array(list(some_type())) rather than a top level array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the difference between Array and Column, the latter one is named, right?
I don't think we need to differentiate because their shape is identical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs no the main difference would be an implicit length constraint; an expression of Column shape would have equal length to all sibling expressions of Column shape.

@fsaintjacques what cases do you have in mind where an Array would have differing length from a Column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main difference is knowledge of the length of the array. Column has an implicit that the length is the same as the referenced table. Where a table may or not have this.

An example, a user could have computed the mask out-of-band in panda because we're missing a functionality. In this case, from the perspective of the LogicalPlan, this is a value array that does not come from an expression of a column. It needs to validate the length. The same goes for any operators, e.g. in projection I could provide an array literal to concatenate to an existing table as a new column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case you raise is interesting, but sounds more like something we would support through a custom Expr subclass than through a (guessing) first class ArrayExpr.

WRT the implicit length assumption: we'll still need to insert validation for this on translation to a physical plan, since a non-custom expression could still result in Columns of differing length:

# example 1:
# no custom, but one column is filtered while the other isn't
# so they probably won't have the same length
And(
  Filter(FieldRef("int32_column"), FieldRef("mask")),
  FieldRef("bool_column"))

# example 2:
# custom; no guarantee that my_pandas_mask has equal size to int32_column
# physical plan must include a length check
Filter(FieldRef("int32_column"), CustomArrayExpr(my_pandas_mask))

The physical plan builder should recognize when the column length assumption is not guaranteed by an expression tree and insert length validation nodes. (Example 1 is a toy; rather than inserting a length validation node we should probably raise an error on translation to a physical plan since for non trivial mask it must break the column length assumption.)

@kszucs
Copy link
Member

kszucs commented Mar 11, 2020

Could you give me an example expression for a query like

SELECT * FROM table WHERE table.a >= table.b 

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented Mar 11, 2020

Could you give me an example expression for a query like

SELECT * FROM table WHERE table.a >= table.b 
t = Scan(table)
a = FieldRefExpr(t, "a")
b = FieldRefExpr(t, "b")
return FilterRelExpr(t, GreaterEqualThanExpr(a, b))

@kszucs
Copy link
Member

kszucs commented Mar 11, 2020

Ok, so you encode the operation with inheritance in the expression. This means that all expression type needs to have a corresponding type enum and the visitors will have the same amount of overloads? (as comparison in ibis we have ~250 operations and ~60 expression types)

Sorry, I accidentally edited your comment.

@fsaintjacques fsaintjacques force-pushed the ARROW-7878-logical-plan branch 10 times, most recently from c956347 to d1f8c6d Compare March 13, 2020 11:48
@wesm
Copy link
Member

wesm commented Mar 13, 2020

Sorry I'm late to the party do I still have time to review this? I'll spend time on it today

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented Mar 13, 2020

@wesm I'm still fixing portability issues and adding more operators, e.g. AggregateFn and AggregateRel. It's not too late and we should review thoroughly this one, especially someone with more data frame view than my database centric view.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments (some cosmetic, some design questions) on some of the basic things that jumped out at me. I think we will want to invest some time in prototyping desired APIs based on more complex SQL queries. The document https://docs.ibis-project.org/sql.html could be useful to provide a list of different types of SQL concepts to try to capture.

I'd also like to make sure we're thinking about column expressions on nested types, consider for example the APIs in Ibis (as a strawman) for dealing with nested types in e.g. postgres

https://github.com/ibis-project/ibis/blob/master/ibis/sql/postgres/tests/test_functions.py#L1062

thanks again for getting this started!!

@@ -173,6 +173,8 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")

define_option(ARROW_DATASET "Build the Arrow Dataset Modules" OFF)

define_option(ARROW_ENGINE "Build the Arrow Query Engine Modules" OFF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to bikeshed about this but ARROW_QUERY_ENGINE or something more "obvious" might aid readability, unless we come up with some other name for the project

@@ -583,6 +583,10 @@ if(ARROW_DATASET)
add_subdirectory(dataset)
endif()

if(ARROW_ENGINE)
add_subdirectory(engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone have name opinions. Is arrow/query or arrow/query_engine better? More curious about opinions than anything

Some examples

  • Impala uses exec
  • MapD uses QueryEngine
  • Clickhouse seems to be called "Interpreters"

🤷‍♂

endforeach()

# Adding unit tests part of the "engine" portion of the test suite
function(ADD_ARROW_ENGINE_TEST REL_TEST_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self and team, we might want to make a helper function to reduce boilerplate for component-specific unit tests

https://issues.apache.org/jira/browse/ARROW-8116


add_arrow_engine_test(catalog_test PREFIX arrow-engine)
add_arrow_engine_test(expression_test PREFIX arrow-engine)
add_arrow_engine_test(logical_plan_test PREFIX arrow-engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: combine these into a single executable (until it can be shown that splitting them up improves parallel test performance)?

class ARROW_EN_EXPORT Entry {
public:
Entry(std::shared_ptr<dataset::Dataset> dataset, std::string name);
Entry(std::shared_ptr<Table> table, std::string name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: name first?

@@ -22,6 +22,7 @@
#include <utility>

#include "arrow/util/macros.h"
#include "arrow/util/visibility.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IWYU or...?

/// \brief References a field by index.
ResultExpr Field(const std::shared_ptr<Expr>& input, int field_index);
/// \brief References a field by name.
ResultExpr Field(const std::shared_ptr<Expr>& input, const std::string& field_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing comments about nested column paths

@@ -0,0 +1,127 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this file being somewhat WIP, a lot of stuff needs to be fleshed out and I expect refactoring. There's some other stuff that needs to be resolved either here or in follow up PRs

" should not be have a table shape");
// TODO(fsaintjacques): better name handling. Callers should be able to
// pass a vector of names.
fields.push_back(field("expr", expr_type.type()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think names need to be a property of the expr

@fsaintjacques fsaintjacques force-pushed the ARROW-7878-logical-plan branch from 254677a to 0423241 Compare March 16, 2020 15:00

class ARROW_EN_EXPORT LogicalPlanBuilder {
public:
using ResultExpr = Result<std::shared_ptr<Expr>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using ResultExpr = Result<std::shared_ptr<Expr>>;
using MaybeExpr = Result<std::shared_ptr<Expr>>;

}

template <typename E>
bool IsA(const std::shared_ptr<Expr>& expr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the const& overloads are enabled, it'd be best to forward that to this overload with a trailing return type:

Suggested change
bool IsA(const std::shared_ptr<Expr>& expr) {
auto IsA(const std::shared_ptr<Expr>& expr) -> decltype(IsA<E>(*expr)) {

if (!expr) return false;
return IsA<E>(*expr);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename E, typename Enable = decltype(IsA<E>(std::declval<const Expr&>()))>
const E* DynCast(const Expr& expr) {
return IsA<E>(expr) ? &expr : NULLPTR;
}
template <typename E, typename Enable = decltype(IsA<E>(std::declval<const Expr&>()))>
const E* DynCast(const std::shared_ptr<Expr>& expr) {
return IsA<E>(expr) ? expr.get() : NULLPTR;
}

@kszucs
Copy link
Member

kszucs commented Apr 29, 2020

@fsaintjacques where do we stand on this?

@wesm
Copy link
Member

wesm commented Apr 29, 2020

I'm going to review this again in the near future. I don't think this is blocking anything at the moment?

@emkornfield
Copy link
Contributor

@wesm @fsaintjacques @kszucs do you want to keep this PR open or resurrect it at some point in the future?

@kszucs
Copy link
Member

kszucs commented Aug 27, 2020

I'd assume so, but not sure when are we going to have the time for it.

@wesm
Copy link
Member

wesm commented Aug 27, 2020

Will close for now until there is bandwidth to make a push on this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants