-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41094: [C++] Support scalar expression special form #42106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| // The kernel (if_else) is not selection-vector-aware, so the input should not have a | ||
| // selection vector. | ||
| DCHECK(!call.kernel->selection_vector_aware && !input.selection_vector); | ||
|
|
||
| std::vector<Datum> arguments(call.arguments.size()); | ||
| ARROW_ASSIGN_OR_RAISE(arguments[0], | ||
| ExecuteScalarExpression(call.arguments[0], input, exec_context)); | ||
| // Use cond as selection vector for IF branch. | ||
| // TODO: Consider chunked array for arguments[0]. | ||
| auto if_sel = std::make_shared<SelectionVector>(arguments[0].array()); | ||
| // Duplicate and invert cond as selection vector for ELSE branch. | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| auto else_sel, | ||
| if_sel->Copy(CPUDevice::memory_manager(exec_context->memory_pool()))); | ||
| RETURN_NOT_OK(else_sel->Invert()); | ||
|
|
||
| ExecBatch if_input = input; | ||
| if_input.selection_vector = std::move(if_sel); | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| arguments[1], ExecuteScalarExpression(call.arguments[1], if_input, exec_context)); | ||
| ExecBatch else_input = input; | ||
| else_input.selection_vector = std::move(else_sel); | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| arguments[2], ExecuteScalarExpression(call.arguments[2], else_input, exec_context)); | ||
|
|
||
| // Leveraging if_else kernel with all arguments evaluated. | ||
| return ExecuteCallNonRecursive(call, input, arguments, exec_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv This part may look quite different than you originally imagined/proposed. For example, you might wonder why no "scatter" at all? That's because scatter is handled more generically in ExecuteScalarExpression. And even why is that? First it takes less code. And more importantly, I would imagine not only special forms being the producers of selection vector, but also acero nodes like Filter: a Filter node produces a selection vector, i.e., late materializing the filtered rows, and all the subsequent nodes could keep working on this selection vector without even knowing its existence.
On the other hand, the current way loses the opportunity of optimization for out-of-order scatter all branches at once. That's true. But we can always improve it - all the information needed (kernel, arguments, etc.) is already there - it just takes more code.
| bool selection_vector_aware = false; | ||
| std::shared_ptr<SpecialForm> special_form = NULLPTR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would selection_vector_aware defined in kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a kernel will claim itself as selection_vector_aware here: https://github.com/apache/arrow/pull/42106/files#diff-c275e8c64a3935e84eaf970584e61b0d3db88fa5b052965505edf8126526c80eR527-R532
if it has special execution path to deal with the given selection vector directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice as a POC! By the way, seems if_else and switch could be migrate to special_forms?
By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926
| ARROW_ASSIGN_OR_RAISE(arguments[0], | ||
| ExecuteScalarExpression(call.arguments[0], input, exec_context)); | ||
| // Use cond as selection vector for IF branch. | ||
| // TODO: Consider chunked array for arguments[0]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe applying check here? Or could it be scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do. Maybe later after the rest part of the sketch is good enough.
| if (value.is_scalar()) continue; | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| value, Filter(value, input.selection_vector->data(), FilterOptions::Defaults())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, this can be implemented like this, funny
| namespace arrow { | ||
| namespace compute { | ||
|
|
||
| /// The concept "special form" is borrowed from Lisp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I didn't know that
Thanks for looking! As for migration to special forms, there are some points I want to make:
|
I'm not familiar with |
| "ExecuteScalarExpression cannot Execute non-scalar expression ", expr.ToString()); | ||
| } | ||
|
|
||
| if (!expr.selection_vector_aware() && input.selection_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually implies a strategy (1) that is: if any subexpression (including the top-level expression itself) is not selection vector aware, we do the gather/scatter work only once for the top-level expression. This way all the subexpressions can work on a (gathered) continuous input.
An alternative strategy (2) could be: every expression that is not selection vector aware (not counting its subexpressions), does the gather/scatter itself. This way every expression itself maintains the input/output shape the same as the original input.
Given that for a long time (possibly forever), there will be much less selection vector aware kernels than the other, it's probably worth it to use 1 instead of 2, as the gather/scatter only happen once for the whole expression.
I tried my best answering your questions in #41926. And I don't think meta function has anything to do with special form :( |
| std::shared_ptr<FunctionOptions> options; | ||
| // Whether this call is a special form (e.g. if-else). If true, the `special_form` | ||
| // field will be resolved in binding. | ||
| bool is_special_form = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A special-form is not a call. A special-form is a special form of expression.
- using Impl = std::variant<Datum, Parameter, Call>;
+ using Impl = std::variant<Datum, Parameter, Call, Special>;This is fundamental to the approach. Special forms don't have names, they are virtual sub-classes of Special.
We can start with CondSpecial which can model if-else and case-when based on number of conditions and branches used to construct it.
selection_vector_aware is not a member variable, it's something resolved at evaluation time by traversing the expression.
| // XXX someday | ||
| // NullGeneralization::type nullable() const; | ||
|
|
||
| /// Whether the entire expression (including all its subexpressions) is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only something you will discover during evaluation. The same function can have kernels that are aware of selection and some that are not. Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only something you will discover during evaluation.
I don't think it's necessarily during evaluation. This can be done in binding.
The same function can have kernels that are aware of selection and some that are not.
Binding will resolve the function to concrete kernel, so the selection awareness is known once the kernel is resolved. Also note that I made selection aware flag a property of kernel rather than of function.
Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.
(I don't think this is related to selection-awareness. In this proposal, special forms and expression's selection-awareness are relatively separated - they only correlate on the fact that a special form may use selection vector to mask the evaluation of its subexpressions. So I'm assuming this is a general comment about special form.)
Before responding to your comment about special forms, I want to raise some other discussion which is part of my reasons making this proposal the current way.
In arrow compute, there are "function"s, which users can directly invoke via CallFunction, and there are expressions, one of whose concrete type is a "call", which further leverages the "function" when evaluating. Now take "if_else" as an example. There already exists an "if_else" function, which makes prefect sense because users do need it for two-way branching on a triple of {condition vector, true branch vector, false branch vector}. For expression, "if_else", as a regular function, is represented as a call. This is fine in terms of lexical/grammatical aspects, type checking, kernel resolution, etc., except that "if_else" requires special argument evaluation rules other than what a call does (eagerly evaluating all its arguments) - the reason why we need "if_else" (or "cond") special form.
So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.
Then the next question is, given that we'll have the function companion for every special form, can we leverage the function companion to do type-checking/resolution (the function need to do those anyway)? My feeling is yes, at least I can't think of a case that a special form has different type-checking/resolution rule than its function companion.
So to summarize, if a special form:
- Always has a function companion;
- Is type-checked/resolved the same as its function companion;
- Evaluates the same as its function companion - I mean the evaluation for the last kernel invocation after all its arguments properly evaluated;
- Only differs a regular call on the arguments evaluation.
Then we can think of special forms to be a special call with special argument evaluation rules.
Of course, if there are counterexamples for any of above assumptions, then my conclusion would be wrong. And I should take the way around. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.
Special forms in Arrow won't be many and we don't have to store them in a table like we have to do for functions. My use of LISP for inspiration might have led you into a path of over-engineering. We don't need the full generality of LISP or languages where users can define functions with call-by-name parameters. Think more C-like (strict evaluation) and less LISP-like (quote, let, defunc...) because that is overkill. Hence my insistence on making special forms a well-defined set of C++ classes.
| Expression call(std::string function, std::vector<Expression> arguments, Options options, | ||
| bool is_special_form = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special forms are not "called", they are evaluated. They are more general expressions than calls. if (true) f(x) is equivalent to a call to f with x that you discover while trying to evaluate the conditional.
| /// TODO: More formal factory, a registry maybe? | ||
| static Result<std::unique_ptr<SpecialForm>> Make(const std::string& name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of special-forms is very small. They should be sub-classes of the SpecialForm class. They will contain type-checking logic and other things that are very different from calls.
That duplication is a feature, not a bug. It's better to separate the cases than to have a And as I said above, special-forms don't need all the complex things that |
| SelectionVector::SelectionVector(std::shared_ptr<ArrayData> data) | ||
| : data_(std::move(data)) { | ||
| DCHECK_EQ(Type::INT32, data_->type->id()); | ||
| DCHECK_EQ(Type::BOOL, data_->type->id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bitmaps are alternatives to selection-vectors, yes, but they have a disadvantage: they don't reduce in size as you process them — you start with length bits and end up with length bits and have to scan them over and over. Ideally we should have both forms and use bitmaps when the selection is small, but if we are going to simplify things and have a single form it should be selection vectors.
|
|
||
| Expression call(std::string function, std::vector<Expression> arguments, | ||
| std::shared_ptr<compute::FunctionOptions> options) { | ||
| std::shared_ptr<compute::FunctionOptions> options, bool is_special_form) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the kind of added complexity around calls that I'm trying to convince you is not a good idea. This boolean is effectively saying the arguments are passed in call-by-name fashion and not call-by-value [1].
We should keep the Arrow rewrite system a strict-evaluated system with just 1 or 2 special forms. That means calls are always called by value. We keep the if_else and case_when functions as they are and introduce a ConditionalSpecialForm in the context of compute::Expression.
C++ is a strictly evaluated rewrite system. You can't declare a function and say "all the parameters are call-by-name" like you're trying to do with this boolean parameter. So you can implement if_else just like Arrow does here and has to use special syntax to get a conditional that takes two expression in a call by name fashion.
template<typename T>
T if_else(bool cond, T then_case, T else_case) {
if (cond) { return then_case; } else { return else_case; }
}
// Non-strict evaluation with special form syntax. The branches of the
// if are non-strictly evaluated in a way that can't be achieved with
// the `if_else` Call.
if (den > 0) {
return num / den;
} else {
return 0;
}A general conditional special form can serve for if-else, and, and or. Probably the single special form we need.
https://en.wikipedia.org/wiki/Evaluation_strategy#Non-strict_binding_strategies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your explanation is convincing and it's hard to not be convinced. I'll need some more time to digest your comments and get back to you.
I really appreciate your kind help and insightful comments, and the pointer to the interesting reading too :)
Thank you @felipecrv !
| /// selection-vector-aware. If true, then the expression can be executed using the "fast | ||
| /// path" - all kernels directly working on the selection vector. Otherwise the | ||
| /// execution takes the "slow path" - gathering the input and scattering the output. | ||
| bool selection_vector_aware() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should return the TypeMatcher or the DataType or be a bool MatchesSelectionVector(TypeHolder/DataType) that is derived from what KernelSignatures contain to let the evaluation system check which kinds of selection vectors they accept (INT32, INT64 or BOOL).
So you can extend KernelSignature to have a new type matcher (matches nothing by default) that can match selection vector types.
|
Is there any progress related to this PR? We also encountered similar problems during use. Hope there will be corresponding solutions |
I wasn't working on this for quite a while. And the current solution isn't good enough and needs to be re-worked. I do plan to resume this work recently, but considering the effort and the time I'll be able to spend, it won't come in a short time. Sorry about that. |
|
I'm ready to send out my rework on this during the past year. Let's close this and start a new one. See #47374 . |
Rationale for this change
@felipecrv gave a very descriptive rationale in comment #41094 (comment). This PR follows it mostly, with the following exceptions:
if_elseevaluates a boolean column, which is exactly the mask we need forifbranch and saves the indices collection. (but I guess the naming should be revised as I saw other usage of term "selection vector" is referring to indices, not mask?)Expressionas I found for most cases, e.g. creation and binding, a special form is just aCall- there would be too much duplicated code if we differentiateSpeicalFormfromCall. So I made it aCallwith a special flag, and did little specialization in binding and evaluation.Callcontains a post-bound polymorphicSpecialFormpointer, which will take over the evaluation of the whole expression.if_elseis implemented to show-case how a special form leverages selection vector to mask the evaluation of certain arguments. Another thing to note is thatif_elsespecial form still leveragesif_elsekernel to do the final function evaluation (with all arguments properly evaluated).Expressionevaluation will take care of the selection vector on their behalves, i.e., the "slow path".Functionstructure is not touched at all. Because when aFunctionis called, all arguments are already evaluated, so special forms don't come into play.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?