Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Jun 11, 2024

Rationale for this change

@felipecrv gave a very descriptive rationale in comment #41094 (comment). This PR follows it mostly, with the following exceptions:

  1. I found bit mask is more preferable to be used as the "selection vector", e.g., the first argument of if_else evaluates a boolean column, which is exactly the mask we need for if branch and saves the indices collection. (but I guess the naming should be revised as I saw other usage of term "selection vector" is referring to indices, not mask?)
  2. I didn't make the special form the fourth impl of Expression as I found for most cases, e.g. creation and binding, a special form is just a Call - there would be too much duplicated code if we differentiate SpeicalForm from Call. So I made it a Call with a special flag, and did little specialization in binding and evaluation.
  3. A special form Call contains a post-bound polymorphic SpecialForm pointer, which will take over the evaluation of the whole expression.
  4. Special form if_else is implemented to show-case how a special form leverages selection vector to mask the evaluation of certain arguments. Another thing to note is that if_else special form still leverages if_else kernel to do the final function evaluation (with all arguments properly evaluated).
  5. The kernels are not bothered with the concept "special form", they only need to sense the existence of selection vector. If they support selection-vector-aware execution, they just claim this fact when registration. Otherwise they can keep being dumb and the Expression evaluation will take care of the selection vector on their behalves, i.e., the "slow path".
  6. The Function structure is not touched at all. Because when a Function is called, all arguments are already evaluated, so special forms don't come into play.
  7. I'm also working on a "scatter" operation, as well as piloting some simple kernels to be selection-vector-aware elsewhere. The code is not included in this PR because I don't want them to distract the attention on the main sketch. I'll send out those code separately, or include them in this PR in the future - I'm not sure yet.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions
Copy link

⚠️ GitHub issue #41094 has been automatically assigned in GitHub to PR creator.

Comment on lines +32 to +58
// The kernel (if_else) is not selection-vector-aware, so the input should not have a
// selection vector.
DCHECK(!call.kernel->selection_vector_aware && !input.selection_vector);

std::vector<Datum> arguments(call.arguments.size());
ARROW_ASSIGN_OR_RAISE(arguments[0],
ExecuteScalarExpression(call.arguments[0], input, exec_context));
// Use cond as selection vector for IF branch.
// TODO: Consider chunked array for arguments[0].
auto if_sel = std::make_shared<SelectionVector>(arguments[0].array());
// Duplicate and invert cond as selection vector for ELSE branch.
ARROW_ASSIGN_OR_RAISE(
auto else_sel,
if_sel->Copy(CPUDevice::memory_manager(exec_context->memory_pool())));
RETURN_NOT_OK(else_sel->Invert());

ExecBatch if_input = input;
if_input.selection_vector = std::move(if_sel);
ARROW_ASSIGN_OR_RAISE(
arguments[1], ExecuteScalarExpression(call.arguments[1], if_input, exec_context));
ExecBatch else_input = input;
else_input.selection_vector = std::move(else_sel);
ARROW_ASSIGN_OR_RAISE(
arguments[2], ExecuteScalarExpression(call.arguments[2], else_input, exec_context));

// Leveraging if_else kernel with all arguments evaluated.
return ExecuteCallNonRecursive(call, input, arguments, exec_context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv This part may look quite different than you originally imagined/proposed. For example, you might wonder why no "scatter" at all? That's because scatter is handled more generically in ExecuteScalarExpression. And even why is that? First it takes less code. And more importantly, I would imagine not only special forms being the producers of selection vector, but also acero nodes like Filter: a Filter node produces a selection vector, i.e., late materializing the filtered rows, and all the subsequent nodes could keep working on this selection vector without even knowing its existence.

On the other hand, the current way loses the opportunity of optimization for out-of-order scatter all branches at once. That's true. But we can always improve it - all the information needed (kernel, arguments, etc.) is already there - it just takes more code.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2024
Comment on lines +63 to +64
bool selection_vector_aware = false;
std::shared_ptr<SpecialForm> special_form = NULLPTR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would selection_vector_aware defined in kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a kernel will claim itself as selection_vector_aware here: https://github.com/apache/arrow/pull/42106/files#diff-c275e8c64a3935e84eaf970584e61b0d3db88fa5b052965505edf8126526c80eR527-R532
if it has special execution path to deal with the given selection vector directly.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice as a POC! By the way, seems if_else and switch could be migrate to special_forms?

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

ARROW_ASSIGN_OR_RAISE(arguments[0],
ExecuteScalarExpression(call.arguments[0], input, exec_context));
// Use cond as selection vector for IF branch.
// TODO: Consider chunked array for arguments[0].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe applying check here? Or could it be scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will do. Maybe later after the rest part of the sketch is good enough.

if (value.is_scalar()) continue;
ARROW_ASSIGN_OR_RAISE(
value, Filter(value, input.selection_vector->data(), FilterOptions::Defaults()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, this can be implemented like this, funny

namespace arrow {
namespace compute {

/// The concept "special form" is borrowed from Lisp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I didn't know that

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Jun 12, 2024

This looks nice as a POC! By the way, seems if_else and switch could be migrate to special_forms?

Thanks for looking! As for migration to special forms, there are some points I want to make:

  1. The way a special form is currently implemented doesn't simply "replace" its kernel counter-part. Instead, a special form takes care of the argument evaluation, and directly "invokes" the underlying kernel. I think the kernel will always be there to serve both direct function call (i.e., not via expression system) and expression evaluation.
  2. I'm exposing the special form creation via the is_special_form flag in Expression::Call so that user is free to use either "non-special-form" (defaulted) or special form function call.
  3. Whether to use special form or non-special-form in the implementation for a specific type of query plan, e.g., substrait, should be fixed though. This would happen mostly in the compiling phase: when compiling a function-call-like expression, always prefer special form whenever possible; or at least use an user option to guide that.

@zanmato1984
Copy link
Contributor Author

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

I'm not familiar with MetaFunction either. I'll take a look and get back to you. Thanks.

"ExecuteScalarExpression cannot Execute non-scalar expression ", expr.ToString());
}

if (!expr.selection_vector_aware() && input.selection_vector) {
Copy link
Contributor Author

@zanmato1984 zanmato1984 Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually implies a strategy (1) that is: if any subexpression (including the top-level expression itself) is not selection vector aware, we do the gather/scatter work only once for the top-level expression. This way all the subexpressions can work on a (gathered) continuous input.

An alternative strategy (2) could be: every expression that is not selection vector aware (not counting its subexpressions), does the gather/scatter itself. This way every expression itself maintains the input/output shape the same as the original input.

Given that for a long time (possibly forever), there will be much less selection vector aware kernels than the other, it's probably worth it to use 1 instead of 2, as the gather/scatter only happen once for the whole expression.

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Jun 12, 2024

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

I'm not familiar with MetaFunction either. I'll take a look and get back to you. Thanks.

I tried my best answering your questions in #41926. And I don't think meta function has anything to do with special form :(

std::shared_ptr<FunctionOptions> options;
// Whether this call is a special form (e.g. if-else). If true, the `special_form`
// field will be resolved in binding.
bool is_special_form = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A special-form is not a call. A special-form is a special form of expression.

-  using Impl = std::variant<Datum, Parameter, Call>;
+  using Impl = std::variant<Datum, Parameter, Call, Special>;

This is fundamental to the approach. Special forms don't have names, they are virtual sub-classes of Special.

We can start with CondSpecial which can model if-else and case-when based on number of conditions and branches used to construct it.

selection_vector_aware is not a member variable, it's something resolved at evaluation time by traversing the expression.

// XXX someday
// NullGeneralization::type nullable() const;

/// Whether the entire expression (including all its subexpressions) is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only something you will discover during evaluation. The same function can have kernels that are aware of selection and some that are not. Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only something you will discover during evaluation.

I don't think it's necessarily during evaluation. This can be done in binding.

The same function can have kernels that are aware of selection and some that are not.

Binding will resolve the function to concrete kernel, so the selection awareness is known once the kernel is resolved. Also note that I made selection aware flag a property of kernel rather than of function.

Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.

(I don't think this is related to selection-awareness. In this proposal, special forms and expression's selection-awareness are relatively separated - they only correlate on the fact that a special form may use selection vector to mask the evaluation of its subexpressions. So I'm assuming this is a general comment about special form.)

Before responding to your comment about special forms, I want to raise some other discussion which is part of my reasons making this proposal the current way.

In arrow compute, there are "function"s, which users can directly invoke via CallFunction, and there are expressions, one of whose concrete type is a "call", which further leverages the "function" when evaluating. Now take "if_else" as an example. There already exists an "if_else" function, which makes prefect sense because users do need it for two-way branching on a triple of {condition vector, true branch vector, false branch vector}. For expression, "if_else", as a regular function, is represented as a call. This is fine in terms of lexical/grammatical aspects, type checking, kernel resolution, etc., except that "if_else" requires special argument evaluation rules other than what a call does (eagerly evaluating all its arguments) - the reason why we need "if_else" (or "cond") special form.

So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.

Then the next question is, given that we'll have the function companion for every special form, can we leverage the function companion to do type-checking/resolution (the function need to do those anyway)? My feeling is yes, at least I can't think of a case that a special form has different type-checking/resolution rule than its function companion.

So to summarize, if a special form:

  1. Always has a function companion;
  2. Is type-checked/resolved the same as its function companion;
  3. Evaluates the same as its function companion - I mean the evaluation for the last kernel invocation after all its arguments properly evaluated;
  4. Only differs a regular call on the arguments evaluation.

Then we can think of special forms to be a special call with special argument evaluation rules.

Of course, if there are counterexamples for any of above assumptions, then my conclusion would be wrong. And I should take the way around. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.

Special forms in Arrow won't be many and we don't have to store them in a table like we have to do for functions. My use of LISP for inspiration might have led you into a path of over-engineering. We don't need the full generality of LISP or languages where users can define functions with call-by-name parameters. Think more C-like (strict evaluation) and less LISP-like (quote, let, defunc...) because that is overkill. Hence my insistence on making special forms a well-defined set of C++ classes.

Comment on lines +179 to +180
Expression call(std::string function, std::vector<Expression> arguments, Options options,
bool is_special_form = false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special forms are not "called", they are evaluated. They are more general expressions than calls. if (true) f(x) is equivalent to a call to f with x that you discover while trying to evaluate the conditional.

Comment on lines +52 to +53
/// TODO: More formal factory, a registry maybe?
static Result<std::unique_ptr<SpecialForm>> Make(const std::string& name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of special-forms is very small. They should be sub-classes of the SpecialForm class. They will contain type-checking logic and other things that are very different from calls.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 13, 2024
@felipecrv
Copy link
Contributor

I didn't make the special form the fourth impl of Expression as I found for most cases, e.g. creation and binding, a special form is just a Call - there would be too much duplicated code if we differentiate SpeicalForm from Call. So I made it a Call with a special flag, and did little specialization in binding and evaluation.

That duplication is a feature, not a bug. It's better to separate the cases than to have a bool splitting Call into two types. Impl already contains the variant tag that will tell you if something is a Call or an SpecialForm.

And as I said above, special-forms don't need all the complex things that Calls need.

SelectionVector::SelectionVector(std::shared_ptr<ArrayData> data)
: data_(std::move(data)) {
DCHECK_EQ(Type::INT32, data_->type->id());
DCHECK_EQ(Type::BOOL, data_->type->id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitmaps are alternatives to selection-vectors, yes, but they have a disadvantage: they don't reduce in size as you process them — you start with length bits and end up with length bits and have to scan them over and over. Ideally we should have both forms and use bitmaps when the selection is small, but if we are going to simplify things and have a single form it should be selection vectors.


Expression call(std::string function, std::vector<Expression> arguments,
std::shared_ptr<compute::FunctionOptions> options) {
std::shared_ptr<compute::FunctionOptions> options, bool is_special_form) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the kind of added complexity around calls that I'm trying to convince you is not a good idea. This boolean is effectively saying the arguments are passed in call-by-name fashion and not call-by-value [1].

We should keep the Arrow rewrite system a strict-evaluated system with just 1 or 2 special forms. That means calls are always called by value. We keep the if_else and case_when functions as they are and introduce a ConditionalSpecialForm in the context of compute::Expression.

C++ is a strictly evaluated rewrite system. You can't declare a function and say "all the parameters are call-by-name" like you're trying to do with this boolean parameter. So you can implement if_else just like Arrow does here and has to use special syntax to get a conditional that takes two expression in a call by name fashion.

template<typename T>
T if_else(bool cond, T then_case, T else_case) {
  if (cond) { return then_case; } else { return else_case; }
}

// Non-strict evaluation with special form syntax. The branches of the
// if are non-strictly evaluated in a way that can't be achieved with
// the `if_else` Call.
if (den > 0) {
  return num / den;
} else {
  return 0;
}

A general conditional special form can serve for if-else, and, and or. Probably the single special form we need.

https://en.wikipedia.org/wiki/Evaluation_strategy#Non-strict_binding_strategies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanation is convincing and it's hard to not be convinced. I'll need some more time to digest your comments and get back to you.

I really appreciate your kind help and insightful comments, and the pointer to the interesting reading too :)

Thank you @felipecrv !

/// selection-vector-aware. If true, then the expression can be executed using the "fast
/// path" - all kernels directly working on the selection vector. Otherwise the
/// execution takes the "slow path" - gathering the input and scattering the output.
bool selection_vector_aware() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should return the TypeMatcher or the DataType or be a bool MatchesSelectionVector(TypeHolder/DataType) that is derived from what KernelSignatures contain to let the evaluation system check which kinds of selection vectors they accept (INT32, INT64 or BOOL).

So you can extend KernelSignature to have a new type matcher (matches nothing by default) that can match selection vector types.

@qinpengxiang001
Copy link

qinpengxiang001 commented Sep 18, 2024

Is there any progress related to this PR? We also encountered similar problems during use. Hope there will be corresponding solutions

@felipecrv @zanmato1984

@zanmato1984
Copy link
Contributor Author

Is there any progress related to this PR? We also encountered similar problems during use. Hope there will be corresponding solutions

@felipecrv @zanmato1984

I wasn't working on this for quite a while. And the current solution isn't good enough and needs to be re-worked.

I do plan to resume this work recently, but considering the effort and the time I'll be able to spend, it won't come in a short time. Sorry about that.

@zanmato1984
Copy link
Contributor Author

I'm ready to send out my rework on this during the past year. Let's close this and start a new one. See #47374 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants