GH-41094: [C++] Support scalar expression special form #42106

zanmato1984 · 2024-06-11T17:52:50Z

Rationale for this change

@felipecrv gave a very descriptive rationale in comment #41094 (comment). This PR follows it mostly, with the following exceptions:

I found bit mask is more preferable to be used as the "selection vector", e.g., the first argument of if_else evaluates a boolean column, which is exactly the mask we need for if branch and saves the indices collection. (but I guess the naming should be revised as I saw other usage of term "selection vector" is referring to indices, not mask?)
I didn't make the special form the fourth impl of Expression as I found for most cases, e.g. creation and binding, a special form is just a Call - there would be too much duplicated code if we differentiate SpeicalForm from Call. So I made it a Call with a special flag, and did little specialization in binding and evaluation.
A special form Call contains a post-bound polymorphic SpecialForm pointer, which will take over the evaluation of the whole expression.
Special form if_else is implemented to show-case how a special form leverages selection vector to mask the evaluation of certain arguments. Another thing to note is that if_else special form still leverages if_else kernel to do the final function evaluation (with all arguments properly evaluated).
The kernels are not bothered with the concept "special form", they only need to sense the existence of selection vector. If they support selection-vector-aware execution, they just claim this fact when registration. Otherwise they can keep being dumb and the Expression evaluation will take care of the selection vector on their behalves, i.e., the "slow path".
The Function structure is not touched at all. Because when a Function is called, all arguments are already evaluated, so special forms don't come into play.
I'm also working on a "scatter" operation, as well as piloting some simple kernels to be selection-vector-aware elsewhere. The code is not included in this PR because I don't want them to distract the attention on the main sketch. I'll send out those code separately, or include them in this PR in the future - I'm not sure yet.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

GitHub Issue: [C++] Wrong and low inefficient expression execution for [if/else, case/when ... etc] expression #41094

github-actions · 2024-06-11T17:53:25Z

⚠️ GitHub issue #41094 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-06-11T19:08:30Z

cpp/src/arrow/compute/special_forms/if_else_special_form.cc

+  // The kernel (if_else) is not selection-vector-aware, so the input should not have a
+  // selection vector.
+  DCHECK(!call.kernel->selection_vector_aware && !input.selection_vector);
+
+  std::vector<Datum> arguments(call.arguments.size());
+  ARROW_ASSIGN_OR_RAISE(arguments[0],
+                        ExecuteScalarExpression(call.arguments[0], input, exec_context));
+  // Use cond as selection vector for IF branch.
+  // TODO: Consider chunked array for arguments[0].
+  auto if_sel = std::make_shared<SelectionVector>(arguments[0].array());
+  // Duplicate and invert cond as selection vector for ELSE branch.
+  ARROW_ASSIGN_OR_RAISE(
+      auto else_sel,
+      if_sel->Copy(CPUDevice::memory_manager(exec_context->memory_pool())));
+  RETURN_NOT_OK(else_sel->Invert());
+
+  ExecBatch if_input = input;
+  if_input.selection_vector = std::move(if_sel);
+  ARROW_ASSIGN_OR_RAISE(
+      arguments[1], ExecuteScalarExpression(call.arguments[1], if_input, exec_context));
+  ExecBatch else_input = input;
+  else_input.selection_vector = std::move(else_sel);
+  ARROW_ASSIGN_OR_RAISE(
+      arguments[2], ExecuteScalarExpression(call.arguments[2], else_input, exec_context));
+
+  // Leveraging if_else kernel with all arguments evaluated.
+  return ExecuteCallNonRecursive(call, input, arguments, exec_context);


@felipecrv This part may look quite different than you originally imagined/proposed. For example, you might wonder why no "scatter" at all? That's because scatter is handled more generically in ExecuteScalarExpression. And even why is that? First it takes less code. And more importantly, I would imagine not only special forms being the producers of selection vector, but also acero nodes like Filter: a Filter node produces a selection vector, i.e., late materializing the filtered rows, and all the subsequent nodes could keep working on this selection vector without even knowing its existence.

On the other hand, the current way loses the opportunity of optimization for out-of-order scatter all branches at once. That's true. But we can always improve it - all the information needed (kernel, arguments, etc.) is already there - it just takes more code.

mapleFU · 2024-06-11T19:14:54Z

cpp/src/arrow/compute/expression.h

+    bool selection_vector_aware = false;
+    std::shared_ptr<SpecialForm> special_form = NULLPTR;


Would selection_vector_aware defined in kernel?

Yeah, a kernel will claim itself as selection_vector_aware here: https://github.com/apache/arrow/pull/42106/files#diff-c275e8c64a3935e84eaf970584e61b0d3db88fa5b052965505edf8126526c80eR527-R532
if it has special execution path to deal with the given selection vector directly.

mapleFU

This looks nice as a POC! By the way, seems if_else and switch could be migrate to special_forms?

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

mapleFU · 2024-06-11T19:19:14Z

cpp/src/arrow/compute/special_forms/if_else_special_form.cc

+  ARROW_ASSIGN_OR_RAISE(arguments[0],
+                        ExecuteScalarExpression(call.arguments[0], input, exec_context));
+  // Use cond as selection vector for IF branch.
+  // TODO: Consider chunked array for arguments[0].


Maybe applying check here? Or could it be scalar?

Yeah, will do. Maybe later after the rest part of the sketch is good enough.

mapleFU · 2024-06-11T19:20:57Z

cpp/src/arrow/compute/expression.cc

+    if (value.is_scalar()) continue;
+    ARROW_ASSIGN_OR_RAISE(
+        value, Filter(value, input.selection_vector->data(), FilterOptions::Defaults()));
+  }


Aha, this can be implemented like this, funny

mapleFU · 2024-06-11T19:22:50Z

cpp/src/arrow/compute/special_form.h

+namespace arrow {
+namespace compute {
+
+/// The concept "special form" is borrowed from Lisp


Aha, I didn't know that

zanmato1984 · 2024-06-12T02:44:31Z

This looks nice as a POC! By the way, seems if_else and switch could be migrate to special_forms?

Thanks for looking! As for migration to special forms, there are some points I want to make:

The way a special form is currently implemented doesn't simply "replace" its kernel counter-part. Instead, a special form takes care of the argument evaluation, and directly "invokes" the underlying kernel. I think the kernel will always be there to serve both direct function call (i.e., not via expression system) and expression evaluation.
I'm exposing the special form creation via the is_special_form flag in Expression::Call so that user is free to use either "non-special-form" (defaulted) or special form function call.
Whether to use special form or non-special-form in the implementation for a specific type of query plan, e.g., substrait, should be fixed though. This would happen mostly in the compiling phase: when compiling a function-call-like expression, always prefer special form whenever possible; or at least use an user option to guide that.

zanmato1984 · 2024-06-12T02:47:20Z

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

I'm not familiar with MetaFunction either. I'll take a look and get back to you. Thanks.

zanmato1984 · 2024-06-12T04:44:19Z

cpp/src/arrow/compute/expression.cc

        "ExecuteScalarExpression cannot Execute non-scalar expression ", expr.ToString());
  }

+  if (!expr.selection_vector_aware() && input.selection_vector) {


This actually implies a strategy (1) that is: if any subexpression (including the top-level expression itself) is not selection vector aware, we do the gather/scatter work only once for the top-level expression. This way all the subexpressions can work on a (gathered) continuous input.

An alternative strategy (2) could be: every expression that is not selection vector aware (not counting its subexpressions), does the gather/scatter itself. This way every expression itself maintains the input/output shape the same as the original input.

Given that for a long time (possibly forever), there will be much less selection vector aware kernels than the other, it's probably worth it to use 1 instead of 2, as the gather/scatter only happen once for the whole expression.

zanmato1984 · 2024-06-12T12:59:39Z

By the way, previously I didn't understand why we need a "MetaFunction", seems something like "cast" would also a "special form"? I've a issue here: #41926

I'm not familiar with MetaFunction either. I'll take a look and get back to you. Thanks.

I tried my best answering your questions in #41926. And I don't think meta function has anything to do with special form :(

felipecrv · 2024-06-13T01:57:06Z

cpp/src/arrow/compute/expression.h

    std::shared_ptr<FunctionOptions> options;
+    // Whether this call is a special form (e.g. if-else). If true, the `special_form`
+    // field will be resolved in binding.
+    bool is_special_form = false;


A special-form is not a call. A special-form is a special form of expression.

- using Impl = std::variant<Datum, Parameter, Call>; + using Impl = std::variant<Datum, Parameter, Call, Special>;

This is fundamental to the approach. Special forms don't have names, they are virtual sub-classes of Special.

We can start with CondSpecial which can model if-else and case-when based on number of conditions and branches used to construct it.

selection_vector_aware is not a member variable, it's something resolved at evaluation time by traversing the expression.

felipecrv · 2024-06-13T02:08:39Z

cpp/src/arrow/compute/expression.h

  // XXX someday
  // NullGeneralization::type nullable() const;

+  /// Whether the entire expression (including all its subexpressions) is


I think this is only something you will discover during evaluation. The same function can have kernels that are aware of selection and some that are not. Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.

I think this is only something you will discover during evaluation.

I don't think it's necessarily during evaluation. This can be done in binding.

The same function can have kernels that are aware of selection and some that are not.

Binding will resolve the function to concrete kernel, so the selection awareness is known once the kernel is resolved. Also note that I made selection aware flag a property of kernel rather than of function.

Another important aspect is the type-checking of the special forms. You need to unify [1] the output type of all the branches, so you can pre-allocate the output and introduce the appropriate casts.

(I don't think this is related to selection-awareness. In this proposal, special forms and expression's selection-awareness are relatively separated - they only correlate on the fact that a special form may use selection vector to mask the evaluation of its subexpressions. So I'm assuming this is a general comment about special form.)

Before responding to your comment about special forms, I want to raise some other discussion which is part of my reasons making this proposal the current way.

In arrow compute, there are "function"s, which users can directly invoke via CallFunction, and there are expressions, one of whose concrete type is a "call", which further leverages the "function" when evaluating. Now take "if_else" as an example. There already exists an "if_else" function, which makes prefect sense because users do need it for two-way branching on a triple of {condition vector, true branch vector, false branch vector}. For expression, "if_else", as a regular function, is represented as a call. This is fine in terms of lexical/grammatical aspects, type checking, kernel resolution, etc., except that "if_else" requires special argument evaluation rules other than what a call does (eagerly evaluating all its arguments) - the reason why we need "if_else" (or "cond") special form.

So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.

Then the next question is, given that we'll have the function companion for every special form, can we leverage the function companion to do type-checking/resolution (the function need to do those anyway)? My feeling is yes, at least I can't think of a case that a special form has different type-checking/resolution rule than its function companion.

So to summarize, if a special form:

Always has a function companion;

Is type-checked/resolved the same as its function companion;

Evaluates the same as its function companion - I mean the evaluation for the last kernel invocation after all its arguments properly evaluated;

Only differs a regular call on the arguments evaluation.

Then we can think of special forms to be a special call with special argument evaluation rules.

Of course, if there are counterexamples for any of above assumptions, then my conclusion would be wrong. And I should take the way around. Thanks.

So my question is, can we assume that for every coming special form in arrow, there exists a function "companion" of it? For some common special forms like "and", "or", "if_else", "case_when", there already are. For other special forms in Lisp, like "quote", "let", "defunc", they don't have corresponding "function"s, but those special forms themselves are more like fundamental parts, like defining variable and assignments, for a Turing-complete language, which I believe arrow won't need to have. So my so-far conclusion is, yes, we can assume it.

Special forms in Arrow won't be many and we don't have to store them in a table like we have to do for functions. My use of LISP for inspiration might have led you into a path of over-engineering. We don't need the full generality of LISP or languages where users can define functions with call-by-name parameters. Think more C-like (strict evaluation) and less LISP-like (quote, let, defunc...) because that is overkill. Hence my insistence on making special forms a well-defined set of C++ classes.

felipecrv · 2024-06-13T02:09:56Z

cpp/src/arrow/compute/expression.h

+Expression call(std::string function, std::vector<Expression> arguments, Options options,
+                bool is_special_form = false) {


Special forms are not "called", they are evaluated. They are more general expressions than calls. if (true) f(x) is equivalent to a call to f with x that you discover while trying to evaluate the conditional.

felipecrv · 2024-06-13T02:11:09Z

cpp/src/arrow/compute/special_form.h

+  /// TODO: More formal factory, a registry maybe?
+  static Result<std::unique_ptr<SpecialForm>> Make(const std::string& name);


The number of special-forms is very small. They should be sub-classes of the SpecialForm class. They will contain type-checking logic and other things that are very different from calls.

felipecrv · 2024-06-13T02:28:24Z

I didn't make the special form the fourth impl of Expression as I found for most cases, e.g. creation and binding, a special form is just a Call - there would be too much duplicated code if we differentiate SpeicalForm from Call. So I made it a Call with a special flag, and did little specialization in binding and evaluation.

That duplication is a feature, not a bug. It's better to separate the cases than to have a bool splitting Call into two types. Impl already contains the variant tag that will tell you if something is a Call or an SpecialForm.

And as I said above, special-forms don't need all the complex things that Calls need.

felipecrv · 2024-06-13T14:37:06Z

cpp/src/arrow/compute/exec.cc

 SelectionVector::SelectionVector(std::shared_ptr<ArrayData> data)
    : data_(std::move(data)) {
-  DCHECK_EQ(Type::INT32, data_->type->id());
+  DCHECK_EQ(Type::BOOL, data_->type->id());


Bitmaps are alternatives to selection-vectors, yes, but they have a disadvantage: they don't reduce in size as you process them — you start with length bits and end up with length bits and have to scan them over and over. Ideally we should have both forms and use bitmaps when the selection is small, but if we are going to simplify things and have a single form it should be selection vectors.

felipecrv · 2024-06-13T14:48:29Z

cpp/src/arrow/compute/expression.cc


 Expression call(std::string function, std::vector<Expression> arguments,
-                std::shared_ptr<compute::FunctionOptions> options) {
+                std::shared_ptr<compute::FunctionOptions> options, bool is_special_form) {


This is exactly the kind of added complexity around calls that I'm trying to convince you is not a good idea. This boolean is effectively saying the arguments are passed in call-by-name fashion and not call-by-value [1].

We should keep the Arrow rewrite system a strict-evaluated system with just 1 or 2 special forms. That means calls are always called by value. We keep the if_else and case_when functions as they are and introduce a ConditionalSpecialForm in the context of compute::Expression.

C++ is a strictly evaluated rewrite system. You can't declare a function and say "all the parameters are call-by-name" like you're trying to do with this boolean parameter. So you can implement if_else just like Arrow does here and has to use special syntax to get a conditional that takes two expression in a call by name fashion.

template<typename T> T if_else(bool cond, T then_case, T else_case) { if (cond) { return then_case; } else { return else_case; } } // Non-strict evaluation with special form syntax. The branches of the // if are non-strictly evaluated in a way that can't be achieved with // the `if_else` Call. if (den > 0) { return num / den; } else { return 0; }

A general conditional special form can serve for if-else, and, and or. Probably the single special form we need.

https://en.wikipedia.org/wiki/Evaluation_strategy#Non-strict_binding_strategies

Your explanation is convincing and it's hard to not be convinced. I'll need some more time to digest your comments and get back to you.

I really appreciate your kind help and insightful comments, and the pointer to the interesting reading too :)

Thank you @felipecrv !

felipecrv · 2024-06-13T16:04:25Z

cpp/src/arrow/compute/expression.h

+  /// selection-vector-aware. If true, then the expression can be executed using the "fast
+  /// path" - all kernels directly working on the selection vector. Otherwise the
+  /// execution takes the "slow path" - gathering the input and scattering the output.
+  bool selection_vector_aware() const;


I think that this should return the TypeMatcher or the DataType or be a bool MatchesSelectionVector(TypeHolder/DataType) that is derived from what KernelSignatures contain to let the evaluation system check which kinds of selection vectors they accept (INT32, INT64 or BOOL).

So you can extend KernelSignature to have a new type matcher (matches nothing by default) that can match selection vector types.

qinpengxiang001 · 2024-09-18T08:31:40Z

Is there any progress related to this PR? We also encountered similar problems during use. Hope there will be corresponding solutions

@felipecrv @zanmato1984

zanmato1984 · 2024-09-18T09:21:54Z

Is there any progress related to this PR? We also encountered similar problems during use. Hope there will be corresponding solutions

@felipecrv @zanmato1984

I wasn't working on this for quite a while. And the current solution isn't good enough and needs to be re-worked.

I do plan to resume this work recently, but considering the effort and the time I'll be able to spend, it won't come in a short time. Sorry about that.

zanmato1984 · 2025-08-20T03:35:29Z

I'm ready to send out my rework on this during the past year. Let's close this and start a new one. See #47374 .

zanmato1984 added 11 commits May 24, 2024 18:14

WIP

49950d6

Some framework change

fb7b936

Use bit mask as selection vector

8047d0a

Add test

d10d34a

Add basic selection vector operations

49fb971

Support expression level selection vector aware inquery

cf33b98

Framework mostly done

6c6ed3b

Revert WIP sel-aware kernel change

0112889

Doc

42d8c87

Doc

8bb5af3

More doc

b382a67

github-actions bot added Component: C++ awaiting review Awaiting review labels Jun 11, 2024

zanmato1984 requested a review from felipecrv June 11, 2024 18:31

Lint

2bca08c

zanmato1984 mentioned this pull request Jun 11, 2024

[C++] Wrong and low inefficient expression execution for [if/else, case/when ... etc] expression #41094

Open

zanmato1984 commented Jun 11, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2024

mapleFU reviewed Jun 11, 2024

View reviewed changes

zanmato1984 commented Jun 12, 2024

View reviewed changes

felipecrv reviewed Jun 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 13, 2024

felipecrv reviewed Jun 13, 2024

View reviewed changes

zanmato1984 mentioned this pull request Dec 11, 2024

GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter #44394

Merged

zanmato1984 mentioned this pull request Aug 20, 2025

[C++][Compute] Support special form in expression system #47374

Open

8 tasks

zanmato1984 closed this Aug 20, 2025

		bool selection_vector_aware = false;
		std::shared_ptr<SpecialForm> special_form = NULLPTR;

		Expression call(std::string function, std::vector<Expression> arguments, Options options,
		bool is_special_form = false) {

		/// TODO: More formal factory, a registry maybe?
		static Result<std::unique_ptr<SpecialForm>> Make(const std::string& name);

GH-41094: [C++] Support scalar expression special form #42106

GH-41094: [C++] Support scalar expression special form #42106

Uh oh!

Conversation

zanmato1984 commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jun 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanmato1984 commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zanmato1984 commented Jun 12, 2024

Uh oh!

zanmato1984 Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanmato1984 commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipecrv commented Jun 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinpengxiang001 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zanmato1984 commented Sep 18, 2024

Uh oh!

zanmato1984 commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zanmato1984 commented Jun 11, 2024 •

edited

Loading

mapleFU left a comment •

edited

Loading

zanmato1984 commented Jun 12, 2024 •

edited

Loading

zanmato1984 Jun 12, 2024 •

edited

Loading

zanmato1984 commented Jun 12, 2024 •

edited

Loading

qinpengxiang001 commented Sep 18, 2024 •

edited

Loading