Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Aug 21, 2019

Adds the Expression class which is used to represent an arbitrarily complex filter expression. Expressions can be constructed using factory functions, for example:

and_(
  equal(field_ref("a"), scalar<int16_t>(5)),   // column 'a' is equal to 5
  greater(field_ref("b"), scalar<double>(0.0)) // column 'b' is greater than 0.0
)

Operator overloads are also provided, so the above could also be written as

"a"_ == int16_t(5) and "b"_ > 0.0

These can be executed against a single record batch (using the arrow::compute:: kernels).

Additionally, expressions may be simplified or even elided given partition information. For example, given a partition where column 'a' is equal to 5 the above query could be simplified to "b"_ > 0.0 (since the condition on column 'a' is satisfied by the entire partition) and given a partition where column 'b' is between -1.0 and 0.0 the query simplifies to false (since no record in the partition will satisfy the condition on column 'b'). This can be used to support arbitrary partitioning schemes and do the least kernel work possible on each record batch.

@bkietz bkietz force-pushed the 6243-Implement-basic-Filter-ex branch from c214abe to 5541946 Compare August 25, 2019 17:03
@bkietz bkietz marked this pull request as ready for review August 26, 2019 14:49
@bkietz bkietz force-pushed the 6243-Implement-basic-Filter-ex branch from 5541946 to 18cae8d Compare August 26, 2019 15:34
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design looks neat. I haven't looked at the implementation in filter.cc.

One thing that surprises me is the ability for filter to return nulls rather than booleans. Isn't the use case to filter input from a dataset? Why would one want to generate null rows at this point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for using "and" rather than "&&"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more like SQL. I expect somebody will ask me to change it, but I thought it was more readable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I learnt something. I didn't know that these "alternate operator spellings" existed in C++.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't *equal(fieldRef("b"), null32) be written as "b"_ == null32?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't it simplify to never instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re filters returning nulls:
If a comparison references a slot which is null, we RefuseToGuess and the result of the comparison is null. Here's a test illustrating this:

// filter expression: "a"_ != 0 and "b"_ > 0.1
// record batch:
      {"a": 0, "b": -0.1},  // filtered out because "a" == 0, return 0
      {"a": 1, "b":  0.2},  // included, return 1
      {"a": 2, "b": -0.1},  // filtered out because "b" is not greater than 0.1, return 0
      {"a": 0, "b": null}   // unknown because "b" is null, return null

RefuseToGuess is also implemented at the level of Expression::Assume. This is necessary when (for example) a filter expression references a column absent from some data fragment (perhaps it is an older file from before the referenced column was added). In that case we can't know whether rows in that fragment are relevant or not and we must yield them, but we can avoid the work of evaluating a kernel there.

If a user needs to query only rows where a field is defined, use a validity expression (which extracts the null bitmask of an array to a boolean expression).

Copy link
Member

@pitrou pitrou Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. I'll let others share their opinions about this. @wesm @fsaintjacques

Copy link
Member Author

@bkietz bkietz Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not confident this handles null correctly, and to address this I've created:
https://issues.apache.org/jira/browse/ARROW-6386
Draft PR #5231

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which context can it be a scalar? And if it's an array, what is the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most trivial case of evaluating to scalar would be evaluation of a ScalarExpression. I could have just refused the bequest of Evaluate and returned an error in that case, but it seemed more robust to do it this way.

all, any, not and the comparison operators will evaluate to Type::BOOL, whereas a field reference will evaluate to whatever type that column has in the record batch. In general, the evaluated type of an Expression can be examined using Expression::Validate()

I can add a comment describing as much of this as you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. We can probably refine this later.

@fsaintjacques fsaintjacques changed the title ARROW-6243: [C++] implement basic filter ex ARROW-6243: [C++][Dataset] Filter expressions Aug 27, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fieldRef is not compliant with our style guide

Copy link
Member Author

@bkietz bkietz Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean https://google.github.io/styleguide/cppguide.html#Function_Names ? I was following the factories for Field = field(), StructType = struct_(), etc. What should I have named them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field_ref should probably be it then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it

@wesm
Copy link
Member

wesm commented Aug 27, 2019

I'll review in more detail when I can, this week is tough

@bkietz bkietz force-pushed the 6243-Implement-basic-Filter-ex branch 2 times, most recently from 1a110b3 to 5416690 Compare September 2, 2019 00:05
@fsaintjacques
Copy link
Contributor

The MSVC error seems legit.

@bkietz
Copy link
Member Author

bkietz commented Sep 5, 2019

I'll try to reproduce locally

@bkietz bkietz force-pushed the 6243-Implement-basic-Filter-ex branch from bac51ba to 82da536 Compare September 6, 2019 16:10
@bkietz
Copy link
Member Author

bkietz commented Sep 6, 2019

@kou
Copy link
Member

kou commented Sep 6, 2019

Restarted.

@bkietz bkietz force-pushed the 6243-Implement-basic-Filter-ex branch from 82da536 to 539c287 Compare September 9, 2019 13:25
@fsaintjacques fsaintjacques self-assigned this Sep 9, 2019
@codecov-io
Copy link

Codecov Report

Merging #5157 into master will increase coverage by 0.36%.
The diff coverage is 58.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5157      +/-   ##
==========================================
+ Coverage   88.75%   89.12%   +0.36%     
==========================================
  Files         946      757     -189     
  Lines      124312   109719   -14593     
  Branches     1437        0    -1437     
==========================================
- Hits       110336    97789   -12547     
+ Misses      13614    11930    -1684     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/record_batch.h 100% <ø> (ø) ⬆️
cpp/src/arrow/compute/kernels/compare.cc 87.31% <33.33%> (-1.33%) ⬇️
cpp/src/arrow/scalar.h 71.42% <33.33%> (-16.08%) ⬇️
cpp/src/arrow/dataset/filter.cc 46.53% <46.53%> (ø)
cpp/src/arrow/scalar.cc 65.21% <77.77%> (+1.88%) ⬆️
cpp/src/arrow/record_batch.cc 94.07% <83.33%> (+5.03%) ⬆️
cpp/src/arrow/dataset/filter_test.cc 95.87% <95.87%> (ø)
cpp/src/arrow/dataset/filter.h 96.49% <96.49%> (ø)
cpp/src/arrow/filesystem/s3_internal.h 90.74% <0%> (-3.71%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
... and 203 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92f16e3...fda4742. Read the comment docs.

@fsaintjacques
Copy link
Contributor

This is a +1 for me, since this is still experimental, I say we merge it.

pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Sep 16, 2019
Adds the Expression class which is used to represent an arbitrarily complex filter expression. Expressions can be constructed using factory functions, for example:

```c++
and_(
  equal(field_ref("a"), scalar<int16_t>(5)),   // column 'a' is equal to 5
  greater(field_ref("b"), scalar<double>(0.0)) // column 'b' is greater than 0.0
)
```

Operator overloads are also provided, so the above could also be written as
```c++
"a"_ == int16_t(5) and "b"_ > 0.0
```

These can be executed against a single record batch (using the `arrow::compute::` kernels).

Additionally, expressions may be simplified or even elided given partition information. For example, given a partition where column 'a' is equal to 5 the above query could be simplified to `"b"_ > 0.0` (since the condition on column 'a' is satisfied by the entire partition) and given a partition where column 'b' is between -1.0 and 0.0 the query simplifies to `false` (since no record in the partition will satisfy the condition on column 'b'). This can be used to support arbitrary partitioning schemes and do the least kernel work possible on each record batch.

Closes apache#5157 from bkietz/6243-Implement-basic-Filter-ex and squashes the following commits:

fda4742 <Benjamin Kietzman> give MSVC a little help to avoid instantiating impossible constructors
539c287 <Benjamin Kietzman> rename fieldRef to field_ref, comments
9494bab <Benjamin Kietzman> refactor And, Or to binary
0e366bb <Benjamin Kietzman> rename all/any to and/or
ca155c6 <Benjamin Kietzman> add explicit std::move, msvc doesn't like defining operator and
7c01f76 <Benjamin Kietzman> construct correct scalartype
3ac299b <Benjamin Kietzman> use strongly typed nulls
cab17b1 <Benjamin Kietzman> amend doccomments
46fa3fb <Benjamin Kietzman> add Expression::Validate implementations
d54ca06 <Benjamin Kietzman> Expressions evaluate to Datums
24323fc <Benjamin Kietzman> implement NotExpression::ToString
d7f3ed2 <Benjamin Kietzman> simplify Expression::Equals
cb56906 <Benjamin Kietzman> rename FieldRef -> Field, and_ -> all, or_ -> any, add comments to ExpressionType
5bae590 <Benjamin Kietzman> re-enable Invert
f9e0c08 <Benjamin Kietzman> remove unused Empty() method
e9af935 <Benjamin Kietzman> lint fixes
5899067 <Benjamin Kietzman> break OperatorExpression into multiple classes
02d94a6 <Benjamin Kietzman> use explicit enumeration of comparison results
523186d <Benjamin Kietzman> add support for evaluation of trivial expressions, tests
60d9e08 <Benjamin Kietzman> fix factory linkage, factory fns deal in shared_ptrs exclusively
55aa452 <Benjamin Kietzman> implement more robust null handling
3d355ff <Benjamin Kietzman> break up assumption logic, add function expression factories
3499fd2 <Benjamin Kietzman> add an expression simplification test
06b25be <Benjamin Kietzman> move expression evaluation to a free function
a67b330 <Benjamin Kietzman> add comments, more tests, simplify operator overloads
9282284 <Benjamin Kietzman> simplify filter testing
6ccd636 <Benjamin Kietzman> add execution of filter expressions using compute kernels
f316578 <Benjamin Kietzman> add basic filter expressions

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
fsaintjacques pushed a commit that referenced this pull request Sep 19, 2019
The condition is an expression guaranteed to evaluate true for all records in a DataSource. This provides some predicate push down funcitonality: DataSources whose condition precludes a filter expression will not yield any fragments (since those fragments would be filtered out anyway).

This patch does not implement evaluation of filter expressions against an in memory RecordBatch. It makes a half hearted attempt at API compatibility with #5157 which does implement this.

Closes #5221 from bkietz/6244-Implement-Partition-DataS and squashes the following commits:

142cc7b <Benjamin Kietzman> explicit move for Result returning functions
13b5948 <Benjamin Kietzman> add comment on motivation for type erasure approach
42e2ad3 <Benjamin Kietzman> clang-format
a9e5d7a <Benjamin Kietzman> bludgeon MSVC linker error with __forceinline
e8c8cd6 <Benjamin Kietzman> AssumePartitionExpression's inout argument is confusing
48b349f <Benjamin Kietzman> move overridable GetFragments to protected GetFragmentsImpl
19f26a0 <Benjamin Kietzman> DataSource::assume -> bool, remove partition_expr mutator
a651c65 <Benjamin Kietzman> rename DataSource::condition to partition_expression
949fa7a <Benjamin Kietzman> provide basic predicate pushdown to datasources
955cb56 <Benjamin Kietzman> flesh out shim Expression class
b1a6c54 <Benjamin Kietzman> remove unused FileSystemBasedDataSource::options_
4f5a8bc <Benjamin Kietzman> rename partitionner_
d66f159 <Benjamin Kietzman> add an Expression stub

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>
@bkietz bkietz deleted the 6243-Implement-basic-Filter-ex branch February 25, 2021 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants