-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6243: [C++][Dataset] Filter expressions #5157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c214abe to
5541946
Compare
5541946 to
18cae8d
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design looks neat. I haven't looked at the implementation in filter.cc.
One thing that surprises me is the ability for filter to return nulls rather than booleans. Isn't the use case to filter input from a dataset? Why would one want to generate null rows at this point?
cpp/src/arrow/dataset/filter_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for using "and" rather than "&&"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more like SQL. I expect somebody will ask me to change it, but I thought it was more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I learnt something. I didn't know that these "alternate operator spellings" existed in C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant change to cpplint exclusions:
https://github.com/apache/arrow/pull/5157/files#diff-ca3a12ce3f399d977c7dc7c19c893b65R38
cpp/src/arrow/dataset/filter_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't *equal(fieldRef("b"), null32) be written as "b"_ == null32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shouldn't it simplify to never instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re filters returning nulls:
If a comparison references a slot which is null, we RefuseToGuess and the result of the comparison is null. Here's a test illustrating this:
// filter expression: "a"_ != 0 and "b"_ > 0.1
// record batch:
{"a": 0, "b": -0.1}, // filtered out because "a" == 0, return 0
{"a": 1, "b": 0.2}, // included, return 1
{"a": 2, "b": -0.1}, // filtered out because "b" is not greater than 0.1, return 0
{"a": 0, "b": null} // unknown because "b" is null, return nullRefuseToGuess is also implemented at the level of Expression::Assume. This is necessary when (for example) a filter expression references a column absent from some data fragment (perhaps it is an older file from before the referenced column was added). In that case we can't know whether rows in that fragment are relevant or not and we must yield them, but we can avoid the work of evaluating a kernel there.
If a user needs to query only rows where a field is defined, use a validity expression (which extracts the null bitmask of an array to a boolean expression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. I'll let others share their opinions about this. @wesm @fsaintjacques
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not confident this handles null correctly, and to address this I've created:
https://issues.apache.org/jira/browse/ARROW-6386
Draft PR #5231
cpp/src/arrow/dataset/filter.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which context can it be a scalar? And if it's an array, what is the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most trivial case of evaluating to scalar would be evaluation of a ScalarExpression. I could have just refused the bequest of Evaluate and returned an error in that case, but it seemed more robust to do it this way.
all, any, not and the comparison operators will evaluate to Type::BOOL, whereas a field reference will evaluate to whatever type that column has in the record batch. In general, the evaluated type of an Expression can be examined using Expression::Validate()
I can add a comment describing as much of this as you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. We can probably refine this later.
cpp/src/arrow/dataset/filter.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fieldRef is not compliant with our style guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean https://google.github.io/styleguide/cppguide.html#Function_Names ? I was following the factories for Field = field(), StructType = struct_(), etc. What should I have named them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field_ref should probably be it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename it
|
I'll review in more detail when I can, this week is tough |
1a110b3 to
5416690
Compare
|
The MSVC error seems legit. |
|
I'll try to reproduce locally |
bac51ba to
82da536
Compare
|
Travis timed out https://travis-ci.org/apache/arrow/jobs/581750838#L2081 |
|
Restarted. |
82da536 to
539c287
Compare
Codecov Report
@@ Coverage Diff @@
## master #5157 +/- ##
==========================================
+ Coverage 88.75% 89.12% +0.36%
==========================================
Files 946 757 -189
Lines 124312 109719 -14593
Branches 1437 0 -1437
==========================================
- Hits 110336 97789 -12547
+ Misses 13614 11930 -1684
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
This is a +1 for me, since this is still experimental, I say we merge it. |
Adds the Expression class which is used to represent an arbitrarily complex filter expression. Expressions can be constructed using factory functions, for example:
```c++
and_(
equal(field_ref("a"), scalar<int16_t>(5)), // column 'a' is equal to 5
greater(field_ref("b"), scalar<double>(0.0)) // column 'b' is greater than 0.0
)
```
Operator overloads are also provided, so the above could also be written as
```c++
"a"_ == int16_t(5) and "b"_ > 0.0
```
These can be executed against a single record batch (using the `arrow::compute::` kernels).
Additionally, expressions may be simplified or even elided given partition information. For example, given a partition where column 'a' is equal to 5 the above query could be simplified to `"b"_ > 0.0` (since the condition on column 'a' is satisfied by the entire partition) and given a partition where column 'b' is between -1.0 and 0.0 the query simplifies to `false` (since no record in the partition will satisfy the condition on column 'b'). This can be used to support arbitrary partitioning schemes and do the least kernel work possible on each record batch.
Closes apache#5157 from bkietz/6243-Implement-basic-Filter-ex and squashes the following commits:
fda4742 <Benjamin Kietzman> give MSVC a little help to avoid instantiating impossible constructors
539c287 <Benjamin Kietzman> rename fieldRef to field_ref, comments
9494bab <Benjamin Kietzman> refactor And, Or to binary
0e366bb <Benjamin Kietzman> rename all/any to and/or
ca155c6 <Benjamin Kietzman> add explicit std::move, msvc doesn't like defining operator and
7c01f76 <Benjamin Kietzman> construct correct scalartype
3ac299b <Benjamin Kietzman> use strongly typed nulls
cab17b1 <Benjamin Kietzman> amend doccomments
46fa3fb <Benjamin Kietzman> add Expression::Validate implementations
d54ca06 <Benjamin Kietzman> Expressions evaluate to Datums
24323fc <Benjamin Kietzman> implement NotExpression::ToString
d7f3ed2 <Benjamin Kietzman> simplify Expression::Equals
cb56906 <Benjamin Kietzman> rename FieldRef -> Field, and_ -> all, or_ -> any, add comments to ExpressionType
5bae590 <Benjamin Kietzman> re-enable Invert
f9e0c08 <Benjamin Kietzman> remove unused Empty() method
e9af935 <Benjamin Kietzman> lint fixes
5899067 <Benjamin Kietzman> break OperatorExpression into multiple classes
02d94a6 <Benjamin Kietzman> use explicit enumeration of comparison results
523186d <Benjamin Kietzman> add support for evaluation of trivial expressions, tests
60d9e08 <Benjamin Kietzman> fix factory linkage, factory fns deal in shared_ptrs exclusively
55aa452 <Benjamin Kietzman> implement more robust null handling
3d355ff <Benjamin Kietzman> break up assumption logic, add function expression factories
3499fd2 <Benjamin Kietzman> add an expression simplification test
06b25be <Benjamin Kietzman> move expression evaluation to a free function
a67b330 <Benjamin Kietzman> add comments, more tests, simplify operator overloads
9282284 <Benjamin Kietzman> simplify filter testing
6ccd636 <Benjamin Kietzman> add execution of filter expressions using compute kernels
f316578 <Benjamin Kietzman> add basic filter expressions
Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
The condition is an expression guaranteed to evaluate true for all records in a DataSource. This provides some predicate push down funcitonality: DataSources whose condition precludes a filter expression will not yield any fragments (since those fragments would be filtered out anyway). This patch does not implement evaluation of filter expressions against an in memory RecordBatch. It makes a half hearted attempt at API compatibility with #5157 which does implement this. Closes #5221 from bkietz/6244-Implement-Partition-DataS and squashes the following commits: 142cc7b <Benjamin Kietzman> explicit move for Result returning functions 13b5948 <Benjamin Kietzman> add comment on motivation for type erasure approach 42e2ad3 <Benjamin Kietzman> clang-format a9e5d7a <Benjamin Kietzman> bludgeon MSVC linker error with __forceinline e8c8cd6 <Benjamin Kietzman> AssumePartitionExpression's inout argument is confusing 48b349f <Benjamin Kietzman> move overridable GetFragments to protected GetFragmentsImpl 19f26a0 <Benjamin Kietzman> DataSource::assume -> bool, remove partition_expr mutator a651c65 <Benjamin Kietzman> rename DataSource::condition to partition_expression 949fa7a <Benjamin Kietzman> provide basic predicate pushdown to datasources 955cb56 <Benjamin Kietzman> flesh out shim Expression class b1a6c54 <Benjamin Kietzman> remove unused FileSystemBasedDataSource::options_ 4f5a8bc <Benjamin Kietzman> rename partitionner_ d66f159 <Benjamin Kietzman> add an Expression stub Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>
Adds the Expression class which is used to represent an arbitrarily complex filter expression. Expressions can be constructed using factory functions, for example:
Operator overloads are also provided, so the above could also be written as
These can be executed against a single record batch (using the
arrow::compute::kernels).Additionally, expressions may be simplified or even elided given partition information. For example, given a partition where column 'a' is equal to 5 the above query could be simplified to
"b"_ > 0.0(since the condition on column 'a' is satisfied by the entire partition) and given a partition where column 'b' is between -1.0 and 0.0 the query simplifies tofalse(since no record in the partition will satisfy the condition on column 'b'). This can be used to support arbitrary partitioning schemes and do the least kernel work possible on each record batch.