-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings #7026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions on the R bindings. Also, IIUC you can delete https://github.com/apache/arrow/blob/master/r/R/expression.R#L122-L163 and refactor that file a bit to match how you're not really specifying the Expression subclasses anymore.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I am not sure if I find the Expression.field, Expression.and_, etc API necessarily better than the current FieldExpression, AndExpression, etc.
Both equally tie us to a certain python API: whether we use static methods or subclass constructors to construct the Expression objects doesn't really matter in the end IMO (they still expose an Expression class with methods, and a constructor like AndExpression could also simply be a wrapper to construct a base expression object as is done now in Expression.and_).
To be clear, I am fully on board with the the general clean-up and reducing the C++ API used from the python side. So I am only talking about the Python (user-facing) API. But for that, I am wondering: do we actually need to expose those constructors publicly? I think that the current ds.field(..) and ds.scalar(..) functions should be sufficient, as all other expressions can be created based on those using methods or binary operations?
|
Agree that exposing |
c920613 to
0c9233e
Compare
|
@github-actions autotune everything |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
22e238e to
3d4fd34
Compare
cpp/src/arrow/util/iterator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows Iterator::RangeIterator to be copied, which is a requirement on iterators https://en.cppreference.com/w/cpp/named_req/Iterator
We haven't caught this before because we only use Iterator::begin and Iterator::end in the context of range-for loops which don't require it. However cython wraps the range-for loop does, however, which is why I noticed the concept error.
cpp/src/arrow/type.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ensuring that /\w+Type::type_id/ is a defined symbol in libarrow.so. Previously these have only been used within a translation unit which links into libarrow.so, but now I'm using them in filter.cc which gets linked to libarrow_dataset.so. If you remove these you'll find the linker emits an error because they're undefined in libarrow.so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a quirk due to the use of ARROW_EXPORT, would I be correct that this issue would not occur without the export/import business?
cpp/src/arrow/scalar.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use GetView since you copy it right after in a Buffer, that'll remove one memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer::FromString wraps the allocation of the std::string without copying
483e8a0 to
990de30
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz please update your version of the R roxygen2 package and re-run make doc (the autotune bot is failing because of python 2.7 or something)
990de30 to
c01a435
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for doing this -- I think this will save us some headache down the road
|
|
||
| // Serialization is accomplished by converting expressions to single element StructArrays | ||
| // then writing that to an IPC file. The last field is always an int32 column containing | ||
| // ExpressionType, the rest store the Expression's members. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems likely we'll want to define a (Flatbuffers?) serialization protocol protocol for expressions in the not-too-distant future, though this certainly works for now! =)
cpp/src/arrow/type.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a quirk due to the use of ARROW_EXPORT, would I be correct that this issue would not occur without the export/import business?
Serialization is implemented by converting Expressions to Arrays then writing a tiny IPC file. This is a ridiculous way to serialize Expressions but it should be acceptable since these classes are destined to be replaced by
arrow::engine::Expr