Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Apr 23, 2020

Serialization is implemented by converting Expressions to Arrays then writing a tiny IPC file. This is a ridiculous way to serialize Expressions but it should be acceptable since these classes are destined to be replaced by arrow::engine::Expr

@bkietz bkietz requested review from kszucs and nealrichardson April 23, 2020 20:27
@github-actions
Copy link

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions on the R bindings. Also, IIUC you can delete https://github.com/apache/arrow/blob/master/r/R/expression.R#L122-L163 and refactor that file a bit to match how you're not really specifying the Expression subclasses anymore.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am not sure if I find the Expression.field, Expression.and_, etc API necessarily better than the current FieldExpression, AndExpression, etc.
Both equally tie us to a certain python API: whether we use static methods or subclass constructors to construct the Expression objects doesn't really matter in the end IMO (they still expose an Expression class with methods, and a constructor like AndExpression could also simply be a wrapper to construct a base expression object as is done now in Expression.and_).

To be clear, I am fully on board with the the general clean-up and reducing the C++ API used from the python side. So I am only talking about the Python (user-facing) API. But for that, I am wondering: do we actually need to expose those constructors publicly? I think that the current ds.field(..) and ds.scalar(..) functions should be sufficient, as all other expressions can be created based on those using methods or binary operations?

@kszucs
Copy link
Member

kszucs commented Apr 27, 2020

Agree that exposing ds.field and ds.scalar should be sufficient on the python side.

@bkietz bkietz force-pushed the 7391-Remove-unnecessary-classe branch from c920613 to 0c9233e Compare April 27, 2020 16:20
@bkietz
Copy link
Member Author

bkietz commented Apr 27, 2020

@kszucs @jorisvandenbossche PTAL

@nealrichardson
Copy link
Member

@github-actions autotune everything

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@bkietz bkietz force-pushed the 7391-Remove-unnecessary-classe branch from 22e238e to 3d4fd34 Compare April 28, 2020 16:46
@fsaintjacques fsaintjacques self-requested a review April 28, 2020 19:07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows Iterator::RangeIterator to be copied, which is a requirement on iterators https://en.cppreference.com/w/cpp/named_req/Iterator

We haven't caught this before because we only use Iterator::begin and Iterator::end in the context of range-for loops which don't require it. However cython wraps the range-for loop does, however, which is why I noticed the concept error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ensuring that /\w+Type::type_id/ is a defined symbol in libarrow.so. Previously these have only been used within a translation unit which links into libarrow.so, but now I'm using them in filter.cc which gets linked to libarrow_dataset.so. If you remove these you'll find the linker emits an error because they're undefined in libarrow.so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a quirk due to the use of ARROW_EXPORT, would I be correct that this issue would not occur without the export/import business?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GetView since you copy it right after in a Buffer, that'll remove one memcpy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer::FromString wraps the allocation of the std::string without copying

@bkietz bkietz force-pushed the 7391-Remove-unnecessary-classe branch 2 times, most recently from 483e8a0 to 990de30 Compare May 1, 2020 00:18
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz please update your version of the R roxygen2 package and re-run make doc (the autotune bot is failing because of python 2.7 or something)

@bkietz bkietz force-pushed the 7391-Remove-unnecessary-classe branch from 990de30 to c01a435 Compare May 4, 2020 15:19
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for doing this -- I think this will save us some headache down the road


// Serialization is accomplished by converting expressions to single element StructArrays
// then writing that to an IPC file. The last field is always an int32 column containing
// ExpressionType, the rest store the Expression's members.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems likely we'll want to define a (Flatbuffers?) serialization protocol protocol for expressions in the not-too-distant future, though this certainly works for now! =)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a quirk due to the use of ARROW_EXPORT, would I be correct that this issue would not occur without the export/import business?

@wesm wesm closed this in a66ab10 May 5, 2020
@bkietz bkietz deleted the 7391-Remove-unnecessary-classe branch February 25, 2021 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants