-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17462: [R] Cast scalars to type of field in Expression building #13985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
374402e
R scalars should be cast to match the expression type, where appropriate
nealrichardson eec7ab1
Add a few tests and clean up notes
nealrichardson 2339e27
Progress on string-datetime parsing
nealrichardson 6393bb5
Assume timezone
nealrichardson b554f10
Refactor and resolve some TODOs
nealrichardson 97bfc13
Don't auto-cast to Decimal types to avoid compute bug
nealrichardson 2b2e15b
Avoid some cpp calls
nealrichardson 30410fa
Take UDFs out of build_expr and have Expression ensure Expression inputs
nealrichardson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to whitelist the functions this applies to? (Or maybe you already do this and I'm not reading this correctly?) This logic is awesome and very appropriate for most math functions but I wonder if there are some compute functions (maybe
binary_repeat) that will stop working when used withbuild_expr(). I think that user-defined functions also generate their bindings throughbuild_expr()(although they don't have to).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a blocklist rather than an allowlist, and binary_repeat is on it (L285, below). If there are compute functions that don't work with this change, we don't test them.
Do you think we should exclude UDFs from the type matching too?
For functions that do go through
build_expr(), the way to skip the type-matching logic is to wrap the value inExpression$scalar(). Only non-Expressions are cast.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think you should whitelist here...in theory one can use
build_expr()for any compute function, although many bindings choose to go directly throughExpression$create()instead. Using a blocklist would mean you can only usebuild_expr()safely for specific functions (in which case you should probably compute what those functions are so that can be documented).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the function list on https://arrow.apache.org/docs/cpp/compute.html and evaluated whether you should try to cast scalar inputs to the type of the corresponding column (and remember, if you can't cast the scalar without loss of precision, it doesn't add the cast, so for
int + 4.2, 4.2 won't get cast to int so that will go tocast(int, float64) + 4.2in Acero). For the non-unary scalar functions, all but 4 make sense to try to convert scalars like this. The 4 functions that don't arebinary_repeat,list_element,binary_join(kind of an odd case, which we don't use, we use binary_join_element_wise instead), andmake_struct. It's around 40-50 functions on the allow side, so it seems that the "don't cast" functions are the exception.Does that persuade you in favor of blocklist instead of allowlist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a whitelist is safer, although feel free to make the change. The
build_expr()in the user-defined function code ( https://github.com/apache/arrow/blob/master/r/R/compute.R#L384 ) would have to change to something approaching the previous behaviour since we have no guarantees about those functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pulled UDFs out of build_expr in d54de48, and in a followup I'll go further to reduce the usage of
build_exprto places where the type matching matters (more of an allowlist, in that sense), pull out the special cases inside of it, and rename it to something likebuild_simple_exprto make clear that it is a special case and not the default path you should choose.Sound ok to you?