-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11656: [Rust][DataFusion] Remaining Postgres String functions #9654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11656: [Rust][DataFusion] Remaining Postgres String functions #9654
Conversation
|
Thanks @seddonm1 -- I'll plan to look at this later today or tomorrow |
| let length = length as usize; | ||
|
|
||
| if length == 0 { | ||
| Some("".to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is String necessary or could we use &str in those iterators? (to avoid an extra allocation per item)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hitting issues with the borrow checker so I think that the String allocation is required.
| .grapheme_indices(true) | ||
| .nth(n as usize) | ||
| .map_or(string, |(i, _)| { | ||
| &from_utf8(&string.as_bytes()[..i]).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any unwrap() I think it makes sense to put a comment why the unwrap is ok, and/or use .expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes
I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back
Something like (untested)
let i = strings
.graphemes()
.limit(n as usize)
.map(|s| s.len())
.sum()
string[..i]Or something like that. I am not sure how important this is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have a look and see if we can do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I have refactored just the left function to demonstrate how it could be done:
| let result = string_array |
I think this is nicer so if you agree I can apply to the rest of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it a lot 👍 Thanks @seddonm1
I am not sure why this clone is needed, but that is a minor point and can be cleaned up later
let len = graphemes.clone().count() as i64;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The borrow checker is identifying that the second use of graphemes is using a moved value without the clone. If you can help resolve that it would help me learn.
use of moved value: `graphemes` value used here after move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I took a closer look and I see now that graphemes is an iterator -- so we are clone() ing an iterator, which I guess feels right to me -- the code has to effectively scan the input to figure out how many graphemes long it is and then scan it again to create the output.
Sorry for my confusion!
|
|
||
| /// Replaces substring(s) matching a POSIX regular expression | ||
| /// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM' | ||
| pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at some point this could be in arrow (and more parts of your string contributions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels
|
Truly looks great @seddonm1 I added some comments |
|
Integration test failure looked like https://issues.apache.org/jira/browse/ARROW-11908 so I retriggered the CI checks |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great, as always @seddonm1 . 💯 🥇 for the test cases and feature flags.
I think @Dandandan and I had some style / improvement suggestions, but I think they could be done as follow on PRs as well. I'll wait to merge this PR for another day or so in case you want to respond, otherwise I'll plan to merge tomorrow.
👍
|
|
||
| /// Replaces substring(s) matching a POSIX regular expression | ||
| /// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM' | ||
| pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels
| .grapheme_indices(true) | ||
| .nth(n as usize) | ||
| .map_or(string, |(i, _)| { | ||
| &from_utf8(&string.as_bytes()[..i]).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes
I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back
Something like (untested)
let i = strings
.graphemes()
.limit(n as usize)
.map(|s| s.len())
.sum()
string[..i]Or something like that. I am not sure how important this is
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @seddonm1 -- I think it is mergeable as soon as you are ready. Just let me know
|
Thanks @alamb. I will make some more tweaks to these unicode functions to try to remove |
|
@alamb I have removed the |
|
@alamb hitting some CICD issues - nothing to do with this change. |
|
The integration test looks like https://issues.apache.org/jira/browse/ARROW-11908 so merging this one in. THanks again @seddonm1 ! |
@alamb the last one (for now).
This PR does a few things:
regexp_replace,replace,split_part,starts_with,strposandtranslate.unicode_expressionsand moves anything that depends onunicode-segmentationcrate into it.regex_expressionsand addsregexandlazy_staticcrates to it.