-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] #9243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6691eac to
32333bb
Compare
Codecov Report
@@ Coverage Diff @@
## master #9243 +/- ##
==========================================
+ Coverage 82.29% 82.64% +0.35%
==========================================
Files 244 245 +1
Lines 55616 57408 +1792
==========================================
+ Hits 45767 47443 +1676
- Misses 9849 9965 +116
Continue to review full report at Codecov.
|
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @seddonm1 , I went through this and it looks great so far. Impressive work 💯
I left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont this coerce any type to the first variant, even if the latter variant is accepted?
I.e. if we use
Uniform(vec![
vec![vec![A]],
vec![vec![B]],
])
and pass arg types vec![B], I would expect that no coercion would happen, but I suspect that this will coerce B to A, because the first entry with the same number of arguments is vec![vec![A]].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorgecarleitao . Yes I will split this out.
A good example is lpad which is either:
[string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.
9cdf641 to
5a90cf8
Compare
799cf68 to
a69e099
Compare
|
@alamb @jorgecarleitao @andygrove I think these are mostly implemented now. Not sure how we want to do the merge given this change is so large. |
|
@andygrove @alamb @jorgecarleitao I think we need this Signature::OneOf. A good example is |
|
I plan to try and review this probably this weekend. I wonder if we should update the title to remove the |
|
I think the Clippy CI check on this PR is failing due to a new stable rust being released. I am working on a fix here #9476 |
|
Thanks @alamb . I know the prospect of doing a review like this is not something to look forward to. I will rebase and push soon. |
a69e099 to
b799b66
Compare
|
I am going in to review this PR -- I am getting second cup of ☕ and settling down for a good read 👓 ... |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, THANK YOU so much @seddonm1 -- this is an Epic body of work and will really drive DataFusion forward and make it so much more useful. I found the code well structured, easy to follow, and well commented. ❤️ ❤️
I spent an hour reviewing this PR -- what I saw was great, but I will be honest that by the end of that hour my mind was quite exhausted and I am not sure I would have caught every little thing
My opinion is we should break the content of this PR into separate pieces (rather than try and merge the whole thing) and start putting them in piece by piece.
Here is one possible way to split it up:
bit_lengthkernelsSignature::Uniform- Length functions (BitLength, etc)
- Ascii/unicode functions
- Regex functions
- Pad/trim functions
If you would like help doing so / creating the tickets I think I can find the time to do so as I think this is a really important PR.
If you don't want to split it up or disagree, I think I would be ok with merging this in as is once @jorgecarleitao is satisfied with the changes to type signatures if we commit ourselves to some post-merge cleanup / splitting up some of this code into smaller modules.
rust/datafusion/Cargo.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I have been thinking a lot about -- how can we keep DataFusion's dependency stack reasonable (it is already pretty large and it just keeps getting larger).
One thing I was thinking about was making some of these dependencies optional (so that we had features like regex and unicode and hash which would only pull in the dependencies / have those functions if the features were enabled.
What do you think @jorgecarleitao / @andygrove / @ovr ? If it is a reasonable idea (I think we mentioned it before) I will file a JIRA to track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was nervous about additional dependencies. Perhaps this topic can be raised at the next Arrow Rust call to agree some sort of assessment criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, regex / lazy_static are already non-optional dependencies of arrow, so I think not that much can be gained there, unless we make it optional in Arrow as well.
I think it is a good idea to make some features optional, to reduce compile times whenever you are not working on them.
Another thing we can do to split benchmarks / examples / etc. out of the crate to make compile times a bit shorter, which I started doing hereL #9494 and #9493
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.
|
@alamb Thanks for your extreme attention to detail and yes it is absolutely IMPLEMENT ALL THE FUNCTIONS 😆 I have addressed and resolved most of the comments you have made. The remaining unresolved comments do require further discussion. I am happy to do the split based on your suggestions and I'm ok to raise the tickets:
Obviously this is a lot of work but this should allow us to split up the reviews more fairly. I will start the PR-mageddon. |
|
@seddonm1, If you need hands, just ping me with function names you would like me to work on and I will pick them up and PR then to this branch. |
…Length Functions Splitting up #9243 This implements the following functions: - String functions - [x] bit_Length - [x] char_length - [x] character_length - [x] length - [x] octet_length Closes #9509 from seddonm1/length-functions Lead-authored-by: Mike Seddon <seddonm1@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
e133704 to
145b108
Compare
|
@alamb FYI i have just rebased against master. @Dandandan has already added the |
This PR is a child of #9243 It does a few things that are hard to separate: - fixes the behavior of `concat` and `trim` functions to be in line with the Postgres implementations - restructures some of the code base (mainly sorting and adding tests) to facilitate easier testing and implementation of the remainder of #9243 @alamb @jorgecarleitao please review but merging will be dependent on #9507 Closes #9551 from seddonm1/concat Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…, right, rpad @alamb Another one. Please pay close attention to the type coercion. It does two things: - fixes the behavior of the **type coercion**. - adds the simple functions `left`, `lpad`, `right`, `rpad` following the Postgres style. This PR is a child of #9243 Closes #9565 from seddonm1/left_ltrim_right_rpad Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
|
switching to draft so it is clear this is not being merged as is and instead is being broken up |
…, initcap, repeat, reverse, to_hex @alamb This is the second last of the current string functions but I think there may be one after that with new code. This implements some of the miscellaneous string functions `ascii`, `chr`, `initcap`, `repeat`, `reverse`, `to_hex`. The next PR will have more useful functions (including regex). A little bit of tidying for consistency to the other functions was applied. This PR is a child of #9243 Closes #9625 from seddonm1/ascii-chr-initcap-repeat-reverse-tohex Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
|
Closed after splitting. |
…Length Functions Splitting up apache/arrow#9243 This implements the following functions: - String functions - [x] bit_Length - [x] char_length - [x] character_length - [x] length - [x] octet_length Closes #9509 from seddonm1/length-functions Lead-authored-by: Mike Seddon <seddonm1@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR starts the large work of implementing the Postgres String functions. Most of these are naive implementations but the tests should allow rapid performance enhancement without regressions.
sprintfimplementationChanges
I have had to make some changes to the existing implementations:
concathad the incorrect behavior for how to handle NULLs where any null would result in a NULL where the Postgres implementation documents:NULL arguments are ignored..ltrimandrtrimwere implemented to support only the default space character whereas Postgres supports an optional second parameter:ltrim('zzzytest', 'xyz')so that has been updated.lengthkernel returns bytes not characters.character_lengthhas been reimplemented but requires an import of theunicode-segmentationcrate. The comments have been updated forlength.Questions
Signature::OneOfvsSignature::Uniform. This came up with aleftfunction that takes a(utf8, int64)signature and it is not correct to try to cast both toutf8. You can see my implementation here but perhaps you have a better method.