-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jorgecarleitao could I ask you for a review if you find the time? This is my first attempt to do something with datafusion, so there are probably some things I misunderstood. |
|
@sweb This is cool and useful. Given we are aiming for Postgres compatability (in terms of syntax) do you think you could modify it to be the See: https://www.postgresql.org/docs/13/functions-string.html I have done a lot of work recently on Postgres functions so there may be some useful work there: #9243 |
@seddonm1 Sure, I will try to change it accordingly!
What a lucky coincidence that you have not implemented |
d4608a9 to
356c300
Compare
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sweb . Thank you for this PR! This is a really important kernel :)
I went through this. I think that the overall logic makes sense and the structure is correct, but there are some points that IMO should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more efficient (and probably idiomatic) to use the collect::<GenericStringArray<OffsetSize>>() here.
I.e.
array
.iter()
.map(<logic here>)
.collect::<GenericStringArray<OffsetSize>>()Since this is an unary operation on a utf8 array, I would try to write a generic for it (like we do for primitives in arity.rs) and use it here. We may even be able to write it using the trusted_len, which is the faster option available atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find a way to do this for ListArray - I think I would have to implement the FromIterator trait for GenericListArray, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the signature shouldn't be idx: &[usize] and the result Vec<ArrayRef>. It would allow for optimizations where the user wants more than one group for the same regex (as regex is usually slow). Could be left out for now, just a though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jorgecarleitao thank you very much for your review. I will try to address your comments in the next days.
Since @seddonm1 remarked that Postgres compatibility, I was thinking about changing the function signature of the kernel to:
pub fn regexp_match(array: &Array, pattern: &str) -> Result<ArrayRef>
where the returned array is of type GenericListArray with values of type &str. A list is closer to the Postgres signature and would provide the flexibility to choose multiple groups. Would this be fine as well or is Vec<ArrayRef> preferable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to mimic spark, the last entry should result in an empty string, not a None. This is because it would be otherwise impossible to differentiate between a "no match" and a "input is null".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for regexp_match. This will now lead to a ListArray([StringArray([""]]), i.e. the group has a single entry with an empty string - even if there are multiple groups. I am not sure how Postgres behaves in this case... I will try to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make one of these entries None, so that we also test the null entry case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a None case for regexp_match - I am currently planning to remove regexp_extract if something like regexp_match is preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine not adding this test. IMO this is covered on the test above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is correct: the function expects an array, but then only picks the first element of the array for the regex. Maybe this was used because ScalarFunctions did not support the ScalarValue variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, I did this because I am not sure, how I can get from ArrayRef to ScalarVariant - I am looking for something like the following:
let pattern_expr = args[1].as_any().downcast_ref::<ScalarValue>().unwrap();
if let ScalarValue::Utf8(Some(pattern)) = pattern_expr {
compute::regexp_match(args[0].as_ref(), pattern)
.map_err(DataFusionError::ArrowError)
} else {
Err(DataFusionError::Internal("This is wrong".to_string()))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is correct; isn't the signature Variant([[Utf8, Utf8, UInt64], [LargeUtf8, Utf8, UInt64]]) or something like that?
Note that these signatures are very important because they are used for type validation during logical planning, as well as type coercion at physical planning. Whenever we write Any, the logical planning will accept any type. Worse, the type coercer will not perform any coercion.
In this case, because we downcast arg[2] to Int64Array, if the user passes a Int32Array, the execution panics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I had to use Int64 instead of UInt64 because I got errors from tests/sql.rs. I have to check how to define a literal as unsigned from within the query string.
|
@sweb what is the status of this PR? Are you blocked? If you just haven't had time or inclination to make changes, that is fine (I totally understand) I just wanted try and clear the PR queue |
@alamb Sorry for the late response - I did not have enough time to continue with this and did not see your message. However, I am currently blocked on one front: I have created a new kernel function with the following signature: where If you have some pointers for this, I would be very grateful - I was not able to find something similar yet. I have added my current state, which kind of works but I am not too sure that what I am doing there is correct. Again, apologies for the late response. |
200204b to
b5d0092
Compare
|
@sweb I can help on Monday. I'm planning to raise the PR for those other regexp functions then can help work through this? |
4f4257c to
e9c4c5d
Compare
Hey @seddonm1 I have rebased my PR on the current master. My plan would be to only keep My main issues are:
|
|
Hi @sweb Yesterday I made my PR for The way I have done the I think the |
@sweb no worries! I totally understand. Thanks for sticking with it |
|
@seddonm1 I have pushed the following adjustment: This way, I define that the regular expression needs to come from a literal, so I do not have to take the first element of the array and hope for the best and do not have to check for multiple regular expressions. This comes with a change I am not sure is wanted: The string_expressions function I am just proposing this because I am not sure that we want to handle multiple regular expressions. If supporting multiple expressions is the way to go, I will adjust the kernel the way you implemented regexp_replace. Let me know what you think! |
|
@sweb I know you have spent a lot of time on this so I know its painful but I do think that we should built to the Postgres spec which will allow regex to be passed in via a columnar value. How we implement that behind the scene allows optimisation but the spec allows each row to be individually processed with a different regex. To clarify: I am not sure if anyone actually uses SQL like this but it is a consistent implementation pattern in Postgres where a Scalar and Columnar value is treated equally - which I think is an elegant design. |
|
@sweb i have had a read through the code and it looks good. Are you able to rebase so the tests will be able to run? Sorry, I missed your comment 10 days ago to have a look earlier 🤦 |
|
@seddonm1 I just merged master - there is currently a linting issue due to rust 1.51 but the tests are green. |
| list_builder.append(true)? | ||
| } | ||
| None => { | ||
| list_builder.values().append_value("")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this logic is correct based on Postgres behavior:
SELECT regexp_match('foobarbequebaz', '(bar)(bequ1e)') = NULLThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the behavior - it returns NULL now. This was my half-thought-through attempt to address the review comment differentiating between matching on NULL and no match.
| use crate::array::{ListArray, StringArray}; | ||
|
|
||
| #[test] | ||
| fn match_single_group() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test case for and (regexp_match('foobarbequebaz', '(bar)(bequ1e)') above):
SELECT regexp_match('foobarbequebaz', ''); = {""}Some of these behaviors from Postgres don't really make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seddonm1 First: I am very impressed that you know of this case.
My original implementation returned an empty List, without an item. Do you know whether Postgres actually returns a quoted empty string? I am asking because
SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); => {bar,beque}
so I am not sure what to make of the quotes, since strings are not returned with quotes or is this just a special case when the string is empty?
Regardless, I added special case for the empty string pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know about this case. I just put a few scenarios into a Postgres instance running locally (via docker). Your implementation does make sense.
|
@sweb This looks good and I think it is almost ready. I think there are just a few cases to look into to match Postgres behavior then ready to merge. |
Codecov Report
@@ Coverage Diff @@
## master #9428 +/- ##
==========================================
+ Coverage 82.42% 82.45% +0.03%
==========================================
Files 252 253 +1
Lines 58977 59132 +155
==========================================
+ Hits 48609 48755 +146
- Misses 10368 10377 +9
Continue to review full report at Codecov.
|
rust/datafusion/tests/sql.rs
Outdated
|
|
||
| #[tokio::test] | ||
| #[cfg(feature = "regex_expressions")] | ||
| async fn query_regexp_match() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think we can remove this and just add these cases to the test_regex_expressions function above:
test_expression!("regexp_match('foobarbequebaz', '')", "[]");
test_expression!("regexp_match('foobarbequebaz', '(bar)(beque)')", "[bar, beque]");
test_expression!("regexp_match('foobarbequebaz', '(ba3r)(bequ34e)')", "NULL");
test_expression!("regexp_match('aaa-0', '.*-(\\d)')", "[0]");
test_expression!("regexp_match('bb-1', '.*-(\\d)')", "[1]");
test_expression!("regexp_match('aa', '.*-(\\d)')", "NULL");
test_expression!("regexp_match(NULL, '.*-(\\d)')", "NULL");
test_expression!("regexp_match('aaa-0', NULL)", "NULL");There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
No worries, thank you for your thorough review work. I learned a lot! |
|
I merged master, but I still get (as far as I can tell) unrelated linting errors. I assume #9867 solves this |
|
Checking this out @sweb |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR and it looks good to me. Thanks @sweb and @seddonm1 and @jorgecarleitao for your work here.
|
I also merged it with apache/master locally and ran the tests / clippy and everything passed for me. Thus merging this one in |
Adds a regexp_extract compute kernel to select a substring based on a regular expression.
Some things I did that I may be doing wrong:
GenericStringBuilderStringArrayand take the first record to compile the regex pattern from it and apply it to all values. Is there a way to define that an argument has to be a literal/scalar and cannot be filled by e.g. another column? I consider my current implementation quite error prone and would like to make this a bit more robust.