ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428

sweb · 2021-02-05T17:16:58Z

Adds a regexp_extract compute kernel to select a substring based on a regular expression.

Some things I did that I may be doing wrong:

I exposed GenericStringBuilder
I build the resulting Array using a builder - this looks quite different from e.g. the substring kernel. Should I change it accordingly, e.g. because of performance considerations?
In order to apply the new function in datafusion, I did not see a better solution than to handle the pattern string as StringArray and take the first record to compile the regex pattern from it and apply it to all values. Is there a way to define that an argument has to be a literal/scalar and cannot be filled by e.g. another column? I consider my current implementation quite error prone and would like to make this a bit more robust.

github-actions · 2021-02-05T17:39:42Z

https://issues.apache.org/jira/browse/ARROW-10354

sweb · 2021-02-07T16:32:02Z

@jorgecarleitao could I ask you for a review if you find the time? This is my first attempt to do something with datafusion, so there are probably some things I misunderstood.

seddonm1 · 2021-02-10T21:20:57Z

@sweb This is cool and useful.

Given we are aiming for Postgres compatability (in terms of syntax) do you think you could modify it to be the regexp_match ( string text, pattern text [, flags text ] ) → text[] function instead? Then you could modify the parser to support extracting a value from the text[] array like Postgres.

See: https://www.postgresql.org/docs/13/functions-string.html

I have done a lot of work recently on Postgres functions so there may be some useful work there: #9243

sweb · 2021-02-10T21:44:08Z

Given we are aiming for Postgres compatability (in terms of syntax) do you think you could modify it to be the regexp_match ( string text, pattern text [, flags text ] ) → text[] function instead? Then you could modify the parser to support extracting a value from the text[] array like Postgres.

@seddonm1 Sure, I will try to change it accordingly!

I have done a lot of work recently on Postgres functions so there may be some useful work there: #9243

What a lucky coincidence that you have not implemented regexp_match yet :) I will check your PR for some pointers on how to provide things like flags, etc.

jorgecarleitao

Hi @sweb . Thank you for this PR! This is a really important kernel :)

I went through this. I think that the overall logic makes sense and the structure is correct, but there are some points that IMO should be addressed.

jorgecarleitao · 2021-02-17T03:16:12Z

rust/arrow/src/compute/kernels/regexp.rs

It is more efficient (and probably idiomatic) to use the collect::<GenericStringArray<OffsetSize>>() here.

I.e.

array .iter() .map(<logic here>) .collect::<GenericStringArray<OffsetSize>>()

Since this is an unary operation on a utf8 array, I would try to write a generic for it (like we do for primitives in arity.rs) and use it here. We may even be able to write it using the trusted_len, which is the faster option available atm.

I did not find a way to do this for ListArray - I think I would have to implement the FromIterator trait for GenericListArray, correct?

jorgecarleitao · 2021-02-17T03:22:03Z

rust/arrow/src/compute/kernels/regexp.rs

I wonder if the signature shouldn't be idx: &[usize] and the result Vec<ArrayRef>. It would allow for optimizations where the user wants more than one group for the same regex (as regex is usually slow). Could be left out for now, just a though.

Hey @jorgecarleitao thank you very much for your review. I will try to address your comments in the next days.

Since @seddonm1 remarked that Postgres compatibility, I was thinking about changing the function signature of the kernel to:

pub fn regexp_match(array: &Array, pattern: &str) -> Result<ArrayRef>

where the returned array is of type GenericListArray with values of type &str. A list is closer to the Postgres signature and would provide the flexibility to choose multiple groups. Would this be fine as well or is Vec<ArrayRef> preferable to you?

jorgecarleitao · 2021-02-17T03:23:55Z

rust/arrow/src/compute/kernels/regexp.rs

if we want to mimic spark, the last entry should result in an empty string, not a None. This is because it would be otherwise impossible to differentiate between a "no match" and a "input is null".

Done for regexp_match. This will now lead to a ListArray([StringArray([""]]), i.e. the group has a single entry with an empty string - even if there are multiple groups. I am not sure how Postgres behaves in this case... I will try to check

jorgecarleitao · 2021-02-17T03:24:22Z

rust/arrow/src/compute/kernels/regexp.rs

Can we make one of these entries None, so that we also test the null entry case?

I added a None case for regexp_match - I am currently planning to remove regexp_extract if something like regexp_match is preferable.

jorgecarleitao · 2021-02-17T03:25:21Z

rust/arrow/src/compute/kernels/regexp.rs

I would be fine not adding this test. IMO this is covered on the test above.

jorgecarleitao · 2021-02-17T03:26:58Z

rust/datafusion/src/physical_plan/string_expressions.rs

I am not sure this is correct: the function expects an array, but then only picks the first element of the array for the regex. Maybe this was used because ScalarFunctions did not support the ScalarValue variant?

Absolutely, I did this because I am not sure, how I can get from ArrayRef to ScalarVariant - I am looking for something like the following:

let pattern_expr = args[1].as_any().downcast_ref::<ScalarValue>().unwrap(); if let ScalarValue::Utf8(Some(pattern)) = pattern_expr { compute::regexp_match(args[0].as_ref(), pattern) .map_err(DataFusionError::ArrowError) } else { Err(DataFusionError::Internal("This is wrong".to_string())) }

jorgecarleitao · 2021-02-17T03:32:46Z

rust/datafusion/src/physical_plan/functions.rs

I am not sure this is correct; isn't the signature Variant([[Utf8, Utf8, UInt64], [LargeUtf8, Utf8, UInt64]]) or something like that?

Note that these signatures are very important because they are used for type validation during logical planning, as well as type coercion at physical planning. Whenever we write Any, the logical planning will accept any type. Worse, the type coercer will not perform any coercion.

In this case, because we downcast arg[2] to Int64Array, if the user passes a Int32Array, the execution panics.

Done - I had to use Int64 instead of UInt64 because I got errors from tests/sql.rs. I have to check how to define a literal as unsigned from within the query string.

alamb · 2021-02-26T22:06:39Z

@sweb what is the status of this PR? Are you blocked? If you just haven't had time or inclination to make changes, that is fine (I totally understand) I just wanted try and clear the PR queue

sweb · 2021-03-05T15:28:36Z

@sweb what is the status of this PR? Are you blocked? If you just haven't had time or inclination to make changes, that is fine (I totally understand) I just wanted try and clear the PR queue

@alamb Sorry for the late response - I did not have enough time to continue with this and did not see your message. However, I am currently blocked on one front:

I have created a new kernel function with the following signature:

fn regexp_match(array: &Array, pattern: &str) -> Result<ArrayRef>

where ArrayRef is a GenericListArray. This is my attempt to get closer to the Postgres functions. However, I do not seem to understand, how I can define a return type appropriate for ListArray for a DataFusion function. To be more specific: When I try to define the return type of DataType::List in functions.rs I need to create a Box<Field> and I am not sure whether it makes sense to create it here - e.g. what should be the name of the field?

If you have some pointers for this, I would be very grateful - I was not able to find something similar yet. I have added my current state, which kind of works but I am not too sure that what I am doing there is correct. Again, apologies for the late response.

seddonm1 · 2021-03-05T22:01:47Z

@sweb I can help on Monday. I'm planning to raise the PR for those other regexp functions then can help work through this?

sweb · 2021-03-08T17:43:36Z

@sweb I can help on Monday. I'm planning to raise the PR for those other regexp functions then can help work through this?

Hey @seddonm1 I have rebased my PR on the current master. My plan would be to only keep regexp_match and remove regexp_extract. However, since there are review comments that I did not address yet, I did not want to remove regexp_extract before I was sure that regexp_match is the way to go.

My main issues are:

What is the correct way to pass the regular expression into the DataFusion function? I have seen in your PR that you treat it as a normal column (ArrayRef) and compile all distinct regular expressions and apply the relevant one. I have to admit that I did not think of this use case, but of course it is much cleaner than my assumption that it is going to be a literal, fingers crossed. However, I am not sure that I would expect this API from a kernel function, i.e. passing the regex as a StringArray, instead of a &str.
What do you think concerning the usage of ListArray in regexp_match and just defining the return types of the corresponding DataFusion functions as List? I have no experience in using DataFusion (yet) so I am not sure whether this makes sense or if I have to add some kind of handling for accessing list elements to make this usable.

seddonm1 · 2021-03-08T21:01:08Z

Hi @sweb

Yesterday I made my PR for regexp_replace (and others) see https://github.com/apache/arrow/pull/9654/files#diff-c122a83600dc86aa69067fdfdca1e0349616dfae73eaad8a8c90d5e69dbf7a3c You can see how I am passing in the values. I think there is opportunity to use more lazy_static to precompile any standard regex in that file to reduce runtime cost - but we need to see all the functions before too much optimisation.

The way I have done the regexp_replace code (and I am not saying it is the best way) is that because potentially each row can be different (as any argument to a function in Postgres can actually be supplied by referencing column) I have tried to balance that cost by memoizing the Regex objects. I did write a lot of this code prior to @jorgecarleitao doing some large changes in the functions.rs relating to Scalar vs Columnar so there may be a second pass to optimise this once we get basic functionality working.

I think the regexp_match is the way to go (as per Postgres) which does return a list of string values. We will then need to look at the sqlparser to add the ability to 'extract' values from the list by id: [0] (I was thinking of doing this soon). I need to do some playing in Postgres to fully understand the behaivor (what happens if you reference a non-existent index).

alamb · 2021-03-08T21:56:32Z

Again, apologies for the late response.

@sweb no worries! I totally understand. Thanks for sticking with it

sweb · 2021-03-09T22:11:08Z

@seddonm1 I have pushed the following adjustment:

       // from functions.rs
        BuiltinScalarFunction::RegexpMatch => |args| match (&args[0], &args[1]) {
            (c, ColumnarValue::Scalar(ScalarValue::Utf8(Some(pattern))))
                if c.data_type() == DataType::Utf8
                    || c.data_type() == DataType::LargeUtf8 =>
            {
                let arr = match c {
                    ColumnarValue::Array(col) => col.clone(),
                    ColumnarValue::Scalar(_) => c.clone().into_array(1),
                };

                Ok(ColumnarValue::Array(string_expressions::regexp_match(
                    &arr,
                    pattern.as_str(),
                )?))
            }
   ...

This way, I define that the regular expression needs to come from a literal, so I do not have to take the first element of the array and hope for the best and do not have to check for multiple regular expressions. This comes with a change I am not sure is wanted: The string_expressions function regexp_match has two arguments and not a single slice of ArrayRefs.

I am just proposing this because I am not sure that we want to handle multiple regular expressions. If supporting multiple expressions is the way to go, I will adjust the kernel the way you implemented regexp_replace. Let me know what you think!

seddonm1 · 2021-03-10T07:22:57Z

@sweb I know you have spent a lot of time on this so I know its painful but I do think that we should built to the Postgres spec which will allow regex to be passed in via a columnar value. How we implement that behind the scene allows optimisation but the spec allows each row to be individually processed with a different regex.

To clarify: I am not sure if anyone actually uses SQL like this but it is a consistent implementation pattern in Postgres where a Scalar and Columnar value is treated equally - which I think is an elegant design.

seddonm1 · 2021-03-25T20:52:22Z

@sweb i have had a read through the code and it looks good. Are you able to rebase so the tests will be able to run? Sorry, I missed your comment 10 days ago to have a look earlier 🤦

sweb · 2021-03-26T20:40:24Z

@seddonm1 I just merged master - there is currently a linting issue due to rust 1.51 but the tests are green.

seddonm1 · 2021-03-26T22:07:20Z

rust/arrow/src/compute/kernels/regexp.rs

+                            list_builder.append(true)?
+                        }
+                        None => {
+                            list_builder.values().append_value("")?;


I am not sure this logic is correct based on Postgres behavior:

SELECT regexp_match('foobarbequebaz', '(bar)(bequ1e)') = NULL

I changed the behavior - it returns NULL now. This was my half-thought-through attempt to address the review comment differentiating between matching on NULL and no match.

seddonm1 · 2021-03-26T22:07:43Z

rust/arrow/src/compute/kernels/regexp.rs

+    use crate::array::{ListArray, StringArray};
+
+    #[test]
+    fn match_single_group() -> Result<()> {


Could you also add a test case for and (regexp_match('foobarbequebaz', '(bar)(bequ1e)') above):

SELECT regexp_match('foobarbequebaz', ''); = {""}

Some of these behaviors from Postgres don't really make sense to me.

@seddonm1 First: I am very impressed that you know of this case.

My original implementation returned an empty List, without an item. Do you know whether Postgres actually returns a quoted empty string? I am asking because

SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); => {bar,beque}

so I am not sure what to make of the quotes, since strings are not returned with quotes or is this just a special case when the string is empty?

Regardless, I added special case for the empty string pattern

I did not know about this case. I just put a few scenarios into a Postgres instance running locally (via docker). Your implementation does make sense.

seddonm1 · 2021-03-26T22:11:34Z

@sweb This looks good and I think it is almost ready. I think there are just a few cases to look into to match Postgres behavior then ready to merge.

codecov-io · 2021-03-28T13:11:16Z

Codecov Report

Merging #9428 (da33017) into master (894dd17) will increase coverage by 0.03%.
The diff coverage is 92.90%.

@@            Coverage Diff             @@
##           master    #9428      +/-   ##
==========================================
+ Coverage   82.42%   82.45%   +0.03%     
==========================================
  Files         252      253       +1     
  Lines       58977    59132     +155     
==========================================
+ Hits        48609    48755     +146     
- Misses      10368    10377       +9

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/expr.rs	`83.39% <ø> (ø)`
rust/datafusion/src/scalar.rs	`53.65% <33.33%> (-0.17%)`	⬇️
.../datafusion/src/physical_plan/regex_expressions.rs	`86.15% <60.00%> (-2.18%)`	⬇️
rust/datafusion/src/physical_plan/functions.rs	`89.48% <90.19%> (+0.02%)`	⬆️
rust/arrow/src/compute/kernels/regexp.rs	`97.59% <97.59%> (ø)`
rust/datafusion/tests/sql.rs	`98.47% <100.00%> (+0.01%)`	⬆️
rust/datafusion/examples/simple_udaf.rs
rust/datafusion/examples/flight_server.rs
rust/datafusion-examples/examples/flight_server.rs	`0.00% <0.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 894dd17...da33017. Read the comment docs.

seddonm1 · 2021-03-29T22:32:38Z

rust/datafusion/tests/sql.rs

+
+#[tokio::test]
+#[cfg(feature = "regex_expressions")]
+async fn query_regexp_match() -> Result<()> {


It think we can remove this and just add these cases to the test_regex_expressions function above:

test_expression!("regexp_match('foobarbequebaz', '')", "[]"); test_expression!("regexp_match('foobarbequebaz', '(bar)(beque)')", "[bar, beque]"); test_expression!("regexp_match('foobarbequebaz', '(ba3r)(bequ34e)')", "NULL"); test_expression!("regexp_match('aaa-0', '.*-(\\d)')", "[0]"); test_expression!("regexp_match('bb-1', '.*-(\\d)')", "[1]"); test_expression!("regexp_match('aa', '.*-(\\d)')", "NULL"); test_expression!("regexp_match(NULL, '.*-(\\d)')", "NULL"); test_expression!("regexp_match('aaa-0', NULL)", "NULL");

seddonm1 · 2021-03-29T22:34:31Z

@sweb This is great and does work with all the cases I have run tests for 👍. You can see my slight suggested change above and once thats done @alamb this is ready for merge.

Sorry for being pedantic but I think this is really important functionality.

alamb · 2021-03-30T12:09:09Z

❤️ --- thanks @sweb and @seddonm1 . I'll keep an eye on this PR and merge when ready. I took a quick skim through this morning and it is looking very nice 👌

sweb · 2021-03-31T20:00:55Z

@sweb This is great and does work with all the cases I have run tests for 👍. You can see my slight suggested change above and once thats done @alamb this is ready for merge.

Sorry for being pedantic but I think this is really important functionality.

No worries, thank you for your thorough review work. I learned a lot!

sweb · 2021-04-01T08:38:39Z

I merged master, but I still get (as far as I can tell) unrelated linting errors. I assume #9867 solves this

alamb · 2021-04-01T11:28:53Z

Checking this out @sweb

Comments were addressed

alamb

I went through this PR and it looks good to me. Thanks @sweb and @seddonm1 and @jorgecarleitao for your work here.

alamb · 2021-04-01T11:58:02Z

I also merged it with apache/master locally and ran the tests / clippy and everything passed for me. Thus merging this one in

seddonm1 · 2021-04-02T02:26:45Z

@sweb Sorry I missed this in the review but raised #9874 to fix the location of the test.

Great work!

sweb marked this pull request as draft February 5, 2021 17:17

github-actions bot added Component: Rust Component: Rust - DataFusion labels Feb 5, 2021

sweb marked this pull request as ready for review February 7, 2021 16:27

sweb changed the title ~~ARROW-10354: [Rust] regexp_extract function to select regex groups from strings~~ ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings Feb 7, 2021

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

jorgecarleitao previously requested changes Feb 17, 2021

View reviewed changes

sweb marked this pull request as draft March 5, 2021 15:29

github-actions bot added Component: FlightRPC Component: C++ labels Mar 5, 2021

sweb force-pushed the ARROW-10354/regexp_extract branch from 200204b to b5d0092 Compare March 5, 2021 16:29

sweb added 3 commits March 8, 2021 18:23

feat: regexp_extract and regexp_match

c38a5f8

cleanups

0065a10

fix: clean up after rebase

e9c4c5d

sweb force-pushed the ARROW-10354/regexp_extract branch from 4f4257c to e9c4c5d Compare March 8, 2021 17:31

sweb added 2 commits March 8, 2021 18:55

fix: formatting

2213696

fix: correct signature

a66b414

refactor: make usage of literal explicit

5f98c0e

sweb added 2 commits March 26, 2021 18:06

chore: merge master

5652b31

fix: sql test after merge

9b5c80d

seddonm1 reviewed Mar 26, 2021

View reviewed changes

sweb added 2 commits March 28, 2021 14:26

Merge branch 'master' into ARROW-10354/regexp_extract

09ee533

chore: let unmatching pattern return null

077a7dc

sweb added 2 commits March 28, 2021 17:18

feat: add special case for empty string pattern

f7e1745

fix: clippy

da33017

seddonm1 reviewed Mar 29, 2021

View reviewed changes

refactor: simplify tests for regexp_match

852ba38

sweb added 2 commits April 1, 2021 07:24

chore: formatting / linting

346227c

Merge branch 'master' into ARROW-10354/regexp_extract

b7bca91

alamb approved these changes Apr 1, 2021

View reviewed changes

alamb closed this in 090f11c Apr 1, 2021

asfimport mentioned this pull request Apr 2, 2021

[Rust] [DataFusion] Add support for regex extract #26340

Closed

ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428

ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428

Uh oh!

Conversation

sweb commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 5, 2021

Uh oh!

sweb commented Feb 7, 2021

Uh oh!

seddonm1 commented Feb 10, 2021

Uh oh!

sweb commented Feb 10, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 26, 2021

Uh oh!

sweb commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seddonm1 commented Mar 5, 2021

Uh oh!

sweb commented Mar 8, 2021

Uh oh!

seddonm1 commented Mar 8, 2021

Uh oh!

alamb commented Mar 8, 2021

Uh oh!

sweb commented Mar 9, 2021

Uh oh!

seddonm1 commented Mar 10, 2021

Uh oh!

seddonm1 commented Mar 25, 2021

Uh oh!

sweb commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweb Mar 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sweb commented Feb 5, 2021 •

edited

Loading

jorgecarleitao Feb 17, 2021 •

edited

Loading

sweb commented Mar 5, 2021 •

edited

Loading

sweb Mar 28, 2021 •

edited

Loading

codecov-io commented Mar 28, 2021 •

edited

Loading