Skip to content

Conversation

@sweb
Copy link
Contributor

@sweb sweb commented Feb 5, 2021

Adds a regexp_extract compute kernel to select a substring based on a regular expression.

Some things I did that I may be doing wrong:

  • I exposed GenericStringBuilder
  • I build the resulting Array using a builder - this looks quite different from e.g. the substring kernel. Should I change it accordingly, e.g. because of performance considerations?
  • In order to apply the new function in datafusion, I did not see a better solution than to handle the pattern string as StringArray and take the first record to compile the regex pattern from it and apply it to all values. Is there a way to define that an argument has to be a literal/scalar and cannot be filled by e.g. another column? I consider my current implementation quite error prone and would like to make this a bit more robust.

@sweb sweb marked this pull request as draft February 5, 2021 17:17
@github-actions
Copy link

github-actions bot commented Feb 5, 2021

@sweb
Copy link
Contributor Author

sweb commented Feb 7, 2021

@jorgecarleitao could I ask you for a review if you find the time? This is my first attempt to do something with datafusion, so there are probably some things I misunderstood.

@sweb sweb changed the title ARROW-10354: [Rust] regexp_extract function to select regex groups from strings ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings Feb 7, 2021
@seddonm1
Copy link
Contributor

@sweb This is cool and useful.

Given we are aiming for Postgres compatability (in terms of syntax) do you think you could modify it to be the regexp_match ( string text, pattern text [, flags text ] ) → text[] function instead? Then you could modify the parser to support extracting a value from the text[] array like Postgres.

See: https://www.postgresql.org/docs/13/functions-string.html

I have done a lot of work recently on Postgres functions so there may be some useful work there: #9243

@sweb
Copy link
Contributor Author

sweb commented Feb 10, 2021

Given we are aiming for Postgres compatability (in terms of syntax) do you think you could modify it to be the regexp_match ( string text, pattern text [, flags text ] ) → text[] function instead? Then you could modify the parser to support extracting a value from the text[] array like Postgres.

@seddonm1 Sure, I will try to change it accordingly!

I have done a lot of work recently on Postgres functions so there may be some useful work there: #9243

What a lucky coincidence that you have not implemented regexp_match yet :) I will check your PR for some pointers on how to provide things like flags, etc.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sweb . Thank you for this PR! This is a really important kernel :)

I went through this. I think that the overall logic makes sense and the structure is correct, but there are some points that IMO should be addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more efficient (and probably idiomatic) to use the collect::<GenericStringArray<OffsetSize>>() here.

I.e.

array
    .iter()
    .map(<logic here>)
    .collect::<GenericStringArray<OffsetSize>>()

Since this is an unary operation on a utf8 array, I would try to write a generic for it (like we do for primitives in arity.rs) and use it here. We may even be able to write it using the trusted_len, which is the faster option available atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find a way to do this for ListArray - I think I would have to implement the FromIterator trait for GenericListArray, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the signature shouldn't be idx: &[usize] and the result Vec<ArrayRef>. It would allow for optimizations where the user wants more than one group for the same regex (as regex is usually slow). Could be left out for now, just a though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jorgecarleitao thank you very much for your review. I will try to address your comments in the next days.

Since @seddonm1 remarked that Postgres compatibility, I was thinking about changing the function signature of the kernel to:

pub fn regexp_match(array: &Array, pattern: &str) -> Result<ArrayRef>

where the returned array is of type GenericListArray with values of type &str. A list is closer to the Postgres signature and would provide the flexibility to choose multiple groups. Would this be fine as well or is Vec<ArrayRef> preferable to you?

Copy link
Member

@jorgecarleitao jorgecarleitao Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to mimic spark, the last entry should result in an empty string, not a None. This is because it would be otherwise impossible to differentiate between a "no match" and a "input is null".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for regexp_match. This will now lead to a ListArray([StringArray([""]]), i.e. the group has a single entry with an empty string - even if there are multiple groups. I am not sure how Postgres behaves in this case... I will try to check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make one of these entries None, so that we also test the null entry case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a None case for regexp_match - I am currently planning to remove regexp_extract if something like regexp_match is preferable.

Comment on lines 107 to 168
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine not adding this test. IMO this is covered on the test above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is correct: the function expects an array, but then only picks the first element of the array for the regex. Maybe this was used because ScalarFunctions did not support the ScalarValue variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I did this because I am not sure, how I can get from ArrayRef to ScalarVariant - I am looking for something like the following:

    let pattern_expr = args[1].as_any().downcast_ref::<ScalarValue>().unwrap();
    if let ScalarValue::Utf8(Some(pattern)) = pattern_expr {
        compute::regexp_match(args[0].as_ref(), pattern)
            .map_err(DataFusionError::ArrowError)
    } else {
        Err(DataFusionError::Internal("This is wrong".to_string()))
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is correct; isn't the signature Variant([[Utf8, Utf8, UInt64], [LargeUtf8, Utf8, UInt64]]) or something like that?

Note that these signatures are very important because they are used for type validation during logical planning, as well as type coercion at physical planning. Whenever we write Any, the logical planning will accept any type. Worse, the type coercer will not perform any coercion.

In this case, because we downcast arg[2] to Int64Array, if the user passes a Int32Array, the execution panics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I had to use Int64 instead of UInt64 because I got errors from tests/sql.rs. I have to check how to define a literal as unsigned from within the query string.

@alamb
Copy link
Contributor

alamb commented Feb 26, 2021

@sweb what is the status of this PR? Are you blocked? If you just haven't had time or inclination to make changes, that is fine (I totally understand) I just wanted try and clear the PR queue

@sweb
Copy link
Contributor Author

sweb commented Mar 5, 2021

@sweb what is the status of this PR? Are you blocked? If you just haven't had time or inclination to make changes, that is fine (I totally understand) I just wanted try and clear the PR queue

@alamb Sorry for the late response - I did not have enough time to continue with this and did not see your message. However, I am currently blocked on one front:

I have created a new kernel function with the following signature:

fn regexp_match(array: &Array, pattern: &str) -> Result<ArrayRef>

where ArrayRef is a GenericListArray. This is my attempt to get closer to the Postgres functions. However, I do not seem to understand, how I can define a return type appropriate for ListArray for a DataFusion function. To be more specific: When I try to define the return type of DataType::List in functions.rs I need to create a Box<Field> and I am not sure whether it makes sense to create it here - e.g. what should be the name of the field?

If you have some pointers for this, I would be very grateful - I was not able to find something similar yet. I have added my current state, which kind of works but I am not too sure that what I am doing there is correct. Again, apologies for the late response.

@sweb sweb marked this pull request as draft March 5, 2021 15:29
@sweb sweb force-pushed the ARROW-10354/regexp_extract branch from 200204b to b5d0092 Compare March 5, 2021 16:29
@seddonm1
Copy link
Contributor

seddonm1 commented Mar 5, 2021

@sweb I can help on Monday. I'm planning to raise the PR for those other regexp functions then can help work through this?

@sweb sweb force-pushed the ARROW-10354/regexp_extract branch from 4f4257c to e9c4c5d Compare March 8, 2021 17:31
@sweb
Copy link
Contributor Author

sweb commented Mar 8, 2021

@sweb I can help on Monday. I'm planning to raise the PR for those other regexp functions then can help work through this?

Hey @seddonm1 I have rebased my PR on the current master. My plan would be to only keep regexp_match and remove regexp_extract. However, since there are review comments that I did not address yet, I did not want to remove regexp_extract before I was sure that regexp_match is the way to go.

My main issues are:

  • What is the correct way to pass the regular expression into the DataFusion function? I have seen in your PR that you treat it as a normal column (ArrayRef) and compile all distinct regular expressions and apply the relevant one. I have to admit that I did not think of this use case, but of course it is much cleaner than my assumption that it is going to be a literal, fingers crossed. However, I am not sure that I would expect this API from a kernel function, i.e. passing the regex as a StringArray, instead of a &str.
  • What do you think concerning the usage of ListArray in regexp_match and just defining the return types of the corresponding DataFusion functions as List? I have no experience in using DataFusion (yet) so I am not sure whether this makes sense or if I have to add some kind of handling for accessing list elements to make this usable.

@seddonm1
Copy link
Contributor

seddonm1 commented Mar 8, 2021

Hi @sweb

Yesterday I made my PR for regexp_replace (and others) see https://github.com/apache/arrow/pull/9654/files#diff-c122a83600dc86aa69067fdfdca1e0349616dfae73eaad8a8c90d5e69dbf7a3c You can see how I am passing in the values. I think there is opportunity to use more lazy_static to precompile any standard regex in that file to reduce runtime cost - but we need to see all the functions before too much optimisation.

The way I have done the regexp_replace code (and I am not saying it is the best way) is that because potentially each row can be different (as any argument to a function in Postgres can actually be supplied by referencing column) I have tried to balance that cost by memoizing the Regex objects. I did write a lot of this code prior to @jorgecarleitao doing some large changes in the functions.rs relating to Scalar vs Columnar so there may be a second pass to optimise this once we get basic functionality working.

I think the regexp_match is the way to go (as per Postgres) which does return a list of string values. We will then need to look at the sqlparser to add the ability to 'extract' values from the list by id: [0] (I was thinking of doing this soon). I need to do some playing in Postgres to fully understand the behaivor (what happens if you reference a non-existent index).

@alamb
Copy link
Contributor

alamb commented Mar 8, 2021

Again, apologies for the late response.

@sweb no worries! I totally understand. Thanks for sticking with it

@sweb
Copy link
Contributor Author

sweb commented Mar 9, 2021

@seddonm1 I have pushed the following adjustment:

       // from functions.rs
        BuiltinScalarFunction::RegexpMatch => |args| match (&args[0], &args[1]) {
            (c, ColumnarValue::Scalar(ScalarValue::Utf8(Some(pattern))))
                if c.data_type() == DataType::Utf8
                    || c.data_type() == DataType::LargeUtf8 =>
            {
                let arr = match c {
                    ColumnarValue::Array(col) => col.clone(),
                    ColumnarValue::Scalar(_) => c.clone().into_array(1),
                };

                Ok(ColumnarValue::Array(string_expressions::regexp_match(
                    &arr,
                    pattern.as_str(),
                )?))
            }
   ...

This way, I define that the regular expression needs to come from a literal, so I do not have to take the first element of the array and hope for the best and do not have to check for multiple regular expressions. This comes with a change I am not sure is wanted: The string_expressions function regexp_match has two arguments and not a single slice of ArrayRefs.

I am just proposing this because I am not sure that we want to handle multiple regular expressions. If supporting multiple expressions is the way to go, I will adjust the kernel the way you implemented regexp_replace. Let me know what you think!

@seddonm1
Copy link
Contributor

@sweb I know you have spent a lot of time on this so I know its painful but I do think that we should built to the Postgres spec which will allow regex to be passed in via a columnar value. How we implement that behind the scene allows optimisation but the spec allows each row to be individually processed with a different regex.

To clarify: I am not sure if anyone actually uses SQL like this but it is a consistent implementation pattern in Postgres where a Scalar and Columnar value is treated equally - which I think is an elegant design.

@seddonm1
Copy link
Contributor

@sweb i have had a read through the code and it looks good. Are you able to rebase so the tests will be able to run? Sorry, I missed your comment 10 days ago to have a look earlier 🤦

@sweb
Copy link
Contributor Author

sweb commented Mar 26, 2021

@seddonm1 I just merged master - there is currently a linting issue due to rust 1.51 but the tests are green.

list_builder.append(true)?
}
None => {
list_builder.values().append_value("")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this logic is correct based on Postgres behavior:

SELECT regexp_match('foobarbequebaz', '(bar)(bequ1e)') = NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the behavior - it returns NULL now. This was my half-thought-through attempt to address the review comment differentiating between matching on NULL and no match.

use crate::array::{ListArray, StringArray};

#[test]
fn match_single_group() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test case for and (regexp_match('foobarbequebaz', '(bar)(bequ1e)') above):

SELECT regexp_match('foobarbequebaz', ''); = {""}

Some of these behaviors from Postgres don't really make sense to me.

Copy link
Contributor Author

@sweb sweb Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seddonm1 First: I am very impressed that you know of this case.

My original implementation returned an empty List, without an item. Do you know whether Postgres actually returns a quoted empty string? I am asking because

SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); => {bar,beque}

so I am not sure what to make of the quotes, since strings are not returned with quotes or is this just a special case when the string is empty?

Regardless, I added special case for the empty string pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know about this case. I just put a few scenarios into a Postgres instance running locally (via docker). Your implementation does make sense.

@seddonm1
Copy link
Contributor

@sweb This looks good and I think it is almost ready. I think there are just a few cases to look into to match Postgres behavior then ready to merge.

@codecov-io
Copy link

codecov-io commented Mar 28, 2021

Codecov Report

Merging #9428 (da33017) into master (894dd17) will increase coverage by 0.03%.
The diff coverage is 92.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9428      +/-   ##
==========================================
+ Coverage   82.42%   82.45%   +0.03%     
==========================================
  Files         252      253       +1     
  Lines       58977    59132     +155     
==========================================
+ Hits        48609    48755     +146     
- Misses      10368    10377       +9     
Impacted Files Coverage Δ
rust/datafusion/src/logical_plan/expr.rs 83.39% <ø> (ø)
rust/datafusion/src/scalar.rs 53.65% <33.33%> (-0.17%) ⬇️
.../datafusion/src/physical_plan/regex_expressions.rs 86.15% <60.00%> (-2.18%) ⬇️
rust/datafusion/src/physical_plan/functions.rs 89.48% <90.19%> (+0.02%) ⬆️
rust/arrow/src/compute/kernels/regexp.rs 97.59% <97.59%> (ø)
rust/datafusion/tests/sql.rs 98.47% <100.00%> (+0.01%) ⬆️
rust/datafusion/examples/simple_udaf.rs
rust/datafusion/examples/flight_server.rs
rust/datafusion-examples/examples/flight_server.rs 0.00% <0.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 894dd17...da33017. Read the comment docs.


#[tokio::test]
#[cfg(feature = "regex_expressions")]
async fn query_regexp_match() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think we can remove this and just add these cases to the test_regex_expressions function above:

    test_expression!("regexp_match('foobarbequebaz', '')", "[]");
    test_expression!("regexp_match('foobarbequebaz', '(bar)(beque)')", "[bar, beque]");
    test_expression!("regexp_match('foobarbequebaz', '(ba3r)(bequ34e)')", "NULL");
    test_expression!("regexp_match('aaa-0', '.*-(\\d)')", "[0]");
    test_expression!("regexp_match('bb-1', '.*-(\\d)')", "[1]");
    test_expression!("regexp_match('aa', '.*-(\\d)')", "NULL");
    test_expression!("regexp_match(NULL, '.*-(\\d)')", "NULL");
    test_expression!("regexp_match('aaa-0', NULL)", "NULL");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@seddonm1
Copy link
Contributor

@sweb This is great and does work with all the cases I have run tests for 👍. You can see my slight suggested change above and once thats done @alamb this is ready for merge.

Sorry for being pedantic but I think this is really important functionality.

@alamb
Copy link
Contributor

alamb commented Mar 30, 2021

❤️ --- thanks @sweb and @seddonm1 . I'll keep an eye on this PR and merge when ready. I took a quick skim through this morning and it is looking very nice 👌

@sweb
Copy link
Contributor Author

sweb commented Mar 31, 2021

@sweb This is great and does work with all the cases I have run tests for 👍. You can see my slight suggested change above and once thats done @alamb this is ready for merge.

Sorry for being pedantic but I think this is really important functionality.

No worries, thank you for your thorough review work. I learned a lot!

@sweb
Copy link
Contributor Author

sweb commented Apr 1, 2021

I merged master, but I still get (as far as I can tell) unrelated linting errors. I assume #9867 solves this

@alamb
Copy link
Contributor

alamb commented Apr 1, 2021

Checking this out @sweb

@alamb alamb dismissed jorgecarleitao’s stale review April 1, 2021 11:53

Comments were addressed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR and it looks good to me. Thanks @sweb and @seddonm1 and @jorgecarleitao for your work here.

@alamb
Copy link
Contributor

alamb commented Apr 1, 2021

I also merged it with apache/master locally and ran the tests / clippy and everything passed for me. Thus merging this one in

@alamb alamb closed this in 090f11c Apr 1, 2021
@seddonm1
Copy link
Contributor

seddonm1 commented Apr 2, 2021

@sweb Sorry I missed this in the review but raised #9874 to fix the location of the test.

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants