-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
c38a5f8
feat: regexp_extract and regexp_match
sweb 0065a10
cleanups
sweb e9c4c5d
fix: clean up after rebase
sweb 2213696
fix: formatting
sweb a66b414
fix: correct signature
sweb 5f98c0e
refactor: make usage of literal explicit
sweb dfe4f9f
refactor: support regex pattern as own column
sweb 476f167
feat: add flags
sweb d8393e1
chore: merge master
sweb 1199c17
refactor: move regexp_match to regex expressions
sweb cac07f0
fix: add regex feature flag to regexp_match tests
sweb ec32f9a
fix: add regex feature flag to regexp_match tests
sweb a7371fd
Merge branch 'master' into ARROW-10354/regexp_extract
sweb 5652b31
chore: merge master
sweb 9b5c80d
fix: sql test after merge
sweb 09ee533
Merge branch 'master' into ARROW-10354/regexp_extract
sweb 077a7dc
chore: let unmatching pattern return null
sweb f7e1745
feat: add special case for empty string pattern
sweb da33017
fix: clippy
sweb 852ba38
refactor: simplify tests for regexp_match
sweb 346227c
chore: formatting / linting
sweb b7bca91
Merge branch 'master' into ARROW-10354/regexp_extract
sweb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,160 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Defines kernel to extract substrings based on a regular | ||
| //! expression of a \[Large\]StringArray | ||
|
|
||
| use crate::array::{ | ||
| ArrayRef, GenericStringArray, GenericStringBuilder, ListBuilder, | ||
| StringOffsetSizeTrait, | ||
| }; | ||
| use crate::error::{ArrowError, Result}; | ||
| use std::collections::HashMap; | ||
|
|
||
| use std::sync::Arc; | ||
|
|
||
| use regex::Regex; | ||
|
|
||
| /// Extract all groups matched by a regular expression for a given String array. | ||
| pub fn regexp_match<OffsetSize: StringOffsetSizeTrait>( | ||
| array: &GenericStringArray<OffsetSize>, | ||
| regex_array: &GenericStringArray<OffsetSize>, | ||
| flags_array: Option<&GenericStringArray<OffsetSize>>, | ||
| ) -> Result<ArrayRef> { | ||
| let mut patterns: HashMap<String, Regex> = HashMap::new(); | ||
| let builder: GenericStringBuilder<OffsetSize> = GenericStringBuilder::new(0); | ||
| let mut list_builder = ListBuilder::new(builder); | ||
|
|
||
| let complete_pattern = match flags_array { | ||
| Some(flags) => Box::new(regex_array.iter().zip(flags.iter()).map( | ||
| |(pattern, flags)| { | ||
| pattern.map(|pattern| match flags { | ||
| Some(value) => format!("(?{}){}", value, pattern), | ||
| None => pattern.to_string(), | ||
| }) | ||
| }, | ||
| )) as Box<dyn Iterator<Item = Option<String>>>, | ||
| None => Box::new( | ||
| regex_array | ||
| .iter() | ||
| .map(|pattern| pattern.map(|pattern| pattern.to_string())), | ||
| ), | ||
| }; | ||
| array | ||
| .iter() | ||
| .zip(complete_pattern) | ||
| .map(|(value, pattern)| { | ||
| match (value, pattern) { | ||
| // Required for Postgres compatibility: | ||
| // SELECT regexp_match('foobarbequebaz', ''); = {""} | ||
| (Some(_), Some(pattern)) if pattern == *"" => { | ||
| list_builder.values().append_value("")?; | ||
| list_builder.append(true)?; | ||
| } | ||
| (Some(value), Some(pattern)) => { | ||
| let existing_pattern = patterns.get(&pattern); | ||
| let re = match existing_pattern { | ||
| Some(re) => re.clone(), | ||
| None => { | ||
| let re = Regex::new(pattern.as_str()).map_err(|e| { | ||
| ArrowError::ComputeError(format!( | ||
| "Regular expression did not compile: {:?}", | ||
| e | ||
| )) | ||
| })?; | ||
| patterns.insert(pattern, re.clone()); | ||
| re | ||
| } | ||
| }; | ||
| match re.captures(value) { | ||
| Some(caps) => { | ||
| for m in caps.iter().skip(1) { | ||
| if let Some(v) = m { | ||
| list_builder.values().append_value(v.as_str())?; | ||
| } | ||
| } | ||
| list_builder.append(true)? | ||
| } | ||
| None => list_builder.append(false)?, | ||
| } | ||
| } | ||
| _ => list_builder.append(false)?, | ||
| } | ||
| Ok(()) | ||
| }) | ||
| .collect::<Result<Vec<()>>>()?; | ||
| Ok(Arc::new(list_builder.finish())) | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use crate::array::{ListArray, StringArray}; | ||
|
|
||
| #[test] | ||
| fn match_single_group() -> Result<()> { | ||
| let values = vec![ | ||
| Some("abc-005-def"), | ||
| Some("X-7-5"), | ||
| Some("X545"), | ||
| None, | ||
| Some("foobarbequebaz"), | ||
| Some("foobarbequebaz"), | ||
| ]; | ||
| let array = StringArray::from(values); | ||
| let mut pattern_values = vec![r".*-(\d*)-.*"; 4]; | ||
| pattern_values.push(r"(bar)(bequ1e)"); | ||
| pattern_values.push(""); | ||
| let pattern = StringArray::from(pattern_values); | ||
| let actual = regexp_match(&array, &pattern, None)?; | ||
| let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new(0); | ||
| let mut expected_builder = ListBuilder::new(elem_builder); | ||
| expected_builder.values().append_value("005")?; | ||
| expected_builder.append(true)?; | ||
| expected_builder.values().append_value("7")?; | ||
| expected_builder.append(true)?; | ||
| expected_builder.append(false)?; | ||
| expected_builder.append(false)?; | ||
| expected_builder.append(false)?; | ||
| expected_builder.values().append_value("")?; | ||
| expected_builder.append(true)?; | ||
| let expected = expected_builder.finish(); | ||
| let result = actual.as_any().downcast_ref::<ListArray>().unwrap(); | ||
| assert_eq!(&expected, result); | ||
| Ok(()) | ||
| } | ||
|
|
||
| #[test] | ||
| fn match_single_group_with_flags() -> Result<()> { | ||
| let values = vec![Some("abc-005-def"), Some("X-7-5"), Some("X545"), None]; | ||
| let array = StringArray::from(values); | ||
| let pattern = StringArray::from(vec![r"x.*-(\d*)-.*"; 4]); | ||
| let flags = StringArray::from(vec!["i"; 4]); | ||
| let actual = regexp_match(&array, &pattern, Some(&flags))?; | ||
| let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new(0); | ||
| let mut expected_builder = ListBuilder::new(elem_builder); | ||
| expected_builder.append(false)?; | ||
| expected_builder.values().append_value("7")?; | ||
| expected_builder.append(true)?; | ||
| expected_builder.append(false)?; | ||
| expected_builder.append(false)?; | ||
| let expected = expected_builder.finish(); | ||
| let result = actual.as_any().downcast_ref::<ListArray>().unwrap(); | ||
| assert_eq!(&expected, result); | ||
| Ok(()) | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test case for and (
regexp_match('foobarbequebaz', '(bar)(bequ1e)')above):Some of these behaviors from Postgres don't really make sense to me.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seddonm1 First: I am very impressed that you know of this case.
My original implementation returned an empty List, without an item. Do you know whether Postgres actually returns a quoted empty string? I am asking because
so I am not sure what to make of the quotes, since strings are not returned with quotes or is this just a special case when the string is empty?
Regardless, I added special case for the empty string pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know about this case. I just put a few scenarios into a Postgres instance running locally (via docker). Your implementation does make sense.