ARROW-11656: [Rust][DataFusion] Remaining Postgres String functions #9654

seddonm1 · 2021-03-07T23:25:34Z

@alamb the last one (for now).

This PR does a few things:

adds regexp_replace, replace, split_part, starts_with, strpos and translate.
adds feature flag unicode_expressions and moves anything that depends on unicode-segmentation crate into it.
adds feature flag regex_expressions and adds regex and lazy_static crates to it.

github-actions · 2021-03-07T23:25:53Z

https://issues.apache.org/jira/browse/ARROW-11656

alamb · 2021-03-08T15:30:15Z

Thanks @seddonm1 -- I'll plan to look at this later today or tomorrow

Dandandan · 2021-03-08T21:22:47Z

rust/datafusion/src/physical_plan/unicode_expressions.rs

+                        let length = length as usize;
+
+                        if length == 0 {
+                            Some("".to_string())


Is String necessary or could we use &str in those iterators? (to avoid an extra allocation per item)

I'm hitting issues with the borrow checker so I think that the String allocation is required.

Dandandan · 2021-03-08T21:26:47Z

rust/datafusion/src/physical_plan/unicode_expressions.rs

+                        .grapheme_indices(true)
+                        .nth(n as usize)
+                        .map_or(string, |(i, _)| {
+                            &from_utf8(&string.as_bytes()[..i]).unwrap()


For any unwrap() I think it makes sense to put a comment why the unwrap is ok, and/or use .expect

I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back

https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

Something like (untested)

let i = strings .graphemes() .limit(n as usize) .map(|s| s.len()) .sum() string[..i]

Or something like that. I am not sure how important this is

Let me have a look and see if we can do this.

@alamb I have refactored just the left function to demonstrate how it could be done:

arrow/rust/datafusion/src/physical_plan/unicode_expressions.rs

Line 98 in 7e40d25

let result = string_array

I think this is nicer so if you agree I can apply to the rest of the code?

I like it a lot 👍 Thanks @seddonm1

I am not sure why this clone is needed, but that is a minor point and can be cleaned up later

let len = graphemes.clone().count() as i64;

The borrow checker is identifying that the second use of graphemes is using a moved value without the clone. If you can help resolve that it would help me learn.

use of moved value: `graphemes` value used here after move

I see -- I took a closer look and I see now that graphemes is an iterator -- so we are clone() ing an iterator, which I guess feels right to me -- the code has to effectively scan the input to figure out how many graphemes long it is and then scan it again to create the output.

Sorry for my confusion!

rust/datafusion/src/physical_plan/string_expressions.rs

rust/datafusion/src/physical_plan/unicode_expressions.rs

Dandandan · 2021-03-08T21:37:24Z

rust/datafusion/src/physical_plan/regex_expressions.rs

+
+/// Replaces substring(s) matching a POSIX regular expression
+/// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'
+pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {


I think at some point this could be in arrow (and more parts of your string contributions).

Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels

Dandandan · 2021-03-08T21:52:52Z

Truly looks great @seddonm1 I added some comments

alamb · 2021-03-08T22:16:15Z

Integration test failure looked like https://issues.apache.org/jira/browse/ARROW-11908 so I retriggered the CI checks

alamb

I think this looks great, as always @seddonm1 . 💯 🥇 for the test cases and feature flags.

I think @Dandandan and I had some style / improvement suggestions, but I think they could be done as follow on PRs as well. I'll wait to merge this PR for another day or so in case you want to respond, otherwise I'll plan to merge tomorrow.

👍

rust/datafusion/Cargo.toml

rust/datafusion/src/physical_plan/functions.rs

alamb · 2021-03-08T22:20:57Z

rust/datafusion/src/physical_plan/regex_expressions.rs

+
+/// Replaces substring(s) matching a POSIX regular expression
+/// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'
+pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {


Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels

rust/datafusion/src/physical_plan/regex_expressions.rs

alamb · 2021-03-08T22:42:57Z

rust/datafusion/src/physical_plan/unicode_expressions.rs

+                        .grapheme_indices(true)
+                        .nth(n as usize)
+                        .map_or(string, |(i, _)| {
+                            &from_utf8(&string.as_bytes()[..i]).unwrap()


I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back

https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

Something like (untested)

let i = strings .graphemes() .limit(n as usize) .map(|s| s.len()) .sum() string[..i]

Or something like that. I am not sure how important this is

rust/datafusion/src/physical_plan/unicode_expressions.rs

rust/datafusion/tests/sql.rs

alamb

This is looking great @seddonm1 -- I think it is mergeable as soon as you are ready. Just let me know

seddonm1 · 2021-03-09T21:04:25Z

Thanks @alamb. I will make some more tweaks to these unicode functions to try to remove std::str::from_utf8 then let you know.

seddonm1 · 2021-03-09T21:34:56Z

@alamb I have removed the std::str::from_utf8. We should be good to go. 🥳

seddonm1 · 2021-03-10T21:48:57Z

@alamb hitting some CICD issues - nothing to do with this change.

Testing file auth:basic_proto
==========================================================
################# FAILURES #################
FAILED TEST: auth:basic_proto Rust producing,  C++ consuming

1 failures
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/build/cpp/debug/flight-test-integration-client', '-host', 'localhost', '-port=43311', '-scenario', 'auth:basic_proto']' died with <Signals.SIGABRT: 6>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case
    consumer.flight_request(port, **client_args)
  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request
    run_cmd(cmd)
  File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
    raise RuntimeError(sio.getvalue())
RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=43311 -scenario auth:basic_proto
With output:
--------------
-- Arrow Fatal Error --
Invalid: Expected UNAUTHENTICATED but got Unavailable

alamb · 2021-03-10T23:27:55Z

The integration test looks like https://issues.apache.org/jira/browse/ARROW-11908 so merging this one in. THanks again @seddonm1 !

remaining string functions

a84f714

github-actions bot added Component: Rust - DataFusion Component: Rust labels Mar 7, 2021