Skip to content

Conversation

@seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Mar 7, 2021

@alamb the last one (for now).

This PR does a few things:

  • adds regexp_replace, replace, split_part, starts_with, strpos and translate.
  • adds feature flag unicode_expressions and moves anything that depends on unicode-segmentation crate into it.
  • adds feature flag regex_expressions and adds regex and lazy_static crates to it.

@github-actions
Copy link

github-actions bot commented Mar 7, 2021

@alamb
Copy link
Contributor

alamb commented Mar 8, 2021

Thanks @seddonm1 -- I'll plan to look at this later today or tomorrow

let length = length as usize;

if length == 0 {
Some("".to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is String necessary or could we use &str in those iterators? (to avoid an extra allocation per item)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hitting issues with the borrow checker so I think that the String allocation is required.

.grapheme_indices(true)
.nth(n as usize)
.map_or(string, |(i, _)| {
&from_utf8(&string.as_bytes()[..i]).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any unwrap() I think it makes sense to put a comment why the unwrap is ok, and/or use .expect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back

https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

Something like (untested)

let i = strings
  .graphemes()
  .limit(n as usize)
  .map(|s| s.len())
  .sum()

string[..i]

Or something like that. I am not sure how important this is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a look and see if we can do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I have refactored just the left function to demonstrate how it could be done:

I think this is nicer so if you agree I can apply to the rest of the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot 👍 Thanks @seddonm1

I am not sure why this clone is needed, but that is a minor point and can be cleaned up later

                    let len = graphemes.clone().count() as i64;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The borrow checker is identifying that the second use of graphemes is using a moved value without the clone. If you can help resolve that it would help me learn.

use of moved value: `graphemes` value used here after move

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I took a closer look and I see now that graphemes is an iterator -- so we are clone() ing an iterator, which I guess feels right to me -- the code has to effectively scan the input to figure out how many graphemes long it is and then scan it again to create the output.

Sorry for my confusion!


/// Replaces substring(s) matching a POSIX regular expression
/// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'
pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at some point this could be in arrow (and more parts of your string contributions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels

@Dandandan
Copy link
Contributor

Truly looks great @seddonm1 I added some comments

@alamb
Copy link
Contributor

alamb commented Mar 8, 2021

Integration test failure looked like https://issues.apache.org/jira/browse/ARROW-11908 so I retriggered the CI checks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great, as always @seddonm1 . 💯 🥇 for the test cases and feature flags.

I think @Dandandan and I had some style / improvement suggestions, but I think they could be done as follow on PRs as well. I'll wait to merge this PR for another day or so in case you want to respond, otherwise I'll plan to merge tomorrow.

👍


/// Replaces substring(s) matching a POSIX regular expression
/// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'
pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- it might be worth a discussion on the mailing list of what functions belong in datafusion and what are more "core" and broadly applicable to bring them into the core arrow kernels

.grapheme_indices(true)
.nth(n as usize)
.map_or(string, |(i, _)| {
&from_utf8(&string.as_bytes()[..i]).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time poking around in https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

I couldn't find anything that would let you get back a &str from the grapheme indices. I wonder if it might be possible to use the graphemes() call itself and count up the lengths of the &str that came back

https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes

Something like (untested)

let i = strings
  .graphemes()
  .limit(n as usize)
  .map(|s| s.len())
  .sum()

string[..i]

Or something like that. I am not sure how important this is

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @seddonm1 -- I think it is mergeable as soon as you are ready. Just let me know

@seddonm1
Copy link
Contributor Author

seddonm1 commented Mar 9, 2021

Thanks @alamb. I will make some more tweaks to these unicode functions to try to remove std::str::from_utf8 then let you know.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Mar 9, 2021

@alamb I have removed the std::str::from_utf8. We should be good to go. 🥳

@seddonm1
Copy link
Contributor Author

@alamb hitting some CICD issues - nothing to do with this change.

Testing file auth:basic_proto
==========================================================
################# FAILURES #################
FAILED TEST: auth:basic_proto Rust producing,  C++ consuming

1 failures
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/build/cpp/debug/flight-test-integration-client', '-host', 'localhost', '-port=43311', '-scenario', 'auth:basic_proto']' died with <Signals.SIGABRT: 6>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case
    consumer.flight_request(port, **client_args)
  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request
    run_cmd(cmd)
  File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
    raise RuntimeError(sio.getvalue())
RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=43311 -scenario auth:basic_proto
With output:
--------------
-- Arrow Fatal Error --
Invalid: Expected UNAUTHENTICATED but got Unavailable

@alamb
Copy link
Contributor

alamb commented Mar 10, 2021

The integration test looks like https://issues.apache.org/jira/browse/ARROW-11908 so merging this one in. THanks again @seddonm1 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants