Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
4710e2e
Add split_pattern function
thisisnic Apr 28, 2021
33a4447
Remove references to split_pattern arguments in favour of strsplit ar…
thisisnic Apr 28, 2021
a370b9a
Remove reference to split_pattern from pkgdown
thisisnic Apr 28, 2021
4677f31
Update tests for strsplit
thisisnic Apr 28, 2021
e1aa64f
Remove test for strsplit
thisisnic Apr 30, 2021
a3f0738
Remove s3 methods for strsplit
thisisnic Apr 30, 2021
43120a6
Remove strsplit from compute.R
thisisnic Apr 30, 2021
0ecbab4
Implement strsplit
thisisnic Apr 30, 2021
5615486
Fix typo
thisisnic Apr 30, 2021
9ceff3c
Add extra args to split_pattern
thisisnic Apr 30, 2021
d140155
Whitespace fix
thisisnic Apr 30, 2021
8f5dcc6
Add in 'any' to regex check
thisisnic Apr 30, 2021
6a7b756
Assert split length and shotern regex metachar search
thisisnic Apr 30, 2021
983cb9c
Ignore Perl
thisisnic Apr 30, 2021
48e8917
Add bdingin for str_split, and add tests
thisisnic May 4, 2021
45f36c4
Update test to use mutate
thisisnic May 4, 2021
abeef7c
Update r/R/dplyr.R
thisisnic May 4, 2021
1f77d66
use is.string instead of length == 1
thisisnic May 4, 2021
2448450
Capitalise
thisisnic May 4, 2021
14cf037
Update r/R/dplyr.R
thisisnic May 4, 2021
fc743e7
Update spacing
thisisnic May 4, 2021
22b9256
Add spacing
thisisnic May 4, 2021
51d250f
Use get_stringr_pattern
thisisnic May 6, 2021
a521405
Document get_stringr_pattern_options function, change default value o…
thisisnic May 6, 2021
b758b55
Fix minor style issues, add extra tests
thisisnic May 6, 2021
3d4f9ce
Reorganise warning/error generating tests into one section for consis…
thisisnic May 6, 2021
7e160ca
Merge branch 'master' into ARROW-11515-strsplit
ianmcook May 6, 2021
f8cfff5
Fix missing }
ianmcook May 6, 2021
3b66f4f
Re-lint
ianmcook May 6, 2021
245c99f
Make reverse optional in split_pattern options
ianmcook May 6, 2021
b3884f2
Add handling for utf8_split_whitespace, ascii_split_whitespace options
ianmcook May 6, 2021
8dfc213
Tweak str_split() args for consistency with stringr
ianmcook May 6, 2021
cbc0e78
Add and reorganize tests
ianmcook May 6, 2021
4b96d07
Add missing call. = FALSE
ianmcook May 6, 2021
a567ce1
Tweak get_stringr_pattern_options() docs
ianmcook May 6, 2021
0a30a69
Build new internal functions docs
ianmcook May 6, 2021
25ec2f8
Add unrelated note about 0-based indices in call_function() docs
ianmcook May 6, 2021
1504200
Test arrow_ascii_split_whitespace, arrow_utf8_split_whitespace
ianmcook May 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions r/R/compute.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
#' `RecordBatch`, or `Table`.
#' @param args list arguments as an alternative to specifying in `...`
#' @param options named list of C++ function options.
#' @details When passing indices in `...`, `args`, or `options`, express them as
#' 0-based integers (consistent with C++).
#' @return An `Array`, `ChunkedArray`, `Scalar`, `RecordBatch`, or `Table`, whatever the compute function results in.
#' @seealso [Arrow C++ documentation](https://arrow.apache.org/docs/cpp/compute.html) for the functions and their respective options.
#' @examples
Expand Down
66 changes: 62 additions & 4 deletions r/R/dplyr.R
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,8 @@ build_function_list <- function(FUN) {
gsub = arrow_r_string_replace_function(FUN, -1L),
str_replace = arrow_stringr_string_replace_function(FUN, 1L),
str_replace_all = arrow_stringr_string_replace_function(FUN, -1L),
strsplit = arrow_r_string_split_function(FUN),
str_split = arrow_stringr_string_split_function(FUN),
between = function(x, left, right) {
x >= left & x <= right
},
Expand Down Expand Up @@ -539,6 +541,44 @@ arrow_stringr_string_replace_function <- function(FUN, max_replacements) {
}
}

arrow_r_string_split_function <- function(FUN, reverse = FALSE, max_splits = -1) {
function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) {

assert_that(is.string(split))

# if !fixed but no regex metachars in split pattern, allow to proceed as split isn't regex
if (!fixed && contains_regex(split)) {
stop("Regular expression matching not supported in strsplit for Arrow", call. = FALSE)
}
if (fixed && perl) {
warning("Argument 'perl = TRUE' will be ignored", call. = FALSE)
}
FUN("split_pattern", x, options = list(pattern = split, reverse = reverse, max_splits = max_splits))
}
}

arrow_stringr_string_split_function <- function(FUN, reverse = FALSE) {
function(string, pattern, n = Inf, simplify = FALSE) {
opts <- get_stringr_pattern_options(enexpr(pattern))
if (!opts$fixed && contains_regex(opts$pattern)) {
stop("Regular expression matching not supported in str_split() for Arrow", call. = FALSE)
}
if (opts$ignore_case) {
stop("Case-insensitive string splitting not supported in Arrow", call. = FALSE)
}
if (n == 0) {
stop("Splitting strings into zero parts not supported in Arrow" , call. = FALSE)
}
if (identical(n, Inf)) {
n <- 0L
}
if (simplify) {
warning("Argument 'simplify = TRUE' will be ignored", call. = FALSE)
}
FUN("split_pattern", string, options = list(pattern = opts$pattern, reverse = reverse, max_splits = n - 1L))
}
}

# format `pattern` as needed for case insensitivity and literal matching by RE2
format_string_pattern <- function(pattern, ignore.case, fixed) {
# Arrow lacks native support for case-insensitive literal string matching and
Expand Down Expand Up @@ -571,9 +611,18 @@ format_string_replacement <- function(replacement, ignore.case, fixed) {
replacement
}

# this function assigns definitions for the stringr pattern modifier functions
# (fixed, regex, etc.) in itself, and uses them to evaluate the quoted
# expression `pattern`
#' Get `stringr` pattern options
#'
#' This function assigns definitions for the `stringr` pattern modifier
#' functions (`fixed()`, `regex()`, etc.) inside itself, and uses them to
#' evaluate the quoted expression `pattern`, returning a list that is used
#' to control pattern matching behavior in internal `arrow` functions.
#'
#' @param pattern Unevaluated expression containing a call to a `stringr`
#' pattern modifier function
#'
#' @return List containing elements `pattern`, `fixed`, and `ignore_case`
#' @keywords internal
get_stringr_pattern_options <- function(pattern) {
fixed <- function(pattern, ignore_case = FALSE, ...) {
check_dots(...)
Expand Down Expand Up @@ -605,7 +654,7 @@ get_stringr_pattern_options <- function(pattern) {
}
ensure_opts <- function(opts) {
if (is.character(opts)) {
opts <- list(pattern = opts, fixed = TRUE, ignore_case = FALSE)
opts <- list(pattern = opts, fixed = FALSE, ignore_case = FALSE)
}
opts
}
Expand Down Expand Up @@ -1097,3 +1146,12 @@ not_implemented_for_dataset <- function(method) {
call. = FALSE
)
}

#' Does this string contain regex metacharacters?
#'
#' @param string String to be tested
#' @keywords internal
#' @return Logical: does `string` contain regex metacharacters?
contains_regex <- function(string) {
grepl("[.\\|()[{^$*+?]", string)
}
4 changes: 4 additions & 0 deletions r/man/call_function.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions r/man/contains_regex.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions r/man/get_stringr_pattern_options.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions r/src/compute.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,33 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
max_replacements);
}

if (func_name == "split_pattern") {
using Options = arrow::compute::SplitPatternOptions;
int64_t max_splits = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting defaults manually, can you do auto out = std::make_shared<Options>(Options::Defaults()); and then just set them if they're provided? (We do this above for other types.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this didn't work when I tried it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that there is no Defaults() method in arrow::compute::SplitPatternOptions. Apparently some of the options classes just don't have one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably should be added in the C++ library, can you do that and/or make another JIRA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thisisnic could you make a Jira for this please? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

@thisisnic thisisnic May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, @nealrichardson and @ianmcook , it may make sense for arrow::compute::SplitPatternOptions not to have defaults, given that the pattern is one of the options - it might not make sense to pre-specify a pattern to split the string on? Neither of the R functions strsplit nor stringr::str_split have defaults for this argument.

Could one perhaps make a case for pattern moving from being an option to instead being a second input, similarly to how arrow::compute::filter and arrow::compute::take operate? At that point, having defaults for SplitPatternOptions feels more natural.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes good sense to me but I'm not up to speed on what design conventions we follow in the C++ compute functions. Maybe it'd be best to start a Zulip thread to ask folks about this (Joris, Antonie, Maarten Breddels) to get a better idea of the best change to suggest (if any).

if (!Rf_isNull(options["max_splits"])) {
max_splits = cpp11::as_cpp<int64_t>(options["max_splits"]);
}
bool reverse = false;
if (!Rf_isNull(options["reverse"])) {
reverse = cpp11::as_cpp<bool>(options["reverse"]);
}
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]),
max_splits, reverse);
}

if (func_name == "utf8_split_whitespace" || func_name == "ascii_split_whitespace") {
using Options = arrow::compute::SplitOptions;
int64_t max_splits = -1;
if (!Rf_isNull(options["max_splits"])) {
max_splits = cpp11::as_cpp<int64_t>(options["max_splits"]);
}
bool reverse = false;
if (!Rf_isNull(options["reverse"])) {
reverse = cpp11::as_cpp<bool>(options["reverse"]);
}
return std::make_shared<Options>(max_splits, reverse);
}

if (func_name == "variance" || func_name == "stddev") {
using Options = arrow::compute::VarianceOptions;
return std::make_shared<Options>(cpp11::as_cpp<int64_t>(options["ddof"]));
Expand Down
Loading