feat: support Spark expression `regexp_extract` by andygrove · Pull Request #4146 · apache/datafusion-comet

andygrove · 2026-04-29T21:59:43Z

Which issue does this PR close?

N/A

Rationale for this change

regexp_extract is a common Spark SQL string function used to pull substrings out of input strings via regex capture groups. Adding native support lets queries that use it stay in Comet instead of falling back to Spark.

What changes are included in this PR?

New Rust UDF spark_regexp_extract in native/spark-expr/src/string_funcs/regexp_extract.rs, backed by the regex crate. Handles Utf8 and LargeUtf8 inputs (array and scalar), idx defaults to 1, idx=0 returns the whole match, no match returns the empty string, an unmatched optional group returns the empty string, null input returns null, and an out-of-range idx returns an execution error.
Registration of regexp_extract in comet_scalar_funcs.rs.
New CometRegExpExtract serde mapping RegExpExtract to the native UDF. Reported as Incompatible because the Rust regex engine has different semantics from Java's regex engine (POSIX classes, look-around, possessive quantifiers, etc.). Users opt in via spark.comet.expression.RegExpExtract.allowIncompatible=true. Falls back when the pattern or idx is non-literal.

How are these changes tested?

9 Rust unit tests in regexp_extract.rs covering basic group extraction, idx=0/default idx, null subject, null pattern, unmatched optional group, out-of-range idx, negative idx, and invalid regex.
Two new Comet SQL test files:
- regexp_extract.sql verifies that the expression falls back to Spark by default.
- regexp_extract_enabled.sql exercises the happy path with allowIncompatible=true, including default and explicit idx, idx=0, no-match, null input, optional unmatched groups, anchors, and an all-literal expression. Run under a ConfigMatrix for both dictionary-encoded and plain Parquet input.

Implement regexp_extract using the Rust regex crate. The expression is marked Incompatible because the Rust regex engine differs from the Java engine that Spark uses; users must opt in via spark.comet.expression.RegExpExtract.allowIncompatible=true.

Audit follow-ups: - Align Rust error messages with Spark's `INVALID_PARAMETER_VALUE` templates so `expect_error` substrings can match both engines. - Override `getUnsupportedReasons` in `CometRegExpExtract` so the non-literal pattern and non-literal idx reasons are picked up by the Compatibility Guide generator. - Add Comet SQL test cases for: NULL pattern and NULL idx, idx=0 with no capture groups, multibyte / Unicode subjects, idx out of range, pattern with no groups + idx>=1, negative idx, invalid regex syntax, and a Java-only lookahead that Rust regex rejects (marked `ignore`). - Add fallback test cases for non-literal pattern and non-literal idx. - Mark the expression supported in `spark_expressions_support.md` with per-version audit notes.

Address review feedback: - Make `extract_array` build a `GenericStringBuilder<O>` matching the input offset size so a `LargeUtf8` subject no longer silently outputs `Utf8` (avoids potential i32-offset overflow on >2GB inputs). - Inline group extraction so the per-row `String` allocation is gone; the only remaining `to_string` is on the rare scalar code path. - Replace the manual append-null loop in `null_result` with `StringArray::new_null(n)`. - Borrow the pattern as `&str` instead of cloning it before calling `Regex::new`. - Pass `failOnError = false` to the proto, matching `CometStringSplit`. The Rust UDF does not branch on this flag, so `true` was misleading.

andygrove added 3 commits April 29, 2026 15:59

andygrove marked this pull request as ready for review April 30, 2026 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Spark expression `regexp_extract`#4146

feat: support Spark expression `regexp_extract`#4146
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/regexp-extract

andygrove commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant