Skip to content

perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs#20581

Merged
alamb merged 11 commits intoapache:mainfrom
petern48:regexp_optim_utf8view
Mar 3, 2026
Merged

perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs#20581
alamb merged 11 commits intoapache:mainfrom
petern48:regexp_optim_utf8view

Conversation

@petern48
Copy link
Copy Markdown
Member

@petern48 petern48 commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

I ran into a bug that prevented some regexp optimizations from working that were introduced in #15299. After #16290, some SQL types were updated to utf8view. As part of that PR, some expected query plans in sqllogictest were updated to expect the unoptimized version.

I need this fixed to avoid additional test failures while implementing a new regexp optimization for #20579.

What changes are included in this PR?

  • Add support for Utf8View and LargeUtf8 in regex.rs.
  • Properly return Transformed::no() on cases when the plan isn't modified (previously, it was always returning Transformed::yes()
  • Updates the tests back to expect the optimized query plans

Are these changes tested?

Fixed existing tests that previously weren't working. Now they reflect the optimization being reflected properly.

Are there any user-facing changes?

No. Just applying the optimizations to more cases.

@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026
Comment thread datafusion/optimizer/src/simplify_expressions/regex.rs Outdated
@petern48 petern48 changed the title bug: Apply logical regexp optimizations to utf8view inputs bug: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs Feb 26, 2026
@petern48 petern48 force-pushed the regexp_optim_utf8view branch from 1c8836b to e1f661b Compare February 26, 2026 21:41
@petern48 petern48 changed the title bug: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs Feb 26, 2026
@petern48 petern48 marked this pull request as ready for review February 26, 2026 22:16
@alamb alamb added the performance Make DataFusion faster label Feb 27, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @petern48 -- this looks like an improvement to me

I do think we should add documentation for the newly added pub API (or I have some ideas on how to improve it too)

----
logical_plan
01)Projection: test.column1_utf8view ~ Utf8View("an") AS c1
01)Projection: test.column1_utf8view LIKE Utf8View("%an%") AS c1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is certainly a better plan

}

fn as_string_scalar(expr: &Expr) -> Option<(DataType, &Option<String>)> {
pub fn as_string_scalar(expr: &Expr) -> Option<(DataType, &Option<String>)> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to make this a pub API, we should at least document them, and maybe we can clean it up bit?

Maybe the simplest thing would be to just make it pub (crate) rather than pub

What do you think about something like

pub enum StringScalar<'a> {
  Utf8(&'a ScalarValue, &'str),
  LargeUtf8(&'a ScalarValue, &'str),
  Utf8View(&'a ScalarValue, &'str),
}

impl StringScalar {
  fn try_from_scalar(scalar: &ScalarValue) -> Self { 
...
  }
 
  fn to_scalar(&self, val: &str) -> Expr {
}

That would:

  1. Avoid creating DataTypes
  2. Put some better documentation in place
  3. Encapsulate the logic a bit more

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I ended up with a slightly different interface. The main differences between what you proposed and what I ended up with are:

  • Utf8(&'a ScalarValue, &'str) -> Utf8(&'a ScalarValue) We can retrieve the str from the ScalarValue
  • Renamed to_scalar to to_expr since it returns an Expr
  • Added as_str() for convenience
  • Added try_from_expr(). I found that using this was cleaner because both call sites provide Exprs, and handle the None cases in the same way. It mainly just collapsed some redundant error handling.
    • I still added try_from_scalar for possible future use, but I left it as private since it isn't used anywhere atm (other than try_from_expr calling it).
impl<'a> StringScalar<'a> {
    pub(crate) fn try_from_expr(expr: &'a Expr) -> Option<Self> { (calls try_from_scalar) }

      // (not actually used anywhere)
    fn try_from_scalar(scalar: &'a ScalarValue) -> Option<Self> { ... }

    pub(crate) fn as_str(&self) -> Option<&'a str> { ... }

    pub(crate) fn to_expr(&self, val: &str) -> Expr { ... }
}

Let me know what you think

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @petern48

petern48 added a commit to petern48/datafusion that referenced this pull request Mar 1, 2026
@alamb alamb added this pull request to the merge queue Mar 3, 2026
Merged via the queue into apache:main with commit d1a3058 Mar 3, 2026
32 checks passed
@petern48 petern48 deleted the regexp_optim_utf8view branch March 3, 2026 15:21
de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
…puts (apache#20581)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#20580

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

I ran into a bug that prevented some regexp optimizations from working
that were introduced in apache#15299.
After apache#16290, some SQL types
were updated to `utf8view`. As part of that PR, some expected query
plans in sqllogictest were updated to expect the unoptimized version.

I need this fixed to avoid additional test failures while implementing a
new regexp optimization for
apache#20579.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- Add support for Utf8View and LargeUtf8 in `regex.rs`.
- Properly return `Transformed::no()` on cases when the plan isn't
modified (previously, it was always returning `Transformed::yes()`
- Updates the tests back to expect the optimized query plans

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Fixed existing tests that previously weren't working. Now they reflect
the optimization being reflected properly.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
No. Just applying the optimizations to more cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: regexp simplify optimizations don't work on utf8view

2 participants