Skip to content

Rewrite simple regex expressions #4370

@crepererum

Description

@crepererum

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In InfluxDB IOx, we have some users that query the data with simple regex expressions that don't really need a regex but (I guess) regexes are used for convenience or technical reasons (e.g. auto-generated expressions). For "regex match" and "regex not match", we have the following cases:

Case Example Description Logical Rewrite (for "match")
Empty '' Match all col IS NOT NULL
OR-chain 'foo|bar|baz' Any of (col = 'foo') OR (col = 'bar') OR (col = 'baz')

col IN ('foo', 'bar', 'baz')

Now the fact that they are expressed as regex instead of a simple rewritten form has a bunch of performance consequences. These regex predicates are NOT considered for pruning (because how would you prune an arbitrary regex):

https://github.com/apache/arrow-datafusion/blob/e1204a5bf72c119123404463befb716adbdcff25/datafusion/core/src/physical_optimizer/pruning.rs#L818-L871

Finally they are NOT pushed down into ParquetExec.

Describe the solution you'd like
Transform simple regex expressions into their equivalent logical expression.

Describe alternatives you've considered
Extend the pruning expression framework and ParquetExec to handle regexes. However this seems unnecessary complex and maybe even counterproductive, since regexes per se can be really expensive+complex to evaluate.

Additional context
-

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions