-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In InfluxDB IOx, we have some users that query the data with simple regex expressions that don't really need a regex but (I guess) regexes are used for convenience or technical reasons (e.g. auto-generated expressions). For "regex match" and "regex not match", we have the following cases:
| Case | Example | Description | Logical Rewrite (for "match") |
|---|---|---|---|
| Empty | '' |
Match all | col IS NOT NULL |
| OR-chain | 'foo|bar|baz' |
Any of | (col = 'foo') OR (col = 'bar') OR (col = 'baz')col IN ('foo', 'bar', 'baz') |
Now the fact that they are expressed as regex instead of a simple rewritten form has a bunch of performance consequences. These regex predicates are NOT considered for pruning (because how would you prune an arbitrary regex):
Finally they are NOT pushed down into ParquetExec.
Describe the solution you'd like
Transform simple regex expressions into their equivalent logical expression.
Describe alternatives you've considered
Extend the pruning expression framework and ParquetExec to handle regexes. However this seems unnecessary complex and maybe even counterproductive, since regexes per se can be really expensive+complex to evaluate.
Additional context
-