Skip to content

Improve worst-case performance of LIKE filters by 20x#16153

Merged
gianm merged 5 commits intoapache:masterfrom
twilliamson:timw/like-this
Apr 24, 2024
Merged

Improve worst-case performance of LIKE filters by 20x#16153
gianm merged 5 commits intoapache:masterfrom
twilliamson:timw/like-this

Conversation

@twilliamson
Copy link
Copy Markdown
Contributor

@twilliamson twilliamson commented Mar 18, 2024

Description

LikeDimFilter was compiling the LIKE clause down to a java.util.regex.Pattern. Unfortunately, even seemingly simply regexes can lead to catastrophic backtracking. In particular, something as simple as a few % wildcards can end up in exploding the time complexity. This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the LIKE pattern into a list of java.util.regex.Pattern by splitting on the % wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using Matcher.find() to progress through the string is used.

Running an updated version of the LikeFilterBenchmark with Java 11 on a t2.xlarge instance showed at least a 1.15x speed up for a simple "contains" query (%50%), and more than a 20x speed up for a "killer" query with four wildcards but no matches (%%%%x). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the REGEX operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in java.util.regex.Pattern (e.g., joni) would be a good idea to avoid accidental — or intentional — DoSing.

Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x

Release note

Improved: LIKE filtering performance with multiple wildcards improved 1.1x (common cases) to 20x (edge cases) by avoiding using java.util.regex.Pattern to match %.


Key changed/added classes in this PR
  • LikeDimFilter.from() is the most-complicated change: parsing to a list of patterns.
  • LikeDimFilter.matches() is updated to loop through the list of patterns.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking.

Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                (cardinality)  Mode  Cnt  Before Score       Error  After Score       Error  Units  Before / After
LikeFilterBenchmark.matchBoundPrefix              1000  avgt   10         6.686 ±     0.026        6.765 ±     0.087  us/op           0.99x
LikeFilterBenchmark.matchBoundPrefix            100000  avgt   10       163.936 ±     1.589      140.014 ±     0.563  us/op           1.17x
LikeFilterBenchmark.matchBoundPrefix           1000000  avgt   10      1235.259 ±     7.318     1165.330 ±     9.300  us/op           1.06x
LikeFilterBenchmark.matchLikeContains             1000  avgt   10       255.074 ±     1.530      130.212 ±     3.314  us/op           1.96x
LikeFilterBenchmark.matchLikeContains           100000  avgt   10     34789.639 ±   210.219    18563.644 ±   100.030  us/op           1.87x
LikeFilterBenchmark.matchLikeContains          1000000  avgt   10    287265.302 ±  1790.957   164684.778 ±   317.698  us/op           1.74x
LikeFilterBenchmark.matchLikeEquals               1000  avgt   10         0.410 ±     0.003        0.399 ±     0.001  us/op           1.03x
LikeFilterBenchmark.matchLikeEquals             100000  avgt   10         0.793 ±     0.005        0.719 ±     0.003  us/op           1.10x
LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10         0.864 ±     0.004        0.839 ±     0.005  us/op           1.03x
LikeFilterBenchmark.matchLikeKiller               1000  avgt   10      3077.629 ±     7.928      103.714 ±     2.417  us/op          29.67x
LikeFilterBenchmark.matchLikeKiller             100000  avgt   10    311048.049 ± 13466.911    14777.567 ±    70.242  us/op          21.05x
LikeFilterBenchmark.matchLikeKiller            1000000  avgt   10   3055855.099 ± 18387.839    92476.621 ±  1198.255  us/op          33.04x
LikeFilterBenchmark.matchLikePrefix               1000  avgt   10         6.711 ±     0.035        6.653 ±     0.046  us/op           1.01x
LikeFilterBenchmark.matchLikePrefix             100000  avgt   10       161.535 ±     0.574      163.740 ±     0.833  us/op           0.99x
LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10      1255.696 ±     5.207     1201.378 ±     3.466  us/op           1.05x
LikeFilterBenchmark.matchRegexContains            1000  avgt   10       467.736 ±     2.546      481.431 ±     5.647  us/op           0.97x
LikeFilterBenchmark.matchRegexContains          100000  avgt   10     64871.766 ±   223.341    65483.992 ±   391.249  us/op           0.99x
LikeFilterBenchmark.matchRegexContains         1000000  avgt   10    482906.004 ±  2003.583   477195.835 ±  3094.605  us/op           1.01x
LikeFilterBenchmark.matchRegexKiller              1000  avgt   10      8071.881 ±    18.026     8052.322 ±    17.336  us/op           1.00x
LikeFilterBenchmark.matchRegexKiller            100000  avgt   10   1120094.520 ±  2428.172   808321.542 ±  2411.032  us/op           1.39x
LikeFilterBenchmark.matchRegexKiller           1000000  avgt   10   8096745.012 ± 40782.747  8114114.896 ± 43250.204  us/op           1.00x
LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10       170.843 ±     1.095      175.924 ±     1.144  us/op           0.97x
LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10     17785.280 ±   116.813    18708.888 ±    61.857  us/op           0.95x
LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10    174415.586 ±  1827.478   173190.799 ±   949.224  us/op           1.01x
LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10         0.411 ±     0.003        0.416 ±     0.002  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10         0.728 ±     0.003        0.739 ±     0.003  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10         0.842 ±     0.002        0.879 ±     0.007  us/op           0.96x
```
@twilliamson twilliamson changed the title Expected-linear-time LIKE Improve worst-case performance of LIKE filters by 20x Mar 18, 2024
@abhishekagarwal87
Copy link
Copy Markdown
Contributor

https://github.com/spring-projects/spring-framework/blob/main/spring-expression/src/main/java/org/springframework/expression/spel/ast/OperatorMatches.java#L128 seems like a better way to protect against such killer regex. It wouldn't improve the performance but would avoid bad regex causing a significant issue.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 1, 2024

For LIKE it seems reasonable to have an optimized matcher specifically geared towards the needs of LIKE. I am disappointed that the JDK regexp library doesn't handle these relatively simple cases well. I would have hoped that it could realize that .*.* is equivalent to .* and not blow up runtime with naive matching.

For REGEXP_LIKE, the possibility of runtime blowup has always been there, as you point out. I'm not aware of a regexp library for Java that is equivalent to java.util.Pattern but does not have any possibility of runtime blowup. I am not even sure if it's possible. As I understand it, regexp libraries that guarantee no runtime blowup tend to achieve that by limiting the pattern syntax somewhat. I would be excited to be wrong through. (Does joni guarantee no runtime blowup, btw? I don't see a note about that on its page.)

Assuming such a library doesn't exist, probably good approaches would be optionally using an alternative library (either as a cluster-wide setting or with a different set of regexp functions), or @abhishekagarwal87's suggestion: detecting runtime blowup and throwing an error. But that's out of scope for this particular PR anyway, so let's focus on the LIKE situation.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

@gianm - https://github.com/google/re2j seems to fit the bill. A cluster-wide setting that switches the regex-based filters to use this library instead of the standard Java library functions. I suggest both regexp_like and like filter switch to the new library when there is a need to use Pattern.

@twilliamson
Copy link
Copy Markdown
Contributor Author

This info is from 5 or 6 years back while working on stream processing systems at Facebook, but my recollection is that re2j had issues with UTF-8 multi-byte sequences. Not sure if that's still the case, but I remember it not working as a drop-in replacement. We checked out what the Trino folks were doing at the time, and that's what led to us using Joni, which we were able to switch to without any of our pipeline owners noticing. From what I can remember, while it doesn't make hard runtime guarantees, in practice we didn't see it run into the same pathological behavior, but would still sometimes see exceptions for certain inputs (maybe StackOverflowException? but you also get those with java.util.regex…).

Just checked, and it looks like Trino has since updated to a custom LIKE implementation based on DFAs. (It looks quite complicated — I'm tempted to submit a MR to Trino with the same approach as in this MR…) Trino appears to still be using Joni as the default for regexp_* functions, with an option to use re2j.

@imply-cheddar
Copy link
Copy Markdown
Contributor

@abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 9, 2024

@abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress?

Oh, I don't think the discussion about regex libraries and stuff needs to block anything for this particular patch. It was just interesting.

I had previously skimmed this patch and the general approach looks OK to me. I think all that was required for making forward progress was a committer to take more than a skim level look. I didn't have bandwidth to do more than that until today, but I can take a look today.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 10, 2024

Just looked more deeply, I think there's a bug in the logic around the greediness of the matching. This test fails on the patch but passes on master:

assertMatch("%ba_", "foo bar baz", DruidPredicateMatch.TRUE);

The %ba matches foo ba, then the _ matches r, then because there's some string left over the overall match fails.

Making the first %ba greedy wouldn't be a good fix, because then this test (which passes on the branch now) would probably start to fail:

assertMatch("%ba_____", "foo bar baz", DruidPredicateMatch.TRUE);

It seems like some kind of backtracking is necessary, like trying the minimal-length match for %ba first, and then if the overall match fails, trying a longer match.

@twilliamson what do you think about mitigating the original issue by sticking with java.util.Pattern, but using your logic for pattern normalization to simplify the regexps? Or, alternatively, enhancing the hand-rolled logic in your patch to have some backtracking? That would surely help for some patterns, like %%%%x. Although, I'm not sure if it will open the door back up for "catastrophe". I haven't thought about the problem enough to have a sense of that.

@twilliamson
Copy link
Copy Markdown
Contributor Author

Backtracking isn't needed – the bug in my current PR is in the suffix-handling logic. ☹️

At a high level, there are three parts to the match:

  • Must start with (the prefix)
  • Must end with (the suffix)
  • Must contain (list of clauses)

The special thing about the "starts with" and "ends with" parts is they have to match a particular part of the string (either anchored at the start or anchored at the end, respectively). The list of "contains" clauses can eagerly match the first occurrence. That is, if I'm looking for a%j%z, then the a has to match the first a and the z has to match the last z, but the j can match any j that's not the first or last.

The current PR is splitting clauses on either % or _ and assuming the last clause is the suffix, but the suffix is actually everything after the last %. I'll fix the suffix handling and update the PR.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 10, 2024

OK, thanks, I'll keep an eye out for the update.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 11, 2024

This test has an issue on the latest patch:

assertMatch("%1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.TRUE);

The matcher strips off the 6 to get 1 2 3 1 4 5 , then eagerly matches the first 1 using the %1 , then can't match _ 5 on the start of 2 3 1 4 5 and returns FALSE.

I'm not sure how to fix it, so am definitely interested in your thoughts. I'm happy to take a look at any fixes you propose. Another possible option is to include Joni or re2j, and use it here as well as optionally use it for the REGEXP_* functions. Or adapt the DFA approach that Trino is using.

@twilliamson
Copy link
Copy Markdown
Contributor Author

You're right: I'm making the same mistake as with the suffix — and now worrying I left the same bug in the Facebook stream processing system… 😛 Sequences of literal characters and _ without any % between them have to either all match or all fail. That's certainly doable to code up, and still has better theortical bounds than java.util.regex, but whether it's faster than the joni/re2 option (or better enough to warrant the additional complexity) is a good question. I'll try out the different options…

Thanks so much for taking a look at this MR — and extra, extra thanks for spotting the flaws! 😄 I was grateful to finally feel like I could contribute to some of the awesomeness of Druid, and it would have been horrifying to end up introducing bugs into LIKE. 🙀

@twilliamson
Copy link
Copy Markdown
Contributor Author

I got the custom implementation to work, but it's pretty complicated, and I'm bit worried about there being more bugs. Similarly, switching the regex library to joni or re2 introduces a fairly big new dependency (though that probably needs to be done at some point for the REGEXP_* methods).

Fortunately, there's a nice middle ground: use the list-of-patterns approach for handling the % matching (and thereby avoid catastrophic backtracking), but continue to use java.util.regex for handling the parts containing only _ and literals (using ^ and $ to handle prefix/suffix anchoring), e.g., a%b_c%d becomes [^a, b.c, d$].

This ends up boiling down to changing the from() parsing logic, then updating:

    private static DruidPredicateMatch matches(@Nullable final String s, Pattern pattern)
    {
      String val = NullHandling.nullToEmptyIfNeeded(s);
      if (val == null) {
        return DruidPredicateMatch.UNKNOWN;
      }
      return DruidPredicateMatch.of(pattern.matcher(val).matches());
    }

to:

    private static DruidPredicateMatch matches(@Nullable final String s, List<Pattern> pattern)
    {
      String val = NullHandling.nullToEmptyIfNeeded(s);
      if (val == null) {
        return DruidPredicateMatch.UNKNOWN;
      }

      if (pattern.size() == 1) {
        // The common case is a single pattern: a% => ^a, %z => z$, %m% => m
        return DruidPredicateMatch.of(pattern.get(0).matcher(val).find());
      }

      int offset = 0;

      for (Pattern part : pattern) {
        Matcher matcher = part.matcher(val);

        if (!matcher.find(offset)) {
          return DruidPredicateMatch.FALSE;
        }

        offset = matcher.end();
      }

      return DruidPredicateMatch.TRUE;
    }

The full benchmark suite is still running on a dedicated box, but local laptop testing confirms that the "killer regex" is no longer killer, and suggests performance is the same or better across the board (though not as much as with the fully-custom implementation, e.g., ~1.3x faster contains/suffix vs ~2x faster).

The list-of-patterns approach seems like a sweet spot of bounded worst-case and slightly-better average performance, and a simple-enough implementation to avoid introducing bugs and not increase on-going maintenance overhead. I'm not sure whether it's better to update this PR with that approach (i.e., add another commit that rolls back the custom code), or create a new PR? And should I keep the new tests and benchmarks?

@gianm What are your thoughts? How would you like me to proceed?

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used.

Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x
```
@twilliamson
Copy link
Copy Markdown
Contributor Author

twilliamson commented Apr 16, 2024

The overnight dedicated-box perf tests finished, and they look good. I've gone ahead and updated this PR with the list-of-java.util.regex.Pattern approach. Hopefully that's a good balance of protecting from catastrophic backtracking while minimizing risk.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 17, 2024

@twilliamson thanks! I will have another look at the latest patch. Hopefully should have some time for that tomorrow. Updating this PR works for me.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 24, 2024

I just had a look at the latest patch. The list-of-patterns approach makes more intuitive sense to me: the LIKE pattern is split on % into fixed-length subpatterns, then each fixed-length subpattern is matched in the string as early as possible. Certainly sounds correct.

I'll go ahead and merge this. Thank you for the contribution!

@gianm gianm merged commit 4bdc189 into apache:master Apr 24, 2024
@gianm gianm added this to the 30.0.0 milestone Apr 24, 2024
@twilliamson
Copy link
Copy Markdown
Contributor Author

You just made my week! 🎉 😄 Thank you so much for taking the time to work through my buggy iterations of this — I really appreciate it!

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 25, 2024

Of course! Thank you for contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants