Improve worst-case performance of LIKE filters by 20x by twilliamson · Pull Request #16153 · apache/druid

twilliamson · 2024-03-18T18:23:51Z

Description

LikeDimFilter was compiling the LIKE clause down to a java.util.regex.Pattern. Unfortunately, even seemingly simply regexes can lead to catastrophic backtracking. In particular, something as simple as a few % wildcards can end up in exploding the time complexity. This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the LIKE pattern into a list of java.util.regex.Pattern by splitting on the % wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using Matcher.find() to progress through the string is used.

Running an updated version of the LikeFilterBenchmark with Java 11 on a t2.xlarge instance showed at least a 1.15x speed up for a simple "contains" query (%50%), and more than a 20x speed up for a "killer" query with four wildcards but no matches (%%%%x). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the REGEX operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in java.util.regex.Pattern (e.g., joni) would be a good idea to avoid accidental — or intentional — DoSing.

Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x

Release note

Improved: LIKE filtering performance with multiple wildcards improved 1.1x (common cases) to 20x (edge cases) by avoiding using java.util.regex.Pattern to match %.

Key changed/added classes in this PR

LikeDimFilter.from() is the most-complicated change: parsing to a list of patterns.
LikeDimFilter.matches() is updated to loop through the list of patterns.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking. Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before / After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 6.686 ± 0.026 6.765 ± 0.087 us/op 0.99x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 163.936 ± 1.589 140.014 ± 0.563 us/op 1.17x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1235.259 ± 7.318 1165.330 ± 9.300 us/op 1.06x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 255.074 ± 1.530 130.212 ± 3.314 us/op 1.96x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 34789.639 ± 210.219 18563.644 ± 100.030 us/op 1.87x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 287265.302 ± 1790.957 164684.778 ± 317.698 us/op 1.74x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.410 ± 0.003 0.399 ± 0.001 us/op 1.03x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.793 ± 0.005 0.719 ± 0.003 us/op 1.10x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.864 ± 0.004 0.839 ± 0.005 us/op 1.03x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 3077.629 ± 7.928 103.714 ± 2.417 us/op 29.67x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 311048.049 ± 13466.911 14777.567 ± 70.242 us/op 21.05x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 3055855.099 ± 18387.839 92476.621 ± 1198.255 us/op 33.04x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 6.711 ± 0.035 6.653 ± 0.046 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 161.535 ± 0.574 163.740 ± 0.833 us/op 0.99x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1255.696 ± 5.207 1201.378 ± 3.466 us/op 1.05x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 467.736 ± 2.546 481.431 ± 5.647 us/op 0.97x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 64871.766 ± 223.341 65483.992 ± 391.249 us/op 0.99x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 482906.004 ± 2003.583 477195.835 ± 3094.605 us/op 1.01x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 8071.881 ± 18.026 8052.322 ± 17.336 us/op 1.00x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1120094.520 ± 2428.172 808321.542 ± 2411.032 us/op 1.39x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 8096745.012 ± 40782.747 8114114.896 ± 43250.204 us/op 1.00x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 170.843 ± 1.095 175.924 ± 1.144 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 17785.280 ± 116.813 18708.888 ± 61.857 us/op 0.95x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 174415.586 ± 1827.478 173190.799 ± 949.224 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.411 ± 0.003 0.416 ± 0.002 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.728 ± 0.003 0.739 ± 0.003 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.842 ± 0.002 0.879 ± 0.007 us/op 0.96x ```

… in LikeDimFilterTest assertions.

abhishekagarwal87 · 2024-03-20T08:31:08Z

https://github.com/spring-projects/spring-framework/blob/main/spring-expression/src/main/java/org/springframework/expression/spel/ast/OperatorMatches.java#L128 seems like a better way to protect against such killer regex. It wouldn't improve the performance but would avoid bad regex causing a significant issue.

gianm · 2024-04-01T21:03:39Z

For LIKE it seems reasonable to have an optimized matcher specifically geared towards the needs of LIKE. I am disappointed that the JDK regexp library doesn't handle these relatively simple cases well. I would have hoped that it could realize that .*.* is equivalent to .* and not blow up runtime with naive matching.

For REGEXP_LIKE, the possibility of runtime blowup has always been there, as you point out. I'm not aware of a regexp library for Java that is equivalent to java.util.Pattern but does not have any possibility of runtime blowup. I am not even sure if it's possible. As I understand it, regexp libraries that guarantee no runtime blowup tend to achieve that by limiting the pattern syntax somewhat. I would be excited to be wrong through. (Does joni guarantee no runtime blowup, btw? I don't see a note about that on its page.)

Assuming such a library doesn't exist, probably good approaches would be optionally using an alternative library (either as a cluster-wide setting or with a different set of regexp functions), or @abhishekagarwal87's suggestion: detecting runtime blowup and throwing an error. But that's out of scope for this particular PR anyway, so let's focus on the LIKE situation.

abhishekagarwal87 · 2024-04-05T09:24:10Z

@gianm - https://github.com/google/re2j seems to fit the bill. A cluster-wide setting that switches the regex-based filters to use this library instead of the standard Java library functions. I suggest both regexp_like and like filter switch to the new library when there is a need to use Pattern.

twilliamson · 2024-04-05T22:35:09Z

This info is from 5 or 6 years back while working on stream processing systems at Facebook, but my recollection is that re2j had issues with UTF-8 multi-byte sequences. Not sure if that's still the case, but I remember it not working as a drop-in replacement. We checked out what the Trino folks were doing at the time, and that's what led to us using Joni, which we were able to switch to without any of our pipeline owners noticing. From what I can remember, while it doesn't make hard runtime guarantees, in practice we didn't see it run into the same pathological behavior, but would still sometimes see exceptions for certain inputs (maybe StackOverflowException? but you also get those with java.util.regex…).

Just checked, and it looks like Trino has since updated to a custom LIKE implementation based on DFAs. (It looks quite complicated — I'm tempted to submit a MR to Trino with the same approach as in this MR…) Trino appears to still be using Joni as the default for regexp_* functions, with an option to use re2j.

imply-cheddar · 2024-04-09T05:00:00Z

@abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress?

gianm · 2024-04-09T22:30:53Z

@abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress?

Oh, I don't think the discussion about regex libraries and stuff needs to block anything for this particular patch. It was just interesting.

I had previously skimmed this patch and the general approach looks OK to me. I think all that was required for making forward progress was a committer to take more than a skim level look. I didn't have bandwidth to do more than that until today, but I can take a look today.

gianm · 2024-04-10T00:01:46Z

Just looked more deeply, I think there's a bug in the logic around the greediness of the matching. This test fails on the patch but passes on master:

assertMatch("%ba_", "foo bar baz", DruidPredicateMatch.TRUE);

The %ba matches foo ba, then the _ matches r, then because there's some string left over the overall match fails.

Making the first %ba greedy wouldn't be a good fix, because then this test (which passes on the branch now) would probably start to fail:

assertMatch("%ba_____", "foo bar baz", DruidPredicateMatch.TRUE);

It seems like some kind of backtracking is necessary, like trying the minimal-length match for %ba first, and then if the overall match fails, trying a longer match.

@twilliamson what do you think about mitigating the original issue by sticking with java.util.Pattern, but using your logic for pattern normalization to simplify the regexps? Or, alternatively, enhancing the hand-rolled logic in your patch to have some backtracking? That would surely help for some patterns, like %%%%x. Although, I'm not sure if it will open the door back up for "catastrophe". I haven't thought about the problem enough to have a sense of that.

twilliamson · 2024-04-10T00:37:46Z

Backtracking isn't needed – the bug in my current PR is in the suffix-handling logic. ☹️

At a high level, there are three parts to the match:

Must start with (the prefix)
Must end with (the suffix)
Must contain (list of clauses)

The special thing about the "starts with" and "ends with" parts is they have to match a particular part of the string (either anchored at the start or anchored at the end, respectively). The list of "contains" clauses can eagerly match the first occurrence. That is, if I'm looking for a%j%z, then the a has to match the first a and the z has to match the last z, but the j can match any j that's not the first or last.

The current PR is splitting clauses on either % or _ and assuming the last clause is the suffix, but the suffix is actually everything after the last %. I'll fix the suffix handling and update the PR.

gianm · 2024-04-10T04:55:09Z

OK, thanks, I'll keep an eye out for the update.

gianm · 2024-04-11T06:30:03Z

This test has an issue on the latest patch:

assertMatch("%1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.TRUE);

The matcher strips off the 6 to get 1 2 3 1 4 5 , then eagerly matches the first 1 using the %1 , then can't match _ 5 on the start of 2 3 1 4 5 and returns FALSE.

I'm not sure how to fix it, so am definitely interested in your thoughts. I'm happy to take a look at any fixes you propose. Another possible option is to include Joni or re2j, and use it here as well as optionally use it for the REGEXP_* functions. Or adapt the DFA approach that Trino is using.

twilliamson · 2024-04-11T16:34:11Z

You're right: I'm making the same mistake as with the suffix — and now worrying I left the same bug in the Facebook stream processing system… 😛 Sequences of literal characters and _ without any % between them have to either all match or all fail. That's certainly doable to code up, and still has better theortical bounds than java.util.regex, but whether it's faster than the joni/re2 option (or better enough to warrant the additional complexity) is a good question. I'll try out the different options…

Thanks so much for taking a look at this MR — and extra, extra thanks for spotting the flaws! 😄 I was grateful to finally feel like I could contribute to some of the awesomeness of Druid, and it would have been horrifying to end up introducing bugs into LIKE. 🙀

twilliamson · 2024-04-15T18:12:50Z

I got the custom implementation to work, but it's pretty complicated, and I'm bit worried about there being more bugs. Similarly, switching the regex library to joni or re2 introduces a fairly big new dependency (though that probably needs to be done at some point for the REGEXP_* methods).

Fortunately, there's a nice middle ground: use the list-of-patterns approach for handling the % matching (and thereby avoid catastrophic backtracking), but continue to use java.util.regex for handling the parts containing only _ and literals (using ^ and $ to handle prefix/suffix anchoring), e.g., a%b_c%d becomes [^a, b.c, d$].

This ends up boiling down to changing the from() parsing logic, then updating:

    private static DruidPredicateMatch matches(@Nullable final String s, Pattern pattern)
    {
      String val = NullHandling.nullToEmptyIfNeeded(s);
      if (val == null) {
        return DruidPredicateMatch.UNKNOWN;
      }
      return DruidPredicateMatch.of(pattern.matcher(val).matches());
    }

to:

    private static DruidPredicateMatch matches(@Nullable final String s, List<Pattern> pattern)
    {
      String val = NullHandling.nullToEmptyIfNeeded(s);
      if (val == null) {
        return DruidPredicateMatch.UNKNOWN;
      }

      if (pattern.size() == 1) {
        // The common case is a single pattern: a% => ^a, %z => z$, %m% => m
        return DruidPredicateMatch.of(pattern.get(0).matcher(val).find());
      }

      int offset = 0;

      for (Pattern part : pattern) {
        Matcher matcher = part.matcher(val);

        if (!matcher.find(offset)) {
          return DruidPredicateMatch.FALSE;
        }

        offset = matcher.end();
      }

      return DruidPredicateMatch.TRUE;
    }

The full benchmark suite is still running on a dedicated box, but local laptop testing confirms that the "killer regex" is no longer killer, and suggests performance is the same or better across the board (though not as much as with the fully-custom implementation, e.g., ~1.3x faster contains/suffix vs ~2x faster).

The list-of-patterns approach seems like a sweet spot of bounded worst-case and slightly-better average performance, and a simple-enough implementation to avoid introducing bugs and not increase on-going maintenance overhead. I'm not sure whether it's better to update this PR with that approach (i.e., add another commit that rolls back the custom code), or create a new PR? And should I keep the new tests and benchmarks?

@gianm What are your thoughts? How would you like me to proceed?

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before/After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 5.410 ± 0.010 5.582 ± 0.004 us/op 0.97x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 140.920 ± 0.306 141.082 ± 0.391 us/op 1.00x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1082.762 ± 1.070 1171.407 ± 1.628 us/op 0.92x LikeFilterBenchmark.matchLikeComplexContains 1000 avgt 10 221.572 ± 0.228 183.742 ± 0.210 us/op 1.21x LikeFilterBenchmark.matchLikeComplexContains 100000 avgt 10 25461.362 ± 21.481 17373.828 ± 42.577 us/op 1.47x LikeFilterBenchmark.matchLikeComplexContains 1000000 avgt 10 221075.917 ± 919.238 177454.683 ± 506.420 us/op 1.25x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 283.015 ± 0.219 218.835 ± 3.126 us/op 1.29x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 30202.910 ± 32.697 26713.488 ± 49.525 us/op 1.13x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 284661.411 ± 130.324 243381.857 ± 540.143 us/op 1.17x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.386 ± 0.001 0.380 ± 0.001 us/op 1.02x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.670 ± 0.001 0.705 ± 0.002 us/op 0.95x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.839 ± 0.001 0.796 ± 0.001 us/op 1.05x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 4882.099 ± 7.953 170.142 ± 0.494 us/op 28.69x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 524122.010 ± 390.170 19461.637 ± 117.090 us/op 26.93x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 5121795.377 ± 4176.052 181162.978 ± 368.443 us/op 28.27x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 5.708 ± 0.005 5.677 ± 0.011 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 141.853 ± 0.554 108.313 ± 0.330 us/op 1.31x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1199.148 ± 1.298 1153.297 ± 1.575 us/op 1.04x LikeFilterBenchmark.matchLikeSuffix 1000 avgt 10 256.020 ± 0.283 196.339 ± 0.564 us/op 1.30x LikeFilterBenchmark.matchLikeSuffix 100000 avgt 10 29917.931 ± 28.218 21450.997 ± 20.341 us/op 1.39x LikeFilterBenchmark.matchLikeSuffix 1000000 avgt 10 241225.193 ± 465.824 194034.292 ± 362.312 us/op 1.24x LikeFilterBenchmark.matchRegexComplexContains 1000 avgt 10 119.597 ± 0.635 135.550 ± 0.697 us/op 0.88x LikeFilterBenchmark.matchRegexComplexContains 100000 avgt 10 13089.670 ± 13.738 13766.712 ± 12.802 us/op 0.95x LikeFilterBenchmark.matchRegexComplexContains 1000000 avgt 10 130822.830 ± 1624.048 131076.029 ± 1636.811 us/op 1.00x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 573.273 ± 0.421 615.399 ± 0.633 us/op 0.93x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 57259.313 ± 162.747 62900.380 ± 44.746 us/op 0.91x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 571335.768 ± 2822.776 542536.982 ± 780.290 us/op 1.05x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 11525.499 ± 8.741 11061.791 ± 21.746 us/op 1.04x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1170414.723 ± 766.160 1144437.291 ± 886.263 us/op 1.02x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 11507668.302 ± 11318.176 110381620.014 ± 10707.974 us/op 1.11x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 156.460 ± 0.097 155.217 ± 0.431 us/op 1.01x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 15056.491 ± 23.906 15508.965 ± 763.976 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 154416.563 ± 473.108 153737.912 ± 273.347 us/op 1.00x LikeFilterBenchmark.matchRegexSuffix 1000 avgt 10 610.684 ± 0.462 590.352 ± 0.334 us/op 1.03x LikeFilterBenchmark.matchRegexSuffix 100000 avgt 10 53196.517 ± 78.155 59460.261 ± 56.934 us/op 0.89x LikeFilterBenchmark.matchRegexSuffix 1000000 avgt 10 536100.944 ± 440.353 550098.917 ± 740.464 us/op 0.97x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.390 ± 0.001 0.366 ± 0.001 us/op 1.07x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.724 ± 0.001 0.714 ± 0.001 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.826 ± 0.001 0.847 ± 0.001 us/op 0.98x ```

twilliamson · 2024-04-16T18:37:38Z

The overnight dedicated-box perf tests finished, and they look good. I've gone ahead and updated this PR with the list-of-java.util.regex.Pattern approach. Hopefully that's a good balance of protecting from catastrophic backtracking while minimizing risk.

gianm · 2024-04-17T05:09:14Z

@twilliamson thanks! I will have another look at the latest patch. Hopefully should have some time for that tomorrow. Updating this PR works for me.

gianm · 2024-04-24T05:45:10Z

I just had a look at the latest patch. The list-of-patterns approach makes more intuitive sense to me: the LIKE pattern is split on % into fixed-length subpatterns, then each fixed-length subpattern is matched in the string as early as possible. Certainly sounds correct.

I'll go ahead and merge this. Thank you for the contribution!

twilliamson · 2024-04-24T18:47:13Z

You just made my week! 🎉 😄 Thank you so much for taking the time to work through my buggy iterations of this — I really appreciate it!

gianm · 2024-04-25T15:28:11Z

Of course! Thank you for contributing.

github-advanced-security AI found potential problems Mar 18, 2024

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/query/filter/LikeDimFilter.java Fixed

twilliamson added 2 commits March 18, 2024 14:15

Take into account whether druid.generic.useDefaultValueForNull is set…

efc35b7

… in LikeDimFilterTest assertions.

Attempt to placate CodeQL.

2b69437

twilliamson changed the title ~~Expected-linear-time LIKE~~ Improve worst-case performance of LIKE filters by 20x Mar 18, 2024

Fix handling of multi-pattern suffixes.

67666ee

gianm approved these changes Apr 24, 2024

View reviewed changes

gianm merged commit 4bdc189 into apache:master Apr 24, 2024

gianm added this to the 30.0.0 milestone Apr 24, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

Conversation

twilliamson commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changed/added classes in this PR

Uh oh!

Uh oh!

abhishekagarwal87 commented Mar 20, 2024

Uh oh!

gianm commented Apr 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhishekagarwal87 commented Apr 5, 2024

Uh oh!

twilliamson commented Apr 5, 2024

Uh oh!

imply-cheddar commented Apr 9, 2024

Uh oh!

gianm commented Apr 9, 2024

Uh oh!

gianm commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twilliamson commented Apr 10, 2024

Uh oh!

gianm commented Apr 10, 2024

Uh oh!

gianm commented Apr 11, 2024

Uh oh!

twilliamson commented Apr 11, 2024

Uh oh!

twilliamson commented Apr 15, 2024

Uh oh!

twilliamson commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Apr 17, 2024

Uh oh!

gianm commented Apr 24, 2024

Uh oh!

twilliamson commented Apr 24, 2024

Uh oh!

gianm commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

twilliamson commented Mar 18, 2024 •

edited

Loading

gianm commented Apr 1, 2024 •

edited

Loading

gianm commented Apr 10, 2024 •

edited

Loading

twilliamson commented Apr 16, 2024 •

edited

Loading