Skip to content

Enable custom regex engine use.#7

Merged
ekelly-1898 merged 4 commits intorapid7:masterfrom
censys:hudclark/custom-regex-engine
Jul 20, 2021
Merged

Enable custom regex engine use.#7
ekelly-1898 merged 4 commits intorapid7:masterfrom
censys:hudclark/custom-regex-engine

Conversation

@hudclark
Copy link
Copy Markdown
Contributor

@hudclark hudclark commented Apr 27, 2021

Enable custom regex engine use.

Allow library users to customize the regular expression engine used
during fingerprint matching. Custom regex engines can be configured
by implementing the RecogPatternMatcher interface.

This feature is desirable as Java's regex engine is susceptible to
catastrophic backtracking.

Description

  • Add the RecogPatternMatcher interface.
  • Update RecogMatcher to use an underlying RecogPatternMatcher instance, rather than java.regex.* classes directly
  • Implement a JavaRegexRecogMatcher to use as a default an provide backwards compatability.

How Has This Been Tested?

  • JavaRegexRecogMatcher has been exercised via existing unit tests. Since it is the default, it is used in all existing tests.
  • CustomPatternMatcherTest has been added to validate custom RecogPatternMatchers may be used.

Types of changes

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have updated the documentation accordingly (or changes are not required).
  • I have added tests to cover my changes (or new tests are not required).
  • All new and existing tests passed.

Allow library users to customize the regular expression engine used
during fingerprint matching. Custom regex engines can be configured
by implementing the RecogPatternMatcher interface.

This feature is desirable as Java's regex engine is susceptible to
catastrophic backtracking.
@hudclark
Copy link
Copy Markdown
Contributor Author

For reference, here's an example of how a user would use the linear-time regex engine Rej2

Rej2PatternMatcher.java
package com.rapid7.recog.pattern;

import com.google.re2j.Matcher;
import com.google.re2j.Pattern;

public class Re2jPatternMatcher implements RecogPatternMatcher {

  private static class Re2jPatternMatchResult implements RecogPatternMatchResult {
    private final Matcher matcher;

    Re2jPatternMatchResult(Matcher matcher) {
      this.matcher = matcher;
    }

    @Override
    public int groupCount() {
      return matcher.groupCount();
    }

    @Override
    public String group(int group) {
      return matcher.group(group);
    }

    @Override
    public String group(String group) {
      return matcher.group(group);
    }
  }

  private final Pattern pattern;

  public Re2jPatternMatcher(String pattern, int flags) {
    this.pattern = Pattern.compile(pattern, flags);
  }

  @Override
  public String getPattern() {
    return pattern.pattern();
  }

  @Override
  public int getFlags() {
    return pattern.flags();
  }

  @Override
  public boolean matches(String input) {
    return input != null && pattern.matcher(input).find();
  }

  @Override
  public RecogPatternMatchResult match(String input) {
    if (input == null) {
      return null;
    }
    Matcher matcher = pattern.matcher(input);
    return matcher.find() ? new Re2jPatternMatchResult(matcher) : null;
  }
}

@hudclark
Copy link
Copy Markdown
Contributor Author

For a little context around why this is needed, in http_server.xml, the following fingerprint can be found. link

  <fingerprint pattern="^(?:(?:\d+.){3}\d+):\d{1,4}$">
    <description>A banner consisting of an IP address and port -- assert nothing.</description>
    <example>192.168.0.4:9999</example>
  </fingerprint>

When running a Server header value found in the wild (pasted below), java's regex implementation will not complete and will pin a cpu until cancelled.

Details Offending Input
363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738363738393061626364653132333435363738393061626364653132333435363738393061626364653132333435363738393061626646531323334353637383930616263646531323334353637383930616263646531323334353637383930616263646531323334353637383930616263646531323334353637383930616263646531323334353637383930616263646531323334353637383637383930616263646531323334353637383930616263646531323334353637383930616263646531323343536373839306162636465313233343536373839306162636465313233343536373839306162636465313233343536373839306162636465313233343536373893061626364653132333435363738393061626364653132333435363738393061626364653132333435363738

While patching offending fingerprints so that they're not susceptible to backtracking is an option, my team as opted to instead use a regex engine that can guarantee linear-time execution.

Tagging @gschneider-r7, @tsellers-r7, @dabdine-r7 to get the ball rolling here 😄

@tsellers-r7
Copy link
Copy Markdown

Thanks for the PR @hudclark I'll have to defer to @gschneider-r7 on this project. That being said, that regex explosion is a bit rough. Please let us know anytime you see something like that because it likely impacts other engines.

@gschneider-r7
Copy link
Copy Markdown
Contributor

This seems fine to me at a glance, though I haven't worked on this project in a while so I have to defer to those who would be impacted by changes to recog-java.

@ihorbatiuk-r7 @rkirk-r7 @ekelly-rapid7

Note that this repo isn't building on travis-ci anymore so I don't know if it is currently releasable to maven central or if someone has already addressed that internally at R7.

hudclark added 2 commits May 4, 2021 09:16
Update the README to include documentation about the default regular
expression engine used in recog-java and provide an example for how
users can override this behavior.
Previously, this interface was package-private and didn't support
library users supplying their own factories.
Copy link
Copy Markdown

@ekelly-1898 ekelly-1898 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change to me.

Thanks very much for the contribution @hudclark. Apologies for the delay in reviewing.

I'll see about getting this landed and released.

@ekelly-1898
Copy link
Copy Markdown

Hi @hudclark apologies again for the delay. We've got the CI side sorted on our side. Could you please merge latest master to your branch and I'll be able to land and release this.

@hudclark
Copy link
Copy Markdown
Contributor Author

Hi @hudclark apologies again for the delay. We've got the CI side sorted on our side. Could you please merge latest master to your branch and I'll be able to land and release this.

Thanks @ekelly-rapid7! I've just merged master into this.

@ekelly-1898 ekelly-1898 merged commit 0557833 into rapid7:master Jul 20, 2021
@ekelly-1898
Copy link
Copy Markdown

@hudclark this has been released and should be available through maven.
Many thanks for the contribution and apologies for the time it took us to get it landed and released, thanks for your patience.

@tsellers-r7
Copy link
Copy Markdown

Just an FYI, as part of working on rapid7/recog#367 I used @hudclark 's example catastrophic use case ( #7 (comment) ) against every regex currently in recog to ensure that nothing else would choke on it. I've identified a few other regexes that could probably use some performance tuning but none of them are likely to be noticeable in normal use.

While patching offending fingerprints so that they're not susceptible to backtracking is an option, my team as opted to instead use a regex engine that can guarantee linear-time execution.

I also checked to ensure that every regex will compile with re2 which doesn't support many PCRE features (negative lookaheads, etc) that are known to cause performance issues in certain situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants