Skip to content

Add IPv4 druid expressions#8197

Merged
gianm merged 5 commits intoapache:masterfrom
ccaominh:ipv4-druid-expressions
Aug 1, 2019
Merged

Add IPv4 druid expressions#8197
gianm merged 5 commits intoapache:masterfrom
ccaominh:ipv4-druid-expressions

Conversation

@ccaominh
Copy link
Copy Markdown
Contributor

@ccaominh ccaominh commented Jul 30, 2019

Description

New druid expressions for filtering IPv4 addresses:

  • ipv4address_match: Check if IP address belongs to a subnet
  • ipv4address_parse: Convert string IP address to long
  • ipv4address_stringify: Convert long IP address to string

These expressions operate on IP addresses represented as either strings
or longs. The filtering is more efficient when operating on IP addresses
as longs. In other words, the intended use case is:

  1. Use ipv4address_parse to convert to long at ingestion time
  2. Use ipv4address_match to filter (on longs) at query time
  3. Use ipv4adress_stringify to convert to (readable) string at query
    time

A follow-up PR will add the equivalent SQL functions.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.

ccaominh added 2 commits July 30, 2019 11:03
New druid expressions for filtering IPv4 addresses:
- ipv4address_match: Check if IP address belongs to a subnet
- ipv4address_parse: Convert string IP address to long
- ipv4address_stringify: Convert long IP address to string

These expressions operate on IP addresses represented as either strings
or longs, so that they can be applied to dimensions with mixed
representation of IP addresses. The filtering is more efficient when
operating on IP addresses as longs. In other words, the intended use
case is:

1) Use ipv4address_parse to convert to long at ingestion time
2) Use ipv4address_match to filter (on longs) at query time
3) Use ipv4adress_stringify to convert to (readable) string at query
time


## IP Address Functions
| function | description |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use shorter names like ipv4_match, etc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's more convenient to type and it's reasonable that users will understand "ipv4" to mean an "IP address", I'm ok with using the shorter names you suggested.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about balancing between clarity and keystrokes. I think the current names are great, thanks.

Comment thread docs/content/misc/math-expr.md Outdated
## IP Address Functions
| function | description |
| --- | --- |
| ipv4address_match(address, subnet [,inclusive]) | Returns 1 if the string or long IP `address` belongs to the `subnet` CIDR notation literal string, else 0. The optional literal boolean `inclusive` parameter indicates whether the network and broadcast IP addresses should be included in the subnet and defaults to `false`. This function is more efficient when operating on IP addresses as longs instead of strings.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of false for inclusive means doing a filter of 1.2.3.4/32 will match nothing (instead of matching the single address 1.2.3.4). It seems a bit weird to me. I also don't see an obvious reason why you'd want to exclude the network and broadcast addresses by default, even for a bigger subnet. So for these reasons I'd consider making inclusive false by default (or naming it exclusive and making it false by default).

If I'm missing something feel free to let me know.

Copy link
Copy Markdown
Contributor Author

@ccaominh ccaominh Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The network and broadcast addresses aren't usable IP addresses in the subnet, which is why I excluded them by default. However, since they do match the subnet, I'll change the default to include them.

I'm also ok with removing the option entirely to simplify the code/tests/documentation.

Comment thread docs/content/misc/math-expr.md Outdated
| function | description |
| --- | --- |
| ipv4address_match(address, subnet [,inclusive]) | Returns 1 if the string or long IP `address` belongs to the `subnet` CIDR notation literal string, else 0. The optional literal boolean `inclusive` parameter indicates whether the network and broadcast IP addresses should be included in the subnet and defaults to `false`. This function is more efficient when operating on IP addresses as longs instead of strings.|
| ipv4address_parse(address) | Parses string or long into an IPv4 address as a long. Returns null if `address` cannot be represented as an IPv4 address. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "parses string or long", would longs just be passed through unchanged? If so I'd suggest the docs say that. Please also include what formats are supported for strings. (Dotted decimal only or are there others?)


private String createErrMsg(String msg)
{
String prefix = "Function[" + name() + "] ";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of code is used frequently across Exprs, since you're making some effort here to factor it out into separate methods, how about putting it into a helper class? ExprUtils would be a good choice, I think.


private void checkLiteralArgument(Expr arg, String name)
{
Preconditions.checkArgument(arg.isLiteral(), createErrMsg(name + " arg must be a literal"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about ExprUtils.

{
boolean match;
try {
long longValue = Long.parseLong(stringValue);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GuavaUtils.tryParseLong might be speedier on non-longs, since it doesn't throw exceptions with stack traces filled in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, didn't know about that method

// 2) convert string to string

try {
long longValue = Long.parseLong(stringValue);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment: GuavaUtils.tryParseLong might be speedier on non-longs, since it doesn't throw exceptions with stack traces filled in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this method is sort of duplicating the functionality already in StringExprEval.asLong, and should maybe just use that instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about the use cases, I've decided to simplify the expressions so that the only valid string format is dotted-decimal (previously, it'd also accept strings that were parseable as longs).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me. I don't think it's likely people will have strings that are parseable as longs. If they do, they could always cast the arg to long before feeding it into IPv4 functions.

{
if (string != null) {
try {
InetAddress address = InetAddresses.forString(string);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Do not use java.lang.InetAddress#getByName() as it may do DNS lookups" on another call to InetAddresses.forString in IPv4AddressParseExprMacro is useful and would be nice here too.

Although, I'd also consider changing the signature of this method to be @Nullable InetAddress extractIPv4Address(String) so it could be used at that call site too, so we wouldn't need two copies of the same comment. Then to make the method easier to use at the original call sites I'd also consider having other methods in here:

  • @Nullable String inet4AddressToString(InetAddress)
  • @Nullable int inet4AddressToInt(InetAddress)

In the hopes that it makes the ExprMacro impls simpler.

If it doesn't, don't worry about it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the methods you suggested and I think it makes the ExprMacros impls simpler.

if (ipv4 == null) {
match = isLongMatch(stringValue);
} else {
match = subnetInfo.isInRange(ipv4);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like SubnetInfo.isInRange(String) converts the String address into an int anyway. We could have fed it an int directly. May be nicer to do something like

subnetInfo.isInRange(
  IPv4AddressExprUtils.inet4AddressToInt(
    IPv4AddressExprUtils.extractIPv4Address(stringValue)
  )
)

(With the appropriate null checks sprinkled in)

* @see IPv4AddressParseExprMacro
* @see IPv4AddressStringifyExprMacro
*/
public class IPv4AddressMatchExprMacro implements ExprMacroTable.ExprMacro
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason these functions defined via ExprMacro in druid-processing instead of in Function.java in druid-core? As far as I can tell this doesn't really rely on anything in the processing module, and is not an extension, so it seems like adding it there instead it would be a much lighter touch approach with a lot less boilerplate code and test surface area.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made match an ExprMacro so that it could pre-compute the subnet matcher at parse time. parse and stringify don't do any precomputation, but since I wanted to colocate related expressions, I made both of those ExprMacros too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsSubnetInfo expensive to compute? Also, is it best that it only takes literal arguments and not be able to write an expression to compute the subnet argument?

There is probably a sort of janky way that stuff could be pre-computed in Function involving using validateArguments since that is called at parse time, but maybe it would be nice to plumb a more official way to do such things in the future.

Side note, please don't mark comments as resolved, leave that for the reviewer who made the comment to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building SubnetInfo requires validating if the subnet argument has a valid (CIDR-notation) format, which only needs to be done once. What kinds of expressions are you thinking of for the subnet argument?

I've undone all the comment threads that I marked as resolved (was using that to keep track of the ones I've addressed in my branch).

Copy link
Copy Markdown
Contributor

@gianm gianm Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a sort of janky way that stuff could be pre-computed in Function involving using validateArguments since that is called at parse time, but maybe it would be nice to plumb a more official way to do such things in the future.

The purpose of ExprMacro is to provide a way to precompute certain things before doing the row-by-row evaluation, so IMO it does make more sense to use ExprMacro (which is designed for this) rather than hook into parse-time Function methods.

Any reason these functions defined via ExprMacro in druid-processing instead of in Function.java in druid-core?

It's probably an accident that Function and ExprMacro are in different packages. I added ExprMacro in #4365 and I don't recall having a good reason for having them be in different places. Fwiw, I don't have a strong separation in my mind between the responsibility of processing and core, anyway, largely agreeing with this comment: #4312 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of ExprMacro is to provide a way to precompute certain things before doing the row-by-row evaluation, so IMO it does make more sense to use ExprMacro (which is designed for this) rather than hook into parse-time Function methods.

Thanks for the explanation, 👍 I had stored away in my head for whatever reason that the main purpose of ExprMacro was so extensions could define expressions. I wonder if there is a way we could cut down the amount of code/test that needs to be added for ExprMacro that live in main Druid, because it does seem to take a lot more to add an ExprMacro compared to a Function, but I don't have any good ideas at the moment, nor am I sure that allowing Parser.getFunction to also take the list of args and allow initializing a Function when creating the FunctionExpr would be the appropriate.

// 2) convert string to string

try {
long longValue = Long.parseLong(stringValue);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this method is sort of duplicating the functionality already in StringExprEval.asLong, and should maybe just use that instead?

}
}
catch (IllegalArgumentException ignored) {
// fall through (Invalid IPv4 adddress string)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively this feels like it could be very expensive to eat an exception every row we evaluate an expression for the case where all values in the column will fall through to parseAsLong. Is there a less explodey way to check if the string is chill?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to compare against a regular expression to validate the format before trying to parse.

Copy link
Copy Markdown
Contributor Author

@ccaominh ccaominh Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is InetAddresses.isInetAddress(), although in this case I want something that only returns true for ipv4 and not ipv6.

private static Long parseAsLong(String stringValue)
{
try {
Long value = Long.valueOf(stringValue);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another place for GuavaUtils.tryParseLong (returns null instead of throwing an exception, so try/catch could be dropped)

@ccaominh
Copy link
Copy Markdown
Contributor Author

@gianm, @clintropolis: Ready for re-review

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest changes lgtm, thanks 👍

* @see IPv4AddressParseExprMacro
* @see IPv4AddressStringifyExprMacro
*/
public class IPv4AddressMatchExprMacro implements ExprMacroTable.ExprMacro
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of ExprMacro is to provide a way to precompute certain things before doing the row-by-row evaluation, so IMO it does make more sense to use ExprMacro (which is designed for this) rather than hook into parse-time Function methods.

Thanks for the explanation, 👍 I had stored away in my head for whatever reason that the main purpose of ExprMacro was so extensions could define expressions. I wonder if there is a way we could cut down the amount of code/test that needs to be added for ExprMacro that live in main Druid, because it does seem to take a lot more to add an ExprMacro compared to a Function, but I don't have any good ideas at the moment, nor am I sure that allowing Parser.getFunction to also take the list of args and allow initializing a Function when creating the FunctionExpr would be the appropriate.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think this version looks a lot cleaner, just had a couple remaining comments.



## IP Address Functions
| function | description |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about balancing between clarity and keystrokes. I think the current names are great, thanks.

import java.util.List;

/**
* <pre>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to include these tags, since we generally write javadocs with reading in source form in mind. (Nobody is looking at the generated docs AFAIK)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags also preserve the formatting when invoking "Quick Documentation" in IntelliJ.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well, good point, in that case feel free to leave them.

*/
static boolean isValidAddress(@Nullable String string)
{
return string != null && IPV4_PATTERN.matcher(string).matches();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll accept some invalid IP-like strings, such as 300.300.300.300. It seems like it's mostly being used to avoid exception-throwing, so how about changing the name to isPossiblyValidAddress and emphasizing the might-be-false-positive nature of the check?

Or alternatively, making the check exact and keeping the name the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I'll change the regex to be tighter and add more test cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gianm gianm added this to the 0.16.0 milestone Aug 1, 2019
@gianm gianm merged commit 7783b31 into apache:master Aug 1, 2019
@ccaominh ccaominh deleted the ipv4-druid-expressions branch August 1, 2019 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants