Skip to content

Comparing dimensions to each other using a select filter#3928

Merged
jon-wei merged 1 commit intoapache:masterfrom
erikdubbelboer:dimension-compare
Mar 24, 2017
Merged

Comparing dimensions to each other using a select filter#3928
jon-wei merged 1 commit intoapache:masterfrom
erikdubbelboer:dimension-compare

Conversation

@erikdubbelboer
Copy link
Copy Markdown
Contributor

See #3840 for a discussion on the design choices.

Most of the pull request is adding the new argument to SelectorDimFilter. Since SelectorDimFilter has an @JsonCreator constructor we can not add an extra constructor with the extra argument (only one @JsonCreator allowed).

Since all current valueMatcher code is build around matching a fixed value against one column, I had to add a new method to
ValueMatcherColumnSelectorStrategy: ValueGetter makeValueGetter(ValueSelectorType selector). This is then used in the new DimensionsFilter to get all values and compare them to each other.

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I'm thinking now that I could add the old constructor to SelectorDimFilter but just not annotate it with @JsonCreator. That way the diff would be a lot smaller as all current invocations could use this old constructor. Would this work and would this be better?

Is there a test somewhere that tests this JSON parsing?

@erikdubbelboer erikdubbelboer force-pushed the dimension-compare branch 2 times, most recently from 81e1c1f to 3b59aa0 Compare February 14, 2017 12:44
@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I cleaned up the changes a lot by adding the extra constructor to SelectorDimFilter.

Old changes can still be found at: erikdubbelboer@81e1c1f

@gianm gianm requested review from gianm and jon-wei February 25, 2017 02:59
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't fully review, but have some partial comments for now. Will pick back up again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert whitespace to spaces instead of tabs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have this call this(dimension, value, extractionFn, null) so the checks/translations in the constructor only have to be in one place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check another two things:

  1. value should be null if dimensions is set (someone that sets both is probably confused about usage)
  2. dimension and dimensions should not both be set

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EqualColumnsFilter is a better name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw UnsupportedOperationException

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This String[], and the ValueGetter itself, could both be extracted to constants.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad (for performance) that this means that comparing two integer columns against each other will involve casting both of them to string. I'm ok with developing a way to deal with that in a future patch, but please leave a comment that this is less than ideal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this pull requests is merged it's a good idea to fix this. But right now it would result in a lot more code making the pull requests even harder to comprehend.

I think the only way to do this is in makeMatcher to see if all columns are integer or float. And then call a different makeValueMatcher that compares integers/floats instead of strings. This would mean the code of makeValueMatcher and overlap needs to be duplicated twice for integers and floats. That's another 106 lines of code in this file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do similar things in other areas where we need specializations for performance reasons, so I think that's just a cost of doing business in a language like java. For now the comment is good enough for me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<String> is more typical although I guess it doesn't really matter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters as it's a fixed list. If you want me to change it let me know.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah it's fine the way it is.

@erikdubbelboer erikdubbelboer force-pushed the dimension-compare branch 2 times, most recently from 7ef4f48 to 399db85 Compare February 26, 2017 05:53
@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I'm pretty sure the test timeout has nothing to do with my code.

@jon-wei jon-wei closed this Feb 28, 2017
@jon-wei jon-wei reopened this Feb 28, 2017
@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Feb 28, 2017

@erikdubbelboer it's an issue with the tests that comes up sometimes, I just closed/re-opened to re-run the tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing this filter such that it accepts a list of DimensionSpec instead, which would allow the user to specify a separate extraction function for each dimension in the comparison?

I can see various use cases where that functionality would be helpful (e.g., extracting and comparing a substring from two columns, but where the two column's base values have totally different formats causing a shared extractionFn to be insufficient)

The DimensionSpecs could also be directly passed to DimensionHandlerUtils.createColumnSelectorPlus, so that this filter doesn't need to manage the extraction functions itself in the matcher.


Related to that, I think it makes sense to have EqualColumnsFilter be a separate type of filter instead of being created from SelectorDimFilter, given that the parameters on the filter are being used differently.

The equal columns use case cares about the list of dimensions and ignores the selector value and single dimension, while the original constant value comparison case uses the value/single dimension and ignores the list of dimensions.

I think it'd be cleaner with fewer rules about what needs to be specified if the filters are totally split since the only common point now is the singular extractionFn.

Separation would also help remove the restriction on only having one extraction function for the equal columns case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei, I think you're right that this bears little resemblance to the rest of "selector" and makes more sense as a different DimFilter class. I agree with that.

On using DimensionSpecs instead of dimension names + an extractionFn, I could take or leave that one, so I would be ok with this patch either way. I'm ok with leaving that until someone actually has the need for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for making a separate class.

Copy link
Copy Markdown
Contributor Author

@erikdubbelboer erikdubbelboer Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should read the rest before I post :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other filter that allows a DimensionSpec for the dimension?

I don't see how this would work with the extractionFn seeing as this needs to be applied in the ValueMatcher?

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's not necessary to generalize to having a different extractionFn for each column in this patch, but it does make sense to split this out from the "selector" filter.

Comment thread docs/content/querying/filters.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jon-wei that this has drifted far enough from "selector" that it makes more sense as a separate filter. How about calling it "columnComparison" (& calling the implementation classes ColumnComparisonDimFilter, ColumnComparisonFilter)?

I think "equalColumns" isn't the right name, since what it does on multi-value columns isn't really an "equals" relationship. For one thing it's not transitive, since with the current impl we'd have ["a","b"] == ["a"] and ["a","b"] == ["b"] but ["a"] != ["b"].

Please also document what the filter does on multi-value columns.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 1, 2017

@erikdubbelboer, I'm going to target this for 0.10.1, assuming you have time to finish working on it. I think this should be good to go if the filter is split out and has docs added for multi-value behavior.

@gianm gianm added this to the 0.10.1 milestone Mar 1, 2017
@gianm gianm added the Feature label Mar 1, 2017
@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Mar 2, 2017

I'm okay with this PR going with the shared extractionFn for now, I'll review again after splitting out the filter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant null check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but it's desired to use CacheKeyBuilder. It will be much easier to use and understand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for making a separate class.

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I think we should decide on the extractionFn. If I add a shared one now but in the future you want to switch to DimensionSpec instead we would have to deprecate the shared one again. Better to decide on a final solution now?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 11, 2017

There isn't any other filter that does take DimensionSpec and IIRC, the reason is that DimensionSpecs do things that we don't necessarily need or want in most filters (like decorating). The decorating doesn't work with indexes, and most other filters use indexes, so they would have problems accepting DimensionSpecs.

But this filter doesn't use indexes. So it could use DimensionSpecs and still get its job done.

I don't see how this would work with the extractionFn seeing as this needs to be applied in the ValueMatcher?

If I understand what you're talking about correctly, then DimensionHandlerUtils can handle this. You can just pass in the DimensionSpec and don't worry about it -- the value that comes out of the returned selector will be extractionFn'd appropriately.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 11, 2017

Given that this filter wouldn't use indexes I think it makes sense to use DimensionSpecs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better if I split these into multiple lines, or how do you guys like this formatted?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can go with:

    ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(ImmutableList.<DimensionSpec>of(
        DefaultDimensionSpec.of("abc"),
        DefaultDimensionSpec.of("d")
    ));

I just autoformatted that with our style standards:
https://github.com/druid-io/druid/raw/master/druid_intellij_formatting.xml
https://github.com/druid-io/druid/raw/master/eclipse_formatting.xml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the eclipse formatting but that results in a completely different style than all other sources. It turns this line into:

ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(
    ImmutableList.<DimensionSpec> of(DefaultDimensionSpec.of("abc"), DefaultDimensionSpec.of("d")));

Is the eclipse config up to date or is everyone using intellij?

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I changed it to a new filter now and added DimensionSpecs as inputs. Any input on what should change or maybe which other tests should exist?

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, had a few comments on formatting

For additional tests, can you add tests that use the ColumnComparisonFilter on long and float columns?

You can look at LongFilteringTest and FloatFilteringTest for reference or add cases there

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EqualColumnsFilter -> ColumnComparisonFilter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can go with:

    ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(ImmutableList.<DimensionSpec>of(
        DefaultDimensionSpec.of("abc"),
        DefaultDimensionSpec.of("d")
    ));

I just autoformatted that with our style standards:
https://github.com/druid-io/druid/raw/master/druid_intellij_formatting.xml
https://github.com/druid-io/druid/raw/master/eclipse_formatting.xml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename this test to something else, since dim2 is a multi-value string column

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting, new lines would be easier to read

@erikdubbelboer erikdubbelboer force-pushed the dimension-compare branch 2 times, most recently from 0e972d6 to 910e868 Compare March 17, 2017 06:01
@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I made all changes. I didn't really make a new test for long and float but instead added a long and float value to the current tests.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 17, 2017

Thanks @erikdubbelboer, will take another look today.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM subject to the cache key id being changed. thanks @erikdubbelboer!

Comment thread docs/content/querying/filters.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trailing ### is ignored, so this is probably harmless, but it's also not necessary.

Copy link
Copy Markdown
Contributor

@gianm gianm Mar 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter should have a new cache id, like DimFilterUtils.COLUMN_COMPARISON_CACHE_ID.

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

I changed the cache key.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

The failed test has nothing to do with this pull request. I get timeouts like that when I test locally as well. Maybe the test timeouts should be increased a bit.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 21, 2017

I restarted the tester.

That test has a history of being a little elusive, see some attempts to fix it at: https://github.com/druid-io/druid/search?q=druidcoordinatortest&type=Issues&utf8=%E2%9C%93.

@erikdubbelboer
Copy link
Copy Markdown
Contributor Author

For me increasing the 60_000L test timeout to 120_000L seems to work. But I guess that could also just be luck.

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Mar 24, 2017

Looks good, thanks for the contribution!

@jon-wei jon-wei changed the title [WIP] Comparing dimensions to each other using a select filter Comparing dimensions to each other using a select filter Mar 24, 2017
@jon-wei jon-wei merged commit 2cbc476 into apache:master Mar 24, 2017
erikdubbelboer added a commit to atomx/pydruid that referenced this pull request Mar 26, 2017
erikdubbelboer added a commit to atomx/druid that referenced this pull request Mar 26, 2017
gianm pushed a commit to druid-io/pydruid that referenced this pull request May 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants