Comparing dimensions to each other using a select filter#3928
Comparing dimensions to each other using a select filter#3928jon-wei merged 1 commit intoapache:masterfrom
Conversation
|
I'm thinking now that I could add the old constructor to Is there a test somewhere that tests this JSON parsing? |
81e1c1f to
3b59aa0
Compare
|
I cleaned up the changes a lot by adding the extra constructor to Old changes can still be found at: erikdubbelboer@81e1c1f |
gianm
left a comment
There was a problem hiding this comment.
Didn't fully review, but have some partial comments for now. Will pick back up again.
There was a problem hiding this comment.
Please convert whitespace to spaces instead of tabs.
There was a problem hiding this comment.
Just have this call this(dimension, value, extractionFn, null) so the checks/translations in the constructor only have to be in one place.
There was a problem hiding this comment.
Please check another two things:
valueshould be null ifdimensionsis set (someone that sets both is probably confused about usage)dimensionanddimensionsshould not both be set
There was a problem hiding this comment.
EqualColumnsFilter is a better name
There was a problem hiding this comment.
throw UnsupportedOperationException
There was a problem hiding this comment.
This String[], and the ValueGetter itself, could both be extracted to constants.
There was a problem hiding this comment.
It's too bad (for performance) that this means that comparing two integer columns against each other will involve casting both of them to string. I'm ok with developing a way to deal with that in a future patch, but please leave a comment that this is less than ideal.
There was a problem hiding this comment.
After this pull requests is merged it's a good idea to fix this. But right now it would result in a lot more code making the pull requests even harder to comprehend.
I think the only way to do this is in makeMatcher to see if all columns are integer or float. And then call a different makeValueMatcher that compares integers/floats instead of strings. This would mean the code of makeValueMatcher and overlap needs to be duplicated twice for integers and floats. That's another 106 lines of code in this file.
There was a problem hiding this comment.
We do similar things in other areas where we need specializations for performance reasons, so I think that's just a cost of doing business in a language like java. For now the comment is good enough for me.
There was a problem hiding this comment.
List<String> is more typical although I guess it doesn't really matter
There was a problem hiding this comment.
I don't think it really matters as it's a fixed list. If you want me to change it let me know.
7ef4f48 to
399db85
Compare
|
I'm pretty sure the test timeout has nothing to do with my code. |
|
@erikdubbelboer it's an issue with the tests that comes up sometimes, I just closed/re-opened to re-run the tests |
There was a problem hiding this comment.
What do you think about changing this filter such that it accepts a list of DimensionSpec instead, which would allow the user to specify a separate extraction function for each dimension in the comparison?
I can see various use cases where that functionality would be helpful (e.g., extracting and comparing a substring from two columns, but where the two column's base values have totally different formats causing a shared extractionFn to be insufficient)
The DimensionSpecs could also be directly passed to DimensionHandlerUtils.createColumnSelectorPlus, so that this filter doesn't need to manage the extraction functions itself in the matcher.
Related to that, I think it makes sense to have EqualColumnsFilter be a separate type of filter instead of being created from SelectorDimFilter, given that the parameters on the filter are being used differently.
The equal columns use case cares about the list of dimensions and ignores the selector value and single dimension, while the original constant value comparison case uses the value/single dimension and ignores the list of dimensions.
I think it'd be cleaner with fewer rules about what needs to be specified if the filters are totally split since the only common point now is the singular extractionFn.
Separation would also help remove the restriction on only having one extraction function for the equal columns case.
There was a problem hiding this comment.
@jon-wei, I think you're right that this bears little resemblance to the rest of "selector" and makes more sense as a different DimFilter class. I agree with that.
On using DimensionSpecs instead of dimension names + an extractionFn, I could take or leave that one, so I would be ok with this patch either way. I'm ok with leaving that until someone actually has the need for it.
There was a problem hiding this comment.
+1 for making a separate class.
There was a problem hiding this comment.
I should read the rest before I post :)
There was a problem hiding this comment.
Is there any other filter that allows a DimensionSpec for the dimension?
I don't see how this would work with the extractionFn seeing as this needs to be applied in the ValueMatcher?
gianm
left a comment
There was a problem hiding this comment.
IMO, it's not necessary to generalize to having a different extractionFn for each column in this patch, but it does make sense to split this out from the "selector" filter.
There was a problem hiding this comment.
Agree with @jon-wei that this has drifted far enough from "selector" that it makes more sense as a separate filter. How about calling it "columnComparison" (& calling the implementation classes ColumnComparisonDimFilter, ColumnComparisonFilter)?
I think "equalColumns" isn't the right name, since what it does on multi-value columns isn't really an "equals" relationship. For one thing it's not transitive, since with the current impl we'd have ["a","b"] == ["a"] and ["a","b"] == ["b"] but ["a"] != ["b"].
Please also document what the filter does on multi-value columns.
|
@erikdubbelboer, I'm going to target this for 0.10.1, assuming you have time to finish working on it. I think this should be good to go if the filter is split out and has docs added for multi-value behavior. |
|
I'm okay with this PR going with the shared extractionFn for now, I'll review again after splitting out the filter |
There was a problem hiding this comment.
nit: redundant null check
There was a problem hiding this comment.
This looks good, but it's desired to use CacheKeyBuilder. It will be much easier to use and understand.
There was a problem hiding this comment.
+1 for making a separate class.
|
I think we should decide on the |
|
There isn't any other filter that does take But this filter doesn't use indexes. So it could use DimensionSpecs and still get its job done.
If I understand what you're talking about correctly, then DimensionHandlerUtils can handle this. You can just pass in the DimensionSpec and don't worry about it -- the value that comes out of the returned selector will be extractionFn'd appropriately. |
|
Given that this filter wouldn't use indexes I think it makes sense to use DimensionSpecs. |
399db85 to
a9f04d8
Compare
There was a problem hiding this comment.
I guess it's better if I split these into multiple lines, or how do you guys like this formatted?
There was a problem hiding this comment.
You can go with:
ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(ImmutableList.<DimensionSpec>of(
DefaultDimensionSpec.of("abc"),
DefaultDimensionSpec.of("d")
));
I just autoformatted that with our style standards:
https://github.com/druid-io/druid/raw/master/druid_intellij_formatting.xml
https://github.com/druid-io/druid/raw/master/eclipse_formatting.xml
There was a problem hiding this comment.
I tried the eclipse formatting but that results in a completely different style than all other sources. It turns this line into:
ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(
ImmutableList.<DimensionSpec> of(DefaultDimensionSpec.of("abc"), DefaultDimensionSpec.of("d")));Is the eclipse config up to date or is everyone using intellij?
|
I changed it to a new filter now and added DimensionSpecs as inputs. Any input on what should change or maybe which other tests should exist? |
a9f04d8 to
651462e
Compare
jon-wei
left a comment
There was a problem hiding this comment.
Looks good generally, had a few comments on formatting
For additional tests, can you add tests that use the ColumnComparisonFilter on long and float columns?
You can look at LongFilteringTest and FloatFilteringTest for reference or add cases there
There was a problem hiding this comment.
EqualColumnsFilter -> ColumnComparisonFilter
There was a problem hiding this comment.
You can go with:
ColumnComparisonDimFilter columnComparisonDimFilter = new ColumnComparisonDimFilter(ImmutableList.<DimensionSpec>of(
DefaultDimensionSpec.of("abc"),
DefaultDimensionSpec.of("d")
));
I just autoformatted that with our style standards:
https://github.com/druid-io/druid/raw/master/druid_intellij_formatting.xml
https://github.com/druid-io/druid/raw/master/eclipse_formatting.xml
There was a problem hiding this comment.
maybe rename this test to something else, since dim2 is a multi-value string column
There was a problem hiding this comment.
formatting, new lines would be easier to read
0e972d6 to
910e868
Compare
|
I made all changes. I didn't really make a new test for long and float but instead added a long and float value to the current tests. |
|
Thanks @erikdubbelboer, will take another look today. |
gianm
left a comment
There was a problem hiding this comment.
LGTM subject to the cache key id being changed. thanks @erikdubbelboer!
There was a problem hiding this comment.
I think the trailing ### is ignored, so this is probably harmless, but it's also not necessary.
There was a problem hiding this comment.
This filter should have a new cache id, like DimFilterUtils.COLUMN_COMPARISON_CACHE_ID.
910e868 to
fb82202
Compare
|
I changed the cache key. |
fb82202 to
57f79a8
Compare
|
The failed test has nothing to do with this pull request. I get timeouts like that when I test locally as well. Maybe the test timeouts should be increased a bit. |
|
I restarted the tester. That test has a history of being a little elusive, see some attempts to fix it at: https://github.com/druid-io/druid/search?q=druidcoordinatortest&type=Issues&utf8=%E2%9C%93. |
|
For me increasing the |
|
Looks good, thanks for the contribution! |
See #3840 for a discussion on the design choices.
Most of the pull request is adding the new argument to
SelectorDimFilter. SinceSelectorDimFilterhas an@JsonCreatorconstructor we can not add an extra constructor with the extra argument (only one@JsonCreatorallowed).Since all current
valueMatchercode is build around matching a fixed value against one column, I had to add a new method toValueMatcherColumnSelectorStrategy:
ValueGetter makeValueGetter(ValueSelectorType selector). This is then used in the new DimensionsFilter to get all values and compare them to each other.