-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Feature to "fix" filtering on multi-valued dimensions #2130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,238 @@ | ||
| --- | ||
| layout: doc_page | ||
| --- | ||
|
|
||
| Druid supports "multi-valued" dimensions. See the section on multi-valued columns in [segments](../design/segments.html) for internal representation details. This document describes the behavior of groupBy(topN has similar behavior) queries on multi-valued dimensions when they are used as a dimension being grouped by. | ||
|
|
||
| Suppose, you have a dataSource with a segment that contains following rows with a multi-valued dimension called tags. | ||
|
|
||
| ``` | ||
| 2772011-01-12T00:00:00.000Z,["t1","t2","t3"], #row1 | ||
| 2782011-01-13T00:00:00.000Z,["t3","t4","t5"], #row2 | ||
| 2792011-01-14T00:00:00.000Z,["t5","t6","t7"] #row3 | ||
| ``` | ||
|
|
||
| ### Group-By query with no filtering | ||
|
|
||
| See [GroupBy querying](groupbyquery.html) for details. | ||
|
|
||
| ```json | ||
| { | ||
| "queryType": "groupBy", | ||
| "dataSource": "test", | ||
| "intervals": [ | ||
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" | ||
| ], | ||
| "granularity": { | ||
| "type": "all" | ||
| }, | ||
| "dimensions": [ | ||
| { | ||
| "type": "default", | ||
| "dimension": "tags", | ||
| "outputName": "tags" | ||
| } | ||
| ], | ||
| "aggregations": [ | ||
| { | ||
| "type": "count", | ||
| "name": "count" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| returns following result. | ||
|
|
||
| ```json | ||
| [ | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t1" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t2" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 2, | ||
| "tags": "t3" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t4" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 2, | ||
| "tags": "t5" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t6" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t7" | ||
| } | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| notice how original rows are "exploded" into multiple rows and merged. | ||
|
|
||
| ### Group-By query with a selector query filter | ||
|
|
||
| See [query filters](filters.html) for details of selector query filter. | ||
|
|
||
| ```json | ||
| { | ||
| "queryType": "groupBy", | ||
| "dataSource": "test", | ||
| "intervals": [ | ||
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" | ||
| ], | ||
| "filter": { | ||
| "type": "selector", | ||
| "dimension": "tags", | ||
| "value": "t3" | ||
| }, | ||
| "granularity": { | ||
| "type": "all" | ||
| }, | ||
| "dimensions": [ | ||
| { | ||
| "type": "default", | ||
| "dimension": "tags", | ||
| "outputName": "tags" | ||
| } | ||
| ], | ||
| "aggregations": [ | ||
| { | ||
| "type": "count", | ||
| "name": "count" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| returns following result. | ||
|
|
||
| ```json | ||
| [ | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t1" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t2" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 2, | ||
| "tags": "t3" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t4" | ||
| } | ||
| }, | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 1, | ||
| "tags": "t5" | ||
| } | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-valued dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-valued dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter. | ||
|
|
||
| ### Group-By query with a selector query filter and additional filter in "dimensions" attributes | ||
|
|
||
| To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below. | ||
|
|
||
| See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html) for details. | ||
|
|
||
| ```json | ||
| { | ||
| "queryType": "groupBy", | ||
| "dataSource": "test", | ||
| "intervals": [ | ||
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" | ||
| ], | ||
| "filter": { | ||
| "type": "selector", | ||
| "dimension": "tags", | ||
| "value": "t3" | ||
| }, | ||
| "granularity": { | ||
| "type": "all" | ||
| }, | ||
| "dimensions": [ | ||
| { | ||
| "type": "listFiltered", | ||
| "delegate": { | ||
| "type": "default", | ||
| "dimension": "tags", | ||
| "outputName": "tags" | ||
| }, | ||
| "values": ["t3"] | ||
| } | ||
| ], | ||
| "aggregations": [ | ||
| { | ||
| "type": "count", | ||
| "name": "count" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| returns following result. | ||
|
|
||
| ```json | ||
| [ | ||
| { | ||
| "timestamp": "1970-01-01T00:00:00.000Z", | ||
| "event": { | ||
| "count": 2, | ||
| "tags": "t3" | ||
| } | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec would be much more efficient because that gets applied at the lowest level in the query processing pipeline while having spec is applied at the highest level of groupBy query processing. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
processing/src/main/java/io/druid/query/dimension/BaseFilteredDimensionSpec.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| /* | ||
| * Licensed to Metamarkets Group Inc. (Metamarkets) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. Metamarkets licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package io.druid.query.dimension; | ||
|
|
||
| import com.fasterxml.jackson.annotation.JsonProperty; | ||
| import com.google.common.base.Preconditions; | ||
| import io.druid.query.extraction.ExtractionFn; | ||
|
|
||
| /** | ||
| */ | ||
| public abstract class BaseFilteredDimensionSpec implements DimensionSpec | ||
| { | ||
| protected final DimensionSpec delegate; | ||
|
|
||
| public BaseFilteredDimensionSpec( | ||
| @JsonProperty("delegate") DimensionSpec delegate | ||
| ) | ||
| { | ||
| this.delegate = Preconditions.checkNotNull(delegate, "delegate must not be null"); | ||
| } | ||
|
|
||
| @JsonProperty | ||
| public DimensionSpec getDelegate() | ||
| { | ||
| return delegate; | ||
| } | ||
|
|
||
| @Override | ||
| public String getDimension() | ||
| { | ||
| return delegate.getDimension(); | ||
| } | ||
|
|
||
| @Override | ||
| public String getOutputName() | ||
| { | ||
| return delegate.getOutputName(); | ||
| } | ||
|
|
||
| @Override | ||
| public ExtractionFn getExtractionFn() | ||
| { | ||
| return delegate.getExtractionFn(); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean preservesOrdering() | ||
| { | ||
| return delegate.preservesOrdering(); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make these as having filters? Unless I am mistaken these are basically restricted HAVING filters. At the very least there should be a link in the HAVING filter spec doc to this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to expand in this I filed #1984 some time ago, it looks like #2043 fixed it (I am not 100% sure, need to try it out).
So doing a having filter with:
{ "type": "dimSelector", "dimension": "<dimension>", "value": <dimension_value> }Should work. (I have not tested that yet but the PR was merged).
What would be the difference between doing
listFilteredanddimSelectoron the relevant dimensions?I understand that
listFilteredwould work for topNs but ideallytopNshould supporthavingfilters just likegroupByThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of my teammates added the new having specs to work on dimension values to solve the problem. However having specs are only applied at the broker after all the processing is done, so historicals will process/merge all the unwanted rows and pass them to broker where broker will further merge them and then in the end having filter will discard those. that would cause a lot of unnecessary memory and cpu consumption across the cluster. filters in this PR will get applied to the lowest possible level in the pipeline.
That said, I would add a line in the doc saying similar results can be obtained via having filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this will work for both topN and groupBy