Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/content/design/segments.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ Each column is stored as two parts:
1. A Jackson-serialized ColumnDescriptor
2. The rest of the binary for the column

A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-valued, etc.) and then a list of serde logic that can deserialize the rest of the binary.
A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serde logic that can deserialize the rest of the binary.

Sharding Data to Create Segments
--------------------------------
Expand Down
12 changes: 6 additions & 6 deletions docs/content/querying/dimensionspecs.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,14 +351,14 @@ Returns the dimension value formatted according to the given format string.

For example if you want to concat "[" and "]" before and after the actual dimension value, you need to specify "[%s]" as format string.

### Filtering DimensionSpecs
### Filtered DimensionSpecs

These are only valid for multi-valued dimensions. If you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filter.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases.
These are only valid for multi-value dimensions. If you have a row in druid that has a multi-value dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filter.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases.

It happens because `query filter` is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multivalued dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.html) for more details.
Then groupBy/topN processing pipeline "explodes" all multi-valued dimensions resulting 3 rows for "v1", "v2" and "v3" each.
It happens because "query filter" is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multi-value dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.html) for more details.
Then groupBy/topN processing pipeline "explodes" all multi-value dimensions resulting 3 rows for "v1", "v2" and "v3" each.

In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtering dimension spec to filter for specific values within the values of a multi-valued dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.
In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtered dimension spec to filter for specific values within the values of a multi-value dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.

The following filtered dimension spec acts as a whitelist or blacklist for values as per the "isWhitelist" attribute value.

Expand All @@ -372,7 +372,7 @@ Following filtered dimension spec retains only the values matching regex. Note t
{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
```

For more details and examples, see [multi-valued dimensions](multi-valued-dimensions.html).
For more details and examples, see [multi-value dimensions](multi-value-dimensions.html).

### Upper and Lower extraction functions.

Expand Down
11 changes: 11 additions & 0 deletions docs/content/querying/groupbyquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,14 @@ To pull it all together, the above query would return *n\*m* data points, up to
...
]
```

### Behavior on multi-value dimensions

groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
there are rows. For example, a groupBy on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
improve performance.

See [Multi-value dimensions](multi-value-dimensions.html) for more details.
Original file line number Diff line number Diff line change
@@ -1,22 +1,38 @@
---
layout: doc_page
---
# Multi-value dimensions

This document contains additional query optimizations for certain types of queries.
Druid supports "multi-value" string dimensions. These are generated when an input field contains an array of values
instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter` characters).

# Multi-value Dimensions
This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
are used as a dimension being grouped by. See the section on multi-value columns in
[segments](../design/segments.html#multi-value-columns) for internal representation details.

Druid supports "multi-valued" dimensions. See the section on multi-valued columns in [segments](../design/segments.html) for internal representation details. This document describes the behavior of groupBy(topN has similar behavior) queries on multi-valued dimensions when they are used as a dimension being grouped by.
## Querying multi-value dimensions

Suppose, you have a dataSource with a segment that contains following rows with a multi-valued dimension called tags.
Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
called `tags`.

```
2772011-01-12T00:00:00.000Z,["t1","t2","t3"], #row1
2782011-01-13T00:00:00.000Z,["t3","t4","t5"], #row2
2792011-01-14T00:00:00.000Z,["t5","t6","t7"] #row3
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
```

### Group-By query with no filtering
All query types can filter on multi-value dimensions. Filters operate independently on each value of a multi-value
dimension. For example, a `"t1" OR "t3"` filter would match row1 and row2 but not row3. A `"t1" AND "t3"` filter
would only match row1.

topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
there are rows. For example, a topN on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
improve performance.

### Example: GroupBy query with no filtering

See [GroupBy querying](groupbyquery.html) for details.

Expand Down Expand Up @@ -104,7 +120,7 @@ returns following result.

notice how original rows are "exploded" into multiple rows and merged.

### Group-By query with a selector query filter
### Example: GroupBy query with a selector query filter

See [query filters](filters.html) for details of selector query filter.

Expand Down Expand Up @@ -181,13 +197,13 @@ returns following result.
]
```

You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-valued dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-valued dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.
You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.

### Group-By query with a selector query filter and additional filter in "dimensions" attributes
### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes

To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below.

See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html) for details.
See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details.

```json
{
Expand Down Expand Up @@ -224,7 +240,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html)
}
```

returns following result.
returns the following result.

```json
[
Expand All @@ -238,5 +254,4 @@ returns following result.
]
```

Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec would be much more efficient because that gets applied at the lowest level in the query processing pipeline while having spec is applied at the highest level of groupBy query processing.

Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing.
13 changes: 13 additions & 0 deletions docs/content/querying/topnquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,20 @@ The format of the results would look like so:
}
]
```

### Behavior on multi-value dimensions

topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
there are rows. For example, a topN on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
improve performance.

See [Multi-value dimensions](multi-value-dimensions.html) for more details.

### Aliasing

The current TopN algorithm is an approximate algorithm. The top 1000 local results from each segment are returned for merging to determine the global topN. As such, the topN algorithm is approximate in both rank and results. Approximate results *ONLY APPLY WHEN THERE ARE MORE THAN 1000 DIM VALUES*. A topN over a dimension with fewer than 1000 unique dimension values can be considered accurate in rank and accurate in aggregates.

The threshold can be modified from it's default 1000 via the server parameter `druid.query.topN.minTopNThreshold` which need to restart servers to take effect or set `minTopNThreshold` in query context which take effect per query.
Expand Down
2 changes: 1 addition & 1 deletion docs/content/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@
* [Granularities](../querying/granularities.html)
* [DimensionSpecs](../querying/dimensionspecs.html)
* [Context](../querying/query-context.html)
* [Multi-value dimensions](../querying/multi-value-dimensions.html)
* [SQL](../querying/sql.html)
* [Joins](../querying/joins.html)
* [Optimizations](../querying/optimizations.html)
* [Multitenancy](../querying/multitenancy.html)
* [Caching](../querying/caching.html)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,7 @@ protected void innerReduce(

// Respect poisoning
if (!currentDimSkip && dvc.numRows < 0) {
log.info("Cannot partition on multi-valued dimension: %s", dvc.dim);
log.info("Cannot partition on multi-value dimension: %s", dvc.dim);
currentDimSkip = true;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -614,7 +614,7 @@ public ObjectColumnSelector makeObjectColumnSelector(String column)

if (columnVals.hasMultipleValues()) {
throw new UnsupportedOperationException(
"makeObjectColumnSelector does not support multivalued GenericColumns"
"makeObjectColumnSelector does not support multi-value GenericColumns"
);
}

Expand Down