Consistent empty multi-value dimension behavior for groupBy / topN

Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).

Consider a dataset called `tweets` that has tweets, with a multi-value dimension `hashtags` listing the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.

```sql
select hashtags, count(*)
from tweets
where floor(__time to hour) = timestamp '2018-06-22 00:00:00'
group by 1
order by 2 desc
limit 5
```

The groupBy engine returns:

```
hashtags,EXPR$1
,118719
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
```

And topN returns:

```
hashtags,EXPR$1
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
GOT7,159
```

The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension _and_ a single-value dimension. What if you group by `hashtags` and `username`, and some user never uses any hashtags? Should the fact that `hashtags` is empty make the rows implode, and that user would never show up in the results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent empty multi-value dimension behavior for groupBy / topN #5897

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Consistent empty multi-value dimension behavior for groupBy / topN #5897

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions