Skip to content

Consistent empty multi-value dimension behavior for groupBy / topN #5897

@gianm

Description

@gianm

Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).

Consider a dataset called tweets that has tweets, with a multi-value dimension hashtags listing the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.

select hashtags, count(*)
from tweets
where floor(__time to hour) = timestamp '2018-06-22 00:00:00'
group by 1
order by 2 desc
limit 5

The groupBy engine returns:

hashtags,EXPR$1
,118719
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199

And topN returns:

hashtags,EXPR$1
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
GOT7,159

The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension and a single-value dimension. What if you group by hashtags and username, and some user never uses any hashtags? Should the fact that hashtags is empty make the rows implode, and that user would never show up in the results?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions