Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).
Consider a dataset called tweets that has tweets, with a multi-value dimension hashtags listing the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.
select hashtags, count(*)
from tweets
where floor(__time to hour) = timestamp '2018-06-22 00:00:00'
group by 1
order by 2 desc
limit 5
The groupBy engine returns:
hashtags,EXPR$1
,118719
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
And topN returns:
hashtags,EXPR$1
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
GOT7,159
The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension and a single-value dimension. What if you group by hashtags and username, and some user never uses any hashtags? Should the fact that hashtags is empty make the rows implode, and that user would never show up in the results?
Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).
Consider a dataset called
tweetsthat has tweets, with a multi-value dimensionhashtagslisting the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.The groupBy engine returns:
And topN returns:
The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension and a single-value dimension. What if you group by
hashtagsandusername, and some user never uses any hashtags? Should the fact thathashtagsis empty make the rows implode, and that user would never show up in the results?