Skip to content

TopN query on multi-value dimension ignores null values #4915

@NicolasBielza

Description

@NicolasBielza

Using Druid 0.10.1,

When running a topN query on a multi-valued dimension, Druid only returns the aggregation results for non-null values.

The following TopN query:

{
  "queryType": "topN",
  "dataSource": "audience-breakdown",
  "intervals": "2017-09-07T00Z/2017-09-08T00Z",
  "granularity": "all",
  "dimension": {
    "type": "default",
    "dimension": "listenerDma",
    "outputName": "DMA"
  },
  "aggregations": [
    {
      "name": "TLS",
      "type": "longSum",
      "fieldName": "tls"
    }
  ],
  "metric": "TLS",
  "threshold": 3
}

Produces this response:

[
    {
        "timestamp": "2017-09-07T00:00:00.000Z",
        "result": [
            {
                "DMA": "803",
                "TLS": 2798577
            },
            {
                "DMA": "501",
                "TLS": 2509147
            },
            {
                "DMA": "602",
                "TLS": 1779172
            }
        ]
    }
]

Whereas using a similar groupBy query:

{
  "queryType": "groupBy",
  "dataSource": "audience-breakdown",
  "intervals": "2017-09-07T00Z/2017-09-08T00Z",
  "granularity": "all",
  "dimensions": [
    {
      "type": "default",
      "dimension": "listenerDma",
      "outputName": "DMA"
    }
  ],
  "aggregations": [
    {
      "name": "TLS",
      "type": "longSum",
      "fieldName": "tls"
    }
  ],
  "limitSpec": {
    "type": "default",
    "columns": [
      {
        "dimension": "TLS",
        "direction": "descending"
      }
    ],
    "limit": 3
  }
}

We can see that the largest group is actually the one having null in the listenerDma dimension:

[
    {
        "version": "v1",
        "timestamp": "2017-09-07T00:00:00.000Z",
        "event": {
            "DMA": null,
            "TLS": 8770694
        }
    },
    {
        "version": "v1",
        "timestamp": "2017-09-07T00:00:00.000Z",
        "event": {
            "DMA": "803",
            "TLS": 2798577
        }
    },
    {
        "version": "v1",
        "timestamp": "2017-09-07T00:00:00.000Z",
        "event": {
            "DMA": "501",
            "TLS": 2509147
        }
    }
]

Is this the expected behavior of topN? If so, then why does it properly return the null group when I run my query on a single valued dimension?

For instance, if I use the city dimension (which is single valued):

{
  "queryType": "topN",
  "dataSource": "audience-breakdown",
  "intervals": "2017-09-07T00Z/2017-09-08T00Z",
  "granularity": "all",
  "dimension": {
    "type": "default",
    "dimension": "city",
    "outputName": "City"
  },
  "aggregations": [
    {
      "name": "TLS",
      "type": "longSum",
      "fieldName": "tls"
    }
  ],
  "metric": "TLS",
  "threshold": 3
}

Druid does return an aggregated value for null

[
    {
        "timestamp": "2017-09-07T00:00:00.000Z",
        "result": [
            {
                "TLS": 2412533,
                "City": null
            },
            {
                "TLS": 634690,
                "City": "Los Angeles"
            },
            {
                "TLS": 609145,
                "City": "Houston"
            }
        ]
    }
]

Thanks,
Nicolas

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions