Affected Version
0.16.0 and 0.17.0
Description
We often use the combination of NULLIF() and LOOKUP() functions in combination. This works fine for normal dimension columns, but when using a multi-valued dimension, then the following combination of NULLIF and LOOKUP leads to extremely high segment scan times of 30 seconds or even up to several minutes:
SELECT
NULLIF(LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid'), 'fallback') AS adFormat,
SUM("count") AS cnt
FROM "supply-activities"
WHERE (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp '2020-01-20 01:00:00')
GROUP BY 1
ORDER BY 2
LIMIT 10
In contrast, the following query which is merely dropping the surrounding NULLIF() function is executing at normal speed:
SELECT
LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid') AS adFormat,
SUM("count") AS cnt
FROM "supply-activities"
WHERE (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp '2020-01-20 01:00:00')
GROUP BY 1
ORDER BY 2
Using NULLIF in combination with LOOKUP on normal dinensions does NOT lead to a performance issue.
Prior to Druid 0.16.0 we did not notice any performance issue but cannot guarentee that it was introduced with 0.16 because we might have simply overlooked it in prior releases.
With 0.16.0 and also 0.17.0 the performance issue exists. We tested with SQL compatible null handling turned on and off. Performance is bad in both cases. As mentioned, the segment scan times are off the charts.
--
I could not demonstrate the issue using the wikipedia dataset because it doesn't seem to contain any multi-valued dimensions. Perhaps it would be good to have a normative dataset which showcases all field types such that it can be used for problem reports that can be reproduced?
Affected Version
0.16.0 and 0.17.0
Description
We often use the combination of NULLIF() and LOOKUP() functions in combination. This works fine for normal dimension columns, but when using a multi-valued dimension, then the following combination of NULLIF and LOOKUP leads to extremely high segment scan times of 30 seconds or even up to several minutes:
In contrast, the following query which is merely dropping the surrounding NULLIF() function is executing at normal speed:
Using NULLIF in combination with LOOKUP on normal dinensions does NOT lead to a performance issue.
Prior to Druid 0.16.0 we did not notice any performance issue but cannot guarentee that it was introduced with 0.16 because we might have simply overlooked it in prior releases.
With 0.16.0 and also 0.17.0 the performance issue exists. We tested with SQL compatible null handling turned on and off. Performance is bad in both cases. As mentioned, the segment scan times are off the charts.
--
I could not demonstrate the issue using the wikipedia dataset because it doesn't seem to contain any multi-valued dimensions. Perhaps it would be good to have a normative dataset which showcases all field types such that it can be used for problem reports that can be reproduced?