Fix ExpressionVirtualColumn capabilities; fix groupBy's improper uses of StorageAdapter#getColumnCapabilities.#8013
Conversation
1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.)
jihoonson
left a comment
There was a problem hiding this comment.
Thanks for fixing this. +1 after CI.
|
@gianm I think this is failing UT |
|
It looks like the tests were failing because ExpressionVirtualColumn always set its capabilities to be singly-valued, which is a bug, since ever since #7588 they might be multi-valued. However, that bug was probably not detected since it was masked by this bug (which prevents groupBy from using its all-singly-valued-dimension optimization if some of the columns involved are virtual columns). I pushed a fix for the ExpressionVirtualColumn issue and updated the top comment. In this fix I just set it to always be "true". This isn't ideal, since it means singly-valued optimizations won't work on top of it, but I didn't see an easy way for the ExpressionVirtualColumn to determine upfront if it will be singly-valued or not. I think this should be possible in the future as we add more upfront type info to the expression system, so I added a comment saying as much. /cc @clintropolis |
👍 I think this makes sense for now, I will follow this up with a fix to allow us to determine when a single input column will produce an scalar or array output so we can have this optimization again where possible. |
… of StorageAdapter#getColumnCapabilities. (apache#8013) * GroupBy: Fix improper uses of StorageAdapter#getColumnCapabilities. 1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.) * Add addl javadoc. * ExpressionVirtualColumn: Set multi-value flag.
… of StorageAdapter#getColumnCapabilities. (#8013) * GroupBy: Fix improper uses of StorageAdapter#getColumnCapabilities. 1) A usage in "isArrayAggregateApplicable" that would potentially incorrectly use array-based aggregation on a virtual column that shadows a real column. 2) A usage in "process" that would potentially use the more expensive multi-value aggregation path on a singly-valued virtual column. (No correctness issue, but a performance issue.) * Add addl javadoc. * ExpressionVirtualColumn: Set multi-value flag.
array-based aggregation on a virtual column that shadows a real column.
aggregation path on a singly-valued virtual column. (No correctness issue, but
a performance issue.)
Also makes ExpressionVirtualColumn always report that it is multi-valued. Previously,
it always set its capabilities to be singly-valued, which was bug ever since #7588, since
it might actually be multi-valued.