add single input string expression dimension vector selector and better expression planning#11213
Conversation
…er expression planning
…o be less aggressive about vectorizing
| * the the string input. | ||
| */ | ||
| @Override | ||
| public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities) |
There was a problem hiding this comment.
Please add @Nullable for capabilities.
There was a problem hiding this comment.
hmm, it should never be null for this method though (nor any of the other VectorColumnProcessorFactory methods), ColumnProcessors will return a nil vector selector if the capabilities are null, since null capabilities in the vectorized engine means that the column doesn't exist.
| * We do this even for things like virtual columns that have a single string input, because it allows deferring | ||
| * accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting | ||
| * them to string form (if not already cached), and in the case of expressions, computing the expression output for | ||
| * the the string input. |
There was a problem hiding this comment.
| * the the string input. | |
| * the string input. |
There was a problem hiding this comment.
quite a few 'the the', fixed all I could find.
There was a problem hiding this comment.
That's a lot! Thanks for fixing them.
| * deal with the actual string value in exchange for the increased complexity of dealing with dictionary encoded | ||
| * selectors. | ||
| */ | ||
| default boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities) |
There was a problem hiding this comment.
Please add @Nullable for capabilities.
| .setDictionaryValuesUnique( | ||
| unique && dimensionSpec.getExtractionFn().getExtractionType() == ExtractionFn.ExtractionType.ONE_TO_ONE | ||
| ) | ||
| .setHasMultipleValues(dimensionSpec.mustDecorate() || mayBeMultiValue(columnCapabilities)); |
There was a problem hiding this comment.
nit: dimensionSpec.mustDecorate() always returns false here.
| * If no output type was able to be inferred during planning, returns null | ||
| */ | ||
| @Nullable | ||
| public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint) |
There was a problem hiding this comment.
| public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint) | |
| public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint) |
| return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities(); | ||
| } | ||
| // we don't know what we don't know | ||
| return null; |
There was a problem hiding this comment.
Can we still infer in some cases if a non-null hint is provided, such as when outputType is null but hint is ValueType.DOUBLE? I'm not sure when this case can happen though.
There was a problem hiding this comment.
The contract of this method is that it only returns column capabilities if it could infer an output type, and it factors the hinted type into things (where the hint is from SQL planner or user who sent JSON query).
The caller of this method, ExpressionVirtualColumn.capabilities, will construct ColumnCapabilities using the hint if this inferColumnCapabilities method returns null.
| return ConstantVectorSelectors.singleValueDimensionVectorSelector(factory.getReadableVectorInspector(), constant); | ||
| } | ||
| throw new IllegalStateException("Only constant expressions currently support dimension selectors"); | ||
| if (plan.is(ExpressionPlan.Trait.SINGLE_INPUT_SCALAR) && ExprType.STRING == plan.getOutputType()) { |
There was a problem hiding this comment.
Can we also use SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector when the plan is SINGLE_INPUT_MAPPABLE?
There was a problem hiding this comment.
Not at this time, at least for expressions that aren't also SINGLE_INPUT_SCALAR. The vector engine uses different selectors for single and multi value string dimensions, and all SINGLE_INPUT_MAPPABLE that have a single input and single output will also have the trait SINGLE_INPUT_SCALAR. This means any SINGLE_INPUT_MAPPABLE that aren't caught by this if will be on multi-valued columns so need to use the multi-valued dimension vector selector, which isn't implemented for vectorized expression processing yet.
I guess we could hypothetically unroll multi-valued rows into a vector if we had some sort of input binding which could handle that mapping and still processing them as single valued dims, but I'm not certain if the complexity would be worth it over just using the multi-valued selector. I also can imagine reworking array processing to be much closer to scalar vector processing, so I need to think a bit more about the best way to tackle this.
| * construct the appropriate capabilities for virtual columns, while the base inspector directly supplies the | ||
| * capabilities for non-virtual columns. | ||
| */ | ||
| public class VirtualizedColumnInspector implements ColumnInspector |
| { | ||
| Preconditions.checkArgument(capabilities != null, "Capabilities must not be null"); | ||
| Preconditions.checkArgument(capabilities.getType() == ValueType.STRING, "Must only be called on a STRING column"); | ||
| return capabilities.isDictionaryEncoded().isTrue(); |
There was a problem hiding this comment.
Just for recording (and helping myself to remember later in the future), this means that the groupBy vector engine will use the dictionary IDs to compute per-segment results, and decode them when merging those results, which is what non-vectorized engine does today. When the column is dictionary encoded but not unique, this optimization might not be always good because there could be some sort of tradeoff depending on the column cardinality post expression evaluation. Even though I think this optimization is likely good in most cases, it could worth investigating further later to understand the tradeoff better.
| @Override | ||
| public int getMaxVectorSize() | ||
| { | ||
| return 1; |
There was a problem hiding this comment.
It would worth mentioning why the vector size is 1. Can you add some comment about it?
There was a problem hiding this comment.
added javadocs to the bindings class to detail its usage
|
thanks for review @jihoonson 🤘 |
…er expression planning (apache#11213) * add single input string expression dimension vector selector and better expression planning * better * fixes * oops * rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing * oops * javadocs, renaming * more javadocs * benchmarks * use string expression vector processor with vector size 1 instead of expr.eval * better logging * javadocs, surprising number of the the * more * simplify
Description
This PR adds some improvements and fixes to the query processing system, mostly related to expression planning and execution.
ExpressionPlanhas been reworked to now produce theColumnCapabilitiesdirectly instead ofExpressionVirtualColumndoing this from the plan, which keeps everything a bit tighter together and much actual logic out of being inExpressionVirtualColumn. TheExpressionPlannerwas relatively well covered by tests prior to this PR, but it was almost entirely indirect coverage from the set of queries we were running, and had no direct tests. I've changed this added quite a few tests to model various scenarios that can occur when planning expressions against an underlyingColumnInspectorso hopefully things will be better defined.Part of the reason for reworking this stuff is so that single input strings can correctly report themselves as dictionary encoded to allow for deferring expression evaluation until the time
lookupNameis called, an optimization that had not yet made its way to the vectorized expression processing, but is available now throughSingleStringDeferredEvaluationExpressionDimensionVectorSelector. This deferred evaluation offers quite a performance improvement, allowing using the native string grouping strategy and lazily executing expression on value lookup, instead of eagerly evaluating them and using the dictionary building grouping strategy.Given the queries:
non-vectorized:
eager vector object selector/dictionary building grouping strategy (unoptimized previous behavior which ended up being slower than the optimized non-vectorized engine) :
new lazy evaluated expression grouping optimization for parity with non-vectorized engine using
Expr.eval:new lazy evaluated expression grouping optimization for parity with non-vectorized engine using
ExprVectorProcessor.evalVectorwith a vector size of 1:The last one measured slightly better than using
Expr.eval, so this PR went with usingExprVectorProcessoreven though the vector size is only ever 1. I suspect this will often be the case since the vectorized expression processors are stronger typed and so can eliminate most branching required to process individual rows since it knows the types of everything ahead of time.This optimization can only be used by single input strings, because it can delegate dictionary ids to use for grouping to the underlying column. String expressions with multiple input columns will still use the default vector object selector path.
Vector selector construction in
ColumnProcessorshas been adjusted to push the decision on whether or not to use dictionary encoded selectors or object selectors into the processor factory, and group by engine now takes a less aggressive stance that any dictionary encoded column should use a dictionary encoded selector, which means single input string expressions will now use theSingleStringDeferredEvaluationExpressionDimensionVectorSelectorand group on the underlying dictionary ids, while aggregators or filters on virtual columns will still use the object selector, which should still be more performant in non-grouping cases where evaluation cannot be deferred.This PR also fixes an unrelated issue with vector query engines not having access to virtual column capabilities when trying to determine when if they can vectorize, and solves it by introducing a
VirtualizedColumnInspectorwhich wraps anotherColumnInspectorand is now used in these engines instead of the direct segment adapter, fixing the issue described in this comment #11188 (comment)This PR has: