allow optimizing sql expressions and virtual columns#12241
allow optimizing sql expressions and virtual columns#12241clintropolis merged 8 commits intoapache:masterfrom
Conversation
…g coercion bug and associated tests
| final boolean inputMatches; | ||
| final VirtualColumn virtualInput = | ||
| virtualColumnRegistry.findVirtualColumns(factory.requiredFields()) | ||
| final DruidExpression virtualInput = |
There was a problem hiding this comment.
this pattern is done in a few places, and i think could be done more efficiently, but I avoided changing the logic at this point to reduce the number of moving parts since this PR has become a bit large
|
Nice improvement! The new format should be easier to work with in code. The format is only partly an AST. It still creates named arguments, which adds more structure than is really necessary. Did you consider something that is an actual expression tree? "virtualColumns":[
{
"type":"fn",
"fn":"array_length",
"outputType":"LONG",
"args":[
{
"type:":"fn",
"fn:":"filter",
"args":[
{
"type":"lambda",
"output":"dim3",
"input_args":[
"x"
],
"args":[ {
"type":"fn",
"fn":"array",
"args":[ {
"type":"col-ref",
"name":"b"
}, {
"type":"arg-ref",
"name":"x"
} ] } ] ] ] ] }
} ],
The gist is, rather than have a node for every operation (a large number), have nodes for the expression elements: functions, literals, column references, etc. One ends up with a handful of node types rather than many dozens. Calcite should produce a tree something like this, so the translation from Calcite is pretty simple. |
the string fragments generated here do later turn into It is probably worth considering doing this at some point, but for now I focused on changing as little as possible to allow the transformations I needed, which is just to swap complete operations. Additionally, |
imply-cheddar
left a comment
There was a problem hiding this comment.
This generally looks good to me assuming tests pass. I called out a few places that might introduce backwards incompatibilities, please make sure to verify those before actually merging.
| new ExpressionPostAggregator( | ||
| "p3", | ||
| "(p2 + 1000)", | ||
| "(\"p2\" + 1000)", |
There was a problem hiding this comment.
Naive question, but is this test change indicative of a change in what SQL is required to be written? I.e. might some things that didn't require quotes previously suddenly require quotes?
There was a problem hiding this comment.
no, I think it was potentially a bug that the identifiers were not always being escaped in expression post-aggs, that has been fixed by the changes in how we build the native string expressions. humans are still allowed to reference identifiers without the double quotes, just the expressions we generate from SQL will always be escaped with them
| import java.util.function.Function; | ||
|
|
||
| /** | ||
| * todo(clint): rewrite javadocs |
There was a problem hiding this comment.
nit: you've got a todo todo
|
|
||
| // Must be sorted | ||
| private static final char[] SAFE_CHARS = " ,._-;:(){}[]<>!@#$%^&*`~?/".toCharArray(); | ||
| private static final VirtualColumnBuilder DEFAULT_VIRTUAL_COLUMN_BUILDER = new ExpressionVirtualColumnBuilder(); |
There was a problem hiding this comment.
I'm scared of static buildery thingies. Is there a reason that this needs to be static instead of just creating a new one each time it's used?
There was a problem hiding this comment.
no reason, just seemed wasteful because is stateless
There was a problem hiding this comment.
Seems more like a function than a builder, but we can change the name later if needed.
| return escaped.toString(); | ||
| } | ||
|
|
||
| private DruidExpression(@Nullable final SimpleExtraction simpleExtraction, final String expression, @Nullable final TrinaryFn<String, ColumnType, ExprMacroTable, VirtualColumn> virtualColumnFn) |
There was a problem hiding this comment.
The diff is making it hard to really see this, but I think you got rid of this 3-argument constructor. That's going to make anything built on the old code stop working. Is it possible to have the 3-argument constructor continue to exist, but be deprecated and delegate to the new constructor kinda like you did with the other methods?
There was a problem hiding this comment.
i think since it is private it should be ok? the static methods are what everything should be using to make these things afaik
There was a problem hiding this comment.
right.. it's private. of course, uhhhh, the diff made it hard for me to see that :P
| this.virtualColumnsByName = virtualColumnsByName; | ||
| } | ||
|
|
||
| public static VirtualColumnRegistry create(final RowSignature rowSignature) |
There was a problem hiding this comment.
This single-argument create was not resurrected. I haven't double checked if that might be a compatibility problem or not, but just wanted to highlight it.
There was a problem hiding this comment.
hmm, i thought about this, but only a DruidRel of some sort should be creating a virtual column registry, since they are scoped to a native query, so this so it didn't seem like a huge deal unless someone is like super super super deep into custom SQL planning for custom query types (and not fully sure it is possible without modifying druid-sql itself at this point...)
I can't really think of a clean way to add it back; I moved it here because the expression macro table should not be changing over the course of planning a query (or even planning multiple queries), so it seemed pointless that we were passing it in via planner context when we were eagerly creating the virtual columns (creating expression virtual columns requires a reference to the macro table), so we can't count on it being set in the constructor we would have to have an alternative way even for the new methods, which I guess would mean passing it in when creating an expression and storing a reference to it in each DruidExpression in the tree so that they could build their virtual columns.
I think maybe we should just document this on the unlikely chance someone was doing something incredibly crazy with this, because in my mind the registry is a facility that the planner provides to operator conversions/filters/aggregators etc.
There was a problem hiding this comment.
Agree that it's pretty in the weeds and someone who actually depends on the implementation to this level is probably in deep enough to deal with the change. Documenting seems reasonable.
jihoonson
left a comment
There was a problem hiding this comment.
LGTM. +1 after CI. Left a couple of nit comments that can be done as follow-ups.
|
|
||
| // Must be sorted | ||
| private static final char[] SAFE_CHARS = " ,._-;:(){}[]<>!@#$%^&*`~?/".toCharArray(); | ||
| private static final VirtualColumnBuilder DEFAULT_VIRTUAL_COLUMN_BUILDER = new ExpressionVirtualColumnBuilder(); |
There was a problem hiding this comment.
Seems more like a function than a builder, but we can change the name later if needed.
| private static class ExpressionAndTypeHint | ||
| { | ||
| private final DruidExpression expression; | ||
| private final ColumnType typeHint; |
There was a problem hiding this comment.
nit: typeHint seems worth to document. Can you please add some javadoc for it? It's not a blocker and you can do it as a follow-up.
| // the fallback expression would actually produce 3 rows, 2 nulls, 2 0's, and 2 1s | ||
| // instead of 4 nulls and two 1's we get when using the 'native' list filtered virtual column | ||
| // this is because of slight differences between filter and the native | ||
| // selector, which treats a 0 length array as null instead of an empty array like is produced by filter |
There was a problem hiding this comment.
This seems worth the release notes. Can you add some release note section in the PR description, so that we won't forget later?
|
thanks for review @cheddar and @jihoonson. CI failure seems unrelated (and travis logs look pretty messed up), so I'm going to go ahead and merge |
Description
This PR reworks how the SQL planner constructs
DruidExpressionand how it translates them intoVirtualColumnwhen constructing the native query.Prior to the changes in this PR,
DruidExpressionwere flattened string expressions, eagerly constructed into aVirtualColumnvia theVirtualColumnRegistry, whenever required for use in projections, aggregations, filters, etc. This meant that basically all of the expression structure was lost after converting to the string expression, making it impossible to modify them to try to optimize the query prior to conversion the the native JSON query.Now,
DruidExpressionretains the expression tree structure, each containing aList<DruidExpression>that contains any sub-expressions which at the time of virtual column creation is built into the string expression, using a newly added interfacewhich is added to the
DruidExpressionwhen it is created. Another new interface has been added to aid in rewritingDruidExpressiontrees before planning into a nativeQuery(along with a
visitmethod onDruidExpressionto accept the shuttle).DruidExpressionare also now annotated, with one of 4 roles,LITERAL,IDENTIFIER,EXPRESSION, andSPECIALIZED, the last of which, coupled with the shuttle visitor, is the primary motivator of this PR.Using
MV_FILTER_ONLY/MV_FILTER_NONEas an example, which has a "native" specializedVirtualColumnimplementation, any expression which supplies such a specialized implementation will now always use it.Prior to this PR, when used in some composition, e.g.
this function would use a fall-back expression
but now plans to this instead
In the process, I uncovered additional bugs with multi-value string array expressions, similar to the bugs fixed #12078 (and controlled by the flag #12210)
Noteworthy: in addition to the behavior change of fixing the remaining bugs of
MV_functions, specificallyMV_FILTER_ONLYandMV_FILTER_NONEmight have subtle behavior differences due to the new specialization code compared to the fall-back expression. thefilterfunction used as the fall-back expression would return empty arrays whereListFilteredVirtualColumnwould returnnull, so something likeMV_LENGTH(MV_FILTER_ONLY(...))would report0as the length for rows which contained no values. However, usingMV_FILTER_ONLY(..)directly would returnnullfor these rows since it did previously plan to theListFilteredVirtualColumn. This is fixed now, soMV_LENGTHwill benullfor these rows without values now.We might want to consider adjusting the fallback
filterexpression ofMV_FILTER_*to coerce empty arrays to nulls to behave consistently withListFilteredVirtualColumn, but I did not do that in this PR.Key changed/added classes in this PR
DruidExpressionVirtualColumnRegistryDruidQueryOperatorConversionsExpressionSelectors(bug fix)This PR has: