Skip to content

allow optimizing sql expressions and virtual columns#12241

Merged
clintropolis merged 8 commits intoapache:masterfrom
clintropolis:sql-expresions-and-virtual-columns
Feb 9, 2022
Merged

allow optimizing sql expressions and virtual columns#12241
clintropolis merged 8 commits intoapache:masterfrom
clintropolis:sql-expresions-and-virtual-columns

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Feb 8, 2022

Description

This PR reworks how the SQL planner constructs DruidExpression and how it translates them into VirtualColumn when constructing the native query.

Prior to the changes in this PR, DruidExpression were flattened string expressions, eagerly constructed into a VirtualColumn via the VirtualColumnRegistry, whenever required for use in projections, aggregations, filters, etc. This meant that basically all of the expression structure was lost after converting to the string expression, making it impossible to modify them to try to optimize the query prior to conversion the the native JSON query.

Now, DruidExpression retains the expression tree structure, each containing a List<DruidExpression> that contains any sub-expressions which at the time of virtual column creation is built into the string expression, using a newly added interface

  public interface ExpressionBuilder
  {
    String buildExpression(List<DruidExpression> arguments);
  }

which is added to the DruidExpression when it is created. Another new interface has been added to aid in rewriting DruidExpression trees before planning into a native Query

  public interface DruidExpressionShuttle
  {
    DruidExpression visit(DruidExpression expression);

    default List<DruidExpression> visitAll(List<DruidExpression> expressions)
    {
      List<DruidExpression> list = new ArrayList<>(expressions.size());
      for (DruidExpression expr : expressions) {
        list.add(visit(expr));
      }
      return list;
    }
  }

(along with a visit method on DruidExpression to accept the shuttle).

DruidExpression are also now annotated, with one of 4 roles, LITERAL, IDENTIFIER, EXPRESSION, and SPECIALIZED, the last of which, coupled with the shuttle visitor, is the primary motivator of this PR.

Using MV_FILTER_ONLY/MV_FILTER_NONE as an example, which has a "native" specialized VirtualColumn implementation, any expression which supplies such a specialized implementation will now always use it.

Prior to this PR, when used in some composition, e.g.

SELECT MV_LENGTH(MV_FILTER_ONLY(dim3, ARRAY['b'])), SUM(cnt) FROM druid.numfoo GROUP BY 1 ORDER BY 2 DESC

this function would use a fall-back expression

"virtualColumns" : [ {
    "type" : "expression",
    "name" : "v0",
    "expression" : "array_length(filter((x) -> array_contains(array('b'), x), "dim3"))",
    "outputType" : "LONG"
  } ],

but now plans to this instead

"virtualColumns" : [ {
    "type" : "expression",
    "name" : "v0",
    "expression" : "array_length(\"v1\")",
    "outputType" : "LONG"
  }, {
    "type" : "mv-filtered",
    "name" : "v1",
    "delegate" : {
      "type" : "default",
      "dimension" : "dim3",
      "outputName" : "dim3",
      "outputType" : "STRING"
    },
    "values" : [ "b" ],
    "isAllowList" : true
  } ],

In the process, I uncovered additional bugs with multi-value string array expressions, similar to the bugs fixed #12078 (and controlled by the flag #12210)

Noteworthy: in addition to the behavior change of fixing the remaining bugs of MV_ functions, specifically MV_FILTER_ONLY and MV_FILTER_NONE might have subtle behavior differences due to the new specialization code compared to the fall-back expression. the filter function used as the fall-back expression would return empty arrays where ListFilteredVirtualColumn would return null, so something like MV_LENGTH(MV_FILTER_ONLY(...)) would report 0 as the length for rows which contained no values. However, using MV_FILTER_ONLY(..) directly would return null for these rows since it did previously plan to the ListFilteredVirtualColumn. This is fixed now, so MV_LENGTH will be null for these rows without values now.

We might want to consider adjusting the fallback filter expression of MV_FILTER_* to coerce empty arrays to nulls to behave consistently with ListFilteredVirtualColumn, but I did not do that in this PR.

Key changed/added classes in this PR
  • DruidExpression
  • VirtualColumnRegistry
  • DruidQuery
  • OperatorConversions
  • ExpressionSelectors (bug fix)

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis clintropolis changed the title sql expressions and virtual columns rework sql expressions and virtual columns Feb 8, 2022
final boolean inputMatches;
final VirtualColumn virtualInput =
virtualColumnRegistry.findVirtualColumns(factory.requiredFields())
final DruidExpression virtualInput =
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern is done in a few places, and i think could be done more efficiently, but I avoided changing the logic at this point to reduce the number of moving parts since this PR has become a bit large

@paul-rogers
Copy link
Copy Markdown
Contributor

Nice improvement! The new format should be easier to work with in code. The format is only partly an AST. It still creates named arguments, which adds more structure than is really necessary. Did you consider something that is an actual expression tree?

"virtualColumns":[
   {
      "type":"fn",
      "fn":"array_length",
      "outputType":"LONG",
      "args":[
         {
            "type:":"fn",
            "fn:":"filter",
            "args":[
               {
                  "type":"lambda",
                  "output":"dim3",
                  "input_args":[
                     "x"
                  ],
                  "args":[ {
                        "type":"fn",
                        "fn":"array",
                        "args":[ {
                              "type":"col-ref",
                              "name":"b"
                           }, {
                           "type":"arg-ref",
                           "name":"x"
                        } ] } ] ] ] ] }
} ],

The gist is, rather than have a node for every operation (a large number), have nodes for the expression elements: functions, literals, column references, etc. One ends up with a handful of node types rather than many dozens.

Calcite should produce a tree something like this, so the translation from Calcite is pretty simple.

@clintropolis
Copy link
Copy Markdown
Member Author

Did you consider something that is an actual expression tree?

the string fragments generated here do later turn into Expr, which is the AST of native expressions, however that is a bit more work than I was trying to bite off at this time, and we lack JSON serialization and such to be able to send these expressions parsed over the wire.

It is probably worth considering doing this at some point, but for now I focused on changing as little as possible to allow the transformations I needed, which is just to swap complete operations. Additionally, Expr is missing a lot of information that we might want to decorate it with if we were using it in this manner, so we would need some additional adjustments to use like this.

Copy link
Copy Markdown
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me assuming tests pass. I called out a few places that might introduce backwards incompatibilities, please make sure to verify those before actually merging.

new ExpressionPostAggregator(
"p3",
"(p2 + 1000)",
"(\"p2\" + 1000)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question, but is this test change indicative of a change in what SQL is required to be written? I.e. might some things that didn't require quotes previously suddenly require quotes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think it was potentially a bug that the identifiers were not always being escaped in expression post-aggs, that has been fixed by the changes in how we build the native string expressions. humans are still allowed to reference identifiers without the double quotes, just the expressions we generate from SQL will always be escaped with them

import java.util.function.Function;

/**
* todo(clint): rewrite javadocs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you've got a todo todo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops 👍


// Must be sorted
private static final char[] SAFE_CHARS = " ,._-;:(){}[]<>!@#$%^&*`~?/".toCharArray();
private static final VirtualColumnBuilder DEFAULT_VIRTUAL_COLUMN_BUILDER = new ExpressionVirtualColumnBuilder();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared of static buildery thingies. Is there a reason that this needs to be static instead of just creating a new one each time it's used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, just seemed wasteful because is stateless

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more like a function than a builder, but we can change the name later if needed.

return escaped.toString();
}

private DruidExpression(@Nullable final SimpleExtraction simpleExtraction, final String expression, @Nullable final TrinaryFn<String, ColumnType, ExprMacroTable, VirtualColumn> virtualColumnFn)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is making it hard to really see this, but I think you got rid of this 3-argument constructor. That's going to make anything built on the old code stop working. Is it possible to have the 3-argument constructor continue to exist, but be deprecated and delegate to the new constructor kinda like you did with the other methods?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think since it is private it should be ok? the static methods are what everything should be using to make these things afaik

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.. it's private. of course, uhhhh, the diff made it hard for me to see that :P

this.virtualColumnsByName = virtualColumnsByName;
}

public static VirtualColumnRegistry create(final RowSignature rowSignature)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single-argument create was not resurrected. I haven't double checked if that might be a compatibility problem or not, but just wanted to highlight it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i thought about this, but only a DruidRel of some sort should be creating a virtual column registry, since they are scoped to a native query, so this so it didn't seem like a huge deal unless someone is like super super super deep into custom SQL planning for custom query types (and not fully sure it is possible without modifying druid-sql itself at this point...)

I can't really think of a clean way to add it back; I moved it here because the expression macro table should not be changing over the course of planning a query (or even planning multiple queries), so it seemed pointless that we were passing it in via planner context when we were eagerly creating the virtual columns (creating expression virtual columns requires a reference to the macro table), so we can't count on it being set in the constructor we would have to have an alternative way even for the new methods, which I guess would mean passing it in when creating an expression and storing a reference to it in each DruidExpression in the tree so that they could build their virtual columns.

I think maybe we should just document this on the unlikely chance someone was doing something incredibly crazy with this, because in my mind the registry is a facility that the planner provides to operator conversions/filters/aggregators etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's pretty in the weeds and someone who actually depends on the implementation to this level is probably in deep enough to deal with the change. Documenting seems reasonable.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 after CI. Left a couple of nit comments that can be done as follow-ups.


// Must be sorted
private static final char[] SAFE_CHARS = " ,._-;:(){}[]<>!@#$%^&*`~?/".toCharArray();
private static final VirtualColumnBuilder DEFAULT_VIRTUAL_COLUMN_BUILDER = new ExpressionVirtualColumnBuilder();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more like a function than a builder, but we can change the name later if needed.

private static class ExpressionAndTypeHint
{
private final DruidExpression expression;
private final ColumnType typeHint;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typeHint seems worth to document. Can you please add some javadoc for it? It's not a blocker and you can do it as a follow-up.

Comment on lines +1158 to +1161
// the fallback expression would actually produce 3 rows, 2 nulls, 2 0's, and 2 1s
// instead of 4 nulls and two 1's we get when using the 'native' list filtered virtual column
// this is because of slight differences between filter and the native
// selector, which treats a 0 length array as null instead of an empty array like is produced by filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems worth the release notes. Can you add some release note section in the PR description, so that we won't forget later?

@clintropolis clintropolis changed the title rework sql expressions and virtual columns allow optimizing sql expressions and virtual columns Feb 9, 2022
@clintropolis
Copy link
Copy Markdown
Member Author

thanks for review @cheddar and @jihoonson. CI failure seems unrelated (and travis logs look pretty messed up), so I'm going to go ahead and merge

@clintropolis clintropolis merged commit 3ee66bb into apache:master Feb 9, 2022
@clintropolis clintropolis deleted the sql-expresions-and-virtual-columns branch February 10, 2022 01:47
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants