Skip to content

Druid Multi-Value String Columns, Expressions, and SQL #7525

@clintropolis

Description

@clintropolis

Motivation

Druid SQL is the likely way for new users to interact with Druid, but despite many advancements since introduction in some regards is still not as expressive as using native Druid queries. One of the largest remaining gaps in functionality is that Druid SQL only partially support multi-value dimensions, which is captured in #4638.

There are a number of non-trivial challenges with rectifying this situation however, with the largest barrier being the behavior of Druid multi-value dimensions themselves. The native behavior of these columns is not directly compatible with standard SQL array or multi-set types because they are not designed or equipped to actually be array or multi-set types. Every string dimension is opportunistically a multi-value dimension, as they share the same underlying string dictionary column type. Segment to segment, whether or not a column has multiple values may vary, with just a column capabilities flag to distinguish them at processing time. This is not itself a problem for Druid, because natively they are just string dimensions. Druid offers no real significance to the array of values which make up a row; when aggregated upon with a top-n or a group-by query, it is done by the individual string values in the array, not the array itself; when filtering, any individual value in the array which satisfies the filters counts as a match and the entire array worth of values is included in the result set. The original array can only be retrieved with a scan or select query, which largely diminishes its utility.

On top of this, multi-value dimensions themselves are something of a second class citizen in Druid, as they are not supported by the expression language used in virtual columns and expression filters, which is also used heavily by the SQL planner.

Proposed changes

This proposal aims to lay out a way in which multi-value string dimensions can be operated on as either complete arrays or as individual values, near invisibly, at the users discretion. This has a number of purposes, but perhaps the primary one is to avoid breaking backwards compatibility for SQL queries that by happenstance currently work. The purpose of allowing operations on multi-value dimensions with single value semantics is that this is likely most intuitive to existing users of these dimensions who are using native Druid queries or have constructed a working SQL query. Likewise, allowing array operations is perhaps more naturally intuitive with how a user might expect these dimensions to behave; an array is ingested, an array is stored, therefore it should be possible to operate on arrays on a per row basis and preserve the structure as it was ingested during aggregations and filtering. Additionally by providing this array functionality, we can vastly increase the power of existing multi-value string dimensions to allow expressiveness that is currently just not possible. The primary extent of the changes will be focused in two areas: the Druid expression language used for virtual columns and expression filters, and wiring the expanded functionality up to the Druid SQL planner.

Druid Expressions

Druid SQL relies heavily on expression virtual columns and expression filters, so introducing expression support for multi-value string dimensions will be necessary. Currently the Druid expression language has no real support for multi-value dimensions, with null being the value binding for rows which have multiple values. The objective of expanding the expression language is to both support expressions against existing multi-value dimensions, as well as provide the machinery for any future fully SQL compatible array types we might wish to introduce in future work.

To this end, this proposal introduces array types for doubles, longs, and strings to the Druid expression language, as well as a handful of function expressions to operate on them. Array constants will parse into DoubleArrayExpr, LongArrayExpr, and StringArrayExpr which extend ConstantArrayExpr which itself extends ConstantExpr. A new type of function expression, ApplyFunctionExpr will also be defined, which
takes a newly defined LambdaExpr fragment and one or more Expr arguments which are applied to the LambdaExpr, to enable defining some of the functions necessary to fully support all aspects of this proposal, as well as increase the expressiveness of what is possible with these new array types.

New top level grammar:


expr : 'null'                                         # null
     | ('-'|'!') expr                                 # unaryOpExpr
     |<assoc=right> expr '^' expr                     # powOpExpr
     | expr ('*'|'/'|'%') expr                        # mulDivModuloExpr
     | expr ('+'|'-') expr                            # addSubExpr
     | expr ('<'|'<='|'>'|'>='|'=='|'!=') expr        # logicalOpExpr
     | expr ('&&'|'||') expr                          # logicalAndOrExpr
     | '(' expr ')'                                   # nestedExpr
     | IDENTIFIER '(' lambda ',' fnArgs ')'           # applyFunctionExpr
     | IDENTIFIER '(' fnArgs? ')'                     # functionExpr
     | IDENTIFIER                                     # identifierExpr
     | DOUBLE                                         # doubleExpr
     | LONG                                           # longExpr
     | STRING                                         # string
     | '[' DOUBLE  (','? DOUBLE)* ']'                 # doubleArray
     | '[' LONG (','? LONG)* ']'                      # longArray
     | '[' STRING (','? STRING)* ']'                  # stringArray
     ;

lambda : (IDENTIFIER | '(' IDENTIFIER (','? IDENTIFIER)* ')') '->' expr
       ;

...
'Apply' Functions

The ApplyFunctionExpr is added, not directly for SQL compatibility, but rather to lay the groundwork to facilitate 'magical' transformations to seamlessly allow SQL interactions with multi-value dimensions with either single valued or array semantics.

function description
map(lambda,arr) applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array
cartesian_map(lambda,arr1,arr2,...) applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same
filter(lambda,arr) filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match
fold(lambda,arr) folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value.
cartesian_fold(lambda,arr1,arr2,...) folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value.
any(lambda,arr) returns true if any element in the array matches the lambda expression
all(lambda,arr) returns true if all elements in the array matches the lambda expression
Array Functions

To unlock the potential of multi-value dimensions as array types, as well as support future array types, a number of array specific functions will also be defined.

function description
array_length(arr) returns length of array expression
array_offset(arr,long) returns the array element at the 0 based index supplied, or null for an out of range index
array_ordinal(arr,long) returns the array element at the 1 based index supplied, or null for an out of range index
array_contains(arr,expr) returns true if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array
array_overlap(arr1,arr2) returns true if arr1 and arr2 have any elements in common
array_offset_of(expr) returns the 0 based index of the first occurrence of expr in the array, or null if no matching elements exist in the array.
array_ordinal_of(expr) returns the 1 based index of the first occurrence of expr in the array, or null if no matching elements exist in the array.
array_append(arr1,expr) appends expr to arr
array_concat(arr1,arr2) concatenates 2 arrays
array_to_string(arr,str) joins all elements of arr by the delimiter specified by str
string_to_array(str1,str2) splits str1 into an array on the delimiter specified by str2
Automatic Expression Transformations and Expression Value Selector Behavior

One of the less straight-forward problems to solve is how the expression language can be made to cope with the matter that when operating on an individual segment it is impossible to tell if a dimension is multi-valued if that segment specific does not have multiple values but others in the query set do. Likewise, since native Druid facilities for handling multi-valued dimensions do not approach them as arrays, it might be more natural to allow users to write expressions against them as though they are single valued since that is how they aggregate. Since multi-valued dimensions natively have this opportunistic nature, so too will the expression language handling of these dimensions.

When creating the expression virtual column selector, currently the expression is examined to collect the required column identifiers to read values from to create the bindings for the expression to operate on. An additional bit of logic to check if an identifier is taking part in an array function will be included to detect which columns are expected to have multiple values by the expression. Likewise, the underlying columns that take part in these bindings will be examined to detect if multiple values are present.

  • Expressions which have array functions, but with single valued columns as input to those functions, will see those values implicitly cast to single element arrays containing the value.
  • Expressions which do not have array functions, but have multi valued columns as input, will be implicitly wrapped in a map expression created using the column identifier as the lambda identifier and the actual expression as the body, and the column as the array input, e.g. for an Expr with an multi-value identifier x will become map((x) -> Expr, x). Expression filters which do not contain array functions, but have multi value columns as input will instead of map, use the any function to mimic native Druid filter behavior.

These translations resolve how to handle the input to allow operating on multi-value dimensions either as the individual values or the complete array, but the introduction of array types as the potential output of expressions does present a bit of a problem. Since there are no new ValueType being introduced to correspond to the array expression types, expression system will also need to automatically coerce these output types back to String and fabricate a multi value selector to make use of all of the existing native infrastructure to process multi-value dimensions.

SQL Support

Druid SQL will continue to represent multi-value string dimensions as VARCHAR type as far as Calcite is concerned, saving true SQL array types for future work. The Druid INFORMATION_SCHEMA columns table should be expanded to include a new column, IS_MULTIVALUE to indicate that a string dimension has multiple values, connecting to the segment metadata query data on the column.

Array Functions

All of the expression language array functions will be mirrored in SQL. Future work will add a NEST function, the opposite of the UNNEST that many SQL dialects provide, to allow Druid multi-value string columns to be treated as ARRAY typed in SQL.

Filtering

Druid SQL will still have some limitations of when filtering on multi-value string dimensions compared to its native counterpart, namely the 'contradictory' filter problem illustrated in #4638, a = 'x' AND a = 'y'. Calcite will correctly realize that this is effectively false for a legitimate VARCHAR, and the query will be a no-op. This behavior will continue since we are still going to represent multi-value string dimensions as VARCHAR, so to express this type of filter the user must use array_contains or array_overlap functions. If simple, these SQL expressions can be translated directly into native Druid selector filters, and fall back to Druid expressions if the planner is unable to perform the optimization.

Odds and Ends

There are potentially some discrepancies in the behavior of native handling of multi-value dimensions if issues #4195 and #5897 are still valid, which indicate that during aggregation top-n ignores null values and group-by does not, producing different results which is not likely expected. If this is still an issue it should be addressed to make multi-value dimensions more self consistent, especially since in SQL a group-by can become a top-n if a limit is added and approximate results are allowed.

Rationale

While it might have been put in place some machinery to make multi-value dimensions masquerade solely as true SQL compatible array types and represent them as such, since they are already partially supported by Druid SQL this does not seem a good option to force and break backwards compatibility. Furthermore, having wildly different default behavior between native queries and SQL would not be intuitive.

Instead, this proposal argues that the best path forward is for multi-value dimensions to continue to behave as string dimensions in SQL, but also introduce a handful of mechanisms to Druid SQL and the virtual column expression language to interact with them as array types as well. The resulting enhancements will make both native and SQL queries against multi-value dimensions much more powerful than they currently are.

Future work

NEST Function

The opposite of UNNEST that many SQL dialects support to allow aggregations of single values like the default Druid behavior.

function description
nest(arr,[delimiter]) converts a multi-value string dimension into a single, combined value for the purposes of aggregation with group-by or top-n queries. Internally this will aggregate as an expression virtual column which is joined into a string with array_to_string and transformed back into an array prior to returning the results. delimiter is optional, with a default of ','

SQL Sub-Query Support for Multi-Value String Dimensions

Absent from the SQL portion of this proposal is an analog to the map, filter, fold and similar 'lambda' functions, because SQL doesn't have any of these class of functions. However, subqueries against a multi-value column which a handful of implementations support are a sort of analog of the type of behavior that map and filter provide to the expression language, so providing support in SQL to map array column subqueries to these functions will be needed for complete parity between SQL queries and native queries.

Native Array Typed Columns

The introduction of array types and multi-value support for the expression language will lay the foundation to introduce true SQL Array types in the future should we choose. Much of the existing multi-value string dimension machinery could be re-used and expanded to add native array type columns. Many of the functions that the improved multi-value string dimensions will support could likely be optimized greatly if the explicit array typed columns are created, and native representations of long/float/double arrays would likely be much more space efficient than hijacking multi-value string dimensions for this purpose.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions