Druid Multi-Value String Columns, Expressions, and SQL

### Motivation

Druid SQL is the likely way for new users to interact with Druid, but despite many advancements since introduction in some regards is still not as expressive as using native Druid queries. One of the largest remaining gaps in functionality is that Druid SQL only partially support multi-value dimensions, which is captured in #4638. 

There are a number of non-trivial challenges with rectifying this situation however, with the largest barrier being the behavior of Druid multi-value dimensions themselves. The native behavior of these columns is _not_ directly compatible with standard SQL array or multi-set types because they are not designed or equipped to actually be array or multi-set types. _Every string dimension_ is opportunistically a multi-value dimension, as they share the same underlying string dictionary column type. Segment to segment, whether or not a column has multiple values may vary, with just a column capabilities flag to distinguish them at processing time. This is not itself a problem for Druid, because natively they _are_ just string dimensions. Druid offers no real significance to the array of values which make up a row; when aggregated upon with a top-n or a group-by query, it is done by the individual string values in the array, not the array itself; when filtering, _any_ individual value in the array which satisfies the filters counts as a match and the entire array worth of values is included in the result set. The original array can only be retrieved with a scan or select query, which largely diminishes its utility.

On top of this, multi-value dimensions themselves are something of a second class citizen in Druid, as they are not supported by the expression language used in virtual columns and expression filters, which is also used heavily by the SQL planner.

### Proposed changes

This proposal aims to lay out a way in which multi-value string dimensions can be operated on as _either_ complete arrays or as individual values, near invisibly, at the users discretion. This has a number of purposes, but perhaps the primary one is to avoid breaking backwards compatibility for SQL queries that by happenstance currently work. The purpose of allowing operations on multi-value dimensions with single value semantics is that this is likely most intuitive to existing users of these dimensions who are using native Druid queries or have constructed a working SQL query. Likewise, allowing array operations is perhaps more _naturally_ intuitive with how a user might expect these dimensions to behave; an array is ingested, an array is stored, therefore it should be possible to operate on arrays on a per row basis and preserve the structure as it was ingested during aggregations and filtering. Additionally by providing this array functionality, we can vastly increase the power of existing multi-value string dimensions to allow expressiveness that is currently just not possible. The primary extent of the changes will be focused in two areas: the Druid expression language used for virtual columns and expression filters, and wiring the expanded functionality up to the Druid SQL planner.

#### Druid Expressions

Druid SQL relies heavily on expression virtual columns and expression filters, so introducing expression support for multi-value string dimensions will be necessary. Currently the Druid expression language has _no_ real support for multi-value dimensions, with `null` being the value binding for rows which have multiple values. The objective of expanding the expression language is to both support expressions against existing multi-value dimensions, as well as provide the machinery for any future fully SQL compatible array types we might wish to introduce in future work. 

To this end, this proposal introduces array types for doubles, longs, and strings to the Druid expression language, as well as a handful of function expressions to operate on them. Array constants will parse into `DoubleArrayExpr`, `LongArrayExpr`, and `StringArrayExpr` which extend `ConstantArrayExpr` which itself extends `ConstantExpr`. A new type of function expression, `ApplyFunctionExpr` will also be defined, which 
takes a newly defined `LambdaExpr` fragment and one or more `Expr` arguments which are applied to the `LambdaExpr`, to enable defining some of the functions necessary to fully support all aspects of this proposal, as well as increase the expressiveness of what is possible with these new array types.

New top level grammar:
```

expr : 'null'                                         # null
     | ('-'|'!') expr                                 # unaryOpExpr
     |<assoc=right> expr '^' expr                     # powOpExpr
     | expr ('*'|'/'|'%') expr                        # mulDivModuloExpr
     | expr ('+'|'-') expr                            # addSubExpr
     | expr ('<'|'<='|'>'|'>='|'=='|'!=') expr        # logicalOpExpr
     | expr ('&&'|'||') expr                          # logicalAndOrExpr
     | '(' expr ')'                                   # nestedExpr
     | IDENTIFIER '(' lambda ',' fnArgs ')'           # applyFunctionExpr
     | IDENTIFIER '(' fnArgs? ')'                     # functionExpr
     | IDENTIFIER                                     # identifierExpr
     | DOUBLE                                         # doubleExpr
     | LONG                                           # longExpr
     | STRING                                         # string
     | '[' DOUBLE  (','? DOUBLE)* ']'                 # doubleArray
     | '[' LONG (','? LONG)* ']'                      # longArray
     | '[' STRING (','? STRING)* ']'                  # stringArray
     ;

lambda : (IDENTIFIER | '(' IDENTIFIER (','? IDENTIFIER)* ')') '->' expr
       ;

...
```

##### 'Apply' Functions

The `ApplyFunctionExpr` is added, not directly for SQL compatibility, but rather to lay the groundwork to facilitate 'magical' transformations to seamlessly allow SQL interactions with multi-value dimensions with either single valued _or_ array semantics.

| function | description |
| --- | --- |
| `map(lambda,arr)` | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array |
| `cartesian_map(lambda,arr1,arr2,...)` | applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same |
| `filter(lambda,arr)` | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match |
| `fold(lambda,arr)` | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. |
| `cartesian_fold(lambda,arr1,arr2,...)` | folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. |
| `any(lambda,arr)` | returns true if any element in the array matches the lambda expression |
| `all(lambda,arr)` | returns true if all elements in the array matches the lambda expression |

##### Array Functions 

To unlock the potential of multi-value dimensions _as_ array types, as well as support future array types, a number of array specific functions will also be defined.

| function | description |
| -- | -- |
| `array_length(arr)` | returns length of array expression |
| `array_offset(arr,long)` | returns the array element at the 0 based index supplied, or null for an out of range index|
| `array_ordinal(arr,long)` | returns the array element at the 1 based index supplied, or null for an out of range index |
| `array_contains(arr,expr)` | returns true if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array |
| `array_overlap(arr1,arr2)` | returns true if arr1 and arr2 have any elements in common |
| `array_offset_of(expr)` | returns the 0 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
| `array_ordinal_of(expr)` | returns the 1 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
| `array_append(arr1,expr)` | appends expr to arr
| `array_concat(arr1,arr2)` | concatenates 2 arrays |
| `array_to_string(arr,str)` | joins all elements of arr by the delimiter specified by str |
| `string_to_array(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |

##### Automatic Expression Transformations and Expression Value Selector Behavior

One of the less straight-forward problems to solve is how the expression language can be made to cope with the matter that when operating on an individual segment it is impossible to tell if a dimension is multi-valued if that segment specific _does not_ have multiple values but others in the query set do. Likewise, since native Druid facilities for handling multi-valued dimensions do not approach them as arrays, it might be more natural to allow users to write expressions against them as though they are single valued since that is how they aggregate. Since multi-valued dimensions natively have this opportunistic nature, so too will the expression language handling of these dimensions. 

When creating the expression virtual column selector, currently the expression is examined to collect the required column identifiers to read values from to create the bindings for the expression to operate on. An additional bit of logic to check if an identifier is taking part in an array function will be included to detect which columns are expected to have multiple values by the expression. Likewise, the underlying columns that take part in these bindings will be examined to detect if multiple values are present. 
* Expressions which have array functions, but with single valued columns as input to those functions, will see those values implicitly cast to single element arrays containing the value.
* Expressions which do not have array functions, but have multi valued columns as input, will be implicitly wrapped in a `map` expression created using the column identifier as the lambda identifier and the actual expression as the body, and the column as the array input, e.g. for an `Expr` with an multi-value identifier `x` will become `map((x) -> Expr, x)`. Expression filters which do not contain array functions, but have multi value columns as input will instead of `map`, use the `any` function to mimic native Druid filter behavior.

These translations resolve how to handle the input to allow operating on multi-value dimensions either as the individual values or the complete array, but the introduction of array types as the potential _output_ of expressions does present a bit of a problem. Since there are no new `ValueType` being introduced to correspond to the array expression types, expression system will also need to automatically coerce these output types back to `String` and fabricate a multi value selector to make use of all of the existing native infrastructure to process multi-value dimensions.


#### SQL Support

Druid SQL will continue to represent multi-value string dimensions as `VARCHAR` type as far as Calcite is concerned, saving true SQL array types for future work. The Druid `INFORMATION_SCHEMA` columns table should be expanded to include a new column, `IS_MULTIVALUE` to indicate that a string dimension has multiple values, connecting to the segment metadata query data on the column. 

##### Array Functions

All of the expression language array functions will be mirrored in SQL. Future work will add a `NEST` function, the opposite of the `UNNEST` that many SQL dialects provide, to allow Druid multi-value string columns to be treated as `ARRAY` typed in SQL.

##### Filtering

Druid SQL will still have some limitations of when filtering on multi-value string dimensions compared to its native counterpart, namely the 'contradictory' filter problem illustrated in #4638, `a = 'x' AND a = 'y'`. Calcite will correctly realize that this is effectively `false` for a legitimate `VARCHAR`, and the query will be a no-op. This behavior will continue since we are still going to represent multi-value string dimensions as `VARCHAR`, so to express this type of filter the user must use `array_contains` or `array_overlap` functions. If simple, these SQL expressions can be translated directly into native Druid selector filters, and fall back to Druid expressions if the planner is unable to perform the optimization.

#### Odds and Ends

There are potentially some discrepancies in the behavior of native handling of multi-value dimensions if issues #4195 and #5897 are still valid, which indicate that during aggregation top-n ignores `null` values and group-by does not, producing different results which is not likely expected. If this is still an issue it should be addressed to make multi-value dimensions more self consistent, especially since in SQL a group-by can become a top-n if a limit is added and approximate results are allowed. 

### Rationale

While it might have been put in place some machinery to make multi-value dimensions masquerade solely as true SQL compatible array types and represent them as such, since they are already partially supported by Druid SQL this does not seem a good option to force and break backwards compatibility. Furthermore, having wildly different default behavior between native queries and SQL would not be intuitive.

Instead, this proposal argues that the best path forward is for multi-value dimensions to continue to behave as string dimensions in SQL, but also introduce a handful of mechanisms to Druid SQL and the virtual column expression language to interact with them as array types as well. The resulting enhancements will make both native and SQL queries against multi-value dimensions much more powerful than they currently are.

### Future work

#### `NEST` Function

The opposite of `UNNEST` that many SQL dialects support to allow aggregations of single values like the default Druid behavior.

| function | description | 
| -- | -- |
| `nest(arr,[delimiter])` | converts a multi-value string dimension into a single, combined value for the purposes of aggregation with group-by or top-n queries. Internally this will aggregate as an expression virtual column which is joined into a string with `array_to_string` and transformed back into an array prior to returning the results. delimiter is optional, with a default of ',' |

#### SQL Sub-Query Support for Multi-Value String Dimensions

Absent from the SQL portion of this proposal is an analog to the `map`, `filter`, `fold` and similar 'lambda' functions, because SQL doesn't have any of these class of functions. However, subqueries against a multi-value column which a handful of implementations support are a sort of analog of the type of behavior that `map` and `filter` provide to the expression language, so providing support in SQL to map array column subqueries to these functions will be needed for complete parity between SQL queries and native queries.

#### Native Array Typed Columns

The introduction of array types and multi-value support for the expression language will lay the foundation to introduce true SQL Array types in the future should we choose. Much of the existing multi-value string dimension machinery could be re-used and expanded to add native array type columns. Many of the functions that the improved multi-value string dimensions will support could likely be optimized greatly if the explicit array typed columns are created, and native representations of long/float/double arrays would likely be much more space efficient than hijacking multi-value string dimensions for this purpose. 

function	description
`array_length(arr)`	returns length of array expression
`array_offset(arr,long)`	returns the array element at the 0 based index supplied, or null for an out of range index
`array_ordinal(arr,long)`	returns the array element at the 1 based index supplied, or null for an out of range index
`array_contains(arr,expr)`	returns true if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array
`array_overlap(arr1,arr2)`	returns true if arr1 and arr2 have any elements in common
`array_offset_of(expr)`	returns the 0 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array.
`array_ordinal_of(expr)`	returns the 1 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array.
`array_append(arr1,expr)`	appends expr to arr
`array_concat(arr1,arr2)`	concatenates 2 arrays
`array_to_string(arr,str)`	joins all elements of arr by the delimiter specified by str
`string_to_array(str1,str2)`	splits str1 into an array on the delimiter specified by str2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid Multi-Value String Columns, Expressions, and SQL #7525

Motivation

Proposed changes

Druid Expressions

'Apply' Functions

Array Functions

Automatic Expression Transformations and Expression Value Selector Behavior

SQL Support

Array Functions

Filtering

Odds and Ends

Rationale

Future work

`NEST` Function

SQL Sub-Query Support for Multi-Value String Dimensions

Native Array Typed Columns

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

function	description
`map(lambda,arr)`	applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array
`cartesian_map(lambda,arr1,arr2,...)`	applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same
`filter(lambda,arr)`	filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match
`fold(lambda,arr)`	folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value.
`cartesian_fold(lambda,arr1,arr2,...)`	folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value.
`any(lambda,arr)`	returns true if any element in the array matches the lambda expression
`all(lambda,arr)`	returns true if all elements in the array matches the lambda expression

Druid Multi-Value String Columns, Expressions, and SQL #7525

Description

Motivation

Proposed changes

Druid Expressions

'Apply' Functions

Array Functions

Automatic Expression Transformations and Expression Value Selector Behavior

SQL Support

Array Functions

Filtering

Odds and Ends

Rationale

Future work

NEST Function

SQL Sub-Query Support for Multi-Value String Dimensions

Native Array Typed Columns

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`NEST` Function