diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md index 5060594d7f5d..94167800c429 100644 --- a/docs/misc/math-expr.md +++ b/docs/misc/math-expr.md @@ -170,7 +170,6 @@ See javadoc of java.lang.Math for detailed explanation for each function. |toradians|toradians(x) converts an angle measured in degrees to an approximately equivalent angle measured in radians| |ulp|ulp(x) returns the size of an ulp of the argument x| - ## Array functions | function | description | @@ -227,6 +226,34 @@ map((x) -> x + 1, x) ``` in this case, the `x` when evaluating `x + 1` is the lambda argument, thus an element of the multi-valued column `x`, rather than the column `x` itself. + +## JSON functions +JSON functions provide facilities to extract, transform, and create `COMPLEX` values. + +| function | description | +|---|---| +| json_value(expr, path) | Extract a Druid literal (`STRING`, `LONG`, `DOUBLE`) value from `expr` using JSONPath syntax of `path` | +| json_query(expr, path) | Extract a `COMPLEX` value from `expr` using JSONPath syntax of `path` | +| json_object(expr1, expr2[, expr3, expr4 ...]) | Construct a `COMPLEX` with alternating 'key' and 'value' arguments| +| parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in an error.| +| try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in a `NULL` value. | +| to_json_string(expr) | Convert `expr` into a JSON `STRING` value | +| json_keys(expr, path) | Get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | +| json_paths(expr) | Get array of all JSONPath paths available from `expr` | + +### JSONPath syntax + +Druid supports a small, simplified subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. + +|Operator|Description| +| --- | --- | +|`$`| Root element. All JSONPath expressions start with this operator. | +|`.`| Child element in dot notation. | +|`['']`| Child element in bracket notation. | +|`[]`| Array index. | + +See [SQL JSON documentation](../querying/sql-json-functions.md#jsonpath-syntax) for examples. + ## Reduction functions Reduction functions operate on zero or more expressions and return a single expression. If no expressions are passed as diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 693a6b660408..bd056dae0288 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -33,7 +33,7 @@ Columns in Druid are associated with a specific data type. This topic describes Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit float) "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like -hyperUnique and approxHistogram columns). +json, hyperUnique, and approxHistogram columns). Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any @@ -112,3 +112,17 @@ When `druid.expressions.useStrictBooleans = false` (the default mode), Druid use When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for [expressions](../misc/math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters. However, even in this mode, Druid uses two-valued logic for filter types other than `expression`. + +## Nested columns +Druid supports storing nested data structures in segments using the native `COMPLEX` type. You can interact +with this data using [JSON functions](sql-json-functions.md), which can extract nested values, parse from string, +serialize to string, and create new `COMPLEX` structures. + +`COMPLEX` types in general currently have limited functionality outside of the use of the specialized functions which +understand them, and so have undefined behavior when: +* grouping on complex values +* filtered directly on complex values, such as `WHERE json is NULL` +* used as inputs to aggregators without specialized handling for a specific complex type + +In many cases, functions are provided to translate `COMPLEX` value types to `STRING`, which serves as a workaround +solution until `COMPLEX` type functionality can be improved. \ No newline at end of file diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md index 410180efa98c..90083dcd77bd 100644 --- a/docs/querying/sql-functions.md +++ b/docs/querying/sql-functions.md @@ -647,6 +647,46 @@ Parses `address` into an IPv4 address stored as an integer. Converts `address` into an IPv4 address in dot-decimal notation. +## JSON_KEYS + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_KEYS(expr, path)` + +Returns an array of field names from `expr` at the specified `path`. + +## JSON_OBJECT + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` + +Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`. + +## JSON_PATHS + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_PATHS(expr)` + +Returns an array of all paths which refer to literal values in `expr` in JSONPath format. + +## JSON_QUERY + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_QUERY(expr, path)` + +Extracts a `COMPLEX` value from `expr`, at the specified `path`. + +## JSON_VALUE + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_VALUE(expr, path [RETURNING sqlType])` + +Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`. + ## LATEST `LATEST(expr)` @@ -899,6 +939,14 @@ Returns NULL if two values are equal, else returns the first value. Returns `e2` if `e1` is null, else returns `e1`. +## PARSE_JSON + +**Function type:** [JSON](sql-json-functions.md) + +`PARSE_JSON(expr)` + +Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error. + ## PARSE_LONG `PARSE_LONG(, [])` @@ -1267,6 +1315,15 @@ Adds a certain amount of time to a given timestamp. Takes the difference between two timestamps, returning the results in the given units. +## TO_JSON_STRING + +**Function type:** [JSON](sql-json-functions.md) + +`TO_JSON_STRING(expr)` + +Serializes `expr` into a JSON string. + + ## TRIM `TRIM([BOTH|LEADING|TRAILING] [ FROM] expr)` @@ -1291,6 +1348,16 @@ Alias for [`TRUNCATE`](#truncate). Truncates a numerical expression to a specific number of decimal digits. + +## TRY_PARSE_JSON + +**Function type:** [JSON](sql-json-functions.md) + +`TRY_PARSE_JSON(expr)` + +Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. + + ## UPPER `UPPER(expr)` diff --git a/docs/querying/sql-json-functions.md b/docs/querying/sql-json-functions.md new file mode 100644 index 000000000000..ee0114598838 --- /dev/null +++ b/docs/querying/sql-json-functions.md @@ -0,0 +1,71 @@ +--- +id: sql-json-functions +title: "SQL JSON functions" +sidebar_label: "JSON functions" +--- + + + + + +Druid supports nested columns, which provide optimized storage and indexes for nested data structures. Use +the following JSON functions to extract, transform, and create `COMPLEX` values. + +| Function | Notes | +| --- | --- | +|`JSON_KEYS(expr, path)`| Returns an array of field names from `expr` at the specified `path`.| +|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.| +|`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in `expr` in JSONPath format. | +|`JSON_QUERY(expr, path)`| Extracts a `COMPLEX` value from `expr`, at the specified `path`. | +|`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`.| +|`PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error.| +|`TRY_PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.| +|`TO_JSON_STRING(expr)`|Serializes `expr` into a JSON string.| + +### JSONPath syntax + +Druid supports a subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. + +|Operator|Description| +| --- | --- | +|`$`| Root element. All JSONPath expressions start with this operator. | +|`.`| Child element in dot notation. | +|`['']`| Child element in bracket notation. | +|`[]`| Array index. | + +Consider the following example input JSON: + +```json +{"x":1, "y":[1, 2, 3]} +``` + +- To return the entire JSON object:
+ `$` -> `{"x":1, "y":[1, 2, 3]}` +- To return the value of the key "x":
+ `$.x` -> `1` +- For a key that contains an array, to return the entire array:
+ `$['y']` -> `[1, 2, 3]` +- For a key that contains an array, to return an item in the array:
+ `$.y[1]` -> `2` \ No newline at end of file diff --git a/docs/querying/virtual-columns.md b/docs/querying/virtual-columns.md index 53e64546269b..d2611582e651 100644 --- a/docs/querying/virtual-columns.md +++ b/docs/querying/virtual-columns.md @@ -31,7 +31,7 @@ Virtual columns are queryable column "views" created from a set of columns durin A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column. -Virtual columns can be used as dimensions or as inputs to aggregators. +Virtual columns can be referenced by their output names to be used as [dimensions](./dimensionspecs.md) or as inputs to [filters](./filters.md) and [aggregators](./aggregations.md). Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example: @@ -64,6 +64,8 @@ Each Apache Druid query can accept a list of virtual columns as a parameter. The ## Virtual column types ### Expression virtual column +Expression virtual columns use Druid's native [expression](../misc/math-expr.md) system to allow defining query time +transforms of inputs from one or more columns. The expression virtual column has the following syntax: @@ -78,6 +80,111 @@ The expression virtual column has the following syntax: |property|description|required?| |--------|-----------|---------| +|type|Must be `"expression"` to indicate that this is an expression virtual column.|yes| |name|The name of the virtual column.|yes| |expression|An [expression](../misc/math-expr.md) that takes a row as input and outputs a value for the virtual column.|yes| -|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, or STRING.|no, default is FLOAT| +|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.|no, default is FLOAT| + + +### Nested field virtual column + +The nested field virtual column is an optimized virtual column that can provide direct access into various paths of +a `COMPLEX` column, including using their indexes. + +This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromRaw` is set to false) or `JSON_QUERY` +(if `processFromRaw` is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed +list of "path parts" in order to determine what should be selected from the column. + +You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce +the same output value, with each example showing a different way to specify how to access the nested value. The first +is using JSONPath syntax `path`, the second with a jq `path`, and the third uses `pathParts`. + +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v0", + "expectedType": "STRING", + "path": "$.phoneNumbers[1].number" + } +``` + +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v1", + "expectedType": "STRING", + "path": ".phoneNumbers[1].number", + "useJqSyntax": true + } +``` + +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v2", + "expectedType": "STRING", + "pathParts": [ + { + "type": "field", + "field": "phoneNumbers" + }, + { + "type": "arrayElement", + "index": 1 + }, + { + "type": "field", + "field": "number" + } + ] + } +``` + +|property|description|required?| +|--------|-----------|---------| +|type|Must be `"nested-field"` to indicate that this is a nested field virtual column.|yes| +|columnName|The name of the `COMPLEX` input column.|yes| +|outputName|The name of the virtual column.|yes| +|expectedType|The native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be `STRING`, `LONG`, `FLOAT`, `DOUBLE`, or `COMPLEX`. Extracting `ARRAY` types is not yet supported.|no, default `STRING`| +|pathParts|The parsed path parts used to locate the nested values. `path` will be translated into `pathParts` internally. One of `path` or `pathParts` must be set|no, if `path` is defined| +|processFromRaw|If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a `COMPLEX` at the cost of much slower performance.|no, default false| +|path|'JSONPath' (or 'jq') syntax path. One of `path` or `pathParts` must be set. |no, if `pathParts` is defined| +|useJqSyntax|If true, parse `path` using 'jq' syntax instead of 'JSONPath'.|no, default is false| + +#### Nested path part +Specify `pathParts` as an array of objects that describe each component of the path to traverse. Each object can take the following properties: + +|property|description|required?| +|--------|-----------|---------| +|type|Must be 'field' or 'arrayElement'. Use `field` when accessing a specific field in a nested structure. Use `arrayElement` when accessing a specific integer position of an array (zero based).|yes| +|field|The name of the 'field' in a 'field' `type` path part|yes, if `type` is 'field'| +|index|The array element index if `type` is `arrayElement`|yes, if `type` is 'arrayElement'| + + + +### List filtered virtual column +This virtual column provides an alternative way to use +['list filtered' dimension spec](./dimensionspecs.md#filtered-dimensionspecs) as a virtual column. It has optimized +access to the underlying column value indexes that can provide a small performance improvement in some cases. + + +```json + { + "type": "mv-filtered", + "name": "filteredDim3", + "delegate": "dim3", + "values": ["hello", "world"], + "isAllowList": true + } +``` + +|property|description|required?| +|--------|-----------|---------| +|type|Must be `"mv-filtered"` to indicate that this is a list filtered virtual column.|yes| +|name|The output name of the virtual column|yes| +|delegate|The name of the multi-value STRING input column to filter|yes| +|values|Set of STRING values to allow or deny|yes| +|isAllowList|If true, the output of the virtual column will be limited to the set specified by `values`, else it will provide all values _except_ those specified.|No, default true| diff --git a/web-console/script/create-sql-docs.js b/web-console/script/create-sql-docs.js index f727634b01a0..57fba5b81fb7 100755 --- a/web-console/script/create-sql-docs.js +++ b/web-console/script/create-sql-docs.js @@ -23,7 +23,7 @@ const snarkdown = require('snarkdown'); const writefile = 'lib/sql-docs.js'; -const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 150; +const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 158; const MINIMUM_EXPECTED_NUMBER_OF_DATA_TYPES = 14; function hasHtmlTags(str) { @@ -63,6 +63,7 @@ const readDoc = async () => { await fs.readFile('../docs/querying/sql-scalar.md', 'utf-8'), await fs.readFile('../docs/querying/sql-aggregations.md', 'utf-8'), await fs.readFile('../docs/querying/sql-multivalue-string-functions.md', 'utf-8'), + await fs.readFile('../docs/querying/sql-json-functions.md', 'utf-8'), await fs.readFile('../docs/querying/sql-operators.md', 'utf-8'), ].join('\n'); diff --git a/website/.spelling b/website/.spelling index f37e44cc8ca5..09fcd9bfa685 100644 --- a/website/.spelling +++ b/website/.spelling @@ -125,6 +125,7 @@ JRE JS JSON JsonPath +JSONPath JSSE JVM JVMs @@ -209,6 +210,7 @@ aggregator aggregators ambari analytics +arrayElement assumeRoleArn assumeRoleExternalId async @@ -225,6 +227,7 @@ backfills backpressure base64 big-endian +bigint blobstore boolean breakpoint @@ -261,6 +264,7 @@ dequeued deserialization deserialize deserialized +deserializes downtimes druid druid–kubernetes-extensions @@ -269,6 +273,7 @@ encodings endian endpointConfig enum +expectedType expr failover featureSpec @@ -301,9 +306,15 @@ injective inlined inSubQueryThreshold interruptible +isAllowList jackson-jq javadoc joinable +json_keys +json_object +json_paths +json_query +json_value kerberos keystore keytool @@ -343,10 +354,12 @@ noop numerics numShards parameterized +parse_json parseable partitioner partitionFunction partitionsSpec +pathParts performant plaintext pluggable @@ -377,6 +390,7 @@ prepopulated preprocessing priori procs +processFromRaw programmatically proto proxied @@ -432,12 +446,15 @@ subtask subtasks supervisorTaskId symlink +syntaxes tiering timeseries timestamp timestamps +to_json_string tradeoffs transformSpec +try_parse_json tsv ulimit unannounce @@ -456,6 +473,7 @@ unparsed unsetting untrusted useFilterCNF +useJqSyntax useSSL uptime uris @@ -465,6 +483,7 @@ v1 v2 vCPUs validator +varchar vectorizable vectorize vectorizeVirtualColumns @@ -1331,6 +1350,8 @@ expm1 expr expr1 expr2 +expr3 +expr4 fromIndex getExponent hypot