From fc77f4e5a884303005bf84e1dd21ffc08b75cba3 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 18 Aug 2022 14:03:20 -0700 Subject: [PATCH 1/7] basic docs for nested column query functions --- docs/misc/math-expr.md | 29 ++++++- docs/querying/sql-data-types.md | 9 ++- docs/querying/sql-functions.md | 67 ++++++++++++++++ docs/querying/sql-json-functions.md | 71 +++++++++++++++++ docs/querying/virtual-columns.md | 106 +++++++++++++++++++++++++- web-console/script/create-sql-docs.js | 1 + website/.spelling | 22 +++++- 7 files changed, 301 insertions(+), 4 deletions(-) create mode 100644 docs/querying/sql-json-functions.md diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md index 5060594d7f5d..5f38ed2e29c1 100644 --- a/docs/misc/math-expr.md +++ b/docs/misc/math-expr.md @@ -170,7 +170,6 @@ See javadoc of java.lang.Math for detailed explanation for each function. |toradians|toradians(x) converts an angle measured in degrees to an approximately equivalent angle measured in radians| |ulp|ulp(x) returns the size of an ulp of the argument x| - ## Array functions | function | description | @@ -227,6 +226,34 @@ map((x) -> x + 1, x) ``` in this case, the `x` when evaluating `x + 1` is the lambda argument, thus an element of the multi-valued column `x`, rather than the column `x` itself. + +## JSON functions +JSON functions provide facilities to extract, transform, and create `COMPLEX` values. + +| function | description | +|---|---| +| json_value(expr, path) | Extract a Druid literal (`STRING`, `LONG`, `DOUBLE`) value from a `COMPLEX` column or input `expr` using JSONPath syntax of `path` | +| json_query(expr, path) | Extract a `COMPLEX` value from a `COMPLEX` column or input `expr` using JSONPath syntax of `path` | +| json_object(expr1, expr2[, expr3, expr4 ...]) | Construct a `COMPLEX` with alternating 'key' and 'value' arguments| +| parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX` to be used with expressions which operate on `COMPLEX` inputs. Non-`STRING` input or invalid JSON will result in an error. | +| try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX` to be used with expressions which operate on `COMPLEX` inputs. Non-`STRING` input or invalid JSON will result in a `NULL` value. | +| to_json_string(expr) | Convert a `COMPLEX` input into a JSON `STRING` value | +| json_keys(expr, path) | get array of field names in `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | +| json_paths(expr) | get array of all JSONPath paths available in `expr` | + +### JSONPath syntax + +Druid supports a small, simplified subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. + +|Operator|Description| +| --- | --- | +|`$`| Root element. All JSONPath expressions start with this operator. | +|`.`| Child element in dot notation. | +|`['']`| Child element in bracket notation. | +|`[]`| Array index. | + +See [SQL JSON documentation](../querying/sql-json-functions.md#jsonpath-syntax) for examples. + ## Reduction functions Reduction functions operate on zero or more expressions and return a single expression. If no expressions are passed as diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 693a6b660408..3371fe0d58d3 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -33,7 +33,7 @@ Columns in Druid are associated with a specific data type. This topic describes Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit float) "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like -hyperUnique and approxHistogram columns). +json, hyperUnique, and approxHistogram columns). Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any @@ -112,3 +112,10 @@ When `druid.expressions.useStrictBooleans = false` (the default mode), Druid use When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for [expressions](../misc/math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters. However, even in this mode, Druid uses two-valued logic for filter types other than `expression`. + +## Nested columns +Druid `COMPLEX` types can be interacted with using [JSON functions](sql-json-functions.md), which can perform +nested value extraction, transforms, and create new `COMPLEX` structures. `COMPLEX` types currently have +limited functionality outside of the use of these specialized functions, and so cannot be grouped on, filtered directly +on, or used as inputs to many types of aggregations. These values can be translated into a `STRING` as workaround +solution until `COMPLEX` types are fully integrated into the general engine. \ No newline at end of file diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md index 410180efa98c..cea24c2e59c4 100644 --- a/docs/querying/sql-functions.md +++ b/docs/querying/sql-functions.md @@ -647,6 +647,46 @@ Parses `address` into an IPv4 address stored as an integer. Converts `address` into an IPv4 address in dot-decimal notation. +## JSON_KEYS + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_KEYS(expr, path)` + +Returns an array of field names in a `COMPLEX` typed `expr`, at the specified `path`. + +## JSON_OBJECT + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` + +Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types, but the `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. + +## JSON_PATHS + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_PATHS(expr)` + +Returns an array of all paths which refer to literal values in a `COMPLEX` typed `expr`, in JSONPath format. + +## JSON_QUERY + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_QUERY(expr, path)` + +Extracts a `COMPLEX` value from a `COMPLEX` typed `expr`, at the specified `path`. + +## JSON_VALUE + +**Function type:** [JSON](sql-json-functions.md) + +`JSON_VALUE(expr, path [RETURNING sqlType])` + +Extracts a literal value from a `COMPLEX` typed `expr`, at the specified `path`. If you specify `RETURNING` and an SQL type name (such as varchar, bigint, decimal, or double) the function plans the query using the suggested type. Otherwise it attempts to infer the type based on the context. If it can't infer the type, it defaults to varchar. + ## LATEST `LATEST(expr)` @@ -899,6 +939,14 @@ Returns NULL if two values are equal, else returns the first value. Returns `e2` if `e1` is null, else returns `e1`. +## PARSE_JSON + +**Function type:** [JSON](sql-json-functions.md) + +`PARSE_JSON(expr)` + +Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in an error. + ## PARSE_LONG `PARSE_LONG(, [])` @@ -1267,6 +1315,15 @@ Adds a certain amount of time to a given timestamp. Takes the difference between two timestamps, returning the results in the given units. +## TO_JSON_STRING + +**Function type:** [JSON](sql-json-functions.md) + +`TO_JSON_STRING(expr)` + +Casts an `expr` of any type into a `COMPLEX` object, then serializes the value into a JSON string. + + ## TRIM `TRIM([BOTH|LEADING|TRAILING] [ FROM] expr)` @@ -1291,6 +1348,16 @@ Alias for [`TRUNCATE`](#truncate). Truncates a numerical expression to a specific number of decimal digits. + +## TRY_PARSE_JSON + +**Function type:** [JSON](sql-json-functions.md) + +`TRY_PARSE_JSON(expr)` + +Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in a `NULL` value. + + ## UPPER `UPPER(expr)` diff --git a/docs/querying/sql-json-functions.md b/docs/querying/sql-json-functions.md new file mode 100644 index 000000000000..1c9aa494327a --- /dev/null +++ b/docs/querying/sql-json-functions.md @@ -0,0 +1,71 @@ +--- +id: sql-json-functions +title: "SQL JSON functions" +sidebar_label: "JSON functions" +--- + + + + + +Druid supports nested columns, which provide optimized storage and indexes for nested data structures. These JSON +functions provide facilities to extract, transform, and create `COMPLEX` values. + +| function | notes | +| --- | --- | +|`JSON_KEYS(expr, path)`| Returns an array of field names in a `COMPLEX` typed `expr`, at the specified `path`.| +|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types, but the `VALUE` expressions can be composed of any input type, including other `COMPLEX` values.| +|`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in a `COMPLEX` typed `expr`, in JSONPath format. | +|`JSON_QUERY(expr, path)`| Extracts a `COMPLEX` value from a `COMPLEX` typed `expr`, at the specified `path`. | +|`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from a `COMPLEX` typed `expr`, at the specified `path`. If you specify `RETURNING` and an SQL type name (such as varchar, bigint, decimal, or double) the function plans the query using the suggested type. Otherwise it attempts to infer the type based on the context. If it can't infer the type, it defaults to varchar.| +|`PARSE_JSON(expr)`|Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in an error.| +|`TRY_PARSE_JSON(expr)`|Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in a `NULL` value.| +|`TO_JSON_STRING(expr)`|Casts an `expr` of any type into a `COMPLEX` object, then serializes the value into a JSON string.| + +### JSONPath syntax + +Druid supports a small, simplified subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. + +|Operator|Description| +| --- | --- | +|`$`| Root element. All JSONPath expressions start with this operator. | +|`.`| Child element in dot notation. | +|`['']`| Child element in bracket notation. | +|`[]`| Array index. | + +Consider the following example input JSON: + +```json +{"x":1, "y":[1, 2, 3]} +``` + +- To return the JSON object:
+ `$` -> `{"x":1, "y":[1, 2, 3]}` +- To return the value of a key "x":
+ `$.x` -> `1` +- For a key that contains an array, to return the entire array:
+ `$['y']` -> `[1, 2, 3]` +- For a key that contains an array, to return an item in the array:
+ `$.y[1]` -> `2` \ No newline at end of file diff --git a/docs/querying/virtual-columns.md b/docs/querying/virtual-columns.md index 53e64546269b..9c8038c79e24 100644 --- a/docs/querying/virtual-columns.md +++ b/docs/querying/virtual-columns.md @@ -64,6 +64,8 @@ Each Apache Druid query can accept a list of virtual columns as a parameter. The ## Virtual column types ### Expression virtual column +Expression virtual columns use Druid's native [expression](../misc/math-expr.md) system to allow defining query time +transforms of inputs from one or more columns. The expression virtual column has the following syntax: @@ -80,4 +82,106 @@ The expression virtual column has the following syntax: |--------|-----------|---------| |name|The name of the virtual column.|yes| |expression|An [expression](../misc/math-expr.md) that takes a row as input and outputs a value for the virtual column.|yes| -|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, or STRING.|no, default is FLOAT| +|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.|no, default is FLOAT| + + +### Nested field virtual column + +The nested field virtual column is an optimized virtual column that can provide direct access into various paths of +a `COMPLEX` column, including using their indexes. + +Syntax (all 3 of these virtual columns produce the same output): +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v0", + "expectedType": "STRING", + "path": "$.phoneNumbers[1].number" + } +``` +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v1", + "expectedType": "STRING", + "path": ".phoneNumbers[1].number", + "useJqSyntax": true + } +``` + +```json + { + "type": "nested-field", + "columnName": "shipTo", + "outputName": "v2", + "expectedType": "STRING", + "pathParts": [ + { + "type": "field", + "field": "phoneNumbers" + }, + { + "type": "arrayElement", + "index": 1 + }, + { + "type": "field", + "field": "number" + } + ] + } +``` + +|property|description|required?| +|--------|-----------|---------| +|columnName|The name of the virtual column.|yes| +|outputName|The name of the virtual column.|yes| +|expectedType|The name of the virtual column.|yes| +|pathParts|The name of the virtual column.|yes| +|processFromRaw|If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a `COMPLEX` at the cost of much slower performance.|No, default false| +|path|'JSONPath' or 'jq' syntax path. One of `path` or `pathParts` must be set|no, if `pathParts` is defined| +|useJqSyntax||no, default is false| + +#### Nested path part +|property|description|required?| +|--------|-----------|---------| +|type|Must be 'field' or 'arrayElement'|yes| +|field|The name of the 'field' in a 'field' `type` path part|yes, if `type` is 'field'| +|index|The array element index if `type` is `arrayElement`|yes, if `type` is 'arrayElement'| + +This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromRaw` is set to false) or `JSON_QUERY` +(if it is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed +list of "path parts" in order to determine what should be selected from the column. + +Type information for nested fields is absent at higher levels (it is contained within the segment, but not to segment +metadata queries or the SQL planner), so `expectedType` provides the context for how something is being used, e.g. an +aggregators default type or an explicit cast, or, if using the 'RETURNING' syntax which explicitly specifies type. +This might not be the same as if it had actual type information, so the results will be "best effort" cast to the +expected type if the column is not natively the expected type so that this column can fulfill the contract of the type +of selector that is likely to be created to read this column. + + +### List filtered virtual column +This virtual column provides an alternative way to use +['list filtered' dimension spec](./dimensionspecs.md#filtered-dimensionspecs) as a virtual column. It has optimized +access to the underlying column value indexes that can provide a small performance improvement in some cases. + + +```json + { + "type": "mv-filtered", + "name": "filteredDim3", + "delegate": "dim3", + "values": ["hello", "world"], + "isAllowList": true + } +``` + +|property|description|required?| +|--------|-----------|---------| +|name|The output name of the virtual column|yes| +|delegate|The name of the multi-value STRING input column to filter|yes| +|values|Set of STRING values to allow or deny|yes| +|isAllowList|If true, the output of the virtual column will be limited to the set specified by `values`, else it will provide all values _except_ those specified.|No, default true| diff --git a/web-console/script/create-sql-docs.js b/web-console/script/create-sql-docs.js index f727634b01a0..6d99227c599a 100755 --- a/web-console/script/create-sql-docs.js +++ b/web-console/script/create-sql-docs.js @@ -63,6 +63,7 @@ const readDoc = async () => { await fs.readFile('../docs/querying/sql-scalar.md', 'utf-8'), await fs.readFile('../docs/querying/sql-aggregations.md', 'utf-8'), await fs.readFile('../docs/querying/sql-multivalue-string-functions.md', 'utf-8'), + await fs.readFile('../docs/querying/sql-json-functions.md', 'utf-8'), await fs.readFile('../docs/querying/sql-operators.md', 'utf-8'), ].join('\n'); diff --git a/website/.spelling b/website/.spelling index ad1ccc318b50..a92d3f411117 100644 --- a/website/.spelling +++ b/website/.spelling @@ -125,6 +125,7 @@ JRE JS JSON JsonPath +JSONPath JSSE JVM JVMs @@ -209,6 +210,7 @@ aggregator aggregators ambari analytics +arrayElement assumeRoleArn assumeRoleExternalId async @@ -225,6 +227,7 @@ backfills backpressure base64 big-endian +bigint blobstore boolean breakpoint @@ -261,6 +264,7 @@ dequeued deserialization deserialize deserialized +deserializes downtimes druid druid–kubernetes-extensions @@ -269,6 +273,7 @@ encodings endian endpointConfig enum +expectedType expr failover featureSpec @@ -301,9 +306,15 @@ injective inlined inSubQueryThreshold interruptible +isAllowList jackson-jq javadoc joinable +json_keys +json_object +json_paths +json_query +json_value kerberos keystore keytool @@ -343,10 +354,12 @@ noop numerics numShards parameterized +parse_json parseable partitioner partitionFunction partitionsSpec +pathParts performant plaintext pluggable @@ -377,6 +390,7 @@ prepopulated preprocessing priori procs +processFromRaw programmatically proto proxied @@ -436,8 +450,10 @@ tiering timeseries timestamp timestamps +to_json_string tradeoffs transformSpec +try_parse_json tsv ulimit unannounce @@ -456,6 +472,7 @@ unparsed unsetting untrusted useFilterCNF +useJqSyntax useSSL uptime uris @@ -464,7 +481,8 @@ useFieldDiscovery v1 v2 vCPUs -validator +validatcdor +varchar vectorizable vectorize vectorizeVirtualColumns @@ -1330,6 +1348,8 @@ expm1 expr expr1 expr2 +expr3 +expr4 fromIndex getExponent hypot From 742de2a256ad86de44c3cf578f220ef89132f937 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 18 Aug 2022 16:31:24 -0700 Subject: [PATCH 2/7] fix terminal input accidentally typed in editor --- website/.spelling | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/.spelling b/website/.spelling index a92d3f411117..2a2e356f301c 100644 --- a/website/.spelling +++ b/website/.spelling @@ -481,7 +481,7 @@ useFieldDiscovery v1 v2 vCPUs -validatcdor +validator varchar vectorizable vectorize From 758ed621fd7b2cfd715ab2095b0a0e4a82ac7bda Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 18 Aug 2022 23:41:17 -0700 Subject: [PATCH 3/7] review adjustments --- docs/querying/sql-data-types.md | 17 ++++++++++---- docs/querying/sql-functions.md | 16 ++++++------- docs/querying/sql-json-functions.md | 28 +++++++++++------------ docs/querying/virtual-columns.md | 33 +++++++++++++-------------- web-console/script/create-sql-docs.js | 2 +- 5 files changed, 51 insertions(+), 45 deletions(-) diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 3371fe0d58d3..e8a76adda238 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -114,8 +114,15 @@ When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic However, even in this mode, Druid uses two-valued logic for filter types other than `expression`. ## Nested columns -Druid `COMPLEX` types can be interacted with using [JSON functions](sql-json-functions.md), which can perform -nested value extraction, transforms, and create new `COMPLEX` structures. `COMPLEX` types currently have -limited functionality outside of the use of these specialized functions, and so cannot be grouped on, filtered directly -on, or used as inputs to many types of aggregations. These values can be translated into a `STRING` as workaround -solution until `COMPLEX` types are fully integrated into the general engine. \ No newline at end of file +Druid supports storing nested data structures in segments using the native `COMPLEX` type. This data can be +interacted with using [JSON functions](sql-json-functions.md), which can extract nested values, parse from string, +serialize to string, and to create new `COMPLEX` structures. + +`COMPLEX` types in general currently have limited functionality outside of the use of the specialized functions which +understand them, and so have undefined behavior when: +* grouping on complex values +* filtered directly on complex values, e.g. `WHERE json is NULL` +* used as inputs to aggregators without specialized handling for a specific complex type +q +In many cases, these functions are provided for translating these `COMPLEX` value types a `STRING`, which serves as +workaround solution until `COMPLEX` types functionality can be improved. \ No newline at end of file diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md index cea24c2e59c4..4e0d3bf4edc9 100644 --- a/docs/querying/sql-functions.md +++ b/docs/querying/sql-functions.md @@ -653,7 +653,7 @@ Converts `address` into an IPv4 address in dot-decimal notation. `JSON_KEYS(expr, path)` -Returns an array of field names in a `COMPLEX` typed `expr`, at the specified `path`. +Returns an array of field names from `expr` at the specified `path`. ## JSON_OBJECT @@ -661,7 +661,7 @@ Returns an array of field names in a `COMPLEX` typed `expr`, at the specif `JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` -Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types, but the `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. +Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept alternating key-value pairs separated by colons. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`. ## JSON_PATHS @@ -669,7 +669,7 @@ Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to `JSON_PATHS(expr)` -Returns an array of all paths which refer to literal values in a `COMPLEX` typed `expr`, in JSONPath format. +Returns an array of all paths which refer to literal values in `expr` in JSONPath format. ## JSON_QUERY @@ -677,7 +677,7 @@ Returns an array of all paths which refer to literal values in a `COMPLEX` `JSON_QUERY(expr, path)` -Extracts a `COMPLEX` value from a `COMPLEX` typed `expr`, at the specified `path`. +Extracts a `COMPLEX` value from `expr`, at the specified `path`. ## JSON_VALUE @@ -685,7 +685,7 @@ Extracts a `COMPLEX` value from a `COMPLEX` typed `expr`, at the spe `JSON_VALUE(expr, path [RETURNING sqlType])` -Extracts a literal value from a `COMPLEX` typed `expr`, at the specified `path`. If you specify `RETURNING` and an SQL type name (such as varchar, bigint, decimal, or double) the function plans the query using the suggested type. Otherwise it attempts to infer the type based on the context. If it can't infer the type, it defaults to varchar. +Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`. ## LATEST @@ -945,7 +945,7 @@ Returns `e2` if `e1` is null, else returns `e1`. `PARSE_JSON(expr)` -Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in an error. +Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error. ## PARSE_LONG @@ -1321,7 +1321,7 @@ Takes the difference between two timestamps, returning the results in the given `TO_JSON_STRING(expr)` -Casts an `expr` of any type into a `COMPLEX` object, then serializes the value into a JSON string. +Serializes `expr` into a JSON string. ## TRIM @@ -1355,7 +1355,7 @@ Truncates a numerical expression to a specific number of decimal digits. `TRY_PARSE_JSON(expr)` -Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in a `NULL` value. +Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. ## UPPER diff --git a/docs/querying/sql-json-functions.md b/docs/querying/sql-json-functions.md index 1c9aa494327a..40264ab8c342 100644 --- a/docs/querying/sql-json-functions.md +++ b/docs/querying/sql-json-functions.md @@ -30,23 +30,23 @@ sidebar_label: "JSON functions" patterns in this markdown file and parse it to TypeScript file for web console --> -Druid supports nested columns, which provide optimized storage and indexes for nested data structures. These JSON -functions provide facilities to extract, transform, and create `COMPLEX` values. +Druid supports nested columns, which provide optimized storage and indexes for nested data structures. Use +the following JSON functions to extract, transform, and create `COMPLEX` values. -| function | notes | +| Function | Notes | | --- | --- | -|`JSON_KEYS(expr, path)`| Returns an array of field names in a `COMPLEX` typed `expr`, at the specified `path`.| -|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types, but the `VALUE` expressions can be composed of any input type, including other `COMPLEX` values.| -|`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in a `COMPLEX` typed `expr`, in JSONPath format. | -|`JSON_QUERY(expr, path)`| Extracts a `COMPLEX` value from a `COMPLEX` typed `expr`, at the specified `path`. | -|`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from a `COMPLEX` typed `expr`, at the specified `path`. If you specify `RETURNING` and an SQL type name (such as varchar, bigint, decimal, or double) the function plans the query using the suggested type. Otherwise it attempts to infer the type based on the context. If it can't infer the type, it defaults to varchar.| -|`PARSE_JSON(expr)`|Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in an error.| -|`TRY_PARSE_JSON(expr)`|Parses a string type `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. Non-`STRING` input or invalid JSON will result in a `NULL` value.| -|`TO_JSON_STRING(expr)`|Casts an `expr` of any type into a `COMPLEX` object, then serializes the value into a JSON string.| +|`JSON_KEYS(expr, path)`| Returns an array of field names from `expr` at the specified `path`.| +|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept alternating key-value pairs separated by colons. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.| +|`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in `expr` in JSONPath format. | +|`JSON_QUERY(expr, path)`| Extracts a `COMPLEX` value from `expr`, at the specified `path`. | +|`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`.| +|`PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error.| +|`TRY_PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.| +|`TO_JSON_STRING(expr)`|Serializes `expr` into a JSON string.| ### JSONPath syntax -Druid supports a small, simplified subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. +Druid supports a subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures. |Operator|Description| | --- | --- | @@ -61,9 +61,9 @@ Consider the following example input JSON: {"x":1, "y":[1, 2, 3]} ``` -- To return the JSON object:
+- To return the entire JSON object:
`$` -> `{"x":1, "y":[1, 2, 3]}` -- To return the value of a key "x":
+- To return the value of the key "x":
`$.x` -> `1` - For a key that contains an array, to return the entire array:
`$['y']` -> `[1, 2, 3]` diff --git a/docs/querying/virtual-columns.md b/docs/querying/virtual-columns.md index 9c8038c79e24..33d7135d9ed3 100644 --- a/docs/querying/virtual-columns.md +++ b/docs/querying/virtual-columns.md @@ -80,6 +80,7 @@ The expression virtual column has the following syntax: |property|description|required?| |--------|-----------|---------| +|type|Must be `"expression"` to indicate that this is an expression virtual column.|yes| |name|The name of the virtual column.|yes| |expression|An [expression](../misc/math-expr.md) that takes a row as input and outputs a value for the virtual column.|yes| |outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.|no, default is FLOAT| @@ -90,6 +91,10 @@ The expression virtual column has the following syntax: The nested field virtual column is an optimized virtual column that can provide direct access into various paths of a `COMPLEX` column, including using their indexes. +This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromRaw` is set to false) or `JSON_QUERY` +(if `processFromRaw` is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed +list of "path parts" in order to determine what should be selected from the column. + Syntax (all 3 of these virtual columns produce the same output): ```json { @@ -136,31 +141,24 @@ Syntax (all 3 of these virtual columns produce the same output): |property|description|required?| |--------|-----------|---------| -|columnName|The name of the virtual column.|yes| +|type|Must be `"nested-field"` to indicate that this is a nested field virtual column.|yes| +|columnName|The name of the `COMPLEX` input column.|yes| |outputName|The name of the virtual column.|yes| -|expectedType|The name of the virtual column.|yes| -|pathParts|The name of the virtual column.|yes| -|processFromRaw|If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a `COMPLEX` at the cost of much slower performance.|No, default false| -|path|'JSONPath' or 'jq' syntax path. One of `path` or `pathParts` must be set|no, if `pathParts` is defined| -|useJqSyntax||no, default is false| +|expectedType|The native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be `STRING`, `LONG`, `FLOAT`, `DOUBLE`, or `COMPLEX`. Extracting `ARRAY` types is not yet supported.|no, default `STRING`| +|pathParts|The parsed path parts used to locate the nested values. `path` will be translated into `pathParts` internally. One of `path` or `pathParts` must be set|no, if `path` is defined| +|processFromRaw|If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a `COMPLEX` at the cost of much slower performance.|no, default false| +|path|'JSONPath' (or 'jq') syntax path. One of `path` or `pathParts` must be set. |no, if `pathParts` is defined| +|useJqSyntax|If true, parse `path` using 'jq' syntax instead of 'JSONPath'.|no, default is false| #### Nested path part +Specify `pathParts` as an array of objects that describe each component of the path to traverse. Each object can take the following properties: + |property|description|required?| |--------|-----------|---------| -|type|Must be 'field' or 'arrayElement'|yes| +|type|Must be 'field' or 'arrayElement'. Use `field` when accessing a specific field in a nested structure. Use `arrayElement` when accessing a specific integer position of an array (zero based).|yes| |field|The name of the 'field' in a 'field' `type` path part|yes, if `type` is 'field'| |index|The array element index if `type` is `arrayElement`|yes, if `type` is 'arrayElement'| -This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromRaw` is set to false) or `JSON_QUERY` -(if it is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed -list of "path parts" in order to determine what should be selected from the column. - -Type information for nested fields is absent at higher levels (it is contained within the segment, but not to segment -metadata queries or the SQL planner), so `expectedType` provides the context for how something is being used, e.g. an -aggregators default type or an explicit cast, or, if using the 'RETURNING' syntax which explicitly specifies type. -This might not be the same as if it had actual type information, so the results will be "best effort" cast to the -expected type if the column is not natively the expected type so that this column can fulfill the contract of the type -of selector that is likely to be created to read this column. ### List filtered virtual column @@ -181,6 +179,7 @@ access to the underlying column value indexes that can provide a small performan |property|description|required?| |--------|-----------|---------| +|type|Must be `"mv-filtered"` to indicate that this is a list filtered virtual column.|yes| |name|The output name of the virtual column|yes| |delegate|The name of the multi-value STRING input column to filter|yes| |values|Set of STRING values to allow or deny|yes| diff --git a/web-console/script/create-sql-docs.js b/web-console/script/create-sql-docs.js index 6d99227c599a..57fba5b81fb7 100755 --- a/web-console/script/create-sql-docs.js +++ b/web-console/script/create-sql-docs.js @@ -23,7 +23,7 @@ const snarkdown = require('snarkdown'); const writefile = 'lib/sql-docs.js'; -const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 150; +const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 158; const MINIMUM_EXPECTED_NUMBER_OF_DATA_TYPES = 14; function hasHtmlTags(str) { From fc98b838c09df7467b1f79c27c7968acde05f495 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 18 Aug 2022 23:41:38 -0700 Subject: [PATCH 4/7] meh --- docs/querying/sql-data-types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index e8a76adda238..616ae2e47086 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -123,6 +123,6 @@ understand them, and so have undefined behavior when: * grouping on complex values * filtered directly on complex values, e.g. `WHERE json is NULL` * used as inputs to aggregators without specialized handling for a specific complex type -q + In many cases, these functions are provided for translating these `COMPLEX` value types a `STRING`, which serves as workaround solution until `COMPLEX` types functionality can be improved. \ No newline at end of file From a84166c72d50267dc54c2a97e8e1122fdc90b797 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 18 Aug 2022 23:47:30 -0700 Subject: [PATCH 5/7] more better --- docs/misc/math-expr.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md index 5f38ed2e29c1..fa8955f47a85 100644 --- a/docs/misc/math-expr.md +++ b/docs/misc/math-expr.md @@ -232,14 +232,14 @@ JSON functions provide facilities to extract, transform, and create `COMPLEX` column or input `expr` using JSONPath syntax of `path` | -| json_query(expr, path) | Extract a `COMPLEX` value from a `COMPLEX` column or input `expr` using JSONPath syntax of `path` | +| json_value(expr, path) | Extract a Druid literal (`STRING`, `LONG`, `DOUBLE`) value from `expr` using JSONPath syntax of `path` | +| json_query(expr, path) | Extract a `COMPLEX` value from `expr` using JSONPath syntax of `path` | | json_object(expr1, expr2[, expr3, expr4 ...]) | Construct a `COMPLEX` with alternating 'key' and 'value' arguments| -| parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX` to be used with expressions which operate on `COMPLEX` inputs. Non-`STRING` input or invalid JSON will result in an error. | -| try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX` to be used with expressions which operate on `COMPLEX` inputs. Non-`STRING` input or invalid JSON will result in a `NULL` value. | -| to_json_string(expr) | Convert a `COMPLEX` input into a JSON `STRING` value | -| json_keys(expr, path) | get array of field names in `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | -| json_paths(expr) | get array of all JSONPath paths available in `expr` | +| parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in an error.| +| try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in a `NULL` value. | +| to_json_string(expr) | Convert `expr` into a JSON `STRING` value | +| json_keys(expr, path) | get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | +| json_paths(expr) | get array of all JSONPath paths available from `expr` | ### JSONPath syntax From bf3f2b6cb5ae5f01f0aff50e8d9be5f0e75de549 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Fri, 19 Aug 2022 15:56:09 -0700 Subject: [PATCH 6/7] more better --- docs/misc/math-expr.md | 4 ++-- docs/querying/sql-data-types.md | 12 ++++++------ docs/querying/sql-functions.md | 2 +- docs/querying/sql-json-functions.md | 2 +- docs/querying/virtual-columns.md | 8 ++++++-- 5 files changed, 16 insertions(+), 12 deletions(-) diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md index fa8955f47a85..94167800c429 100644 --- a/docs/misc/math-expr.md +++ b/docs/misc/math-expr.md @@ -238,8 +238,8 @@ JSON functions provide facilities to extract, transform, and create `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in an error.| | try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX`. If the input is not a `STRING` or it is invalid JSON, this function will result in a `NULL` value. | | to_json_string(expr) | Convert `expr` into a JSON `STRING` value | -| json_keys(expr, path) | get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | -| json_paths(expr) | get array of all JSONPath paths available from `expr` | +| json_keys(expr, path) | Get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields | +| json_paths(expr) | Get array of all JSONPath paths available from `expr` | ### JSONPath syntax diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 616ae2e47086..bd056dae0288 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -114,15 +114,15 @@ When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic However, even in this mode, Druid uses two-valued logic for filter types other than `expression`. ## Nested columns -Druid supports storing nested data structures in segments using the native `COMPLEX` type. This data can be -interacted with using [JSON functions](sql-json-functions.md), which can extract nested values, parse from string, -serialize to string, and to create new `COMPLEX` structures. +Druid supports storing nested data structures in segments using the native `COMPLEX` type. You can interact +with this data using [JSON functions](sql-json-functions.md), which can extract nested values, parse from string, +serialize to string, and create new `COMPLEX` structures. `COMPLEX` types in general currently have limited functionality outside of the use of the specialized functions which understand them, and so have undefined behavior when: * grouping on complex values -* filtered directly on complex values, e.g. `WHERE json is NULL` +* filtered directly on complex values, such as `WHERE json is NULL` * used as inputs to aggregators without specialized handling for a specific complex type -In many cases, these functions are provided for translating these `COMPLEX` value types a `STRING`, which serves as -workaround solution until `COMPLEX` types functionality can be improved. \ No newline at end of file +In many cases, functions are provided to translate `COMPLEX` value types to `STRING`, which serves as a workaround +solution until `COMPLEX` type functionality can be improved. \ No newline at end of file diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md index 4e0d3bf4edc9..90083dcd77bd 100644 --- a/docs/querying/sql-functions.md +++ b/docs/querying/sql-functions.md @@ -661,7 +661,7 @@ Returns an array of field names from `expr` at the specified `path`. `JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` -Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept alternating key-value pairs separated by colons. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`. +Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`. ## JSON_PATHS diff --git a/docs/querying/sql-json-functions.md b/docs/querying/sql-json-functions.md index 40264ab8c342..ee0114598838 100644 --- a/docs/querying/sql-json-functions.md +++ b/docs/querying/sql-json-functions.md @@ -36,7 +36,7 @@ the following JSON functions to extract, transform, and create `COMPLEX` v | Function | Notes | | --- | --- | |`JSON_KEYS(expr, path)`| Returns an array of field names from `expr` at the specified `path`.| -|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept alternating key-value pairs separated by colons. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.| +|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.| |`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in `expr` in JSONPath format. | |`JSON_QUERY(expr, path)`| Extracts a `COMPLEX` value from `expr`, at the specified `path`. | |`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`.| diff --git a/docs/querying/virtual-columns.md b/docs/querying/virtual-columns.md index 33d7135d9ed3..d2611582e651 100644 --- a/docs/querying/virtual-columns.md +++ b/docs/querying/virtual-columns.md @@ -31,7 +31,7 @@ Virtual columns are queryable column "views" created from a set of columns durin A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column. -Virtual columns can be used as dimensions or as inputs to aggregators. +Virtual columns can be referenced by their output names to be used as [dimensions](./dimensionspecs.md) or as inputs to [filters](./filters.md) and [aggregators](./aggregations.md). Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example: @@ -95,7 +95,10 @@ This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromR (if `processFromRaw` is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed list of "path parts" in order to determine what should be selected from the column. -Syntax (all 3 of these virtual columns produce the same output): +You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce +the same output value, with each example showing a different way to specify how to access the nested value. The first +is using JSONPath syntax `path`, the second with a jq `path`, and the third uses `pathParts`. + ```json { "type": "nested-field", @@ -105,6 +108,7 @@ Syntax (all 3 of these virtual columns produce the same output): "path": "$.phoneNumbers[1].number" } ``` + ```json { "type": "nested-field", From 393512f69c984c7829f612264f9f14623c763fff Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Fri, 19 Aug 2022 16:21:55 -0700 Subject: [PATCH 7/7] spelling --- website/.spelling | 1 + 1 file changed, 1 insertion(+) diff --git a/website/.spelling b/website/.spelling index db5fbbdb60fe..09fcd9bfa685 100644 --- a/website/.spelling +++ b/website/.spelling @@ -446,6 +446,7 @@ subtask subtasks supervisorTaskId symlink +syntaxes tiering timeseries timestamp