From 53269d223dfbac1cb7dc4371b16ff5a42980a03d Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Wed, 9 Aug 2023 13:35:39 -0700 Subject: [PATCH 1/6] enable sql compatible null handling mode by default --- docs/design/segments.md | 5 ++++- docs/ingestion/schema-design.md | 2 +- docs/querying/math-expr.md | 2 +- docs/querying/sql-data-types.md | 10 +++++----- .../druid/common/config/NullValueHandlingConfig.java | 2 +- 5 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/design/segments.md b/docs/design/segments.md index 5dbc8ba97b38..d5b9fad021c2 100644 --- a/docs/design/segments.md +++ b/docs/design/segments.md @@ -82,10 +82,13 @@ For each row in the list of column data, there is only a single bitmap that has ## Handling null values -By default, Druid string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. However, Druid provides a SQL compatible null handling mode, which you can enable at the system level through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, allows Druid to create segments _at ingestion time_ in which the following occurs: +By default, Druid runs in a SQL compatible null handling mode, which allows Druid to create segments _at ingestion time_ in which the following occurs: + * String columns can distinguish `''` from `null`, * Numeric columns can represent `null` valued rows instead of `0`. +Druid also has a legacy null handling mode which was the default prior to Druid 28.0.0. In this mode string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. You can enable this classic behavior at the system level through `druid.generic.useDefaultValueForNull` and setting to `true`. + String dimension columns contain no additional column structures in SQL compatible null handling mode. Instead, they reserve an additional dictionary entry for the `null` value. Numeric columns are stored in the segment with an additional bitmap in which the set bits indicate `null`-valued rows. In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain null values. diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md index 7fd29c1d0ea2..cfd09f99d3c6 100644 --- a/docs/ingestion/schema-design.md +++ b/docs/ingestion/schema-design.md @@ -261,7 +261,7 @@ native boolean types, Druid ingests these values as strings if `druid.expression the [array functions](../querying/sql-array-functions.md) or [UNNEST](../querying/sql-functions.md#unnest). Nested columns can be queried with the [JSON functions](../querying/sql-json-functions.md). -We also highly recommend setting `druid.generic.useDefaultValueForNull=false` when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead. +We also highly recommend setting `druid.generic.useDefaultValueForNull=false` (the default) when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead. Mixed type columns are stored in the _least_ restrictive type that can represent all values in the column. For example: diff --git a/docs/querying/math-expr.md b/docs/querying/math-expr.md index 8d558f4ceb9f..af2301daa10e 100644 --- a/docs/querying/math-expr.md +++ b/docs/querying/math-expr.md @@ -307,7 +307,7 @@ Supported features: * other: `parse_long` is supported for numeric and string types ## Logical operator modes -Prior to the 0.23 release of Apache Druid, boolean function expressions have inconsistent handling of true and false values, and the logical 'and' and 'or' operators behave in a manner that is incompatible with SQL, even if SQL compatible null handling mode (`druid.generic.useDefaultValueForNull=false`) is enabled. Logical operators also pass through their input values similar to many scripting languages, and treat `null` as false, which can result in some rather strange behavior. Other boolean operations, such as comparisons and equality, retain their input types (e.g. `DOUBLE` comparison would produce `1.0` for true and `0.0` for false), while many other boolean functions strictly produce `LONG` typed values of `1` for true and `0` for false. +Prior to the 0.23 release of Apache Druid, boolean function expressions have inconsistent handling of true and false values, and the logical 'and' and 'or' operators behave in a manner that is incompatible with SQL, even if SQL compatible null handling mode (`druid.generic.useDefaultValueForNull=false`, the default) is enabled. Logical operators also pass through their input values similar to many scripting languages, and treat `null` as false, which can result in some rather strange behavior. Other boolean operations, such as comparisons and equality, retain their input types (e.g. `DOUBLE` comparison would produce `1.0` for true and `0.0` for false), while many other boolean functions strictly produce `LONG` typed values of `1` for true and `0` for false. After 0.23, while the inconsistent legacy behavior is still the default, it can be optionally be changed by setting `druid.expressions.useStrictBooleans=true`, so that these operations will allow correctly treating `null` values as "unknown" for SQL compatible behavior, and _all boolean output functions_ will output 'homogeneous' `LONG` typed boolean values of `1` for `true` and `0` for `false`. Additionally, diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 8427a8dd7372..fb4eea5b5d3b 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -71,8 +71,8 @@ Casts between two SQL types with the same Druid runtime type have no effect othe Casts between two SQL types that have different Druid runtime types generate a runtime cast in Druid. If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid either substitutes a default -value (when `druid.generic.useDefaultValueForNull = true`, the default mode), or substitutes [NULL](#null-values) (when -`druid.generic.useDefaultValueForNull = false`). NULL values cast to non-nullable types are also substituted with a default value. For example, if `druid.generic.useDefaultValueForNull = true`, a null VARCHAR cast to BIGINT is converted to a zero. +value (when `druid.generic.useDefaultValueForNull = true`), or substitutes [NULL](#null-values) (when +`druid.generic.useDefaultValueForNull = false`, the default mode). NULL values cast to non-nullable types are also substituted with a default value. For example, if `druid.generic.useDefaultValueForNull = true`, a null VARCHAR cast to BIGINT is converted to a zero. ## Multi-value strings @@ -129,15 +129,15 @@ VARCHAR. ARRAY typed results will be serialized into stringified JSON arrays if ## NULL values The [`druid.generic.useDefaultValueForNull`](../configuration/index.md#sql-compatible-null-handling) -runtime property controls Druid's NULL handling mode. For the most SQL compliant behavior, set this to `false`. +runtime property controls Druid's NULL handling mode. For the most SQL compliant behavior, set this to `false` (the default). -When `druid.generic.useDefaultValueForNull = true` (the default mode), Druid treats NULLs and empty strings +When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings interchangeably, rather than according to the SQL standard. In this mode Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent, and both evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` returns `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)` aggregator counts the number of rows where `expr` is neither null nor the empty string. Numeric columns in this mode are not nullable; any null or missing -values are treated as zeroes. +values are treated as zeroes. This was the default prior to Druid 28.0.0. When `druid.generic.useDefaultValueForNull = false`, NULLs are treated more closely to the SQL standard. In this mode, numeric NULL is permitted, and NULLs and empty strings are no longer treated as interchangeable. This property diff --git a/processing/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java b/processing/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java index fbdc852105d8..fdd13d6a570f 100644 --- a/processing/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java +++ b/processing/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java @@ -45,7 +45,7 @@ public NullValueHandlingConfig( ) { if (useDefaultValuesForNull == null) { - this.useDefaultValuesForNull = Boolean.valueOf(System.getProperty(NULL_HANDLING_CONFIG_STRING, "true")); + this.useDefaultValuesForNull = Boolean.valueOf(System.getProperty(NULL_HANDLING_CONFIG_STRING, "false")); } else { this.useDefaultValuesForNull = useDefaultValuesForNull; } From bf782cdf097d9b1dbeb49ef7ffa3c905bcc3171d Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Wed, 9 Aug 2023 17:43:27 -0700 Subject: [PATCH 2/6] fixes --- .../wikipedia_msq_select_query_ha.json | 12 ++++++------ integration-tests/docker/environment-configs/common | 9 +++++++++ .../druid/testing/utils/AbstractTestQueryHelper.java | 3 ++- .../queries/wikipedia_editstream_queries.json | 2 +- 4 files changed, 18 insertions(+), 8 deletions(-) diff --git a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_ha.json b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_ha.json index 58c38250722d..992eda01a26a 100644 --- a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_ha.json +++ b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_ha.json @@ -4,7 +4,7 @@ "expectedResults": [ { "__time": 1377910953000, - "isRobot": "", + "isRobot": null, "added": 57, "delta": -143, "deleted": 200, @@ -12,7 +12,7 @@ }, { "__time": 1377910953000, - "isRobot": "", + "isRobot": null, "added": 57, "delta": -143, "deleted": 200, @@ -20,7 +20,7 @@ }, { "__time": 1377919965000, - "isRobot": "", + "isRobot": null, "added": 459, "delta": 330, "deleted": 129, @@ -28,7 +28,7 @@ }, { "__time": 1377919965000, - "isRobot": "", + "isRobot": null, "added": 459, "delta": 330, "deleted": 129, @@ -36,7 +36,7 @@ }, { "__time": 1377933081000, - "isRobot": "", + "isRobot": null, "added": 123, "delta": 111, "deleted": 12, @@ -44,7 +44,7 @@ }, { "__time": 1377933081000, - "isRobot": "", + "isRobot": null, "added": 123, "delta": 111, "deleted": 12, diff --git a/integration-tests/docker/environment-configs/common b/integration-tests/docker/environment-configs/common index 7ace1dc25d30..0f6400b49f54 100644 --- a/integration-tests/docker/environment-configs/common +++ b/integration-tests/docker/environment-configs/common @@ -81,3 +81,12 @@ druid_sql_planner_authorizeSystemTablesDirectly=true # Testing the legacy config from https://github.com/apache/druid/pull/10267 # Can remove this when the flag is no longer needed druid_indexer_task_ignoreTimestampSpecForDruidInputSource=true + +AWS_REGION=us-east-1 + +# If you are making a change in load list below, make the necessary changes in github actions too +druid_extensions_loadList=["mysql-metadata-storage","druid-s3-extensions","druid-basic-security","simple-client-sslcontext","druid-testing-tools","druid-lookups-cached-global","druid-histogram","druid-datasketches","druid-integration-tests"] + +# Setting s3 credentials and region to use pre-populated data for testing. +druid_s3_accessKey=AKIAT2GGLKKJQCMG64V4 +druid_s3_secretKey=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 \ No newline at end of file diff --git a/integration-tests/src/main/java/org/apache/druid/testing/utils/AbstractTestQueryHelper.java b/integration-tests/src/main/java/org/apache/druid/testing/utils/AbstractTestQueryHelper.java index 7f2773a89836..f680ad909a63 100644 --- a/integration-tests/src/main/java/org/apache/druid/testing/utils/AbstractTestQueryHelper.java +++ b/integration-tests/src/main/java/org/apache/druid/testing/utils/AbstractTestQueryHelper.java @@ -186,7 +186,8 @@ public int countRows(String dataSource, Interval interval, Function map = (Map) results.get(0).get("result"); - return (Integer) map.get("rows"); + Integer rowCount = (Integer) map.get("rows"); + return rowCount == null ? 0 : rowCount; } } } diff --git a/integration-tests/src/test/resources/queries/wikipedia_editstream_queries.json b/integration-tests/src/test/resources/queries/wikipedia_editstream_queries.json index 59a5c6ca70b8..0d0290d23243 100644 --- a/integration-tests/src/test/resources/queries/wikipedia_editstream_queries.json +++ b/integration-tests/src/test/resources/queries/wikipedia_editstream_queries.json @@ -1410,7 +1410,7 @@ "minValue":"", "maxValue":"mmx._unknown", "errorMessage":null, - "hasNulls":true + "hasNulls":false }, "language":{ "typeSignature": "STRING", From 8a31bbe415f49ff236465dd55c3472d10c38793c Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Thu, 10 Aug 2023 02:42:34 -0700 Subject: [PATCH 3/6] fix tests --- .../wikipedia_merge_index_queries.json | 4 ++-- .../wikipedia_msq_select_query1.json | 6 +++--- ...wikipedia_msq_select_query_sequential_test.json | 2 +- .../docker/environment-configs/common | 9 --------- .../coordinator/duty/ITAutoCompactionTest.java | 14 +++++++------- 5 files changed, 13 insertions(+), 22 deletions(-) diff --git a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json index 0439b5fdca14..9413bbbec01a 100644 --- a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json +++ b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json @@ -34,8 +34,8 @@ "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "continent":"Asia", - "earliest_user":"masterYi", - "latest_user":"stringer" + "earliest_user":null, + "latest_user":null } } ] } diff --git a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query1.json b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query1.json index 151fb54aaff0..32c3d592d6e5 100644 --- a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query1.json +++ b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query1.json @@ -4,7 +4,7 @@ "expectedResults": [ { "__time": 1377910953000, - "isRobot": "", + "isRobot": null, "added": 57, "delta": -143, "deleted": 200, @@ -12,7 +12,7 @@ }, { "__time": 1377919965000, - "isRobot": "", + "isRobot": null, "added": 459, "delta": 330, "deleted": 129, @@ -20,7 +20,7 @@ }, { "__time": 1377933081000, - "isRobot": "", + "isRobot": null, "added": 123, "delta": 111, "deleted": 12, diff --git a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_sequential_test.json b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_sequential_test.json index c50ea09ad26a..6987f8cdb813 100644 --- a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_sequential_test.json +++ b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_msq_select_query_sequential_test.json @@ -4,7 +4,7 @@ "expectedResults": [ { "__time": 1377933081000, - "isRobot": "", + "isRobot": null, "added": 123, "delta": 111, "deleted": 12, diff --git a/integration-tests/docker/environment-configs/common b/integration-tests/docker/environment-configs/common index 0f6400b49f54..7ace1dc25d30 100644 --- a/integration-tests/docker/environment-configs/common +++ b/integration-tests/docker/environment-configs/common @@ -81,12 +81,3 @@ druid_sql_planner_authorizeSystemTablesDirectly=true # Testing the legacy config from https://github.com/apache/druid/pull/10267 # Can remove this when the flag is no longer needed druid_indexer_task_ignoreTimestampSpecForDruidInputSource=true - -AWS_REGION=us-east-1 - -# If you are making a change in load list below, make the necessary changes in github actions too -druid_extensions_loadList=["mysql-metadata-storage","druid-s3-extensions","druid-basic-security","simple-client-sslcontext","druid-testing-tools","druid-lookups-cached-global","druid-histogram","druid-datasketches","druid-integration-tests"] - -# Setting s3 credentials and region to use pre-populated data for testing. -druid_s3_accessKey=AKIAT2GGLKKJQCMG64V4 -druid_s3_secretKey=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 \ No newline at end of file diff --git a/integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java b/integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java index 3c40affa7834..26df03e0d81f 100644 --- a/integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java +++ b/integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java @@ -466,8 +466,8 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception fullDatasourceName, AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING, 0, - 13702, - 13701, + 14166, + 14165, 0, 2, 2, @@ -484,7 +484,7 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception fullDatasourceName, AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING, 0, - 21566, + 22262, 0, 0, 3, @@ -600,8 +600,8 @@ public void testAutoCompactionDutyCanUpdateTaskSlots() throws Exception getAndAssertCompactionStatus( fullDatasourceName, AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING, - 13702, - 13701, + 14166, + 14165, 0, 2, 2, @@ -609,7 +609,7 @@ public void testAutoCompactionDutyCanUpdateTaskSlots() throws Exception 1, 1, 0); - Assert.assertEquals(compactionResource.getCompactionProgress(fullDatasourceName).get("remainingSegmentSize"), "13702"); + Assert.assertEquals(compactionResource.getCompactionProgress(fullDatasourceName).get("remainingSegmentSize"), "14166"); // Run compaction again to compact the remaining day // Remaining day compacted (1 new segment). Now both days compacted (2 total) forceTriggerAutoCompaction(2); @@ -620,7 +620,7 @@ public void testAutoCompactionDutyCanUpdateTaskSlots() throws Exception fullDatasourceName, AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING, 0, - 21566, + 22262, 0, 0, 3, From 9fb545158d9f3e7a79f378b30d8c866df0439511 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Tue, 15 Aug 2023 00:52:08 -0700 Subject: [PATCH 4/6] fix bug with string first/last aggs when druid.generic.useDefaultValueForNull=false --- .../multi-stage-query/wikipedia_merge_index_queries.json | 4 ++-- .../query/aggregation/first/StringFirstAggregator.java | 6 +++--- .../aggregation/first/StringFirstBufferAggregator.java | 6 +++--- .../druid/query/aggregation/first/StringFirstLastUtils.java | 3 +++ .../druid/query/aggregation/last/StringLastAggregator.java | 6 +++--- .../query/aggregation/last/StringLastBufferAggregator.java | 6 +++--- 6 files changed, 17 insertions(+), 14 deletions(-) diff --git a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json index 9413bbbec01a..0439b5fdca14 100644 --- a/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json +++ b/integration-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json @@ -34,8 +34,8 @@ "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "continent":"Asia", - "earliest_user":null, - "latest_user":null + "earliest_user":"masterYi", + "latest_user":"stringer" } } ] } diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstAggregator.java b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstAggregator.java index 8a6654fbfdff..0d05833378c6 100644 --- a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstAggregator.java +++ b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstAggregator.java @@ -56,9 +56,6 @@ public StringFirstAggregator( @Override public void aggregate() { - if (timeSelector.isNull()) { - return; - } if (needsFoldCheck) { // Less efficient code path when folding is a possibility (we must read the value selector first just in case // it's a foldable object). @@ -72,6 +69,9 @@ public void aggregate() firstValue = StringUtils.fastLooseChop(inPair.rhs, maxStringBytes); } } else { + if (timeSelector.isNull()) { + return; + } final long time = timeSelector.getLong(); if (time < firstTime) { diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstBufferAggregator.java b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstBufferAggregator.java index fbf2a4156c56..563455c9eefa 100644 --- a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstBufferAggregator.java +++ b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstBufferAggregator.java @@ -63,9 +63,6 @@ public void init(ByteBuffer buf, int position) @Override public void aggregate(ByteBuffer buf, int position) { - if (timeSelector.isNull()) { - return; - } if (needsFoldCheck) { // Less efficient code path when folding is a possibility (we must read the value selector first just in case // it's a foldable object). @@ -86,6 +83,9 @@ public void aggregate(ByteBuffer buf, int position) } } } else { + if (timeSelector.isNull()) { + return; + } final long time = timeSelector.getLong(); final long firstTime = buf.getLong(position); diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java index 3a9b8818cd0b..14538fe4712e 100644 --- a/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java +++ b/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java @@ -120,6 +120,9 @@ public static SerializablePairLongString readPairFromSelectors( time = pair.lhs; string = pair.rhs; } else if (object != null) { + if (timeSelector.isNull()) { + return null; + } time = timeSelector.getLong(); string = DimensionHandlerUtils.convertObjectToString(object); } else { diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastAggregator.java b/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastAggregator.java index a7c33c8ad23e..f1dbab60938b 100644 --- a/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastAggregator.java +++ b/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastAggregator.java @@ -57,9 +57,6 @@ public StringLastAggregator( @Override public void aggregate() { - if (timeSelector.isNull()) { - return; - } if (needsFoldCheck) { // Less efficient code path when folding is a possibility (we must read the value selector first just in case // it's a foldable object). @@ -73,6 +70,9 @@ public void aggregate() lastValue = StringUtils.fastLooseChop(inPair.rhs, maxStringBytes); } } else { + if (timeSelector.isNull()) { + return; + } final long time = timeSelector.getLong(); if (time >= lastTime) { diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastBufferAggregator.java b/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastBufferAggregator.java index 8611ef72365a..3f78745f5fad 100644 --- a/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastBufferAggregator.java +++ b/processing/src/main/java/org/apache/druid/query/aggregation/last/StringLastBufferAggregator.java @@ -64,9 +64,6 @@ public void init(ByteBuffer buf, int position) @Override public void aggregate(ByteBuffer buf, int position) { - if (timeSelector.isNull()) { - return; - } if (needsFoldCheck) { // Less efficient code path when folding is a possibility (we must read the value selector first just in case // it's a foldable object). @@ -87,6 +84,9 @@ public void aggregate(ByteBuffer buf, int position) } } } else { + if (timeSelector.isNull()) { + return; + } final long time = timeSelector.getLong(); final long lastTime = buf.getLong(position); From 8b866e044a29d99030a2075450460abe5ae843f5 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Fri, 18 Aug 2023 15:12:57 -0700 Subject: [PATCH 5/6] adjust docs --- docs/configuration/index.md | 2 +- docs/design/segments.md | 12 ++--- docs/ingestion/schema-design.md | 2 - docs/querying/math-expr.md | 8 +-- docs/querying/sql-aggregations.md | 50 +++++++++---------- docs/querying/sql-array-functions.md | 4 +- docs/querying/sql-data-types.md | 27 +++++----- docs/querying/sql-functions.md | 4 +- docs/querying/sql-metadata-tables.md | 2 +- .../sql-multivalue-string-functions.md | 4 +- docs/querying/sql-query-context.md | 2 +- 11 files changed, 57 insertions(+), 60 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index deb1e7c54183..d863d47d8896 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -798,7 +798,7 @@ Prior to version 0.13.0, Druid string columns treated `''` and `null` values as |Property|Description|Default| |---|---|---| -|`druid.generic.useDefaultValueForNull`|When set to `true`, `null` values will be stored as `''` for string columns and `0` for numeric columns. Set to `false` to store and query data in SQL compatible mode.|`true`| +|`druid.generic.useDefaultValueForNull`|Set to `false` to store and query data in SQL compatible mode. When set to `true` (legacy mode), `null` values will be stored as `''` for string columns and `0` for numeric columns.|`false`| |`druid.generic.ignoreNullsForStringCardinality`|When set to `true`, `null` values will be ignored for the built-in cardinality aggregator over string columns. Set to `false` to include `null` values while estimating cardinality of only string columns using the built-in cardinality aggregator. This setting takes effect only when `druid.generic.useDefaultValueForNull` is set to `true` and is ignored in SQL compatibility mode. Additionally, empty strings (equivalent to null) are not counted when this is set to `true`. |`false`| This mode does have a storage size and query performance cost, see [segment documentation](../design/segments.md#handling-null-values) for more details. diff --git a/docs/design/segments.md b/docs/design/segments.md index d5b9fad021c2..f3ce94018583 100644 --- a/docs/design/segments.md +++ b/docs/design/segments.md @@ -82,16 +82,16 @@ For each row in the list of column data, there is only a single bitmap that has ## Handling null values -By default, Druid runs in a SQL compatible null handling mode, which allows Druid to create segments _at ingestion time_ in which the following occurs: +By default Druid stores segments in a SQL compatible null handling mode. String columns always store the null value as id 0, the first position in the value dictionary and an associated entry in the bitmap value indexes used to filter null values. Numeric columns also store a null value bitmap index to indicate the null valued rows, which is used to null check aggregations and for filter matching null values. -* String columns can distinguish `''` from `null`, -* Numeric columns can represent `null` valued rows instead of `0`. +Druid also has a legacy mode which uses default values instead of nulls, which was the default prior to Druid 28.0.0. This legacy mode can be enabled by setting `druid.generic.useDefaultValueForNull=true`. -Druid also has a legacy null handling mode which was the default prior to Druid 28.0.0. In this mode string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. You can enable this classic behavior at the system level through `druid.generic.useDefaultValueForNull` and setting to `true`. +In legacy mode, Druid segments created _at ingestion time_ have the following characteristics: -String dimension columns contain no additional column structures in SQL compatible null handling mode. Instead, they reserve an additional dictionary entry for the `null` value. Numeric columns are stored in the segment with an additional bitmap in which the set bits indicate `null`-valued rows. +* String columns can not distinguish `''` from `null`, they are treated interchangebly as the same value +* Numeric columns can not represent `null` valued rows, and instead store a `0`. -In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain null values. +In legacy mode, numeric columns do not have the null value bitmap, and so can have slightly decreased segment sizes, and queries involving numeric columns can have slightly increased performance in some cases since there is no need to check the null value bitmap. ## Segments with different schemas diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md index 5c3e8eff2ffb..556cdc41a4b3 100644 --- a/docs/ingestion/schema-design.md +++ b/docs/ingestion/schema-design.md @@ -263,8 +263,6 @@ native boolean types, Druid ingests these values as strings if `druid.expression the [array functions](../querying/sql-array-functions.md) or [UNNEST](../querying/sql-functions.md#unnest). Nested columns can be queried with the [JSON functions](../querying/sql-json-functions.md). -We also highly recommend setting `druid.generic.useDefaultValueForNull=false` (the default) when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead. - Mixed type columns are stored in the _least_ restrictive type that can represent all values in the column. For example: - Mixed numeric columns are `DOUBLE` diff --git a/docs/querying/math-expr.md b/docs/querying/math-expr.md index 0a60604bbad2..3da1fd398199 100644 --- a/docs/querying/math-expr.md +++ b/docs/querying/math-expr.md @@ -161,7 +161,7 @@ See javadoc of java.lang.Math for detailed explanation for each function. |remainder|remainder(x, y) returns the remainder operation on two arguments as prescribed by the IEEE 754 standard| |rint|rint(x) returns value that is closest in value to x and is equal to a mathematical integer| |round|round(x, y) returns the value of the x rounded to the y decimal places. While x can be an integer or floating-point number, y must be an integer. The type of the return value is specified by that of x. y defaults to 0 if omitted. When y is negative, x is rounded on the left side of the y decimal points. If x is `NaN`, x returns 0. If x is infinity, x will be converted to the nearest finite double. | -|safe_divide|safe_divide(x,y) returns the division of x by y if y is not equal to 0. In case y is 0 it returns 0 or `null` if `druid.generic.useDefaultValueForNull=false` | +|safe_divide|safe_divide(x,y) returns the division of x by y if y is not equal to 0. In case y is 0 it returns `null` or 0 if `druid.generic.useDefaultValueForNull=true` (legacy mode) | |scalb|scalb(d, sf) returns d * 2^sf rounded as if performed by a single correctly rounded floating-point multiply to a member of the double value set| |signum|signum(x) returns the signum function of the argument x| |sin|sin(x) returns the trigonometric sine of an angle x| @@ -183,8 +183,8 @@ See javadoc of java.lang.Math for detailed explanation for each function. | array_ordinal(arr,long) | returns the array element at the 1 based index supplied, or null for an out of range index | | array_contains(arr,expr) | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 | | array_overlap(arr1,arr2) | returns 1 if arr1 and arr2 have any elements in common, else 0 | -| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. | -| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. | +| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode) if no matching elements exist in the array. | +| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode) if no matching elements exist in the array. | | array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array | | array_append(arr,expr) | appends expr to arr, the resulting array type determined by the type of the first array | | array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array | @@ -309,7 +309,7 @@ Supported features: * other: `parse_long` is supported for numeric and string types ## Logical operator modes -Prior to the 0.23 release of Apache Druid, boolean function expressions have inconsistent handling of true and false values, and the logical 'and' and 'or' operators behave in a manner that is incompatible with SQL, even if SQL compatible null handling mode (`druid.generic.useDefaultValueForNull=false`, the default) is enabled. Logical operators also pass through their input values similar to many scripting languages, and treat `null` as false, which can result in some rather strange behavior. Other boolean operations, such as comparisons and equality, retain their input types (e.g. `DOUBLE` comparison would produce `1.0` for true and `0.0` for false), while many other boolean functions strictly produce `LONG` typed values of `1` for true and `0` for false. +Prior to the 0.23 release of Apache Druid, boolean function expressions have inconsistent handling of true and false values, and the logical 'and' and 'or' operators behave in a manner that is incompatible with SQL, even if SQL compatible null handling mode (`druid.generic.useDefaultValueForNull=false`) is enabled. Logical operators also pass through their input values similar to many scripting languages, and treat `null` as false, which can result in some rather strange behavior. Other boolean operations, such as comparisons and equality, retain their input types (e.g. `DOUBLE` comparison would produce `1.0` for true and `0.0` for false), while many other boolean functions strictly produce `LONG` typed values of `1` for true and `0` for false. After 0.23, while the inconsistent legacy behavior is still the default, it can be optionally be changed by setting `druid.expressions.useStrictBooleans=true`, so that these operations will allow correctly treating `null` values as "unknown" for SQL compatible behavior, and _all boolean output functions_ will output 'homogeneous' `LONG` typed boolean values of `1` for `true` and `0` for `false`. Additionally, diff --git a/docs/querying/sql-aggregations.md b/docs/querying/sql-aggregations.md index 4cb30cd193b3..f9233d40f704 100644 --- a/docs/querying/sql-aggregations.md +++ b/docs/querying/sql-aggregations.md @@ -71,41 +71,41 @@ In the aggregation functions supported by Druid, only `COUNT`, `ARRAY_AGG`, and |--------|-----|-------| |`COUNT(*)`|Counts the number of rows.|`0`| |`COUNT(DISTINCT expr)`|Counts distinct values of `expr`.

When `useApproximateCountDistinct` is set to "true" (the default), this is an alias for `APPROX_COUNT_DISTINCT`. The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). In this mode, you can use strings, numbers, or prebuilt sketches. If counting prebuilt sketches, the prebuilt sketch type must match the selected algorithm.

When `useApproximateCountDistinct` is set to "false", the computation will be exact. In this case, `expr` must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is enabled.

Counts each distinct value in a [`multi-value`](../querying/multi-value-dimensions.md)-row separately.|`0`| -|`SUM(expr)`|Sums numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`MIN(expr)`|Takes the minimum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `9223372036854775807` (maximum LONG value)| -|`MAX(expr)`|Takes the maximum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `-9223372036854775808` (minimum LONG value)| -|`AVG(expr)`|Averages numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| +|`SUM(expr)`|Sums numbers.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`MIN(expr)`|Takes the minimum of numbers.|`null` or `9223372036854775807` (maximum LONG value) if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`MAX(expr)`|Takes the maximum of numbers.|`null` or `-9223372036854775808` (minimum LONG value) if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`AVG(expr)`|Averages numbers.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of `expr` using an approximate algorithm. The `expr` can be a regular column or a prebuilt sketch column.

The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). By default, this is `APPROX_COUNT_DISTINCT_BUILTIN`. If the [DataSketches extension](../development/extensions-core/datasketches-extension.md) is loaded, you can set it to `APPROX_COUNT_DISTINCT_DS_HLL` or `APPROX_COUNT_DISTINCT_DS_THETA`.

When run on prebuilt sketch columns, the sketch column type must match the implementation of this function. For example: when `druid.sql.approxCountDistinct.function` is set to `APPROX_COUNT_DISTINCT_BUILTIN`, this function runs on prebuilt hyperUnique columns, but not on prebuilt HLLSketchBuild columns.| |`APPROX_COUNT_DISTINCT_BUILTIN(expr)`|_Usage note:_ consider using `APPROX_COUNT_DISTINCT_DS_HLL` instead, which offers better accuracy in many cases.

Counts distinct values of `expr` using Druid's built-in "cardinality" or "hyperUnique" aggregators, which implement a variant of [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). The `expr` can be a string, a number, or a prebuilt hyperUnique column. Results are always approximate, regardless of the value of `useApproximateCountDistinct`.| |`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.

Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) expressions. `probability` should be between 0 and 1, exclusive. `resolution` is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. Load the [approximate histogram extension](../development/extensions-core/approximate-histograms.md) to use this function.|`NaN`| |`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) expressions. `probability` should be between 0 and 1, exclusive. The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. Load the [approximate histogram extension](../development/extensions-core/approximate-histograms.md) to use this function.|`0.0`| |`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced by `expr`, with `numEntries` maximum number of distinct values before false positive rate increases. See [bloom filter extension](../development/extensions-core/bloom-filter.md) documentation for additional details.|Empty base64 encoded bloom filter STRING| -|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column. If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, which must be numeric. The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like `EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column. If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which must be numeric. The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`. If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| +|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column. If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, which must be numeric. The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like `EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column. If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. |`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=false` (legacy mode)| +|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which must be numeric. The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`. If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|N/A| |`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|`null`| |`ARRAY_AGG(DISTINCT expr, [size])`|Collects all distinct values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes) per aggregate. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results will be based on the default for the element type.|`null`| |`ARRAY_CONCAT_AGG(expr, [size])`|Concatenates all array `expr` into a single ARRAY, with `size` in bytes limit on aggregation size (default of 1024 bytes). Input `expr` _must_ be an array. Null `expr` will be ignored, but any null values within an `expr` _will_ be included in the resulting array. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_CONCAT_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|`null`| |`ARRAY_CONCAT_AGG(DISTINCT expr, [size])`|Concatenates all distinct values of all array `expr` into a single ARRAY, with `size` in bytes limit on aggregation size (default of 1024 bytes) per aggregate. Input `expr` _must_ be an array. Null `expr` will be ignored, but any null values within an `expr` _will_ be included in the resulting array. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_CONCAT_AGG` expression is not currently supported, and the ordering of results will be based on the default for the element type.|`null`| -|`STRING_AGG([DISTINCT] expr, [separator, [size]])`|Collects all values (or all distinct values) of `expr` into a single STRING, ignoring null values. Each value is joined by an optional `separator`, which must be a literal STRING. If the `separator` is not provided, strings are concatenated without a separator.

An optional `size` in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `STRING_AGG` expression is not currently supported, and the ordering of results within the output string may vary depending on processing order.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`LISTAGG([DISTINCT] expr, [separator, [size]])`|Synonym for `STRING_AGG`.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| -|`BIT_AND(expr)`|Performs a bitwise AND operation on all input values.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`BIT_OR(expr)`|Performs a bitwise OR operation on all input values.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| -|`BIT_XOR(expr)`|Performs a bitwise XOR operation on all input values.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| +|`STRING_AGG([DISTINCT] expr, [separator, [size]])`|Collects all values (or all distinct values) of `expr` into a single STRING, ignoring null values. Each value is joined by an optional `separator`, which must be a literal STRING. If the `separator` is not provided, strings are concatenated without a separator.

An optional `size` in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `STRING_AGG` expression is not currently supported, and the ordering of results within the output string may vary depending on processing order.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`LISTAGG([DISTINCT] expr, [separator, [size]])`|Synonym for `STRING_AGG`.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`BIT_AND(expr)`|Performs a bitwise AND operation on all input values.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`BIT_OR(expr)`|Performs a bitwise OR operation on all input values.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`BIT_XOR(expr)`|Performs a bitwise XOR operation on all input values.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| ## Sketch functions diff --git a/docs/querying/sql-array-functions.md b/docs/querying/sql-array-functions.md index 460a0868bb62..b39c5d526bc3 100644 --- a/docs/querying/sql-array-functions.md +++ b/docs/querying/sql-array-functions.md @@ -54,8 +54,8 @@ The following table describes array functions. To learn more about array aggrega |`ARRAY_ORDINAL(arr, long)`|Returns the array element at the 1-based index supplied, or null for an out of range index.| |`ARRAY_CONTAINS(arr, expr)`|If `expr` is a scalar type, returns 1 if `arr` contains `expr`. If `expr` is an array, returns 1 if `arr` contains all elements of `expr`. Otherwise returns 0.| |`ARRAY_OVERLAP(arr1, arr2)`|Returns 1 if `arr1` and `arr2` have any elements in common, else 0.| -|`ARRAY_OFFSET_OF(arr, expr)`|Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`.| -|`ARRAY_ORDINAL_OF(arr, expr)`|Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`.| +|`ARRAY_OFFSET_OF(arr, expr)`|Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode).| +|`ARRAY_ORDINAL_OF(arr, expr)`|Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode).| |`ARRAY_PREPEND(expr, arr)`|Prepends `expr` to `arr` at the beginning, the resulting array type determined by the type of `arr`.| |`ARRAY_APPEND(arr1, expr)`|Appends `expr` to `arr`, the resulting array type determined by the type of `arr1`.| |`ARRAY_CONCAT(arr1, arr2)`|Concatenates `arr2` to `arr1`. The resulting array type is determined by the type of `arr1`.| diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 274c5bab461d..6fb5cc0764e5 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -66,15 +66,14 @@ The following table describes how Druid maps SQL types onto native types when ru |ARRAY|ARRAY|`NULL`|Druid native array types work as SQL arrays, and multi-value strings can be converted to arrays. See [Arrays](#arrays) for more information.| |OTHER|COMPLEX|none|May represent various Druid column types such as hyperUnique, approxHistogram, etc.| -* Default value applies if `druid.generic.useDefaultValueForNull = true` (the default mode). Otherwise, the default value is `NULL` for all types. +* The default value is `NULL` for all types, except in legacy mode (`druid.generic.useDefaultValueForNull = true`) which initialize a default value. Casts between two SQL types with the same Druid runtime type have no effect other than the exceptions noted in the table. Casts between two SQL types that have different Druid runtime types generate a runtime cast in Druid. -If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid either substitutes a default -value (when `druid.generic.useDefaultValueForNull = true`), or substitutes [NULL](#null-values) (when -`druid.generic.useDefaultValueForNull = false`, the default mode). NULL values cast to non-nullable types are also substituted with a default value. For example, if `druid.generic.useDefaultValueForNull = true`, a null VARCHAR cast to BIGINT is converted to a zero. +If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid a substitutes [NULL](#null-values). +When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid instead substitutes a default value, including when NULL values cast to non-nullable types. For example, if `druid.generic.useDefaultValueForNull = true`, a null VARCHAR cast to BIGINT is converted to a zero. ## Multi-value strings @@ -137,7 +136,13 @@ VARCHAR. ARRAY typed results will be serialized into stringified JSON arrays if The [`druid.generic.useDefaultValueForNull`](../configuration/index.md#sql-compatible-null-handling) runtime property controls Druid's NULL handling mode. For the most SQL compliant behavior, set this to `false` (the default). -When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings +When `druid.generic.useDefaultValueForNull = false` (the default), NULLs are treated more closely to the SQL standard. In this mode, +numeric NULL is permitted, and NULLs and empty strings are no longer treated as interchangeable. This property +affects both storage and querying, and must be set on all Druid service types to be available at both ingestion time +and query time. There is some overhead associated with the ability to handle NULLs; see +the [segment internals](../design/segments.md#handling-null-values) documentation for more details. + +When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid treats NULLs and empty strings interchangeably, rather than according to the SQL standard. In this mode Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent, and both evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` returns `col2` if `col1` is an empty @@ -145,23 +150,17 @@ string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)` aggre where `expr` is neither null nor the empty string. Numeric columns in this mode are not nullable; any null or missing values are treated as zeroes. This was the default prior to Druid 28.0.0. -When `druid.generic.useDefaultValueForNull = false`, NULLs are treated more closely to the SQL standard. In this mode, -numeric NULL is permitted, and NULLs and empty strings are no longer treated as interchangeable. This property -affects both storage and querying, and must be set on all Druid service types to be available at both ingestion time -and query time. There is some overhead associated with the ability to handle NULLs; see -the [segment internals](../design/segments.md#handling-null-values) documentation for more details. - ## Boolean logic The [`druid.expressions.useStrictBooleans`](../configuration/index.md#expression-processing-configurations) runtime property controls Druid's boolean logic mode. For the most SQL compliant behavior, set this to `true`. -When `druid.expressions.useStrictBooleans = false` (the default mode), Druid uses two-valued logic. - -When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for +When `druid.expressions.useStrictBooleans = false` (the default mode), Druid uses three-valued logic for [expressions](math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters. However, even in this mode, Druid uses two-valued logic for filter types other than `expression`. +When `druid.expressions.useStrictBooleans = true` (legacy mode), Druid uses two-valued logic. + ## Nested columns Druid supports storing nested data structures in segments using the native `COMPLEX` type. See [Nested columns](./nested-columns.md) for more information. diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md index 8821c642f43f..f936610e1630 100644 --- a/docs/querying/sql-functions.md +++ b/docs/querying/sql-functions.md @@ -185,7 +185,7 @@ Returns the array element at the 0-based index supplied, or null for an out of r **Function type:** [Array](./sql-array-functions.md) -Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`. +Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode).. ## ARRAY_ORDINAL @@ -200,7 +200,7 @@ Returns the array element at the 1-based index supplied, or null for an out of r **Function type:** [Array](./sql-array-functions.md) -Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`.| +Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode)..| ## ARRAY_OVERLAP diff --git a/docs/querying/sql-metadata-tables.md b/docs/querying/sql-metadata-tables.md index 23700e60a8d6..8e9bce9fad95 100644 --- a/docs/querying/sql-metadata-tables.md +++ b/docs/querying/sql-metadata-tables.md @@ -234,7 +234,7 @@ Servers table lists all discovered servers in the cluster. |tier|VARCHAR|Distribution tier see [druid.server.tier](../configuration/index.md#historical-general-configuration). Only valid for HISTORICAL type, for other types it's null| |current_size|BIGINT|Current size of segments in bytes on this server. Only valid for HISTORICAL type, for other types it's 0| |max_size|BIGINT|Max size in bytes this server recommends to assign to segments see [druid.server.maxSize](../configuration/index.md#historical-general-configuration). Only valid for HISTORICAL type, for other types it's 0| -|is_leader|BIGINT|1 if the server is currently the 'leader' (for services which have the concept of leadership), otherwise 0 if the server is not the leader, or the default long value (0 or null depending on `druid.generic.useDefaultValueForNull`) if the server type does not have the concept of leadership| +|is_leader|BIGINT|1 if the server is currently the 'leader' (for services which have the concept of leadership), otherwise 0 if the server is not the leader, or the default long value (null or zero depending on `druid.generic.useDefaultValueForNull`) if the server type does not have the concept of leadership| |start_time|STRING|Timestamp in ISO8601 format when the server was announced in the cluster| To retrieve information about all servers, use the query: diff --git a/docs/querying/sql-multivalue-string-functions.md b/docs/querying/sql-multivalue-string-functions.md index 86c22abd83c3..9688ca083f3e 100644 --- a/docs/querying/sql-multivalue-string-functions.md +++ b/docs/querying/sql-multivalue-string-functions.md @@ -55,8 +55,8 @@ All array references in the multi-value string function documentation can refer |`MV_ORDINAL(arr, long)`|Returns the array element at the 1-based index supplied, or null for an out of range index.| |`MV_CONTAINS(arr, expr)`|If `expr` is a scalar type, returns 1 if `arr` contains `expr`. If `expr` is an array, returns 1 if `arr` contains all elements of `expr`. Otherwise returns 0.| |`MV_OVERLAP(arr1, arr2)`|Returns 1 if `arr1` and `arr2` have any elements in common, else 0.| -|`MV_OFFSET_OF(arr, expr)`|Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`.| -|`MV_ORDINAL_OF(arr, expr)`|Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `-1` or `null` if `druid.generic.useDefaultValueForNull=false`.| +|`MV_OFFSET_OF(arr, expr)`|Returns the 0-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or -1 if `druid.generic.useDefaultValueForNull=true` (legacy mode).| +|`MV_ORDINAL_OF(arr, expr)`|Returns the 1-based index of the first occurrence of `expr` in the array. If no matching elements exist in the array, returns `null` or `-1` if `druid.generic.useDefaultValueForNull=true` (legacy mode).| |`MV_PREPEND(expr, arr)`|Adds `expr` to `arr` at the beginning, the resulting array type determined by the type of the array.| |`MV_APPEND(arr1, expr)`|Appends `expr` to `arr`, the resulting array type determined by the type of the first array.| |`MV_CONCAT(arr1, arr2)`|Concatenates `arr2` to `arr1`. The resulting array type is determined by the type of `arr1`.| diff --git a/docs/querying/sql-query-context.md b/docs/querying/sql-query-context.md index f9438363add4..dc192db17183 100644 --- a/docs/querying/sql-query-context.md +++ b/docs/querying/sql-query-context.md @@ -46,7 +46,7 @@ Configure Druid SQL query planning using the parameters in the table below. |`enableTimeBoundaryPlanning`|If true, SQL queries will get converted to TimeBoundary queries wherever possible. TimeBoundary queries are very efficient for min-max calculation on `__time` column in a datasource |`druid.query.default.context.enableTimeBoundaryPlanning` on the Broker (default: false)| |`useNativeQueryExplain`|If true, `EXPLAIN PLAN FOR` will return the explain plan as a JSON representation of equivalent native query(s), else it will return the original version of explain plan generated by Calcite.

This property is provided for backwards compatibility. It is not recommended to use this parameter unless you were depending on the older behavior.|`druid.sql.planner.useNativeQueryExplain` on the Broker (default: true)| |`sqlFinalizeOuterSketches`|If false (default behavior in Druid 25.0.0 and later), `DS_HLL`, `DS_THETA`, and `DS_QUANTILES_SKETCH` return sketches in query results, as documented. If true (default behavior in Druid 24.0.1 and earlier), sketches from these functions are finalized when they appear in query results.

This property is provided for backwards compatibility with behavior in Druid 24.0.1 and earlier. It is not recommended to use this parameter unless you were depending on the older behavior. Instead, use a function that does not return a sketch, such as `APPROX_COUNT_DISTINCT_DS_HLL`, `APPROX_COUNT_DISTINCT_DS_THETA`, `APPROX_QUANTILE_DS`, `DS_THETA_ESTIMATE`, or `DS_GET_QUANTILE`.|`druid.query.default.context.sqlFinalizeOuterSketches` on the Broker (default: false)| -|`sqlUseBoundAndSelectors`|If false (default behavior if `druid.generic.useDefaultValueForNull=false` in Druid 27.0.0 and later), the SQL planner will use [equality](./filters.md#equality-filter), [null](./filters.md#null-filter), and [range](./filters.md#range-filter) filters instead of [selector](./filters.md#selector-filter) and [bounds](./filters.md#bound-filter). This value must be set to `false` for correct behavior for filtering `ARRAY` typed values. | Defaults to same value as `druid.generic.useDefaultValueForNull` | +|`sqlUseBoundAndSelectors`|If false (default behavior if `druid.generic.useDefaultValueForNull=false` in Druid 27.0.0 and later), the SQL planner will use [equality](./filters.md#equality-filter), [null](./filters.md#null-filter), and [range](./filters.md#range-filter) filters instead of [selector](./filters.md#selector-filter) and [bounds](./filters.md#bound-filter). This value must be set to `false` for correct behavior for filtering `ARRAY` typed values. | Defaults to same value as `druid.generic.useDefaultValueForNull`, which is `false`| ## Setting the query context The query context parameters can be specified as a "context" object in the [JSON API](../api-reference/sql-api.md) or as a [JDBC connection properties object](../api-reference/sql-jdbc.md). From ac2e3c489fe7bf2886ab0d0b65ba62e312454585 Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Fri, 18 Aug 2023 15:19:58 -0700 Subject: [PATCH 6/6] oops --- docs/design/segments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/segments.md b/docs/design/segments.md index f3ce94018583..194520045aa3 100644 --- a/docs/design/segments.md +++ b/docs/design/segments.md @@ -88,7 +88,7 @@ Druid also has a legacy mode which uses default values instead of nulls, which w In legacy mode, Druid segments created _at ingestion time_ have the following characteristics: -* String columns can not distinguish `''` from `null`, they are treated interchangebly as the same value +* String columns can not distinguish `''` from `null`, they are treated interchangeably as the same value * Numeric columns can not represent `null` valued rows, and instead store a `0`. In legacy mode, numeric columns do not have the null value bitmap, and so can have slightly decreased segment sizes, and queries involving numeric columns can have slightly increased performance in some cases since there is no need to check the null value bitmap.