diff --git a/docs/multi-stage-query/concepts.md b/docs/multi-stage-query/concepts.md index 7100e14d01cf..d9385061d79f 100644 --- a/docs/multi-stage-query/concepts.md +++ b/docs/multi-stage-query/concepts.md @@ -200,8 +200,8 @@ rollup-related metadata into the generated segments. Other applications can then queries](../querying/segmentmetadataquery.md) to retrieve rollup-related information. The following [aggregation functions](../querying/sql-aggregations.md) are supported for rollup at ingestion time: -`COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` and `EARLIEST_BY` ([string only](known-issues.md#select-statement)), -`LATEST` and `LATEST_BY` ([string only](known-issues.md#select-statement)), `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`, +`COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` and `EARLIEST_BY`, +`LATEST` and `LATEST_BY`, `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`, `APPROX_COUNT_DISTINCT_DS_HLL`, `APPROX_COUNT_DISTINCT_DS_THETA`, and `DS_QUANTILES_SKETCH` (but switch to `APPROX_QUANTILE_DS` at query time). Do not use `AVG`; instead, use `SUM` and `COUNT` at ingest time and compute the quotient at query time. diff --git a/docs/multi-stage-query/known-issues.md b/docs/multi-stage-query/known-issues.md index f4e97dc23dad..2a67dafb0f6a 100644 --- a/docs/multi-stage-query/known-issues.md +++ b/docs/multi-stage-query/known-issues.md @@ -42,11 +42,6 @@ an [UnknownError](./reference.md#error_UnknownError) with a message including "N - `GROUPING SETS` are not implemented. Queries using these features return a [QueryNotSupported](reference.md#error_QueryNotSupported) error. -- The numeric varieties of the `EARLIEST` and `LATEST` aggregators do not work properly. Attempting to use the numeric - varieties of these aggregators lead to an error like - `java.lang.ClassCastException: class java.lang.Double cannot be cast to class org.apache.druid.collections.SerializablePair`. - The string varieties, however, do work properly. - ## `INSERT` and `REPLACE` Statements - The `INSERT` and `REPLACE` statements with column lists, like `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented. diff --git a/docs/querying/aggregations.md b/docs/querying/aggregations.md index c7f798011973..8ef8287a9822 100644 --- a/docs/querying/aggregations.md +++ b/docs/querying/aggregations.md @@ -177,10 +177,9 @@ Example: The first and last aggregators determine the metric values that respectively correspond to the earliest and latest values of a time column. -Do not use first and last aggregators for the double, float, and long types in an ingestion spec. They are only supported for queries. -The string-typed aggregators, `stringFirst` and `stringLast`, are supported for both ingestion and querying. - -Queries with first or last aggregators on a segment created with rollup return the rolled up value, not the first or last value from the raw ingested data. +Queries with first or last aggregators on a segment created with rollup return the rolled up value, not the first or last value from the +raw ingested data. The `timeColumn` will get ignored in such cases, and the aggregation will use the original value of the time column +stored at the time the segment was created. #### Numeric first and last aggregators diff --git a/docs/querying/sql-aggregations.md b/docs/querying/sql-aggregations.md index b2df640a68f0..5124b75c7798 100644 --- a/docs/querying/sql-aggregations.md +++ b/docs/querying/sql-aggregations.md @@ -87,9 +87,9 @@ In the aggregation functions supported by Druid, only `COUNT`, `ARRAY_AGG`, and |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`EARLIEST(expr, [maxBytesPerValue])`|Returns the earliest value of `expr`.
If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.
If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| -|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest value of `expr`.
The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`.
If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest value of `expr`.
The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`.
If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`.

Use `EARLIEST` instead of `EARLIEST_BY` on a table that has rollup enabled and was created with any variant of `EARLIEST`, `LATEST`, `EARLIEST_BY`, or `LATEST_BY`. In these cases, the intermediate type already stores the timestamp, and Druid ignores the value passed in `timestampExpr`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`LATEST(expr, [maxBytesPerValue])`|Returns the latest value of `expr`
The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column.
If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| -|`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the latest value of `expr`.
The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`.
If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| +|`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the latest value of `expr`.
The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`.
If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue`is omitted; it defaults to `1024`.

Use `LATEST` instead of `LATEST_BY` on a table that has rollup enabled and was created with any variant of `EARLIEST`, `LATEST`, `EARLIEST_BY`, or `LATEST_BY`. In these cases, the intermediate type already stores the timestamp, and Druid ignores the value passed in `timestampExpr`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`ANY_VALUE(expr, [maxBytesPerValue, [aggregateMultipleValues]])`|Returns any value of `expr` including null. This aggregator can simplify and optimize the performance by returning the first encountered value (including `null`).

If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.
If `maxBytesPerValue` is omitted; it defaults to `1024`. `aggregateMultipleValues` is an optional boolean flag controls the behavior of aggregating a [multi-value dimension](./multi-value-dimensions.md). `aggregateMultipleValues` is set as true by default and returns the stringified array in case of a multi-value dimension. By setting it to false, function will return first value instead. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)| |`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|N/A| |`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|`null`| diff --git a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java index 66bbdf8a49bf..b47985ea95a9 100644 --- a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java +++ b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java @@ -229,7 +229,7 @@ public Aggregation toDruidAggregation( ); } - final String fieldName = getColumnName(plannerContext, virtualColumnRegistry, args.get(0), rexNodes.get(0)); + final String fieldName = getColumnName(virtualColumnRegistry, args.get(0), rexNodes.get(0)); if (!inputAccessor.getInputRowSignature().contains(ColumnHolder.TIME_COLUMN_NAME) && (aggregatorType == AggregatorType.LATEST || aggregatorType == AggregatorType.EARLIEST)) { @@ -291,7 +291,6 @@ public Aggregation toDruidAggregation( } static String getColumnName( - PlannerContext plannerContext, VirtualColumnRegistry virtualColumnRegistry, DruidExpression arg, RexNode rexNode @@ -360,7 +359,9 @@ public TimeColIdentifer() @Override public R accept(SqlVisitor visitor) { - + // We overridde the "accept()" method, because the __time column's presence is determined when Calcite is converting + // the identifiers to the fully qualified column names with prefixes. This is where the validation exception can + // trigger try { return super.accept(visitor); } diff --git a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java index fac88d853e11..c72ad3150ad5 100644 --- a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java +++ b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java @@ -100,7 +100,6 @@ public Aggregation toDruidAggregation( } final String fieldName = EarliestLatestAnySqlAggregator.getColumnName( - plannerContext, virtualColumnRegistry, args.get(0), rexNodes.get(0) @@ -113,7 +112,6 @@ public Aggregation toDruidAggregation( aggregatorName, fieldName, EarliestLatestAnySqlAggregator.getColumnName( - plannerContext, virtualColumnRegistry, args.get(1), rexNodes.get(1) @@ -140,7 +138,6 @@ public Aggregation toDruidAggregation( aggregatorName, fieldName, EarliestLatestAnySqlAggregator.getColumnName( - plannerContext, virtualColumnRegistry, args.get(1), rexNodes.get(1) diff --git a/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java b/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java index b0e607b61a00..a617a9461943 100644 --- a/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java +++ b/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java @@ -638,8 +638,6 @@ public void testGroupBySingleColumnDescendingNoTopN() @Test public void testEarliestAggregators() { - msqIncompatible(); - testQuery( "SELECT " + "EARLIEST(cnt), EARLIEST(m1), EARLIEST(dim1, 10), EARLIEST(dim1, CAST(10 AS INTEGER)), " @@ -1200,8 +1198,6 @@ public void testStringLatestByGroupByWithAlwaysFalseCondition() @Test public void testPrimitiveEarliestInSubquery() { - msqIncompatible(); - testQuery( "SELECT SUM(val1), SUM(val2), SUM(val3) FROM (SELECT dim2, EARLIEST(m1) AS val1, EARLIEST(cnt) AS val2, EARLIEST(m2) AS val3 FROM foo GROUP BY dim2)", ImmutableList.of( @@ -1408,7 +1404,6 @@ public void testPrimitiveAnyInSubquery() @Test public void testStringEarliestSingleStringDim() { - msqIncompatible(); testQuery( "SELECT dim2, EARLIEST(dim1,10) AS val FROM foo GROUP BY dim2", ImmutableList.of( @@ -1524,8 +1519,6 @@ public void testStringAnyInSubquery() @Test public void testEarliestAggregatorsNumericNulls() { - msqIncompatible(); - testQuery( "SELECT EARLIEST(l1), EARLIEST(d1), EARLIEST(f1) FROM druid.numfoo", ImmutableList.of( @@ -1583,8 +1576,6 @@ public void testLatestAggregatorsNumericNull() @Test public void testFirstLatestAggregatorsSkipNulls() { - msqIncompatible(); - final DimFilter filter; if (useDefault) { filter = notNull("dim1"); @@ -1697,8 +1688,6 @@ public void testAnyAggregatorsSkipNullsWithFilter() @Test public void testOrderByEarliestFloat() { - msqIncompatible(); - List expected; if (NullHandling.replaceWithDefault()) { expected = ImmutableList.of( @@ -1744,8 +1733,6 @@ public void testOrderByEarliestFloat() @Test public void testOrderByEarliestDouble() { - msqIncompatible(); - List expected; if (NullHandling.replaceWithDefault()) { expected = ImmutableList.of( @@ -1791,8 +1778,6 @@ public void testOrderByEarliestDouble() @Test public void testOrderByEarliestLong() { - msqIncompatible(); - List expected; if (NullHandling.replaceWithDefault()) { expected = ImmutableList.of( @@ -9660,7 +9645,9 @@ public void testTimeseriesEmptyResultsAggregatorDefaultValues() @Test public void testTimeseriesEmptyResultsAggregatorDefaultValuesNonVectorized() { + // Empty-dataset aggregation queries in MSQ return an empty row, rather than a single row as SQL requires. msqIncompatible(); + cannotVectorize(); skipVectorize(); // timeseries with all granularity have a single group, so should return default results for given aggregators @@ -9976,7 +9963,6 @@ public void testGroupByAggregatorDefaultValues() @Test public void testGroupByAggregatorDefaultValuesNonVectorized() { - msqIncompatible(); cannotVectorize(); skipVectorize(); testQuery(