Backend
VL (Velox)
Bug description
GlutenSQLQueryTestSuite is a replacement for SQLQueryTestSuite in Spark. It runs sample SQL queries provided by Spark under the path spark/sql/core/src/test/resources/sql-tests/inputs and verify the results. However, spark.sql.optimizer.excludedRules is not set in the test suite to forbid static evaluation for literal inputs, which leads to many sql expressions with literal input being evaluated by Spark before being offloaded to Gluten.
To ensure that the expressions are effectively offloaded, we should set:
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation,org.apache.spark.sql.catalyst.optimizer.ConstantFolding,org.apache.spark.sql.catalyst.optimizer.NullPropagation
in GlutenSQLQueryTestSuite. With this setting, many queries fail due to result mismatches and need to be fixed to ensure consistency with Spark. For queries that produce result mismatches, we record and track them in this issue (see the list below) and temporarily overwrite the sample file by removing the failed SQL queries until all failures in that file are resolved.
Additionally, there are three other types of failures that are not considered as result mismatch:
- Exception Message Mismatches: Gluten may throw different exceptions and messages for invalid SQL queries. This is not a result mismatch issue and should be addressed by overwriting the original files with the updated SQL output.
- Queries That Also Fail in Vanilla Spark: If a query fails in vanilla Spark when the optimizer rules
org.apache.spark.sql.catalyst.optimizer.ConstantFolding or org.apache.spark.sql.catalyst.optimizer.NullPropagation are excluded, we should overwrite the original files and ignore these queries. (Because Spark doesn't need to set spark.sql.optimizer.excludedRules in SQLQueryTestSuite)
- Legacy Date/Timestamp Function Behavior: For some date/timestamp functions, Spark uses the configuration
SET spark.sql.legacy.timeParserPolicy=LEGACY to control the output for invalid SQL queries. When spark.sql.legacy.timeParserPolicy=LEGACY is set, queries should return NULL for invalid inputs; otherwise, an exception is thrown. Gluten's behaviour is mostly consistent with spark.sql.legacy.timeParserPolicy=LEGACY. We need to update the documentation to explain the scope of the supported date/timestamp functions.
Below are the failed sample files and the sqls:
-
string-functions.sql
- select to_binary('GG')
- select to_binary('01 AF', 'hex')
- select to_binary(' ab cdef= = ', replace('HEX0', '0'))
-
date.sql
- select to_date("16", "dd")
- select to_date("02-29", "MM-dd")
-
datetime-parsing.sql / datetime-parsing-legacy.sql
- select to_timestamp('009999', 'y')
- select to_timestamp('99', 'yy')
- select to_timestamp('009999', 'yyy')
- select to_timestamp('9999', 'yyyy')
- select to_timestamp('09999', 'yyyyy')
- select to_timestamp('009999', 'yyyyyy')
- select to_timestamp('9', 'D')
- select to_timestamp('300', 'D')
- select to_timestamp('09', 'DD')
- select to_timestamp('99', 'DD')
- select to_timestamp('009', 'DDD')
- select to_timestamp('365', 'DDD')
- select to_timestamp('31-365', 'dd-DDD')
- select to_timestamp('12-365', 'MM-DDD')
- select to_timestamp('12-31-365', 'MM-dd-DDD')
-
datetime-parsing-invalid.sql
- select to_timestamp('1', 'yy')
- select to_timestamp('-12', 'yy')
- select to_timestamp('123', 'yy')
- select to_timestamp('1', 'yyy')
- select to_timestamp('366', 'D')
- select to_timestamp('9', 'DD')
- select to_timestamp('9', 'DDD')
- select to_timestamp('99', 'DDD')
- select to_timestamp('30-365', 'dd-DDD')
- select to_timestamp('11-365', 'MM-DDD')
- select to_timestamp('12-30-365', 'MM-dd-DDD')
- select to_timestamp('2020-01-365', 'yyyy-dd-DDD')
- select to_timestamp('2020-10-350', 'yyyy-MM-DDD')
-
datetime-legacy.sql
- select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
- select to_timestamp("12.12342019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm")
- select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm")
- select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyy-MM-dd'S'HH:mm")
- select to_timestamp("12.1234019-10-06S10:11", "ss.SSSSy-MM-dd'S'HH:mm")
- select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")
- select to_timestamp("16", "dd")
- select to_timestamp("02-29", "MM-dd")
- select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE')
- select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE')
- select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE')
-
datetime-formatting.sql/datetime-formating-legacy.sql
- All queries fail. Exclude from the supported list.
-
try_datetime_functions.sql
- select try_to_timestamp("02-29", "MM-dd")
- select try_to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE')
-
try_cast.sql
- SELECT TRY_CAST('1.23' AS int)
- SELECT TRY_CAST('1.23' AS long)
- SELECT TRY_CAST('-4.56' AS int)
- SELECT TRY_CAST('-4.56' AS long)
-
try_arithmetic.sql
- SELECT try_add(1, (2147483647 + 1))
- SELECT try_add(1L, (9223372036854775807L + 1L))
-
regexp-functions.sql
- SELECT regexp_extract_all('1a 2b 14m', '(\d+)?', 1)
- SELECT regexp_extract_all('a 2b 14m', '(\d+)?', 1)
- SELECT regexp_extract_all('a 2b 14m', '(\d+)?([a-z]+)', 1)
-
window.sql
- SELECT val_date, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_date
RANGE BETWEEN CURRENT ROW AND interval '1-1' year to month FOLLOWING) FROM testData
ORDER BY cate, val_date;
Spark version
Spark-3.5.x
Spark configurations
No response
System information
No response
Relevant logs
Backend
VL (Velox)
Bug description
GlutenSQLQueryTestSuiteis a replacement forSQLQueryTestSuitein Spark. It runs sample SQL queries provided by Spark under the pathspark/sql/core/src/test/resources/sql-tests/inputsand verify the results. However,spark.sql.optimizer.excludedRulesis not set in the test suite to forbid static evaluation for literal inputs, which leads to many sql expressions with literal input being evaluated by Spark before being offloaded to Gluten.To ensure that the expressions are effectively offloaded, we should set:
in
GlutenSQLQueryTestSuite. With this setting, many queries fail due to result mismatches and need to be fixed to ensure consistency with Spark. For queries that produce result mismatches, we record and track them in this issue (see the list below) and temporarily overwrite the sample file by removing the failed SQL queries until all failures in that file are resolved.Additionally, there are three other types of failures that are not considered as result mismatch:
org.apache.spark.sql.catalyst.optimizer.ConstantFoldingororg.apache.spark.sql.catalyst.optimizer.NullPropagationare excluded, we should overwrite the original files and ignore these queries. (Because Spark doesn't need to setspark.sql.optimizer.excludedRulesinSQLQueryTestSuite)SET spark.sql.legacy.timeParserPolicy=LEGACYto control the output for invalid SQL queries. Whenspark.sql.legacy.timeParserPolicy=LEGACYis set, queries should returnNULLfor invalid inputs; otherwise, an exception is thrown. Gluten's behaviour is mostly consistent withspark.sql.legacy.timeParserPolicy=LEGACY. We need to update the documentation to explain the scope of the supported date/timestamp functions.Below are the failed sample files and the sqls:
string-functions.sql
date.sql
datetime-parsing.sql / datetime-parsing-legacy.sql
datetime-parsing-invalid.sql
datetime-legacy.sql
datetime-formatting.sql/datetime-formating-legacy.sql
try_datetime_functions.sql
try_cast.sql
try_arithmetic.sql
regexp-functions.sql
window.sql
RANGE BETWEEN CURRENT ROW AND interval '1-1' year to month FOLLOWING) FROM testData
ORDER BY cate, val_date;
Spark version
Spark-3.5.x
Spark configurations
No response
System information
No response
Relevant logs