Skip to content

[VL] Track result mismatch cases in GlutenSQLQueryTestSuite #8738

@marin-ma

Description

@marin-ma

Backend

VL (Velox)

Bug description

GlutenSQLQueryTestSuite is a replacement for SQLQueryTestSuite in Spark. It runs sample SQL queries provided by Spark under the path spark/sql/core/src/test/resources/sql-tests/inputs and verify the results. However, spark.sql.optimizer.excludedRules is not set in the test suite to forbid static evaluation for literal inputs, which leads to many sql expressions with literal input being evaluated by Spark before being offloaded to Gluten.

To ensure that the expressions are effectively offloaded, we should set:

spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation,org.apache.spark.sql.catalyst.optimizer.ConstantFolding,org.apache.spark.sql.catalyst.optimizer.NullPropagation

in GlutenSQLQueryTestSuite. With this setting, many queries fail due to result mismatches and need to be fixed to ensure consistency with Spark. For queries that produce result mismatches, we record and track them in this issue (see the list below) and temporarily overwrite the sample file by removing the failed SQL queries until all failures in that file are resolved.

Additionally, there are three other types of failures that are not considered as result mismatch:

  1. Exception Message Mismatches: Gluten may throw different exceptions and messages for invalid SQL queries. This is not a result mismatch issue and should be addressed by overwriting the original files with the updated SQL output.
  2. Queries That Also Fail in Vanilla Spark: If a query fails in vanilla Spark when the optimizer rules org.apache.spark.sql.catalyst.optimizer.ConstantFolding or org.apache.spark.sql.catalyst.optimizer.NullPropagation are excluded, we should overwrite the original files and ignore these queries. (Because Spark doesn't need to set spark.sql.optimizer.excludedRules in SQLQueryTestSuite)
  3. Legacy Date/Timestamp Function Behavior: For some date/timestamp functions, Spark uses the configuration SET spark.sql.legacy.timeParserPolicy=LEGACY to control the output for invalid SQL queries. When spark.sql.legacy.timeParserPolicy=LEGACY is set, queries should return NULL for invalid inputs; otherwise, an exception is thrown. Gluten's behaviour is mostly consistent with spark.sql.legacy.timeParserPolicy=LEGACY. We need to update the documentation to explain the scope of the supported date/timestamp functions.

Below are the failed sample files and the sqls:

  • string-functions.sql

    • select to_binary('GG')
    • select to_binary('01 AF', 'hex')
    • select to_binary(' ab cdef= = ', replace('HEX0', '0'))
  • date.sql

    • select to_date("16", "dd")
    • select to_date("02-29", "MM-dd")
  • datetime-parsing.sql / datetime-parsing-legacy.sql

    • select to_timestamp('009999', 'y')
    • select to_timestamp('99', 'yy')
    • select to_timestamp('009999', 'yyy')
    • select to_timestamp('9999', 'yyyy')
    • select to_timestamp('09999', 'yyyyy')
    • select to_timestamp('009999', 'yyyyyy')
    • select to_timestamp('9', 'D')
    • select to_timestamp('300', 'D')
    • select to_timestamp('09', 'DD')
    • select to_timestamp('99', 'DD')
    • select to_timestamp('009', 'DDD')
    • select to_timestamp('365', 'DDD')
    • select to_timestamp('31-365', 'dd-DDD')
    • select to_timestamp('12-365', 'MM-DDD')
    • select to_timestamp('12-31-365', 'MM-dd-DDD')
  • datetime-parsing-invalid.sql

    • select to_timestamp('1', 'yy')
    • select to_timestamp('-12', 'yy')
    • select to_timestamp('123', 'yy')
    • select to_timestamp('1', 'yyy')
    • select to_timestamp('366', 'D')
    • select to_timestamp('9', 'DD')
    • select to_timestamp('9', 'DDD')
    • select to_timestamp('99', 'DDD')
    • select to_timestamp('30-365', 'dd-DDD')
    • select to_timestamp('11-365', 'MM-DDD')
    • select to_timestamp('12-30-365', 'MM-dd-DDD')
    • select to_timestamp('2020-01-365', 'yyyy-dd-DDD')
    • select to_timestamp('2020-10-350', 'yyyy-MM-DDD')
  • datetime-legacy.sql

    • select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
    • select to_timestamp("12.12342019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm")
    • select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm")
    • select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyy-MM-dd'S'HH:mm")
    • select to_timestamp("12.1234019-10-06S10:11", "ss.SSSSy-MM-dd'S'HH:mm")
    • select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")
    • select to_timestamp("16", "dd")
    • select to_timestamp("02-29", "MM-dd")
    • select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE')
    • select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE')
    • select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE')
  • datetime-formatting.sql/datetime-formating-legacy.sql

    • All queries fail. Exclude from the supported list.
  • try_datetime_functions.sql

    • select try_to_timestamp("02-29", "MM-dd")
    • select try_to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE')
  • try_cast.sql

    • SELECT TRY_CAST('1.23' AS int)
    • SELECT TRY_CAST('1.23' AS long)
    • SELECT TRY_CAST('-4.56' AS int)
    • SELECT TRY_CAST('-4.56' AS long)
  • try_arithmetic.sql

    • SELECT try_add(1, (2147483647 + 1))
    • SELECT try_add(1L, (9223372036854775807L + 1L))
  • regexp-functions.sql

    • SELECT regexp_extract_all('1a 2b 14m', '(\d+)?', 1)
    • SELECT regexp_extract_all('a 2b 14m', '(\d+)?', 1)
    • SELECT regexp_extract_all('a 2b 14m', '(\d+)?([a-z]+)', 1)
  • window.sql

    • SELECT val_date, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_date
      RANGE BETWEEN CURRENT ROW AND interval '1-1' year to month FOLLOWING) FROM testData
      ORDER BY cate, val_date;

Spark version

Spark-3.5.x

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions