From 8f0635eba78e5fcef2d175cc5e013f52a78b29e5 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 29 Jan 2026 11:15:58 -0700 Subject: [PATCH 1/3] Add contributor guide page for SQL file tests Document the CometSqlFileTestSuite test framework including file format, directives (Config, ConfigMatrix, MinSparkVersion), query assertion modes, how to add new tests, and how to handle failures. Co-Authored-By: Claude Opus 4.5 --- docs/source/contributor-guide/index.md | 1 + .../contributor-guide/sql-file-tests.md | 242 ++++++++++++++++++ 2 files changed, 243 insertions(+) create mode 100644 docs/source/contributor-guide/sql-file-tests.md diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md index db3270b6af..f5ebc43ed3 100644 --- a/docs/source/contributor-guide/index.md +++ b/docs/source/contributor-guide/index.md @@ -37,6 +37,7 @@ Adding a New Expression Tracing Profiling Native Code Spark SQL Tests +SQL File Tests Roadmap Github and Issue Tracker ``` diff --git a/docs/source/contributor-guide/sql-file-tests.md b/docs/source/contributor-guide/sql-file-tests.md new file mode 100644 index 0000000000..71d8234e04 --- /dev/null +++ b/docs/source/contributor-guide/sql-file-tests.md @@ -0,0 +1,242 @@ + + +# SQL File Tests + +`CometSqlFileTestSuite` is a test suite that automatically discovers `.sql` test files and +runs each query through both Spark and Comet, comparing results. This provides a lightweight +way to add expression and operator test coverage without writing Scala test code. + +## Running the tests + +```shell +mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none +``` + +## Test file location + +SQL test files live under: + +``` +spark/src/test/resources/sql-tests/expressions/ +``` + +Files are organized into category subdirectories: + +``` +expressions/ + aggregate/ -- avg, sum, count, min_max, ... + array/ -- array_contains, array_append, get_array_item, ... + bitwise/ + cast/ + conditional/ -- case_when, coalesce, if_expr, ... + datetime/ -- date_add, date_diff, unix_timestamp, ... + decimal/ + hash/ + map/ -- get_map_value, map_keys, map_values, ... + math/ -- abs, ceil, floor, round, sqrt, ... + misc/ -- width_bucket, scalar_subquery, ... + string/ -- concat, like, substring, lower, upper, ... + struct/ -- create_named_struct, get_struct_field, ... +``` + +The test suite recursively discovers all `.sql` files in these directories. Each file becomes +one or more ScalaTest test cases. + +## File format + +A test file consists of SQL comments, directives, statements, and queries separated by blank +lines. Here is a minimal example: + +```sql +-- ConfigMatrix: parquet.enable.dictionary=false,true + +statement +CREATE TABLE test_abs(v double) USING parquet + +statement +INSERT INTO test_abs VALUES (1.5), (-2.5), (0.0), (NULL) + +query +SELECT abs(v) FROM test_abs +``` + +### Directives + +Directives are SQL comments at the top of the file that configure how the test runs. + +#### `Config` + +Sets a Spark SQL config for all queries in the file. + +```sql +-- Config: spark.sql.ansi.enabled=true +``` + +#### `ConfigMatrix` + +Runs the entire file once per combination of values. Multiple `ConfigMatrix` lines produce a +cross product of all combinations. + +```sql +-- ConfigMatrix: parquet.enable.dictionary=false,true +``` + +This generates two test cases: + +``` +sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=false] +sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=true] +``` + +#### `MinSparkVersion` + +Skips the file when running on a Spark version older than the specified version. + +```sql +-- MinSparkVersion: 3.5 +``` + +### Statements + +A `statement` block executes DDL or DML and does not check results. Use this for `CREATE TABLE` +and `INSERT` commands. Table names are automatically extracted for cleanup after the test. + +```sql +statement +CREATE TABLE my_table(x int, y double) USING parquet + +statement +INSERT INTO my_table VALUES (1, 2.0), (3, 4.0), (NULL, NULL) +``` + +### Queries + +A `query` block executes a SELECT and compares results between Spark and Comet. The query +mode controls how results are validated. + +#### `query` (default mode) + +Checks that the query runs natively on Comet (not falling back to Spark) and that results +match Spark exactly. + +```sql +query +SELECT abs(v) FROM test_abs +``` + +#### `query spark_answer_only` + +Only checks that Comet results match Spark. Does not assert that the query runs natively. +Use this for expressions that Comet may not fully support yet but should still produce +correct results. + +```sql +query spark_answer_only +SELECT some_expression(v) FROM test_table +``` + +#### `query tolerance=` + +Checks results with a numeric tolerance. Useful for floating-point functions where small +differences are acceptable. + +```sql +query tolerance=0.0001 +SELECT cos(v) FROM test_trig +``` + +#### `query expect_fallback()` + +Asserts that the query falls back to Spark and verifies the fallback reason contains the +given string. + +```sql +query expect_fallback(unsupported expression) +SELECT unsupported_func(v) FROM test_table +``` + +#### `query ignore()` + +Skips the query entirely. Use this for queries that hit known bugs. The reason should be a +link to the tracking GitHub issue. + +```sql +-- Comet bug: space(-1) causes native crash +query ignore(https://github.com/apache/datafusion-comet/issues/3326) +SELECT space(n) FROM test_space WHERE n < 0 +``` + +## Adding a new test + +1. Create a `.sql` file under the appropriate subdirectory in + `spark/src/test/resources/sql-tests/expressions/`. Create a new subdirectory if no + existing category fits. + +2. Add the Apache license header as a SQL comment. + +3. Add a `ConfigMatrix` directive if the test should run with multiple Parquet configurations. + Most expression tests use: + + ```sql + -- ConfigMatrix: parquet.enable.dictionary=false,true + ``` + +4. Create tables and insert test data using `statement` blocks. Include edge cases such as + `NULL`, boundary values, and negative numbers. + +5. Add `query` blocks for each expression or behavior to test. Use the default `query` mode + when you expect Comet to run the expression natively. Use `query spark_answer_only` when + native execution is not yet expected. + +6. Run the tests to verify: + + ```shell + mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none + ``` + +## Handling test failures + +When a query fails due to a known Comet bug: + +1. File a GitHub issue describing the problem. +2. Change the query mode to `ignore(...)` with a link to the issue. +3. Optionally add a SQL comment above the query explaining the problem. + +```sql +-- GetArrayItem returns incorrect results with dynamic index +query ignore(https://github.com/apache/datafusion-comet/issues/3332) +SELECT arr[idx] FROM test_get_array_item +``` + +When the bug is fixed, remove the `ignore(...)` and restore the original query mode. + +## Architecture + +The test infrastructure consists of two Scala files: + +- **`SqlFileTestParser`** (`spark/src/test/scala/org/apache/comet/SqlFileTestParser.scala`) -- + Parses `.sql` files into a `SqlTestFile` data structure containing directives, statements, + and queries. + +- **`CometSqlFileTestSuite`** (`spark/src/test/scala/org/apache/comet/CometSqlFileTestSuite.scala`) -- + Discovers test files at suite initialization time, generates ScalaTest test cases for each + file and config combination, and executes them using `CometTestBase` assertion methods. + +Tables created in test files are automatically cleaned up after each test. From d1fb171df1e88d74754fae79406172950353eaf9 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 29 Jan 2026 11:18:57 -0700 Subject: [PATCH 2/3] Link to SQL file tests guide from development guide Co-Authored-By: Claude Opus 4.5 --- docs/source/contributor-guide/development.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/contributor-guide/development.md b/docs/source/contributor-guide/development.md index ec62316170..b8a800b14f 100644 --- a/docs/source/contributor-guide/development.md +++ b/docs/source/contributor-guide/development.md @@ -157,6 +157,9 @@ in their name in `org.apache.comet.CometCastSuite` you can use Other options for selecting specific suites are described in the [ScalaTest Maven Plugin documentation](https://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin) +For adding expression and operator tests using SQL files rather than Scala code, see the +[SQL File Tests](sql-file-tests) guide. + ## Plan Stability Testing Comet has a plan stability testing framework that can be used to test the stability of the query plans generated by Comet. From d15b07a4367441db7da1181ddab3f256a2403d64 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Sat, 31 Jan 2026 10:58:38 -0700 Subject: [PATCH 3/3] docs: Add tips for writing thorough SQL file tests Add guidance on covering all combinations of literal and column arguments, edge cases, and using agentic coding tools. Move SQL file tests section higher in the development guide as the recommended approach for expression testing. Co-Authored-By: Claude Opus 4.5 --- docs/source/contributor-guide/development.md | 13 ++- .../contributor-guide/sql-file-tests.md | 95 +++++++++++++++++++ 2 files changed, 105 insertions(+), 3 deletions(-) diff --git a/docs/source/contributor-guide/development.md b/docs/source/contributor-guide/development.md index b8a800b14f..52519e7c60 100644 --- a/docs/source/contributor-guide/development.md +++ b/docs/source/contributor-guide/development.md @@ -140,6 +140,16 @@ size limit to 16 MB. First make sure to install the Rust plugin in CLion or you can use the dedicated Rust IDE: RustRover. After that you can open the project in CLion. The IDE should automatically detect the project structure and import as a Cargo project. +### SQL file tests (recommended for expressions) + +For testing expressions and operators, prefer using SQL file tests over writing Scala test +code. SQL file tests are plain `.sql` files that are automatically discovered and executed -- +no Scala code to write, and no recompilation needed when tests change. This makes it easy to +iterate quickly and to get good coverage of edge cases and argument combinations. + +See the [SQL File Tests](sql-file-tests) guide for the full documentation on how to write +and run these tests. + ### Running Tests in IDEA Like other Maven projects, you can run tests in IntelliJ IDEA by right-clicking on the test class or test method and selecting "Run" or "Debug". @@ -157,9 +167,6 @@ in their name in `org.apache.comet.CometCastSuite` you can use Other options for selecting specific suites are described in the [ScalaTest Maven Plugin documentation](https://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin) -For adding expression and operator tests using SQL files rather than Scala code, see the -[SQL File Tests](sql-file-tests) guide. - ## Plan Stability Testing Comet has a plan stability testing framework that can be used to test the stability of the query plans generated by Comet. diff --git a/docs/source/contributor-guide/sql-file-tests.md b/docs/source/contributor-guide/sql-file-tests.md index 71d8234e04..f521cb4adf 100644 --- a/docs/source/contributor-guide/sql-file-tests.md +++ b/docs/source/contributor-guide/sql-file-tests.md @@ -211,6 +211,101 @@ SELECT space(n) FROM test_space WHERE n < 0 mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none ``` +### Tips for writing thorough tests + +#### Cover all combinations of literal and column arguments + +Comet often uses different code paths for literal values versus column references. Tests +should exercise both. For a function with multiple arguments, test every useful combination. + +For a single-argument function, test both a column reference and a literal: + +```sql +-- column argument (reads from Parquet, goes through columnar evaluation) +query +SELECT ascii(s) FROM test_ascii + +-- literal arguments +query +SELECT ascii('A'), ascii(''), ascii(NULL) +``` + +For a multi-argument function like `concat_ws(sep, str1, str2, ...)`, test with the separator +as a column versus a literal, and similarly for the other arguments: + +```sql +-- all columns +query +SELECT concat_ws(sep, a, b) FROM test_table + +-- literal separator, column values +query +SELECT concat_ws(',', a, b) FROM test_table + +-- all literals +query +SELECT concat_ws(',', 'hello', 'world') +``` + +**Note on constant folding:** Normally Spark constant-folds all-literal expressions during +planning, so Comet would never see them. However, `CometSqlFileTestSuite` automatically +disables constant folding (by excluding `ConstantFolding` from the optimizer rules), so +all-literal queries are evaluated by Comet's native engine. This means you can use the +default `query` mode for all-literal cases and they will be tested natively just like +column-based queries. + +#### Cover edge cases + +Include edge-case values in your test data. The exact cases depend on the function, but +common ones include: + +- **NULL values** -- every test should include NULLs +- **Empty strings** -- for string functions +- **Zero, negative, and very large numbers** -- for numeric functions +- **Boundary values** -- `INT_MIN`, `INT_MAX`, `NaN`, `Infinity`, `-Infinity` for numeric + types +- **Special characters and multibyte UTF-8** -- for string functions (e.g. `'é'`, `'中文'`, + `'\t'`) +- **Empty arrays/maps** -- for collection functions +- **Single-element and multi-element collections** -- for aggregate and collection functions + +#### One file per expression + +Keep each `.sql` file focused on a single expression or a small group of closely related +expressions. This makes failures easy to locate and keeps files readable. + +#### Use comments to label sections + +Add SQL comments before query blocks to describe what aspect of the expression is being +tested. This helps reviewers and future maintainers understand the intent: + +```sql +-- literal separator with NULL values +query +SELECT concat_ws(',', NULL, 'b', 'c') + +-- empty separator +query +SELECT concat_ws('', a, b, c) FROM test_table +``` + +### Using agentic coding tools + +Writing thorough SQL test files is a task well suited to agentic coding tools such as +Claude Code. You can point the tool at an existing test file as an example, describe the +expression you want to test, and ask it to generate a complete `.sql` file covering all +argument combinations and edge cases. This is significantly faster than writing the +combinatorial test cases by hand and helps ensure nothing is missed. + +For example: + +``` +Read the test file spark/src/test/resources/sql-tests/expressions/string/ascii.sql +and the documentation in docs/source/contributor-guide/sql-file-tests.md. +Then write a similar test file for the `reverse` function, covering column arguments, +literal arguments, NULLs, empty strings, and multibyte characters. +``` + ## Handling test failures When a query fails due to a known Comet bug: