From 9cd15667ca15656ed7bcab93ddf4ae265eaa1e1a Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 30 Apr 2026 06:17:35 -0600 Subject: [PATCH 1/3] docs: add implement-comet-expression Claude skill --- .../implement-comet-expression/SKILL.md | 81 +++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 .claude/skills/implement-comet-expression/SKILL.md diff --git a/.claude/skills/implement-comet-expression/SKILL.md b/.claude/skills/implement-comet-expression/SKILL.md new file mode 100644 index 0000000000..d2eaa96065 --- /dev/null +++ b/.claude/skills/implement-comet-expression/SKILL.md @@ -0,0 +1,81 @@ +--- +name: implement-comet-expression +description: Use when implementing a new Spark expression in DataFusion Comet. Walks through cloning latest Spark master to study the canonical implementation, building the initial Comet serde and Rust function from the contributor guide, then running audit-comet-expression to drive a test-coverage iteration loop. +argument-hint: +--- + +Implement Comet support for the `$ARGUMENTS` Spark expression. + +## Background reading + +The contributor guide is the canonical reference. Read these before writing code: + +- `docs/source/contributor-guide/adding_a_new_expression.md` covers the Scala serde, protobuf, Rust scalar function flow, support levels, shims, and tests. +- `docs/source/contributor-guide/sql-file-tests.md` describes the Comet SQL Tests format. +- `docs/source/contributor-guide/spark_expressions_support.md` lists the coverage status for every expression. + +## Workflow + +### 1. Study the Spark master implementation first + +Always start from the latest Spark `master`. Shallow clone if not already present: + +```bash +if [ ! -d /tmp/spark-master ]; then + git clone --depth 1 https://github.com/apache/spark.git /tmp/spark-master +fi +``` + +Find the expression class and tests: + +```bash +find /tmp/spark-master/sql -name "*.scala" | \ + xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null + +find /tmp/spark-master/sql -name "*.scala" -path "*/test/*" | \ + xargs grep -l "$ARGUMENTS" 2>/dev/null +``` + +Read the source. Note `inputTypes`, `dataType`, `eval` / `nullSafeEval`, ANSI mode branches, and any `require` guards. These define the contract Comet must match. + +### 2. Implement the initial version + +Follow `adding_a_new_expression.md`: + +1. Add a `CometExpressionSerde[T]` in the appropriate file under `spark/src/main/scala/org/apache/comet/serde/`. +2. Register it in the matching map in `QueryPlanSerde.scala`. +3. If the function name collides with a DataFusion built-in that has a different signature, use `scalarFunctionExprToProtoWithReturnType` (see "When to set the return type explicitly"). +4. For a new scalar function, add a match case in `native/spark-expr/src/comet_scalar_funcs.rs::create_comet_physical_fun` and implement the function under `native/spark-expr/src/`. +5. Add at least one Comet SQL Test at `spark/src/test/resources/sql-tests/expressions//$ARGUMENTS.sql` exercising column references, literals, and `NULL`. + +Build and smoke-test: + +```bash +make +./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none +``` + +### 3. Run the audit skill + +Once the initial implementation passes its smoke test, run the `audit-comet-expression` skill on `$ARGUMENTS`. It compares the implementation and tests against Spark 3.4.3, 3.5.8, and 4.0.1 and produces a prioritized list of gaps. + +### 4. Implement audit-recommended tests and iterate + +Add the missing test cases the audit recommends, then re-run the targeted suite: + +```bash +./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none +``` + +Surface findings to the user and ask whether the coverage is sufficient. Continue iterating (adding tests, fixing bugs, refining `getSupportLevel` / `getIncompatibleReasons` / `getUnsupportedReasons`) until the user confirms they are happy. + +### 5. Final checks + +Before opening a PR: + +```bash +make format +cd native && cargo clippy --all-targets --workspace -- -D warnings +``` + +Use the repo's PR template and fill in every section. From 953cb86d9f875af1ccb9195d2ab3613f0f27bd92 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 30 Apr 2026 06:19:30 -0600 Subject: [PATCH 2/3] docs: reference PR template and add skill-acknowledgement note --- .claude/skills/implement-comet-expression/SKILL.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/.claude/skills/implement-comet-expression/SKILL.md b/.claude/skills/implement-comet-expression/SKILL.md index d2eaa96065..cf3fcf95de 100644 --- a/.claude/skills/implement-comet-expression/SKILL.md +++ b/.claude/skills/implement-comet-expression/SKILL.md @@ -78,4 +78,8 @@ make format cd native && cargo clippy --all-targets --workspace -- -D warnings ``` -Use the repo's PR template and fill in every section. +### 6. Open the PR + +Use the repo's PR template at `.github/pull_request_template.md` and fill in every section: "Which issue does this PR close?", "Rationale for this change", "What changes are included in this PR?", and "How are these changes tested?". Do not add a separate test plan section. + +In the "What changes are included in this PR?" section, add a brief note that the `implement-comet-expression` skill was used to scaffold the implementation, so reviewers know which workflow produced the change. From 422d2b36e3e21b8e08a19454e38b7c730f7b3902 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 30 Apr 2026 06:28:12 -0600 Subject: [PATCH 3/3] docs: check datafusion-spark crate before writing native code --- .../implement-comet-expression/SKILL.md | 26 ++++++++++++++----- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/.claude/skills/implement-comet-expression/SKILL.md b/.claude/skills/implement-comet-expression/SKILL.md index cf3fcf95de..528d3cdb10 100644 --- a/.claude/skills/implement-comet-expression/SKILL.md +++ b/.claude/skills/implement-comet-expression/SKILL.md @@ -1,6 +1,6 @@ --- name: implement-comet-expression -description: Use when implementing a new Spark expression in DataFusion Comet. Walks through cloning latest Spark master to study the canonical implementation, building the initial Comet serde and Rust function from the contributor guide, then running audit-comet-expression to drive a test-coverage iteration loop. +description: Use when implementing a new Spark expression in DataFusion Comet. Walks through cloning latest Spark master to study the canonical implementation, checking the upstream datafusion-spark crate before writing native code, building the Comet serde and Rust wire-up from the contributor guide, then running audit-comet-expression to drive a test-coverage iteration loop. argument-hint: --- @@ -38,14 +38,26 @@ find /tmp/spark-master/sql -name "*.scala" -path "*/test/*" | \ Read the source. Note `inputTypes`, `dataType`, `eval` / `nullSafeEval`, ANSI mode branches, and any `require` guards. These define the contract Comet must match. -### 2. Implement the initial version +### 2. Check for an upstream `datafusion-spark` implementation + +Before writing a Comet-specific native function, check whether the expression is already available in the upstream `datafusion-spark` crate. It is a Spark-compatible function library maintained alongside DataFusion, so its semantics are usually a closer match to Spark than a generic `datafusion-functions` built-in. + +```bash +grep -rn "fn name\|SparkFunctionName" ~/.cargo/registry/src/*/datafusion-spark-*/src/function/ 2>/dev/null | grep -i "$ARGUMENTS" +``` + +Functions are organized as `datafusion_spark::function::::::Spark`. Existing wire-ups can be found in `native/core/src/execution/planner.rs` (e.g. `SparkDateAdd`, `SparkDateSub`, `SparkCollectSet`). + +When the upstream implementation matches Spark's semantics, prefer it: register the `ScalarUDF` from `datafusion-spark` rather than re-implementing. This keeps the maintenance burden upstream. If the upstream version is missing, incomplete, or diverges from Spark, fall through to step 3 and write the function locally. + +### 3. Implement the initial version Follow `adding_a_new_expression.md`: 1. Add a `CometExpressionSerde[T]` in the appropriate file under `spark/src/main/scala/org/apache/comet/serde/`. 2. Register it in the matching map in `QueryPlanSerde.scala`. 3. If the function name collides with a DataFusion built-in that has a different signature, use `scalarFunctionExprToProtoWithReturnType` (see "When to set the return type explicitly"). -4. For a new scalar function, add a match case in `native/spark-expr/src/comet_scalar_funcs.rs::create_comet_physical_fun` and implement the function under `native/spark-expr/src/`. +4. For a new scalar function, add a match case in `native/spark-expr/src/comet_scalar_funcs.rs::create_comet_physical_fun`. If step 2 found an upstream implementation, wire that in. Otherwise implement the function under `native/spark-expr/src/`. 5. Add at least one Comet SQL Test at `spark/src/test/resources/sql-tests/expressions//$ARGUMENTS.sql` exercising column references, literals, and `NULL`. Build and smoke-test: @@ -55,11 +67,11 @@ make ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none ``` -### 3. Run the audit skill +### 4. Run the audit skill Once the initial implementation passes its smoke test, run the `audit-comet-expression` skill on `$ARGUMENTS`. It compares the implementation and tests against Spark 3.4.3, 3.5.8, and 4.0.1 and produces a prioritized list of gaps. -### 4. Implement audit-recommended tests and iterate +### 5. Implement audit-recommended tests and iterate Add the missing test cases the audit recommends, then re-run the targeted suite: @@ -69,7 +81,7 @@ Add the missing test cases the audit recommends, then re-run the targeted suite: Surface findings to the user and ask whether the coverage is sufficient. Continue iterating (adding tests, fixing bugs, refining `getSupportLevel` / `getIncompatibleReasons` / `getUnsupportedReasons`) until the user confirms they are happy. -### 5. Final checks +### 6. Final checks Before opening a PR: @@ -78,7 +90,7 @@ make format cd native && cargo clippy --all-targets --workspace -- -D warnings ``` -### 6. Open the PR +### 7. Open the PR Use the repo's PR template at `.github/pull_request_template.md` and fill in every section: "Which issue does this PR close?", "Rationale for this change", "What changes are included in this PR?", and "How are these changes tested?". Do not add a separate test plan section.