Feat: Support Spark 4.0.0 part1 by huaxingao · Pull Request #1830 · apache/datafusion-comet

huaxingao · 2025-06-02T17:08:23Z

Which issue does this PR close?

Part of #1637
Closes #1846

Rationale for this change

Adding shim files for Spark 4.0.0 support

What changes are included in this PR?

Remove 4.0.0-preview1.diff
Add 4.0.0.diff
Update/skip some tests

Follow-up issues:

How are these changes tested?

andygrove · 2025-06-04T18:15:04Z

I ran into compilation issues locally when building for Spark 3.4. I think these can be resolved simply by moving the new spark-3.5 shims into the spark-3.x shim folder instead.

Also, some jobs fail with scalastyle errors, which can be fixed by running make format.

codecov-commenter · 2025-06-05T07:43:08Z

Codecov Report

❌ Patch coverage is 13.33333% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.11%. Comparing base (f09f8af) to head (e8a2f40).
⚠️ Report is 392 commits behind head on main.

Files with missing lines	Patch %	Lines
...c/main/java/org/apache/comet/parquet/TypeUtil.java	0.00%	5 Missing and 2 partials ⚠️
...ffle/comet/CometBoundedShuffleMemoryAllocator.java	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1830      +/-   ##
============================================
+ Coverage     56.12%   58.11%   +1.98%     
- Complexity      976     1154     +178     
============================================
  Files           119      132      +13     
  Lines         11743    13022    +1279     
  Branches       2251     2417     +166     
============================================
+ Hits           6591     7568     +977     
- Misses         4012     4225     +213     
- Partials       1140     1229      +89

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-06-05T18:31:32Z

The Spark version will also need to be updated in .github/workflows/spark_sql_test_ansi.yml. It currently has:

spark-version: [{short: '4.0', full: '4.0.0-preview1'}]

parthchandra

Just some minor comments.

YanivKunda

A few dependency alignment as part of the Spark 4.0.0 release

YanivKunda

Found some files that might have been left over from local operations -
should probably be deleted.

andygrove · 2025-06-27T00:43:04Z

I'm seeing some resource issues causing Comet test suites to abort:

Warning: [551.091s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached.
Warning: [551.091s][warning][os,thread] Failed to start the native thread for java.lang.Thread "shuffle-exchange-3799"
*** RUN ABORTED *** (9 minutes, 5 seconds)
  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
  at java.base/java.lang.Thread.start0(Native Method)
  at java.base/java.lang.Thread.start(Thread.java:809)
  at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:945)
  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1353)
  at scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21)
  at java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782)
  at java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005)
  at org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329)
  at org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.org$apache$spark$sql$execution$exchange$ShuffleExchangeLike$$triggerFuture(ShuffleExchangeExec.scala:87)

edit: Related issue reported in Spark: https://issues.apache.org/jira/browse/SPARK-47115

andygrove · 2025-06-27T02:54:01Z

The majority of the remaining Spark SQL test failures will be resolved once the 4.0.0 diff is updated to reflect changes already made to the other Spark versions over recent weeks. I will ignore the remaining failing tests and file follow-up issues. Hopefully, this PR will be ready for review tomorrow.

andygrove · 2025-06-27T15:09:02Z

Found some files that might have been left over from local operations - should probably be deleted.

Thanks @YanivKunda. These issues should be resolved now.

andygrove · 2025-06-27T15:09:45Z

+-- TODO fix Comet for this query
+-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);


I filed #1947 to fix this test later

These issues are now resolved. Thanks.

andygrove · 2025-06-27T17:47:27Z

        COMET_PARQUET_SCAN_IMPL: ${{ inputs.scan_impl }}
      run: |
-        MAVEN_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
+        MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}


This memory change did not help, but also did no harm

Didn't help, do you mean the test fails on OOM?

The Comet test suites fail with OOM when running on macOS. I tried this change to specify more memory, but it did not make any difference. The macOS workflow is commented out in this PR, and I filed a follow up issue #1949

andygrove · 2025-06-27T17:47:53Z

+            # TODO fails with OOM
+            # https://github.com/apache/datafusion-comet/issues/1949
+#          - name: "Spark 4.0, JDK 17, Scala 2.13"
+#            java_version: "17"
+#            maven_opts: "-Pspark-4.0 -Pscala-2.13"


The tests fail on macOS specifically, but still run on Linux

andygrove · 2025-06-27T18:30:48Z

@kazuyukitanimura @parthchandra @comphead This PR is ready for review now

kazuyukitanimura

Thanks @huaxingao
For the disabled test, let's put a link to a tracking github issue so that we can easily search later

kazuyukitanimura · 2025-06-27T20:11:41Z


-      assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
+      if (!isSpark40Plus()) {
+        assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));


Is this because of ANSI mode?

In L539, it has

// Creating huge stats so the column index will reach the limit and won't be written

This line

assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));

is trying to read the column index metadata for the first column of the third row group, and verify it's null.

I am not sure why this failed for 4.0. My guess is that in the new parquet version, the column index implementation gets changed, but I didn't find the corresponding change for this.

Do you mind tracking this in a ticket please?

Did we log an issue specifically for this? The ColumnIndex implementation is part of Comet code so if a test is failing we need to fix it in Comet.

The code here has a TODO linking to #1948. I added a comment in this issue referring to this file.

kazuyukitanimura · 2025-06-27T20:20:06Z

+-- TODO: Disabled due to one of the test failed for Spark4.0
+-- select /*+ COALESCE(1) */ id, a+b, a-b, a*b, a/b from decimals_test order by id
+--SET spark.comet.enabled = false


Could you file a github issue and add a TODO link? E.g.
+-- TODO: https://github.com/apache/datafusion-comet/issues/551

andygrove · 2025-06-30T16:11:59Z

The 4.0.0 diff will now need to be updated to reflect the changes made to 4.0.0-preview1 in #1936

I suggest that we don't merge any more PRs that update 4.0.0-preview1.diff

andygrove

LGTM. Thanks @huaxingao

kazuyukitanimura · 2025-06-30T20:59:24Z

+        <scala.version>2.13.16</scala.version>
        <scala.binary.version>2.13</scala.binary.version>
-        <semanticdb.version>4.9.5</semanticdb.version>
+        <semanticdb.version>4.13.6</semanticdb.version>


Since we already updated the spark-4.0 profile, do we need this change?

I think this is needed. 4.9.5 is not compatible with scala 2.13.16

kazuyukitanimura · 2025-06-30T21:00:17Z

+                        <ignoreClass>com.google.thirdparty.publicsuffix.TrieParser</ignoreClass>
+                        <ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixPatterns</ignoreClass>
+                        <ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixType</ignoreClass>


Is this a new conflict?

I ignored these classes to get around the following errors.

[ERROR] Rule 2: org.codehaus.mojo.extraenforcer.dependencies.BanDuplicateClasses failed with message: [ERROR] Duplicate classes found: [ERROR] [ERROR] Found in: [ERROR] org.apache.spark:spark-network-common_2.13:jar:4.0.0:provided [ERROR] com.google.guava:guava:jar:33.2.1-jre:compile [ERROR] Duplicate classes: [ERROR] com/google/thirdparty/publicsuffix/TrieParser.class [ERROR] com/google/thirdparty/publicsuffix/PublicSuffixPatterns.class [ERROR] com/google/thirdparty/publicsuffix/PublicSuffixType.class

Filed #1968
cc @andygrove @huaxingao

kazuyukitanimura · 2025-06-30T22:17:01Z

+-SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);
+-- TODO https://github.com/apache/datafusion-comet/issues/1947
+-- TODO fix Comet for this query
+-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);


Should we disable comet instead of commenting out? due to this, the result in listagg-collations.sql.out is also removed.

Could we make that change as part of the follow-up issue #1947 that is linked to here (or maybe we can just fix the test instead). The sooner we can get this PR merged, the easier it will be to start addressing the test failures in separate (and smaller) PRs.

I would also like to ship the experimental 4.0.0 support in Comet 0.9.0 so that we can start getting feedback from the community. It seems like all the skipped tests have links to issues now.

kazuyukitanimura · 2025-06-30T22:23:44Z

+
+-  test("SPARK-17091: Convert IN predicate to Parquet filter push-down") {
+  test("SPARK-17091: Convert IN predicate to Parquet filter push-down",
+      IgnoreComet("IN predicate is not yet supported in Comet, see issue #36")) {


For other Spark version, looks like we are using IgnoreCometNativeScan("Comet has different push-down behavior")

kazuyukitanimura · 2025-06-30T22:24:43Z

@@ -3021,7 +3338,6 @@ index ed2e309fa07..a1fb4abe681 100644
 +      conf
 +        .set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions")
 +        .set("spark.comet.enabled", "true")
-+        .set("spark.comet.parquet.respectFilterPushdown", "true")


Should we keep this? .set("spark.comet.parquet.respectFilterPushdown", "true")

andygrove · 2025-07-01T12:38:05Z

Thanks @huaxingao and @kazuyukitanimura. I will merge this now so that I can backport to branch-0.9 and include it in the 0.9.0 release candidate today.

huaxingao · 2025-07-01T15:49:25Z

Thanks @andygrove @kazuyukitanimura

andygrove reviewed Jun 2, 2025

View reviewed changes

Comment thread spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometTPCDSMicroBenchmark.scala Outdated

andygrove mentioned this pull request Jun 4, 2025

Update 4.0.0.diff to reflect recent improvements in 3.5.5.diff #1846

Closed

andygrove reviewed Jun 4, 2025

View reviewed changes

Comment thread common/src/main/java/org/apache/comet/parquet/TypeUtil.java Outdated

huaxingao force-pushed the 4.0shim branch from e768b43 to b39c6d2 Compare June 5, 2025 07:19

parthchandra reviewed Jun 6, 2025

View reviewed changes

Comment thread common/src/main/java/org/apache/comet/parquet/TypeUtil.java

Comment thread spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

huaxingao force-pushed the 4.0shim branch from d5f5ecf to 5228b4c Compare June 7, 2025 18:37

YanivKunda suggested changes Jun 8, 2025

View reviewed changes

Comment thread pom.xml Outdated

Comment thread pom.xml Outdated

Comment thread pom.xml Outdated

huaxingao force-pushed the 4.0shim branch 2 times, most recently from db3254f to 381c953 Compare June 9, 2025 06:18

YanivKunda previously requested changes Jun 18, 2025

View reviewed changes

Comment thread dev/.DS_Store Outdated

Comment thread dev/diffs/.DS_Store Outdated

Comment thread dev/diffs/4.0.0-diff.patch

Comment thread .DS_Store Outdated

andygrove reviewed Jun 27, 2025

View reviewed changes

andygrove mentioned this pull request Jun 27, 2025

Investigate failing tests with Spark 4.0.0 #1948

Closed

andygrove mentioned this pull request Jun 27, 2025

Reinstate macOS CI builds of Comet with Spark 4.0.0 #1949

Open

andygrove reviewed Jun 27, 2025

View reviewed changes

andygrove mentioned this pull request Jun 27, 2025

feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto #1933

Merged

andygrove requested review from comphead, kazuyukitanimura and parthchandra June 27, 2025 18:43

kazuyukitanimura reviewed Jun 27, 2025

View reviewed changes

Feat: Support Spark 4.0.0 part1

ac42642

andygrove and others added 10 commits June 30, 2025 09:21

skip another test

9ac9f1b

Remove .DS_Store

1285408

specify mvn memory

2c5c755

update expected plans

b537642

diff

0d73622

diff

ee9fe2c

Scalastyle

a97680d

skip macOs PR build tests due to OOM

d3cf777

fix

b1bbbc7

update 4.0.0.diff

b03b1d9

huaxingao force-pushed the 4.0shim branch from 1f4f8ab to b03b1d9 Compare June 30, 2025 16:58

andygrove reviewed Jun 30, 2025

View reviewed changes

Comment thread dev/diffs/4.0.0.diff Outdated

use 11 digits hash

072e439

andygrove reviewed Jun 30, 2025

View reviewed changes

Comment thread dev/diffs/4.0.0.diff Outdated

andygrove approved these changes Jun 30, 2025

View reviewed changes

andygrove requested a review from kazuyukitanimura June 30, 2025 17:25

huaxingao added 3 commits June 30, 2025 10:29

remove println from diff file

5860482

add .set(spark.comet.parquet.respectFilterPushdown, true) in diff

d431fe6

update diff

ffa68a9

kazuyukitanimura reviewed Jun 30, 2025

View reviewed changes

fix

e8a2f40

andygrove mentioned this pull request Jul 1, 2025

chore: Start 0.10.0 development #1958

Merged

andygrove merged commit 547eb9c into apache:main Jul 1, 2025
94 of 96 checks passed

andygrove pushed a commit to andygrove/datafusion-comet that referenced this pull request Jul 1, 2025

Feat: Support Spark 4.0.0 part1 (apache#1830)

fac18a8

andygrove pushed a commit to andygrove/datafusion-comet that referenced this pull request Jul 1, 2025

Feat: Support Spark 4.0.0 part1 (apache#1830)

1b6ab11

huaxingao deleted the 4.0shim branch July 1, 2025 15:49

andygrove mentioned this pull request Jul 2, 2025

[EPIC] Add support for Spark 4.0.1 #1637

Open

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

Feat: Support Spark 4.0.0 part1 (apache#1830)

809eb10

		+-- TODO fix Comet for this query
		+-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);

Conversation

huaxingao commented Jun 2, 2025 • edited by andygrove Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Uh oh!

andygrove commented Jun 4, 2025

Uh oh!

Uh oh!

codecov-commenter commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YanivKunda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YanivKunda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andygrove commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Jun 27, 2025

Uh oh!

andygrove commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jun 27, 2025

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jun 2, 2025 •

edited by andygrove

Loading

codecov-commenter commented Jun 5, 2025 •

edited

Loading

andygrove commented Jun 5, 2025 •

edited

Loading

andygrove commented Jun 27, 2025 •

edited

Loading