Skip to content

Feat: Support Spark 4.0.0 part1#1830

Merged
andygrove merged 30 commits intoapache:mainfrom
huaxingao:4.0shim
Jul 1, 2025
Merged

Feat: Support Spark 4.0.0 part1#1830
andygrove merged 30 commits intoapache:mainfrom
huaxingao:4.0shim

Conversation

@huaxingao
Copy link
Copy Markdown
Contributor

@huaxingao huaxingao commented Jun 2, 2025

Which issue does this PR close?

Part of #1637
Closes #1846

Rationale for this change

Adding shim files for Spark 4.0.0 support

What changes are included in this PR?

  • Remove 4.0.0-preview1.diff
  • Add 4.0.0.diff
  • Update/skip some tests

Follow-up issues:

How are these changes tested?

@andygrove
Copy link
Copy Markdown
Member

I ran into compilation issues locally when building for Spark 3.4. I think these can be resolved simply by moving the new spark-3.5 shims into the spark-3.x shim folder instead.

Also, some jobs fail with scalastyle errors, which can be fixed by running make format.

Comment thread common/src/main/java/org/apache/comet/parquet/TypeUtil.java Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 5, 2025

Codecov Report

❌ Patch coverage is 13.33333% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.11%. Comparing base (f09f8af) to head (e8a2f40).
⚠️ Report is 392 commits behind head on main.

Files with missing lines Patch % Lines
...c/main/java/org/apache/comet/parquet/TypeUtil.java 0.00% 5 Missing and 2 partials ⚠️
...ffle/comet/CometBoundedShuffleMemoryAllocator.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1830      +/-   ##
============================================
+ Coverage     56.12%   58.11%   +1.98%     
- Complexity      976     1154     +178     
============================================
  Files           119      132      +13     
  Lines         11743    13022    +1279     
  Branches       2251     2417     +166     
============================================
+ Hits           6591     7568     +977     
- Misses         4012     4225     +213     
- Partials       1140     1229      +89     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Copy Markdown
Member

andygrove commented Jun 5, 2025

The Spark version will also need to be updated in .github/workflows/spark_sql_test_ansi.yml. It currently has:

spark-version: [{short: '4.0', full: '4.0.0-preview1'}]

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments.

Comment thread common/src/main/java/org/apache/comet/parquet/TypeUtil.java
Comment thread spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Copy link
Copy Markdown
Contributor

@YanivKunda YanivKunda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few dependency alignment as part of the Spark 4.0.0 release

Comment thread pom.xml Outdated
Comment thread pom.xml Outdated
Comment thread pom.xml Outdated
@huaxingao huaxingao force-pushed the 4.0shim branch 2 times, most recently from db3254f to 381c953 Compare June 9, 2025 06:18
Copy link
Copy Markdown
Contributor

@YanivKunda YanivKunda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some files that might have been left over from local operations -
should probably be deleted.

Comment thread dev/.DS_Store Outdated
Comment thread dev/diffs/.DS_Store Outdated
Comment thread dev/diffs/4.0.0-diff.patch
Comment thread .DS_Store Outdated
@andygrove
Copy link
Copy Markdown
Member

andygrove commented Jun 27, 2025

I'm seeing some resource issues causing Comet test suites to abort:

Warning: [551.091s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached.
Warning: [551.091s][warning][os,thread] Failed to start the native thread for java.lang.Thread "shuffle-exchange-3799"
*** RUN ABORTED *** (9 minutes, 5 seconds)
  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
  at java.base/java.lang.Thread.start0(Native Method)
  at java.base/java.lang.Thread.start(Thread.java:809)
  at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:945)
  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1353)
  at scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21)
  at java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782)
  at java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005)
  at org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329)
  at org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.org$apache$spark$sql$execution$exchange$ShuffleExchangeLike$$triggerFuture(ShuffleExchangeExec.scala:87)

edit: Related issue reported in Spark: https://issues.apache.org/jira/browse/SPARK-47115

@andygrove
Copy link
Copy Markdown
Member

The majority of the remaining Spark SQL test failures will be resolved once the 4.0.0 diff is updated to reflect changes already made to the other Spark versions over recent weeks. I will ignore the remaining failing tests and file follow-up issues. Hopefully, this PR will be ready for review tomorrow.

@andygrove
Copy link
Copy Markdown
Member

Found some files that might have been left over from local operations - should probably be deleted.

Thanks @YanivKunda. These issues should be resolved now.

Comment thread dev/diffs/4.0.0.diff
Comment on lines +197 to +204
+-- TODO fix Comet for this query
+-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #1947 to fix this test later

COMET_PARQUET_SCAN_IMPL: ${{ inputs.scan_impl }}
run: |
MAVEN_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This memory change did not help, but also did no harm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't help, do you mean the test fails on OOM?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Comet test suites fail with OOM when running on macOS. I tried this change to specify more memory, but it did not make any difference. The macOS workflow is commented out in this PR, and I filed a follow up issue #1949

Comment on lines +60 to +64
# TODO fails with OOM
# https://github.com/apache/datafusion-comet/issues/1949
# - name: "Spark 4.0, JDK 17, Scala 2.13"
# java_version: "17"
# maven_opts: "-Pspark-4.0 -Pscala-2.13"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests fail on macOS specifically, but still run on Linux

@andygrove
Copy link
Copy Markdown
Member

@kazuyukitanimura @parthchandra @comphead This PR is ready for review now

Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao
For the disabled test, let's put a link to a tracking github issue so that we can easily search later


assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
if (!isSpark40Plus()) {
assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of ANSI mode?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In L539, it has

// Creating huge stats so the column index will reach the limit and won't be written

This line

assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));

is trying to read the column index metadata for the first column of the third row group, and verify it's null.

I am not sure why this failed for 4.0. My guess is that in the new parquet version, the column index implementation gets changed, but I didn't find the corresponding change for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind tracking this in a ticket please?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we log an issue specifically for this? The ColumnIndex implementation is part of Comet code so if a test is failing we need to fix it in Comet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here has a TODO linking to #1948. I added a comment in this issue referring to this file.

Comment thread dev/diffs/4.0.0.diff Outdated
Comment thread dev/diffs/4.0.0.diff Outdated
Comment thread dev/diffs/4.0.0.diff
Comment on lines +169 to +156
+-- TODO: Disabled due to one of the test failed for Spark4.0
+-- select /*+ COALESCE(1) */ id, a+b, a-b, a*b, a/b from decimals_test order by id
+--SET spark.comet.enabled = false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file a github issue and add a TODO link? E.g.
+-- TODO: https://github.com/apache/datafusion-comet/issues/551

@andygrove
Copy link
Copy Markdown
Member

The 4.0.0 diff will now need to be updated to reflect the changes made to 4.0.0-preview1 in #1936

I suggest that we don't merge any more PRs that update 4.0.0-preview1.diff

Comment thread dev/diffs/4.0.0.diff Outdated
Comment thread dev/diffs/4.0.0.diff Outdated
Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @huaxingao

Comment thread pom.xml
Comment on lines +640 to +642
<scala.version>2.13.16</scala.version>
<scala.binary.version>2.13</scala.binary.version>
<semanticdb.version>4.9.5</semanticdb.version>
<semanticdb.version>4.13.6</semanticdb.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already updated the spark-4.0 profile, do we need this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is needed. 4.9.5 is not compatible with scala 2.13.16

Comment thread pom.xml
Comment on lines +1081 to +1083
<ignoreClass>com.google.thirdparty.publicsuffix.TrieParser</ignoreClass>
<ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixPatterns</ignoreClass>
<ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixType</ignoreClass>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new conflict?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignored these classes to get around the following errors.

[ERROR] Rule 2: org.codehaus.mojo.extraenforcer.dependencies.BanDuplicateClasses failed with message:
[ERROR] Duplicate classes found:
[ERROR] 
[ERROR]   Found in:
[ERROR]     org.apache.spark:spark-network-common_2.13:jar:4.0.0:provided
[ERROR]     com.google.guava:guava:jar:33.2.1-jre:compile
[ERROR]   Duplicate classes:
[ERROR]     com/google/thirdparty/publicsuffix/TrieParser.class
[ERROR]     com/google/thirdparty/publicsuffix/PublicSuffixPatterns.class
[ERROR]     com/google/thirdparty/publicsuffix/PublicSuffixType.class

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread dev/diffs/4.0.0.diff
-SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);
+-- TODO https://github.com/apache/datafusion-comet/issues/1947
+-- TODO fix Comet for this query
+-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we disable comet instead of commenting out? due to this, the result in listagg-collations.sql.out is also removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make that change as part of the follow-up issue #1947 that is linked to here (or maybe we can just fix the test instead). The sooner we can get this PR merged, the easier it will be to start addressing the test failures in separate (and smaller) PRs.

I would also like to ship the experimental 4.0.0 support in Comet 0.9.0 so that we can start getting feedback from the community. It seems like all the skipped tests have links to issues now.

Comment thread dev/diffs/4.0.0.diff Outdated

- test("SPARK-17091: Convert IN predicate to Parquet filter push-down") {
+ test("SPARK-17091: Convert IN predicate to Parquet filter push-down",
+ IgnoreComet("IN predicate is not yet supported in Comet, see issue #36")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other Spark version, looks like we are using IgnoreCometNativeScan("Comet has different push-down behavior")

Comment thread dev/diffs/4.0.0.diff
@@ -3021,7 +3338,6 @@ index ed2e309fa07..a1fb4abe681 100644
+ conf
+ .set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions")
+ .set("spark.comet.enabled", "true")
+ .set("spark.comet.parquet.respectFilterPushdown", "true")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this? .set("spark.comet.parquet.respectFilterPushdown", "true")

@andygrove
Copy link
Copy Markdown
Member

Thanks @huaxingao and @kazuyukitanimura. I will merge this now so that I can backport to branch-0.9 and include it in the 0.9.0 release candidate today.

@andygrove andygrove merged commit 547eb9c into apache:main Jul 1, 2025
94 of 96 checks passed
andygrove pushed a commit to andygrove/datafusion-comet that referenced this pull request Jul 1, 2025
andygrove pushed a commit to andygrove/datafusion-comet that referenced this pull request Jul 1, 2025
@huaxingao
Copy link
Copy Markdown
Contributor Author

Thanks @andygrove @kazuyukitanimura

@huaxingao huaxingao deleted the 4.0shim branch July 1, 2025 15:49
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update 4.0.0.diff to reflect recent improvements in 3.5.5.diff

7 participants