Feat: Support Spark 4.0.0 part1#1830
Conversation
|
I ran into compilation issues locally when building for Spark 3.4. I think these can be resolved simply by moving the new Also, some jobs fail with scalastyle errors, which can be fixed by running |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1830 +/- ##
============================================
+ Coverage 56.12% 58.11% +1.98%
- Complexity 976 1154 +178
============================================
Files 119 132 +13
Lines 11743 13022 +1279
Branches 2251 2417 +166
============================================
+ Hits 6591 7568 +977
- Misses 4012 4225 +213
- Partials 1140 1229 +89 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The Spark version will also need to be updated in |
parthchandra
left a comment
There was a problem hiding this comment.
Just some minor comments.
YanivKunda
left a comment
There was a problem hiding this comment.
A few dependency alignment as part of the Spark 4.0.0 release
db3254f to
381c953
Compare
YanivKunda
left a comment
There was a problem hiding this comment.
Found some files that might have been left over from local operations -
should probably be deleted.
|
I'm seeing some resource issues causing Comet test suites to abort: edit: Related issue reported in Spark: https://issues.apache.org/jira/browse/SPARK-47115 |
|
The majority of the remaining Spark SQL test failures will be resolved once the 4.0.0 diff is updated to reflect changes already made to the other Spark versions over recent weeks. I will ignore the remaining failing tests and file follow-up issues. Hopefully, this PR will be ready for review tomorrow. |
Thanks @YanivKunda. These issues should be resolved now. |
| +-- TODO fix Comet for this query | ||
| +-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1); |
These issues are now resolved. Thanks.
| COMET_PARQUET_SCAN_IMPL: ${{ inputs.scan_impl }} | ||
| run: | | ||
| MAVEN_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }} | ||
| MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }} |
There was a problem hiding this comment.
This memory change did not help, but also did no harm
There was a problem hiding this comment.
Didn't help, do you mean the test fails on OOM?
There was a problem hiding this comment.
The Comet test suites fail with OOM when running on macOS. I tried this change to specify more memory, but it did not make any difference. The macOS workflow is commented out in this PR, and I filed a follow up issue #1949
| # TODO fails with OOM | ||
| # https://github.com/apache/datafusion-comet/issues/1949 | ||
| # - name: "Spark 4.0, JDK 17, Scala 2.13" | ||
| # java_version: "17" | ||
| # maven_opts: "-Pspark-4.0 -Pscala-2.13" |
There was a problem hiding this comment.
The tests fail on macOS specifically, but still run on Linux
|
@kazuyukitanimura @parthchandra @comphead This PR is ready for review now |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Thanks @huaxingao
For the disabled test, let's put a link to a tracking github issue so that we can easily search later
|
|
||
| assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0))); | ||
| if (!isSpark40Plus()) { | ||
| assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0))); |
There was a problem hiding this comment.
Is this because of ANSI mode?
There was a problem hiding this comment.
In L539, it has
// Creating huge stats so the column index will reach the limit and won't be written
This line
assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
is trying to read the column index metadata for the first column of the third row group, and verify it's null.
I am not sure why this failed for 4.0. My guess is that in the new parquet version, the column index implementation gets changed, but I didn't find the corresponding change for this.
There was a problem hiding this comment.
Do you mind tracking this in a ticket please?
There was a problem hiding this comment.
Did we log an issue specifically for this? The ColumnIndex implementation is part of Comet code so if a test is failing we need to fix it in Comet.
There was a problem hiding this comment.
The code here has a TODO linking to #1948. I added a comment in this issue referring to this file.
| +-- TODO: Disabled due to one of the test failed for Spark4.0 | ||
| +-- select /*+ COALESCE(1) */ id, a+b, a-b, a*b, a/b from decimals_test order by id | ||
| +--SET spark.comet.enabled = false |
There was a problem hiding this comment.
Could you file a github issue and add a TODO link? E.g.
+-- TODO: https://github.com/apache/datafusion-comet/issues/551
|
The 4.0.0 diff will now need to be updated to reflect the changes made to 4.0.0-preview1 in #1936 I suggest that we don't merge any more PRs that update |
| <scala.version>2.13.16</scala.version> | ||
| <scala.binary.version>2.13</scala.binary.version> | ||
| <semanticdb.version>4.9.5</semanticdb.version> | ||
| <semanticdb.version>4.13.6</semanticdb.version> |
There was a problem hiding this comment.
Since we already updated the spark-4.0 profile, do we need this change?
There was a problem hiding this comment.
I think this is needed. 4.9.5 is not compatible with scala 2.13.16
| <ignoreClass>com.google.thirdparty.publicsuffix.TrieParser</ignoreClass> | ||
| <ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixPatterns</ignoreClass> | ||
| <ignoreClass>com.google.thirdparty.publicsuffix.PublicSuffixType</ignoreClass> |
There was a problem hiding this comment.
Is this a new conflict?
There was a problem hiding this comment.
I ignored these classes to get around the following errors.
[ERROR] Rule 2: org.codehaus.mojo.extraenforcer.dependencies.BanDuplicateClasses failed with message:
[ERROR] Duplicate classes found:
[ERROR]
[ERROR] Found in:
[ERROR] org.apache.spark:spark-network-common_2.13:jar:4.0.0:provided
[ERROR] com.google.guava:guava:jar:33.2.1-jre:compile
[ERROR] Duplicate classes:
[ERROR] com/google/thirdparty/publicsuffix/TrieParser.class
[ERROR] com/google/thirdparty/publicsuffix/PublicSuffixPatterns.class
[ERROR] com/google/thirdparty/publicsuffix/PublicSuffixType.class
| -SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1); | ||
| +-- TODO https://github.com/apache/datafusion-comet/issues/1947 | ||
| +-- TODO fix Comet for this query | ||
| +-- SELECT lower(listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1 COLLATE utf8_lcase)) FROM (VALUES ('a'), ('B'), ('b'), ('A')) AS t(c1); |
There was a problem hiding this comment.
Should we disable comet instead of commenting out? due to this, the result in listagg-collations.sql.out is also removed.
There was a problem hiding this comment.
Could we make that change as part of the follow-up issue #1947 that is linked to here (or maybe we can just fix the test instead). The sooner we can get this PR merged, the easier it will be to start addressing the test failures in separate (and smaller) PRs.
I would also like to ship the experimental 4.0.0 support in Comet 0.9.0 so that we can start getting feedback from the community. It seems like all the skipped tests have links to issues now.
|
|
||
| - test("SPARK-17091: Convert IN predicate to Parquet filter push-down") { | ||
| + test("SPARK-17091: Convert IN predicate to Parquet filter push-down", | ||
| + IgnoreComet("IN predicate is not yet supported in Comet, see issue #36")) { |
There was a problem hiding this comment.
For other Spark version, looks like we are using IgnoreCometNativeScan("Comet has different push-down behavior")
| @@ -3021,7 +3338,6 @@ index ed2e309fa07..a1fb4abe681 100644 | |||
| + conf | |||
| + .set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions") | |||
| + .set("spark.comet.enabled", "true") | |||
| + .set("spark.comet.parquet.respectFilterPushdown", "true") | |||
There was a problem hiding this comment.
Should we keep this? .set("spark.comet.parquet.respectFilterPushdown", "true")
|
Thanks @huaxingao and @kazuyukitanimura. I will merge this now so that I can backport to branch-0.9 and include it in the 0.9.0 release candidate today. |
|
Thanks @andygrove @kazuyukitanimura |
Which issue does this PR close?
Part of #1637
Closes #1846
Rationale for this change
Adding shim files for Spark 4.0.0 support
What changes are included in this PR?
Follow-up issues:
MapSortexpression in Spark 4.0.0 #1941How are these changes tested?