Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ac42642
Feat: Support Spark 4.0.0 part1
huaxingao Jun 2, 2025
f2b76f4
remove unnecessary shim
huaxingao Jun 3, 2025
ada6a24
address comments
huaxingao Jun 5, 2025
9db8fda
fix
huaxingao Jun 5, 2025
e94f7b9
fix
huaxingao Jun 5, 2025
6011512
update spark version in spark_sql_test_ansi.yml
huaxingao Jun 5, 2025
d7eff03
update diff
huaxingao Jun 5, 2025
4365733
fix
huaxingao Jun 7, 2025
ba43e24
fix
huaxingao Jun 7, 2025
600d415
address comments
huaxingao Jun 9, 2025
ef058e4
Expected column index is not null for spark4
huaxingao Jun 16, 2025
695b193
update diff to disable a couple of sql tests
huaxingao Jun 16, 2025
a2c1f3a
disable columnarShuffleOnMapTest for spark4.0
huaxingao Jun 17, 2025
6fa18cd
fix style
huaxingao Jun 17, 2025
ea77900
skip some tests due to unsupported MapSort expression
andygrove Jun 26, 2025
9ac9f1b
skip another test
andygrove Jun 27, 2025
1285408
Remove .DS_Store
andygrove Jun 27, 2025
2c5c755
specify mvn memory
andygrove Jun 27, 2025
b537642
update expected plans
andygrove Jun 27, 2025
0d73622
diff
andygrove Jun 27, 2025
ee9fe2c
diff
andygrove Jun 27, 2025
a97680d
Scalastyle
andygrove Jun 27, 2025
d3cf777
skip macOs PR build tests due to OOM
andygrove Jun 27, 2025
b1bbbc7
fix
andygrove Jun 27, 2025
b03b1d9
update 4.0.0.diff
huaxingao Jun 30, 2025
072e439
use 11 digits hash
huaxingao Jun 30, 2025
5860482
remove println from diff file
huaxingao Jun 30, 2025
d431fe6
add .set(spark.comet.parquet.respectFilterPushdown, true) in diff
huaxingao Jun 30, 2025
ffa68a9
update diff
huaxingao Jun 30, 2025
e8a2f40
fix
huaxingao Jun 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/actions/java-test/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ runs:
env:
COMET_PARQUET_SCAN_IMPL: ${{ inputs.scan_impl }}
run: |
MAVEN_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This memory change did not help, but also did no harm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't help, do you mean the test fails on OOM?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Comet test suites fail with OOM when running on macOS. I tried this change to specify more memory, but it did not make any difference. The macOS workflow is commented out in this PR, and I filed a follow up issue #1949

- name: Run specified tests
shell: bash
if: ${{ inputs.suites != '' }}
Expand All @@ -77,7 +77,7 @@ runs:
run: |
MAVEN_SUITES="$(echo "${{ inputs.suites }}" | paste -sd, -)"
echo "Running with MAVEN_SUITES=$MAVEN_SUITES"
MAVEN_OPTS="-DwildcardSuites=$MAVEN_SUITES -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
MAVEN_OPTS="-Xmx4G -Xms2G -DwildcardSuites=$MAVEN_SUITES -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B clean install ${{ inputs.maven_opts }}
- name: Upload crash logs
if: failure()
uses: actions/upload-artifact@v4
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,9 @@ jobs:
runs-on: ${{ matrix.os }}
container:
image: amd64/rust
env:
JAVA_TOOL_OPTIONS: ${{ matrix.profile.java_version == '17' && '--add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-exports=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED' || '' }}

steps:
- uses: actions/checkout@v4
- name: Setup Rust & Java toolchain
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,11 @@ jobs:
java_version: "17"
maven_opts: "-Pspark-3.5 -Pscala-2.13"

- name: "Spark 4.0, JDK 17, Scala 2.13"
java_version: "17"
maven_opts: "-Pspark-4.0 -Pscala-2.13"
# TODO fails with OOM
# https://github.com/apache/datafusion-comet/issues/1949
# - name: "Spark 4.0, JDK 17, Scala 2.13"
# java_version: "17"
# maven_opts: "-Pspark-4.0 -Pscala-2.13"
Comment on lines +60 to +64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests fail on macOS specifically, but still run on Linux


suite:
- name: "fuzz"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/spark_sql_test_ansi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
matrix:
os: [ubuntu-24.04]
java-version: [17]
spark-version: [{short: '4.0', full: '4.0.0-preview1'}]
spark-version: [{short: '4.0', full: '4.0.0'}]
module:
- {name: "catalyst", args1: "catalyst/test", args2: ""}
- {name: "sql/core-1", args1: "", args2: sql/testOnly * -- -l org.apache.spark.tags.ExtendedSQLTest -l org.apache.spark.tags.SlowSQLTest}
Expand Down
12 changes: 10 additions & 2 deletions common/src/main/java/org/apache/comet/parquet/TypeUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ public static ColumnDescriptor convertToParquet(StructField field) {
builder = Types.primitive(PrimitiveType.PrimitiveTypeName.INT64, repetition);
} else if (type == DataTypes.BinaryType) {
builder = Types.primitive(PrimitiveType.PrimitiveTypeName.BINARY, repetition);
} else if (type == DataTypes.StringType) {
} else if (type == DataTypes.StringType
|| (type.sameType(DataTypes.StringType) && isSpark40Plus())) {
Comment thread
parthchandra marked this conversation as resolved.
builder =
Types.primitive(PrimitiveType.PrimitiveTypeName.BINARY, repetition)
.as(LogicalTypeAnnotation.stringType());
Expand Down Expand Up @@ -199,6 +200,13 @@ && isUnsignedIntTypeMatched(logicalTypeAnnotation, 64)) {
|| canReadAsBinaryDecimal(descriptor, sparkType)) {
return;
}

if (sparkType.sameType(DataTypes.StringType) && isSpark40Plus()) {
LogicalTypeAnnotation lta = descriptor.getPrimitiveType().getLogicalTypeAnnotation();
if (lta instanceof LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
return;
}
}
break;
case FIXED_LEN_BYTE_ARRAY:
if (canReadAsIntDecimal(descriptor, sparkType)
Expand Down Expand Up @@ -314,7 +322,7 @@ private static boolean isUnsignedIntTypeMatched(
&& ((IntLogicalTypeAnnotation) logicalTypeAnnotation).getBitWidth() == bitWidth;
}

private static boolean isSpark40Plus() {
static boolean isSpark40Plus() {
return package$.MODULE$.SPARK_VERSION().compareTo("4.0") >= 0;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ import org.apache.spark.util.AccumulatorV2
object ShimTaskMetrics {

def getTaskAccumulator(taskMetrics: TaskMetrics): Option[AccumulatorV2[_, _]] =
taskMetrics.externalAccums.lastOption
taskMetrics._externalAccums.lastOption
}
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@
import static org.junit.Assert.*;
import static org.junit.Assert.assertEquals;

import static org.apache.comet.parquet.TypeUtil.isSpark40Plus;

@SuppressWarnings("deprecation")
public class TestFileReader {
private static final MessageType SCHEMA =
Expand Down Expand Up @@ -609,7 +611,9 @@ public void testColumnIndexReadWrite() throws Exception {
assertEquals(1, offsetIndex.getFirstRowIndex(1));
assertEquals(3, offsetIndex.getFirstRowIndex(2));

assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
if (!isSpark40Plus()) { // TODO: https://github.com/apache/datafusion-comet/issues/1948
assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of ANSI mode?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In L539, it has

// Creating huge stats so the column index will reach the limit and won't be written

This line

assertNull(indexReader.readColumnIndex(footer.getBlocks().get(2).getColumns().get(0)));

is trying to read the column index metadata for the first column of the third row group, and verify it's null.

I am not sure why this failed for 4.0. My guess is that in the new parquet version, the column index implementation gets changed, but I didn't find the corresponding change for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind tracking this in a ticket please?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we log an issue specifically for this? The ColumnIndex implementation is part of Comet code so if a test is failing we need to fix it in Comet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here has a TODO linking to #1948. I added a comment in this issue referring to this file.

}
}
}

Expand Down
Loading
Loading