You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Native DataFusion Scan Test Analysis (Spark 3.5.7)
Overview
This analysis covers tests that were previously ignored for native_datafusion scan mode
via IgnoreCometNativeScan or IgnoreCometNativeDataFusion tags in the Spark 3.5.7 diff.
Each test was run with spark.comet.scan.impl=native_datafusion to determine whether
the ignore directive is still necessary.
Summary
Total tests with ignore directives removed: 8 (across 3 test files)
Tests now passing: 3 (ParquetEncryptionSuite)
Tests still failing: 5 (ParquetV1FilterSuite: 4, ParquetV1QuerySuite: 1)
Diff updated: Yes, removed IgnoreCometNativeScan from the 3 passing encryption tests
Tests Now Passing (Ignore Removed)
ParquetEncryptionSuite (sql/hive)
All three encryption tests now pass with native_datafusion:
Test
Previous Ignore Reason
SPARK-34990: Write and read an encrypted parquet
no encryption support yet
SPARK-37117: Can't read files in Parquet encryption external key material mode
no encryption support yet
SPARK-42114: Test of uniform parquet encryption
no encryption support yet
These tests verify that Spark can write and read encrypted Parquet files. The native
DataFusion scan now handles encrypted Parquet correctly, so the ignore directives were
removed from the diff.
Tests Still Failing (Ignore Retained)
ParquetV1FilterSuite (sql/core)
All four tests fail only in the V1 source path (ParquetV1FilterSuite). The corresponding
V2 tests (ParquetV2FilterSuite) pass because V2 sources don't use Comet's native scan.
1. Filters should be pushed down for vectorized Parquet reader at row group level
Ignore reason: Native scans do not support the tested accumulator
Details: The test checks that Parquet filter pushdown works at the row group level
by examining a custom accumulator that counts row groups. The native DataFusion scan
does not support Spark's accumulator mechanism for tracking pushed-down filter statistics,
so the assertion on the accumulator value fails.
Details: Tests that StartsWith, EndsWith, and Contains string predicates are
pushed down into the Parquet reader. The native DataFusion scan does not push these
string predicates down in the same way Spark's built-in reader does, causing the
assertions on pushed filter counts to fail.
3. SPARK-17091: Convert IN predicate to Parquet filter push-down
Ignore reason: Comet has different push-down behavior
Failure type:CometRuntimeException: CometNativeExec should not be executed directly without a serialized plan
Details: The test constructs a DataFrame with specific filters and directly executes
it in a way that triggers CometNativeScan without going through the proper native
execution plan serialization. This is a fundamental incompatibility with how the native
DataFusion scan handles standalone execution outside of a full native plan.
4. SPARK-34562: Bloom filter push down
Ignore reason: Native scans do not support the tested accumulator
Details: Similar to test Initial PR #1, this test relies on a custom accumulator to verify that
bloom filter push-down is working. The native DataFusion scan does not integrate with
Spark's accumulator framework for this purpose.
ParquetV1QuerySuite (sql/core)
5. SPARK-26677: negated null-safe equality comparison should not filter matched row groups
Ignore reason: Native scans had the filter pushed into DF operator, cannot strip
Failure type:CometRuntimeException: CometNativeExec should not be executed directly without a serialized plan
Details: The test verifies that a negated null-safe equality filter (NOT (value <=> 'A'))
does not incorrectly filter out row groups. With the native DataFusion scan, the filter
gets pushed into the DataFusion operator rather than being handled at the Spark level.
When the test tries to execute the scan directly, it hits the same serialization issue
as SPARK-17091 above.
Root Causes
The 5 still-failing tests fall into two categories:
Direct execution without serialized plan (tests Set up RAT checks in CI #3, Bump com.google.protobuf:protobuf-java from 3.17.3 to 3.19.6 #5): The native DataFusion scan
requires execution through a serialized native plan. When tests construct and execute
scans directly (outside of the normal query planning flow), they hit a CometRuntimeException because CometNativeScan cannot be executed standalone.
Note on V2 Tests
The ParquetV2FilterSuite and ParquetV2QuerySuite variants of these tests all pass
because they use USE_V1_SOURCE_LIST = "", which means Spark uses V2 data sources
instead of V1. Comet's native scan only intercepts V1 Parquet sources, so V2 tests
effectively run without Comet's native scan and pass trivially.
Native DataFusion Scan Test Analysis (Spark 3.5.7)
Overview
This analysis covers tests that were previously ignored for
native_datafusionscan modevia
IgnoreCometNativeScanorIgnoreCometNativeDataFusiontags in the Spark 3.5.7 diff.Each test was run with
spark.comet.scan.impl=native_datafusionto determine whetherthe ignore directive is still necessary.
Summary
IgnoreCometNativeScanfrom the 3 passing encryption testsTests Now Passing (Ignore Removed)
ParquetEncryptionSuite (
sql/hive)All three encryption tests now pass with
native_datafusion:SPARK-34990: Write and read an encrypted parquetSPARK-37117: Can't read files in Parquet encryption external key material modeSPARK-42114: Test of uniform parquet encryptionThese tests verify that Spark can write and read encrypted Parquet files. The native
DataFusion scan now handles encrypted Parquet correctly, so the ignore directives were
removed from the diff.
Tests Still Failing (Ignore Retained)
ParquetV1FilterSuite (
sql/core)All four tests fail only in the V1 source path (
ParquetV1FilterSuite). The correspondingV2 tests (
ParquetV2FilterSuite) pass because V2 sources don't use Comet's native scan.1.
Filters should be pushed down for vectorized Parquet reader at row group levelTestFailedException(assertion failure)by examining a custom accumulator that counts row groups. The native DataFusion scan
does not support Spark's accumulator mechanism for tracking pushed-down filter statistics,
so the assertion on the accumulator value fails.
2.
filter pushdown - StringPredicateTestFailedException(assertion failure)StartsWith,EndsWith, andContainsstring predicates arepushed down into the Parquet reader. The native DataFusion scan does not push these
string predicates down in the same way Spark's built-in reader does, causing the
assertions on pushed filter counts to fail.
3.
SPARK-17091: Convert IN predicate to Parquet filter push-downCometRuntimeException: CometNativeExec should not be executed directly without a serialized planit in a way that triggers
CometNativeScanwithout going through the proper nativeexecution plan serialization. This is a fundamental incompatibility with how the native
DataFusion scan handles standalone execution outside of a full native plan.
4.
SPARK-34562: Bloom filter push downTestFailedException(assertion failure)bloom filter push-down is working. The native DataFusion scan does not integrate with
Spark's accumulator framework for this purpose.
ParquetV1QuerySuite (
sql/core)5.
SPARK-26677: negated null-safe equality comparison should not filter matched row groupsCometRuntimeException: CometNativeExec should not be executed directly without a serialized planNOT (value <=> 'A'))does not incorrectly filter out row groups. With the native DataFusion scan, the filter
gets pushed into the DataFusion operator rather than being handled at the Spark level.
When the test tries to execute the scan directly, it hits the same serialization issue
as SPARK-17091 above.
Root Causes
The 5 still-failing tests fall into two categories:
Accumulator incompatibility (tests Initial PR #1, IP clearance form #2, copy over the script to enable pyspark as well #4): The native DataFusion scan bypasses
Spark's internal accumulator mechanism used to track filter pushdown statistics. Tests
that assert on these accumulator values will fail.
Direct execution without serialized plan (tests Set up RAT checks in CI #3, Bump com.google.protobuf:protobuf-java from 3.17.3 to 3.19.6 #5): The native DataFusion scan
requires execution through a serialized native plan. When tests construct and execute
scans directly (outside of the normal query planning flow), they hit a
CometRuntimeExceptionbecauseCometNativeScancannot be executed standalone.Note on V2 Tests
The
ParquetV2FilterSuiteandParquetV2QuerySuitevariants of these tests all passbecause they use
USE_V1_SOURCE_LIST = "", which means Spark uses V2 data sourcesinstead of V1. Comet's native scan only intercepts V1 Parquet sources, so V2 tests
effectively run without Comet's native scan and pass trivially.