chore: add assertion that not using comet scan but using native scan#1793
chore: add assertion that not using comet scan but using native scan#1793rluvaton wants to merge 2 commits intoapache:mainfrom
Conversation
|
There are a lot of failures, I think I found the error, pushing a fix, maybe this is the reason for: |
| plan match { | ||
| case w: CometScanWrapper => containsNativeCometScan(w.originalPlan) | ||
| case scan: CometScanExec => scan.scanImpl == CometConf.SCAN_NATIVE_COMET | ||
| case _: CometScanWrapper => true |
There was a problem hiding this comment.
Changed it to return true as native scan exec is not wrapped in CometScanWrapper
There was a problem hiding this comment.
and when not scan datafusion so it will not use df filter when compat iceberg
| case scan: CometScanExec => scan.scanImpl != CometConf.SCAN_NATIVE_DATAFUSION | ||
| case _: CometNativeScanExec => false | ||
| case _ => plan.children.exists(containsNativeCometScan) | ||
| case _ => plan.children.isEmpty || plan.children.exists(containsNonDataFusionScan) |
There was a problem hiding this comment.
exists return false for empty children as it means that this is the data source, right? and if so it using the ScanExec of rust and not ScanExec of DataFusion
There was a problem hiding this comment.
ScanExec of rust and not ScanExec of DataFusion
what does this mean?
There was a problem hiding this comment.
we must start from a scan, right? either ScanExec or DataFusion DataSourceExec so if no children this means that we are at the root, for example ExistingRDD and therefore we must use ScanExec as otherwise we would have converted to CometNativeScanExec
| assert_eq!( | ||
| scans.len(), | ||
| 0, | ||
| "Must not use ScanExec as it may requires a copy" | ||
| ); | ||
|
|
There was a problem hiding this comment.
My understanding is that we only reuse buffers when using native_comet to read the Parquet files. If native_iceberg_compat is used then we will still have a ScanExec reading from JVM but it is safe to use DataFusion's FilterExec in this case.
@parthchandra would you agree with this?
There was a problem hiding this comment.
That is correct. native_iceberg_compat does not reuse buffers.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1793 +/- ##
============================================
+ Coverage 56.12% 59.46% +3.33%
- Complexity 976 1140 +164
============================================
Files 119 128 +9
Lines 11743 12528 +785
Branches 2251 2356 +105
============================================
+ Hits 6591 7450 +859
+ Misses 4012 3887 -125
- Partials 1140 1191 +51 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks @rluvaton but this PR does not appear to help with the CI issue (the tests are still failing - see https://github.com/apache/datafusion-comet/actions/runs/15278587551/job/42973846520?pr=1793). The scans in Comet are pretty confusing. In the native plan we either have DataFusion's |
|
I thought everything that came from JVM is reusing buffers, if it's not the case than the copy should not always be added when there is a ScanExec, no? (looking at the wrap in copy exec code) |
That's a good point. We currently add CopyExec for all ScanExecs, regardless of the source. This could be optimized. |
|
I still think there is a bug here: For this test (when running on main): test("debug datafusion native filter") {
val schema = StructType(
Seq(
StructField("row_idx", IntegerType, nullable = false),
StructField("int", IntegerType, nullable = false)))
val data = DataGenerator.DEFAULT.generateRows(1000, schema)
withSQLConf(
CometConf.COMET_EXPLAIN_VERBOSE_ENABLED.key -> "true",
CometConf.COMET_EXPLAIN_NATIVE_ENABLED.key -> "true",
CometConf.COMET_SPARK_TO_ARROW_SUPPORTED_OPERATOR_LIST.key -> "RDDScan") {
val df = spark
.createDataFrame(spark.sparkContext.parallelize(data, 1), schema)
.where(col("row_idx") < 10000 || col("row_idx") > 10010)
df.explain(true)
df
.show()
}
}The spark plan is: and the datafusion plan is: It is using DataFusion Filter and not CometFilter while it should use comet filter as there is reuse, no? |
In this example, Spark (not Comet) is performing the scan. Comet is then performing the row-to-columnar conversion. The |
|
Now I'm really confused, where in the code there is reuse of buffers? |
|
Also, isn't reuse of buffer only part of the reason to do copy? like if the spark side call the arrow release function in the FFI while the |
The key structs are |
Which issue does this PR close?
N/A
Rationale for this change
Just add assertion that using datafusion scan when using datafusion filter making sure the code in #1746 is always correct
What changes are included in this PR?
Just assertion
How are these changes tested?
Existing tests