feat: AQE DPP for native Parquet scans with broadcast reuse#4112
Open
mbutrovich wants to merge 45 commits intoapache:mainfrom
Open
feat: AQE DPP for native Parquet scans with broadcast reuse#4112mbutrovich wants to merge 45 commits intoapache:mainfrom
mbutrovich wants to merge 45 commits intoapache:mainfrom
Conversation
Open
6 tasks
…t exchangeReuseEnabled and onlyInBroadcast, create aggregate SubqueryExec for case 3.
# Conflicts: # spark/src/main/spark-4.1/org/apache/comet/shims/ShimSubqueryBroadcast.scala
andygrove
reviewed
Apr 30, 2026
Member
andygrove
left a comment
There was a problem hiding this comment.
LGTM. I will run benchmarks today to confirm no regressions. Thanks @mbutrovich!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Partially addresses #3510. Closes #4045. Related PRs: #4011 (non-AQE DPP), #4053 (scalar subquery pushdown +
CometReuseSubquery), #4037 (non-AQE DPP edge case tests), #4033 (AQE DPP for Iceberg, draft).Rationale for this change
Under AQE (the default), Spark creates
SubqueryAdaptiveBroadcastExec(SAB) for DPP. Spark'sPlanAdaptiveDynamicPruningFiltersconverts these by findingBroadcastHashJoinExecin the plan. After Comet replaces it withCometBroadcastHashJoinExec, Spark's rule can't find a match and replaces DPP withLiteral.TrueLiteral, disabling partition pruning. Previously, theisAqeDynamicPruningFilterrejection caused the scan to fall back to Spark entirely, losing native acceleration for all DPP queries under AQE.What changes are included in this PR?
Spark 3.5+: two-phase SAB conversion
Spark's
PlanAdaptiveDynamicPruningFiltersruns before customqueryStageOptimizerRulesand converts SABs toTrueLiteral. We work around this in two phases:CometExecRule(queryStagePreparationRules, before Spark's rule): wraps SABs inCometSubqueryAdaptiveBroadcastExecso Spark's pattern match doesn't recognize them. Wraps all SABs regardless of scan type, soCometPlanAdaptiveDynamicPruningFilterscan convert them for both Comet native scans and non-Comet scans (e.g., V2 BatchScan).CometPlanAdaptiveDynamicPruningFilters(queryStageOptimizerRule, after Spark's rule): converts following Spark's decision tree:exchangeReuseEnabled+ matching broadcast join:CometSubqueryBroadcastExecwired toBroadcastQueryStageExecfor broadcast reuse.onlyInBroadcast=true:Literal.TrueLiteral.onlyInBroadcast=false: aggregateSubqueryExec(matchingPlanAdaptiveDynamicPruningFilters.scala:68-79).Cross-stage broadcast search
Spark's rule is constructed with
rootPlan = this(each ASPE's own instance). CustomqueryStageOptimizerRulesare shared across all ASPEs without a per-ASPE rootPlan. We approximate with two searches:planarg toapply()): same-stage joins and scalar subqueries where scan and join are under one exchange.When the broadcast is not yet materialized (cross-stage case), we follow Spark's pattern (lines 44-64): construct a new broadcast exchange, wrap in a new ASPE, and let AQE's
stageCachecanonicalization ensure the broadcast runs once.Subquery deduplication via shared cache
Our rule runs after Spark's
ReuseAdaptiveSubquery(which can't see our subqueries because they don't exist yet). We register DPP subqueries directly inAdaptiveExecutionContext.subqueryCache, matchingReuseAdaptiveSubquery's behavior for cross-plan reuse (e.g., main query and scalar subquery with identical DPP).Dual-filter resolution
CometNativeScanExec.partitionFiltersandCometScanExec.partitionFilterscontain separateInSubqueryExecinstances.CometExecRuleonly wraps the outer filters (the innerCometScanExecis@transient).CometPlanAdaptiveDynamicPruningFiltersconverts both.Spark 3.4: narrow-tagging fallback (
CometSpark34AqeDppFallbackRule)injectQueryStageOptimizerRuleis unavailable on 3.4 (SPARK-45785 added it in 3.5), soCometPlanAdaptiveDynamicPruningFilterscan't run. Rewriting the SAB atqueryStagePrepRuletime also doesn't work: AQE rebuilds plan nodes between prep and execution in ways that drop the@transientinner scan needed for the dual-filter update.Instead, on 3.4 we arrange for Spark's
PlanAdaptiveDynamicPruningFiltersto succeed on its own by tagging specific nodes to stay Spark-native. The rule only writes skip-tags; it never rewrites expressions or plan structure. Tags are honored byCometScanRuleandCometExecRule, and survive AQE per-stage re-entry (same contract as the existingSKIP_COMET_SHUFFLE_TAGfrom #4010). Four cases:BroadcastExchangeExecwithSKIP_COMET_BROADCAST_TAG. Comet's BHJ conversion then fails its all-Comet-children guard and the BHJ stays Spark; Spark's rule matches viasameResultand createsSubqueryBroadcastExec.CometScanRule.transformV1Scanalready rejects the V1 fact scan; the cascade keeps the BHJ and itsBroadcastExchangeExecSpark-native. No tagging needed. V1 BHJ queries (e.g. TPC-DS Q7) behave exactly as today on 3.4 main, including Comet acceleration on dim scans below the Spark broadcast.AUTO_BROADCASTJOIN_THRESHOLD=-1): tag peer scans + their shuffles so both self-join branches end up Spark-native with matching canonical forms. Spark's rule replaces the SAB withTrueLiteral;FileSourceScanExec.doCanonicalizestrips it, restoring shuffle exchange reuse.SubqueryBroadcastExec-bearing scans (AQE re-optimize): on re-optimize,ASPE.preprocessingRules(PlanAdaptiveSubqueries) fills the DPP slot with the already-materializedSubqueryBroadcastExecrather than the original SAB, and the freshly-planned main-BHJ buildBroadcastExchangeExecis a new instance with no tag. The rule also scans forSubqueryBroadcastExec(descending intoQueryStageExecviaAdaptiveSparkPlanHelper), extracts itsbuildKeys, and tags the matching BHJ's build BE so AQEstageCachecan dedupe with the DPP subquery's broadcast.Registered via a new
injectPreSpark35QueryStagePrepRuleShim(3.4 only; no-op on 3.5+). The rule asserts!isSpark35Plusat entry.Known limitation on 3.4: cross-plan scalar-subquery DPP (same limitation Spark's own rule has on 3.4). At prep-rule time each ASPE sees only its own plan, so an SAB in a scalar subquery can't see a matching BHJ in the main query. Produces correct results via Spark's rule falling through to
TrueLiteral/ aggregateSubqueryExec; only broadcast reuse is lost in that edge case.Broadcast fallback cases (3.5+)
BroadcastHashJoinExec, createsSubqueryBroadcastExecvia shim.Literal.TrueLiteralor aggregateSubqueryExecdepending ononlyInBroadcast.BroadcastQueryStageExec.planmay beReusedExchangeExecwhen AQE reuses exchanges across plans. The rule unwraps it to verify the underlying exchange type.Other changes
CometBroadcastExchangeExec: handles non-Comet children (e.g.,LocalTableScanafter AQE re-optimization of empty broadcasts) by wrapping inCometSparkToColumnarExec.CometNativeScanExec.doCanonicalize: strips DPP filters fromoriginalPlanto prevent stale SABs from blocking exchange reuse.CometShuffleExchangeExec.doCanonicalize: excludesoriginalPlanfrom canonical form (matchesCometBroadcastExchangeExec).CometScanUtils.filterUnusedDynamicPruningExpressions: strips unconverted SABs in addition toTrueLiteral, matching Spark'sFileSourceScanExec.filterUnusedDynamicPruningExpressions.ShimPrepareExecutedPlan: new shim forQueryExecution.prepareExecutedPlan(3-arg on 3.x/4.0, 2-arg on 4.1+).CometDppFallbackRepro3949Suite,CometShuffleFallbackStickinessSuite) updated to disable native scan to preserve thestageContainsDPPScanstickiness code path.IgnoreComet(#4045)tags from Spark'sDynamicPartitionPruningSuitediffs for SPARK-32509 and SPARK-34637. Tests ported toCometExecSuitewith version-specific assertions.How are these changes tested?
16 new AQE DPP tests in
CometExecSuitecovering BHJ / SMJ / empty broadcast / dual filters / exchange reuse / non-atomic types / cross-stage search / scalar subquery deduplication / SPARK-32509 / SPARK-34637 / SPARK-39447. SPARK-32509 and SPARK-34637 ports are un-gated: SPARK-32509 asserts 1ReusedExchangeExecon all versions; SPARK-34637 assertsCometSubqueryBroadcastExecon 3.5-4.0 and Spark-nativeSubqueryBroadcastExecon 3.4 and 4.1+. The V2 BatchScan variant runs on 3.4 with an explicithasReusecheck mirroring Spark'scheckPartitionPruningPredicate, exercising case 4 above. Existing non-AQE DPP tests renamed to consistent"[non-AQE|AQE] DPP: <scenario>"format.