feat!(datafusion): enable parallel file scanning with eager task bucketing#2298
feat!(datafusion): enable parallel file scanning with eager task bucketing#2298toutane wants to merge 16 commits intoapache:mainfrom
Conversation
timsaucer
left a comment
There was a problem hiding this comment.
I'm no expert on Iceberg but I've worked a lot on DataFusion, particularly table providers. I wrote a blog on the datafusion site recently, but since you first put this PR up. In case it's in any way useful: https://datafusion.apache.org/blog/2026/03/31/writing-table-providers/
Overall I think the approach here is definitely reasonable. My comments are mostly around opportunities to squeeze out a little more performance based on having done something similar at my work.
| self: Arc<Self>, | ||
| _children: Vec<Arc<dyn ExecutionPlan>>, | ||
| ) -> DFResult<Arc<dyn ExecutionPlan>> { | ||
| Ok(self) |
There was a problem hiding this comment.
Since this doesn't support children, I'd recommend an error if _children is not empty. Not a blocker for merge.
There was a problem hiding this comment.
Yes, you're right thanks! Pushed a fix that returns a DataFusionError::Internal, matching the pattern used in IcebergCommitExec::with_new_children.
Side note: IcebergTableScan::with_new_children has the same issue. This could be the subject of another PR.
| &self, | ||
| filters: &[&Expr], | ||
| ) -> DFResult<Vec<TableProviderFilterPushDown>> { | ||
| Ok(vec![TableProviderFilterPushDown::Inexact; filters.len()]) |
There was a problem hiding this comment.
Can we do better than this? If we have partitioned scan and the filter is on the partitions I would expect to be able to get an exact pushdown. That would entirely remove a filter operation for cases where it matches, and I think that's a big win and common use case I've seen in other work.
There was a problem hiding this comment.
Yes, you're right there's something to do here, I agree.
I'd prefer to tackle this in a follow-up PR: doing it correctly requires a per-filter conversion API (currently convert_filters_to_predicate collapses everything into a single combined predicate and silently drops non-convertible filters) and, in a partition-spec-aware check, only Identity-transformed partition columns can be safely marked Exact; bucket, truncate, year/month/etc. are lossy and must stay Inexact to avoid incorrect results.
Happy to open a tracking issue. However, if you think it's simple enough, I can go ahead and make the changes directly in the PR.
| .map_err(to_datafusion_error)? | ||
| .try_collect::<Vec<_>>() | ||
| .await | ||
| .map_err(to_datafusion_error)?; |
There was a problem hiding this comment.
It looks like the number of output partitions will be the number of files, right? I'm wondering if there's an opportunity to do better than that. We're specifying that the output partitioning in the exec is unknown, but don't we have information about the partitioning we could utilize?
There was a problem hiding this comment.
By better I mean could we be more performant if we were to go ahead and get the target partitions from the session and output in those number of partitions already with hashing?
There was a problem hiding this comment.
Thanks for raising this, please push back if any of the below is off.
For context, the long-term direction for this is tracked in the EPIC #1604 (row-group-based parallel scan with a GroupPruner that can split/merge FileScanTask below the file grain). What I was hoping to land with this PR is a more immediate, scoped optimization that stays within the current file-grain contract, so we don't preempt the design choices in #1604. The file-grouping step you're pointing at is essentially what #2220 describes as the intermediate improvement on the path toward #1604.
If you think it's appropriate, I'd be happy to pick up a short-term follow-up along these lines:
- Switch
IcebergPartitionedScanfrom tasks:Vec<FileScanTask>to file_groups:Vec<Vec<FileScanTask>>, to follow the convention used by DataFusion's ownFileScanConfig, each group = one DataFusion partition that streams its files sequentially throughArrowReaderBuilder::read. - In
IcebergPartitionedTableProvider::scan, readstate.config().target_partitions()and group tasks intomin(n_files, target_partitions)buckets. - When
n_files<target_partitions, parallelism is still capped atn_files. I think that's inherent to the file grain, but let me know if I'm missing something.
I'm happy to open the follow-up issue/PR myself, or defer to you if you'd rather frame it, whatever works best.
There was a problem hiding this comment.
I suppose I'd need to understand those conversations. I think I mentioned in one of the other comments on this PR, but I found the whole discussion difficult to track. Maybe I can find some time this weekend to look through that sized based partitioning they mention.
There was a problem hiding this comment.
I wrote this PR targeting your branch. Let me know what you think!
The one issue I have is that I do not personally have access to any iceberg catalogs that I could use for benchmarking. My ability to test it is very limited right now.
There was a problem hiding this comment.
Hey Tim, thanks a lot for the proposal. It is really clean and smart.
I created an issue for the redundant FilterExec you were mentioning (#2363), so nice that you're addressed it here.
For the benchmark, we can do it in our infra by shadowing real traffic (our ultimate goal would be to distribute execution on multiple workers, based on the output partitioning). I will not be a standard benchmark but at least it will show if things are improving on real world queries.
What do you finally thing of merging this new provider/scan with the current one so that we only maintain one path as you suggested? If I understand correctly the current path is reachable by setting target_partitions to 1.
Last thing, I'll try to support partitioning based on Iceberg bucket transform, the tricky thing being that DataFusion and Iceberg aren't using the same hash function making the bucket hash incompatible with RepartitionExec.
There was a problem hiding this comment.
Personally I strongly believe you should be updating the existing table provider instead of creating a new one. I think it's just more work in the long run to keep to near identical bits of code.
I don't think you'll be able to use iceberg bucket transforms for the datafusion hashing output.
|
Thanks for the PR, @toutane! One thing I noticed: |
0a7af45 to
fde61f6
Compare
|
More broadly, is adding in a second path really the best answer? It seems like now you're going to increase your maintenance load. Is there any reason not to have a single path and the fallback be that it's a partitioned scan of N=1? I am going to spend a little more time trying to understand the issues. It's difficult because some of them are marked as unplanned or stale and some of the links do not have good descriptions. I suppose I'll need to look at the java source to get a better idea of what the long term goal is. |
|
Hey Tim, I think you're absolutely right about consolidating everything into a single The only reason I kept separate paths was to avoid introducing breaking changes. I am going to explore a design where partitioned file scan becomes the default behavior, with the current provider's logic as a fallback as you suggested. On a related note, it could be worth thinking about the next step: exposing |
I understand a desire to not introduce breaking changes. Is the concern that the API is changing or do you have implementation concerns? If it's just the API change, then I think a good upgrade documentation is often sufficient, especially since it looks like the change would be fairly straightforward for a downstream consumer. Please correct me if that's not correct. If it's concern about the implementation, then I think the real solution is to make sure there's robust testing both in the repo and against some real life workloads to verify performance at different scales and partitioning structures. With respect to the question about output partitioning, I think any time you can do that you should. Any time we can give more information about these kinds of things we're going to see performance gains, and sometimes significant gains. |
…itionedScan for parallel file scanning
Co-authored-by: Tim Saucer <timsaucer@gmail.com>
…:with_new_children
…identity-hash partitioning Replace the one-task-per-partition layout in IcebergPartitionedScan with N buckets sized from the session's target_partitions. When the table's default spec exposes identity-transform columns and every task carries the corresponding partition values, tasks are bucketed by hashing those values via DataFusion's REPARTITION_RANDOM_STATE so the resulting partitioning matches what RepartitionExec would produce. The scan then declares Partitioning::Hash(exprs, N), letting downstream joins and aggregates skip an extra repartition. Hash declaration is conservative and only stands when: - the table has a single partition spec (no spec evolution) - every identity source column is present in the output projection - every column type is supported by literal_to_array - every task supplied a full identity key Any miss collapses to UnknownPartitioning(N) while bucketing falls back to a hash of data_file_path so partitions still distribute. IcebergPartitionedScan now stores Vec<Vec<FileScanTask>> and execute(i) streams every task in buckets[i] through to_arrow_with_tasks. Bucket count is capped at min(target_partitions, num_files), and an empty table still yields zero partitions to avoid out-of-bounds execute calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`IcebergPartitionedTableProvider::supports_filters_pushdown` previously
returned `Inexact` for every filter, forcing DataFusion to re-evaluate
even filters that Iceberg's manifest-level pruning has fully resolved.
Per-filter the provider now returns `Exact` when both:
- the iceberg conversion can represent the filter, so manifest pruning
will remove every row that fails it, and
- every leaf is a comparison or null check against an identity-
partition column with a literal RHS.
Identity-partitioned column names are cached at `try_new` from the
table's default spec; tables with spec evolution (>1 historical specs)
fall back to an empty set so all filters stay `Inexact`. Supported
shapes: =, !=, <, <=, >, >=, IS NULL, IS NOT NULL, IN/NOT IN, plus
AND/OR/NOT compositions of the above. Every other shape is `Inexact`.
`convert_filter_to_predicate` is promoted to `pub(crate)` so the
provider can probe convertibility per filter without rebuilding the
whole AND-collapsed predicate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…column intersection Previously identity_partition_col_names returned an empty set whenever the table had more than one historical partition spec, forcing every filter back to Inexact under spec evolution. This was overly conservative: Iceberg evaluates partition predicates against each manifest's own spec, so a column that is identity-partitioned in every spec is fully prunable across the entire table regardless of which spec a given file was written under. Replace the multi-spec gate with an intersection across every spec's identity-source set. A column survives only if every spec includes it with Transform::Identity; columns that appear with non-identity transforms in some spec, or are missing from a spec entirely, are dropped. The result remains an honest set of columns for which Exact pushdown is provably safe across all surviving files. Hash bucketing (compute_identity_cols) keeps its single-spec gate because slot-order alignment with the table's default spec depends on each task carrying its own spec id, which the native plan flow does not yet do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ergTableProvider IcebergPartitionedTableProvider and IcebergPartitionedScan were introduced to enable parallel file scanning by bucketing FileScanTasks across DataFusion partitions. However, maintaining two TableProvider implementations is redundant: the new provider is strictly more capable, and its degenerate case (target_partitions=1) reproduces the old single-partition behavior exactly. This commit folds the partitioned provider into IcebergTableProvider and the partitioned scan into IcebergTableScan, eliminating the parallel types. Changes: - IcebergTableProvider::scan() now eagerly calls plan_files() and distributes FileScanTasks into buckets using the same identity-hash strategy (REPARTITION_RANDOM_STATE + create_hashes) that was in IcebergPartitionedTableProvider, enabling Partitioning::Hash declarations that align with DataFusion's RepartitionExec. - IcebergTableScan gains a new_with_tasks() constructor that accepts pre-planned buckets and a caller-supplied Partitioning. execute(i) streams the tasks in buckets[i] via TableScan::to_arrow_with_tasks, rebuilding the TableScan per-partition to avoid serializing PlanContext Arc-shared caches across workers. - The original new() constructor and the to_arrow() lazy path are kept unchanged for IcebergStaticTableProvider, which does not pre-plan tasks. - Limit slicing (try_filter_map truncation) from the old IcebergTableScan is preserved in both execution paths. - Bucketing helpers (IdentityCol, compute_identity_cols, bucket_tasks, identity_hash, fallback_hash, literal_to_array, is_supported_dtype) are moved verbatim into a new private table/bucketing.rs module. - Unit tests from partitioned.rs are migrated to table/mod.rs and updated to use IcebergTableProvider and IcebergTableScan. - integration_datafusion_test.rs: fix test_provider_plan_stream_schema to call execute(0) instead of execute(1). The old call worked only because the previous IcebergTableScan silently ignored the partition index. (cherry picked from commit d2e5e04)
Review pass over the partitioned-scan branch ahead of upstream contribution. - Rename `TableScan::to_arrow_with_tasks` to `to_arrow_from_tasks` — `from` better signals that the tasks are the input source rather than a builder-style modifier. - Restructure the doc with a `# Correctness` section that calls out the projection/filter contract while clarifying that reader-side configuration (concurrency, batch size, row-group filtering, row selection) is taken from `self`. - Make `IcebergTableScan::new` and `new_with_tasks` `pub` (were `pub(crate)`) so external users can construct the node directly, matching the public visibility of the struct itself. - Drop the `convert_filters_to_predicate` re-export from `physical_plan/mod.rs`: it was unused outside the module. - Extract a private `new_inner` constructor on `IcebergTableScan` so `new` and `new_with_tasks` share a single source of truth for the `PlanProperties` / projection / predicate setup. - Split `IcebergTableScan::execute` into a linear pipeline backed by three helpers: `build_table_scan` (synchronous scan-builder plumbing), `build_record_batch_stream` (async stream construction for the lazy/eager modes), and `apply_limit`. - Trim the `IcebergTableScan` struct doc and field comments to match the rest of the file's style; drop the verbose `to_arrow_with_tasks` rationale (the `# Correctness` doc carries the load-bearing info). - Tighten `DisplayAs::fmt_as`: remove the file-path enumeration (file count alone is enough for `EXPLAIN`) and factor the common prefix. - Trim several narrating comments in `table/mod.rs` and the module doc that duplicated information already evident from the code. - Add `test_identity_partitioned_declares_hash`: verifies the happy path where an identity-partitioned table with the partition column in the projection produces `Partitioning::Hash` referencing that column. This was the main missing coverage for the bucketing logic. - Add `test_projection_without_partition_col_falls_back_to_unknown`: verifies the `compute_identity_cols → None` branch when the projection omits the partition source column. - Add helpers (`make_partitioned_catalog_and_table_for_bucketing`, `append_partitioned_fake_data_files`) to build identity-partitioned fixtures without writing real Parquet files. (cherry picked from commit b1f2d66)
0a2ba62 to
70bc487
Compare
IcebergTableProvider::scan now plans files eagerly and buckets them across DataFusion partitions before returning the ExecutionPlan. As a result, IcebergTableScan's DisplayAs output always includes `buckets:[N] file_count:[M]` - even for unpartitioned tables where N = 1. Update the four .slt files whose EXPLAIN snapshots were missing this suffix, and fix the like_predicate_pushdown snapshots that also had a stale input_partitions count on RepartitionExec (the table now has multiple files across multiple buckets). (cherry picked from commit 6ae4a71)
1f87e13 to
7ff1f6d
Compare
Which issue does this PR close?
What changes are included in this PR?
Approach
Rather than introducing new types (
IcebergPartitionedScan,IcebergPartitionedTableProvideras originally proposed), this PR extends the existingIcebergTableProvider/IcebergTableScanwith an eager mode where file scan tasks are planned atscan()time and distributed into buckets, one bucket per DataFusion partition.The main motivation is to let DataFusion schedule file reads concurrently. Previously all files streamed through a single partition (
UnknownPartitioning(1)); nowIcebergTableProvider::scandistributes tasks acrossmin(target_partitions, n_files)partitions, and declaresPartitioning::Hashwhen the data is identity-partitioned.Key changes
TableScan::to_arrow_from_tasks- New public method onTableScanthat accepts a pre-collectedFileScanTaskStreaminstead of callingplan_files()internally. This is the hook used byIcebergTableScan::execute(i)to replay each bucket through the Arrow reader while preserving all reader-side configuration (concurrency limit, row-group filtering, batch size). Tasks must come from aTableScanwith the same projection and filters asself- predicates are baked into each task at planning time and are not re-applied by the reader. The doc comment makes this contract explicit.IcebergTableScanis nowpub- Previouslypub(crate). Made public so that downstream integrations that need to inspect or wrap the physical plan can do so without going through the table provider.with_new_childrennow returns an error -IcebergTableScanis a leaf node and does not support children. Previously the implementation silently dropped any children passed to it; it now returnsDataFusionError::Internalwhenchildrenis non-empty, matching the contract ofIcebergCommitExec.Eager task planning in
IcebergTableProvider::scan-plan_files()is now called at planning time (insideTableProvider::scan) rather than at execution time (insideExecutionPlan::execute). The collected tasks are distributed intomin(target_partitions, n_files)buckets bybucketing::bucket_tasksand stored in the scan. Eachexecute(i)call then fetches its pre-assigned bucket and streams it throughto_arrow_from_tasks- no redundant metadata reads per partition.bucketingmodule - Handles bucket assignment andPartitioningdeclaration. For tables with a single partition spec using only identity transforms, tasks are hashed on their partition values using DataFusion'screate_hashes+REPARTITION_RANDOM_STATE, and the scan declaresPartitioning::Hash. This lets DataFusion recognize that the output is already hash-partitioned and skip a downstreamRepartitionExec. Non-identity transforms (bucket,truncate,year/month/day/hour) are lossy: the partition value in task metadata does not match what DataFusion would compute by hashing the actual column values, so those cases fall back toUnknownPartitioning. Any task that cannot be fully hashed with the identity key (unsupported literal type, null partition value) also falls back.Credit: This bucketing solution was proposed by @timsaucer.
Design choices - planning at
scan()time vs. atexecute()timePlanning eagerly at
scan()time is a deliberate trade-off:execute(i)is pure I/O with no catalog round-trips.TableProvider::scannow does network I/O (catalog + metadata reads), which is unusual for a planning-phase method. An alternative design - planning lazily at execute time - would keepscan()cheap but requires oneplan_files()call per partition (redundant). A future extension could expose this as an option for use cases where snapshot staleness matters more than plan reproducibility.Known limitations
Limited type support for
Partitioning::Hash-literal_to_arraysupports seven primitive Arrow types (Bool,Int32,Int64,Float32,Float64,Utf8,Date32). Timestamps,Decimal128,LargeUtf8, etc. are not yet covered; any unsupported type forces fallback toUnknownPartitioning.Spec evolution disables
Partitioning::Hash- If the table has more than one historical partition spec, the bucketing module conservatively returnsUnknownPartitioningto avoid mismatches between old and new partition tuple layouts.Are these changes tested?
Unit tests in
table/mod.rscovering the new bucketed scan path:test_empty_table_single_empty_bucket- Empty table produces one empty bucket, guarding against out-of-bounds panic onexecute(0).test_unpartitioned_falls_back_to_unknown- Unpartitioned table declaresUnknownPartitioning.test_bucket_count_capped_at_file_count- Whentarget_partitions > n_files, bucket count is capped atn_files.test_single_target_partition_single_bucket-target_partitions=1produces a single bucket regardless of file count, reproducing the original single-threaded behavior.test_identity_partitioned_declares_hash- Identity-partitioned table declaresPartitioning::Hashreferencing the partition column.test_projection_without_partition_col_falls_back_to_unknown- Projecting away an identity column falls back toUnknownPartitioning.Additional tests are added for
IcebergTableProviderto cover limit pushdown, insert behavior, and schema consistency, ensuring the refactor introduces no regressions on existing functionality.SQL logic tests -
EXPLAINsnapshots are updated to reflect the newbuckets:[N] file_count:[M]display format and the correctinput_partitionscounts.Production validation - We plan to test these changes in our infrastructure by shadowing real-world queries.
Follow-up work
FilterExec- @timsaucer reports thatsupports_filters_pushdownreturnsInexactfor all filters, causing DataFusion to insert aFilterExecaboveIcebergTableScaneven though the Arrow reader already applies the predicate viaArrowPredicateFn. ReturningExactfor losslessly-converted filters would eliminate this redundant re-evaluation.He proposed a solution in earlier commits, but those have been reverted as they are out of scope for this PR. This issue is tracked in IcebergTableProvider::supports_filters_pushdown marks every filter as Inexact, causing a redundant FilterExec above IcebergTableScan #2363.
Note
IcebergStaticTableProvideris unchanged - it still usesIcebergTableScan::new(lazy, single-partition). Static snapshots do not benefit from eager planning because the task list is fixed by construction.