Skip to content

feat!(datafusion): enable parallel file scanning with eager task bucketing#2298

Open
toutane wants to merge 16 commits intoapache:mainfrom
toutane:draft/partitioned-file-scanning-contribution
Open

feat!(datafusion): enable parallel file scanning with eager task bucketing#2298
toutane wants to merge 16 commits intoapache:mainfrom
toutane:draft/partitioned-file-scanning-contribution

Conversation

@toutane
Copy link
Copy Markdown
Contributor

@toutane toutane commented Mar 31, 2026

Which issue does this PR close?

What changes are included in this PR?

Approach

Rather than introducing new types (IcebergPartitionedScan, IcebergPartitionedTableProvider as originally proposed), this PR extends the existing IcebergTableProvider / IcebergTableScan with an eager mode where file scan tasks are planned at scan() time and distributed into buckets, one bucket per DataFusion partition.

The main motivation is to let DataFusion schedule file reads concurrently. Previously all files streamed through a single partition (UnknownPartitioning(1)); now IcebergTableProvider::scan distributes tasks across min(target_partitions, n_files) partitions, and declares Partitioning::Hash when the data is identity-partitioned.

Key changes

  • TableScan::to_arrow_from_tasks - New public method on TableScan that accepts a pre-collected FileScanTaskStream instead of calling plan_files() internally. This is the hook used by IcebergTableScan::execute(i) to replay each bucket through the Arrow reader while preserving all reader-side configuration (concurrency limit, row-group filtering, batch size). Tasks must come from a TableScan with the same projection and filters as self - predicates are baked into each task at planning time and are not re-applied by the reader. The doc comment makes this contract explicit.

  • IcebergTableScan is now pub - Previously pub(crate). Made public so that downstream integrations that need to inspect or wrap the physical plan can do so without going through the table provider.

  • with_new_children now returns an error - IcebergTableScan is a leaf node and does not support children. Previously the implementation silently dropped any children passed to it; it now returns DataFusionError::Internal when children is non-empty, matching the contract of IcebergCommitExec.

  • Eager task planning in IcebergTableProvider::scan - plan_files() is now called at planning time (inside TableProvider::scan) rather than at execution time (inside ExecutionPlan::execute). The collected tasks are distributed into min(target_partitions, n_files) buckets by bucketing::bucket_tasks and stored in the scan. Each execute(i) call then fetches its pre-assigned bucket and streams it through to_arrow_from_tasks - no redundant metadata reads per partition.

  • bucketing module - Handles bucket assignment and Partitioning declaration. For tables with a single partition spec using only identity transforms, tasks are hashed on their partition values using DataFusion's create_hashes + REPARTITION_RANDOM_STATE, and the scan declares Partitioning::Hash. This lets DataFusion recognize that the output is already hash-partitioned and skip a downstream RepartitionExec. Non-identity transforms (bucket, truncate, year/month/day/hour) are lossy: the partition value in task metadata does not match what DataFusion would compute by hashing the actual column values, so those cases fall back to UnknownPartitioning. Any task that cannot be fully hashed with the identity key (unsupported literal type, null partition value) also falls back.
    Credit: This bucketing solution was proposed by @timsaucer.

Design choices - planning at scan() time vs. at execute() time

Planning eagerly at scan() time is a deliberate trade-off:

  • Pro: Tasks are computed once and shared across all partitions; the plan is reproducible; execute(i) is pure I/O with no catalog round-trips.
  • Con: TableProvider::scan now does network I/O (catalog + metadata reads), which is unusual for a planning-phase method. An alternative design - planning lazily at execute time - would keep scan() cheap but requires one plan_files() call per partition (redundant). A future extension could expose this as an option for use cases where snapshot staleness matters more than plan reproducibility.

Known limitations

  • Limited type support for Partitioning::Hash - literal_to_array supports seven primitive Arrow types (Bool, Int32, Int64, Float32, Float64, Utf8, Date32). Timestamps, Decimal128, LargeUtf8, etc. are not yet covered; any unsupported type forces fallback to UnknownPartitioning.

  • Spec evolution disables Partitioning::Hash - If the table has more than one historical partition spec, the bucketing module conservatively returns UnknownPartitioning to avoid mismatches between old and new partition tuple layouts.

Are these changes tested?

Unit tests in table/mod.rs covering the new bucketed scan path:

  • test_empty_table_single_empty_bucket - Empty table produces one empty bucket, guarding against out-of-bounds panic on execute(0).
  • test_unpartitioned_falls_back_to_unknown - Unpartitioned table declares UnknownPartitioning.
  • test_bucket_count_capped_at_file_count - When target_partitions > n_files, bucket count is capped at n_files.
  • test_single_target_partition_single_bucket - target_partitions=1 produces a single bucket regardless of file count, reproducing the original single-threaded behavior.
  • test_identity_partitioned_declares_hash - Identity-partitioned table declares Partitioning::Hash referencing the partition column.
  • test_projection_without_partition_col_falls_back_to_unknown - Projecting away an identity column falls back to UnknownPartitioning.

Additional tests are added for IcebergTableProvider to cover limit pushdown, insert behavior, and schema consistency, ensuring the refactor introduces no regressions on existing functionality.

SQL logic tests - EXPLAIN snapshots are updated to reflect the new buckets:[N] file_count:[M] display format and the correct input_partitions counts.

Production validation - We plan to test these changes in our infrastructure by shadowing real-world queries.

Follow-up work

Note

IcebergStaticTableProvider is unchanged - it still uses IcebergTableScan::new (lazy, single-partition). Static snapshots do not benefit from eager planning because the task list is fixed by construction.

Copy link
Copy Markdown
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert on Iceberg but I've worked a lot on DataFusion, particularly table providers. I wrote a blog on the datafusion site recently, but since you first put this PR up. In case it's in any way useful: https://datafusion.apache.org/blog/2026/03/31/writing-table-providers/

Overall I think the approach here is definitely reasonable. My comments are mostly around opportunities to squeeze out a little more performance based on having done something similar at my work.

self: Arc<Self>,
_children: Vec<Arc<dyn ExecutionPlan>>,
) -> DFResult<Arc<dyn ExecutionPlan>> {
Ok(self)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this doesn't support children, I'd recommend an error if _children is not empty. Not a blocker for merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right thanks! Pushed a fix that returns a DataFusionError::Internal, matching the pattern used in IcebergCommitExec::with_new_children.

Side note: IcebergTableScan::with_new_children has the same issue. This could be the subject of another PR.

Comment thread crates/integrations/datafusion/src/table/mod.rs Outdated
&self,
filters: &[&Expr],
) -> DFResult<Vec<TableProviderFilterPushDown>> {
Ok(vec![TableProviderFilterPushDown::Inexact; filters.len()])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do better than this? If we have partitioned scan and the filter is on the partitions I would expect to be able to get an exact pushdown. That would entirely remove a filter operation for cases where it matches, and I think that's a big win and common use case I've seen in other work.

Copy link
Copy Markdown
Contributor Author

@toutane toutane Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right there's something to do here, I agree.

I'd prefer to tackle this in a follow-up PR: doing it correctly requires a per-filter conversion API (currently convert_filters_to_predicate collapses everything into a single combined predicate and silently drops non-convertible filters) and, in a partition-spec-aware check, only Identity-transformed partition columns can be safely marked Exact; bucket, truncate, year/month/etc. are lossy and must stay Inexact to avoid incorrect results.

Happy to open a tracking issue. However, if you think it's simple enough, I can go ahead and make the changes directly in the PR.

.map_err(to_datafusion_error)?
.try_collect::<Vec<_>>()
.await
.map_err(to_datafusion_error)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the number of output partitions will be the number of files, right? I'm wondering if there's an opportunity to do better than that. We're specifying that the output partitioning in the exec is unknown, but don't we have information about the partitioning we could utilize?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By better I mean could we be more performant if we were to go ahead and get the target partitions from the session and output in those number of partitions already with hashing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this, please push back if any of the below is off.

For context, the long-term direction for this is tracked in the EPIC #1604 (row-group-based parallel scan with a GroupPruner that can split/merge FileScanTask below the file grain). What I was hoping to land with this PR is a more immediate, scoped optimization that stays within the current file-grain contract, so we don't preempt the design choices in #1604. The file-grouping step you're pointing at is essentially what #2220 describes as the intermediate improvement on the path toward #1604.

If you think it's appropriate, I'd be happy to pick up a short-term follow-up along these lines:

  1. Switch IcebergPartitionedScan from tasks: Vec<FileScanTask> to file_groups: Vec<Vec<FileScanTask>>, to follow the convention used by DataFusion's own FileScanConfig, each group = one DataFusion partition that streams its files sequentially through ArrowReaderBuilder::read.
  2. In IcebergPartitionedTableProvider::scan, read state.config().target_partitions() and group tasks into min(n_files, target_partitions) buckets.
  3. When n_files < target_partitions, parallelism is still capped at n_files. I think that's inherent to the file grain, but let me know if I'm missing something.

I'm happy to open the follow-up issue/PR myself, or defer to you if you'd rather frame it, whatever works best.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'd need to understand those conversations. I think I mentioned in one of the other comments on this PR, but I found the whole discussion difficult to track. Maybe I can find some time this weekend to look through that sized based partitioning they mention.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this PR targeting your branch. Let me know what you think!

toutane#1

The one issue I have is that I do not personally have access to any iceberg catalogs that I could use for benchmarking. My ability to test it is very limited right now.

Copy link
Copy Markdown
Contributor Author

@toutane toutane Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Tim, thanks a lot for the proposal. It is really clean and smart.

I created an issue for the redundant FilterExec you were mentioning (#2363), so nice that you're addressed it here.

For the benchmark, we can do it in our infra by shadowing real traffic (our ultimate goal would be to distribute execution on multiple workers, based on the output partitioning). I will not be a standard benchmark but at least it will show if things are improving on real world queries.

What do you finally thing of merging this new provider/scan with the current one so that we only maintain one path as you suggested? If I understand correctly the current path is reachable by setting target_partitions to 1.

Last thing, I'll try to support partitioning based on Iceberg bucket transform, the tricky thing being that DataFusion and Iceberg aren't using the same hash function making the bucket hash incompatible with RepartitionExec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I strongly believe you should be updating the existing table provider instead of creating a new one. I think it's just more work in the long run to keep to near identical bits of code.

I don't think you'll be able to use iceberg bucket transforms for the datafusion hashing output.

@mbutrovich
Copy link
Copy Markdown
Collaborator

Thanks for the PR, @toutane! One thing I noticed: IcebergPartitionedScan::execute() creates a bare ArrowReaderBuilder::new(file_io).build() with no configuration. The existing path through IcebergTableScan wires through row group filtering, row selection, concurrency limits, and batch size. Might be worth plumbing those through here too so users don't silently lose those optimizations when switching to the partitioned scan.

@toutane toutane marked this pull request as draft April 21, 2026 09:35
@toutane toutane force-pushed the draft/partitioned-file-scanning-contribution branch from 0a7af45 to fde61f6 Compare April 21, 2026 09:56
@timsaucer
Copy link
Copy Markdown
Member

More broadly, is adding in a second path really the best answer? It seems like now you're going to increase your maintenance load. Is there any reason not to have a single path and the fallback be that it's a partitioned scan of N=1?

I am going to spend a little more time trying to understand the issues. It's difficult because some of them are marked as unplanned or stale and some of the links do not have good descriptions. I suppose I'll need to look at the java source to get a better idea of what the long term goal is.

@toutane
Copy link
Copy Markdown
Contributor Author

toutane commented Apr 22, 2026

Hey Tim, I think you're absolutely right about consolidating everything into a single TableProvider long term.

The only reason I kept separate paths was to avoid introducing breaking changes. I am going to explore a design where partitioned file scan becomes the default behavior, with the current provider's logic as a fallback as you suggested.

On a related note, it could be worth thinking about the next step: exposing Partitioning::Hash as output-partitioned when the Iceberg data uses bucket partitioning. Do you think that fits naturally in the same path, or would a separate provider be a better fit?

@timsaucer
Copy link
Copy Markdown
Member

Hey Tim, I think you're absolutely right about consolidating everything into a single TableProvider long term.

The only reason I kept separate paths was to avoid introducing breaking changes. I am going to explore a design where partitioned file scan becomes the default behavior, with the current provider's logic as a fallback as you suggested.

On a related note, it could be worth thinking about the next step: exposing Partitioning::Hash as output-partitioned when the Iceberg data uses bucket partitioning. Do you think that fits naturally in the same path, or would a separate provider be a better fit?

I understand a desire to not introduce breaking changes. Is the concern that the API is changing or do you have implementation concerns? If it's just the API change, then I think a good upgrade documentation is often sufficient, especially since it looks like the change would be fairly straightforward for a downstream consumer. Please correct me if that's not correct.

If it's concern about the implementation, then I think the real solution is to make sure there's robust testing both in the repo and against some real life workloads to verify performance at different scales and partitioning structures.

With respect to the question about output partitioning, I think any time you can do that you should. Any time we can give more information about these kinds of things we're going to see performance gains, and sometimes significant gains.

toutane and others added 13 commits April 29, 2026 16:39
Co-authored-by: Tim Saucer <timsaucer@gmail.com>
…identity-hash partitioning

Replace the one-task-per-partition layout in IcebergPartitionedScan with
N buckets sized from the session's target_partitions. When the table's
default spec exposes identity-transform columns and every task carries
the corresponding partition values, tasks are bucketed by hashing those
values via DataFusion's REPARTITION_RANDOM_STATE so the resulting
partitioning matches what RepartitionExec would produce. The scan then
declares Partitioning::Hash(exprs, N), letting downstream joins and
aggregates skip an extra repartition.

Hash declaration is conservative and only stands when:
  - the table has a single partition spec (no spec evolution)
  - every identity source column is present in the output projection
  - every column type is supported by literal_to_array
  - every task supplied a full identity key
Any miss collapses to UnknownPartitioning(N) while bucketing falls
back to a hash of data_file_path so partitions still distribute.

IcebergPartitionedScan now stores Vec<Vec<FileScanTask>> and execute(i)
streams every task in buckets[i] through to_arrow_with_tasks. Bucket
count is capped at min(target_partitions, num_files), and an empty
table still yields zero partitions to avoid out-of-bounds execute calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`IcebergPartitionedTableProvider::supports_filters_pushdown` previously
returned `Inexact` for every filter, forcing DataFusion to re-evaluate
even filters that Iceberg's manifest-level pruning has fully resolved.
Per-filter the provider now returns `Exact` when both:
  - the iceberg conversion can represent the filter, so manifest pruning
    will remove every row that fails it, and
  - every leaf is a comparison or null check against an identity-
    partition column with a literal RHS.

Identity-partitioned column names are cached at `try_new` from the
table's default spec; tables with spec evolution (>1 historical specs)
fall back to an empty set so all filters stay `Inexact`. Supported
shapes: =, !=, <, <=, >, >=, IS NULL, IS NOT NULL, IN/NOT IN, plus
AND/OR/NOT compositions of the above. Every other shape is `Inexact`.

`convert_filter_to_predicate` is promoted to `pub(crate)` so the
provider can probe convertibility per filter without rebuilding the
whole AND-collapsed predicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…column intersection

Previously identity_partition_col_names returned an empty set whenever
the table had more than one historical partition spec, forcing every
filter back to Inexact under spec evolution. This was overly
conservative: Iceberg evaluates partition predicates against each
manifest's own spec, so a column that is identity-partitioned in every
spec is fully prunable across the entire table regardless of which spec
a given file was written under.

Replace the multi-spec gate with an intersection across every spec's
identity-source set. A column survives only if every spec includes it
with Transform::Identity; columns that appear with non-identity
transforms in some spec, or are missing from a spec entirely, are
dropped. The result remains an honest set of columns for which Exact
pushdown is provably safe across all surviving files.

Hash bucketing (compute_identity_cols) keeps its single-spec gate
because slot-order alignment with the table's default spec depends on
each task carrying its own spec id, which the native plan flow does
not yet do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…via per-column intersection"

This reverts commit b2613e3.

(cherry picked from commit 826f054)
…shdown"

This reverts commit 6d0ed4c.

(cherry picked from commit 4381f00)
…ergTableProvider

IcebergPartitionedTableProvider and IcebergPartitionedScan were introduced
to enable parallel file scanning by bucketing FileScanTasks across DataFusion
partitions. However, maintaining two TableProvider implementations is
redundant: the new provider is strictly more capable, and its degenerate case
(target_partitions=1) reproduces the old single-partition behavior exactly.

This commit folds the partitioned provider into IcebergTableProvider and
the partitioned scan into IcebergTableScan, eliminating the parallel types.

Changes:
- IcebergTableProvider::scan() now eagerly calls plan_files() and distributes
FileScanTasks into buckets using the same identity-hash strategy
(REPARTITION_RANDOM_STATE + create_hashes) that was in
IcebergPartitionedTableProvider, enabling Partitioning::Hash declarations
that align with DataFusion's RepartitionExec.
- IcebergTableScan gains a new_with_tasks() constructor that accepts
pre-planned buckets and a caller-supplied Partitioning. execute(i) streams
the tasks in buckets[i] via TableScan::to_arrow_with_tasks, rebuilding
the TableScan per-partition to avoid serializing PlanContext Arc-shared
caches across workers.
- The original new() constructor and the to_arrow() lazy path are kept
unchanged for IcebergStaticTableProvider, which does not pre-plan tasks.
- Limit slicing (try_filter_map truncation) from the old IcebergTableScan
is preserved in both execution paths.
- Bucketing helpers (IdentityCol, compute_identity_cols, bucket_tasks,
identity_hash, fallback_hash, literal_to_array, is_supported_dtype) are
moved verbatim into a new private table/bucketing.rs module.
- Unit tests from partitioned.rs are migrated to table/mod.rs and updated
to use IcebergTableProvider and IcebergTableScan.
- integration_datafusion_test.rs: fix test_provider_plan_stream_schema to
call execute(0) instead of execute(1). The old call worked only because
the previous IcebergTableScan silently ignored the partition index.

(cherry picked from commit d2e5e04)
Review pass over the partitioned-scan branch ahead of upstream
contribution.

- Rename `TableScan::to_arrow_with_tasks` to `to_arrow_from_tasks` —
  `from` better signals that the tasks are the input source rather
  than a builder-style modifier.
- Restructure the doc with a `# Correctness` section that calls out
  the projection/filter contract while clarifying that reader-side
  configuration (concurrency, batch size, row-group filtering, row
  selection) is taken from `self`.
- Make `IcebergTableScan::new` and `new_with_tasks` `pub` (were
  `pub(crate)`) so external users can construct the node directly,
  matching the public visibility of the struct itself.
- Drop the `convert_filters_to_predicate` re-export from
  `physical_plan/mod.rs`: it was unused outside the module.

- Extract a private `new_inner` constructor on `IcebergTableScan` so
  `new` and `new_with_tasks` share a single source of truth for the
  `PlanProperties` / projection / predicate setup.
- Split `IcebergTableScan::execute` into a linear pipeline backed by
  three helpers: `build_table_scan` (synchronous scan-builder
  plumbing), `build_record_batch_stream` (async stream construction
  for the lazy/eager modes), and `apply_limit`.
- Trim the `IcebergTableScan` struct doc and field comments to match
  the rest of the file's style; drop the verbose `to_arrow_with_tasks`
  rationale (the `# Correctness` doc carries the load-bearing info).
- Tighten `DisplayAs::fmt_as`: remove the file-path enumeration (file
  count alone is enough for `EXPLAIN`) and factor the common prefix.
- Trim several narrating comments in `table/mod.rs` and the module
  doc that duplicated information already evident from the code.

- Add `test_identity_partitioned_declares_hash`: verifies the happy
  path where an identity-partitioned table with the partition column
  in the projection produces `Partitioning::Hash` referencing that
  column. This was the main missing coverage for the bucketing logic.
- Add `test_projection_without_partition_col_falls_back_to_unknown`:
  verifies the `compute_identity_cols → None` branch when the
  projection omits the partition source column.
- Add helpers (`make_partitioned_catalog_and_table_for_bucketing`,
  `append_partitioned_fake_data_files`) to build identity-partitioned
  fixtures without writing real Parquet files.

(cherry picked from commit b1f2d66)
@toutane toutane force-pushed the draft/partitioned-file-scanning-contribution branch from 0a2ba62 to 70bc487 Compare April 29, 2026 14:48
@toutane toutane changed the title feat(datafusion): enable parallel file-level scanning via one partition per file feat!(datafusion): enable parallel file scanning with eager task bucketing Apr 29, 2026
toutane added 3 commits April 30, 2026 11:09
IcebergTableProvider::scan now plans files eagerly and buckets them
across DataFusion partitions before returning the ExecutionPlan.
As a result, IcebergTableScan's DisplayAs output always includes
`buckets:[N] file_count:[M]` - even for unpartitioned tables where
N = 1.

Update the four .slt files whose EXPLAIN snapshots were missing this
suffix, and fix the like_predicate_pushdown snapshots that also had
a stale input_partitions count on RepartitionExec (the table now has
multiple files across multiple buckets).

(cherry picked from commit 6ae4a71)
@toutane toutane force-pushed the draft/partitioned-file-scanning-contribution branch from 1f87e13 to 7ff1f6d Compare April 30, 2026 09:12
@toutane toutane marked this pull request as ready for review April 30, 2026 09:44
@toutane toutane requested a review from timsaucer April 30, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable parallel file-level scanning for IcebergTableScan Datafusion Integration

3 participants