True parallelism across full and incremental scan by vustef · Pull Request #11 · RelationalAI/iceberg-rust

vustef · 2025-11-18T14:51:23Z

Which issue does this PR close?

https://relationalai.atlassian.net/browse/RAI-44217

What changes are included in this PR?

Adding true parallelism here. By default, full scan (in reader.rs) has only concurrency, but not parallelism. For the negative impact of that, see this closed draft's description: apache#1684. While upstream the design is such that parallelism should be baked in the upper layers, since we diverge at the moment, we may do this hack to enable parallelism for us right away.
What are the issues?
When FileScanTaskStream is processed, we process it concurrently, however, without spawning, there's no parallelism. The impact of not spawning for each stream item here is minimal though, as operations are IO-heavy, and concurrency is nearly enough. However, the output of processing each file in the file stream is a record batch stream. And processing record batch stream is CPU-heavy operation. Right now we process that concurrently as well (with try_flatten_unordered(N)), but that is not the enough - for CPU-heavy work we definitely need parallelism.

So what can we do?
First, we can create a channel, spawn, and return receiver side of the channel. In the spawned task, we can populate the transmitter side. Here we have two options:

Spawn for each file
Don't spawn, since these are IO-bound operations.
I chose to spawn, with the idea of squeezing parallelism. In some cases it's going to add more latency though, and we may make this an option (or decide for different default here in the PR).

Then for each file, we need to process batches in the record_batch_stream. Since this is happening in the spawned task already (if we choose option 2 above, we should at least spawn around processing record_batch_stream), CPU-heavy operation will be parallelized. But if we only have one file, processing its batches won't be parallelized. For that we'd need to poll from record batch stream in parallel, which is not possible (we would have to use lower level parquet API, which is out of scope for now).

For incremental streams, we already spawn for each file. So I'm just refactoring a bit to reuse code.

In addition, per batch spawn is spawn_blocking.

Both implementations now provide two-level parallelism:

Outer spawn: Background coordination
File-level parallelism: N files processed in parallel

Batch-level parallelism is not implemented.

Are these changes tested?

Existing tests go through these code paths. Haven't tested performance yet, as that is a manual process on EC2 instances.

…arallelism

…ream in parallel

vustef · 2025-11-19T10:36:36Z

In future, we may parallelize this per row group (at least when it comes to decoding, not IO) with next_row_group API on ParquetRecordBatchStream.

True parallelism across full and incremental scan

4979a20

vustef requested a review from gbrgr November 18, 2025 14:51

vustef added 8 commits November 18, 2025 16:02

spawn_blocking

1db982f

Reuse between reader and incremental

2c5328e

fmt incremental.rs

05894f4

cargo fmt for incremental.rs

b5cfeaa

remove mut to make clippy happy

839820f

reuse even more, whole match block is common

f8daf50

cargo fmt

8e30a7f

clippy fixes

8e05499

vustef marked this pull request as ready for review November 18, 2025 19:20

gbrgr reviewed Nov 19, 2025

View reviewed changes

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

vustef added 4 commits November 19, 2025 11:12

Merge branch 'main' of github.com:RelationalAI/iceberg-rust into vs-p…

dbab0fa

…arallelism

fix fmt

e05e701

for_each instead of collect

28bb220

give up on per-batch parallelism, not possible to pull from single st…

12bb6f0

…ream in parallel

gbrgr approved these changes Nov 19, 2025

View reviewed changes

vustef enabled auto-merge (squash) November 19, 2025 10:36

vustef merged commit ae83309 into main Nov 19, 2025
18 checks passed

vustef deleted the vs-parallelism branch November 19, 2025 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

True parallelism across full and incremental scan#11

True parallelism across full and incremental scan#11
vustef merged 13 commits intomainfrom
vs-parallelism

vustef commented Nov 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

vustef commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vustef commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Uh oh!

vustef commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vustef commented Nov 18, 2025 •

edited

Loading