True parallelism across full and incremental scan#11
Merged
Conversation
gbrgr
reviewed
Nov 19, 2025
gbrgr
approved these changes
Nov 19, 2025
Collaborator
Author
|
In future, we may parallelize this per row group (at least when it comes to decoding, not IO) with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
https://relationalai.atlassian.net/browse/RAI-44217
What changes are included in this PR?
Adding true parallelism here. By default, full scan (in
reader.rs) has only concurrency, but not parallelism. For the negative impact of that, see this closed draft's description: apache#1684. While upstream the design is such that parallelism should be baked in the upper layers, since we diverge at the moment, we may do this hack to enable parallelism for us right away.What are the issues?
When
FileScanTaskStreamis processed, we process it concurrently, however, without spawning, there's no parallelism. The impact of not spawning for each stream item here is minimal though, as operations are IO-heavy, and concurrency is nearly enough. However, the output of processing each file in the file stream is a record batch stream. And processing record batch stream is CPU-heavy operation. Right now we process that concurrently as well (withtry_flatten_unordered(N)), but that is not the enough - for CPU-heavy work we definitely need parallelism.So what can we do?
First, we can create a channel, spawn, and return receiver side of the channel. In the spawned task, we can populate the transmitter side. Here we have two options:
I chose to spawn, with the idea of squeezing parallelism. In some cases it's going to add more latency though, and we may make this an option (or decide for different default here in the PR).
Then for each file, we need to process batches in the
record_batch_stream. Since this is happening in the spawned task already (if we choose option 2 above, we should at least spawn around processingrecord_batch_stream), CPU-heavy operation will be parallelized. But if we only have one file, processing its batches won't be parallelized. For that we'd need to poll from record batch stream in parallel, which is not possible (we would have to use lower level parquet API, which is out of scope for now).For incremental streams, we already spawn for each file. So I'm just refactoring a bit to reuse code.
In addition, per batch spawn is
spawn_blocking.Both implementations now provide two-level parallelism:
Batch-level parallelism is not implemented.
Are these changes tested?
Existing tests go through these code paths. Haven't tested performance yet, as that is a manual process on EC2 instances.