Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 8, 2021

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. enable_parallel_column_conversion is kept as it still affects the threaded scanner.

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

@lidavidm
Copy link
Member Author

lidavidm commented Jun 8, 2021

S3 Median Scan Time (s)(2)

Not much difference in a benchmark; the most pronounced change is when files << cores (this was a 4 vcpu machine), which I think makes sense since with many files, file-level parallelism takes hold.

@pitrou pitrou self-requested a review June 15, 2021 13:53
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you very much!

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

Rebased, can merge if green.

@lidavidm lidavidm closed this in b73bcf0 Jun 15, 2021
@lidavidm lidavidm deleted the arrow-12597 branch June 15, 2021 15:22
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…reader

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner.

Closes apache#10482 from lidavidm/arrow-12597

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants