-
Notifications
You must be signed in to change notification settings - Fork 10
Parallel Xarray record batch reading via blocks partition factory. #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep looking at the docs I think this makes sense, this creates a
TableProviderwith all your partitions, which implements scan, which produces anExecutionPlan(without you explicitly building theExecutionPlan).On thing though, the code isn't in this PR, but in your
PartitionStreamimplementation forPyArrowStreamPartition, inside thetry_stream!, you are callingPython::attach, which basically acquires the GIL, but the GIL can only be acquired by one thread at a time. I think it's essentially acting as amutex_guard. So I think, and I'm not sure about this but you can probably test on a decent sized zarr store and check the core usage, that while you are indeed creating partitions that will run in parallel with this, in practice you might only have one partition reading data at a time, with the other ones waiting.Now, even if it is the case, I think what you have here, with partitions, would have a big performance advantage over a single partition, because if you create multiple partitions from the start, downstream execution plans can leverage that. For example, say you have 4 partitions, say partition 1 reads some data, streams it, then partition 2 reads data, streams it, sequentially, but while partition 2 is reading data, whatever operations you have downstream can start working on the data streamed from partition 1. So you might have a "slow" start, but as data is being read, the query could start leveraging the partitions (assuming you have more operations after just reading the data).
On that note though, looking at the
PartitionStreamimplementation, I think right now, one partition would acquire the GIL, read all its batches, and only then let another partition acquire the GIL. It would probably be better to allow partitions to be "interleaved", i.e. partition 1 reads one batch, partition 2 one batch, ..., back to partition 1 for its second batch, and so on. I think you can accomplish that by usingallow_threads, https://pyo3.rs/v0.9.2/parallelism. Like even if just a little bit of rust code runs within anallow_threads, I think it might allow a different partition to acquire the GIL.Okay done with my overly long ramblings! I'm not 100% about all of the above, but I think that if you have some decent sized data to test this on easily, it might be worth just trying with
allow_threads, see if it impacts performance (I think you'd have to include some compute heavy operations that can run on individual batches to see the difference though).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll look into your GIL/parallelism enhancement in a future PR! Thank you so much for the suggestion!
I will add an acceptance test to see if the GIL acquisition is really slowing us down. From my experience so far, we acquire the GIL just to schedule the partitions, but since they are arrow streams ("send"-like), the processing happens in parallel in DF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I'm not sure anymore then... in any case let me know what you find out if you test it out!