RFC P2P pass arrow tables directly to buffers #7992
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I don't have any proper measurements but this is an attempt of fixing #7990 by not converting anything to bytes prematurely and passing pyarrow tables around as far as possible. Haven't tested this with string data or anything like this so I don't know if the event loop survives this.
The most troubling aspect of this is that
pa.Table.nbytesactually takes a surprising amount of time and this is currently where I loose performance (also note: the daks.sizeof implementation for pa.Table is misleading for this use case. The sizeof basically returns pa.Table.get_total_buffer_size which is vastly overestimating the size of our slices since the slices are just views of the larger tables. This seems to be the bottleneck of this implementation atmcc @hendrikmakait