-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-24833 [JS] Implement IPC RecordBatch body buffer compression #46493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
I like that this doesn't increase the bundle size if someone does not need compression. |
|
@Djjanks We're moving the JS implementation to https://github.com/apache/arrow-js . Could you open a PR in https://github.com/apache/arrow-js instead of apache/arrow? We can review this after we prepare CI in https://github.com/apache/arrow-js : @domoritz Could you review apache/arrow-js#12 ? This is a blocker of other issues including the above CI related issues because |
|
@kou Without any problem. Should I close current PR? |
|
Yes. But could you do it after we move to apache/arrow-js? FYI: We'll transfer the JavaScript related open issues in apache/arrow to apache/arrow-js: apache/arrow-js#13 |
|
Sorry but I don't understand what I have to do after moving to apache/arrow-js? Open new PR in apache/arrow-js or close PR in apache/arrow? |
|
|
|
Please open new PR in apache/arrow-js AND THEN close PR in apache/arrow. |
|
Okay, I'm waiting for the move to apache/arrow-js to be done. |
|
|
|
|
Rationale for this change
This change introduces support for reading compressed Arrow IPC streams in JavaScript. The primary motivation is the need to read Arrow IPC Stream in the browser when they are transmitted over the network in a compressed format to reduce network load.
Several reasons support this enhancement:
What changes are included in this PR?
Additional notes:
Not all JavaScript LZ4 libraries are compatible with the Arrow IPC format. For example:
This can result in silent or cryptic errors. To improve developer experience, we could:
After decompressing the buffers, new BufferRegion entries are calculated to match the uncompressed data layout. A new metadata.RecordBatch is constructed with the updated buffer regions and passed into _loadVectors().
This introduces a mutation-like pattern that may break assumptions in the current design. However, it's necessary because:
When reconstructing the metadata, the compression field is explicitly set to null, since the data is already decompressed in memory.
This decision is somewhat debatable — feedback is welcome on whether it's better to retain the original compression metadata or to reflect the current state of the buffer (uncompressed). The current implementation assumes the latter.
Are these changes tested?
Are there any user-facing changes?
Yes, Arrow JS users can now read compressed IPC stream, assuming they register an appropriate codec using compressionRegistry.set().
Example:
This change does not affect writing or serialization.
This PR includes breaking changes to public APIs.
No. The change adds functionality but does not modify any existing API behavior.
This PR contains a "Critical Fix".
No. This is a new feature, not a critical fix.
Checklist
yarn test)yarn build)