GH-24833 [JS] Implement IPC RecordBatch body buffer compression #46493

Djjanks · 2025-05-18T15:40:30Z

Rationale for this change

This change introduces support for reading compressed Arrow IPC streams in JavaScript. The primary motivation is the need to read Arrow IPC Stream in the browser when they are transmitted over the network in a compressed format to reduce network load.

Several reasons support this enhancement:

Personal need in other project to read compressed arrow IPC stream.
Community demand, as seen in Issue apache/arrow-js#109.
A similar implementation was attempted in PR apache/arrow#13076 but was never merged. I am very grateful to @kylebarron .
Other language implementations (e.g., C++, Python, Rust) already support IPC compression.

What changes are included in this PR?

Support for decoding compressed RecordBatch buffers during reading.
Each buffer is decompressed individually, offsets are recalculated with 8-byte alignment, and a new metadata. RecordBatch is constructed before loading vectors.
Only decompression is implemented; compression (writing) is not supported yet.
Currently tested with the lz4 codec using the lz4js library. lz4-wasm was evaluated but rejected due to incompatibility with LZ4 Frame format.
The decompression logic is isolated to _loadRecordBatch() in the RecordBatchReaderImpl class.
A codec.decode function is retrieved from the compressionRegistry and applied per-buffer. So users can choose suitable lib.

Additional notes:

Codec compatibility caveats
Not all JavaScript LZ4 libraries are compatible with the Arrow IPC format. For example:

lz4js works correctly as it supports the LZ4 Frame Format.
lz4-wasm is not compatible, as it expects raw LZ4 blocks and fails to decompress LZ4 frame data.
This can result in silent or cryptic errors. To improve developer experience, we could:
Wrap codec.decode calls in try/catch and surface a clearer error message if decompression fails.
Add an optional check in compressionRegistry.set() to validate that the codec supports LZ4 Frame Format. One way would be to compress dummy data and inspect the first 4 bytes for the expected LZ4 Frame magic header (0x04 0x22 0x4D 0x18).

Reconstruction of metadata.RecordBatch
After decompressing the buffers, new BufferRegion entries are calculated to match the uncompressed data layout. A new metadata.RecordBatch is constructed with the updated buffer regions and passed into _loadVectors().
This introduces a mutation-like pattern that may break assumptions in the current design. However, it's necessary because:

_loadVectors() depends strictly on the offsets in header.buffers, which no longer match the decompressed buffer layout.
Without changing either _loadVectors() or metadata.RecordBatch, the current approach is the least intrusive.

Setting compression = null in new RecordBatch
When reconstructing the metadata, the compression field is explicitly set to null, since the data is already decompressed in memory.
This decision is somewhat debatable — feedback is welcome on whether it's better to retain the original compression metadata or to reflect the current state of the buffer (uncompressed). The current implementation assumes the latter.

Are these changes tested?

The changes were tested in the own project using LZ4-compressed Arrow stream.
Test uncompressed, compressed and pseudo compressed(uncompressed data length = -1) data.
No unit tests are included in this PR yet.
The decompression was verified with real-world data and the lz4js codec (lz4-wasm is not compatible).
No issues were observed with alignment, vector loading, or decompression integrity.
Exception handling is not yet added around codec.decode. This may be useful for catching codec incompatibility and providing better user feedback.

Are there any user-facing changes?

Yes, Arrow JS users can now read compressed IPC stream, assuming they register an appropriate codec using compressionRegistry.set().

Example:

import { Codec, compressionRegistry } from 'apache-arrow';
import * as lz4 from 'lz4js';

  const lz4Codec: Codec = {
      encode(data: Uint8Array): Uint8Array { return lz4js.compress(data) },
      decode(data: Uint8Array): Uint8Array { return lz4js.decompress(data) }
  }; 

  compressionRegistry.set(CompressionType.LZ4_FRAME, lz4Codec);

This change does not affect writing or serialization.

This PR includes breaking changes to public APIs.
No. The change adds functionality but does not modify any existing API behavior.

This PR contains a "Critical Fix".
No. This is a new feature, not a critical fix.

Checklist

All tests pass (yarn test)
Build completes (yarn build)
I have added a new test for compressed batches

GitHub Issue: [JS] Implement IPC RecordBatch body buffer compression from ARROW-300 arrow-js#109
GitHub Issue: ARROW-244: Some global APIs of IPC module should be visible to the outside #109

github-actions · 2025-05-18T15:40:54Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2025-05-18T15:44:49Z

⚠️ GitHub issue apache/arrow-js#109 has been automatically assigned in GitHub to PR creator.

domoritz · 2025-05-18T16:11:46Z

I like that this doesn't increase the bundle size if someone does not need compression.

kou · 2025-05-19T00:21:14Z

@Djjanks We're moving the JS implementation to https://github.com/apache/arrow-js . Could you open a PR in https://github.com/apache/arrow-js instead of apache/arrow?

We can review this after we prepare CI in https://github.com/apache/arrow-js :

@domoritz Could you review apache/arrow-js#12 ? This is a blocker of other issues including the above CI related issues because LICENSE.txt is needed for yarn build.

Djjanks · 2025-05-19T05:54:28Z

@kou Without any problem. Should I close current PR?

kou · 2025-05-19T06:04:04Z

Yes. But could you do it after we move to apache/arrow-js?

FYI: We'll transfer the JavaScript related open issues in apache/arrow to apache/arrow-js: apache/arrow-js#13

Djjanks · 2025-05-19T06:41:23Z

Sorry but I don't understand what I have to do after moving to apache/arrow-js? Open new PR in apache/arrow-js or close PR in apache/arrow?

github-actions · 2025-05-19T06:49:25Z

⚠️ GitHub issue apache/arrow-js#109 has been automatically assigned in GitHub to PR creator.

kou · 2025-05-19T08:02:18Z

Please open new PR in apache/arrow-js AND THEN close PR in apache/arrow.

Djjanks · 2025-05-19T08:14:23Z

Okay, I'm waiting for the move to apache/arrow-js to be done.

github-actions · 2025-05-26T22:42:32Z

⚠️ GitHub issue #109 has been automatically assigned in GitHub to PR creator.

github-actions · 2025-05-26T22:42:32Z

⚠️ GitHub issue #24833 has no components, please add labels for components.

Djjanks added 3 commits May 18, 2025 17:03

[JS] Define commpression registry

f50c489

[JS] Implement RecordBatch body decompression

e7a7145

[JS] Export compressionType, Codec and compressionRegistry

6d33666

Djjanks requested review from domoritz and trxcllnt as code owners May 18, 2025 15:40

github-actions bot added Component: JavaScript awaiting review Awaiting review labels May 18, 2025

Djjanks changed the title ~~Implement IPC RecordBatch body buffer compression~~ GH-24833 [JS] Implement IPC RecordBatch body buffer compression May 18, 2025

Djjanks closed this May 18, 2025

Djjanks reopened this May 18, 2025

Lint

1276a51

Djjanks mentioned this pull request May 19, 2025

feat: Implement IPC RecordBatch body buffer compression apache/arrow-js#14

Merged

3 tasks

Djjanks closed this May 19, 2025

Djjanks reopened this May 19, 2025

Djjanks closed this May 20, 2025

GH-24833 [JS] Implement IPC RecordBatch body buffer compression #46493

GH-24833 [JS] Implement IPC RecordBatch body buffer compression #46493

Uh oh!

Conversation

Djjanks commented May 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Additional notes:

Are these changes tested?

Are there any user-facing changes?

Checklist

Uh oh!

github-actions bot commented May 18, 2025

Uh oh!

github-actions bot commented May 18, 2025

Uh oh!

domoritz commented May 18, 2025

Uh oh!

kou commented May 19, 2025

Uh oh!

Djjanks commented May 19, 2025

Uh oh!

kou commented May 19, 2025

Uh oh!

Djjanks commented May 19, 2025

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

kou commented May 19, 2025

Uh oh!

Djjanks commented May 19, 2025

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Djjanks commented May 18, 2025 •

edited by github-actions bot

Loading