Skip to content

feat: support general compression zstd/lz4 in blocks#4900

Merged
westonpace merged 1 commit intolance-format:mainfrom
lyang24:general_block_compression
Oct 14, 2025
Merged

feat: support general compression zstd/lz4 in blocks#4900
westonpace merged 1 commit intolance-format:mainfrom
lyang24:general_block_compression

Conversation

@lyang24
Copy link
Copy Markdown
Contributor

@lyang24 lyang24 commented Oct 5, 2025

touches #4896

@github-actions github-actions Bot added the enhancement New feature or request label Oct 5, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 5, 2025

Codecov Report

❌ Patch coverage is 42.34234% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.64%. Comparing base (7e65e8b) to head (f789218).

Files with missing lines Patch % Lines
rust/lance-encoding/src/compression.rs 46.55% 28 Missing and 3 partials ⚠️
...ust/lance-encoding/src/encodings/physical/block.rs 0.00% 20 Missing ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 60.60% 12 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4900      +/-   ##
==========================================
- Coverage   81.67%   81.64%   -0.04%     
==========================================
  Files         334      334              
  Lines      132492   132595     +103     
  Branches   132492   132595     +103     
==========================================
+ Hits       108215   108257      +42     
- Misses      20640    20698      +58     
- Partials     3637     3640       +3     
Flag Coverage Δ
unittests 81.64% <42.34%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

I think there are two general issues we will need to tackle before we merge.

First, we need to add a check that the version is >= 2.2. Now that 2.1 is stable we can't make forwards-compatibility breaking changes. In other words, we can't have a 2.1 writer start writing with general block compression because the oldest 2.1 readers won't be able to reader it.

To do this we can add version: LanceFileVersion to DefaultCompressionStrategy.

Second, and a bit more subtle, is that we can't only rely on field metadata to decide we should use general compression. It's a bit subtle but we don't actually use block compression at the field level. This is because it is unsuitable for random access. Instead, we use block compression for individual parts that we will need to decompress in their entirety.

There are two parts we compress with block compression today:

  1. RepDef levels for mini-block encoding. We can see that here

Note that we create a dummy field (and so it will never have field configuration). This particular usage will probably never be a good candidate for general compression anyways (if we apply bitpacking and RLE to these levels that is almost certainly good enough).

  1. Dictionaries. We can see that here

Here I think we do want to apply block compression, since dictionaries can potentially be large and, being strings, can also potentially be good candidates for general compression. Unfortunately, as you can see from that link, we once again use a dummy field, and so there will never be any field_params set on the field.

I think the easiest thing to do would be to put a size criteria in here. If the data block is large (32KiB or more) then we apply general compression, even if there are no field_params.

So something like...

// User-requested compression (unused today but perhaps still used
// in the future someday)
if let Some(compression) = field_params.compression... {
  ...
}
// Automatic compression because data is large enough to justify it
if data.data_size() > 32 * 1024 {
  // Note: there is no CompressionScheme::default but there is a
  // CompressedBufferEncoder::default (see earlier comment)
  let scheme = CompressionScheme::default();
  ...
}

Once we do that then we can create an end-to-end test by creating some dummy data that is large but still eligible for dictionary encoding.

Comment thread rust/lance-encoding/src/compression.rs Outdated
Comment on lines +137 to +183
/// A block compressor that wraps another block compressor and applies
/// general-purpose compression (LZ4, Zstd) to the resulting buffer.
#[derive(Debug)]
pub struct GeneralBlockCompressor {
inner: Box<dyn BlockCompressor>,
compression: CompressionConfig,
}

impl GeneralBlockCompressor {
pub fn new(inner: Box<dyn BlockCompressor>, compression: CompressionConfig) -> Self {
Self { inner, compression }
}
}

impl BlockCompressor for GeneralBlockCompressor {
fn compress(&self, data: DataBlock) -> Result<LanceBuffer> {
let compressed = self.inner.compress(data)?;

let compressor = GeneralBufferCompressor::get_compressor(self.compression)?;
let mut buf = vec![];
compressor.compress(&compressed, &mut buf)?;

Ok(LanceBuffer::from(buf))
}
}

#[derive(Debug)]
pub struct GeneralBlockDecompressor {
inner: Box<dyn BlockDecompressor>,
compression: CompressionConfig,
}

impl GeneralBlockDecompressor {
pub fn new(inner: Box<dyn BlockDecompressor>, compression: CompressionConfig) -> Self {
Self { inner, compression }
}
}

impl BlockDecompressor for GeneralBlockDecompressor {
fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> {
let decompressor = GeneralBufferCompressor::get_compressor(self.compression)?;
let mut buf = vec![];
decompressor.decompress(&data, &mut buf)?;

self.inner.decompress(LanceBuffer::from(buf), num_values)
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I could go either way on this but this could probably be a single GeneralBlock struct that impls both traits.

Or perhaps, even better, we could just add these traits to CompressedBufferEncoder instead of making any new structs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep give my attempt impl both traits in CompressedBufferEncoder
in compression variable length it bind the valueEncoder to encode
in decom it uses BinaryBlockDecompressor to decode


let compressed = compressor.compress(block.clone()).unwrap();

// Test decompression - FIXED: Use BinaryBlockDecompressor instead of VariableDecoder
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does FIXED mean here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a mistake i wrote test with cursor i think it responded to my adjustment prompt :( i wrote a new e2e test in rust/lance-encoding/src/previous/encodings/logical/primitive.rs

@lyang24 lyang24 force-pushed the general_block_compression branch 2 times, most recently from 7c10ac6 to a9ee0af Compare October 9, 2025 06:20
Signed-off-by: lyang24 <lanqingy93@gmail.com>
@lyang24 lyang24 force-pushed the general_block_compression branch from a9ee0af to f789218 Compare October 12, 2025 06:09
@lyang24 lyang24 requested a review from westonpace October 12, 2025 18:50
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks for following up on this!

@westonpace westonpace merged commit 911bf20 into lance-format:main Oct 14, 2025
28 of 29 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants