feat: support general compression zstd/lz4 in blocks by lyang24 · Pull Request #4900 · lance-format/lance

lyang24 · 2025-10-05T06:50:05Z

touches #4896

codecov-commenter · 2025-10-05T07:36:38Z

Codecov Report

❌ Patch coverage is 42.34234% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.64%. Comparing base (7e65e8b) to head (f789218).

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/compression.rs	46.55%	28 Missing and 3 partials ⚠️
...ust/lance-encoding/src/encodings/physical/block.rs	0.00%	20 Missing ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs	60.60%	12 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4900      +/-   ##
==========================================
- Coverage   81.67%   81.64%   -0.04%     
==========================================
  Files         334      334              
  Lines      132492   132595     +103     
  Branches   132492   132595     +103     
==========================================
+ Hits       108215   108257      +42     
- Misses      20640    20698      +58     
- Partials     3637     3640       +3

Flag	Coverage Δ
unittests	`81.64% <42.34%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

This looks good!

I think there are two general issues we will need to tackle before we merge.

First, we need to add a check that the version is >= 2.2. Now that 2.1 is stable we can't make forwards-compatibility breaking changes. In other words, we can't have a 2.1 writer start writing with general block compression because the oldest 2.1 readers won't be able to reader it.

To do this we can add version: LanceFileVersion to DefaultCompressionStrategy.

Second, and a bit more subtle, is that we can't only rely on field metadata to decide we should use general compression. It's a bit subtle but we don't actually use block compression at the field level. This is because it is unsuitable for random access. Instead, we use block compression for individual parts that we will need to decompress in their entirety.

There are two parts we compress with block compression today:

RepDef levels for mini-block encoding. We can see that here

Note that we create a dummy field (and so it will never have field configuration). This particular usage will probably never be a good candidate for general compression anyways (if we apply bitpacking and RLE to these levels that is almost certainly good enough).

Dictionaries. We can see that here

Here I think we do want to apply block compression, since dictionaries can potentially be large and, being strings, can also potentially be good candidates for general compression. Unfortunately, as you can see from that link, we once again use a dummy field, and so there will never be any field_params set on the field.

I think the easiest thing to do would be to put a size criteria in here. If the data block is large (32KiB or more) then we apply general compression, even if there are no field_params.

So something like...

// User-requested compression (unused today but perhaps still used
// in the future someday)
if let Some(compression) = field_params.compression... {
  ...
}
// Automatic compression because data is large enough to justify it
if data.data_size() > 32 * 1024 {
  // Note: there is no CompressionScheme::default but there is a
  // CompressedBufferEncoder::default (see earlier comment)
  let scheme = CompressionScheme::default();
  ...
}

Once we do that then we can create an end-to-end test by creating some dummy data that is large but still eligible for dictionary encoding.

westonpace · 2025-10-07T11:37:09Z

+/// A block compressor that wraps another block compressor and applies
+/// general-purpose compression (LZ4, Zstd) to the resulting buffer.
+#[derive(Debug)]
+pub struct GeneralBlockCompressor {
+    inner: Box<dyn BlockCompressor>,
+    compression: CompressionConfig,
+}
+
+impl GeneralBlockCompressor {
+    pub fn new(inner: Box<dyn BlockCompressor>, compression: CompressionConfig) -> Self {
+        Self { inner, compression }
+    }
+}
+
+impl BlockCompressor for GeneralBlockCompressor {
+    fn compress(&self, data: DataBlock) -> Result<LanceBuffer> {
+        let compressed = self.inner.compress(data)?;
+
+        let compressor = GeneralBufferCompressor::get_compressor(self.compression)?;
+        let mut buf = vec![];
+        compressor.compress(&compressed, &mut buf)?;
+
+        Ok(LanceBuffer::from(buf))
+    }
+}
+
+#[derive(Debug)]
+pub struct GeneralBlockDecompressor {
+    inner: Box<dyn BlockDecompressor>,
+    compression: CompressionConfig,
+}
+
+impl GeneralBlockDecompressor {
+    pub fn new(inner: Box<dyn BlockDecompressor>, compression: CompressionConfig) -> Self {
+        Self { inner, compression }
+    }
+}
+
+impl BlockDecompressor for GeneralBlockDecompressor {
+    fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> {
+        let decompressor = GeneralBufferCompressor::get_compressor(self.compression)?;
+        let mut buf = vec![];
+        decompressor.decompress(&data, &mut buf)?;
+
+        self.inner.decompress(LanceBuffer::from(buf), num_values)
+    }
+}


Minor nit: I could go either way on this but this could probably be a single GeneralBlock struct that impls both traits.

Or perhaps, even better, we could just add these traits to CompressedBufferEncoder instead of making any new structs?

yep give my attempt impl both traits in CompressedBufferEncoder
in compression variable length it bind the valueEncoder to encode
in decom it uses BinaryBlockDecompressor to decode

westonpace · 2025-10-07T11:38:30Z

+
+        let compressed = compressor.compress(block.clone()).unwrap();
+
+        // Test decompression - FIXED: Use BinaryBlockDecompressor instead of VariableDecoder


What does FIXED mean here?

its a mistake i wrote test with cursor i think it responded to my adjustment prompt :( i wrote a new e2e test in rust/lance-encoding/src/previous/encodings/logical/primitive.rs

Signed-off-by: lyang24 <lanqingy93@gmail.com>

westonpace

Nice work, thanks for following up on this!

touches lance-format#4896 Signed-off-by: lyang24 <lanqingy93@gmail.com>

github-actions Bot added the enhancement New feature or request label Oct 5, 2025

wjones127 assigned westonpace Oct 6, 2025

westonpace requested changes Oct 7, 2025

View reviewed changes

lyang24 force-pushed the general_block_compression branch 2 times, most recently from 7c10ac6 to a9ee0af Compare October 9, 2025 06:20

feat: support general compression zstd/lz4 in blocks

f789218

Signed-off-by: lyang24 <lanqingy93@gmail.com>

lyang24 force-pushed the general_block_compression branch from a9ee0af to f789218 Compare October 12, 2025 06:09

lyang24 requested a review from westonpace October 12, 2025 18:50

westonpace approved these changes Oct 14, 2025

View reviewed changes

westonpace merged commit 911bf20 into lance-format:main Oct 14, 2025
28 of 29 checks passed

Xuanwo mentioned this pull request Oct 21, 2025

fix: general block decompression mismatch for Lance 2.2 dictionaries #5025

Merged

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026

feat: support general compression zstd/lz4 in blocks (lance-format#4900)

a586d04

touches lance-format#4896 Signed-off-by: lyang24 <lanqingy93@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support general compression zstd/lz4 in blocks#4900

feat: support general compression zstd/lz4 in blocks#4900
westonpace merged 1 commit intolance-format:mainfrom
lyang24:general_block_compression

lyang24 commented Oct 5, 2025

Uh oh!

codecov-commenter commented Oct 5, 2025 •

edited

Loading

Uh oh!

westonpace left a comment •

edited

Loading

Uh oh!

Uh oh!

westonpace Oct 7, 2025

Uh oh!

lyang24 Oct 9, 2025

Uh oh!

westonpace Oct 7, 2025

Uh oh!

lyang24 Oct 9, 2025

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		let compressed = compressor.compress(block.clone()).unwrap();

		// Test decompression - FIXED: Use BinaryBlockDecompressor instead of VariableDecoder

Conversation

lyang24 commented Oct 5, 2025

Uh oh!

codecov-commenter commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

lyang24 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

westonpace Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

lyang24 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Oct 5, 2025 •

edited

Loading

westonpace left a comment •

edited

Loading