feat: support general compression zstd/lz4 in blocks#4900
feat: support general compression zstd/lz4 in blocks#4900westonpace merged 1 commit intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4900 +/- ##
==========================================
- Coverage 81.67% 81.64% -0.04%
==========================================
Files 334 334
Lines 132492 132595 +103
Branches 132492 132595 +103
==========================================
+ Hits 108215 108257 +42
- Misses 20640 20698 +58
- Partials 3637 3640 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
This looks good!
I think there are two general issues we will need to tackle before we merge.
First, we need to add a check that the version is >= 2.2. Now that 2.1 is stable we can't make forwards-compatibility breaking changes. In other words, we can't have a 2.1 writer start writing with general block compression because the oldest 2.1 readers won't be able to reader it.
To do this we can add version: LanceFileVersion to DefaultCompressionStrategy.
Second, and a bit more subtle, is that we can't only rely on field metadata to decide we should use general compression. It's a bit subtle but we don't actually use block compression at the field level. This is because it is unsuitable for random access. Instead, we use block compression for individual parts that we will need to decompress in their entirety.
There are two parts we compress with block compression today:
- RepDef levels for mini-block encoding. We can see that here
Note that we create a dummy field (and so it will never have field configuration). This particular usage will probably never be a good candidate for general compression anyways (if we apply bitpacking and RLE to these levels that is almost certainly good enough).
- Dictionaries. We can see that here
Here I think we do want to apply block compression, since dictionaries can potentially be large and, being strings, can also potentially be good candidates for general compression. Unfortunately, as you can see from that link, we once again use a dummy field, and so there will never be any field_params set on the field.
I think the easiest thing to do would be to put a size criteria in here. If the data block is large (32KiB or more) then we apply general compression, even if there are no field_params.
So something like...
// User-requested compression (unused today but perhaps still used
// in the future someday)
if let Some(compression) = field_params.compression... {
...
}
// Automatic compression because data is large enough to justify it
if data.data_size() > 32 * 1024 {
// Note: there is no CompressionScheme::default but there is a
// CompressedBufferEncoder::default (see earlier comment)
let scheme = CompressionScheme::default();
...
}
Once we do that then we can create an end-to-end test by creating some dummy data that is large but still eligible for dictionary encoding.
| /// A block compressor that wraps another block compressor and applies | ||
| /// general-purpose compression (LZ4, Zstd) to the resulting buffer. | ||
| #[derive(Debug)] | ||
| pub struct GeneralBlockCompressor { | ||
| inner: Box<dyn BlockCompressor>, | ||
| compression: CompressionConfig, | ||
| } | ||
|
|
||
| impl GeneralBlockCompressor { | ||
| pub fn new(inner: Box<dyn BlockCompressor>, compression: CompressionConfig) -> Self { | ||
| Self { inner, compression } | ||
| } | ||
| } | ||
|
|
||
| impl BlockCompressor for GeneralBlockCompressor { | ||
| fn compress(&self, data: DataBlock) -> Result<LanceBuffer> { | ||
| let compressed = self.inner.compress(data)?; | ||
|
|
||
| let compressor = GeneralBufferCompressor::get_compressor(self.compression)?; | ||
| let mut buf = vec![]; | ||
| compressor.compress(&compressed, &mut buf)?; | ||
|
|
||
| Ok(LanceBuffer::from(buf)) | ||
| } | ||
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub struct GeneralBlockDecompressor { | ||
| inner: Box<dyn BlockDecompressor>, | ||
| compression: CompressionConfig, | ||
| } | ||
|
|
||
| impl GeneralBlockDecompressor { | ||
| pub fn new(inner: Box<dyn BlockDecompressor>, compression: CompressionConfig) -> Self { | ||
| Self { inner, compression } | ||
| } | ||
| } | ||
|
|
||
| impl BlockDecompressor for GeneralBlockDecompressor { | ||
| fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> { | ||
| let decompressor = GeneralBufferCompressor::get_compressor(self.compression)?; | ||
| let mut buf = vec![]; | ||
| decompressor.decompress(&data, &mut buf)?; | ||
|
|
||
| self.inner.decompress(LanceBuffer::from(buf), num_values) | ||
| } | ||
| } |
There was a problem hiding this comment.
Minor nit: I could go either way on this but this could probably be a single GeneralBlock struct that impls both traits.
Or perhaps, even better, we could just add these traits to CompressedBufferEncoder instead of making any new structs?
There was a problem hiding this comment.
yep give my attempt impl both traits in CompressedBufferEncoder
in compression variable length it bind the valueEncoder to encode
in decom it uses BinaryBlockDecompressor to decode
|
|
||
| let compressed = compressor.compress(block.clone()).unwrap(); | ||
|
|
||
| // Test decompression - FIXED: Use BinaryBlockDecompressor instead of VariableDecoder |
There was a problem hiding this comment.
its a mistake i wrote test with cursor i think it responded to my adjustment prompt :( i wrote a new e2e test in rust/lance-encoding/src/previous/encodings/logical/primitive.rs
7c10ac6 to
a9ee0af
Compare
Signed-off-by: lyang24 <lanqingy93@gmail.com>
a9ee0af to
f789218
Compare
westonpace
left a comment
There was a problem hiding this comment.
Nice work, thanks for following up on this!
touches lance-format#4896 Signed-off-by: lyang24 <lanqingy93@gmail.com>
touches #4896