feat: packed struct encoding#3186
Conversation
|
the current main(lance 2.0 packed-struct encoding) doesn't work with the benchmark script in |
ff7ddfd to
72d6920
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3186 +/- ##
==========================================
+ Coverage 78.45% 78.51% +0.06%
==========================================
Files 244 245 +1
Lines 84554 84818 +264
Branches 84554 84818 +264
==========================================
+ Hits 66333 66595 +262
+ Misses 15418 15406 -12
- Partials 2803 2817 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
packed-struct with FixedSizeList is not supported for now. |
of struct is fixed-width data block when using packed-struct encoding.
westonpace
left a comment
There was a problem hiding this comment.
A few suggestions but this approach seems like a good starting point overall.
| /// the projection will be invalid and the read will fail. | ||
| /// If the field is a `struct datatype` with `packed` set to true in the field metadata, | ||
| /// the whole struct has one column index. | ||
| /// To support nested `packed-struct encoding`, this method need to be further adjusted. |
There was a problem hiding this comment.
| /// To support nested `packed-struct encoding`, this method need to be further adjusted. |
This sentence seems redundant
There was a problem hiding this comment.
this sentence is to document nested packed struct encoding is not supported at the moment.
| { | ||
| column_indices.push(curr_column_idx); | ||
| curr_column_idx += 1; | ||
| packed_struct_fields_num = field.children.len(); |
There was a problem hiding this comment.
This won't be correct on nested structs. For example:
A-+
| |
| |
B D-+
| | |
| | |
C E F
If A is packed then field.children.len() will be 2 but there are 5 elements that need to be skipped. I think it would be easiest to just do our own DFS here.
pub fn from_whole_schema(schema: &Schema, version: LanceFileVersion) -> Self {
let schema = Arc::new(schema.clone());
let is_structural = version >= LanceFileVersion::V2_1;
let mut counter = 0;
let mut column_indices = Vec::new();
for field in schema.fields {
from_whole_field(field, &mut counter, &mut column_indices);
}
Self {
schema,
column_indices,
}
}
fn from_whole_field(field: &Field, counter: &mut u32, column_indices: &mut Vec<u32>) {
if is_packed_struct(field) {
add column to column indices
return early
}
if is_primitive or !is_structural {
add column to column indices
}
for child in field.children {
recurse
}
}
There was a problem hiding this comment.
nested packed struct is not supported for now, in fact, the packed struct encoding will panic if encounter nested packed struct so currently in 2.1 nested packed struct data can not be written.
inside PrimitiveStructuralEncoder::do_flush:
// if the `data_block` is a `StructDataBlock`, then this is a struct with packed struct encoding.
if let DataBlock::Struct(ref struct_data_block) = data_block {
if struct_data_block
.children
.iter()
.any(|child| !matches!(child, DataBlock::FixedWidth(_)))
{
panic!("packed struct encoding currently only supports fixed-width fields.")
}
}
| } | ||
| if field_metadata | ||
| .get("packed") | ||
| .map(|v| v == "true") |
There was a problem hiding this comment.
Let's make this a case insensitive comparison
| fn get_stat(&self, stat: Stat) -> Option<Arc<dyn Array>> { | ||
| match stat { | ||
| Stat::MaxLength => { | ||
| let max_len = self.dimension * self.child.expect_single_stat::<UInt64Type>(stat); |
There was a problem hiding this comment.
Does every block have MaxLength? Is it possible the FSL's child doesn't have MaxLength?
There was a problem hiding this comment.
sorry, this is vestige, statistics for fixedSizeList data block is not supported in this PR.
| Some(Arc::new(UInt64Array::from(vec![max_len]))) | ||
| } | ||
| Stat::DataSize => self.child.get_stat(stat), | ||
| _ => None, |
There was a problem hiding this comment.
removed this code completely as fixed size list statistics is not supported in this PR.
| let field_metadata = &field.metadata; | ||
| if field_metadata | ||
| .get("packed") | ||
| .map(|v| v == "true") |
There was a problem hiding this comment.
Let's make this case insensitive.
Also, this check repeats in a few places. We could maybe add is_packed_struct to FieldExt in lance-arrow/src/schema.rs
There was a problem hiding this comment.
thanks, fixed.
I added two is_packed_struct, one in lance-arrow/src/schema.rs, one in lance-core/src/datatypes.field.rs, not sure how to do only one.
| let mut children = vec![]; | ||
|
|
||
| let bytes_per_row: u32 = bits_per_values.iter().sum::<u32>() / 8; | ||
| let bytes_per_row = bytes_per_row as u64; |
There was a problem hiding this comment.
Let's add a debug assert to verify our assumption that bits_per_values.iter().all(|bpv| bpv % 8 == 0)
There was a problem hiding this comment.
fixed, thanks.
| if field_metadata | ||
| .get("packed") | ||
| .map(|v| v == "true") | ||
| .unwrap_or(false) |
There was a problem hiding this comment.
Also case insensitive (or replace with call to helper method)







This PR tries to add packed struct encoding.
During encoding, it packs a struct with fixed width fields, producing a row oriented
FixedWidthDataBlock, then useValueCompressorto compressor to aMiniBlock Layout.during decoding, it first uses
ValueDecompressorto get the row-orientedFixedWidthDataBlock, then construct aStructDataBlockfor output.#3173 #2601