Summary
When reading fragments where fields are stored out of order (scrambled fields array in DataFile metadata), the project_by_schema function fails to reorder fields inside List<Struct> columns. This causes Arrow validation errors when constructing the final StructArray.
Error Message
Invalid argument error: Incorrect datatype for StructArray field "list",
expected List(Field { name: "item", data_type: Struct([
Field { name: "field_a", ... },
Field { name: "field_b", ... }
]), ... })
got List(Field { name: "item", data_type: Struct([
Field { name: "field_b", ... }, // <-- order is swapped
Field { name: "field_a", ... }
]), ... })
Root Cause
The project function in rust/lance-arrow/src/lib.rs:798-827 recursively handles Struct fields but not List<Struct>:
fn project(struct_array: &StructArray, fields: &Fields) -> Result<StructArray> {
for field in fields.iter() {
if let Some(col) = struct_array.column_by_name(field.name()) {
match field.data_type() {
// TODO handle list-of-struct <-- acknowledged but not implemented
DataType::Struct(subfields) => {
let projected = project(col.as_struct(), subfields)?;
columns.push(Arc::new(projected));
}
_ => {
columns.push(col.clone()); // List<Struct> falls through here
}
}
}
}
// ...
}
Conditions to Trigger
The bug requires all of the following:
-
Out-of-order field storage: A fragment where DataFile.fields is not in sequential order (e.g., [2, 8, 1, 5, ...] instead of [0, 1, 2, 3, ...])
-
Schema with List<Struct>: A column with nested structure like struct<list: list<struct<...>>>
-
Schema evolution (optional but common): Missing fields that require null-filling, triggering the merge + project code path
How It Happens
- Fragment is written with fields stored in non-sequential order (this can happen legitimately)
- When reading, the file reader returns data with inner struct fields in file order
project_by_schema is called to reorder columns to match the output schema
- Top-level and direct
Struct fields are reordered correctly
- Fields inside
List<Struct> are NOT reordered (bug)
- Arrow's
StructArray::new() validation fails due to field/column order mismatch
Reproduction
Fragment metadata showing the issue:
# Good fragment - fields in order
>>> frags[0].metadata.files[0]
DataFile(fields=[0, 1, 2, 3, 4, ...], column_indices=[0, 1, 2, 3, 4, ...], ...)
# Bad fragment - fields out of order
>>> frags[3].metadata.files[0]
DataFile(fields=[2, 8, 29, 1, 5, 7, ...], column_indices=[0, 1, 2, 3, 4, 5, ...], ...)
The scrambled fields array means field ID 2 is stored in column 0, field ID 8 in column 1, etc. This is valid Lance format, but the reader fails to properly reorder nested List<Struct> fields when reconstructing the output.
Suggested Fix
Extend the project function to handle List, LargeList, and FixedSizeList types recursively:
fn project(struct_array: &StructArray, fields: &Fields) -> Result<StructArray> {
for field in fields.iter() {
if let Some(col) = struct_array.column_by_name(field.name()) {
match field.data_type() {
DataType::Struct(subfields) => {
let projected = project(col.as_struct(), subfields)?;
columns.push(Arc::new(projected));
}
DataType::List(inner_field) => {
let list_arr = col.as_list::<i32>();
let projected_values = project_list_values(list_arr.values(), inner_field)?;
let projected_list = ListArray::new(
inner_field.clone(),
list_arr.offsets().clone(),
projected_values,
list_arr.nulls().cloned(),
);
columns.push(Arc::new(projected_list));
}
// Similar for LargeList, FixedSizeList
_ => {
columns.push(col.clone());
}
}
}
}
// ...
}
Environment
- Lance version: 1.0.0-beta.8 (commit 1329bf4)
- File version: 2.0
Summary
When reading fragments where fields are stored out of order (scrambled
fieldsarray in DataFile metadata), theproject_by_schemafunction fails to reorder fields insideList<Struct>columns. This causes Arrow validation errors when constructing the finalStructArray.Error Message
Root Cause
The
projectfunction inrust/lance-arrow/src/lib.rs:798-827recursively handlesStructfields but notList<Struct>:Conditions to Trigger
The bug requires all of the following:
Out-of-order field storage: A fragment where
DataFile.fieldsis not in sequential order (e.g.,[2, 8, 1, 5, ...]instead of[0, 1, 2, 3, ...])Schema with
List<Struct>: A column with nested structure likestruct<list: list<struct<...>>>Schema evolution (optional but common): Missing fields that require null-filling, triggering the merge + project code path
How It Happens
project_by_schemais called to reorder columns to match the output schemaStructfields are reordered correctlyList<Struct>are NOT reordered (bug)StructArray::new()validation fails due to field/column order mismatchReproduction
Fragment metadata showing the issue:
The scrambled
fieldsarray means field ID 2 is stored in column 0, field ID 8 in column 1, etc. This is valid Lance format, but the reader fails to properly reorder nestedList<Struct>fields when reconstructing the output.Suggested Fix
Extend the
projectfunction to handleList,LargeList, andFixedSizeListtypes recursively:Environment