Skip to content

project_by_schema does not reorder fields inside List<Struct> types #5702

@wjones127

Description

@wjones127

Summary

When reading fragments where fields are stored out of order (scrambled fields array in DataFile metadata), the project_by_schema function fails to reorder fields inside List<Struct> columns. This causes Arrow validation errors when constructing the final StructArray.

Error Message

Invalid argument error: Incorrect datatype for StructArray field "list",
expected List(Field { name: "item", data_type: Struct([
  Field { name: "field_a", ... },
  Field { name: "field_b", ... }
]), ... })
got List(Field { name: "item", data_type: Struct([
  Field { name: "field_b", ... },  // <-- order is swapped
  Field { name: "field_a", ... }
]), ... })

Root Cause

The project function in rust/lance-arrow/src/lib.rs:798-827 recursively handles Struct fields but not List<Struct>:

fn project(struct_array: &StructArray, fields: &Fields) -> Result<StructArray> {
    for field in fields.iter() {
        if let Some(col) = struct_array.column_by_name(field.name()) {
            match field.data_type() {
                // TODO handle list-of-struct   <-- acknowledged but not implemented
                DataType::Struct(subfields) => {
                    let projected = project(col.as_struct(), subfields)?;
                    columns.push(Arc::new(projected));
                }
                _ => {
                    columns.push(col.clone());  // List<Struct> falls through here
                }
            }
        }
    }
    // ...
}

Conditions to Trigger

The bug requires all of the following:

  1. Out-of-order field storage: A fragment where DataFile.fields is not in sequential order (e.g., [2, 8, 1, 5, ...] instead of [0, 1, 2, 3, ...])

  2. Schema with List<Struct>: A column with nested structure like struct<list: list<struct<...>>>

  3. Schema evolution (optional but common): Missing fields that require null-filling, triggering the merge + project code path

How It Happens

  1. Fragment is written with fields stored in non-sequential order (this can happen legitimately)
  2. When reading, the file reader returns data with inner struct fields in file order
  3. project_by_schema is called to reorder columns to match the output schema
  4. Top-level and direct Struct fields are reordered correctly
  5. Fields inside List<Struct> are NOT reordered (bug)
  6. Arrow's StructArray::new() validation fails due to field/column order mismatch

Reproduction

Fragment metadata showing the issue:

# Good fragment - fields in order
>>> frags[0].metadata.files[0]
DataFile(fields=[0, 1, 2, 3, 4, ...], column_indices=[0, 1, 2, 3, 4, ...], ...)

# Bad fragment - fields out of order
>>> frags[3].metadata.files[0]
DataFile(fields=[2, 8, 29, 1, 5, 7, ...], column_indices=[0, 1, 2, 3, 4, 5, ...], ...)

The scrambled fields array means field ID 2 is stored in column 0, field ID 8 in column 1, etc. This is valid Lance format, but the reader fails to properly reorder nested List<Struct> fields when reconstructing the output.

Suggested Fix

Extend the project function to handle List, LargeList, and FixedSizeList types recursively:

fn project(struct_array: &StructArray, fields: &Fields) -> Result<StructArray> {
    for field in fields.iter() {
        if let Some(col) = struct_array.column_by_name(field.name()) {
            match field.data_type() {
                DataType::Struct(subfields) => {
                    let projected = project(col.as_struct(), subfields)?;
                    columns.push(Arc::new(projected));
                }
                DataType::List(inner_field) => {
                    let list_arr = col.as_list::<i32>();
                    let projected_values = project_list_values(list_arr.values(), inner_field)?;
                    let projected_list = ListArray::new(
                        inner_field.clone(),
                        list_arr.offsets().clone(),
                        projected_values,
                        list_arr.nulls().cloned(),
                    );
                    columns.push(Arc::new(projected_list));
                }
                // Similar for LargeList, FixedSizeList
                _ => {
                    columns.push(col.clone());
                }
            }
        }
    }
    // ...
}

Environment

  • Lance version: 1.0.0-beta.8 (commit 1329bf4)
  • File version: 2.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions