Skip to content

feat(schema evolution): support adding nested struct columns#5361

Closed
zhangyue19921010 wants to merge 3 commits intolance-format:mainfrom
zhangyue19921010:feat_support_add_nested_schema_column_internal
Closed

feat(schema evolution): support adding nested struct columns#5361
zhangyue19921010 wants to merge 3 commits intolance-format:mainfrom
zhangyue19921010:feat_support_add_nested_schema_column_internal

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

@zhangyue19921010 zhangyue19921010 commented Nov 27, 2025

Reuse the underlying merge column capability, only modified the check names logic, optimizing the verification from checking whether top-level field names are the same to verifying whether leaf field names are consistent.

Demo

if __name__ == '__main__':
    path = "../table/alice_and_bob_add_column_only.lance"
    init(path)

    table = pa.table({"id": pa.array([1, 2, 3]),
                      "struct_fields": pa.array([
                          {"a": 1, "b": "foo"},
                          {"a": 2, "b": "bar"},
                          {"a": 3, "b": "baz"}
                      ])
                      })
    ds = lance.write_dataset(table, path)

    new_field = pa.field("embedding", pa.list_(pa.float32(), 128))
    ds.add_columns(new_field)

    new_schema = pa.schema([
        ("label", pa.string()),
        ("score", pa.float32()),
    ])
    ds.add_columns(new_schema)

    new_schema_2 = pa.schema([
        ("struct_fields", pa.struct([
            ("c", pa.string()),
            ("d", pa.string()),
        ])),
    ])

    ds.add_columns(new_schema_2)

    table = pa.table({"id": pa.array([11, 22, 33]),
                      "struct_fields": pa.array([
                          {"a": 11, "b": "foo11", "c": "bar11", "d": "baz11"},
                          {"a": 22, "b": "ba22r", "c": "bar22", "d": "baz22"},
                          {"a": 33, "b": "baz33", "c": "bar33", "d": "baz33"}])
                      })
    ds.insert(table)
    print(ds.schema)
    print(ds.to_table().to_pandas())

Before this PR

Traceback (most recent call last):
  File "/Users/xxx/PycharmProjects/LanceDemo/data_evolution/add_struct_columns_schema_only.py", line 51, in <module>
    ds.add_columns(new_schema_2)
  File "/Users/xxx/PycharmProjects/LanceDemo/.venv/lib/python3.9/site-packages/lance/dataset.py", line 1816, in add_columns
    self._ds.add_columns_with_schema(transforms)
OSError: Invalid user input: Column struct_fields already exists in the dataset, /Users/runner/work/lance/lance/rust/lance/src/dataset/schema_evolution.rs:150:21

After this PR

id: int64
struct_fields: struct<a: int64, b: string, c: string, d: string>
  child 0, a: int64
  child 1, b: string
  child 2, c: string
  child 3, d: string
embedding: fixed_size_list<item: float>[128]
  child 0, item: float
label: string
score: float
   id                                      struct_fields embedding label  score
0   1         {'a': 1, 'b': 'foo', 'c': None, 'd': None}      None  None    NaN
1   2         {'a': 2, 'b': 'bar', 'c': None, 'd': None}      None  None    NaN
2   3         {'a': 3, 'b': 'baz', 'c': None, 'd': None}      None  None    NaN
3  11  {'a': 11, 'b': 'foo11', 'c': 'bar11', 'd': 'ba...      None  None    NaN
4  22  {'a': 22, 'b': 'ba22r', 'c': 'bar22', 'd': 'ba...      None  None    NaN
5  33  {'a': 33, 'b': 'baz33', 'c': 'bar33', 'd': 'ba...      None  None    NaN

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions github-actions Bot added enhancement New feature or request python labels Nov 27, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 92.10526% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/schema_evolution.rs 90.76% 4 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Theoretically, this modification can accommodate column addition for nested fields at the metadata level across all versions of file formats. For details on the capability to support data backfilling during column addition, please refer to #5126.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant