Skip to content

Bug(rust): dataset.update() fails with schema mismatch on datasets containing pa.json_() columns #6329

@erandagan

Description

@erandagan

Bug description

dataset.update() fails on any dataset that has a pa.json_() column, even when the update targets a completely unrelated column. The update path in write/update.rs (line ~274) performs a strict schema equality check between the scanned stream schema and the dataset schema. During the scan, lance.json (LargeBinary) is decoded back to arrow.json (Utf8), but the dataset schema still expects lance.json - causing the mismatch.

This is similar to the issues fixed in #5928 & #5936

The write/update.rs path was not covered by those fixes.

Minimal reproduction

import tempfile
import lance
import pyarrow as pa

with tempfile.TemporaryDirectory() as tmp:
    schema = pa.schema([
        pa.field("id", pa.int64()),
        pa.field("a", pa.utf8()),
        pa.field("b", pa.json_(), nullable=True),
    ])
    data = pa.table(
        {"id": [1, 2, 3], "a": ["x", "y", "z"], "b": ['{"k":1}', None, '{"k":3}']},
        schema=schema,
    )
    ds = lance.write_dataset(data, f"{tmp}/test.lance")

    # Fails
    ds.update({"a": "'updated'"}, where="id = 2")

Error:

RuntimeError: Encountered internal error.
Expected schema Schema { fields: [...,
  Field { name: "b", data_type: LargeBinary, metadata: {"ARROW:extension:name": "lance.json"} }
]} but got Schema { fields: [...,
  Field { name: "b", data_type: Utf8, metadata: {"ARROW:extension:name": "arrow.json"} }
]}, .../rust/lance/src/dataset/write/update.rs:274:24

Versions

  • pylance==4.0.0-beta.12
  • pyarrow==23.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions