Bug description
dataset.update() fails on any dataset that has a pa.json_() column, even when the update targets a completely unrelated column. The update path in write/update.rs (line ~274) performs a strict schema equality check between the scanned stream schema and the dataset schema. During the scan, lance.json (LargeBinary) is decoded back to arrow.json (Utf8), but the dataset schema still expects lance.json - causing the mismatch.
This is similar to the issues fixed in #5928 & #5936
The write/update.rs path was not covered by those fixes.
Minimal reproduction
import tempfile
import lance
import pyarrow as pa
with tempfile.TemporaryDirectory() as tmp:
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("a", pa.utf8()),
pa.field("b", pa.json_(), nullable=True),
])
data = pa.table(
{"id": [1, 2, 3], "a": ["x", "y", "z"], "b": ['{"k":1}', None, '{"k":3}']},
schema=schema,
)
ds = lance.write_dataset(data, f"{tmp}/test.lance")
# Fails
ds.update({"a": "'updated'"}, where="id = 2")
Error:
RuntimeError: Encountered internal error.
Expected schema Schema { fields: [...,
Field { name: "b", data_type: LargeBinary, metadata: {"ARROW:extension:name": "lance.json"} }
]} but got Schema { fields: [...,
Field { name: "b", data_type: Utf8, metadata: {"ARROW:extension:name": "arrow.json"} }
]}, .../rust/lance/src/dataset/write/update.rs:274:24
Versions
pylance==4.0.0-beta.12
pyarrow==23.0.1
Bug description
dataset.update()fails on any dataset that has apa.json_()column, even when the update targets a completely unrelated column. Theupdatepath inwrite/update.rs(line ~274) performs a strict schema equality check between the scanned stream schema and the dataset schema. During the scan,lance.json(LargeBinary) is decoded back toarrow.json(Utf8), but the dataset schema still expectslance.json- causing the mismatch.This is similar to the issues fixed in #5928 & #5936
The
write/update.rspath was not covered by those fixes.Minimal reproduction
Error:
Versions
pylance==4.0.0-beta.12pyarrow==23.0.1