Problem
LanceDataset.commit(Overwrite) creates a dataset without the stable row IDs feature flag in the manifest, even when the fragments were written with enable_stable_row_ids=True. This means the only way to create a dataset with stable row IDs is through write_dataset(), which is not designed for distributed coordination.
Reproduction
import pyarrow as pa, tempfile
from lance.fragment import write_fragments
from lance.dataset import LanceDataset, LanceOperation, write_dataset
schema = pa.schema([pa.field("x", pa.int32())])
# Path A: write_dataset — stable row IDs work
uri_a = tempfile.mkdtemp() + "/a.lance"
ds_a = write_dataset(schema.empty_table(), uri_a, mode="create", enable_stable_row_ids=True)
for i in range(3):
frags = write_fragments(pa.table({"x": pa.array(range(i*3, (i+1)*3), type=pa.int32())}), uri_a, mode="append")
ds_a = LanceDataset.commit(uri_a, LanceOperation.Append(frags), read_version=ds_a.version)
print(ds_a.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist())
# [0, 1, 2, 3, 4, 5, 6, 7, 8] ✓
# Path B: commit(Overwrite) — stable row IDs NOT set
uri_b = tempfile.mkdtemp() + "/b.lance"
ds_b = LanceDataset.commit(uri_b, LanceOperation.Overwrite(schema, []), max_retries=0)
for i in range(3):
frags = write_fragments(pa.table({"x": pa.array(range(i*3, (i+1)*3), type=pa.int32())}), uri_b, mode="append")
ds_b = LanceDataset.commit(uri_b, LanceOperation.Append(frags), read_version=ds_b.version)
print(ds_b.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist())
# [0, 1, 2, 4294967296, 4294967297, 4294967298, 8589934592, 8589934593, 8589934594] ✗
Root cause
enable_stable_row_ids sets manifest feature flags (protobuf fields 9/10, value=2). This is only done in the _write_dataset Rust path. The _Dataset.commit() Rust path does not accept or set these flags.
Why this matters
LanceDataset.commit() is the API designed for distributed environments (atomic commit with max_retries). But it cannot create a dataset with stable row IDs. Users who need both atomicity and stable row IDs have no Lance-native solution and must implement external coordination (e.g., object store conditional put for election).
Suggestion
Add enable_stable_row_ids (or a more general feature_flags / config) parameter to LanceDataset.commit(), so that distributed dataset creation can set manifest-level properties.
Alternatively, consider making stable row IDs the default in Lance v2, as discussed in #3694.
Environment
- pylance 2.0.0-rc.4
- Python 3.12
- Linux
Problem
LanceDataset.commit(Overwrite)creates a dataset without the stable row IDs feature flag in the manifest, even when the fragments were written withenable_stable_row_ids=True. This means the only way to create a dataset with stable row IDs is throughwrite_dataset(), which is not designed for distributed coordination.Reproduction
Root cause
enable_stable_row_idssets manifest feature flags (protobuf fields 9/10, value=2). This is only done in the_write_datasetRust path. The_Dataset.commit()Rust path does not accept or set these flags.Why this matters
LanceDataset.commit()is the API designed for distributed environments (atomic commit withmax_retries). But it cannot create a dataset with stable row IDs. Users who need both atomicity and stable row IDs have no Lance-native solution and must implement external coordination (e.g., object store conditional put for election).Suggestion
Add
enable_stable_row_ids(or a more generalfeature_flags/config) parameter toLanceDataset.commit(), so that distributed dataset creation can set manifest-level properties.Alternatively, consider making stable row IDs the default in Lance v2, as discussed in #3694.
Environment