Skip to content

LanceDataset.commit(Overwrite) cannot set enable_stable_row_ids feature flag #5906

@fecet

Description

@fecet

Problem

LanceDataset.commit(Overwrite) creates a dataset without the stable row IDs feature flag in the manifest, even when the fragments were written with enable_stable_row_ids=True. This means the only way to create a dataset with stable row IDs is through write_dataset(), which is not designed for distributed coordination.

Reproduction

import pyarrow as pa, tempfile
from lance.fragment import write_fragments
from lance.dataset import LanceDataset, LanceOperation, write_dataset

schema = pa.schema([pa.field("x", pa.int32())])

# Path A: write_dataset — stable row IDs work
uri_a = tempfile.mkdtemp() + "/a.lance"
ds_a = write_dataset(schema.empty_table(), uri_a, mode="create", enable_stable_row_ids=True)
for i in range(3):
    frags = write_fragments(pa.table({"x": pa.array(range(i*3, (i+1)*3), type=pa.int32())}), uri_a, mode="append")
    ds_a = LanceDataset.commit(uri_a, LanceOperation.Append(frags), read_version=ds_a.version)
print(ds_a.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist())
# [0, 1, 2, 3, 4, 5, 6, 7, 8] ✓

# Path B: commit(Overwrite) — stable row IDs NOT set
uri_b = tempfile.mkdtemp() + "/b.lance"
ds_b = LanceDataset.commit(uri_b, LanceOperation.Overwrite(schema, []), max_retries=0)
for i in range(3):
    frags = write_fragments(pa.table({"x": pa.array(range(i*3, (i+1)*3), type=pa.int32())}), uri_b, mode="append")
    ds_b = LanceDataset.commit(uri_b, LanceOperation.Append(frags), read_version=ds_b.version)
print(ds_b.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist())
# [0, 1, 2, 4294967296, 4294967297, 4294967298, 8589934592, 8589934593, 8589934594] ✗

Root cause

enable_stable_row_ids sets manifest feature flags (protobuf fields 9/10, value=2). This is only done in the _write_dataset Rust path. The _Dataset.commit() Rust path does not accept or set these flags.

Why this matters

LanceDataset.commit() is the API designed for distributed environments (atomic commit with max_retries). But it cannot create a dataset with stable row IDs. Users who need both atomicity and stable row IDs have no Lance-native solution and must implement external coordination (e.g., object store conditional put for election).

Suggestion

Add enable_stable_row_ids (or a more general feature_flags / config) parameter to LanceDataset.commit(), so that distributed dataset creation can set manifest-level properties.

Alternatively, consider making stable row IDs the default in Lance v2, as discussed in #3694.

Environment

  • pylance 2.0.0-rc.4
  • Python 3.12
  • Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions