Skip to content

feat: add LanceWriteParams and lance_dataset_write_with_params#20

Draft
LuciferYang wants to merge 3 commits intolance-format:mainfrom
LuciferYang:feat/issue-15-write-params
Draft

feat: add LanceWriteParams and lance_dataset_write_with_params#20
LuciferYang wants to merge 3 commits intolance-format:mainfrom
LuciferYang:feat/issue-15-write-params

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

  • Adds LanceWriteParams with five tunable fields: max_rows_per_file, max_rows_per_group, max_bytes_per_file, data_storage_version, enable_stable_row_ids
  • Adds lance_dataset_write_with_params, which takes the struct as an extra argument; pass NULL to use upstream defaults
  • lance_dataset_write is now a thin delegator to _with_params with params = NULL — its signature and ABI are unchanged
  • C++ gets a lance::WriteParams struct and an overload of lance::Dataset::write(...) that accepts it

Motivation

lance_dataset_write uses upstream's defaults for everything. Production callers need to tune output file/group sizes and the Lance file format version; stable row ids are also a common opt-in. This exposes those knobs without touching the simple-path API.

Notes

  • data_storage_version takes a string ("2.0", "2.1", "stable", "legacy", etc.) parsed via LanceFileVersion::from_str. Invalid or empty strings return LANCE_ERR_INVALID_ARGUMENT.
  • Numeric fields use 0 as a "keep default" sentinel — no realistic call would want 0 rows per file.
  • Lance's WriteParams has more advanced fields (commit handlers, progress callbacks, blob/base internals) that are intentionally not exposed — they belong in Rust, not a thin C FFI.

Test plan

  • cargo test — 8 new unit tests: NULL params behaves like plain write, max_rows_per_file splits fragments (100 rows / 20 → ≥5), max_rows_per_group plumbed, max_bytes_per_file plumbed, known versions accepted ("2.0" / "2.1" / "stable"), empty version rejected, invalid version rejected, enable_stable_row_ids toggle accepted
  • cargo clippy --all-targets -- -D warnings and cargo fmt --check clean
  • RUSTDOCFLAGS="-D warnings" cargo doc --no-deps clean
  • cargo test --test compile_and_run_test -- --ignored — C and C++ integration tests pass

Closes #15.

…rrayStream

Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.

Two FFI details:
  - mode is received as i32 and validated via LanceWriteMode::from_raw;
    accepting the enum directly would be UB for out-of-range values.
  - The stream is consumed via ArrowArrayStreamReader::from_raw right
    after the NULL check so the "consumed on any return" contract holds.

Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.

Closes lance-format#14.
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is
private. rustdoc -D warnings (the Rustdoc CI job) flags this as
`private_intra_doc_links`. Rewording to describe the validation behavior
without naming the private function.

No API changes.
Extends the write surface with a LanceWriteParams struct covering the most
useful fields from upstream's WriteParams: max_rows_per_file,
max_rows_per_group, max_bytes_per_file, data_storage_version, and
enable_stable_row_ids. lance_dataset_write is now a thin delegator that
forwards params = NULL to lance_dataset_write_with_params — both remain
part of the ABI, no deprecations.

data_storage_version takes a string ("2.0", "2.1", "stable", "legacy",
etc.) parsed via lance_file::version::LanceFileVersion::from_str; invalid
values are rejected with LANCE_ERR_INVALID_ARGUMENT. Adds lance-file as a
main dependency (it was previously dev-only).

C++ gets a lance::WriteParams struct and a new Dataset::write overload
accepting it; the three-argument form delegates to the new one with
default params.

Tests: 5 new Rust unit tests — NULL-params-behaves-like-plain,
max_rows_per_file splits fragments (100 rows / 20 → at least 5), known
storage versions accepted ("2.0"/"2.1"/"stable"), invalid storage version
rejected, stable_row_ids toggle accepted.

Closes lance-format#15.
@LuciferYang LuciferYang marked this pull request as draft April 24, 2026 11:08
@LuciferYang
Copy link
Copy Markdown
Contributor Author

will rebase after #16 merged

Copy link
Copy Markdown
Collaborator

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean layering — lance_dataset_write becomes a thin delegator and the new _with_params variant exposes only the knobs that make sense at the C boundary. A few inline notes; one carries forward from the discussion on #16.

Comment thread src/writer.rs
storage_opts: *const *const c_char,
out_dataset: *mut *mut LanceDataset,
) -> Result<i32> {
if uri.is_null() || schema.is_null() || stream.is_null() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same stream-consumption ordering issue as discussed on #16 — when uri or schema is NULL with a valid stream, this returns before from_raw is called, so the stream's release callback is never invoked. The C caller, trusting the documented "stream is consumed regardless of return code" contract, leaks it.

Whichever PR lands first should pull stream.is_null() out and consume the stream before validating the rest. Flagging here too in case #20 lands first.

Comment thread src/writer.rs
})?;
target.data_storage_version = Some(version);
}
target.enable_stable_row_ids = params.enable_stable_row_ids;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle: every other field uses 0 / NULL as a "keep upstream default" sentinel, but enable_stable_row_ids is unconditionally assigned. With bool there's no available sentinel, but the consequence is that any caller that zero-initializes LanceWriteParams (the documented way to keep defaults) will explicitly set enable_stable_row_ids = false. That happens to match upstream's current default, so it's silent today — but if upstream ever flips the default to true, every C caller silently regresses.

Options:

  • Add an enable_stable_row_ids_set companion flag (clunky).
  • Use a int8_t with -1 = default, 0 = false, 1 = true (more verbose but matches the rest of the struct's pattern).
  • Document explicitly that this field is "strictly an override, no default sentinel" so future readers don't assume parity.

My preference is the doc-only fix unless upstream's default is actually expected to change.

Comment thread src/writer.rs
/// fields are no-ops so upstream defaults flow through.
unsafe fn apply_write_params(target: &mut WriteParams, params: &LanceWriteParams) -> Result<()> {
if params.max_rows_per_file > 0 {
target.max_rows_per_file = params.max_rows_per_file as usize;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: params.max_rows_per_file as usize truncates on 32-bit targets if the caller passes a value > u32::MAX. Realistic use cases (rows-per-file in the millions, bytes-per-file ~90 GB default) wouldn't hit this, but a try_into().map_err(...) would surface the truncation cleanly instead of silently wrapping. Same applies to max_rows_per_group and max_bytes_per_file below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose WriteParams (rows/file size, storage version, stable row ids) via LanceWriteParams

2 participants