feat: add LanceWriteParams and lance_dataset_write_with_params#20
feat: add LanceWriteParams and lance_dataset_write_with_params#20LuciferYang wants to merge 3 commits intolance-format:mainfrom
Conversation
…rrayStream
Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.
Two FFI details:
- mode is received as i32 and validated via LanceWriteMode::from_raw;
accepting the enum directly would be UB for out-of-range values.
- The stream is consumed via ArrowArrayStreamReader::from_raw right
after the NULL check so the "consumed on any return" contract holds.
Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.
Closes lance-format#14.
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is private. rustdoc -D warnings (the Rustdoc CI job) flags this as `private_intra_doc_links`. Rewording to describe the validation behavior without naming the private function. No API changes.
Extends the write surface with a LanceWriteParams struct covering the most
useful fields from upstream's WriteParams: max_rows_per_file,
max_rows_per_group, max_bytes_per_file, data_storage_version, and
enable_stable_row_ids. lance_dataset_write is now a thin delegator that
forwards params = NULL to lance_dataset_write_with_params — both remain
part of the ABI, no deprecations.
data_storage_version takes a string ("2.0", "2.1", "stable", "legacy",
etc.) parsed via lance_file::version::LanceFileVersion::from_str; invalid
values are rejected with LANCE_ERR_INVALID_ARGUMENT. Adds lance-file as a
main dependency (it was previously dev-only).
C++ gets a lance::WriteParams struct and a new Dataset::write overload
accepting it; the three-argument form delegates to the new one with
default params.
Tests: 5 new Rust unit tests — NULL-params-behaves-like-plain,
max_rows_per_file splits fragments (100 rows / 20 → at least 5), known
storage versions accepted ("2.0"/"2.1"/"stable"), invalid storage version
rejected, stable_row_ids toggle accepted.
Closes lance-format#15.
|
will rebase after #16 merged |
jja725
left a comment
There was a problem hiding this comment.
Nice clean layering — lance_dataset_write becomes a thin delegator and the new _with_params variant exposes only the knobs that make sense at the C boundary. A few inline notes; one carries forward from the discussion on #16.
| storage_opts: *const *const c_char, | ||
| out_dataset: *mut *mut LanceDataset, | ||
| ) -> Result<i32> { | ||
| if uri.is_null() || schema.is_null() || stream.is_null() { |
There was a problem hiding this comment.
Same stream-consumption ordering issue as discussed on #16 — when uri or schema is NULL with a valid stream, this returns before from_raw is called, so the stream's release callback is never invoked. The C caller, trusting the documented "stream is consumed regardless of return code" contract, leaks it.
Whichever PR lands first should pull stream.is_null() out and consume the stream before validating the rest. Flagging here too in case #20 lands first.
| })?; | ||
| target.data_storage_version = Some(version); | ||
| } | ||
| target.enable_stable_row_ids = params.enable_stable_row_ids; |
There was a problem hiding this comment.
Subtle: every other field uses 0 / NULL as a "keep upstream default" sentinel, but enable_stable_row_ids is unconditionally assigned. With bool there's no available sentinel, but the consequence is that any caller that zero-initializes LanceWriteParams (the documented way to keep defaults) will explicitly set enable_stable_row_ids = false. That happens to match upstream's current default, so it's silent today — but if upstream ever flips the default to true, every C caller silently regresses.
Options:
- Add an
enable_stable_row_ids_setcompanion flag (clunky). - Use a
int8_twith-1 = default, 0 = false, 1 = true(more verbose but matches the rest of the struct's pattern). - Document explicitly that this field is "strictly an override, no default sentinel" so future readers don't assume parity.
My preference is the doc-only fix unless upstream's default is actually expected to change.
| /// fields are no-ops so upstream defaults flow through. | ||
| unsafe fn apply_write_params(target: &mut WriteParams, params: &LanceWriteParams) -> Result<()> { | ||
| if params.max_rows_per_file > 0 { | ||
| target.max_rows_per_file = params.max_rows_per_file as usize; |
There was a problem hiding this comment.
Minor: params.max_rows_per_file as usize truncates on 32-bit targets if the caller passes a value > u32::MAX. Realistic use cases (rows-per-file in the millions, bytes-per-file ~90 GB default) wouldn't hit this, but a try_into().map_err(...) would surface the truncation cleanly instead of silently wrapping. Same applies to max_rows_per_group and max_bytes_per_file below.
Summary
LanceWriteParamswith five tunable fields:max_rows_per_file,max_rows_per_group,max_bytes_per_file,data_storage_version,enable_stable_row_idslance_dataset_write_with_params, which takes the struct as an extra argument; pass NULL to use upstream defaultslance_dataset_writeis now a thin delegator to_with_paramswithparams = NULL— its signature and ABI are unchangedlance::WriteParamsstruct and an overload oflance::Dataset::write(...)that accepts itMotivation
lance_dataset_writeuses upstream's defaults for everything. Production callers need to tune output file/group sizes and the Lance file format version; stable row ids are also a common opt-in. This exposes those knobs without touching the simple-path API.Notes
data_storage_versiontakes a string ("2.0","2.1","stable","legacy", etc.) parsed viaLanceFileVersion::from_str. Invalid or empty strings returnLANCE_ERR_INVALID_ARGUMENT.0as a "keep default" sentinel — no realistic call would want 0 rows per file.WriteParamshas more advanced fields (commit handlers, progress callbacks, blob/base internals) that are intentionally not exposed — they belong in Rust, not a thin C FFI.Test plan
cargo test— 8 new unit tests: NULL params behaves like plainwrite,max_rows_per_filesplits fragments (100 rows / 20 → ≥5),max_rows_per_groupplumbed,max_bytes_per_fileplumbed, known versions accepted ("2.0"/"2.1"/"stable"), empty version rejected, invalid version rejected,enable_stable_row_idstoggle acceptedcargo clippy --all-targets -- -D warningsandcargo fmt --checkcleanRUSTDOCFLAGS="-D warnings" cargo doc --no-depscleancargo test --test compile_and_run_test -- --ignored— C and C++ integration tests passCloses #15.