Skip to content

feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16

Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/issue-14-dataset-write
Open

feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/issue-14-dataset-write

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

  • Adds lance_dataset_write(uri, schema, stream, mode, storage_opts, out_dataset) — writes an ArrowArrayStream into a Lance dataset with a committed manifest
  • LanceWriteMode covers CREATE / APPEND / OVERWRITE
  • Optional out_dataset hands back an open LanceDataset* at the new version so callers don't need to reopen
  • Matching lance::Dataset::write(...) static method in lance.hpp

Motivation

Until now the C/C++ path only produced uncommitted fragment files (#5). lance_dataset_write closes the primary write path and unblocks the rest of Phase 3 (delete, update, merge-insert, schema evolution), which all need a way to create a dataset first.

Notes

  • mode is received as int32_t and validated via LanceWriteMode::from_raw. Accepting the enum directly would be UB for out-of-range values from C.
  • The stream is consumed via ArrowArrayStreamReader::from_raw immediately after the NULL check, so the "consumed on any return" contract holds on every error path.

Test plan

  • cargo test — 55 integration tests, 11 new (CREATE/APPEND/OVERWRITE happy paths, OVERWRITE on a missing path, CREATE on an existing path, schema mismatches, empty stream, NULL args, invalid mode, out_dataset propagation)
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean
  • cargo test --test compile_and_run_test -- --ignored — C and C++ scan→write round-trips pass

Closes #14.

…rrayStream

Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.

Two FFI details:
  - mode is received as i32 and validated via LanceWriteMode::from_raw;
    accepting the enum directly would be UB for out-of-range values.
  - The stream is consumed via ArrowArrayStreamReader::from_raw right
    after the NULL check so the "consumed on any return" contract holds.

Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.

Closes lance-format#14.
@LuciferYang LuciferYang force-pushed the feat/issue-14-dataset-write branch from b3e813c to 8a9325e Compare April 24, 2026 08:02
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is
private. rustdoc -D warnings (the Rustdoc CI job) flags this as
`private_intra_doc_links`. Rewording to describe the validation behavior
without naming the private function.

No API changes.
@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @jja725 #15 depend on this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lance_dataset_write() for create/append/overwrite from ArrowArrayStream

1 participant