feat(rust)!: compile-time dataset lifetime safety for all index types#1869
feat(rust)!: compile-time dataset lifetime safety for all index types#1869zbennett10 wants to merge 1 commit intorapidsai:mainfrom
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@benfred This does introduce a breaking change.. which I"m not super happy about it.. I mean we could do something like this where we have a backward-compatible approach.. basically keep // Keep the old build() signature — it becomes the "owned" path
impl Index<'static> {
pub fn build<T: Into<ManagedTensor>>(
res: &Resources,
params: &IndexParams,
dataset: T,
) -> Result<Index<'static>> {
let dataset: ManagedTensor = dataset.into();
let inner = Self::create_handle()?;
unsafe { check_cuvs(ffi::cuvsCagraBuild(res.0, params.0, dataset.as_ptr(), inner))?; }
Ok(Index { inner, _data: DatasetOwnership::Owned(dataset) })
}
}
// New safe borrowed path alongside it
impl<'a> Index<'a> {
pub fn build_borrowed(
res: &Resources,
params: &IndexParams,
dataset: &'a ManagedTensor,
) -> Result<Index<'a>> { ... }
}Existing code like But — the struct still has |
6a66208 to
e006925
Compare
|
/ok to test e006925 |
|
/ok to test 994f408 |
…ypes Cherry-picked from upstream PR rapidsai#1869 (rapidsai/cuvs). Introduces Index<'a> with DatasetOwnership enum (Borrowed/Owned) across brute_force, cagra, ivf_flat, and ivf_pq. Adds build() for borrowed datasets (compiler enforces lifetime) and build_owned() for self-contained indexes. Fixes use-after-free when C library dereferences freed dataset. Original author: zbennett10 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… indexes Cherry-picks CAGRA serialization from upstream PR rapidsai#1840 (zbennett10), adapted for the Index<'a> lifetime API from PR rapidsai#1869. Adds new Nuvai-authored serialize/deserialize methods for IVF-PQ and IVF-Flat, following the same pattern (CString filename, FFI call to cuvsIvfPqSerialize/Deserialize and cuvsIvfFlatSerialize/Deserialize). Includes round-trip tests for CAGRA and IVF-PQ serialization, plus hnswlib export test for CAGRA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cherry-picked from upstream PR rapidsai#1478, adapted for Index<'a> lifetime API from PR rapidsai#1869. Adds Filter trait and implementations (NoFilter, Bitset, Bitmap) in Rust, updates search() signatures across CAGRA, IVF-PQ, and IVF-Flat to accept Optional<&dyn Filter>. Includes C API filter plumbing, C tests for filtered search, and Go/Python bindings updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 563762c |
|
Hey Zachary, thank you for your contribution — the current state of the Rust bindings is indeed unsound for certain index types and this is a very important area of improvement. I've spent some time investigating the C++ internals to understand exactly which index types need lifetime enforcement, and here's what I found:
Our Rust bindings should retain the same ownership semantics as the underlying library. For this reason I feel that I'd need to push back on the proposed design. Besides, the way it is implemented in this PR is too fragile. The Instead, I propose to simplify this PR as follows:
|
Simplify the dataset-lifetime-safety work per review feedback on rapidsai#1869: - IVF-Flat and IVF-PQ always copy their input in the C++ library, so their Rust bindings do not need a lifetime parameter and have been reverted to the upstream signatures. - CAGRA's `make_aligned_dataset()` stores a non-owning view when the input is device-accessible, row-major and 16-byte aligned, so the Rust `Index` now carries a lifetime tied to the `ManagedTensor` it was built from. The borrow checker then prevents use-after-free regardless of which runtime path the C++ side takes. - Brute Force always stores a non-owning view and gets the same lifetime treatment. - Dropped `DatasetOwnership` and `build_owned` — the `ManagedTensor` type itself doesn't express ownership in the type system, so the `Owned` variant was fragile. `Index<'a>` now holds a `PhantomData<&'a ()>` directly, which matches the semantics of the underlying C++ library exactly. - Test suites for both index types have been reorganised so the dataset + self-neighbor search boilerplate is factored into shared helpers, matching the pattern being adopted in the CAGRA serialization PR. This is still a breaking change for `cagra::Index` and `brute_force::Index::build`, which now take `&ManagedTensor` instead of `impl Into<ManagedTensor>`. IVF-Flat and IVF-PQ are unchanged. Signed-off-by: Zach Bennett <zach@worldflowai.com>
563762c to
3c090d6
Compare
|
Appreciate you @yan-zaretskiy — thanks a lot for the great feedback The public API change is narrower now — just |
|
Looks good to me otherwise! |
Simplify the dataset-lifetime-safety work per review feedback on rapidsai#1869: - IVF-Flat and IVF-PQ always copy their input in the C++ library, so their Rust bindings do not need a lifetime parameter and have been reverted to the upstream signatures. - CAGRA's `make_aligned_dataset()` stores a non-owning view when the input is device-accessible, row-major and 16-byte aligned, so the Rust `Index` now carries a lifetime tied to the `ManagedTensor` it was built from. The borrow checker then prevents use-after-free regardless of which runtime path the C++ side takes. - Brute Force always stores a non-owning view and gets the same lifetime treatment. - Dropped `DatasetOwnership` and `build_owned` — the `ManagedTensor` type itself doesn't express ownership in the type system, so the `Owned` variant was fragile. `Index<'a>` now holds a `PhantomData<&'a ()>` directly, which matches the semantics of the underlying C++ library exactly. - Test suites for both index types have been reorganised so the dataset + self-neighbor search boilerplate is factored into shared helpers, matching the pattern being adopted in the CAGRA serialization PR. This is still a breaking change for `cagra::Index` and `brute_force::Index::build`, which now take `&ManagedTensor` instead of `impl Into<ManagedTensor>`. IVF-Flat and IVF-PQ are unchanged. Signed-off-by: Zach Bennett <zach@worldflowai.com>
3c090d6 to
a6c5e46
Compare
|
@zbennett10 Could you resolve the branch conflict so that we can merge this PR? |
Simplify the dataset-lifetime-safety work per review feedback on rapidsai#1869: - IVF-Flat and IVF-PQ always copy their input in the C++ library, so their Rust bindings do not need a lifetime parameter and have been left on their upstream signatures. - CAGRA's `make_aligned_dataset()` stores a non-owning view when the input is device-accessible, row-major and 16-byte aligned, so the Rust `Index` now carries a lifetime tied to the `ManagedTensor` it was built from. The borrow checker then prevents use-after-free regardless of which runtime path the C++ side takes. - Brute Force always stores a non-owning view and gets the same lifetime treatment. - `DatasetOwnership` and `build_owned` are dropped — the `ManagedTensor` type itself doesn't express ownership in the type system, so the `Owned` variant was fragile. `Index<'a>` now holds a `PhantomData<&'a ()>` directly, which matches the semantics of the underlying C++ library exactly. - `cagra::Index::deserialize` now returns `Index<'static>`; the deserialized index owns its data and is not tied to any input `ManagedTensor`. `serialize` and `serialize_to_hnswlib` live on `impl<'a> Index<'a>` so they work against both borrowed and deserialized indices. - `pub fn new()` is gone from both `cagra::Index` and `brute_force::Index`. An empty index has no real dataset lifetime and the builders call the private `create_handle` helper directly. - Test suites for both index types have been reorganised so the dataset + self-neighbor search boilerplate is factored into shared helpers, including a negative test that confirms a path containing an interior NUL byte surfaces as `Error::InvalidArgument`. Breaking change: `cagra::Index::build` and `brute_force::Index::build` now take `&ManagedTensor` instead of `impl Into<ManagedTensor>`, and the public `new()` constructors have been removed. IVF-Flat and IVF-PQ are unchanged. Signed-off-by: Zach Bennett <zach@worldflowai.com>
a6c5e46 to
defcc60
Compare
|
This is follow-on from #1839 based on comments from @benfred
Introduces compile-time dataset lifetime enforcement across all 4 Rust index types (
brute_force,cagra,ivf_flat,ivf_pq)... assuming we want to do it for all of them! It addresses the use-after-free risk whenManagedTensor::from(&ndarray)creates a non-owning view and the C library stores a pointer to the dataset.. and who knows what shenanigans may go on "down there".. 😄Main Changes
DatasetOwnership<'a>enum indlpack.rs— tracks whether an index borrows or owns its datasetIndex<'a>with dual constructors for all 4 index types:build(&'a ManagedTensor)— borrowed path, compiler enforces dataset outlives indexbuild_owned(ManagedTensor)— owned path ('static), index is self-containedcreate_handle()helper — avoids double-free by separating FFI handle creation fromIndexstruct constructionDesign
Breaking Change
Index::build()now takes&ManagedTensorinstead ofimpl Into<ManagedTensor>. Users should:build(&tensor)with an explicitManagedTensorreference (borrowed)build_owned(tensor)for the old move semantics (owned)