[REVIEW] New Dataset API Clarifying Ownership by HowardHuang1 · Pull Request #1846 · rapidsai/cuvs

HowardHuang1 · 2026-02-24T23:21:21Z

Addressing #1574 and #1571.

Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code. Strided_dataset code left in for backwards compatibility but can be deprecated later on.

…A level

copy-pr-bot · 2026-02-24T23:21:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aamijar · 2026-02-25T01:51:35Z

/ok to test 5447a4c

aamijar · 2026-02-25T02:12:45Z

/ok to test 17ab09d

achirkin · 2026-02-25T06:06:02Z

NB: I updated the label to breaking, since the description implies removal of a publicly visible class strided_dataset

achirkin · 2026-02-25T13:49:07Z

+
+/**
+ * @brief Build the index from a device padded dataset view (non-owning).
+ *
+ * The index stores a copy of the view; the caller must keep the dataset memory alive.
+ * See build(res, params, device_matrix_view) for full documentation.
+ */
+template <typename T, typename IdxT = uint32_t>
+auto build(raft::resources const& res,
+           const cuvs::neighbors::cagra::index_params& params,
+           cuvs::neighbors::device_padded_dataset_view<T, int64_t> const& dataset)
+  -> cuvs::neighbors::cagra::index<T, IdxT>;
+
+/**
+ * @brief Build the index from a device padded dataset (owning; takes ownership).
+ *
+ * See build(res, params, device_matrix_view) for full documentation.
+ */
+template <typename T, typename IdxT = uint32_t>
+auto build(raft::resources const& res,
+           const cuvs::neighbors::cagra::index_params& params,
+           cuvs::neighbors::device_padded_dataset<T, int64_t>&& dataset)
+  -> cuvs::neighbors::cagra::index<T, IdxT>;
+


I've likely missed some related discussions, but could you please reiterate for my understanding: what's the purpose of using the explicitly typed device_padded_dataset_view/ device_padded_dataset here vs using mdspans/mdarrays with the padded layout? Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

I would understand the logic if we were passing an abstract dataset/dataset_view object and employed different code-path based on dynamic_casting results. But here it looks like we're adding one more type that mirrors the functionality of the other.

We want a single dataset pointer that can represent all types of datasets (owning/non-owning and quantized/non-quantized) which makes it easier for rest of CAGRA (index, search, merge, serialize, etc.) to interact with. Instead of interacting with individual mdarrays and mdspans, CAGRA can just interact with this single pointer.

dataset (_view) adds additional functionality on top of mdarray and mdspan in the case of compressed datasets (pq_dataset & vpq_dataset) which require more info like codebooks that are not present in mdarray/mdspan.

So yes, in reality, the dataset class is just a wrapper for mdspan and mdarray since device_matrix and device_matrix_view which are used in the padded_dataset classes are just type aliases for mdarray and mdspan but it offers 2 benefits:
(1) centralized way for CAGRA to interact with various types of datasets via ptr + object oriented design
(2) additional info needed for compressed datasets

Thanks for your answers @HowardHuang1 and @cjnolet , I agree with everything you say. But my question here was about these two build functions because they don't make use of the feature we're talking about: why do they take the specific dataset/view instantiation types rather than the top-level abstract dataset/view types (which would properly hide the implementation details like the datasets being padded)?

Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?

I am changing it right now but I believe based on the discussion here we want to alter build so that it only takes a view in the future and will never take owning structures (dataset&&).

Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?

Nope, the whole point with this design is to avoid passing ownership or copies. User will just be passing views around.

I see the move-semantics onwing dataset overloads are gone from the current version and we consistently pass the views.

Still one question on my side remains. You pass the specific device_padded_dataset_view, which is equivalent to passing device_padded_mdspan with no clear advantage over it. Why don't you pass the abstract dataset_view at the API level and let the implementation dynamic-cast it to an appropriate code path in detail namespace? You could also implement the visitor pattern to take the full advantage of the new hierarchy.

That makes sense, and I'm assuming this applies not only to build() but to index() and update_dataset() as well. However, is there any case a user would want to pass in a non- device_padded_dataset_view for build()?

For build():

The only dataset type I can think of that we want to accept is device_padded_dataset_view. This is because we are deprecating strided_dataset and getting rid of any instances where index/build/update take owning datasets like the old device_padded_dataset. Additionally, in all cases, we expect the user to provide a non-compressed dataset as input right? Compression is meant to happen internally only within our build() function, and users can do this by setting params.compression.has_value() == True. This means users should never pass in a compressed dataset directly into build().

TLDR: users will only be expected to pass in device_padded_dataset_view to build().

For index() and update_dataset():

For these two functions I agree. Since index() and update_dataset() should support device_padded_dataset_view, vpq_dataset_view, and pq_dataset_view, the abstract class with dynamic casting to route to the correct type handling can be implemented here.

Correct - all public api functions should except views. There are some very specific APIs that do not accept views, but we code those APIs very carefully and make sure the ownership model is still very clear and transparent to the user.

The APIs which accept sparse matrices, for example, need to accept owning structures, but the ownership is still always left with the user (for eg. No other class or struct will ever copy or take ownership of the underlying data from them, so they are still treated as views).

cjnolet · 2026-02-25T22:22:05Z

Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!).

Two immediate reasons why this API is necessary:

The user should not have to know that they need to pad a dataset in order to use cagra without the additional copy. They should not need to know how any of these algorithms work internally. They should, however, need to know that CAGRA expects a padded dataset, and they should have an API to construct one so that they can own the dataset class and not have cagra creating one under the hood.
APIs, especially the graph-based APIs, should be able to accept as inputs data which has been quantized using a metod like PQ, which carries with additional information. In the case of PQ, the codebooks are needed to compute the distances. This again decouples the quantization from the algorithm (CAGRA-Q does not need to do its own quantization. It should just accept the quantized vectors). We're being asked for the same behavior with Vamana.

This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time).

aamijar · 2026-02-25T22:49:17Z

+  CUVS_BUILD_SPECTRAL_EMBEDDING
+  "Build spectral embedding and cluster spectral (requires RAFT with lanczos_compute_eigenpairs)"
+  OFF
+)


Are all these cmake changes intentional?

Yes seems like I can't build the cuvs codebase without disabling spectral embedding. I tried matching the RAFT version to the version cuvs was expecting but was still getting the error: RAFT version I'm using doesn't provide lanczos_compute_eigenpairs. Was eventually going to circle back and ask about this. Do you know a workaround for this or what the last stable raft version is that provides that function?

Oh oops, missed this. You will want to pull the latest raft version locally to fix these type of build issues. Are you building cuvs with a local version of raft?

…tion between make host/device padded dataset in factory

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset

HowardHuang1 · 2026-03-04T23:24:04Z

The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs.

The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:

cagra_build_view_only.cu
cagra_padded_dataset.cu
cagra_vpq_build_result.cu
dataset_compression.cu
dataset_types.cu

…g make_padded_dataset/view to get correct stride prior to call to build

…API where using stride(0) from strided view does not match what the reduction kernel expects

…er of dimensions in index. Use small buffer to copy over each batch without padding

…cases

…queries.extent(1) != idx.dim() that was caused by index's dataset_view pointed incorrectly at old br.vpq location rather than correct vpq_keep

…added_dataset_view

…e stride for padded dataset causing cudaErrorMisalignedAddress. Fix is use stride for copy matrix instead of dim

…aset is set to zeros rather than initialized to corrupted random values

…bstract base class. New dataset<> and dataset_view<> empty marker layers which are also abstract classes extend polymorphic_dataset<>. Real concrete classes extend either the owning or non-owning empty marker layers.

… tree into 2: owning v.s. view

…nstead of concrete device_padded_dataset_view only. Implemented a dispatcher for index() and a function converting all dataset_view types to device_padded_datatset_view for build()

aamijar · 2026-04-15T02:44:57Z

/ok to test b0cc113

cjnolet · 2026-04-15T19:47:09Z

/ok to test 4098a44

HowardHuang1 · 2026-04-15T21:06:35Z

/ok to test b130341

This reverts commit b130341.

HowardHuang1 added 4 commits February 23, 2026 09:49

get build working

d78f459

add dataset compression test and basic constructor types test

4febf8b

add padded_dataset class along with test cases

b403473

add support for new padded_dataset classes all the way up to the CAGR…

8d6833a

…A level

HowardHuang1 requested review from a team as code owners February 24, 2026 23:21

github-project-automation bot added this to Unstructured Data Processing Feb 24, 2026

aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 25, 2026

aamijar assigned HowardHuang1 Feb 25, 2026

aamijar moved this to In Progress in Unstructured Data Processing Feb 25, 2026

Merge branch 'main' into HH-Dataset-API

5447a4c

fix style

17ab09d

achirkin added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Feb 25, 2026

achirkin reviewed Feb 25, 2026

View reviewed changes

aamijar reviewed Feb 25, 2026

View reviewed changes

seunghwak mentioned this pull request Feb 27, 2026

[WIP] Clarify dataset ownership and allocation semantics #1738

Closed

build() now only takes views and not unique ptrs + get rid of distinc…

fb556c9

…tion between make host/device padded dataset in factory

HowardHuang1 requested a review from a team as a code owner February 28, 2026 03:34

HowardHuang1 added 4 commits March 2, 2026 15:27

clean up old overloads of build & index functions that take ownership…

37d28dc

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

fix failing mg tests that do build -> serialize -> deserialize -> search

26b46a2

Merge remote-tracking branch 'upstream' into HH-Dataset-API

a38fb18

HowardHuang1 added 22 commits April 8, 2026 17:09

Merge upstream main into branch

0c3df3e

run pre-commit

dbc47e1

fix merge issues that cause build to fail

27a6eb3

fix failing test cases for cagra

355adb3

fix cagra test cases to conform with new dataset API by always callin…

5e5445b

…g make_padded_dataset/view to get correct stride prior to call to build

fixed failing cagra test cases caused by shift to new padded dataset …

1d4ca18

…API where using stride(0) from strided view does not match what the reduction kernel expects

fixed failing cagra test cases when query dimensions don't match numb…

187e66e

…er of dimensions in index. Use small buffer to copy over each batch without padding

Fix ace build caller mismatch and vpq keep alive error in cagra test …

e894477

…cases

fix error with vpq layout queries.extent(1) != idx.dim()

84573a6

Fix failing query vs index dimension check when calling cagra search …

dfb0d4c

…queries.extent(1) != idx.dim() that was caused by index's dataset_view pointed incorrectly at old br.vpq location rather than correct vpq_keep

Fix unsupported data type due to dataset_view not taken by serialize()

7be7329

Fix extend core issue in failing cagra test cases to support device_p…

0c9432d

…added_dataset_view

Fix merge to treat device_padded_dataset_view as valid attached dataset

234606e

Fix cagra test case issue where cagra merge failed to preserve 16 byt…

20f7ea7

…e stride for padded dataset causing cudaErrorMisalignedAddress. Fix is use stride for copy matrix instead of dim

Fix failing low recall cagra test cases so that padding in padded dat…

82979c8

…aset is set to zeros rather than initialized to corrupted random values

Remove top level base polymorphic_dataset class and split inheritance…

07caefa

… tree into 2: owning v.s. view

change index() and build() to take high level abstract dataset_view i…

f15317f

…nstead of concrete device_padded_dataset_view only. Implemented a dispatcher for index() and a function converting all dataset_view types to device_padded_datatset_view for build()

add stride check for update_dataset() + update documentation

ee0b5e4

remove unused naming aliases

db16de1

remove host_padded_dataset/view since it's not used in any control path

08e5e9d

Merge remote-tracking branch 'upstream' into HH-Dataset-API

b0cc113

fix minor CI issues

4098a44

fix copyright

b130341

HowardHuang1 added 3 commits April 15, 2026 14:32

Revert "fix copyright"

b897116

This reverts commit b130341.

fix copyright for CI

547657a

remove cmake changes

c124b00

Conversation

HowardHuang1 commented Feb 24, 2026

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

achirkin commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjnolet commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HowardHuang1 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aamijar commented Apr 15, 2026

Uh oh!

cjnolet commented Apr 15, 2026

Uh oh!

HowardHuang1 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HowardHuang1 commented Mar 4, 2026 •

edited

Loading