Skip to content

[REVIEW] New Dataset API Clarifying Ownership#1846

Open
HowardHuang1 wants to merge 40 commits intorapidsai:mainfrom
HowardHuang1:HH-Dataset-API
Open

[REVIEW] New Dataset API Clarifying Ownership#1846
HowardHuang1 wants to merge 40 commits intorapidsai:mainfrom
HowardHuang1:HH-Dataset-API

Conversation

@HowardHuang1
Copy link
Copy Markdown

Addressing #1574 and #1571.

Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code. Strided_dataset code left in for backwards compatibility but can be deprecated later on.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aamijar aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 25, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Feb 25, 2026
@aamijar
Copy link
Copy Markdown
Member

aamijar commented Feb 25, 2026

/ok to test 5447a4c

@aamijar
Copy link
Copy Markdown
Member

aamijar commented Feb 25, 2026

/ok to test 17ab09d

@achirkin achirkin added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Feb 25, 2026
@achirkin
Copy link
Copy Markdown
Contributor

NB: I updated the label to breaking, since the description implies removal of a publicly visible class strided_dataset

Comment on lines +1152 to +1175

/**
* @brief Build the index from a device padded dataset view (non-owning).
*
* The index stores a copy of the view; the caller must keep the dataset memory alive.
* See build(res, params, device_matrix_view) for full documentation.
*/
template <typename T, typename IdxT = uint32_t>
auto build(raft::resources const& res,
const cuvs::neighbors::cagra::index_params& params,
cuvs::neighbors::device_padded_dataset_view<T, int64_t> const& dataset)
-> cuvs::neighbors::cagra::index<T, IdxT>;

/**
* @brief Build the index from a device padded dataset (owning; takes ownership).
*
* See build(res, params, device_matrix_view) for full documentation.
*/
template <typename T, typename IdxT = uint32_t>
auto build(raft::resources const& res,
const cuvs::neighbors::cagra::index_params& params,
cuvs::neighbors::device_padded_dataset<T, int64_t>&& dataset)
-> cuvs::neighbors::cagra::index<T, IdxT>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've likely missed some related discussions, but could you please reiterate for my understanding: what's the purpose of using the explicitly typed device_padded_dataset_view/ device_padded_dataset here vs using mdspans/mdarrays with the padded layout? Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

I would understand the logic if we were passing an abstract dataset/dataset_view object and employed different code-path based on dynamic_casting results. But here it looks like we're adding one more type that mirrors the functionality of the other.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a single dataset pointer that can represent all types of datasets (owning/non-owning and quantized/non-quantized) which makes it easier for rest of CAGRA (index, search, merge, serialize, etc.) to interact with. Instead of interacting with individual mdarrays and mdspans, CAGRA can just interact with this single pointer.

dataset (_view) adds additional functionality on top of mdarray and mdspan in the case of compressed datasets (pq_dataset & vpq_dataset) which require more info like codebooks that are not present in mdarray/mdspan.

So yes, in reality, the dataset class is just a wrapper for mdspan and mdarray since device_matrix and device_matrix_view which are used in the padded_dataset classes are just type aliases for mdarray and mdspan but it offers 2 benefits:
(1) centralized way for CAGRA to interact with various types of datasets via ptr + object oriented design
(2) additional info needed for compressed datasets

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your answers @HowardHuang1 and @cjnolet , I agree with everything you say. But my question here was about these two build functions because they don't make use of the feature we're talking about: why do they take the specific dataset/view instantiation types rather than the top-level abstract dataset/view types (which would properly hide the implementation details like the datasets being padded)?

Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am changing it right now but I believe based on the discussion here we want to alter build so that it only takes a view in the future and will never take owning structures (dataset&&).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?

Nope, the whole point with this design is to avoid passing ownership or copies. User will just be passing views around.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the move-semantics onwing dataset overloads are gone from the current version and we consistently pass the views.

Still one question on my side remains. You pass the specific device_padded_dataset_view, which is equivalent to passing device_padded_mdspan with no clear advantage over it. Why don't you pass the abstract dataset_view at the API level and let the implementation dynamic-cast it to an appropriate code path in detail namespace? You could also implement the visitor pattern to take the full advantage of the new hierarchy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, and I'm assuming this applies not only to build() but to index() and update_dataset() as well. However, is there any case a user would want to pass in a non- device_padded_dataset_view for build()?

For build():

  • The only dataset type I can think of that we want to accept is device_padded_dataset_view. This is because we are deprecating strided_dataset and getting rid of any instances where index/build/update take owning datasets like the old device_padded_dataset. Additionally, in all cases, we expect the user to provide a non-compressed dataset as input right? Compression is meant to happen internally only within our build() function, and users can do this by setting params.compression.has_value() == True. This means users should never pass in a compressed dataset directly into build().

TLDR: users will only be expected to pass in device_padded_dataset_view to build().

For index() and update_dataset():

  • For these two functions I agree. Since index() and update_dataset() should support device_padded_dataset_view, vpq_dataset_view, and pq_dataset_view, the abstract class with dynamic casting to route to the correct type handling can be implemented here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - all public api functions should except views. There are some very specific APIs that do not accept views, but we code those APIs very carefully and make sure the ownership model is still very clear and transparent to the user.

The APIs which accept sparse matrices, for example, need to accept owning structures, but the ownership is still always left with the user (for eg. No other class or struct will ever copy or take ownership of the underlying data from them, so they are still treated as views).

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Feb 25, 2026

Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!).

Two immediate reasons why this API is necessary:

  1. The user should not have to know that they need to pad a dataset in order to use cagra without the additional copy. They should not need to know how any of these algorithms work internally. They should, however, need to know that CAGRA expects a padded dataset, and they should have an API to construct one so that they can own the dataset class and not have cagra creating one under the hood.
  2. APIs, especially the graph-based APIs, should be able to accept as inputs data which has been quantized using a metod like PQ, which carries with additional information. In the case of PQ, the codebooks are needed to compute the distances. This again decouples the quantization from the algorithm (CAGRA-Q does not need to do its own quantization. It should just accept the quantized vectors). We're being asked for the same behavior with Vamana.

This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time).

Comment thread cpp/CMakeLists.txt
CUVS_BUILD_SPECTRAL_EMBEDDING
"Build spectral embedding and cluster spectral (requires RAFT with lanczos_compute_eigenpairs)"
OFF
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these cmake changes intentional?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes seems like I can't build the cuvs codebase without disabling spectral embedding. I tried matching the RAFT version to the version cuvs was expecting but was still getting the error: RAFT version I'm using doesn't provide lanczos_compute_eigenpairs. Was eventually going to circle back and ask about this. Do you know a workaround for this or what the last stable raft version is that provides that function?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops, missed this. You will want to pull the latest raft version locally to fix these type of build issues. Are you building cuvs with a local version of raft?

…tion between make host/device padded dataset in factory
@HowardHuang1 HowardHuang1 requested a review from a team as a code owner February 28, 2026 03:34
… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case
…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset
@HowardHuang1
Copy link
Copy Markdown
Author

HowardHuang1 commented Mar 4, 2026

The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs.

The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:

  • cagra_build_view_only.cu
  • cagra_padded_dataset.cu
  • cagra_vpq_build_result.cu
  • dataset_compression.cu
  • dataset_types.cu

…g make_padded_dataset/view to get correct stride prior to call to build
…API where using stride(0) from strided view does not match what the reduction kernel expects
…er of dimensions in index. Use small buffer to copy over each batch without padding
…queries.extent(1) != idx.dim() that was caused by index's dataset_view pointed incorrectly at old br.vpq location rather than correct vpq_keep
…e stride for padded dataset causing cudaErrorMisalignedAddress. Fix is use stride for copy matrix instead of dim
…aset is set to zeros rather than initialized to corrupted random values
…bstract base class. New dataset<> and dataset_view<> empty marker layers which are also abstract classes extend polymorphic_dataset<>. Real concrete classes extend either the owning or non-owning empty marker layers.
…nstead of concrete device_padded_dataset_view only. Implemented a dispatcher for index() and a function converting all dataset_view types to device_padded_datatset_view for build()
@aamijar
Copy link
Copy Markdown
Member

aamijar commented Apr 15, 2026

/ok to test b0cc113

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Apr 15, 2026

/ok to test 4098a44

@HowardHuang1
Copy link
Copy Markdown
Author

/ok to test b130341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change feature request New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants