[REVIEW] New Dataset API Clarifying Ownership#1846
[REVIEW] New Dataset API Clarifying Ownership#1846HowardHuang1 wants to merge 40 commits intorapidsai:mainfrom
Conversation
|
/ok to test 5447a4c |
|
/ok to test 17ab09d |
|
NB: I updated the label to |
|
|
||
| /** | ||
| * @brief Build the index from a device padded dataset view (non-owning). | ||
| * | ||
| * The index stores a copy of the view; the caller must keep the dataset memory alive. | ||
| * See build(res, params, device_matrix_view) for full documentation. | ||
| */ | ||
| template <typename T, typename IdxT = uint32_t> | ||
| auto build(raft::resources const& res, | ||
| const cuvs::neighbors::cagra::index_params& params, | ||
| cuvs::neighbors::device_padded_dataset_view<T, int64_t> const& dataset) | ||
| -> cuvs::neighbors::cagra::index<T, IdxT>; | ||
|
|
||
| /** | ||
| * @brief Build the index from a device padded dataset (owning; takes ownership). | ||
| * | ||
| * See build(res, params, device_matrix_view) for full documentation. | ||
| */ | ||
| template <typename T, typename IdxT = uint32_t> | ||
| auto build(raft::resources const& res, | ||
| const cuvs::neighbors::cagra::index_params& params, | ||
| cuvs::neighbors::device_padded_dataset<T, int64_t>&& dataset) | ||
| -> cuvs::neighbors::cagra::index<T, IdxT>; | ||
|
|
There was a problem hiding this comment.
I've likely missed some related discussions, but could you please reiterate for my understanding: what's the purpose of using the explicitly typed device_padded_dataset_view/ device_padded_dataset here vs using mdspans/mdarrays with the padded layout? Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?
I would understand the logic if we were passing an abstract dataset/dataset_view object and employed different code-path based on dynamic_casting results. But here it looks like we're adding one more type that mirrors the functionality of the other.
There was a problem hiding this comment.
We want a single dataset pointer that can represent all types of datasets (owning/non-owning and quantized/non-quantized) which makes it easier for rest of CAGRA (index, search, merge, serialize, etc.) to interact with. Instead of interacting with individual mdarrays and mdspans, CAGRA can just interact with this single pointer.
dataset (_view) adds additional functionality on top of mdarray and mdspan in the case of compressed datasets (pq_dataset & vpq_dataset) which require more info like codebooks that are not present in mdarray/mdspan.
So yes, in reality, the dataset class is just a wrapper for mdspan and mdarray since device_matrix and device_matrix_view which are used in the padded_dataset classes are just type aliases for mdarray and mdspan but it offers 2 benefits:
(1) centralized way for CAGRA to interact with various types of datasets via ptr + object oriented design
(2) additional info needed for compressed datasets
There was a problem hiding this comment.
Thanks for your answers @HowardHuang1 and @cjnolet , I agree with everything you say. But my question here was about these two build functions because they don't make use of the feature we're talking about: why do they take the specific dataset/view instantiation types rather than the top-level abstract dataset/view types (which would properly hide the implementation details like the datasets being padded)?
Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?
There was a problem hiding this comment.
I am changing it right now but I believe based on the discussion here we want to alter build so that it only takes a view in the future and will never take owning structures (dataset&&).
There was a problem hiding this comment.
Also @cjnolet , do we relax cuVS api to allow passing owning structures (dataset&&) to the build functions(I'm not against it, just asking - this will probably allow us avoid copies in certain cases)?
Nope, the whole point with this design is to avoid passing ownership or copies. User will just be passing views around.
There was a problem hiding this comment.
I see the move-semantics onwing dataset overloads are gone from the current version and we consistently pass the views.
Still one question on my side remains. You pass the specific device_padded_dataset_view, which is equivalent to passing device_padded_mdspan with no clear advantage over it. Why don't you pass the abstract dataset_view at the API level and let the implementation dynamic-cast it to an appropriate code path in detail namespace? You could also implement the visitor pattern to take the full advantage of the new hierarchy.
There was a problem hiding this comment.
That makes sense, and I'm assuming this applies not only to build() but to index() and update_dataset() as well. However, is there any case a user would want to pass in a non- device_padded_dataset_view for build()?
For build():
- The only dataset type I can think of that we want to accept is device_padded_dataset_view. This is because we are deprecating strided_dataset and getting rid of any instances where index/build/update take owning datasets like the old device_padded_dataset. Additionally, in all cases, we expect the user to provide a non-compressed dataset as input right? Compression is meant to happen internally only within our build() function, and users can do this by setting
params.compression.has_value() == True. This means users should never pass in a compressed dataset directly into build().
TLDR: users will only be expected to pass in device_padded_dataset_view to build().
For index() and update_dataset():
- For these two functions I agree. Since index() and update_dataset() should support device_padded_dataset_view, vpq_dataset_view, and pq_dataset_view, the abstract class with dynamic casting to route to the correct type handling can be implemented here.
There was a problem hiding this comment.
Correct - all public api functions should except views. There are some very specific APIs that do not accept views, but we code those APIs very carefully and make sure the ownership model is still very clear and transparent to the user.
The APIs which accept sparse matrices, for example, need to accept owning structures, but the ownership is still always left with the user (for eg. No other class or struct will ever copy or take ownership of the underlying data from them, so they are still treated as views).
@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!). Two immediate reasons why this API is necessary:
This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time). |
| CUVS_BUILD_SPECTRAL_EMBEDDING | ||
| "Build spectral embedding and cluster spectral (requires RAFT with lanczos_compute_eigenpairs)" | ||
| OFF | ||
| ) |
There was a problem hiding this comment.
Are all these cmake changes intentional?
There was a problem hiding this comment.
Yes seems like I can't build the cuvs codebase without disabling spectral embedding. I tried matching the RAFT version to the version cuvs was expecting but was still getting the error: RAFT version I'm using doesn't provide lanczos_compute_eigenpairs. Was eventually going to circle back and ask about this. Do you know a workaround for this or what the last stable raft version is that provides that function?
There was a problem hiding this comment.
Oh oops, missed this. You will want to pull the latest raft version locally to fix these type of build issues. Are you building cuvs with a local version of raft?
…tion between make host/device padded dataset in factory
… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case
…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset
|
The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs. The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:
|
…g make_padded_dataset/view to get correct stride prior to call to build
…API where using stride(0) from strided view does not match what the reduction kernel expects
…er of dimensions in index. Use small buffer to copy over each batch without padding
…queries.extent(1) != idx.dim() that was caused by index's dataset_view pointed incorrectly at old br.vpq location rather than correct vpq_keep
…added_dataset_view
…e stride for padded dataset causing cudaErrorMisalignedAddress. Fix is use stride for copy matrix instead of dim
…aset is set to zeros rather than initialized to corrupted random values
…bstract base class. New dataset<> and dataset_view<> empty marker layers which are also abstract classes extend polymorphic_dataset<>. Real concrete classes extend either the owning or non-owning empty marker layers.
… tree into 2: owning v.s. view
…nstead of concrete device_padded_dataset_view only. Implemented a dispatcher for index() and a function converting all dataset_view types to device_padded_datatset_view for build()
|
/ok to test b0cc113 |
|
/ok to test 4098a44 |
|
/ok to test b130341 |
This reverts commit b130341.
Addressing #1574 and #1571.
Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code. Strided_dataset code left in for backwards compatibility but can be deprecated later on.