[Feat] Add Support for Index merge in CAGRA#618
[Feat] Add Support for Index merge in CAGRA#618rapids-bot[bot] merged 22 commits intorapidsai:branch-25.02from
merge in CAGRA#618Conversation
| auto merged_index = | ||
| cagra::build(handle, params, raft::make_const_mdspan(device_updated_dataset_view)); | ||
|
|
||
| if (static_cast<std::size_t>(stride) == dim) { |
There was a problem hiding this comment.
Hi @cjnolet @achirkin , I know these codes are odd, but without them, datasets will be changed after calling cagra::detail::search_main_core, which will cause the test failure. I do not know how the dataset format, matrix ownership, cagra::search interact behind it. Could you have comments here? Many thanks!
There was a problem hiding this comment.
To me, it looks like no one owns the host_updated_dataset or device_updated_dataset beyond the scope of this function, so the data gets destroyed unless the owning update_dataset is called under the if branch here.
Hence, I think, you should call update_dataset unconditionally here.
There was a problem hiding this comment.
To me, it looks like no one owns the
host_updated_datasetordevice_updated_datasetbeyond the scope of this function, so the data gets destroyed unless the owningupdate_datasetis called under theifbranch here. Hence, I think, you should callupdate_datasetunconditionally here.
Thank you, very helpful!
| } | ||
|
|
||
| // Allocate the new dataset on device | ||
| auto device_updated_dataset = |
There was a problem hiding this comment.
I found a great API that can be used to fit the situation that device memory is not enough cuvs::neighbors::nn_descent::has_enough_device_memory. I will make it in the next commit
| * | ||
| * @return A new CAGRA index containing the merged indices, graph, and dataset. | ||
| */ | ||
| auto merge(raft::resources const& res, |
There was a problem hiding this comment.
Hey @chatman, I'm working on cagra::merge. Could you review the API design when you have a moment? Any suggestions would be greatly appreciated. Thanks!
The API of
On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises. |
Hi @achirkin, Thank you for pointing these out! Let me explain a bit:
|
Sorry for missing the last concern, Hi @cjnolet, may you confirm if it is a problem mentioned by @achirkin: do we need to take care of the sequence of indices when merging them? |
|
|
||
| # detect when package size grows significantly | ||
| max_allowed_size_compressed = '1.1G' | ||
| max_allowed_size_compressed = '1.2G' |
There was a problem hiding this comment.
Hi @jameslamb , for conservative consideration, I increase it to 1.2GB. Thanks!
|
/ok to test |
Hi @achirkin , I thought about the 1st issue, and I realized the coupling with |
| } | ||
|
|
||
| /** Dimensionality of the data. */ | ||
| /** dimension of the data. */ |
There was a problem hiding this comment.
This change seems a little unecessary. Dimensionality seems like the right word (and case)
| explicit merge_params(const cagra::index_params& params) : output_index_params(params) {} | ||
|
|
||
| // Parameters for creating the output index | ||
| cagra::index_params output_index_params; |
There was a problem hiding this comment.
From an algorithmic perspective, this could be really really challenging. For example, depending upon the merge method used, I'm not sure if these can always be used. I don't think we should hold up the PR over this, but can you create a Github issue just to expend more thought into how we might be able to utilize this efficiently (if at all) with different merge strategies?
There are at least 3 different merge strategies that I can think of off the top of my head:
- Logical- simply wraps a new index structure around existing CAGRA graphs and broadcasts the query to each of the existing cagra graphs. This will be a fast merge but take a small hit in search latency. (This might be preferred for fewer larger CAGRA graphs.
- Physical- builds a new cagra grpah from the union of dataset points in existing cagra graphs. This will be expensive to build but not impact search latency/quality. This might be preferred for many smaller cagra graphs.
- Smart- overlaps dataset vectors across cagra graphs and merges the graphs into a single graph. This might be prefferred for many larger cagra graphs.
Maybe you could create the "MergeKind" enum now and just add "Physical" as the only option (and document accordingly). We will next need to implement the logical merge.
There was a problem hiding this comment.
Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index or logically_merged_index that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.
There was a problem hiding this comment.
Done! (Naming is MergeStrategy)
There was a problem hiding this comment.
Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a
composite_indexorlogically_merged_indexthat can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.
Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?
-- The issue was created: #663
Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type. |
That's a great idea! Sounds like |
| */ | ||
| auto merge(raft::resources const& res, | ||
| const cuvs::neighbors::cagra::merge_params& params, | ||
| std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices) |
There was a problem hiding this comment.
I do agree with Artem that the vector of points is not the prettiest thing, but I don't think variadic templates are the way to fix that (and they overall make things very challenging to work with). I think we can stick with pointers for now and udpate the API later if needed. Initially, this will be needed for Lucene, which will use it through our Java API so at least this public API is localized at the moment.
There was a problem hiding this comment.
Pointers are fine, from the perspective of the Java API. We can work best with memory addresses, since we'll be mmapp'ing the index data from files on disk.
There was a problem hiding this comment.
vector<index> is fine from java's perspective.
| */ | ||
| auto merge(raft::resources const& res, | ||
| const cuvs::neighbors::cagra::merge_params& params, | ||
| std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices) |
There was a problem hiding this comment.
vector<index> is fine from java's perspective.
|
/merge |
No description provided.