[Feat] Add Support for Index `merge` in CAGRA by rhdong · Pull Request #618 · rapidsai/cuvs

rhdong · 2025-01-27T07:04:07Z

No description provided.

rhdong · 2025-01-27T07:12:01Z

+  auto merged_index =
+    cagra::build(handle, params, raft::make_const_mdspan(device_updated_dataset_view));
+
+  if (static_cast<std::size_t>(stride) == dim) {


Hi @cjnolet @achirkin , I know these codes are odd, but without them, datasets will be changed after calling cagra::detail::search_main_core, which will cause the test failure. I do not know how the dataset format, matrix ownership, cagra::search interact behind it. Could you have comments here? Many thanks!

To me, it looks like no one owns the host_updated_dataset or device_updated_dataset beyond the scope of this function, so the data gets destroyed unless the owning update_dataset is called under the if branch here.
Hence, I think, you should call update_dataset unconditionally here.

To me, it looks like no one owns the host_updated_dataset or device_updated_dataset beyond the scope of this function, so the data gets destroyed unless the owning update_dataset is called under the if branch here. Hence, I think, you should call update_dataset unconditionally here.

Thank you, very helpful!

Hi @achirkin , I got the reason; the update_dataset is called in cagra::build, actually. This issue I mentioned is because a non-owning dataset is returned while stride==dim; pls refer to here. So, I unintentionally did the right thing. 😅

…merge

rhdong · 2025-01-29T17:52:08Z

+  }
+
+  // Allocate the new dataset on device
+  auto device_updated_dataset =


I found a great API that can be used to fit the situation that device memory is not enough cuvs::neighbors::nn_descent::has_enough_device_memory. I will make it in the next commit

…merge

rhdong · 2025-01-30T20:55:10Z

+ *
+ * @return A new CAGRA index containing the merged indices, graph, and dataset.
+ */
+auto merge(raft::resources const& res,


Hey @chatman, I'm working on cagra::merge. Could you review the API design when you have a moment? Any suggestions would be greatly appreciated. Thanks!

rhdong · 2025-01-30T21:19:06Z

Hi @achirkin @cjnolet , could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

achirkin · 2025-01-31T12:44:10Z

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.
Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

rhdong · 2025-01-31T17:52:03Z

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.

Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Hi @achirkin, Thank you for pointing these out! Let me explain a bit:

Currently, the initial implementation of merge is based on rebuilding a new index from scratch, so index_param is required. Regarding how to pass this parameter, I’m not in favor of a definition like this, as it’s relatively inconvenient for the caller:
```
struct merge_params {
    index_params index_params;
};
```
In fact, some parameters from the index might be needed for merging in the future (in the ultimate optimized version), such as intermediate_graph_degree, attach_dataset_on_build, graph_degree, and graph_build_params. That’s why I implemented it this way. Would you prefer defining the members individually, or do you have another approach in mind?
Yeah, you're right—I generally agree with you. It’s ultimately a trade-off decision, mainly because std::vector doesn’t support references. My concern with variable-length templates or non-template varargs is that they seem to require the caller to manually fill in all the indices, which could be cumbersome when handling a large number of indices. (we—or the caller—need an additional mechanism like std::initializer_list to make it.)

I found that std::vector<std::reference_wrapper<const index>> can resolve the issue, but it still feels like one problem is being replaced by another. So, I’d prefer to keep it simple and just check for nullptr (which I’ll add in a new commit)-- Anyway, we need to check many other things, such as whether the index dataset is attached. That said, if you’d still like me to change it, please let me know, and I’ll follow your suggestion. :)

rhdong · 2025-01-31T19:10:51Z

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Sorry for missing the last concern, Hi @cjnolet, may you confirm if it is a problem mentioned by @achirkin: do we need to take care of the sequence of indices when merging them?

rhdong · 2025-01-31T20:20:43Z


 # detect when package size grows significantly
-max_allowed_size_compressed = '1.1G'
+max_allowed_size_compressed = '1.2G'


Hi @jameslamb , for conservative consideration, I increase it to 1.2GB. Thanks!

rhdong · 2025-01-31T20:24:33Z

/ok to test

rhdong · 2025-02-02T23:41:13Z

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.

Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Hi @achirkin, Thank you for pointing these out! Let me explain a bit:
Currently, the initial implementation of merge is based on rebuilding a new index from scratch, so index_param is required. Regarding how to pass this parameter, I’m not in favor of a definition like this, as it’s relatively inconvenient for the caller:
struct merge_params {
    index_params index_params;
};
In fact, some parameters from the index might be needed for merging in the future (in the ultimate optimized version), such as intermediate_graph_degree, attach_dataset_on_build, graph_degree, and graph_build_params. That’s why I implemented it this way. Would you prefer defining the members individually, or do you have another approach in mind?
Yeah, you're right—I generally agree with you. It’s ultimately a trade-off decision, mainly because std::vector doesn’t support references. My concern with variable-length templates or non-template varargs is that they seem to require the caller to manually fill in all the indices, which could be cumbersome when handling a large number of indices. (we—or the caller—need an additional mechanism like std::initializer_list to make it.)
I found that std::vector<std::reference_wrapper<const index>> can resolve the issue, but it still feels like one problem is being replaced by another. So, I’d prefer to keep it simple and just check for nullptr (which I’ll add in a new commit)-- Anyway, we need to check many other things, such as whether the index dataset is attached. That said, if you’d still like me to change it, please let me know, and I’ll follow your suggestion. :)

Hi @achirkin , I thought about the 1st issue, and I realized the coupling with index_params is a bad design. I changed it by a new commit in which the strategy will be decoupled with index_params and make it an independent internal member variable, keeping no change on the caller's semantics

cjnolet · 2025-02-05T21:16:26Z

  }

-  /** Dimensionality of the data. */
+  /** dimension of the data. */


This change seems a little unecessary. Dimensionality seems like the right word (and case)

Revert it back

cjnolet · 2025-02-05T21:27:08Z

+  explicit merge_params(const cagra::index_params& params) : output_index_params(params) {}
+
+  // Parameters for creating the output index
+  cagra::index_params output_index_params;


From an algorithmic perspective, this could be really really challenging. For example, depending upon the merge method used, I'm not sure if these can always be used. I don't think we should hold up the PR over this, but can you create a Github issue just to expend more thought into how we might be able to utilize this efficiently (if at all) with different merge strategies?

There are at least 3 different merge strategies that I can think of off the top of my head:

Logical- simply wraps a new index structure around existing CAGRA graphs and broadcasts the query to each of the existing cagra graphs. This will be a fast merge but take a small hit in search latency. (This might be preferred for fewer larger CAGRA graphs.

Physical- builds a new cagra grpah from the union of dataset points in existing cagra graphs. This will be expensive to build but not impact search latency/quality. This might be preferred for many smaller cagra graphs.

Smart- overlaps dataset vectors across cagra graphs and merges the graphs into a single graph. This might be prefferred for many larger cagra graphs.

Maybe you could create the "MergeKind" enum now and just add "Physical" as the only option (and document accordingly). We will next need to implement the logical merge.

Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index or logically_merged_index that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.

Done! (Naming is MergeStrategy)

Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index or logically_merged_index that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?
-- The issue was created: #663

cjnolet · 2025-02-05T22:32:16Z

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?

Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type.

rhdong · 2025-02-05T23:26:29Z

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?

Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type.

That's a great idea! Sounds like composite_index can inherit from the regular index.

cjnolet · 2025-02-06T16:07:00Z

+ */
+auto merge(raft::resources const& res,
+           const cuvs::neighbors::cagra::merge_params& params,
+           std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices)


I do agree with Artem that the vector of points is not the prettiest thing, but I don't think variadic templates are the way to fix that (and they overall make things very challenging to work with). I think we can stick with pointers for now and udpate the API later if needed. Initially, this will be needed for Lucene, which will use it through our Java API so at least this public API is localized at the moment.

Pointers are fine, from the perspective of the Java API. We can work best with memory addresses, since we'll be mmapp'ing the index data from files on disk.

vector<index> is fine from java's perspective.

chatman · 2025-02-06T18:05:13Z

+ */
+auto merge(raft::resources const& res,
+           const cuvs::neighbors::cagra::merge_params& params,
+           std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices)


vector<index> is fine from java's perspective.

cjnolet · 2025-02-06T21:06:57Z

/merge

[Feat] Add Support for Index merge in CAGRA

ba244fb

rhdong requested review from achirkin and cjnolet January 27, 2025 07:04

rhdong requested review from a team as code owners January 27, 2025 07:04

github-actions Bot added cpp CMake labels Jan 27, 2025

rhdong commented Jan 27, 2025

View reviewed changes

Merge remote-tracking branch 'origin/branch-25.02' into rhdong/cagra-…

1dd5140

…merge

rhdong added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 27, 2025

cjnolet assigned rhdong Jan 28, 2025

rhdong added 3 commits January 28, 2025 17:13

Merge remote-tracking branch 'origin/branch-25.02' into rhdong/cagra-…

890a89e

…merge

simplify the memory logic

696c660

Merge branch 'branch-25.02' into rhdong/cagra-merge

318068c

rhdong commented Jan 29, 2025

View reviewed changes

rhdong added 3 commits January 29, 2025 19:52

automatically the memory choose & judgment owning for better readability

e5067c8

Merge branch 'branch-25.02' into rhdong/cagra-merge

1eb211a

Merge remote-tracking branch 'origin/branch-25.02' into rhdong/cagra-…

0fb7dff

…merge

rhdong commented Jan 30, 2025

View reviewed changes

Merge branch 'branch-25.02' into rhdong/cagra-merge

7af3ad8

Merge branch 'branch-25.02' into rhdong/cagra-merge

c69af18

Merge branch 'branch-25.02' into rhdong/cagra-merge

89d0a47

rhdong added 2 commits January 31, 2025 11:49

add nullptr checking & reduce binary size

8b95d7b

increase max_allowed_size_compressed to '1.2G' with rapidsai#634

7446f0e

rhdong mentioned this pull request Jan 31, 2025

[Opt] the binary size of libcuvs is increasing when add .cu files #634

Open

rhdong requested a review from a team as a code owner January 31, 2025 20:18

rhdong requested a review from jameslamb January 31, 2025 20:18

github-actions Bot added the Python label Jan 31, 2025

rhdong commented Jan 31, 2025

View reviewed changes

jameslamb approved these changes Jan 31, 2025

View reviewed changes

rhdong added 2 commits February 2, 2025 14:16

Merge branch 'branch-25.02' into rhdong/cagra-merge

e0633fa

decouple the merge_params from index_params

690e775

rhdong added 2 commits February 3, 2025 21:38

Merge branch 'branch-25.02' into rhdong/cagra-merge

6592d20

fix device memory allocation strategy

afb6026

cjnolet reviewed Feb 5, 2025

View reviewed changes

rhdong added 2 commits February 5, 2025 13:35

Merge branch 'branch-25.02' into rhdong/cagra-merge

b49e04a

add merge strategy for future extension

da45bdb

rhdong requested a review from cjnolet February 5, 2025 22:10

cjnolet reviewed Feb 6, 2025

View reviewed changes

reserve PHYSICAL only

9d3acb0

cjnolet approved these changes Feb 6, 2025

View reviewed changes

rhdong added 2 commits February 6, 2025 09:31

revert the std::move

ea7991b

Merge branch 'branch-25.02' into rhdong/cagra-merge

ca6c59f

chatman approved these changes Feb 6, 2025

View reviewed changes

rapids-bot Bot merged commit 52dd92c into rapidsai:branch-25.02 Feb 6, 2025

Conversation

rhdong commented Jan 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong commented Jan 30, 2025

Uh oh!

achirkin commented Jan 31, 2025

Uh oh!

rhdong commented Jan 31, 2025

Uh oh!

rhdong commented Jan 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong commented Jan 31, 2025

Uh oh!

rhdong commented Feb 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjnolet commented Feb 5, 2025

Uh oh!

rhdong commented Feb 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatman Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatman Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjnolet commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rhdong Jan 28, 2025 •

edited

Loading

rhdong Feb 5, 2025 •

edited

Loading

rhdong Feb 5, 2025 •

edited

Loading

chatman Feb 6, 2025 •

edited

Loading

chatman Feb 6, 2025 •

edited

Loading