[WIP] IP Vector Normalization to avoid all vectors clumped into single cluster in IVF-PQ #1892
[WIP] IP Vector Normalization to avoid all vectors clumped into single cluster in IVF-PQ #1892HowardHuang1 wants to merge 21 commits intorapidsai:mainfrom
Conversation
… cases due to reduced precision 8-bit values stored in LUT
|
/ok to test 78e47f1 |
|
Hi @HowardHuang1, are we able to document an example in the description of the PR where vectors were clumped in a single cluster, and then verify that the fix (this PR) solves that case? Also you can run |
|
Yes will do. I'll upload a before and after example that illustrates the fixed cluster distributions and timing improvements for search. |
|
/ok to test e822460 |
…reviously for debug only
|
/ok to test da32073 |
…ange to only apply normalization to kmeans step
….s. raw doesn't affect ordering of clusters for nearest cluster assignment
|
Hi @HowardHuang1, is this targeting release/26.04? If so, please retarget from main to release/26.04 |
|
Hi @HowardHuang1 you'll have to rebase on |
…put command parsing
jinsolp
left a comment
There was a problem hiding this comment.
Thanks @HowardHuang1 ! I have a few questions as I try to understand the fix in this code.
- So we normalize the data, and train kmeans on this normalized data. However, instead of saving the resulting centroids (which come from the normalized space), we re-compute the proper centroids on the raw vectors (what
calc_centers_and_sizesdo)? - what are the changes in the cuvs bench for? I'm not so familiar with this part of the code so this is out of curiosity.
- If I'm reading things right in the screenshot, we get a lower recall with the normalized vectors compared to using the raw vectors? And the search also takes longer?
| handle, kmeans_params, trainset_kmeans_const_view, centers_const_view, labels_view); | ||
| // Recompute centers in original space (mean of unnormalized trainset per cluster), overwrites centers_view | ||
| rmm::device_uvector<uint32_t> cluster_sizes(impl->n_lists(), stream, device_memory); | ||
| cuvs::cluster::kmeans::detail::calc_centers_and_sizes<float, float, internal_extents_t, uint32_t, uint32_t>( |
There was a problem hiding this comment.
can we use the calc_centers_and_sizes in the public namespace if possible?
There was a problem hiding this comment.
Hey @jinsolp !
- Yes exactly. If we were to keep the normalized centroids that would cause the computation to no longer be Inner Product and instead it will degenerate to Cosine which is not what we want. To prevent this, we use
calc_centers_and_sizesto re-compute the proper centroids on the raw vectors. In essence, we only want normalization to happen in the kmeans cluster assignment step rather than the whole pipeline (normalization of the whole pipeline changes metric to Cosine). - The changes in cuvs bench are to resolve a linker issue. Will remove these when merging. They can be ignored for now.
- The screenshot is a bit outdated, will replace soon with a new one. Recall should be same search should be faster.
| if (impl->metric() == distance::DistanceType::InnerProduct) { | ||
| // Normalization only for k-means: use a copy so trainset stays in original space; metric | ||
| // remains inner product for the rest of the pipeline. | ||
| auto trainset_kmeans = raft::make_device_mdarray<float>( |
There was a problem hiding this comment.
Wait- we seem to be doing a copy of the trainset? Why? Just do the normalization in the distance kernels.
There was a problem hiding this comment.
This is not the way we do things- we don't do unecessary copies just to avoid changing kernels. This becomes frustrating for users, because they are almost always memory limited and every GB counts.
…ulting in 2x memory spike
…ng fused cosine implementation
…ld is called because only ivf_pq should have normalization for IP, other index types should not --> this version fixes low recall issue on gist and fixes Oracle bug
Addresses #1875.
Current Implementation: Have Inner Product code path match Cosine code path.
Within kmeans, the E-step of cluster assignment will be operating on normalized vectors but the M-step recomputes centers based on raw vectors. Search still happens on raw vectors.
In each EM iteration, before predict() we L2 row-normalize the centroid matrix (cluster_centroids) for IP. Within the E-step's call to predict(), we compute the norm on all vectors in the dataset and dataset_norm is passed into predict_core() and used in fusedDistanceNNMinReduce<> kernel. This is a norm to norm comparison between normalized vectors and normalized centroids.
Then in the M-Step, calc_centers_and_sizes() recalculates the centroids based on the raw dataset and rewrites raw centroids into cluster_centers.
For the code review, the bulk of the changes are in this file: cpp/src/cluster/detail/kmeans_balanced.cuh
The rest have to do with linking and can be ignored. Those files will be removed prior to merging.
So far, we've attempted:
Pareto plots for approach #3 (working):
IVF-PQ only:
Cagra end-to-end (Showing no harm in search performance):
Showing Fix for Oracle's Issue (in Cagra):