Skip to content

[REVIEW] cuVS bench: Fix cudaFuncSetAttribute not being called when CAGRA search switches kernel variants#1851

Merged
rapids-bot[bot] merged 1 commit intorapidsai:release/26.04from
irina-resh-nvda:cuvsbench_smem_size_bug
Mar 25, 2026
Merged

[REVIEW] cuVS bench: Fix cudaFuncSetAttribute not being called when CAGRA search switches kernel variants#1851
rapids-bot[bot] merged 1 commit intorapidsai:release/26.04from
irina-resh-nvda:cuvsbench_smem_size_bug

Conversation

@irina-resh-nvda
Copy link
Copy Markdown
Contributor

Fix a bug in safely_launch_kernel_with_smem_size where cudaFuncSetAttribute was skipped for kernels that needed it. The function tracked the max shared memory in a single static variable per KernelT type, but cudaFuncSetAttribute applies per function pointer value — and the single-CTA CAGRA search dispatches multiple kernel instantiations that share the same pointer type. When one kernel bumped the tracked max, a different kernel whose smem fell between its own previous max and the global max would skip cudaFuncSetAttribute, causing cudaErrorInvalidValue. The fix tracks the kernel pointer identity alongside a monotonically growing smem high-water mark: when the pointer changes, the new kernel is brought up to the high-water mark; when smem exceeds it, the mark is grown.

Error in question

$ CUVS_CAGRA_ANN_BENCH --search --data_prefix='<DATA_DIR>/' --benchmark_out_format=csv --benchmark_out=res_search_iter_cagra.csv --benchmark_counters_tabular=true --override_kv=dataset_memory_type:\"device\" <CONFIG_DIR>/laion_1M_cagra_iterative.json
[I] [12:28:52.095261] Using the query file '<DATA_DIR>/laion_1M/queries.fbin'
[I] [12:28:52.096141] Using the ground truth file '<DATA_DIR>/laion_1M/groundtruth.1M.neighbors.ibin'
2026-02-25T12:28:52+00:00
Running CUVS_CAGRA_ANN_BENCH
Run on (224 X 800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x112)
  L1 Instruction 32 KiB (x112)
  L2 Unified 2048 KiB (x112)
  L3 Unified 307200 KiB (x2)
Load Average: 0.70, 0.44, 0.28
dataset: laion_1M
dim: 768
distance: euclidean
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/0/process_time/real_time        5.70 ms         5.70 ms          121   5.68808m   5.69994m    0.96424   0.689692       1.75441M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/1/process_time/real_time        5.70 ms         5.70 ms          121    5.6863m   5.69879m    0.96424   0.689553       1.75477M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/2/process_time/real_time        4.92 ms         4.92 ms          140   4.90351m   4.91567m    0.96046   0.688193       2.03432M/s        128         10             12        10k            1            1          1.4M dataset_memory_type="device"
cuvs_cagra_iterative/3/process_time/real_time        5.99 ms         5.99 ms          115   5.97476m   5.98617m    0.97519   0.688409       1.67052M/s        128         10             16        10k            1            1         1.15M dataset_memory_type="device"
cuvs_cagra_iterative/4/process_time/real_time        6.97 ms         6.97 ms           99   6.95873m    6.9703m    0.98129   0.690059       1.43466M/s        256         10             16        10k            1            1          990k dataset_memory_type="device"
cuvs_cagra_iterative/5/process_time/real_time        10.5 ms         10.5 ms           66   0.010479  0.0104908    0.98548   0.692391       953.222k/s        512         10             10        10k            1            2          660k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/6/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 19 stack frames
#1 in CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#2 in libcuvs.so: void cuvs::neighbors::cagra::detail::single_cta_search::select_and_run<float, unsigned int, float, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(...)
#3 in libcuvs.so: cuvs::neighbors::cagra::detail::single_cta_search::search<float, unsigned int, float, cuvs::neighbors::filtering::none_sample_filter, unsigned int, long>::operator()(...)
#4 in libcuvs.so(+0x18fd0f1)
#5 in libcuvs.so: void cuvs::neighbors::cagra::search<float, unsigned int, long>(...)
#6-#19 in CUVS_CAGRA_ANN_BENCH / libc.so.6
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/7/process_time/real_time        10.5 ms         10.5 ms           66  0.0105088  0.0105202    0.98663   0.694332       950.555k/s         32         10             32        10k            1            1          660k dataset_memory_type="device"
cuvs_cagra_iterative/8/process_time/real_time        12.8 ms         12.8 ms           54   0.012796  0.0128079    0.98807   0.691628       780.768k/s         32         10             64        10k            1            1          540k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/9/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
cuvs_cagra_iterative/10/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/11/process_time/real_time       46.1 ms         46.2 ms           15  0.0461323  0.0461439    0.99131   0.692158       216.714k/s        256         10             10        10k            1           16          150k dataset_memory_type="device"
cuvs_cagra_iterative/12/process_time/real_time        142 ms          142 ms            5   0.141713   0.141725    0.99198   0.708627       70.5591k/s        512         10             32        10k            1           16           50k dataset_memory_type="device"

Config

{
  "dataset": {
    "name": "laion_1M",
    "base_file": "laion_1M/base.1M.fbin",
    "subset_size": 1000000,
    "query_file": "laion_1M/queries.fbin",
    "groundtruth_neighbors_file": "laion_1M/groundtruth.1M.neighbors.ibin",
    "distance": "euclidean"
  },
  "search_basic_param": {
    "batch_size": 10000,
    "k": 10
  },
  "index": [
  
    {
      "name": "cuvs_cagra_iterative",
      "algo": "cuvs_cagra",
      "build_param": { 
        "graph_degree": 64,
        "intermediate_graph_degree": 128,
        "search_width": 1
      },
      "file": "laion_1M/cagra/q_coarse_iterative.ibin",
      "search_params": [
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 256, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 512, "search_width": 2, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 256, "search_width": 2, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 32, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 64, "refine_ratio": 1},
        {"itopk": 192, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 16, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 512, "search_width": 16, "max_iterations": 32, "refine_ratio": 1}
      ]
    }
  ]
}

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@irina-resh-nvda irina-resh-nvda self-assigned this Feb 25, 2026
@irina-resh-nvda irina-resh-nvda added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 25, 2026
@irina-resh-nvda irina-resh-nvda marked this pull request as ready for review February 25, 2026 15:37
@irina-resh-nvda irina-resh-nvda requested a review from a team as a code owner February 25, 2026 15:37
@irina-resh-nvda irina-resh-nvda changed the title [REVIEW] Fix cudaFuncSetAttribute not being called when CAGRA search switches kernel variants [REVIEW] cuVS bench: Fix cudaFuncSetAttribute not being called when CAGRA search switches kernel variants Feb 25, 2026
@divyegala divyegala mentioned this pull request Feb 25, 2026
8 tasks
Copy link
Copy Markdown
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this was indeed an oversight in the original design. Thanks for working on this!

Comment thread cpp/src/neighbors/detail/smem_utils.cuh Outdated
}
// current_smem_size is a monotonically growing high-water mark across all kernel pointers.
// current_kernel tracks which kernel pointer was last used.
static uint32_t current_smem_size{0};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to retain the atomic-fast-path semantics (perhaps stronger memory order and two atomic variables)?

Copy link
Copy Markdown
Contributor Author

@irina-resh-nvda irina-resh-nvda Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, since we are only tracking the watermark, there is no danger in reading an inconsistent state with two atomics, but what will be the benefit of doing it this way vs a mutex?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i withdraw my question given that smem_utils is performance critical functionality

Comment thread cpp/src/neighbors/detail/smem_utils.cuh Outdated
Comment on lines +50 to +52
// When the kernel function pointer changes, bring the new kernel up to the global high-water
// mark. This is necessary because cudaFuncSetAttribute applies to a specific function pointer,
// not to the pointer type — different template instantiations may share the same KernelT.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 Great catch.

I'm feeling a little silly for not having thought of this, actually.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we have exactly one pointer per type, but apparently we're not (non-type template parameters).

Comment thread cpp/src/neighbors/detail/smem_utils.cuh Outdated
Comment on lines +53 to +59
if (kernel != last_kernel) {
current_kernel = kernel;
auto launch_status =
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, last_smem_size);
RAFT_EXPECTS(launch_status == cudaSuccess,
"Failed to set max dynamic shared memory size to %u bytes",
last_smem_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a silly question: Why aren't these two conditions combined into one block?

    if (smem_size > last_smem_size || kernel != last_kernel) {
      // 1. Record high-watermark, current kernel.
      // 2. Call cudaFuncSetAttribute().
    }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, we should probably have put this in a double-checked lock, no?

// For the first check, no mutex.
if (smem_size > current_smem_size || kernel != current_kernel) {
  // Something's changed.  Grab the mutex, and examine.
  auto guard = std::lock_guard<std::mutex>{mutex};
  auto call_set_attribute = false;
  if (smem_size > current_smem_size) {
    current_smem_size = smem_size;
    call_set_attribute = true;
  }
  if (kernel != current_kernel) {
    current_kernel = kernel;
    call_set_attribute = true;
  }
  if (call_set_attribute) {
      auto launch_status =
        cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
      RAFT_EXPECTS(launch_status == cudaSuccess,
                   "Failed to set max dynamic shared memory size to %u bytes",
                   smem_size);
  }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this is too naive, or I'm missing something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you are right this can be trimmed

Comment thread cpp/src/neighbors/detail/smem_utils.cuh Outdated
Comment on lines +47 to +48
auto last_kernel = current_kernel;
auto last_smem_size = current_smem_size;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, why is it necessary to make copies of the current high-watermark and the current_kernel? Why not just use current_kernel directly? We're holding the lock_guard when these are modified, so it should be safe.

What am I missing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it's an artefact from when these were two atomics =)

Copy link
Copy Markdown
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you account for the case where KernelT is just a cudaKernel_t or cudaFunc_t?

@divyegala divyegala dismissed their stale review February 26, 2026 22:06

Found fix.

@mythrocks
Copy link
Copy Markdown
Contributor

Actually, I rather like @divyegala's approach of tracking mem-sizes per kernel, via a std::unordered_map. But @achirkin might know best about whether we want to persist the current smem_max across all kernels evenly, or track them separately. (In that case, we might consider a std::unordered_set instead.)

The map/set version will likely work for both function pointers and cudaKernel_t alike, so we might not even need a template specialization for the latter.

@achirkin
Copy link
Copy Markdown
Contributor

achirkin commented Mar 2, 2026

I'm thinking whether it's still possible to maintain compile-time dictionary of the kernels and smem sizes rather than run-time. What if we just propagate/add the template parameters from the outer scope to ensure there's always one template per kernel instantiation? These host functions are small, so we won't be blowing up the binary size while also avoiding the runtime costs for the locks and dictionaries.

@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

I'm thinking whether it's still possible to maintain compile-time dictionary of the kernels and smem sizes rather than run-time. What if we just propagate/add the template parameters from the outer scope to ensure there's always one template per kernel instantiation? These host functions are small, so we won't be blowing up the binary size while also avoiding the runtime costs for the locks and dictionaries.

This is the benchmark launcher functionality, not a performance-critical algorithmic part. Do you think it's worth it to try and optimise out the run-time overhead?

@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

Can you account for the case where KernelT is just a cudaKernel_t or cudaFunc_t?

Is this still relevant for you?
It's unclear to me how will KernelLauncherT behave if given cudaKernel_t or cudaFunc_t

@achirkin
Copy link
Copy Markdown
Contributor

achirkin commented Mar 2, 2026

This is the benchmark launcher functionality, not a performance-critical algorithmic part. Do you think it's worth it to try and optimise out the run-time overhead?

No, the smem helper in cpp/src/neighbors/detail/smem_utils.cuh is in a performance-critical path, it's invoked during search. It's critical for the case of launching many concurrent small-batch searches.

@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

This is the benchmark launcher functionality, not a performance-critical algorithmic part. Do you think it's worth it to try and optimise out the run-time overhead?

No, the smem helper in cpp/src/neighbors/detail/smem_utils.cuh is in a performance-critical path, it's invoked during search. It's critical for the case of launching many concurrent small-batch searches.

Oh I completely missed that, I thought I fixed a cuvs bench bug. Then for sure

@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

irina-resh-nvda commented Mar 2, 2026

I updated the implementation to use two atomics (order_relaxed because of monotonic smem_size)
However, this approach looks a little slower in some cases when running cuvs bench:
one-mutex approach:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_q_iterative/0/process_time/real_time        4.27 ms         4.27 ms          164   4.26053m   4.27194m    0.84972   0.700598       2.34086M/s         64         10              8        10k            1            2         1.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time        4.27 ms         4.27 ms          164   4.25854m   4.26998m    0.84972   0.700277       2.34194M/s         64         10              8        10k            1            2         1.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time        3.57 ms         3.57 ms          196   3.55504m   3.56633m    0.84494      0.699       2.80401M/s        128         10             12        10k            1            1         1.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time        4.47 ms         4.47 ms          156   4.45523m   4.46646m    0.85445   0.696768       2.23891M/s        128         10             16        10k            1            1         1.56M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time        4.92 ms         4.92 ms          140   4.90958m   4.92073m    0.85754   0.688902       2.03222M/s        256         10             16        10k            1            1          1.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time        7.47 ms         7.47 ms           93   7.45551m   7.46701m    0.85994   0.694432       1.33923M/s        512         10             10        10k            1            2          930k dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time        6.52 ms         6.52 ms          106   6.51184m   6.52313m    0.86124   0.691451       1.53301M/s        256         10             12        10k            1            2         1060k dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time        8.45 ms         8.45 ms           82     8.437m   8.44948m    0.85983   0.692857       1.18351M/s         32         10             32        10k            1            1          820k dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time        10.6 ms         10.6 ms           65  0.0106346  0.0106464    0.86101   0.692016        939.29k/s         32         10             64        10k            1            1          650k dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time        11.3 ms         11.3 ms           61  0.0112994  0.0113108    0.86274   0.689959       884.112k/s        192         10             12        10k            1            4          610k dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time       11.6 ms         11.6 ms           60  0.0115922  0.0116036    0.86268   0.696217       861.802k/s        256         10             12        10k            1            4          600k dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time       36.9 ms         36.9 ms           19  0.0368664  0.0368782    0.86319   0.700685       271.164k/s        256         10             10        10k            1           16          190k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time        117 ms          117 ms            6   0.116596   0.116613    0.86334   0.699677       85.7542k/s        512         10             32        10k            1           16           60k dataset_memory_type="device"

two atomics + mutex approach:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_q_iterative/0/process_time/real_time        4.28 ms         4.27 ms          164   4.26642m   4.27792m    0.85078   0.701578        2.3376M/s         64         10              8        10k            1            2         1.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time        4.27 ms         4.27 ms          164   4.26101m    4.2725m    0.85078   0.700691       2.34056M/s         64         10              8        10k            1            2         1.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time        3.58 ms         3.57 ms          189   3.56572m   3.57723m    0.84674   0.676096       2.79547M/s        128         10             12        10k            1            1         1.89M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time        4.46 ms         4.47 ms          156   4.45332m   4.46465m     0.8556   0.696485       2.23982M/s        128         10             16        10k            1            1         1.56M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time        5.02 ms         4.93 ms          139   5.01347m   5.02475m    0.85859   0.698441       1.99015M/s        256         10             16        10k            1            1         1.39M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time        7.48 ms         7.47 ms           83   7.46776m   7.47937m    0.86108   0.620787       1.33702M/s        512         10             10        10k            1            2          830k dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time        6.52 ms         6.52 ms          106   6.50945m   6.52083m    0.86156   0.691208       1.53355M/s        256         10             12        10k            1            2         1060k dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time        8.43 ms         8.43 ms           82   8.41571m   8.42704m    0.86001   0.691018       1.18666M/s         32         10             32        10k            1            1          820k dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time        10.1 ms         10.0 ms           61  0.0100925  0.0101043    0.86112   0.616365       989.678k/s         32         10             64        10k            1            1          610k dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time        11.3 ms         11.3 ms           60   0.011294  0.0113053     0.8634    0.67832       884.541k/s        192         10             12        10k            1            4          600k dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time       11.6 ms         11.6 ms           60  0.0115935   0.011605    0.86336   0.696301       861.698k/s        256         10             12        10k            1            4          600k dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time       37.0 ms         36.9 ms           19  0.0369513   0.036963    0.86332   0.702297       270.541k/s        256         10             10        10k            1           16          190k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time        118 ms          117 ms            6   0.117519   0.117536    0.86364   0.705215       85.0806k/s        512         10             32        10k            1           16           60k dataset_memory_type="device"

@divyegala
Copy link
Copy Markdown
Member

Is this still relevant for you?

Yes. But I'll fix it on my own if your PR does not account for that case, although I do prefer the solution to be more generic.

Copy link
Copy Markdown
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for exploring the less-locking approach!
Could you please expand your benchmarks to also test the throughput mode (--mode=throughput --threads=1:1024) and increase the benchmark case time for more stable results (--benchmark_min_time=3s)?

Comment thread cpp/src/neighbors/detail/smem_utils.cuh Outdated
Comment on lines +65 to +69
auto launch_status =
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, cur_smem_size);
RAFT_EXPECTS(launch_status == cudaSuccess,
"Failed to set max dynamic shared memory size to %u bytes",
cur_smem_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple issues here:

  • by the time the mutex is locked, another thread may have already called cudaFuncSetAttribute, so the update wouldn't be needed anymore - leads to doing the work twice. So, you'd need to repeat the atomic check to avoid it.
  • By the time smem_size > cur_smem_size checked, another thread may have already increased the last_smem_size and changed the last_kernel, so the update_needed may be incorrectly set to false. To fix this, you'd need to reorder the checks, introduce a loop for checking both atomics, or expand the locked section.

Ludu-nuvai added a commit to Nuvai/cuvs that referenced this pull request Mar 23, 2026
Cherry-picked from upstream PR rapidsai#1851.
Tracks kernel function pointer changes and re-applies shared memory
attribute when CAGRA search switches between kernel variants, preventing
silent performance degradation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

@achirkin
New benchmarks (using the newest commit) with the flags you requested

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_q_iterative/0/process_time/real_time/threads:1           4.26 ms         4.27 ms          983   4.25264m   4.26404m    0.84747    4.19155        2.3452M/s         64         10              8        10k            1            2         9.83M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:2           3.76 ms         7.52 ms         1114   7.51068m   7.52681m    0.84747     4.1925       2.65835M/s         64         10              8        10k            1            2        11.14M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:4           3.76 ms         15.0 ms         1116  0.0150265  0.0150596    0.84747    4.20166       2.65966M/s         64         10              8        10k            1            2        11.16M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:8           3.75 ms         29.8 ms         1120  0.0300058   0.030161    0.84747    4.22251       2.66495M/s         64         10              8        10k            1            2         11.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:16          3.74 ms         59.2 ms         1264  0.0304797  0.0604478    0.84747    4.77536       2.67167M/s         64         10              8        10k            1            2        12.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:32          3.69 ms          101 ms         1120  0.0325192   0.120937    0.84747    4.23278       2.71309M/s         64         10              8        10k            1            2         11.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:64          3.74 ms          100 ms         1152  0.0462351   0.250309    0.84747    4.50583       2.67661M/s         64         10              8        10k            1            2        11.52M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:128         3.70 ms         96.8 ms         1408   0.093142   0.511545    0.84747    5.62735       2.70053M/s         64         10              8        10k            1            2        14.08M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:256         3.11 ms         72.6 ms         1024   0.415988    1.07977    0.84747    4.31994       3.21259M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:512         3.39 ms         59.5 ms         1024    1.42318    2.54438    0.84747     5.0871       2.94647M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/0/process_time/real_time/threads:1024        4.26 ms         52.7 ms         1024    4.36568    6.30396    0.84747    6.30608       2.34491M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:1           4.27 ms         4.27 ms          982   4.25389m    4.2655m    0.84747    4.18873       2.34439M/s         64         10              8        10k            1            2         9.82M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:2           3.76 ms         7.52 ms         1114   7.50801m   7.52447m    0.84747    4.19121       2.65918M/s         64         10              8        10k            1            2        11.14M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:4           3.76 ms         15.0 ms         1120   0.015034  0.0150745    0.84747    4.22091       2.65826M/s         64         10              8        10k            1            2         11.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:8           3.75 ms         29.9 ms         1120  0.0299985  0.0301534    0.84747    4.22145       2.66559M/s         64         10              8        10k            1            2         11.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:16          3.75 ms         59.2 ms         1264  0.0305015  0.0604605    0.84747    4.77631        2.6698M/s         64         10              8        10k            1            2        12.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:32          3.52 ms         94.2 ms         1248  0.0328032   0.122089    0.84747     4.7613       2.83944M/s         64         10              8        10k            1            2        12.48M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:64          3.60 ms         93.2 ms         1152  0.0528436    0.24902    0.84747    4.48214       2.77583M/s         64         10              8        10k            1            2        11.52M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:128         3.69 ms         92.6 ms          896   0.127664    0.53028    0.84747    3.71223       2.70972M/s         64         10              8        10k            1            2         8.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:256         3.30 ms         78.3 ms         1024   0.399409     1.0932    0.84747    4.37352       3.03257M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:512         2.94 ms         57.9 ms         1024    1.23099    2.36729    0.84747    4.73089       3.39818M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/1/process_time/real_time/threads:1024        4.17 ms         54.5 ms         1024    4.24469    6.28381    0.84747    6.28045       2.39902M/s         64         10              8        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:1           3.56 ms         3.56 ms         1177   3.54751m   3.55908m    0.84298    4.18904       2.80972M/s        128         10             12        10k            1            1        11.77M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:2           3.12 ms         6.25 ms         1340    6.2352m   6.25054m    0.84298    4.18787       3.20089M/s        128         10             12        10k            1            1         13.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:4           3.12 ms         12.5 ms         1344  0.0124838  0.0125113    0.84298    4.20376       3.20072M/s        128         10             12        10k            1            1        13.44M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:8           3.12 ms         24.8 ms         1352  0.0249172  0.0250403    0.84298    4.23178       3.20885M/s        128         10             12        10k            1            1        13.52M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:16          3.12 ms         49.2 ms         1504  0.0253534  0.0502439    0.84298    4.72287       3.20994M/s        128         10             12        10k            1            1        15.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:32          2.98 ms         81.4 ms         1408  0.0268179   0.100126    0.84298    4.40549        3.3534M/s        128         10             12        10k            1            1        14.08M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:64          3.11 ms         84.4 ms         1536  0.0350969   0.205721    0.84298    4.93739       3.21547M/s        128         10             12        10k            1            1        15.36M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:128         2.71 ms         69.3 ms         1152   0.077869   0.403269    0.84298    3.62965       3.68494M/s        128         10             12        10k            1            1        11.52M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:256         2.60 ms         64.6 ms         1280   0.305413   0.876562    0.84298    4.38333       3.84575M/s        128         10             12        10k            1            1         12.8M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:512         2.08 ms         51.0 ms         1536   0.755542    1.70121    0.84298    5.10463       4.81926M/s        128         10             12        10k            1            1        15.36M dataset_memory_type="device"
cuvs_cagra_q_iterative/2/process_time/real_time/threads:1024        3.88 ms         46.7 ms         1024    3.93342    5.72016    0.84298    5.72221       2.58026M/s        128         10             12        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:1           4.46 ms         4.46 ms          940   4.44947m   4.46149m    0.85236     4.1938        2.2414M/s        128         10             16        10k            1            1          9.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:2           3.91 ms         7.82 ms         1070    7.8121m   7.83973m    0.85236    4.19936       2.55537M/s        128         10             16        10k            1            1         10.7M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:4           3.91 ms         15.6 ms         1072   0.015634  0.0156694    0.85236     4.1993       2.55633M/s        128         10             16        10k            1            1        10.72M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:8           3.90 ms         31.1 ms         1120  0.0312072  0.0313691    0.85236    4.39151       2.56232M/s        128         10             16        10k            1            1         11.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:16          3.54 ms         52.5 ms         1104  0.0313415  0.0626231    0.85236    4.32099       2.82166M/s        128         10             16        10k            1            1        11.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:32          3.75 ms          102 ms         1088  0.0340829   0.126965    0.85236    4.31684       2.66808M/s        128         10             16        10k            1            1        10.88M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:64          3.65 ms         98.6 ms         1088  0.0452013   0.251998    0.85236    4.28387       2.74043M/s        128         10             16        10k            1            1        10.88M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:128         3.48 ms         94.1 ms          896   0.110273   0.515823    0.85236    3.61101       2.87137M/s        128         10             16        10k            1            1         8.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:256         3.44 ms         87.1 ms         1792   0.196212    1.03583    0.85236    7.25145       2.90419M/s        128         10             16        10k            1            1        17.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:512         3.16 ms         66.8 ms         1536    1.05771    2.34923    0.85236    7.04905       3.16891M/s        128         10             16        10k            1            1        15.36M dataset_memory_type="device"
cuvs_cagra_q_iterative/3/process_time/real_time/threads:1024        4.57 ms         55.0 ms         1024    4.68035    6.71024    0.85236    6.70689       2.18642M/s        128         10             16        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:1           4.94 ms         4.92 ms          849   4.92898m   4.94083m    0.85682    4.19477       2.02395M/s        256         10             16        10k            1            1         8.49M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:2           4.61 ms         9.22 ms          910   9.21324m   9.23171m    0.85682     4.2005       2.16766M/s        256         10             16        10k            1            1          9.1M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:4           4.61 ms         18.4 ms          912  0.0184098  0.0184539    0.85682     4.2075       2.17117M/s        256         10             16        10k            1            1         9.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:8           4.59 ms         36.6 ms          952  0.0367164  0.0369396    0.85682    4.39582       2.17804M/s        256         10             16        10k            1            1         9.52M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:16          4.55 ms         71.7 ms          992  0.0371738  0.0738172    0.85682    4.57662        2.1999M/s        256         10             16        10k            1            1         9.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:32          4.42 ms          120 ms          992  0.0399174   0.147912    0.85682    4.58531       2.26497M/s        256         10             16        10k            1            1         9.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:64          4.38 ms          120 ms          896  0.0552238    0.29989    0.85682    4.19823       2.28228M/s        256         10             16        10k            1            1         8.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:128         3.72 ms          101 ms          896   0.165797   0.601902    0.85682    4.21346       2.68946M/s        256         10             16        10k            1            1         8.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:256         3.59 ms         82.7 ms         1024   0.380164    1.27415    0.85682    5.09742       2.78569M/s        256         10             16        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:512         3.83 ms         75.1 ms         1024    1.44992    2.89466    0.85682    5.79061       2.61049M/s        256         10             16        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/4/process_time/real_time/threads:1024        5.01 ms         71.1 ms         1024    4.85979    7.23408    0.85682    7.23652       1.99689M/s        256         10             16        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:1           7.46 ms         7.46 ms          562   7.44789m   7.45966m    0.85861    4.19233       1.34055M/s        512         10             10        10k            1            2         5.62M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:2           7.23 ms         14.5 ms          580  0.0144516  0.0144779   0.858745    4.19865       1.38263M/s        512         10             10        10k            1            2          5.8M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:4           7.23 ms         28.8 ms          580  0.0288943  0.0290363    0.85886     4.2128       1.38368M/s        512         10             10        10k            1            2          5.8M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:8           7.19 ms         57.1 ms          600   0.057466  0.0580009    0.85878    4.34999       1.39178M/s        512         10             10        10k            1            2            6M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:16          7.17 ms          113 ms          640  0.0586955   0.116338   0.858862    4.65344       1.39531M/s        512         10             10        10k            1            2          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:32          6.82 ms          185 ms          608  0.0640125   0.232856   0.858876    4.42427       1.46523M/s        512         10             10        10k            1            2         6.08M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:64          6.76 ms          186 ms          704  0.0889401   0.468263   0.858906      5.151       1.47852M/s        512         10             10        10k            1            2         7.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:128         6.18 ms          170 ms          896   0.190995   0.936132   0.858862    6.55317         1.617M/s        512         10             10        10k            1            2         8.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:256         6.19 ms          148 ms          768   0.753871    2.07761   0.858888    6.23345       1.61559M/s        512         10             10        10k            1            2         7.68M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:512         6.03 ms         99.4 ms          512    3.08049    4.94092   0.858895    4.94255       1.65903M/s        512         10             10        10k            1            2         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/5/process_time/real_time/threads:1024        5.80 ms          102 ms         1024    5.81449    9.57967   0.858864    9.57797       1.72475M/s        512         10             10        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:1           6.52 ms         6.52 ms          643   6.50481m   6.51671m    0.86104    4.19024       1.53452M/s        256         10             12        10k            1            2         6.43M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:2           6.13 ms         12.2 ms          686  0.0122414  0.0122641    0.86098    4.20653       1.63198M/s        256         10             12        10k            1            2         6.86M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:4           6.12 ms         24.4 ms          688  0.0244572  0.0245248    0.86099    4.21827       1.63459M/s        256         10             12        10k            1            2         6.88M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:8           6.11 ms         48.7 ms          720  0.0488587  0.0491128   0.860976    4.41995        1.6369M/s        256         10             12        10k            1            2          7.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:16          6.08 ms         95.9 ms          704  0.0498208  0.0986288   0.860988    4.33962        1.6438M/s        256         10             12        10k            1            2         7.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:32          5.93 ms          160 ms          736   0.054321   0.199185    0.86098    4.58153       1.68618M/s        256         10             12        10k            1            2         7.36M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:64          5.77 ms          158 ms          768  0.0736453   0.398476   0.860983    4.78181       1.73448M/s        256         10             12        10k            1            2         7.68M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:128         5.88 ms          152 ms          768   0.185618   0.840867   0.860976    5.04553       1.70199M/s        256         10             12        10k            1            2         7.68M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:256         4.72 ms          129 ms         1024   0.446326    1.59564   0.860982    6.38193       2.11695M/s        256         10             12        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:512         5.34 ms          115 ms         1024    1.67939    3.71042   0.860981    7.42286       1.87246M/s        256         10             12        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/6/process_time/real_time/threads:1024        5.18 ms         85.5 ms         1024    5.24696    8.49105   0.860982    8.48756       1.93035M/s        256         10             12        10k            1            2        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:1           8.42 ms         8.42 ms          498    8.4087m   8.42075m    0.85952    4.19353       1.18754M/s         32         10             32        10k            1            1         4.98M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:2           7.48 ms         15.0 ms          560  0.0149531  0.0149797    0.85952    4.19426        1.3363M/s         32         10             32        10k            1            1          5.6M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:4           7.48 ms         29.8 ms          564  0.0298902  0.0299831    0.85952    4.22761       1.33761M/s         32         10             32        10k            1            1         5.64M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:8           7.45 ms         59.3 ms          576  0.0596019  0.0599793    0.85952     4.3185       1.34192M/s         32         10             32        10k            1            1         5.76M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:16          7.41 ms          117 ms          592  0.0607554   0.120359    0.85952    4.45325       1.35018M/s         32         10             32        10k            1            1         5.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:32          7.36 ms          201 ms          576   0.067084   0.242665    0.85952    4.36792       1.35932M/s         32         10             32        10k            1            1         5.76M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:64          7.28 ms          196 ms          640  0.0964646   0.494021    0.85952     4.9403        1.3743M/s         32         10             32        10k            1            1          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:128         6.85 ms          179 ms          640   0.210187   0.999995    0.85952    5.00005       1.45947M/s         32         10             32        10k            1            1          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:256         6.69 ms          168 ms         1024     0.5987    2.09405    0.85952    8.37688       1.49451M/s         32         10             32        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:512         6.19 ms          105 ms          512    3.15013    5.02772    0.85952    5.02654       1.61673M/s         32         10             32        10k            1            1         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/7/process_time/real_time/threads:1024        6.09 ms          109 ms         1024    6.07929     10.054    0.85952    10.0565        1.6432M/s         32         10             32        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:1           9.98 ms         9.98 ms          420    9.9673m   9.97944m    0.86074    4.19136       1002.06k/s         32         10             64        10k            1            1          4.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:2           8.59 ms         17.2 ms          490  0.0171592  0.0171897    0.86074    4.21143       1.16462M/s         32         10             64        10k            1            1          4.9M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:4           8.57 ms         34.2 ms          496   0.034272  0.0343877    0.86074    4.26413       1.16666M/s         32         10             64        10k            1            1         4.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:8           8.54 ms         68.0 ms          504  0.0683426  0.0688308    0.86074    4.33631       1.17032M/s         32         10             64        10k            1            1         5.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:16          8.08 ms          122 ms          512   0.069489   0.138153    0.86074    4.42081       1.23816M/s         32         10             64        10k            1            1         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:32          8.10 ms          220 ms          512  0.0766018   0.276059    0.86074    4.41675         1.234M/s         32         10             64        10k            1            1         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:64          8.02 ms          217 ms          640   0.103561   0.553866    0.86074    5.53882       1.24647M/s         32         10             64        10k            1            1          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:128         7.80 ms          205 ms          640   0.228515     1.1366    0.86074    5.68335       1.28231M/s         32         10             64        10k            1            1          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:256         6.31 ms          173 ms          768   0.700406    2.23698    0.86074     6.7114        1.5852M/s         32         10             64        10k            1            1         7.68M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:512         7.31 ms          173 ms         1024    1.82101    4.88175    0.86074    9.76496        1.3681M/s         32         10             64        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/8/process_time/real_time/threads:1024        7.06 ms          130 ms         1024    6.70106    11.1416    0.86074    11.1378       1.41653M/s         32         10             64        10k            1            1        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:1           11.3 ms         11.3 ms          371  0.0112863  0.0112984    0.86233    4.19169       885.085k/s        192         10             12        10k            1            4         3.71M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:2           10.6 ms         21.3 ms          394  0.0212824  0.0213243    0.86233     4.2009       939.104k/s        192         10             12        10k            1            4         3.94M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:4           10.6 ms         42.4 ms          404  0.0424963  0.0426695   0.862405    4.30958       940.934k/s        192         10             12        10k            1            4         4.04M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:8           10.6 ms         84.2 ms          416   0.084674  0.0854072   0.862399    4.44098       944.633k/s        192         10             12        10k            1            4         4.16M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:16          9.78 ms          149 ms          416  0.0844253   0.171457   0.862386    4.45782       1022.16k/s        192         10             12        10k            1            4         4.16M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:32          9.87 ms          262 ms          320   0.101333   0.348161   0.862402     3.4819       1013.54k/s        192         10             12        10k            1            4          3.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:64          9.85 ms          269 ms          512   0.134169   0.686461   0.862388     5.4914       1014.98k/s        192         10             12        10k            1            4         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:128         9.39 ms          251 ms          640   0.276207    1.38894   0.862392    6.94418       1064.68k/s        192         10             12        10k            1            4          6.4M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:256         8.51 ms          226 ms          768   0.664331    2.74991   0.862395    8.25052       1.17532M/s        192         10             12        10k            1            4         7.68M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:512         8.26 ms          163 ms          512    4.09056    6.62565   0.862392    6.62668        1.2111M/s        192         10             12        10k            1            4         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/9/process_time/real_time/threads:1024        8.25 ms          165 ms         1024    7.79676    13.2496   0.862393    13.2492        1.2124M/s        192         10             12        10k            1            4        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:1          11.6 ms         11.6 ms          362  0.0115814  0.0115936    0.86233    4.19689       862.544k/s        256         10             12        10k            1            4         3.62M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:2          10.9 ms         21.8 ms          384   0.021822  0.0218655    0.86231    4.19811       915.875k/s        256         10             12        10k            1            4         3.84M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:4          10.9 ms         43.5 ms          396   0.043599  0.0437801   0.862337    4.33426       917.138k/s        256         10             12        10k            1            4         3.96M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:8          10.8 ms         85.7 ms          400  0.0863382  0.0874482   0.862325    4.37236       926.423k/s        256         10             12        10k            1            4            4M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:16         10.6 ms          167 ms          400  0.0883175   0.174934   0.862319    4.37312       939.601k/s        256         10             12        10k            1            4            4M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:32         10.5 ms          283 ms          320   0.104093   0.356726   0.862339    3.56713       949.441k/s        256         10             12        10k            1            4          3.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:64         10.2 ms          280 ms          576   0.129415   0.703408   0.862337    6.33077        982.85k/s        256         10             12        10k            1            4         5.76M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:128        8.80 ms          233 ms          512   0.343449    1.40981   0.862322    5.63972       1.13646M/s        256         10             12        10k            1            4         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:256        7.12 ms          192 ms          512    1.15827    2.86344   0.862328    5.72749       1.40387M/s        256         10             12        10k            1            4         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:512        8.79 ms          172 ms          512    4.17641    6.97207   0.862327      6.973       1.13779M/s        256         10             12        10k            1            4         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/10/process_time/real_time/threads:1024       6.62 ms          152 ms         1024    6.57826    12.2257   0.862327    12.2229        1.5111M/s        256         10             12        10k            1            4        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:1          36.9 ms         36.9 ms          114  0.0368472  0.0368597    0.86287    4.20201       271.299k/s        256         10             10        10k            1           16         1.14M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:2          36.2 ms         72.1 ms          118  0.0722891  0.0726126    0.86286    4.28421       276.606k/s        256         10             10        10k            1           16         1.18M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:4          35.9 ms          143 ms          120   0.143505   0.145338    0.86302    4.36023       278.704k/s        256         10             10        10k            1           16          1.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:8          35.3 ms          278 ms          120   0.282388   0.290878   0.862964    4.36313        283.28k/s        256         10             10        10k            1           16          1.2M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:16         35.0 ms          550 ms          208   0.290946   0.580669   0.862944    7.54849       285.853k/s        256         10             10        10k            1           16         2.08M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:32         33.5 ms          904 ms          192   0.349906    1.17199   0.862968    7.03196       298.529k/s        256         10             10        10k            1           16         1.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:64         29.9 ms          814 ms          192   0.622259    2.33434   0.862967    7.00307       334.476k/s        256         10             10        10k            1           16         1.92M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:128        26.4 ms          724 ms          256    1.50934    4.68885    0.86296    9.37717       379.142k/s        256         10             10        10k            1           16         2.56M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:256        18.6 ms          484 ms          256    4.73442    9.36847   0.862968    9.36776        539.03k/s        256         10             10        10k            1           16         2.56M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:512        19.2 ms          511 ms          512    9.78025    19.0511   0.862962    19.0503       520.948k/s        256         10             10        10k            1           16         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/11/process_time/real_time/threads:1024       19.2 ms          516 ms         1024    18.9658    37.4673   0.862964     37.465        519.95k/s        256         10             10        10k            1           16        10.24M dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:1           117 ms          117 ms           36   0.116497    0.11651    0.86331    4.19438       85.8293k/s        512         10             32        10k            1           16          360k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:2           113 ms          225 ms           38   0.226888   0.229925   0.863095    4.36864       88.1424k/s        512         10             32        10k            1           16          380k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:4           111 ms          435 ms           40   0.442771   0.460008   0.863245    4.60008       90.3363k/s        512         10             32        10k            1           16          400k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:8           109 ms          849 ms           64   0.869759   0.920014   0.863194    7.36009       91.9771k/s        512         10             32        10k            1           16          640k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:16          102 ms         1543 ms           64   0.927517    1.84272   0.863165    7.37078       98.3341k/s        512         10             32        10k            1           16          640k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:32         87.5 ms         2213 ms           64    1.42289    3.70456    0.86317    7.40899       114.339k/s        512         10             32        10k            1           16          640k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:64         60.8 ms         1527 ms           64    3.88789     7.5046   0.863181    7.50444       164.586k/s        512         10             32        10k            1           16          640k dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:128        58.8 ms         1594 ms          128    7.52457    14.8276   0.863174    14.8279       170.075k/s        512         10             32        10k            1           16         1.28M dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:256        59.3 ms         1608 ms          256    15.1861    29.8326   0.863175    29.8238       168.527k/s        512         10             32        10k            1           16         2.56M dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:512        58.7 ms         1619 ms          512    29.9228    59.2306   0.863167     59.232       170.282k/s        512         10             32        10k            1           16         5.12M dataset_memory_type="device"
cuvs_cagra_q_iterative/12/process_time/real_time/threads:1024       65.8 ms         1772 ms         1024    66.6036    118.757   0.863165    118.754       151.918k/s        512         10             32        10k            1           16        10.24M dataset_memory_type="device"

@achirkin achirkin changed the base branch from main to release/26.04 March 24, 2026 08:17
@achirkin achirkin requested a review from a team as a code owner March 24, 2026 08:17
@achirkin achirkin requested review from a team as code owners March 24, 2026 08:17
@achirkin achirkin requested a review from KyleFromNVIDIA March 24, 2026 08:17
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Copy Markdown
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work needs to be rebased on release/26.04 and remove pulling in changes from main as of 26.06

…ze by moving state checks and updates inside the mutex with acquire/release ordering on the lock-free fast path.
@irina-resh-nvda irina-resh-nvda force-pushed the cuvsbench_smem_size_bug branch from c8de24a to 962c3b4 Compare March 25, 2026 11:35
@achirkin achirkin requested review from robertmaynard and removed request for a team, KyleFromNVIDIA and robertmaynard March 25, 2026 12:44
Copy link
Copy Markdown
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and especially for the code comments! The atomic+mutex logic looks good to me.
Also regarding the benchmarks - how the atomic+mutex variant looks against mutex-only variant now?

@irina-resh-nvda
Copy link
Copy Markdown
Contributor Author

Thanks for the update and especially for the code comments! The atomic+mutex logic looks good to me. Also regarding the benchmarks - how the atomic+mutex variant looks against mutex-only variant now?

@achirkin
the mutex+atomics+acquire/release is on average 0.07% faster (essentially no change). But my benchmarks don't really measure the change, since there are not that many divergent kernel signatures that get swapped

@achirkin
Copy link
Copy Markdown
Contributor

/merge

@achirkin achirkin dismissed robertmaynard’s stale review March 25, 2026 15:25

Rebased successfully

@rapids-bot rapids-bot bot merged commit dbd29a6 into rapidsai:release/26.04 Mar 25, 2026
80 checks passed
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
…AGRA search switches kernel variants (rapidsai#1851)

Fix a bug in `safely_launch_kernel_with_smem_size` where `cudaFuncSetAttribute` was skipped for kernels that needed it. The function tracked the max shared memory in a single static variable per KernelT type, but `cudaFuncSetAttribute` applies per function pointer value — and the single-CTA CAGRA [search](https://github.com/rapidsai/cuvs/blob/d7a28aa1cb7648fa61037ed0459df0ec0e9db841/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh#L1373C4-L1375C78) dispatches multiple kernel instantiations that share the same pointer type. When one kernel bumped the tracked max, a different kernel whose smem fell between its own previous max and the global max would skip `cudaFuncSetAttribute`, causing `cudaErrorInvalidValue`. The fix tracks the kernel pointer identity alongside a monotonically growing smem high-water mark: when the pointer changes, the new kernel is brought up to the high-water mark; when smem exceeds it, the mark is grown.

## Error in question
```c++
$ CUVS_CAGRA_ANN_BENCH --search --data_prefix='<DATA_DIR>/' --benchmark_out_format=csv --benchmark_out=res_search_iter_cagra.csv --benchmark_counters_tabular=true --override_kv=dataset_memory_type:\"device\" <CONFIG_DIR>/laion_1M_cagra_iterative.json
[I] [12:28:52.095261] Using the query file '<DATA_DIR>/laion_1M/queries.fbin'
[I] [12:28:52.096141] Using the ground truth file '<DATA_DIR>/laion_1M/groundtruth.1M.neighbors.ibin'
2026-02-25T12:28:52+00:00
Running CUVS_CAGRA_ANN_BENCH
Run on (224 X 800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x112)
  L1 Instruction 32 KiB (x112)
  L2 Unified 2048 KiB (x112)
  L3 Unified 307200 KiB (x2)
Load Average: 0.70, 0.44, 0.28
dataset: laion_1M
dim: 768
distance: euclidean
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/0/process_time/real_time        5.70 ms         5.70 ms          121   5.68808m   5.69994m    0.96424   0.689692       1.75441M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/1/process_time/real_time        5.70 ms         5.70 ms          121    5.6863m   5.69879m    0.96424   0.689553       1.75477M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/2/process_time/real_time        4.92 ms         4.92 ms          140   4.90351m   4.91567m    0.96046   0.688193       2.03432M/s        128         10             12        10k            1            1          1.4M dataset_memory_type="device"
cuvs_cagra_iterative/3/process_time/real_time        5.99 ms         5.99 ms          115   5.97476m   5.98617m    0.97519   0.688409       1.67052M/s        128         10             16        10k            1            1         1.15M dataset_memory_type="device"
cuvs_cagra_iterative/4/process_time/real_time        6.97 ms         6.97 ms           99   6.95873m    6.9703m    0.98129   0.690059       1.43466M/s        256         10             16        10k            1            1          990k dataset_memory_type="device"
cuvs_cagra_iterative/5/process_time/real_time        10.5 ms         10.5 ms           66   0.010479  0.0104908    0.98548   0.692391       953.222k/s        512         10             10        10k            1            2          660k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/6/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 19 stack frames
rapidsai#1 in CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
rapidsai#2 in libcuvs.so: void cuvs::neighbors::cagra::detail::single_cta_search::select_and_run<float, unsigned int, float, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(...)
rapidsai#3 in libcuvs.so: cuvs::neighbors::cagra::detail::single_cta_search::search<float, unsigned int, float, cuvs::neighbors::filtering::none_sample_filter, unsigned int, long>::operator()(...)
rapidsai#4 in libcuvs.so(+0x18fd0f1)
rapidsai#5 in libcuvs.so: void cuvs::neighbors::cagra::search<float, unsigned int, long>(...)
rapidsai#6-rapidsai#19 in CUVS_CAGRA_ANN_BENCH / libc.so.6
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/7/process_time/real_time        10.5 ms         10.5 ms           66  0.0105088  0.0105202    0.98663   0.694332       950.555k/s         32         10             32        10k            1            1          660k dataset_memory_type="device"
cuvs_cagra_iterative/8/process_time/real_time        12.8 ms         12.8 ms           54   0.012796  0.0128079    0.98807   0.691628       780.768k/s         32         10             64        10k            1            1          540k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/9/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
cuvs_cagra_iterative/10/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/11/process_time/real_time       46.1 ms         46.2 ms           15  0.0461323  0.0461439    0.99131   0.692158       216.714k/s        256         10             10        10k            1           16          150k dataset_memory_type="device"
cuvs_cagra_iterative/12/process_time/real_time        142 ms          142 ms            5   0.141713   0.141725    0.99198   0.708627       70.5591k/s        512         10             32        10k            1           16           50k dataset_memory_type="device"
``` 

## Config
```
{
  "dataset": {
    "name": "laion_1M",
    "base_file": "laion_1M/base.1M.fbin",
    "subset_size": 1000000,
    "query_file": "laion_1M/queries.fbin",
    "groundtruth_neighbors_file": "laion_1M/groundtruth.1M.neighbors.ibin",
    "distance": "euclidean"
  },
  "search_basic_param": {
    "batch_size": 10000,
    "k": 10
  },
  "index": [
  
    {
      "name": "cuvs_cagra_iterative",
      "algo": "cuvs_cagra",
      "build_param": { 
        "graph_degree": 64,
        "intermediate_graph_degree": 128,
        "search_width": 1
      },
      "file": "laion_1M/cagra/q_coarse_iterative.ibin",
      "search_params": [
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 256, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 512, "search_width": 2, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 256, "search_width": 2, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 32, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 64, "refine_ratio": 1},
        {"itopk": 192, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 16, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 512, "search_width": 16, "max_iterations": 32, "refine_ratio": 1}
      ]
    }
  ]
}

```

Authors:
  - https://github.com/irina-resh-nvda

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)

URL: rapidsai#1851
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
…AGRA search switches kernel variants (rapidsai#1851)

Fix a bug in `safely_launch_kernel_with_smem_size` where `cudaFuncSetAttribute` was skipped for kernels that needed it. The function tracked the max shared memory in a single static variable per KernelT type, but `cudaFuncSetAttribute` applies per function pointer value — and the single-CTA CAGRA [search](https://github.com/rapidsai/cuvs/blob/d7a28aa1cb7648fa61037ed0459df0ec0e9db841/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh#L1373C4-L1375C78) dispatches multiple kernel instantiations that share the same pointer type. When one kernel bumped the tracked max, a different kernel whose smem fell between its own previous max and the global max would skip `cudaFuncSetAttribute`, causing `cudaErrorInvalidValue`. The fix tracks the kernel pointer identity alongside a monotonically growing smem high-water mark: when the pointer changes, the new kernel is brought up to the high-water mark; when smem exceeds it, the mark is grown.

## Error in question
```c++
$ CUVS_CAGRA_ANN_BENCH --search --data_prefix='<DATA_DIR>/' --benchmark_out_format=csv --benchmark_out=res_search_iter_cagra.csv --benchmark_counters_tabular=true --override_kv=dataset_memory_type:\"device\" <CONFIG_DIR>/laion_1M_cagra_iterative.json
[I] [12:28:52.095261] Using the query file '<DATA_DIR>/laion_1M/queries.fbin'
[I] [12:28:52.096141] Using the ground truth file '<DATA_DIR>/laion_1M/groundtruth.1M.neighbors.ibin'
2026-02-25T12:28:52+00:00
Running CUVS_CAGRA_ANN_BENCH
Run on (224 X 800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x112)
  L1 Instruction 32 KiB (x112)
  L2 Unified 2048 KiB (x112)
  L3 Unified 307200 KiB (x2)
Load Average: 0.70, 0.44, 0.28
dataset: laion_1M
dim: 768
distance: euclidean
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/0/process_time/real_time        5.70 ms         5.70 ms          121   5.68808m   5.69994m    0.96424   0.689692       1.75441M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/1/process_time/real_time        5.70 ms         5.70 ms          121    5.6863m   5.69879m    0.96424   0.689553       1.75477M/s         64         10              8        10k            1            2         1.21M dataset_memory_type="device"
cuvs_cagra_iterative/2/process_time/real_time        4.92 ms         4.92 ms          140   4.90351m   4.91567m    0.96046   0.688193       2.03432M/s        128         10             12        10k            1            1          1.4M dataset_memory_type="device"
cuvs_cagra_iterative/3/process_time/real_time        5.99 ms         5.99 ms          115   5.97476m   5.98617m    0.97519   0.688409       1.67052M/s        128         10             16        10k            1            1         1.15M dataset_memory_type="device"
cuvs_cagra_iterative/4/process_time/real_time        6.97 ms         6.97 ms           99   6.95873m    6.9703m    0.98129   0.690059       1.43466M/s        256         10             16        10k            1            1          990k dataset_memory_type="device"
cuvs_cagra_iterative/5/process_time/real_time        10.5 ms         10.5 ms           66   0.010479  0.0104908    0.98548   0.692391       953.222k/s        512         10             10        10k            1            2          660k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/6/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 19 stack frames
rapidsai#1 in CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
rapidsai#2 in libcuvs.so: void cuvs::neighbors::cagra::detail::single_cta_search::select_and_run<float, unsigned int, float, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(...)
rapidsai#3 in libcuvs.so: cuvs::neighbors::cagra::detail::single_cta_search::search<float, unsigned int, float, cuvs::neighbors::filtering::none_sample_filter, unsigned int, long>::operator()(...)
rapidsai#4 in libcuvs.so(+0x18fd0f1)
rapidsai#5 in libcuvs.so: void cuvs::neighbors::cagra::search<float, unsigned int, long>(...)
rapidsai#6-rapidsai#19 in CUVS_CAGRA_ANN_BENCH / libc.so.6
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/7/process_time/real_time        10.5 ms         10.5 ms           66  0.0105088  0.0105202    0.98663   0.694332       950.555k/s         32         10             32        10k            1            1          660k dataset_memory_type="device"
cuvs_cagra_iterative/8/process_time/real_time        12.8 ms         12.8 ms           54   0.012796  0.0128079    0.98807   0.691628       780.768k/s         32         10             64        10k            1            1          540k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/9/process_time/real_time  ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
cuvs_cagra_iterative/10/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/11/process_time/real_time       46.1 ms         46.2 ms           15  0.0461323  0.0461439    0.99131   0.692158       216.714k/s        256         10             10        10k            1           16          150k dataset_memory_type="device"
cuvs_cagra_iterative/12/process_time/real_time        142 ms          142 ms            5   0.141713   0.141725    0.99198   0.708627       70.5591k/s        512         10             32        10k            1           16           50k dataset_memory_type="device"
``` 

## Config
```
{
  "dataset": {
    "name": "laion_1M",
    "base_file": "laion_1M/base.1M.fbin",
    "subset_size": 1000000,
    "query_file": "laion_1M/queries.fbin",
    "groundtruth_neighbors_file": "laion_1M/groundtruth.1M.neighbors.ibin",
    "distance": "euclidean"
  },
  "search_basic_param": {
    "batch_size": 10000,
    "k": 10
  },
  "index": [
  
    {
      "name": "cuvs_cagra_iterative",
      "algo": "cuvs_cagra",
      "build_param": { 
        "graph_degree": 64,
        "intermediate_graph_degree": 128,
        "search_width": 1
      },
      "file": "laion_1M/cagra/q_coarse_iterative.ibin",
      "search_params": [
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 128, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 256, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
        {"itopk": 512, "search_width": 2, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 256, "search_width": 2, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 32, "refine_ratio": 1},
        {"itopk": 32, "search_width": 1, "max_iterations": 64, "refine_ratio": 1},
        {"itopk": 192, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
        {"itopk": 256, "search_width": 16, "max_iterations": 10, "refine_ratio": 1},
        {"itopk": 512, "search_width": 16, "max_iterations": 32, "refine_ratio": 1}
      ]
    }
  ]
}

```

Authors:
  - https://github.com/irina-resh-nvda

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)

URL: rapidsai#1851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants