Skip to content

Fix race in graph capture when set_simplex_solution() is called#1017

Merged
rapids-bot[bot] merged 5 commits intoNVIDIA:release/26.04from
aliceb-nv:fix-set-simplex-solution-race
Apr 2, 2026
Merged

Fix race in graph capture when set_simplex_solution() is called#1017
rapids-bot[bot] merged 5 commits intoNVIDIA:release/26.04from
aliceb-nv:fix-set-simplex-solution-race

Conversation

@aliceb-nv
Copy link
Copy Markdown
Contributor

@aliceb-nv aliceb-nv commented Apr 1, 2026

set_simplex_solution (called when the B&B thread finishes solving the root via dual simplex) in diversity_manager_t triggers a stream synchronization, which could conflict with PDLP graph capture during the GPU side root solve. This was causing crashes in CI when the race was triggered. This PR fixes this issue by pushing the dual simplex solution to a staging buffer than is only read when diversity_manager_t finishes the PDLP solve.

Another bug was caused by barrier's raft::handle_t destructor causing a device-wide sync, which could race with PDLP's graph capture. The handle destruction is now deferred after the concurrent solve threads are joined.

Python tests were also updated to accept incumbents with a nonfinite solution bound, which can occur when early heuristics find a solution before the root solve even begins.

Closes #967
Closes #919

Description

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@aliceb-nv aliceb-nv requested a review from a team as a code owner April 1, 2026 13:58
@aliceb-nv aliceb-nv added bug Something isn't working non-breaking Introduces a non-breaking change labels Apr 1, 2026
@aliceb-nv aliceb-nv requested review from hlinsen and nguidotti April 1, 2026 13:58
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/ok to test b67b0c3

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Adds CPU-side staging for simplex results and a new consume_staged_simplex_solution(lp_state_t<...>&) method; set_simplex_solution(...) now stages data. run_solver() is restructured to optionally consume staged simplex solutions. Separately, PDLP barrier handle lifetime is refactored; docs/tests updated to allow incumbent bound == None.

Changes

Cohort / File(s) Summary
Diversity manager header
cpp/src/mip_heuristics/diversity/diversity_manager.cuh
Declared consume_staged_simplex_solution(lp_state_t<i_t,f_t>&) and added CPU staging members: staged_simplex_solution, staged_simplex_dual_solution, staged_simplex_objective.
Diversity manager implementation
cpp/src/mip_heuristics/diversity/diversity_manager.cu
Added consume_staged_simplex_solution() to atomically read staged CPU data, validate sizes/objective, copy/clamp into device lp_optimal_solution/lp_dual_optimal_solution, update lp_state.prev_primal, sync streams, and set user bound. Modified set_simplex_solution() to stage data only. Restructured run_solver() to use a use_staged_simplex_solution flag and adjust where bounds/termination handling and PDLP relaxed-solution callback are applied.
PDLP concurrency
cpp/src/pdlp/solve.cu
Refactored barrier worker handle lifetime: create std::unique_ptr<raft::handle_t> on main thread, pass raw pointer to barrier thread, and destroy handle on main thread after join() to ensure destructor runs on main thread.
Documentation
docs/cuopt/source/cuopt-server/examples/milp-examples.rst, docs/.../incumbent_callback_example.py
Clarified incumbent callback docs: bound may be None when a finite global bound is not yet available.
Python tests & webserver docs
python/cuopt_server/.../tests/test_incumbents.py, python/cuopt_server/cuopt_server/webserver.py
Test updated to reuse response payload and accept bound being None. OpenAPI/docstring text updated to indicate bound can be float or None. No runtime/serialization changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix race in graph capture when set_simplex_solution() is called' clearly and specifically describes the main purpose of the changeset: addressing a race condition related to graph capture and the set_simplex_solution() method, which aligns with the core changes in diversity_manager.cu/cuh and pdlp/solve.cu that restructure simplex solution handling and barrier thread synchronization.
Description check ✅ Passed The PR description adequately explains the race conditions being fixed: set_simplex_solution() stream synchronization conflicting with PDLP graph capture, and barrier handle destruction causing device-wide sync conflicts. It also describes the solutions and mentions Python test updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cpp/src/mip_heuristics/diversity/diversity_manager.cu (2)

111-151: Consider runtime validation for size mismatches in release builds.

The assertions at lines 126-129 validate critical size invariants (assignment.size() vs staged_simplex_solution_local.size(), n_constraints vs staged_simplex_dual_solution_local.size()). If these assertions are compiled out in release builds and a size mismatch occurs, subsequent raft::copy operations could cause buffer overruns or undefined behavior.

Consider adding explicit checks that throw or return early in release builds for these invariants:

🛡️ Suggested defensive check
  solution_t<i_t, f_t> new_sol(*problem_ptr);
- cuopt_assert(new_sol.assignment.size() == staged_simplex_solution_local.size(),
-              "Assignment size mismatch");
- cuopt_assert(problem_ptr->n_constraints == staged_simplex_dual_solution_local.size(),
-              "Dual assignment size mismatch");
+ if (new_sol.assignment.size() != staged_simplex_solution_local.size() ||
+     problem_ptr->n_constraints != staged_simplex_dual_solution_local.size()) {
+   CUOPT_LOG_ERROR("Staged simplex solution size mismatch: primal %zu vs %zu, dual %zu vs %zu",
+                   new_sol.assignment.size(), staged_simplex_solution_local.size(),
+                   static_cast<size_t>(problem_ptr->n_constraints),
+                   staged_simplex_dual_solution_local.size());
+   return;  // Or throw an appropriate exception
+ }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu` around lines 111 -
151, The assertions in
diversity_manager_t<i_t,f_t>::consume_staged_simplex_solution enforce critical
size invariants but can be compiled out; add explicit runtime checks before
proceeding: verify new_sol.assignment.size() ==
staged_simplex_solution_local.size() and problem_ptr->n_constraints ==
staged_simplex_dual_solution_local.size(), and if either fails log an error (or
use process logger), set an appropriate error state, and return/throw to avoid
the subsequent raft::copy calls; ensure the checks appear before the raft::copy
to lp_optimal_solution and before any GPU copies so buffer overruns cannot
occur.

904-918: Good use of memory ordering; same assertion concern for size validation.

The std::memory_order_release at line 916 correctly ensures the staged data writes are visible before the flag is observed as true by consuming threads.

However, similar to consume_staged_simplex_solution, the size validations at lines 911-912 use assertions that may be compiled out in release builds. If the caller passes mismatched sizes, the staging vectors would be assigned incorrect data silently.

🛡️ Suggested defensive check
  std::lock_guard<std::mutex> lock(relaxed_solution_mutex);
  global_concurrent_halt = 1;
- cuopt_assert(lp_optimal_solution.size() == solution.size(), "Assignment size mismatch");
- cuopt_assert(problem_ptr->n_constraints == dual_solution.size(), "Dual assignment size mismatch");
+ if (lp_optimal_solution.size() != solution.size() ||
+     static_cast<size_t>(problem_ptr->n_constraints) != dual_solution.size()) {
+   CUOPT_LOG_ERROR("set_simplex_solution size mismatch: primal %zu vs %zu, dual %zu vs %zu",
+                   lp_optimal_solution.size(), solution.size(),
+                   static_cast<size_t>(problem_ptr->n_constraints), dual_solution.size());
+   return;  // Or throw an appropriate exception
+ }
  staged_simplex_solution      = solution;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu` around lines 904 -
918, The size checks in set_simplex_solution use cuopt_assert which can be
compiled out; change them to runtime defensive checks that validate
lp_optimal_solution.size() == solution.size() and problem_ptr->n_constraints ==
dual_solution.size() before assigning staged_simplex_solution and
staged_simplex_dual_solution; on mismatch, log an error with CUOPT_LOG_ERROR
(including expected/actual sizes) and return (or throw a std::runtime_error) to
avoid writing corrupt staged data, while leaving the
simplex_solution_exists.store(..., std::memory_order_release) behavior unchanged
so the flag is only set after successful validation and staging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu`:
- Around line 111-151: The assertions in
diversity_manager_t<i_t,f_t>::consume_staged_simplex_solution enforce critical
size invariants but can be compiled out; add explicit runtime checks before
proceeding: verify new_sol.assignment.size() ==
staged_simplex_solution_local.size() and problem_ptr->n_constraints ==
staged_simplex_dual_solution_local.size(), and if either fails log an error (or
use process logger), set an appropriate error state, and return/throw to avoid
the subsequent raft::copy calls; ensure the checks appear before the raft::copy
to lp_optimal_solution and before any GPU copies so buffer overruns cannot
occur.
- Around line 904-918: The size checks in set_simplex_solution use cuopt_assert
which can be compiled out; change them to runtime defensive checks that validate
lp_optimal_solution.size() == solution.size() and problem_ptr->n_constraints ==
dual_solution.size() before assigning staged_simplex_solution and
staged_simplex_dual_solution; on mismatch, log an error with CUOPT_LOG_ERROR
(including expected/actual sizes) and return (or throw a std::runtime_error) to
avoid writing corrupt staged data, while leaving the
simplex_solution_exists.store(..., std::memory_order_release) behavior unchanged
so the flag is only set after successful validation and staging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4c35ce6e-fd96-40eb-a6ee-baa9b50116eb

📥 Commits

Reviewing files that changed from the base of the PR and between e91c66f and b67b0c3.

📒 Files selected for processing (2)
  • cpp/src/mip_heuristics/diversity/diversity_manager.cu
  • cpp/src/mip_heuristics/diversity/diversity_manager.cuh

@aliceb-nv aliceb-nv requested a review from a team as a code owner April 1, 2026 15:57
@aliceb-nv aliceb-nv requested a review from tmckayus April 1, 2026 15:57
@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 9099a1b

@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 25c9c58

@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/ok to test bb4755e

@aliceb-nv aliceb-nv force-pushed the fix-set-simplex-solution-race branch from bb4755e to 19c4b7c Compare April 2, 2026 10:53
@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 19c4b7c

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/src/pdlp/solve.cu (1)

1188-1194: Pre-existing issue: Exception path doesn't join threads before rethrowing.

While not introduced by this PR, the exception handling at lines 1190-1194 rethrows immediately without joining barrier_thread or dual_simplex_thread. The comment at lines 1183-1184 correctly notes that "destroying a joinable std::thread calls std::terminate()," but the code doesn't follow this guidance. Consider wrapping the joins in a scope guard or RAII helper to ensure thread cleanup on all exit paths.

♻️ Suggested approach using a scope guard pattern
+  // RAII guard to ensure threads are joined on all exit paths
+  auto join_threads = [&]() {
+    if (!settings.inside_mip && dual_simplex_thread.joinable()) { 
+      dual_simplex_thread.join(); 
+    }
+    if (barrier_thread.joinable()) { 
+      barrier_thread.join(); 
+    }
+  };
+  
   std::exception_ptr pdlp_exception;
   optimization_problem_solution_t<i_t, f_t> sol_pdlp{pdlp_termination_status_t::NumericalError,
                                                      problem.handle_ptr->get_stream()};
   try {
     sol_pdlp = run_pdlp(problem, settings_pdlp, timer, is_batch_mode);
   } catch (...) {
     pdlp_exception                 = std::current_exception();
     *settings_pdlp.concurrent_halt = 1;
+    join_threads();
+    barrier_handle_ptr.reset();
     std::rethrow_exception(pdlp_exception);
   }
 
   // Wait for dual simplex thread to finish
-  if (!settings.inside_mip) { dual_simplex_thread.join(); }
-
-  barrier_thread.join();
+  join_threads();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/solve.cu` around lines 1188 - 1194, The exception handler in the
run_pdlp invocation rethrows without joining barrier_thread and
dual_simplex_thread, risking std::terminate; update the catch(...) block (or
better: introduce an RAII scope guard like ThreadJoiner) to ensure both
barrier_thread and dual_simplex_thread are joined if joinable before setting
*settings_pdlp.concurrent_halt and rethrowing the stored pdlp_exception.
Specifically, either add join-if-joinable calls for barrier_thread and
dual_simplex_thread inside the catch(...) (suppress/handle any join exceptions)
or replace the local threads with a small helper class whose destructor joins
them, and then rethrow via std::rethrow_exception(pdlp_exception).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1188-1194: The exception handler in the run_pdlp invocation
rethrows without joining barrier_thread and dual_simplex_thread, risking
std::terminate; update the catch(...) block (or better: introduce an RAII scope
guard like ThreadJoiner) to ensure both barrier_thread and dual_simplex_thread
are joined if joinable before setting *settings_pdlp.concurrent_halt and
rethrowing the stored pdlp_exception. Specifically, either add join-if-joinable
calls for barrier_thread and dual_simplex_thread inside the catch(...)
(suppress/handle any join exceptions) or replace the local threads with a small
helper class whose destructor joins them, and then rethrow via
std::rethrow_exception(pdlp_exception).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 86c9bb69-45b6-4fd8-b07e-e8e4fa2ff95f

📥 Commits

Reviewing files that changed from the base of the PR and between 9099a1b and 19c4b7c.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

@aliceb-nv aliceb-nv changed the title [DO NOT MERGE, TESTING] Fix race in graph capture when set_simplex_solution() is called Fix race in graph capture when set_simplex_solution() is called Apr 2, 2026
@aliceb-nv aliceb-nv modified the milestone: 26.04 Apr 2, 2026
Comment thread cpp/src/pdlp/solve.cu
barrier_thread.join();
// At this point, it is safe to destroy the barrier context since we're outside of any PDLP graph
// capture.
barrier_handle_ptr.reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should do the trick yes, nice one!

Copy link
Copy Markdown
Contributor

@Kh4ster Kh4ster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the PDLP / IPM interaction and issue while stream capture this PR should fix the issue. Good catch @aliceb-nv !

@aliceb-nv
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e1469d5 into NVIDIA:release/26.04 Apr 2, 2026
206 of 209 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants