Fix race in graph capture when set_simplex_solution() is called#1017
Conversation
|
/ok to test b67b0c3 |
📝 WalkthroughWalkthroughAdds CPU-side staging for simplex results and a new consume_staged_simplex_solution(lp_state_t<...>&) method; set_simplex_solution(...) now stages data. run_solver() is restructured to optionally consume staged simplex solutions. Separately, PDLP barrier handle lifetime is refactored; docs/tests updated to allow incumbent bound == None. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cpp/src/mip_heuristics/diversity/diversity_manager.cu (2)
111-151: Consider runtime validation for size mismatches in release builds.The assertions at lines 126-129 validate critical size invariants (
assignment.size()vsstaged_simplex_solution_local.size(),n_constraintsvsstaged_simplex_dual_solution_local.size()). If these assertions are compiled out in release builds and a size mismatch occurs, subsequentraft::copyoperations could cause buffer overruns or undefined behavior.Consider adding explicit checks that throw or return early in release builds for these invariants:
🛡️ Suggested defensive check
solution_t<i_t, f_t> new_sol(*problem_ptr); - cuopt_assert(new_sol.assignment.size() == staged_simplex_solution_local.size(), - "Assignment size mismatch"); - cuopt_assert(problem_ptr->n_constraints == staged_simplex_dual_solution_local.size(), - "Dual assignment size mismatch"); + if (new_sol.assignment.size() != staged_simplex_solution_local.size() || + problem_ptr->n_constraints != staged_simplex_dual_solution_local.size()) { + CUOPT_LOG_ERROR("Staged simplex solution size mismatch: primal %zu vs %zu, dual %zu vs %zu", + new_sol.assignment.size(), staged_simplex_solution_local.size(), + static_cast<size_t>(problem_ptr->n_constraints), + staged_simplex_dual_solution_local.size()); + return; // Or throw an appropriate exception + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu` around lines 111 - 151, The assertions in diversity_manager_t<i_t,f_t>::consume_staged_simplex_solution enforce critical size invariants but can be compiled out; add explicit runtime checks before proceeding: verify new_sol.assignment.size() == staged_simplex_solution_local.size() and problem_ptr->n_constraints == staged_simplex_dual_solution_local.size(), and if either fails log an error (or use process logger), set an appropriate error state, and return/throw to avoid the subsequent raft::copy calls; ensure the checks appear before the raft::copy to lp_optimal_solution and before any GPU copies so buffer overruns cannot occur.
904-918: Good use of memory ordering; same assertion concern for size validation.The
std::memory_order_releaseat line 916 correctly ensures the staged data writes are visible before the flag is observed as true by consuming threads.However, similar to
consume_staged_simplex_solution, the size validations at lines 911-912 use assertions that may be compiled out in release builds. If the caller passes mismatched sizes, the staging vectors would be assigned incorrect data silently.🛡️ Suggested defensive check
std::lock_guard<std::mutex> lock(relaxed_solution_mutex); global_concurrent_halt = 1; - cuopt_assert(lp_optimal_solution.size() == solution.size(), "Assignment size mismatch"); - cuopt_assert(problem_ptr->n_constraints == dual_solution.size(), "Dual assignment size mismatch"); + if (lp_optimal_solution.size() != solution.size() || + static_cast<size_t>(problem_ptr->n_constraints) != dual_solution.size()) { + CUOPT_LOG_ERROR("set_simplex_solution size mismatch: primal %zu vs %zu, dual %zu vs %zu", + lp_optimal_solution.size(), solution.size(), + static_cast<size_t>(problem_ptr->n_constraints), dual_solution.size()); + return; // Or throw an appropriate exception + } staged_simplex_solution = solution;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu` around lines 904 - 918, The size checks in set_simplex_solution use cuopt_assert which can be compiled out; change them to runtime defensive checks that validate lp_optimal_solution.size() == solution.size() and problem_ptr->n_constraints == dual_solution.size() before assigning staged_simplex_solution and staged_simplex_dual_solution; on mismatch, log an error with CUOPT_LOG_ERROR (including expected/actual sizes) and return (or throw a std::runtime_error) to avoid writing corrupt staged data, while leaving the simplex_solution_exists.store(..., std::memory_order_release) behavior unchanged so the flag is only set after successful validation and staging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cpp/src/mip_heuristics/diversity/diversity_manager.cu`:
- Around line 111-151: The assertions in
diversity_manager_t<i_t,f_t>::consume_staged_simplex_solution enforce critical
size invariants but can be compiled out; add explicit runtime checks before
proceeding: verify new_sol.assignment.size() ==
staged_simplex_solution_local.size() and problem_ptr->n_constraints ==
staged_simplex_dual_solution_local.size(), and if either fails log an error (or
use process logger), set an appropriate error state, and return/throw to avoid
the subsequent raft::copy calls; ensure the checks appear before the raft::copy
to lp_optimal_solution and before any GPU copies so buffer overruns cannot
occur.
- Around line 904-918: The size checks in set_simplex_solution use cuopt_assert
which can be compiled out; change them to runtime defensive checks that validate
lp_optimal_solution.size() == solution.size() and problem_ptr->n_constraints ==
dual_solution.size() before assigning staged_simplex_solution and
staged_simplex_dual_solution; on mismatch, log an error with CUOPT_LOG_ERROR
(including expected/actual sizes) and return (or throw a std::runtime_error) to
avoid writing corrupt staged data, while leaving the
simplex_solution_exists.store(..., std::memory_order_release) behavior unchanged
so the flag is only set after successful validation and staging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4c35ce6e-fd96-40eb-a6ee-baa9b50116eb
📒 Files selected for processing (2)
cpp/src/mip_heuristics/diversity/diversity_manager.cucpp/src/mip_heuristics/diversity/diversity_manager.cuh
|
/ok to test 9099a1b |
|
/ok to test 25c9c58 |
|
/ok to test bb4755e |
bb4755e to
19c4b7c
Compare
|
/ok to test 19c4b7c |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/src/pdlp/solve.cu (1)
1188-1194: Pre-existing issue: Exception path doesn't join threads before rethrowing.While not introduced by this PR, the exception handling at lines 1190-1194 rethrows immediately without joining
barrier_threadordual_simplex_thread. The comment at lines 1183-1184 correctly notes that "destroying a joinable std::thread calls std::terminate()," but the code doesn't follow this guidance. Consider wrapping the joins in a scope guard or RAII helper to ensure thread cleanup on all exit paths.♻️ Suggested approach using a scope guard pattern
+ // RAII guard to ensure threads are joined on all exit paths + auto join_threads = [&]() { + if (!settings.inside_mip && dual_simplex_thread.joinable()) { + dual_simplex_thread.join(); + } + if (barrier_thread.joinable()) { + barrier_thread.join(); + } + }; + std::exception_ptr pdlp_exception; optimization_problem_solution_t<i_t, f_t> sol_pdlp{pdlp_termination_status_t::NumericalError, problem.handle_ptr->get_stream()}; try { sol_pdlp = run_pdlp(problem, settings_pdlp, timer, is_batch_mode); } catch (...) { pdlp_exception = std::current_exception(); *settings_pdlp.concurrent_halt = 1; + join_threads(); + barrier_handle_ptr.reset(); std::rethrow_exception(pdlp_exception); } // Wait for dual simplex thread to finish - if (!settings.inside_mip) { dual_simplex_thread.join(); } - - barrier_thread.join(); + join_threads();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/solve.cu` around lines 1188 - 1194, The exception handler in the run_pdlp invocation rethrows without joining barrier_thread and dual_simplex_thread, risking std::terminate; update the catch(...) block (or better: introduce an RAII scope guard like ThreadJoiner) to ensure both barrier_thread and dual_simplex_thread are joined if joinable before setting *settings_pdlp.concurrent_halt and rethrowing the stored pdlp_exception. Specifically, either add join-if-joinable calls for barrier_thread and dual_simplex_thread inside the catch(...) (suppress/handle any join exceptions) or replace the local threads with a small helper class whose destructor joins them, and then rethrow via std::rethrow_exception(pdlp_exception).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1188-1194: The exception handler in the run_pdlp invocation
rethrows without joining barrier_thread and dual_simplex_thread, risking
std::terminate; update the catch(...) block (or better: introduce an RAII scope
guard like ThreadJoiner) to ensure both barrier_thread and dual_simplex_thread
are joined if joinable before setting *settings_pdlp.concurrent_halt and
rethrowing the stored pdlp_exception. Specifically, either add join-if-joinable
calls for barrier_thread and dual_simplex_thread inside the catch(...)
(suppress/handle any join exceptions) or replace the local threads with a small
helper class whose destructor joins them, and then rethrow via
std::rethrow_exception(pdlp_exception).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 86c9bb69-45b6-4fd8-b07e-e8e4fa2ff95f
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
| barrier_thread.join(); | ||
| // At this point, it is safe to destroy the barrier context since we're outside of any PDLP graph | ||
| // capture. | ||
| barrier_handle_ptr.reset(); |
There was a problem hiding this comment.
That should do the trick yes, nice one!
Kh4ster
left a comment
There was a problem hiding this comment.
Regarding the PDLP / IPM interaction and issue while stream capture this PR should fix the issue. Good catch @aliceb-nv !
|
/merge |
set_simplex_solution(called when the B&B thread finishes solving the root via dual simplex) indiversity_manager_ttriggers a stream synchronization, which could conflict with PDLP graph capture during the GPU side root solve. This was causing crashes in CI when the race was triggered. This PR fixes this issue by pushing the dual simplex solution to a staging buffer than is only read whendiversity_manager_tfinishes the PDLP solve.Another bug was caused by barrier's raft::handle_t destructor causing a device-wide sync, which could race with PDLP's graph capture. The handle destruction is now deferred after the concurrent solve threads are joined.
Python tests were also updated to accept incumbents with a nonfinite solution bound, which can occur when early heuristics find a solution before the root solve even begins.
Closes #967
Closes #919
Description
Issue
Checklist