Fix batch pdlp and python api support#893
Conversation
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughReplaces integer presolve with a string-based presolver option; adds ping-pong CUDA graph implementation; migrates many thrust reductions to cub::DeviceReduce; tightens numeric NaN/Inf/runtime checks; consolidates stream handling to raw CUDA streams; exposes a Python setting and tests for batch PDLP strong branching. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py (1)
453-457: PreferOptional[bool]for consistency with other binary flags inSolverConfig.The description documents strictly 0 (disable) / 1 (enable) semantics, which is identical to the
Optional[bool]fields already in this class (mip_scaling,mip_heuristics_only,crossover,eliminate_dense_columns, etc.). UsingOptional[int]instead silently accepts arbitrary integers (e.g.2,-1) and diverges from the established convention.♻️ Proposed refactor
- mip_batch_pdlp_strong_branching: Optional[int] = Field( - default=0, + mip_batch_pdlp_strong_branching: Optional[bool] = Field( + default=False, description="Set 1 to enable batch PDLP strong branching " "in the MIP solver, 0 to disable.", )If the backing C++ parameter expects an integer, the conversion layer can cast
bool → intat call time, keeping the Python API type-safe without any burden on the caller.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py` around lines 453 - 457, The field mip_batch_pdlp_strong_branching is typed as Optional[int] but should follow the class convention of binary flags by using Optional[bool]; change its annotation to Optional[bool] and update the Field default from 0 to False (or None if other similar flags use None) so callers get a boolean API, and ensure the call site that passes this to the C++ layer casts the bool to int (bool → int) when invoking the underlying solver; locate the field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make this replacement to keep behavior consistent with mip_scaling, mip_heuristics_only, crossover, eliminate_dense_columns, etc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1429-1447: The current strict cuopt_assert on (movement +
computed_interaction) can fail due to tiny negative rounding errors; change the
check to compute f_t sum = movement + computed_interaction, use a small
tolerance epsilon (e.g., f_t(-1e-12) or similar relative epsilon), clamp sum to
f_t(0.0) if it is negative but within the epsilon, and only call
cuopt_assert(sum >= f_t(0.0)) for values below that tolerance; then assign
*fixed_point_error = cuda::std::sqrt(sum). Update the code around the symbols
movement, computed_interaction, fixed_point_error, and the existing cuopt_assert
to implement this tolerant/clamp behavior.
- Around line 1191-1226: The CUB reductions in compute_stats ignore returned
cudaError_t and will misbehave for empty inputs; wrap every
cub::DeviceReduce::Reduce call with RAFT_CUDA_TRY and add an early empty-vector
guard (check vec.size() == 0) before creating rmm::device_scalar<> objects and
invoking reductions; for the empty case set smallest, largest, and avg to
sensible defaults (or return early) to avoid running transforms (min_nonzero,
abs_iter) and to prevent the avg = d_sum.value(...) / vec.size() divide-by-zero
using the d_smallest, d_largest, d_sum, min_nonzero, abs_iter, and
cub::DeviceReduce::Reduce symbols to locate the changes.
In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 402-419: The CUB reduction calls using cub::DeviceReduce::Max (the
two invocations that pass transform_iter, linf_primal_residual_.data(),
primal_residual_.size(), stream_view_ and the analogous dual-residual calls
later) are not being checked for CUDA errors; wrap each cub::DeviceReduce::Max
call (both the temp storage query and the actual reduction) with RAFT_CUDA_TRY
so returned RAFT/CUDA errors are propagated consistently (apply the same
RAFT_CUDA_TRY wrapping to the dual residual block around the calls that use
sibling variables like dual_transform_iter, linf_dual_residual_, dual_residual_
and their temp_buf/stream_view_).
In `@cpp/src/pdlp/utilities/ping_pong_graph.cu`:
- Around line 56-118: The code currently uses RAFT_CUDA_TRY_NO_THROW but updates
flags unconditionally; change start_capture, end_capture, launch, and
is_initialized so that you capture return statuses from the CUDA calls
(cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate,
cudaGraphDestroy, etc.) instead of blindly relying on RAFT_CUDA_TRY_NO_THROW,
and only set capture_even_active_/capture_odd_active_ and
even_initialized/odd_initialized when the respective CUDA call succeeds; on
failure, roll back any partial state (e.g., reset capture_*_active_, destroy any
created graph if instantiate failed, and ensure even_instance/odd_instance
remain unset) so launch() and is_initialized() only report graph mode when
even_instance/odd_instance are valid; locate changes in start_capture,
end_capture, launch, and is_initialized and use the symbols
RAFT_CUDA_TRY_NO_THROW, capture_even_active_, capture_odd_active_,
even_initialized, odd_initialized, even_instance, odd_instance, even_graph, and
odd_graph to implement the gating and cleanup.
- Around line 24-51: The cancel_active_capture() destructor-invoked cleanup
currently uses RAFT_CUDA_TRY (which can throw) for cudaStreamEndCapture; replace
those RAFT_CUDA_TRY calls with RAFT_CUDA_TRY_NO_THROW so exceptions are not
thrown from cancel_active_capture() (which may be called from the noexcept
destructor); ensure the two calls referencing even_graph and odd_graph use
RAFT_CUDA_TRY_NO_THROW and keep subsequent
RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(...)) and
capture_even_active_/capture_odd_active_ handling unchanged.
In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-618: The two calls to cub::DeviceReduce::Max in
pdlp::utils.cuh are not checking CUDA errors; wrap both invocations with
RAFT_CUDA_TRY so any cudaError_t returned is checked. Locate the block using
handle_ptr->get_stream(), abs_iter, and temp_bytes (the first sizing call) and
the second call that passes temp_buf.data(), and prefix each
cub::DeviceReduce::Max(...) call with RAFT_CUDA_TRY(...). Ensure you include the
same arguments inside RAFT_CUDA_TRY and keep temp_buf allocation and stream
usage unchanged.
---
Nitpick comments:
In
`@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`:
- Around line 453-457: The field mip_batch_pdlp_strong_branching is typed as
Optional[int] but should follow the class convention of binary flags by using
Optional[bool]; change its annotation to Optional[bool] and update the Field
default from 0 to False (or None if other similar flags use None) so callers get
a boolean API, and ensure the call site that passes this to the C++ layer casts
the bool to int (bool → int) when invoking the underlying solver; locate the
field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make
this replacement to keep behavior consistent with mip_scaling,
mip_heuristics_only, crossover, eliminate_dense_columns, etc.
| // We need to make sure both dot products happens after previous operations (next_primal/dual) | ||
| // Thus, we add another node in the main stream before starting the SpMVs | ||
|
|
||
| if (!batch_mode_) deltas_are_done_.record(stream_view_); |
There was a problem hiding this comment.
Is removing those deltas_are_done_ lines intentional?
There was a problem hiding this comment.
Yes. It's actually illegal to run on the same cublas handler while using different streams which I was doing here. The error would extremly rarely trigger but it can results in an incorrect results for the dot product.
| if (!is_legacy_batch_mode_) { | ||
| // This should not happen, but in case a graph was capturing while destroying the object | ||
| if (capture_even_active_ || capture_odd_active_) { | ||
| cancel_active_capture(); |
There was a problem hiding this comment.
I think this might hide real errors. If this is destroyed while its own stream is capturing, it is a logic issue. If this is destroyed while another stream/thread is capturing, we will lose the graph on the other stream/thread and cudaGraphLaunch will launch an incomplete/empty graph.
There was a problem hiding this comment.
Yes I agree. This was to create a failsafe in case the PDLP object is destroyed while a capture is happenning which is an unitended behavior. Hence the log error above. You would still kill the application?
There was a problem hiding this comment.
This behavior has been changed now
|
/ok to test |
@Kh4ster, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test fe31e7e |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)
25-43:⚠️ Potential issue | 🟠 MajorAdd
= deletefor copy constructor and copy assignment operator to prevent double-free of CUDA handles.
cudaGraph_tandcudaGraphExec_tare opaque handles (effectively non-owning pointers). The compiler-generated copy constructor and copy assignment operator will perform a shallow copy of these handles. If twoping_pong_graph_tinstances end up holding the same handle (e.g., via a stray copy or STL container resize), the destructor of each will independently callcudaGraphDestroy/cudaGraphExecDestroyon the same resource, causing a double-free crash or silent corruption.🛡️ Proposed fix: explicitly delete copy and move semantics
public: ping_pong_graph_t(rmm::cuda_stream_view stream_view, bool is_legacy_batch_mode = false); ~ping_pong_graph_t(); + + // Non-copyable, non-movable: owns CUDA graph handles + ping_pong_graph_t(const ping_pong_graph_t&) = delete; + ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete; + ping_pong_graph_t(ping_pong_graph_t&&) = delete; + ping_pong_graph_t& operator=(ping_pong_graph_t&&) = delete;If move semantics are desired in the future (e.g., for storing in a container), a proper move constructor that nulls out the source handles after transfer should be implemented explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 25 - 43, Declare the class copy constructor and copy assignment operator for ping_pong_graph_t as deleted to prevent shallow copies of the CUDA handles (i.e., add ping_pong_graph_t(const ping_pong_graph_t&) = delete; and ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete; in the class declaration), leaving move semantics unimplemented for now (or explicitly delete them too if you want fully non-movable instances) so that even_graph, odd_graph, even_instance and odd_instance cannot be copied and risk double-free in the destructor.cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu (1)
418-497:⚠️ Potential issue | 🟠 MajorWrap CUB calls with error checking to match codebase patterns.
The CUB
DeviceTransform::TransformandDeviceSegmentedReduce::Sumcalls at lines 418–497 lack error checking. This is inconsistent with error-checked versions of the same functions earlier in the file (lines 68, 80, 90). Wrap all four calls withRAFT_CUDA_TRYto align with coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."✅ Suggested fix
- cub::DeviceTransform::Transform( + RAFT_CUDA_TRY(cub::DeviceTransform::Transform( cuda::std::make_tuple(current_saddle_point_state.get_next_AtY().data(), current_saddle_point_state.get_current_AtY().data()), tmp_primal.data(), tmp_primal.size(), cuda::std::minus<>{}, - stream_view_.value()); + stream_view_.value())); ... - cub::DeviceSegmentedReduce::Sum( + RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum( dot_product_storage.data(), dot_product_bytes, thrust::make_transform_iterator( thrust::make_zip_iterator(tmp_primal.data(), current_saddle_point_state.get_delta_primal().data()), tuple_multiplies<f_t>{}), interaction_.data(), climber_strategies_.size(), primal_size_, - stream_view_.value()); + stream_view_.value())); - cub::DeviceSegmentedReduce::Sum( + RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum( dot_product_storage.data(), dot_product_bytes, thrust::make_transform_iterator(current_saddle_point_state.get_delta_primal().data(), power_two_func_t<f_t>{}), norm_squared_delta_primal_.data(), climber_strategies_.size(), primal_size_, - stream_view_.value()); + stream_view_.value())); - cub::DeviceSegmentedReduce::Sum( + RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum( dot_product_storage.data(), dot_product_bytes, thrust::make_transform_iterator(current_saddle_point_state.get_delta_dual().data(), power_two_func_t<f_t>{}), norm_squared_delta_dual_.data(), climber_strategies_.size(), dual_size_, - stream_view_.value()); + stream_view_.value()));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu` around lines 418 - 497, The four CUB calls (cub::DeviceTransform::Transform and the three cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal, tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other CUB/CUDA calls in this file so CUDA errors are checked (preserve the original call arguments including stream_view_.value()).
🧹 Nitpick comments (6)
benchmarks/linear_programming/cuopt/run_pdlp.cu (1)
81-88: Line 87 fallback return is unreachable dead code.Since argparse's
.choices()already rejects any string outside{"None", "Papilo", "PSLP", "Default"}, control can never reach line 87. The silentreturn Defaulthides potential future bugs (e.g. a new enum value added topresolver_twithout updating this mapping). Consider replacing it with an assertion to make the invariant explicit:🔧 Suggested change
static cuopt::linear_programming::presolver_t string_to_presolver(const std::string& presolver) { if (presolver == "None") return cuopt::linear_programming::presolver_t::None; if (presolver == "Papilo") return cuopt::linear_programming::presolver_t::Papilo; if (presolver == "PSLP") return cuopt::linear_programming::presolver_t::PSLP; if (presolver == "Default") return cuopt::linear_programming::presolver_t::Default; - return cuopt::linear_programming::presolver_t::Default; + throw std::invalid_argument("Unknown presolver: " + presolver); }This mirrors how
string_to_pdlp_solver_modewould ideally behave and surfaces mapping gaps immediately at compile/test time if new enum values are added.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/linear_programming/cuopt/run_pdlp.cu` around lines 81 - 88, Replace the silent fallback in string_to_presolver so an unexpected string cannot quietly map to Default: remove the unreachable final "return Default" and instead assert or throw (e.g., use assert(false) or throw std::invalid_argument) to make the invariant explicit for cuopt::linear_programming::presolver_t mapping; reference string_to_presolver and presolver_t (and mirror the behavior used by string_to_pdlp_solver_mode) so any missing mapping surfaces immediately during tests/compilation.cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)
34-37: Initialize CUDA handles tonullptrto prevent accidental use of garbage handle values.
even_graph,odd_graph,even_instance, andodd_instanceare uninitialized. Whileeven_initialized/odd_initializedguard their use in normal paths, any future code path that bypasses those flags — or any static analysis/sanitizer run — will observe undefined values. Explicitnullptrinitialization is zero-cost and aligns with the guideline for tracking CUDA resource lifecycle.♻️ Proposed fix: value-initialize CUDA handles
private: - cudaGraph_t even_graph; - cudaGraph_t odd_graph; - cudaGraphExec_t even_instance; - cudaGraphExec_t odd_instance; + cudaGraph_t even_graph{nullptr}; + cudaGraph_t odd_graph{nullptr}; + cudaGraphExec_t even_instance{nullptr}; + cudaGraphExec_t odd_instance{nullptr};As per coding guidelines, tracking GPU resource lifecycle and ensuring cleanup of CUDA handles requires proper initialization state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 34 - 37, The CUDA handle variables even_graph, odd_graph, even_instance, and odd_instance are uninitialized; initialize them to nullptr (or equivalent zero-initialization) where they are declared so they start in a known state and won't contain garbage values, then keep the existing even_initialized/odd_initialized guards and cleanup logic intact to manage lifecycle; update the declarations of even_graph, odd_graph, even_instance, and odd_instance to value-initialize them (e.g., = nullptr) so static analysis/sanitizers and future code paths see a safe default.docs/cuopt/source/lp-qp-milp-settings.rst (1)
520-521: Wording ambiguity: "in parallel" appears to modify Dual Simplex.The phrase "rather than solving them in parallel using Dual Simplex" grammatically attributes parallelism to Dual Simplex, which is a sequential solver. The intended contrast is batch PDLP (parallel/simultaneous) vs. Dual Simplex (sequential). Consider rephrasing, e.g.:
"…rather than solving them sequentially using Dual Simplex."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cuopt/source/lp-qp-milp-settings.rst` around lines 520 - 521, The wording in the description for CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously attributes parallelism to Dual Simplex; update the sentence that currently reads "rather than solving them in parallel using Dual Simplex" to clearly contrast batched PDLP (simultaneous/parallel evaluation) with Dual Simplex by rephrasing it to something like "rather than solving them sequentially using Dual Simplex" so the intended comparison is unambiguous; update the docstring for CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.python/cuopt/cuopt/tests/linear_programming/test_python_API.py (2)
1003-1017: Problem setup duplicatestest_cutsverbatim — extract a shared helper.The variable declarations, constraints, and objective (lines 1009–1017) are identical to
test_cuts(lines 971–979). Consider a shared factory, e.g.:♻️ Proposed refactor
+def _make_knapsack_problem(): + """Shared 3-variable 0-1 MIP used in multiple tests.""" + problem = Problem() + y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1") + y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2") + y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3") + problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875) + problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875) + problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3) + return problem def test_cuts(): - problem = Problem() - y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1") - y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2") - y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3") - problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875) - problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875) - problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3) + problem = _make_knapsack_problem() ... def test_batch_pdlp_strong_branching(): - problem = Problem() - y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1") - ... + problem = _make_knapsack_problem() ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines 1003 - 1017, The test duplicates the problem construction in test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup into a shared helper (e.g., a function like build_y1y2y3_problem or make_binary_three_var_problem) that creates the Problem, adds variables y1,y2,y3, adds the two constraints and sets the objective, and returns the Problem (and variables if tests need them); then replace the inline block in both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the existing symbols Problem, addVariable, addConstraint, setObjective, and variable names y1,y2,y3 to find and replace the duplicated code).
1003-1032: Missing coverage: parameter should be ignored on LP problems (per docs).The documentation states "This setting is ignored if the problem is not a MIP problem." The test only exercises MIP scenarios. Per the project's test-coverage guidelines, edge behavior (silently ignoring a flag on non-qualifying problems) should be tested to prevent future regressions where the parameter accidentally affects LP solves.
Consider adding a brief LP variant that sets
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1and assertsStatus == Optimaland a correct objective.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines 1003 - 1032, Add a small LP variant to the test to verify the flag is ignored for non-MIP problems: create a new Problem (or reuse by recreating variables) using continuous variables (omit vtype or use CONTINUOUS) with the same constraints/objective as in test_batch_pdlp_strong_branching, call settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then assert problem.Status.name == "Optimal" and problem.ObjValue == pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves. Use the same symbols (Problem, settings.set_parameter, CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue, test_batch_pdlp_strong_branching) so the new assertions align with the existing test structure.cpp/src/pdlp/pdlp.cu (1)
1927-1927:cublasSetStreamcorrectly ensures stream binding beforecublasDgeam; minor inconsistency with.value().The explicit stream binding is the correct fix for the handle-sharing issue described in the PR.
stream_view_(of typermm::cuda_stream_view) implicitly converts tocudaStream_there, but all other cuSPARSE/cuBLAS call sites in this PR (lines 2744, 2757, 2766) use explicit.value(). Consider usingstream_view_.value()here and at line 2004 for consistency.♻️ Proposed consistency fix
- RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_)); + RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_.value()));Apply the same change at line 2004.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/pdlp.cu` at line 1927, The cublas stream is correctly set via RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use the explicit rmm::cuda_stream_view::value(); change the second argument to stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make the same replacement for the other occurrence around the cublasDgeam usage referenced at line ~2004) so all calls consistently pass a cudaStream_t via stream_view_.value().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1861-1864: The cudaPeekAtLastError() call is placed after
RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it redundant; move the
RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow the kernel launch
(before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so that
launch/configuration errors are caught early, then keep the stream sync
(RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors
and ensure device-to-host writes complete.
In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1024-1032: The second solve may return a cached solution because
only settings were changed; force a true re-solve or validate the solver ran
under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or
recreating it before the second problem.solve(settings). For example, call a
method that updates the model state (e.g., problem.setObjective(...) or a
problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat
changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve
actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of
returning a cached result.
---
Outside diff comments:
In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`:
- Around line 418-497: The four CUB calls (cub::DeviceTransform::Transform and
the three cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal,
tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with
RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other
CUB/CUDA calls in this file so CUDA errors are checked (preserve the original
call arguments including stream_view_.value()).
In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 25-43: Declare the class copy constructor and copy assignment
operator for ping_pong_graph_t as deleted to prevent shallow copies of the CUDA
handles (i.e., add ping_pong_graph_t(const ping_pong_graph_t&) = delete; and
ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete; in the class
declaration), leaving move semantics unimplemented for now (or explicitly delete
them too if you want fully non-movable instances) so that even_graph, odd_graph,
even_instance and odd_instance cannot be copied and risk double-free in the
destructor.
---
Duplicate comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1456-1460: The assert cuopt_assert(movement + computed_interaction
>= f_t(0.0), ...) is too strict for floating-point rounding; replace it with an
epsilon-tolerant check that allows tiny negative rounding errors before
clamping. Compute an epsilon relative to the magnitude of operands (e.g., using
std::numeric_limits<f_t>::epsilon() times max(abs(movement),
abs(computed_interaction), f_t(1))) and assert movement + computed_interaction
>= -epsilon, then continue to use cuda::std::max(...) and the existing sqrt into
*fixed_point_error; keep the original message but note the tolerance in the
condition.
- Line 1194: cuopt_assert(vec.size() > 0, "Vector must not be empty") is not
sufficient because asserts can be compiled out; add an explicit runtime check
for vec.size() == 0 before computing avg (the line using avg =
d_sum.value(stream) / vec.size()) and perform the appropriate early return or
empty-result handling instead of relying on cuopt_assert; update the function
containing cuopt_assert, the avg variable calculation, and the
d_sum.value(stream) usage so the divide-by-zero cannot occur in release builds.
In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 461-479: The two cub::DeviceReduce::Max calls using
transform_iter, linf_dual_residual_.data(), temp_storage_bytes, and stream_view_
need CUDA error checking: wrap each cub::DeviceReduce::Max invocation with
RAFT_CUDA_TRY (or the project’s CUDA_CHECK macro) so failures are caught (first
call that computes temp_storage_bytes and the second call that writes into
temp_buf.data()); ensure both calls use the same error-check wrapper and keep
existing variables transform_iter, temp_buf, linf_dual_residual_.data(), and
stream_view_ unchanged.
In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-615: The two cub::DeviceReduce::Max calls in the block that
computes the max absolute value (using abs_iter, result, n, stream) are missing
CUDA error checks; wrap both calls with RAFT_CUDA_TRY (or the project equivalent
CUDA_CHECK) so failures are surfaced — i.e., call
RAFT_CUDA_TRY(cub::DeviceReduce::Max(d_temp, temp_bytes, abs_iter, result, n,
stream)); and RAFT_CUDA_TRY(cub::DeviceReduce::Max(temp_buf.data(), temp_bytes,
abs_iter, result, n, stream)); ensuring you keep the existing variables (d_temp,
temp_bytes, temp_buf, abs_iter, result, n, stream) unchanged.
---
Nitpick comments:
In `@benchmarks/linear_programming/cuopt/run_pdlp.cu`:
- Around line 81-88: Replace the silent fallback in string_to_presolver so an
unexpected string cannot quietly map to Default: remove the unreachable final
"return Default" and instead assert or throw (e.g., use assert(false) or throw
std::invalid_argument) to make the invariant explicit for
cuopt::linear_programming::presolver_t mapping; reference string_to_presolver
and presolver_t (and mirror the behavior used by string_to_pdlp_solver_mode) so
any missing mapping surfaces immediately during tests/compilation.
In `@cpp/src/pdlp/pdlp.cu`:
- Line 1927: The cublas stream is correctly set via
RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(),
stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use
the explicit rmm::cuda_stream_view::value(); change the second argument to
stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make
the same replacement for the other occurrence around the cublasDgeam usage
referenced at line ~2004) so all calls consistently pass a cudaStream_t via
stream_view_.value().
In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 34-37: The CUDA handle variables even_graph, odd_graph,
even_instance, and odd_instance are uninitialized; initialize them to nullptr
(or equivalent zero-initialization) where they are declared so they start in a
known state and won't contain garbage values, then keep the existing
even_initialized/odd_initialized guards and cleanup logic intact to manage
lifecycle; update the declarations of even_graph, odd_graph, even_instance, and
odd_instance to value-initialize them (e.g., = nullptr) so static
analysis/sanitizers and future code paths see a safe default.
In `@docs/cuopt/source/lp-qp-milp-settings.rst`:
- Around line 520-521: The wording in the description for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously attributes parallelism to Dual
Simplex; update the sentence that currently reads "rather than solving them in
parallel using Dual Simplex" to clearly contrast batched PDLP
(simultaneous/parallel evaluation) with Dual Simplex by rephrasing it to
something like "rather than solving them sequentially using Dual Simplex" so the
intended comparison is unambiguous; update the docstring for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.
In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1003-1017: The test duplicates the problem construction in
test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup
into a shared helper (e.g., a function like build_y1y2y3_problem or
make_binary_three_var_problem) that creates the Problem, adds variables
y1,y2,y3, adds the two constraints and sets the objective, and returns the
Problem (and variables if tests need them); then replace the inline block in
both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the
existing symbols Problem, addVariable, addConstraint, setObjective, and variable
names y1,y2,y3 to find and replace the duplicated code).
- Around line 1003-1032: Add a small LP variant to the test to verify the flag
is ignored for non-MIP problems: create a new Problem (or reuse by recreating
variables) using continuous variables (omit vtype or use CONTINUOUS) with the
same constraints/objective as in test_batch_pdlp_strong_branching, call
settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then
assert problem.Status.name == "Optimal" and problem.ObjValue ==
pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves.
Use the same symbols (Problem, settings.set_parameter,
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue,
test_batch_pdlp_strong_branching) so the new assertions align with the existing
test structure.
| RAFT_CUDA_TRY(cudaStreamSynchronize( | ||
| stream_view_)); // To make sure all the data is written from device to host | ||
| RAFT_CUDA_TRY(cudaPeekAtLastError()); | ||
|
|
There was a problem hiding this comment.
cudaPeekAtLastError() is placed after cudaStreamSynchronize() — it is redundant and should precede the sync.
RAFT_CUDA_TRY(cudaStreamSynchronize(...)) already surfaces both launch-configuration errors (via the sticky error bit) and kernel execution errors. The cudaPeekAtLastError() call that follows can never observe a new error not already caught by the sync. To properly detect kernel launch errors (e.g., invalid grid dimensions) before blocking, cudaPeekAtLastError() must come immediately after the kernel launch.
🛠️ Proposed fix
kernel_compute_fixed_error<f_t><<<grid_size, block_size, 0, stream_view_>>>(
make_span(step_size_strategy_.get_norm_squared_delta_primal()),
make_span(step_size_strategy_.get_norm_squared_delta_dual()),
make_span(primal_weight_),
make_span(step_size_),
make_span(step_size_strategy_.get_interaction()),
make_span(restart_strategy_.fixed_point_error_));
+ RAFT_CUDA_TRY(cudaPeekAtLastError());
RAFT_CUDA_TRY(cudaStreamSynchronize(
stream_view_)); // To make sure all the data is written from device to host
- RAFT_CUDA_TRY(cudaPeekAtLastError());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/pdlp/pdlp.cu` around lines 1861 - 1864, The cudaPeekAtLastError()
call is placed after RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it
redundant; move the RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow
the kernel launch (before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so
that launch/configuration errors are caught early, then keep the stream sync
(RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors
and ensure device-to-host writes complete.
| problem.solve(settings) | ||
| assert problem.Status.name == "Optimal" | ||
| assert problem.ObjValue == pytest.approx(-126, abs=1e-3) | ||
|
|
||
| settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) | ||
|
|
||
| problem.solve(settings) | ||
| assert problem.Status.name == "Optimal" | ||
| assert problem.ObjValue == pytest.approx(-126, abs=1e-3) |
There was a problem hiding this comment.
Second solve lacks an explicit problem reset — test may not exercise the batch PDLP=1 path.
Between the two problem.solve(settings) calls (lines 1024 and 1030), only a solver setting is changed; no setObjective, addConstraint, or updateConstraint is called to mark the problem dirty. Depending on how the solver determines whether a re-solve is needed, the second call may return the cached solution without running under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1, causing the second assert to pass trivially.
Compare test_cuts (lines 983–999): it also calls problem.solve twice, but the second call is guarded by a changed setting value that clearly directs a different code path (different cut-pass count). The existing test pattern in test_model calls setObjective between solves to force a fresh state.
Adding an explicit re-construction of the problem object or at minimum verifying a state indicator (e.g., SolutionStats.num_nodes) would make the second assertion meaningful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1024 - 1032, The second solve may return a cached solution because only settings
were changed; force a true re-solve or validate the solver ran under
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or recreating
it before the second problem.solve(settings). For example, call a method that
updates the model state (e.g., problem.setObjective(...) or a
problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat
changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve
actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of
returning a cached result.
|
/ok to test d89af96 |
|
/ok to test c7e3e22 |
|
/merge |
This PR fixes a few issues making batch PDLP crash occasionally.
Create a separate cu for CUDA graph
Add fail safe in CUDA graph to avoid destruction while capturing
Avoid throwing in case of error to avoid falling triggering destruction while capturing
Additionally this also fixes a few compilation issues linked to PDLP.
This PR also allows batch PDLP to be toggled on from the Python side.
Summary by CodeRabbit
New Features
Bug Fixes / Stability
Documentation
Tests