Fix batch pdlp and python api support by Kh4ster · Pull Request #893 · NVIDIA/cuopt

Kh4ster · 2026-02-18T16:49:42Z

This PR fixes a few issues making batch PDLP crash occasionally.

Refactor cuda graph
Create a separate cu for CUDA graph
Add fail safe in CUDA graph to avoid destruction while capturing
Avoid throwing in case of error to avoid falling triggering destruction while capturing
Stop using different streams in adaptive step size to avoid potential issues with cublas handle sharing
Changed all thrust::transform_reduce to CUB::reduce to avoid unintentionally creating new streams and triggering cuda mallocs
Use .value() on streams for all CUB calls to avoid having stream issue
Added more asserts on nan
Correctly use set_stream before trigger a cublasDgeam
Use a separate handler and stream when starting batch pdlp

Additionally this also fixes a few compilation issues linked to PDLP.

This PR also allows batch PDLP to be toggled on from the Python side.

Summary by CodeRabbit

New Features
- New string-based presolver option for LP solvers (multiple modes).
- Added Batch PDLP strong branching toggle for MIP root behavior.
Bug Fixes / Stability
- Improved runtime validation and NaN/Inf checks across solver paths.
- Stream and reduction changes for more robust GPU execution; CUDA graph capture added for performance.
Documentation
- Settings docs updated to include the new options.
Tests
- Added tests for batch PDLP strong branching.

copy-pr-bot · 2026-02-18T16:49:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-18T17:03:36Z

No actionable comments were generated in the recent review. 🎉

📝 Walkthrough

Walkthrough

Replaces integer presolve with a string-based presolver option; adds ping-pong CUDA graph implementation; migrates many thrust reductions to cub::DeviceReduce; tightens numeric NaN/Inf/runtime checks; consolidates stream handling to raw CUDA streams; exposes a Python setting and tests for batch PDLP strong branching.

Changes

Cohort / File(s)	Summary
Argument parser & presolver `benchmarks/linear_programming/cuopt/run_pdlp.cu`	Replaces integer `presolve` with string `--presolver` accepting None/Papilo/PSLP/Default; adds `string_to_presolver()` and maps settings to `presolver_t` (member changed from `presolve` int to `presolver` enum).
CUDA graph ping-pong impl & header `cpp/src/pdlp/utilities/ping_pong_graph.cu`, `cpp/src/pdlp/utilities/ping_pong_graph.cuh`	Adds full `ping_pong_graph_t<i_t>` implementation and declarations, manages even/odd capture/instantiation/launch, adds capture state flags, and provides explicit instantiation for `int`.
Build updates `cpp/src/pdlp/CMakeLists.txt`	Adds `utilities/ping_pong_graph.cu` to core LP sources.
Stream binding / stream API changes `cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu`, `cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`, `cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp`, `cpp/src/pdlp/pdlp.cu`, `cpp/src/pdlp/pdhg.cu`	Replaces Raft stream wrapper usage with explicit `stream_view_.value()` (raw `cudaStream_t`), sets cuBLAS stream via `cublasSetStream()`, removes parallel stream-pool members and sync, and updates kernel/BLAS/cuSPARSE calls accordingly.
Numerical validation & safety checks `cpp/src/pdlp/pdhg.cu`, `cpp/src/pdlp/pdlp.cu`	Adds runtime NaN/Inf/positivity assertions across projection, bulk ops, fixed-error computations, per-climber checks, restart invariants, and more explicit stream sync points.
Reductions: thrust -> cub `cpp/src/pdlp/pdlp.cu`, `cpp/src/pdlp/termination_strategy/convergence_information.cu`, `cpp/src/pdlp/utils.cuh`	Replaces thrust transform_reduce/device_ptr patterns with cub::DeviceReduce using transform iterators and explicit temporary device buffers for max/norm reductions.
Ping-pong util additions (impl details) `cpp/src/pdlp/utilities/ping_pong_graph.cu`	Implements lifecycle methods, capture/instantiate/destroy logic, and debug-mode guards for CUDA graph caching and launch.
Branch & Bound handle change `cpp/src/branch_and_bound/pseudo_costs.cpp`	Creates local `raft::handle_t batch_pdlp_handle` and passes its address into `batch_pdlp_solve()` instead of original problem handle pointer.
Python config, docs & tests `python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`, `docs/cuopt/source/lp-qp-milp-settings.rst`, `python/cuopt/cuopt/tests/linear_programming/test_python_API.py`	Adds `mip_batch_pdlp_strong_branching` SolverConfig field, documents `CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING`, exposes constant in Python API, and adds test(s) for toggling batch PDLP strong branching (duplicate test appears added twice).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix batch pdlp and python api support' directly addresses the core issues resolved in this PR: batch PDLP crash fixes and Python API enablement for batch PDLP toggling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix_batch_pdlp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py (1)
453-457: Prefer Optional[bool] for consistency with other binary flags in SolverConfig.

The description documents strictly 0 (disable) / 1 (enable) semantics, which is identical to the Optional[bool] fields already in this class (mip_scaling, mip_heuristics_only, crossover, eliminate_dense_columns, etc.). Using Optional[int] instead silently accepts arbitrary integers (e.g. 2, -1) and diverges from the established convention.
♻️ Proposed refactor
-    mip_batch_pdlp_strong_branching: Optional[int] = Field(
-        default=0,
+    mip_batch_pdlp_strong_branching: Optional[bool] = Field(
+        default=False,
         description="Set 1 to enable batch PDLP strong branching "
         "in the MIP solver, 0 to disable.",
     )
If the backing C++ parameter expects an integer, the conversion layer can cast bool → int at call time, keeping the Python API type-safe without any burden on the caller.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`
around lines 453 - 457, The field mip_batch_pdlp_strong_branching is typed as
Optional[int] but should follow the class convention of binary flags by using
Optional[bool]; change its annotation to Optional[bool] and update the Field
default from 0 to False (or None if other similar flags use None) so callers get
a boolean API, and ensure the call site that passes this to the C++ layer casts
the bool to int (bool → int) when invoking the underlying solver; locate the
field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make
this replacement to keep behavior consistent with mip_scaling,
mip_heuristics_only, crossover, eliminate_dense_columns, etc.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1429-1447: The current strict cuopt_assert on (movement +
computed_interaction) can fail due to tiny negative rounding errors; change the
check to compute f_t sum = movement + computed_interaction, use a small
tolerance epsilon (e.g., f_t(-1e-12) or similar relative epsilon), clamp sum to
f_t(0.0) if it is negative but within the epsilon, and only call
cuopt_assert(sum >= f_t(0.0)) for values below that tolerance; then assign
*fixed_point_error = cuda::std::sqrt(sum). Update the code around the symbols
movement, computed_interaction, fixed_point_error, and the existing cuopt_assert
to implement this tolerant/clamp behavior.
- Around line 1191-1226: The CUB reductions in compute_stats ignore returned
cudaError_t and will misbehave for empty inputs; wrap every
cub::DeviceReduce::Reduce call with RAFT_CUDA_TRY and add an early empty-vector
guard (check vec.size() == 0) before creating rmm::device_scalar<> objects and
invoking reductions; for the empty case set smallest, largest, and avg to
sensible defaults (or return early) to avoid running transforms (min_nonzero,
abs_iter) and to prevent the avg = d_sum.value(...) / vec.size() divide-by-zero
using the d_smallest, d_largest, d_sum, min_nonzero, abs_iter, and
cub::DeviceReduce::Reduce symbols to locate the changes.

In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 402-419: The CUB reduction calls using cub::DeviceReduce::Max (the
two invocations that pass transform_iter, linf_primal_residual_.data(),
primal_residual_.size(), stream_view_ and the analogous dual-residual calls
later) are not being checked for CUDA errors; wrap each cub::DeviceReduce::Max
call (both the temp storage query and the actual reduction) with RAFT_CUDA_TRY
so returned RAFT/CUDA errors are propagated consistently (apply the same
RAFT_CUDA_TRY wrapping to the dual residual block around the calls that use
sibling variables like dual_transform_iter, linf_dual_residual_, dual_residual_
and their temp_buf/stream_view_).

In `@cpp/src/pdlp/utilities/ping_pong_graph.cu`:
- Around line 56-118: The code currently uses RAFT_CUDA_TRY_NO_THROW but updates
flags unconditionally; change start_capture, end_capture, launch, and
is_initialized so that you capture return statuses from the CUDA calls
(cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate,
cudaGraphDestroy, etc.) instead of blindly relying on RAFT_CUDA_TRY_NO_THROW,
and only set capture_even_active_/capture_odd_active_ and
even_initialized/odd_initialized when the respective CUDA call succeeds; on
failure, roll back any partial state (e.g., reset capture_*_active_, destroy any
created graph if instantiate failed, and ensure even_instance/odd_instance
remain unset) so launch() and is_initialized() only report graph mode when
even_instance/odd_instance are valid; locate changes in start_capture,
end_capture, launch, and is_initialized and use the symbols
RAFT_CUDA_TRY_NO_THROW, capture_even_active_, capture_odd_active_,
even_initialized, odd_initialized, even_instance, odd_instance, even_graph, and
odd_graph to implement the gating and cleanup.
- Around line 24-51: The cancel_active_capture() destructor-invoked cleanup
currently uses RAFT_CUDA_TRY (which can throw) for cudaStreamEndCapture; replace
those RAFT_CUDA_TRY calls with RAFT_CUDA_TRY_NO_THROW so exceptions are not
thrown from cancel_active_capture() (which may be called from the noexcept
destructor); ensure the two calls referencing even_graph and odd_graph use
RAFT_CUDA_TRY_NO_THROW and keep subsequent
RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(...)) and
capture_even_active_/capture_odd_active_ handling unchanged.

In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-618: The two calls to cub::DeviceReduce::Max in
pdlp::utils.cuh are not checking CUDA errors; wrap both invocations with
RAFT_CUDA_TRY so any cudaError_t returned is checked. Locate the block using
handle_ptr->get_stream(), abs_iter, and temp_bytes (the first sizing call) and
the second call that passes temp_buf.data(), and prefix each
cub::DeviceReduce::Max(...) call with RAFT_CUDA_TRY(...). Ensure you include the
same arguments inside RAFT_CUDA_TRY and keep temp_buf allocation and stream
usage unchanged.

---

Nitpick comments:
In
`@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`:
- Around line 453-457: The field mip_batch_pdlp_strong_branching is typed as
Optional[int] but should follow the class convention of binary flags by using
Optional[bool]; change its annotation to Optional[bool] and update the Field
default from 0 to False (or None if other similar flags use None) so callers get
a boolean API, and ensure the call site that passes this to the C++ layer casts
the bool to int (bool → int) when invoking the underlying solver; locate the
field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make
this replacement to keep behavior consistent with mip_scaling,
mip_heuristics_only, crossover, eliminate_dense_columns, etc.

akifcorduk · 2026-02-19T07:32:45Z

-  // We need to make sure both dot products happens after previous operations (next_primal/dual)
-  // Thus, we add another node in the main stream before starting the SpMVs
-
-  if (!batch_mode_) deltas_are_done_.record(stream_view_);


Is removing those deltas_are_done_ lines intentional?

Yes. It's actually illegal to run on the same cublas handler while using different streams which I was doing here. The error would extremly rarely trigger but it can results in an incorrect results for the dot product.

akifcorduk · 2026-02-19T07:35:00Z

+  if (!is_legacy_batch_mode_) {
+    // This should not happen, but in case a graph was capturing while destroying the object
+    if (capture_even_active_ || capture_odd_active_) {
+      cancel_active_capture();


I think this might hide real errors. If this is destroyed while its own stream is capturing, it is a logic issue. If this is destroyed while another stream/thread is capturing, we will lose the graph on the other stream/thread and cudaGraphLaunch will launch an incomplete/empty graph.

Yes I agree. This was to create a failsafe in case the PDLP object is destroyed while a capture is happenning which is an unitended behavior. Hence the log error above. You would still kill the application?

This behavior has been changed now

Kh4ster · 2026-02-19T12:46:51Z

/ok to test

copy-pr-bot · 2026-02-19T12:46:55Z

/ok to test

@Kh4ster, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Kh4ster · 2026-02-19T12:48:38Z

/ok to test fe31e7e

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)

25-43: ⚠️ Potential issue | 🟠 Major

Add = delete for copy constructor and copy assignment operator to prevent double-free of CUDA handles.

cudaGraph_t and cudaGraphExec_t are opaque handles (effectively non-owning pointers). The compiler-generated copy constructor and copy assignment operator will perform a shallow copy of these handles. If two ping_pong_graph_t instances end up holding the same handle (e.g., via a stray copy or STL container resize), the destructor of each will independently call cudaGraphDestroy/cudaGraphExecDestroy on the same resource, causing a double-free crash or silent corruption.
🛡️ Proposed fix: explicitly delete copy and move semantics
 public:
   ping_pong_graph_t(rmm::cuda_stream_view stream_view, bool is_legacy_batch_mode = false);
   ~ping_pong_graph_t();
+
+  // Non-copyable, non-movable: owns CUDA graph handles
+  ping_pong_graph_t(const ping_pong_graph_t&)            = delete;
+  ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete;
+  ping_pong_graph_t(ping_pong_graph_t&&)                 = delete;
+  ping_pong_graph_t& operator=(ping_pong_graph_t&&)      = delete;
If move semantics are desired in the future (e.g., for storing in a container), a proper move constructor that nulls out the source handles after transfer should be implemented explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 25 - 43, Declare the
class copy constructor and copy assignment operator for ping_pong_graph_t as
deleted to prevent shallow copies of the CUDA handles (i.e., add
ping_pong_graph_t(const ping_pong_graph_t&) = delete; and ping_pong_graph_t&
operator=(const ping_pong_graph_t&) = delete; in the class declaration), leaving
move semantics unimplemented for now (or explicitly delete them too if you want
fully non-movable instances) so that even_graph, odd_graph, even_instance and
odd_instance cannot be copied and risk double-free in the destructor.

cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu (1)

418-497: ⚠️ Potential issue | 🟠 Major

Wrap CUB calls with error checking to match codebase patterns.

The CUB DeviceTransform::Transform and DeviceSegmentedReduce::Sum calls at lines 418–497 lack error checking. This is inconsistent with error-checked versions of the same functions earlier in the file (lines 68, 80, 90). Wrap all four calls with RAFT_CUDA_TRY to align with coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."

✅ Suggested fix

-  cub::DeviceTransform::Transform(
+  RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
     cuda::std::make_tuple(current_saddle_point_state.get_next_AtY().data(),
                           current_saddle_point_state.get_current_AtY().data()),
     tmp_primal.data(),
     tmp_primal.size(),
     cuda::std::minus<>{},
-    stream_view_.value());
+    stream_view_.value()));

...
-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(
         thrust::make_zip_iterator(tmp_primal.data(),
                                   current_saddle_point_state.get_delta_primal().data()),
         tuple_multiplies<f_t>{}),
       interaction_.data(),
       climber_strategies_.size(),
       primal_size_,
-      stream_view_.value());
+      stream_view_.value()));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(current_saddle_point_state.get_delta_primal().data(),
                                       power_two_func_t<f_t>{}),
       norm_squared_delta_primal_.data(),
       climber_strategies_.size(),
       primal_size_,
-      stream_view_.value());
+      stream_view_.value()));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(current_saddle_point_state.get_delta_dual().data(),
                                       power_two_func_t<f_t>{}),
       norm_squared_delta_dual_.data(),
       climber_strategies_.size(),
       dual_size_,
-      stream_view_.value());
+      stream_view_.value()));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu` around lines
418 - 497, The four CUB calls (cub::DeviceTransform::Transform and the three
cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal,
tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with
RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other
CUB/CUDA calls in this file so CUDA errors are checked (preserve the original
call arguments including stream_view_.value()).

🧹 Nitpick comments (6)

benchmarks/linear_programming/cuopt/run_pdlp.cu (1)
81-88: Line 87 fallback return is unreachable dead code.

Since argparse's .choices() already rejects any string outside {"None", "Papilo", "PSLP", "Default"}, control can never reach line 87. The silent return Default hides potential future bugs (e.g. a new enum value added to presolver_t without updating this mapping). Consider replacing it with an assertion to make the invariant explicit:
🔧 Suggested change
 static cuopt::linear_programming::presolver_t string_to_presolver(const std::string& presolver)
 {
   if (presolver == "None") return cuopt::linear_programming::presolver_t::None;
   if (presolver == "Papilo") return cuopt::linear_programming::presolver_t::Papilo;
   if (presolver == "PSLP") return cuopt::linear_programming::presolver_t::PSLP;
   if (presolver == "Default") return cuopt::linear_programming::presolver_t::Default;
-  return cuopt::linear_programming::presolver_t::Default;
+  throw std::invalid_argument("Unknown presolver: " + presolver);
 }
This mirrors how string_to_pdlp_solver_mode would ideally behave and surfaces mapping gaps immediately at compile/test time if new enum values are added.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/linear_programming/cuopt/run_pdlp.cu` around lines 81 - 88,
Replace the silent fallback in string_to_presolver so an unexpected string
cannot quietly map to Default: remove the unreachable final "return Default" and
instead assert or throw (e.g., use assert(false) or throw std::invalid_argument)
to make the invariant explicit for cuopt::linear_programming::presolver_t
mapping; reference string_to_presolver and presolver_t (and mirror the behavior
used by string_to_pdlp_solver_mode) so any missing mapping surfaces immediately
during tests/compilation.
cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)
34-37: Initialize CUDA handles to nullptr to prevent accidental use of garbage handle values.

even_graph, odd_graph, even_instance, and odd_instance are uninitialized. While even_initialized/odd_initialized guard their use in normal paths, any future code path that bypasses those flags — or any static analysis/sanitizer run — will observe undefined values. Explicit nullptr initialization is zero-cost and aligns with the guideline for tracking CUDA resource lifecycle.
♻️ Proposed fix: value-initialize CUDA handles
  private:
-  cudaGraph_t even_graph;
-  cudaGraph_t odd_graph;
-  cudaGraphExec_t even_instance;
-  cudaGraphExec_t odd_instance;
+  cudaGraph_t even_graph{nullptr};
+  cudaGraph_t odd_graph{nullptr};
+  cudaGraphExec_t even_instance{nullptr};
+  cudaGraphExec_t odd_instance{nullptr};
As per coding guidelines, tracking GPU resource lifecycle and ensuring cleanup of CUDA handles requires proper initialization state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 34 - 37, The CUDA
handle variables even_graph, odd_graph, even_instance, and odd_instance are
uninitialized; initialize them to nullptr (or equivalent zero-initialization)
where they are declared so they start in a known state and won't contain garbage
values, then keep the existing even_initialized/odd_initialized guards and
cleanup logic intact to manage lifecycle; update the declarations of even_graph,
odd_graph, even_instance, and odd_instance to value-initialize them (e.g., =
nullptr) so static analysis/sanitizers and future code paths see a safe default.
docs/cuopt/source/lp-qp-milp-settings.rst (1)
520-521: Wording ambiguity: "in parallel" appears to modify Dual Simplex.

The phrase "rather than solving them in parallel using Dual Simplex" grammatically attributes parallelism to Dual Simplex, which is a sequential solver. The intended contrast is batch PDLP (parallel/simultaneous) vs. Dual Simplex (sequential). Consider rephrasing, e.g.:

"…rather than solving them sequentially using Dual Simplex."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cuopt/source/lp-qp-milp-settings.rst` around lines 520 - 521, The
wording in the description for CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously
attributes parallelism to Dual Simplex; update the sentence that currently reads
"rather than solving them in parallel using Dual Simplex" to clearly contrast
batched PDLP (simultaneous/parallel evaluation) with Dual Simplex by rephrasing
it to something like "rather than solving them sequentially using Dual Simplex"
so the intended comparison is unambiguous; update the docstring for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.
python/cuopt/cuopt/tests/linear_programming/test_python_API.py (2)
1003-1017: Problem setup duplicates test_cuts verbatim — extract a shared helper.

The variable declarations, constraints, and objective (lines 1009–1017) are identical to test_cuts (lines 971–979). Consider a shared factory, e.g.:
♻️ Proposed refactor
+def _make_knapsack_problem():
+    """Shared 3-variable 0-1 MIP used in multiple tests."""
+    problem = Problem()
+    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
+    y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2")
+    y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3")
+    problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875)
+    problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875)
+    problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3)
+    return problem

 def test_cuts():
-    problem = Problem()
-    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
-    y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2")
-    y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3")
-    problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875)
-    problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875)
-    problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3)
+    problem = _make_knapsack_problem()
     ...

 def test_batch_pdlp_strong_branching():
-    problem = Problem()
-    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
-    ...
+    problem = _make_knapsack_problem()
     ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1003 - 1017, The test duplicates the problem construction in
test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup
into a shared helper (e.g., a function like build_y1y2y3_problem or
make_binary_three_var_problem) that creates the Problem, adds variables
y1,y2,y3, adds the two constraints and sets the objective, and returns the
Problem (and variables if tests need them); then replace the inline block in
both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the
existing symbols Problem, addVariable, addConstraint, setObjective, and variable
names y1,y2,y3 to find and replace the duplicated code).
1003-1032: Missing coverage: parameter should be ignored on LP problems (per docs).

The documentation states "This setting is ignored if the problem is not a MIP problem." The test only exercises MIP scenarios. Per the project's test-coverage guidelines, edge behavior (silently ignoring a flag on non-qualifying problems) should be tested to prevent future regressions where the parameter accidentally affects LP solves.

Consider adding a brief LP variant that sets CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 and asserts Status == Optimal and a correct objective.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1003 - 1032, Add a small LP variant to the test to verify the flag is ignored
for non-MIP problems: create a new Problem (or reuse by recreating variables)
using continuous variables (omit vtype or use CONTINUOUS) with the same
constraints/objective as in test_batch_pdlp_strong_branching, call
settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then
assert problem.Status.name == "Optimal" and problem.ObjValue ==
pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves.
Use the same symbols (Problem, settings.set_parameter,
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue,
test_batch_pdlp_strong_branching) so the new assertions align with the existing
test structure.
cpp/src/pdlp/pdlp.cu (1)
1927-1927: cublasSetStream correctly ensures stream binding before cublasDgeam; minor inconsistency with .value().

The explicit stream binding is the correct fix for the handle-sharing issue described in the PR. stream_view_ (of type rmm::cuda_stream_view) implicitly converts to cudaStream_t here, but all other cuSPARSE/cuBLAS call sites in this PR (lines 2744, 2757, 2766) use explicit .value(). Consider using stream_view_.value() here and at line 2004 for consistency.
♻️ Proposed consistency fix
-  RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_));
+  RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_.value()));
Apply the same change at line 2004.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/pdlp.cu` at line 1927, The cublas stream is correctly set via
RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(),
stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use
the explicit rmm::cuda_stream_view::value(); change the second argument to
stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make
the same replacement for the other occurrence around the cublasDgeam usage
referenced at line ~2004) so all calls consistently pass a cudaStream_t via
stream_view_.value().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1861-1864: The cudaPeekAtLastError() call is placed after
RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it redundant; move the
RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow the kernel launch
(before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so that
launch/configuration errors are caught early, then keep the stream sync
(RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors
and ensure device-to-host writes complete.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1024-1032: The second solve may return a cached solution because
only settings were changed; force a true re-solve or validate the solver ran
under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or
recreating it before the second problem.solve(settings). For example, call a
method that updates the model state (e.g., problem.setObjective(...) or a
problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat
changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve
actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of
returning a cached result.

---

Outside diff comments:
In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`:
- Around line 418-497: The four CUB calls (cub::DeviceTransform::Transform and
the three cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal,
tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with
RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other
CUB/CUDA calls in this file so CUDA errors are checked (preserve the original
call arguments including stream_view_.value()).

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 25-43: Declare the class copy constructor and copy assignment
operator for ping_pong_graph_t as deleted to prevent shallow copies of the CUDA
handles (i.e., add ping_pong_graph_t(const ping_pong_graph_t&) = delete; and
ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete; in the class
declaration), leaving move semantics unimplemented for now (or explicitly delete
them too if you want fully non-movable instances) so that even_graph, odd_graph,
even_instance and odd_instance cannot be copied and risk double-free in the
destructor.

---

Duplicate comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1456-1460: The assert cuopt_assert(movement + computed_interaction
>= f_t(0.0), ...) is too strict for floating-point rounding; replace it with an
epsilon-tolerant check that allows tiny negative rounding errors before
clamping. Compute an epsilon relative to the magnitude of operands (e.g., using
std::numeric_limits<f_t>::epsilon() times max(abs(movement),
abs(computed_interaction), f_t(1))) and assert movement + computed_interaction
>= -epsilon, then continue to use cuda::std::max(...) and the existing sqrt into
*fixed_point_error; keep the original message but note the tolerance in the
condition.
- Line 1194: cuopt_assert(vec.size() > 0, "Vector must not be empty") is not
sufficient because asserts can be compiled out; add an explicit runtime check
for vec.size() == 0 before computing avg (the line using avg =
d_sum.value(stream) / vec.size()) and perform the appropriate early return or
empty-result handling instead of relying on cuopt_assert; update the function
containing cuopt_assert, the avg variable calculation, and the
d_sum.value(stream) usage so the divide-by-zero cannot occur in release builds.

In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 461-479: The two cub::DeviceReduce::Max calls using
transform_iter, linf_dual_residual_.data(), temp_storage_bytes, and stream_view_
need CUDA error checking: wrap each cub::DeviceReduce::Max invocation with
RAFT_CUDA_TRY (or the project’s CUDA_CHECK macro) so failures are caught (first
call that computes temp_storage_bytes and the second call that writes into
temp_buf.data()); ensure both calls use the same error-check wrapper and keep
existing variables transform_iter, temp_buf, linf_dual_residual_.data(), and
stream_view_ unchanged.

In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-615: The two cub::DeviceReduce::Max calls in the block that
computes the max absolute value (using abs_iter, result, n, stream) are missing
CUDA error checks; wrap both calls with RAFT_CUDA_TRY (or the project equivalent
CUDA_CHECK) so failures are surfaced — i.e., call
RAFT_CUDA_TRY(cub::DeviceReduce::Max(d_temp, temp_bytes, abs_iter, result, n,
stream)); and RAFT_CUDA_TRY(cub::DeviceReduce::Max(temp_buf.data(), temp_bytes,
abs_iter, result, n, stream)); ensuring you keep the existing variables (d_temp,
temp_bytes, temp_buf, abs_iter, result, n, stream) unchanged.

---

Nitpick comments:
In `@benchmarks/linear_programming/cuopt/run_pdlp.cu`:
- Around line 81-88: Replace the silent fallback in string_to_presolver so an
unexpected string cannot quietly map to Default: remove the unreachable final
"return Default" and instead assert or throw (e.g., use assert(false) or throw
std::invalid_argument) to make the invariant explicit for
cuopt::linear_programming::presolver_t mapping; reference string_to_presolver
and presolver_t (and mirror the behavior used by string_to_pdlp_solver_mode) so
any missing mapping surfaces immediately during tests/compilation.

In `@cpp/src/pdlp/pdlp.cu`:
- Line 1927: The cublas stream is correctly set via
RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(),
stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use
the explicit rmm::cuda_stream_view::value(); change the second argument to
stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make
the same replacement for the other occurrence around the cublasDgeam usage
referenced at line ~2004) so all calls consistently pass a cudaStream_t via
stream_view_.value().

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 34-37: The CUDA handle variables even_graph, odd_graph,
even_instance, and odd_instance are uninitialized; initialize them to nullptr
(or equivalent zero-initialization) where they are declared so they start in a
known state and won't contain garbage values, then keep the existing
even_initialized/odd_initialized guards and cleanup logic intact to manage
lifecycle; update the declarations of even_graph, odd_graph, even_instance, and
odd_instance to value-initialize them (e.g., = nullptr) so static
analysis/sanitizers and future code paths see a safe default.

In `@docs/cuopt/source/lp-qp-milp-settings.rst`:
- Around line 520-521: The wording in the description for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously attributes parallelism to Dual
Simplex; update the sentence that currently reads "rather than solving them in
parallel using Dual Simplex" to clearly contrast batched PDLP
(simultaneous/parallel evaluation) with Dual Simplex by rephrasing it to
something like "rather than solving them sequentially using Dual Simplex" so the
intended comparison is unambiguous; update the docstring for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1003-1017: The test duplicates the problem construction in
test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup
into a shared helper (e.g., a function like build_y1y2y3_problem or
make_binary_three_var_problem) that creates the Problem, adds variables
y1,y2,y3, adds the two constraints and sets the objective, and returns the
Problem (and variables if tests need them); then replace the inline block in
both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the
existing symbols Problem, addVariable, addConstraint, setObjective, and variable
names y1,y2,y3 to find and replace the duplicated code).
- Around line 1003-1032: Add a small LP variant to the test to verify the flag
is ignored for non-MIP problems: create a new Problem (or reuse by recreating
variables) using continuous variables (omit vtype or use CONTINUOUS) with the
same constraints/objective as in test_batch_pdlp_strong_branching, call
settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then
assert problem.Status.name == "Optimal" and problem.ObjValue ==
pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves.
Use the same symbols (Problem, settings.set_parameter,
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue,
test_batch_pdlp_strong_branching) so the new assertions align with the existing
test structure.

coderabbitai · 2026-02-19T12:55:00Z

    RAFT_CUDA_TRY(cudaStreamSynchronize(
      stream_view_));  // To make sure all the data is written from device to host
    RAFT_CUDA_TRY(cudaPeekAtLastError());
+


⚠️ Potential issue | 🟡 Minor

cudaPeekAtLastError() is placed after cudaStreamSynchronize() — it is redundant and should precede the sync.

RAFT_CUDA_TRY(cudaStreamSynchronize(...)) already surfaces both launch-configuration errors (via the sticky error bit) and kernel execution errors. The cudaPeekAtLastError() call that follows can never observe a new error not already caught by the sync. To properly detect kernel launch errors (e.g., invalid grid dimensions) before blocking, cudaPeekAtLastError() must come immediately after the kernel launch.

🛠️ Proposed fix

kernel_compute_fixed_error<f_t><<<grid_size, block_size, 0, stream_view_>>>( make_span(step_size_strategy_.get_norm_squared_delta_primal()), make_span(step_size_strategy_.get_norm_squared_delta_dual()), make_span(primal_weight_), make_span(step_size_), make_span(step_size_strategy_.get_interaction()), make_span(restart_strategy_.fixed_point_error_)); + RAFT_CUDA_TRY(cudaPeekAtLastError()); RAFT_CUDA_TRY(cudaStreamSynchronize( stream_view_)); // To make sure all the data is written from device to host - RAFT_CUDA_TRY(cudaPeekAtLastError());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/pdlp.cu` around lines 1861 - 1864, The cudaPeekAtLastError() call is placed after RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it redundant; move the RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow the kernel launch (before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so that launch/configuration errors are caught early, then keep the stream sync (RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors and ensure device-to-host writes complete.

coderabbitai · 2026-02-19T12:55:00Z

+    problem.solve(settings)
+    assert problem.Status.name == "Optimal"
+    assert problem.ObjValue == pytest.approx(-126, abs=1e-3)
+
+    settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1)
+
+    problem.solve(settings)
+    assert problem.Status.name == "Optimal"
+    assert problem.ObjValue == pytest.approx(-126, abs=1e-3)


⚠️ Potential issue | 🟡 Minor

Second solve lacks an explicit problem reset — test may not exercise the batch PDLP=1 path.

Between the two problem.solve(settings) calls (lines 1024 and 1030), only a solver setting is changed; no setObjective, addConstraint, or updateConstraint is called to mark the problem dirty. Depending on how the solver determines whether a re-solve is needed, the second call may return the cached solution without running under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1, causing the second assert to pass trivially.

Compare test_cuts (lines 983–999): it also calls problem.solve twice, but the second call is guarded by a changed setting value that clearly directs a different code path (different cut-pass count). The existing test pattern in test_model calls setObjective between solves to force a fresh state.

Adding an explicit re-construction of the problem object or at minimum verifying a state indicator (e.g., SolutionStats.num_nodes) would make the second assertion meaningful.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines 1024 - 1032, The second solve may return a cached solution because only settings were changed; force a true re-solve or validate the solver ran under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or recreating it before the second problem.solve(settings). For example, call a method that updates the model state (e.g., problem.setObjective(...) or a problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of returning a cached result.

Kh4ster · 2026-02-19T13:06:10Z

/ok to test d89af96

Kh4ster · 2026-02-19T14:04:30Z

/ok to test c7e3e22

Kh4ster · 2026-02-19T15:36:10Z

/merge

Kh4ster and others added 8 commits February 17, 2026 14:11

tmp

fc7aa04

fix

1614bc1

fix

4f33531

workaround for thrust reverse iterator build error

e0a530e

remove compile file

e330718

fix

dce6d4f

final cleanup

9c03faf

addtional cleanup

6c2fe35

Kh4ster requested review from a team as code owners February 18, 2026 16:49

Kh4ster requested review from akifcorduk, chris-maes and rgsl888prabhu February 18, 2026 16:49

Kh4ster removed request for akifcorduk, chris-maes and rgsl888prabhu February 18, 2026 16:50

Kh4ster self-assigned this Feb 18, 2026

Kh4ster added bug Something isn't working pdlp mip labels Feb 18, 2026

Merge branch 'main' into fix_batch_pdlp

eb60c98

Kh4ster changed the title ~~Fix batch pdlp~~ Fix batch pdlp and python support Feb 18, 2026

Kh4ster added the non-breaking Introduces a non-breaking change label Feb 18, 2026

Kh4ster changed the title ~~Fix batch pdlp and python support~~ Fix batch pdlp and python api support Feb 18, 2026

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

akifcorduk approved these changes Feb 19, 2026

View reviewed changes

address PR comments, add tests, update doc

a43dc0c

Kh4ster added 3 commits February 19, 2026 13:44

format

c8b8b74

format

b1be5bb

Merge branch 'main' into fix_batch_pdlp

fe31e7e

style

d89af96

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

put back changes in restart

c7e3e22

Iroy30 approved these changes Feb 19, 2026

View reviewed changes

rapids-bot bot merged commit 9ad8b45 into main Feb 19, 2026
94 checks passed

Kh4ster deleted the fix_batch_pdlp branch February 19, 2026 15:36

coderabbitai bot mentioned this pull request Feb 25, 2026

MIP Row and Objective Scaling #906

Merged

Conversation

Kh4ster commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akifcorduk Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Kh4ster Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

akifcorduk Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Kh4ster Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Kh4ster Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kh4ster commented Feb 19, 2026

Uh oh!

copy-pr-bot bot commented Feb 19, 2026

Uh oh!

Kh4ster commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Kh4ster commented Feb 19, 2026

Uh oh!

Kh4ster commented Feb 19, 2026

Uh oh!

Kh4ster commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kh4ster commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading