Skip to content

Fix batch pdlp and python api support#893

Merged
rapids-bot[bot] merged 15 commits intomainfrom
fix_batch_pdlp
Feb 19, 2026
Merged

Fix batch pdlp and python api support#893
rapids-bot[bot] merged 15 commits intomainfrom
fix_batch_pdlp

Conversation

@Kh4ster
Copy link
Copy Markdown
Contributor

@Kh4ster Kh4ster commented Feb 18, 2026

This PR fixes a few issues making batch PDLP crash occasionally.

  • Refactor cuda graph
    Create a separate cu for CUDA graph
    Add fail safe in CUDA graph to avoid destruction while capturing
    Avoid throwing in case of error to avoid falling triggering destruction while capturing
  • Stop using different streams in adaptive step size to avoid potential issues with cublas handle sharing
  • Changed all thrust::transform_reduce to CUB::reduce to avoid unintentionally creating new streams and triggering cuda mallocs
  • Use .value() on streams for all CUB calls to avoid having stream issue
  • Added more asserts on nan
  • Correctly use set_stream before trigger a cublasDgeam
  • Use a separate handler and stream when starting batch pdlp

Additionally this also fixes a few compilation issues linked to PDLP.

This PR also allows batch PDLP to be toggled on from the Python side.

Summary by CodeRabbit

  • New Features

    • New string-based presolver option for LP solvers (multiple modes).
    • Added Batch PDLP strong branching toggle for MIP root behavior.
  • Bug Fixes / Stability

    • Improved runtime validation and NaN/Inf checks across solver paths.
    • Stream and reduction changes for more robust GPU execution; CUDA graph capture added for performance.
  • Documentation

    • Settings docs updated to include the new options.
  • Tests

    • Added tests for batch PDLP strong branching.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Kh4ster Kh4ster self-assigned this Feb 18, 2026
@Kh4ster Kh4ster added bug Something isn't working pdlp mip labels Feb 18, 2026
@Kh4ster Kh4ster changed the title Fix batch pdlp Fix batch pdlp and python support Feb 18, 2026
@Kh4ster Kh4ster added the non-breaking Introduces a non-breaking change label Feb 18, 2026
@Kh4ster Kh4ster changed the title Fix batch pdlp and python support Fix batch pdlp and python api support Feb 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 18, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Replaces integer presolve with a string-based presolver option; adds ping-pong CUDA graph implementation; migrates many thrust reductions to cub::DeviceReduce; tightens numeric NaN/Inf/runtime checks; consolidates stream handling to raw CUDA streams; exposes a Python setting and tests for batch PDLP strong branching.

Changes

Cohort / File(s) Summary
Argument parser & presolver
benchmarks/linear_programming/cuopt/run_pdlp.cu
Replaces integer presolve with string --presolver accepting None/Papilo/PSLP/Default; adds string_to_presolver() and maps settings to presolver_t (member changed from presolve int to presolver enum).
CUDA graph ping-pong impl & header
cpp/src/pdlp/utilities/ping_pong_graph.cu, cpp/src/pdlp/utilities/ping_pong_graph.cuh
Adds full ping_pong_graph_t<i_t> implementation and declarations, manages even/odd capture/instantiation/launch, adds capture state flags, and provides explicit instantiation for int.
Build updates
cpp/src/pdlp/CMakeLists.txt
Adds utilities/ping_pong_graph.cu to core LP sources.
Stream binding / stream API changes
cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu, cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu, cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp, cpp/src/pdlp/pdlp.cu, cpp/src/pdlp/pdhg.cu
Replaces Raft stream wrapper usage with explicit stream_view_.value() (raw cudaStream_t), sets cuBLAS stream via cublasSetStream(), removes parallel stream-pool members and sync, and updates kernel/BLAS/cuSPARSE calls accordingly.
Numerical validation & safety checks
cpp/src/pdlp/pdhg.cu, cpp/src/pdlp/pdlp.cu
Adds runtime NaN/Inf/positivity assertions across projection, bulk ops, fixed-error computations, per-climber checks, restart invariants, and more explicit stream sync points.
Reductions: thrust -> cub
cpp/src/pdlp/pdlp.cu, cpp/src/pdlp/termination_strategy/convergence_information.cu, cpp/src/pdlp/utils.cuh
Replaces thrust transform_reduce/device_ptr patterns with cub::DeviceReduce using transform iterators and explicit temporary device buffers for max/norm reductions.
Ping-pong util additions (impl details)
cpp/src/pdlp/utilities/ping_pong_graph.cu
Implements lifecycle methods, capture/instantiate/destroy logic, and debug-mode guards for CUDA graph caching and launch.
Branch & Bound handle change
cpp/src/branch_and_bound/pseudo_costs.cpp
Creates local raft::handle_t batch_pdlp_handle and passes its address into batch_pdlp_solve() instead of original problem handle pointer.
Python config, docs & tests
python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py, docs/cuopt/source/lp-qp-milp-settings.rst, python/cuopt/cuopt/tests/linear_programming/test_python_API.py
Adds mip_batch_pdlp_strong_branching SolverConfig field, documents CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, exposes constant in Python API, and adds test(s) for toggling batch PDLP strong branching (duplicate test appears added twice).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix batch pdlp and python api support' directly addresses the core issues resolved in this PR: batch PDLP crash fixes and Python API enablement for batch PDLP toggling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix_batch_pdlp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py (1)

453-457: Prefer Optional[bool] for consistency with other binary flags in SolverConfig.

The description documents strictly 0 (disable) / 1 (enable) semantics, which is identical to the Optional[bool] fields already in this class (mip_scaling, mip_heuristics_only, crossover, eliminate_dense_columns, etc.). Using Optional[int] instead silently accepts arbitrary integers (e.g. 2, -1) and diverges from the established convention.

♻️ Proposed refactor
-    mip_batch_pdlp_strong_branching: Optional[int] = Field(
-        default=0,
+    mip_batch_pdlp_strong_branching: Optional[bool] = Field(
+        default=False,
         description="Set 1 to enable batch PDLP strong branching "
         "in the MIP solver, 0 to disable.",
     )

If the backing C++ parameter expects an integer, the conversion layer can cast bool → int at call time, keeping the Python API type-safe without any burden on the caller.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`
around lines 453 - 457, The field mip_batch_pdlp_strong_branching is typed as
Optional[int] but should follow the class convention of binary flags by using
Optional[bool]; change its annotation to Optional[bool] and update the Field
default from 0 to False (or None if other similar flags use None) so callers get
a boolean API, and ensure the call site that passes this to the C++ layer casts
the bool to int (bool → int) when invoking the underlying solver; locate the
field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make
this replacement to keep behavior consistent with mip_scaling,
mip_heuristics_only, crossover, eliminate_dense_columns, etc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1429-1447: The current strict cuopt_assert on (movement +
computed_interaction) can fail due to tiny negative rounding errors; change the
check to compute f_t sum = movement + computed_interaction, use a small
tolerance epsilon (e.g., f_t(-1e-12) or similar relative epsilon), clamp sum to
f_t(0.0) if it is negative but within the epsilon, and only call
cuopt_assert(sum >= f_t(0.0)) for values below that tolerance; then assign
*fixed_point_error = cuda::std::sqrt(sum). Update the code around the symbols
movement, computed_interaction, fixed_point_error, and the existing cuopt_assert
to implement this tolerant/clamp behavior.
- Around line 1191-1226: The CUB reductions in compute_stats ignore returned
cudaError_t and will misbehave for empty inputs; wrap every
cub::DeviceReduce::Reduce call with RAFT_CUDA_TRY and add an early empty-vector
guard (check vec.size() == 0) before creating rmm::device_scalar<> objects and
invoking reductions; for the empty case set smallest, largest, and avg to
sensible defaults (or return early) to avoid running transforms (min_nonzero,
abs_iter) and to prevent the avg = d_sum.value(...) / vec.size() divide-by-zero
using the d_smallest, d_largest, d_sum, min_nonzero, abs_iter, and
cub::DeviceReduce::Reduce symbols to locate the changes.

In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 402-419: The CUB reduction calls using cub::DeviceReduce::Max (the
two invocations that pass transform_iter, linf_primal_residual_.data(),
primal_residual_.size(), stream_view_ and the analogous dual-residual calls
later) are not being checked for CUDA errors; wrap each cub::DeviceReduce::Max
call (both the temp storage query and the actual reduction) with RAFT_CUDA_TRY
so returned RAFT/CUDA errors are propagated consistently (apply the same
RAFT_CUDA_TRY wrapping to the dual residual block around the calls that use
sibling variables like dual_transform_iter, linf_dual_residual_, dual_residual_
and their temp_buf/stream_view_).

In `@cpp/src/pdlp/utilities/ping_pong_graph.cu`:
- Around line 56-118: The code currently uses RAFT_CUDA_TRY_NO_THROW but updates
flags unconditionally; change start_capture, end_capture, launch, and
is_initialized so that you capture return statuses from the CUDA calls
(cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate,
cudaGraphDestroy, etc.) instead of blindly relying on RAFT_CUDA_TRY_NO_THROW,
and only set capture_even_active_/capture_odd_active_ and
even_initialized/odd_initialized when the respective CUDA call succeeds; on
failure, roll back any partial state (e.g., reset capture_*_active_, destroy any
created graph if instantiate failed, and ensure even_instance/odd_instance
remain unset) so launch() and is_initialized() only report graph mode when
even_instance/odd_instance are valid; locate changes in start_capture,
end_capture, launch, and is_initialized and use the symbols
RAFT_CUDA_TRY_NO_THROW, capture_even_active_, capture_odd_active_,
even_initialized, odd_initialized, even_instance, odd_instance, even_graph, and
odd_graph to implement the gating and cleanup.
- Around line 24-51: The cancel_active_capture() destructor-invoked cleanup
currently uses RAFT_CUDA_TRY (which can throw) for cudaStreamEndCapture; replace
those RAFT_CUDA_TRY calls with RAFT_CUDA_TRY_NO_THROW so exceptions are not
thrown from cancel_active_capture() (which may be called from the noexcept
destructor); ensure the two calls referencing even_graph and odd_graph use
RAFT_CUDA_TRY_NO_THROW and keep subsequent
RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(...)) and
capture_even_active_/capture_odd_active_ handling unchanged.

In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-618: The two calls to cub::DeviceReduce::Max in
pdlp::utils.cuh are not checking CUDA errors; wrap both invocations with
RAFT_CUDA_TRY so any cudaError_t returned is checked. Locate the block using
handle_ptr->get_stream(), abs_iter, and temp_bytes (the first sizing call) and
the second call that passes temp_buf.data(), and prefix each
cub::DeviceReduce::Max(...) call with RAFT_CUDA_TRY(...). Ensure you include the
same arguments inside RAFT_CUDA_TRY and keep temp_buf allocation and stream
usage unchanged.

---

Nitpick comments:
In
`@python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py`:
- Around line 453-457: The field mip_batch_pdlp_strong_branching is typed as
Optional[int] but should follow the class convention of binary flags by using
Optional[bool]; change its annotation to Optional[bool] and update the Field
default from 0 to False (or None if other similar flags use None) so callers get
a boolean API, and ensure the call site that passes this to the C++ layer casts
the bool to int (bool → int) when invoking the underlying solver; locate the
field definition on the SolverConfig (mip_batch_pdlp_strong_branching) and make
this replacement to keep behavior consistent with mip_scaling,
mip_heuristics_only, crossover, eliminate_dense_columns, etc.

Comment thread cpp/src/pdlp/pdlp.cu Outdated
Comment thread cpp/src/pdlp/pdlp.cu Outdated
Comment thread cpp/src/pdlp/termination_strategy/convergence_information.cu Outdated
Comment thread cpp/src/pdlp/utilities/ping_pong_graph.cu Outdated
Comment thread cpp/src/pdlp/utilities/ping_pong_graph.cu
Comment thread cpp/src/pdlp/utils.cuh
// We need to make sure both dot products happens after previous operations (next_primal/dual)
// Thus, we add another node in the main stream before starting the SpMVs

if (!batch_mode_) deltas_are_done_.record(stream_view_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing those deltas_are_done_ lines intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's actually illegal to run on the same cublas handler while using different streams which I was doing here. The error would extremly rarely trigger but it can results in an incorrect results for the dot product.

if (!is_legacy_batch_mode_) {
// This should not happen, but in case a graph was capturing while destroying the object
if (capture_even_active_ || capture_odd_active_) {
cancel_active_capture();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might hide real errors. If this is destroyed while its own stream is capturing, it is a logic issue. If this is destroyed while another stream/thread is capturing, we will lose the graph on the other stream/thread and cudaGraphLaunch will launch an incomplete/empty graph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. This was to create a failsafe in case the PDLP object is destroyed while a capture is happenning which is an unitended behavior. Hence the log error above. You would still kill the application?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior has been changed now

Comment thread cpp/src/pdlp/pdlp.cu Outdated
@Kh4ster
Copy link
Copy Markdown
Contributor Author

Kh4ster commented Feb 19, 2026

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 19, 2026

/ok to test

@Kh4ster, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@Kh4ster
Copy link
Copy Markdown
Contributor Author

Kh4ster commented Feb 19, 2026

/ok to test fe31e7e

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)

25-43: ⚠️ Potential issue | 🟠 Major

Add = delete for copy constructor and copy assignment operator to prevent double-free of CUDA handles.

cudaGraph_t and cudaGraphExec_t are opaque handles (effectively non-owning pointers). The compiler-generated copy constructor and copy assignment operator will perform a shallow copy of these handles. If two ping_pong_graph_t instances end up holding the same handle (e.g., via a stray copy or STL container resize), the destructor of each will independently call cudaGraphDestroy/cudaGraphExecDestroy on the same resource, causing a double-free crash or silent corruption.

🛡️ Proposed fix: explicitly delete copy and move semantics
 public:
   ping_pong_graph_t(rmm::cuda_stream_view stream_view, bool is_legacy_batch_mode = false);
   ~ping_pong_graph_t();
+
+  // Non-copyable, non-movable: owns CUDA graph handles
+  ping_pong_graph_t(const ping_pong_graph_t&)            = delete;
+  ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete;
+  ping_pong_graph_t(ping_pong_graph_t&&)                 = delete;
+  ping_pong_graph_t& operator=(ping_pong_graph_t&&)      = delete;

If move semantics are desired in the future (e.g., for storing in a container), a proper move constructor that nulls out the source handles after transfer should be implemented explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 25 - 43, Declare the
class copy constructor and copy assignment operator for ping_pong_graph_t as
deleted to prevent shallow copies of the CUDA handles (i.e., add
ping_pong_graph_t(const ping_pong_graph_t&) = delete; and ping_pong_graph_t&
operator=(const ping_pong_graph_t&) = delete; in the class declaration), leaving
move semantics unimplemented for now (or explicitly delete them too if you want
fully non-movable instances) so that even_graph, odd_graph, even_instance and
odd_instance cannot be copied and risk double-free in the destructor.
cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu (1)

418-497: ⚠️ Potential issue | 🟠 Major

Wrap CUB calls with error checking to match codebase patterns.

The CUB DeviceTransform::Transform and DeviceSegmentedReduce::Sum calls at lines 418–497 lack error checking. This is inconsistent with error-checked versions of the same functions earlier in the file (lines 68, 80, 90). Wrap all four calls with RAFT_CUDA_TRY to align with coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."

✅ Suggested fix
-  cub::DeviceTransform::Transform(
+  RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
     cuda::std::make_tuple(current_saddle_point_state.get_next_AtY().data(),
                           current_saddle_point_state.get_current_AtY().data()),
     tmp_primal.data(),
     tmp_primal.size(),
     cuda::std::minus<>{},
-    stream_view_.value());
+    stream_view_.value()));

...
-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(
         thrust::make_zip_iterator(tmp_primal.data(),
                                   current_saddle_point_state.get_delta_primal().data()),
         tuple_multiplies<f_t>{}),
       interaction_.data(),
       climber_strategies_.size(),
       primal_size_,
-      stream_view_.value());
+      stream_view_.value()));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(current_saddle_point_state.get_delta_primal().data(),
                                       power_two_func_t<f_t>{}),
       norm_squared_delta_primal_.data(),
       climber_strategies_.size(),
       primal_size_,
-      stream_view_.value());
+      stream_view_.value()));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(current_saddle_point_state.get_delta_dual().data(),
                                       power_two_func_t<f_t>{}),
       norm_squared_delta_dual_.data(),
       climber_strategies_.size(),
       dual_size_,
-      stream_view_.value());
+      stream_view_.value()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu` around lines
418 - 497, The four CUB calls (cub::DeviceTransform::Transform and the three
cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal,
tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with
RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other
CUB/CUDA calls in this file so CUDA errors are checked (preserve the original
call arguments including stream_view_.value()).
🧹 Nitpick comments (6)
benchmarks/linear_programming/cuopt/run_pdlp.cu (1)

81-88: Line 87 fallback return is unreachable dead code.

Since argparse's .choices() already rejects any string outside {"None", "Papilo", "PSLP", "Default"}, control can never reach line 87. The silent return Default hides potential future bugs (e.g. a new enum value added to presolver_t without updating this mapping). Consider replacing it with an assertion to make the invariant explicit:

🔧 Suggested change
 static cuopt::linear_programming::presolver_t string_to_presolver(const std::string& presolver)
 {
   if (presolver == "None") return cuopt::linear_programming::presolver_t::None;
   if (presolver == "Papilo") return cuopt::linear_programming::presolver_t::Papilo;
   if (presolver == "PSLP") return cuopt::linear_programming::presolver_t::PSLP;
   if (presolver == "Default") return cuopt::linear_programming::presolver_t::Default;
-  return cuopt::linear_programming::presolver_t::Default;
+  throw std::invalid_argument("Unknown presolver: " + presolver);
 }

This mirrors how string_to_pdlp_solver_mode would ideally behave and surfaces mapping gaps immediately at compile/test time if new enum values are added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/linear_programming/cuopt/run_pdlp.cu` around lines 81 - 88,
Replace the silent fallback in string_to_presolver so an unexpected string
cannot quietly map to Default: remove the unreachable final "return Default" and
instead assert or throw (e.g., use assert(false) or throw std::invalid_argument)
to make the invariant explicit for cuopt::linear_programming::presolver_t
mapping; reference string_to_presolver and presolver_t (and mirror the behavior
used by string_to_pdlp_solver_mode) so any missing mapping surfaces immediately
during tests/compilation.
cpp/src/pdlp/utilities/ping_pong_graph.cuh (1)

34-37: Initialize CUDA handles to nullptr to prevent accidental use of garbage handle values.

even_graph, odd_graph, even_instance, and odd_instance are uninitialized. While even_initialized/odd_initialized guard their use in normal paths, any future code path that bypasses those flags — or any static analysis/sanitizer run — will observe undefined values. Explicit nullptr initialization is zero-cost and aligns with the guideline for tracking CUDA resource lifecycle.

♻️ Proposed fix: value-initialize CUDA handles
  private:
-  cudaGraph_t even_graph;
-  cudaGraph_t odd_graph;
-  cudaGraphExec_t even_instance;
-  cudaGraphExec_t odd_instance;
+  cudaGraph_t even_graph{nullptr};
+  cudaGraph_t odd_graph{nullptr};
+  cudaGraphExec_t even_instance{nullptr};
+  cudaGraphExec_t odd_instance{nullptr};

As per coding guidelines, tracking GPU resource lifecycle and ensuring cleanup of CUDA handles requires proper initialization state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh` around lines 34 - 37, The CUDA
handle variables even_graph, odd_graph, even_instance, and odd_instance are
uninitialized; initialize them to nullptr (or equivalent zero-initialization)
where they are declared so they start in a known state and won't contain garbage
values, then keep the existing even_initialized/odd_initialized guards and
cleanup logic intact to manage lifecycle; update the declarations of even_graph,
odd_graph, even_instance, and odd_instance to value-initialize them (e.g., =
nullptr) so static analysis/sanitizers and future code paths see a safe default.
docs/cuopt/source/lp-qp-milp-settings.rst (1)

520-521: Wording ambiguity: "in parallel" appears to modify Dual Simplex.

The phrase "rather than solving them in parallel using Dual Simplex" grammatically attributes parallelism to Dual Simplex, which is a sequential solver. The intended contrast is batch PDLP (parallel/simultaneous) vs. Dual Simplex (sequential). Consider rephrasing, e.g.:

"…rather than solving them sequentially using Dual Simplex."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cuopt/source/lp-qp-milp-settings.rst` around lines 520 - 521, The
wording in the description for CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously
attributes parallelism to Dual Simplex; update the sentence that currently reads
"rather than solving them in parallel using Dual Simplex" to clearly contrast
batched PDLP (simultaneous/parallel evaluation) with Dual Simplex by rephrasing
it to something like "rather than solving them sequentially using Dual Simplex"
so the intended comparison is unambiguous; update the docstring for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.
python/cuopt/cuopt/tests/linear_programming/test_python_API.py (2)

1003-1017: Problem setup duplicates test_cuts verbatim — extract a shared helper.

The variable declarations, constraints, and objective (lines 1009–1017) are identical to test_cuts (lines 971–979). Consider a shared factory, e.g.:

♻️ Proposed refactor
+def _make_knapsack_problem():
+    """Shared 3-variable 0-1 MIP used in multiple tests."""
+    problem = Problem()
+    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
+    y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2")
+    y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3")
+    problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875)
+    problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875)
+    problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3)
+    return problem

 def test_cuts():
-    problem = Problem()
-    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
-    y2 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y2")
-    y3 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y3")
-    problem.addConstraint(774 * y1 + 76 * y2 + 42 * y3 <= 875)
-    problem.addConstraint(67 * y1 + 27 * y2 + 53 * y3 <= 875)
-    problem.setObjective(-86 * y1 - 4 * y2 - 40 * y3)
+    problem = _make_knapsack_problem()
     ...

 def test_batch_pdlp_strong_branching():
-    problem = Problem()
-    y1 = problem.addVariable(lb=0, ub=1, vtype=INTEGER, name="y1")
-    ...
+    problem = _make_knapsack_problem()
     ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1003 - 1017, The test duplicates the problem construction in
test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup
into a shared helper (e.g., a function like build_y1y2y3_problem or
make_binary_three_var_problem) that creates the Problem, adds variables
y1,y2,y3, adds the two constraints and sets the objective, and returns the
Problem (and variables if tests need them); then replace the inline block in
both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the
existing symbols Problem, addVariable, addConstraint, setObjective, and variable
names y1,y2,y3 to find and replace the duplicated code).

1003-1032: Missing coverage: parameter should be ignored on LP problems (per docs).

The documentation states "This setting is ignored if the problem is not a MIP problem." The test only exercises MIP scenarios. Per the project's test-coverage guidelines, edge behavior (silently ignoring a flag on non-qualifying problems) should be tested to prevent future regressions where the parameter accidentally affects LP solves.

Consider adding a brief LP variant that sets CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 and asserts Status == Optimal and a correct objective.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1003 - 1032, Add a small LP variant to the test to verify the flag is ignored
for non-MIP problems: create a new Problem (or reuse by recreating variables)
using continuous variables (omit vtype or use CONTINUOUS) with the same
constraints/objective as in test_batch_pdlp_strong_branching, call
settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then
assert problem.Status.name == "Optimal" and problem.ObjValue ==
pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves.
Use the same symbols (Problem, settings.set_parameter,
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue,
test_batch_pdlp_strong_branching) so the new assertions align with the existing
test structure.
cpp/src/pdlp/pdlp.cu (1)

1927-1927: cublasSetStream correctly ensures stream binding before cublasDgeam; minor inconsistency with .value().

The explicit stream binding is the correct fix for the handle-sharing issue described in the PR. stream_view_ (of type rmm::cuda_stream_view) implicitly converts to cudaStream_t here, but all other cuSPARSE/cuBLAS call sites in this PR (lines 2744, 2757, 2766) use explicit .value(). Consider using stream_view_.value() here and at line 2004 for consistency.

♻️ Proposed consistency fix
-  RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_));
+  RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(), stream_view_.value()));

Apply the same change at line 2004.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/pdlp.cu` at line 1927, The cublas stream is correctly set via
RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(),
stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use
the explicit rmm::cuda_stream_view::value(); change the second argument to
stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make
the same replacement for the other occurrence around the cublasDgeam usage
referenced at line ~2004) so all calls consistently pass a cudaStream_t via
stream_view_.value().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1861-1864: The cudaPeekAtLastError() call is placed after
RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it redundant; move the
RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow the kernel launch
(before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so that
launch/configuration errors are caught early, then keep the stream sync
(RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors
and ensure device-to-host writes complete.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1024-1032: The second solve may return a cached solution because
only settings were changed; force a true re-solve or validate the solver ran
under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or
recreating it before the second problem.solve(settings). For example, call a
method that updates the model state (e.g., problem.setObjective(...) or a
problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat
changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve
actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of
returning a cached result.

---

Outside diff comments:
In `@cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`:
- Around line 418-497: The four CUB calls (cub::DeviceTransform::Transform and
the three cub::DeviceSegmentedReduce::Sum invocations that use tmp_primal,
tuple_multiplies<f_t>, and power_two_func_t<f_t>) are not wrapped with
RAFT_CUDA_TRY; wrap each of those calls with RAFT_CUDA_TRY(...) exactly as other
CUB/CUDA calls in this file so CUDA errors are checked (preserve the original
call arguments including stream_view_.value()).

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 25-43: Declare the class copy constructor and copy assignment
operator for ping_pong_graph_t as deleted to prevent shallow copies of the CUDA
handles (i.e., add ping_pong_graph_t(const ping_pong_graph_t&) = delete; and
ping_pong_graph_t& operator=(const ping_pong_graph_t&) = delete; in the class
declaration), leaving move semantics unimplemented for now (or explicitly delete
them too if you want fully non-movable instances) so that even_graph, odd_graph,
even_instance and odd_instance cannot be copied and risk double-free in the
destructor.

---

Duplicate comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 1456-1460: The assert cuopt_assert(movement + computed_interaction
>= f_t(0.0), ...) is too strict for floating-point rounding; replace it with an
epsilon-tolerant check that allows tiny negative rounding errors before
clamping. Compute an epsilon relative to the magnitude of operands (e.g., using
std::numeric_limits<f_t>::epsilon() times max(abs(movement),
abs(computed_interaction), f_t(1))) and assert movement + computed_interaction
>= -epsilon, then continue to use cuda::std::max(...) and the existing sqrt into
*fixed_point_error; keep the original message but note the tolerance in the
condition.
- Line 1194: cuopt_assert(vec.size() > 0, "Vector must not be empty") is not
sufficient because asserts can be compiled out; add an explicit runtime check
for vec.size() == 0 before computing avg (the line using avg =
d_sum.value(stream) / vec.size()) and perform the appropriate early return or
empty-result handling instead of relying on cuopt_assert; update the function
containing cuopt_assert, the avg variable calculation, and the
d_sum.value(stream) usage so the divide-by-zero cannot occur in release builds.

In `@cpp/src/pdlp/termination_strategy/convergence_information.cu`:
- Around line 461-479: The two cub::DeviceReduce::Max calls using
transform_iter, linf_dual_residual_.data(), temp_storage_bytes, and stream_view_
need CUDA error checking: wrap each cub::DeviceReduce::Max invocation with
RAFT_CUDA_TRY (or the project’s CUDA_CHECK macro) so failures are caught (first
call that computes temp_storage_bytes and the second call that writes into
temp_buf.data()); ensure both calls use the same error-check wrapper and keep
existing variables transform_iter, temp_buf, linf_dual_residual_.data(), and
stream_view_ unchanged.

In `@cpp/src/pdlp/utils.cuh`:
- Around line 607-615: The two cub::DeviceReduce::Max calls in the block that
computes the max absolute value (using abs_iter, result, n, stream) are missing
CUDA error checks; wrap both calls with RAFT_CUDA_TRY (or the project equivalent
CUDA_CHECK) so failures are surfaced — i.e., call
RAFT_CUDA_TRY(cub::DeviceReduce::Max(d_temp, temp_bytes, abs_iter, result, n,
stream)); and RAFT_CUDA_TRY(cub::DeviceReduce::Max(temp_buf.data(), temp_bytes,
abs_iter, result, n, stream)); ensuring you keep the existing variables (d_temp,
temp_bytes, temp_buf, abs_iter, result, n, stream) unchanged.

---

Nitpick comments:
In `@benchmarks/linear_programming/cuopt/run_pdlp.cu`:
- Around line 81-88: Replace the silent fallback in string_to_presolver so an
unexpected string cannot quietly map to Default: remove the unreachable final
"return Default" and instead assert or throw (e.g., use assert(false) or throw
std::invalid_argument) to make the invariant explicit for
cuopt::linear_programming::presolver_t mapping; reference string_to_presolver
and presolver_t (and mirror the behavior used by string_to_pdlp_solver_mode) so
any missing mapping surfaces immediately during tests/compilation.

In `@cpp/src/pdlp/pdlp.cu`:
- Line 1927: The cublas stream is correctly set via
RAFT_CUBLAS_TRY(cublasSetStream(handle_ptr_->get_cublas_handle(),
stream_view_)), but for consistency with other cuBLAS/cuSPARSE call sites use
the explicit rmm::cuda_stream_view::value(); change the second argument to
stream_view_.value() in the RAFT_CUBLAS_TRY(cublasSetStream(...)) call (and make
the same replacement for the other occurrence around the cublasDgeam usage
referenced at line ~2004) so all calls consistently pass a cudaStream_t via
stream_view_.value().

In `@cpp/src/pdlp/utilities/ping_pong_graph.cuh`:
- Around line 34-37: The CUDA handle variables even_graph, odd_graph,
even_instance, and odd_instance are uninitialized; initialize them to nullptr
(or equivalent zero-initialization) where they are declared so they start in a
known state and won't contain garbage values, then keep the existing
even_initialized/odd_initialized guards and cleanup logic intact to manage
lifecycle; update the declarations of even_graph, odd_graph, even_instance, and
odd_instance to value-initialize them (e.g., = nullptr) so static
analysis/sanitizers and future code paths see a safe default.

In `@docs/cuopt/source/lp-qp-milp-settings.rst`:
- Around line 520-521: The wording in the description for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING ambiguously attributes parallelism to Dual
Simplex; update the sentence that currently reads "rather than solving them in
parallel using Dual Simplex" to clearly contrast batched PDLP
(simultaneous/parallel evaluation) with Dual Simplex by rephrasing it to
something like "rather than solving them sequentially using Dual Simplex" so the
intended comparison is unambiguous; update the docstring for
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING accordingly.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py`:
- Around line 1003-1017: The test duplicates the problem construction in
test_batch_pdlp_strong_branching and test_cuts; factor out the repeated setup
into a shared helper (e.g., a function like build_y1y2y3_problem or
make_binary_three_var_problem) that creates the Problem, adds variables
y1,y2,y3, adds the two constraints and sets the objective, and returns the
Problem (and variables if tests need them); then replace the inline block in
both test_batch_pdlp_strong_branching and test_cuts to call that helper (use the
existing symbols Problem, addVariable, addConstraint, setObjective, and variable
names y1,y2,y3 to find and replace the duplicated code).
- Around line 1003-1032: Add a small LP variant to the test to verify the flag
is ignored for non-MIP problems: create a new Problem (or reuse by recreating
variables) using continuous variables (omit vtype or use CONTINUOUS) with the
same constraints/objective as in test_batch_pdlp_strong_branching, call
settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1) and solve; then
assert problem.Status.name == "Optimal" and problem.ObjValue ==
pytest.approx(-126, abs=1e-3) to ensure the parameter does not affect LP solves.
Use the same symbols (Problem, settings.set_parameter,
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, solve, Status, ObjValue,
test_batch_pdlp_strong_branching) so the new assertions align with the existing
test structure.

Comment thread cpp/src/pdlp/pdlp.cu
Comment on lines 1861 to +1864
RAFT_CUDA_TRY(cudaStreamSynchronize(
stream_view_)); // To make sure all the data is written from device to host
RAFT_CUDA_TRY(cudaPeekAtLastError());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

cudaPeekAtLastError() is placed after cudaStreamSynchronize() — it is redundant and should precede the sync.

RAFT_CUDA_TRY(cudaStreamSynchronize(...)) already surfaces both launch-configuration errors (via the sticky error bit) and kernel execution errors. The cudaPeekAtLastError() call that follows can never observe a new error not already caught by the sync. To properly detect kernel launch errors (e.g., invalid grid dimensions) before blocking, cudaPeekAtLastError() must come immediately after the kernel launch.

🛠️ Proposed fix
     kernel_compute_fixed_error<f_t><<<grid_size, block_size, 0, stream_view_>>>(
       make_span(step_size_strategy_.get_norm_squared_delta_primal()),
       make_span(step_size_strategy_.get_norm_squared_delta_dual()),
       make_span(primal_weight_),
       make_span(step_size_),
       make_span(step_size_strategy_.get_interaction()),
       make_span(restart_strategy_.fixed_point_error_));
+    RAFT_CUDA_TRY(cudaPeekAtLastError());
     RAFT_CUDA_TRY(cudaStreamSynchronize(
       stream_view_));  // To make sure all the data is written from device to host
-    RAFT_CUDA_TRY(cudaPeekAtLastError());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/pdlp.cu` around lines 1861 - 1864, The cudaPeekAtLastError()
call is placed after RAFT_CUDA_TRY(cudaStreamSynchronize(...)) making it
redundant; move the RAFT_CUDA_TRY(cudaPeekAtLastError()) to immediately follow
the kernel launch (before RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) so
that launch/configuration errors are caught early, then keep the stream sync
(RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_))) to detect execution errors
and ensure device-to-host writes complete.

Comment on lines +1024 to +1032
problem.solve(settings)
assert problem.Status.name == "Optimal"
assert problem.ObjValue == pytest.approx(-126, abs=1e-3)

settings.set_parameter(CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, 1)

problem.solve(settings)
assert problem.Status.name == "Optimal"
assert problem.ObjValue == pytest.approx(-126, abs=1e-3)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Second solve lacks an explicit problem reset — test may not exercise the batch PDLP=1 path.

Between the two problem.solve(settings) calls (lines 1024 and 1030), only a solver setting is changed; no setObjective, addConstraint, or updateConstraint is called to mark the problem dirty. Depending on how the solver determines whether a re-solve is needed, the second call may return the cached solution without running under CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1, causing the second assert to pass trivially.

Compare test_cuts (lines 983–999): it also calls problem.solve twice, but the second call is guarded by a changed setting value that clearly directs a different code path (different cut-pass count). The existing test pattern in test_model calls setObjective between solves to force a fresh state.

Adding an explicit re-construction of the problem object or at minimum verifying a state indicator (e.g., SolutionStats.num_nodes) would make the second assertion meaningful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_python_API.py` around lines
1024 - 1032, The second solve may return a cached solution because only settings
were changed; force a true re-solve or validate the solver ran under
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING by marking the model dirty or recreating
it before the second problem.solve(settings). For example, call a method that
updates the model state (e.g., problem.setObjective(...) or a
problem.reset()/rebuild/recreate of the Problem object) or assert a runtime stat
changed (e.g., SolutionStats.num_nodes increased) to ensure the second solve
actually exercised the CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING=1 path instead of
returning a cached result.

@Kh4ster
Copy link
Copy Markdown
Contributor Author

Kh4ster commented Feb 19, 2026

/ok to test d89af96

@Kh4ster
Copy link
Copy Markdown
Contributor Author

Kh4ster commented Feb 19, 2026

/ok to test c7e3e22

@Kh4ster
Copy link
Copy Markdown
Contributor Author

Kh4ster commented Feb 19, 2026

/merge

@rapids-bot rapids-bot bot merged commit 9ad8b45 into main Feb 19, 2026
94 checks passed
@Kh4ster Kh4ster deleted the fix_batch_pdlp branch February 19, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working mip non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants