Explicitely use cudaStream_t for cub::DeviceTransform and use non blocking stream #771

hlinsen · 2026-01-13T23:55:25Z

This PR reverts the change to non blocking stream and fixes a silent regression following an API change in CCCL 3.2

Closes: #741

Summary by CodeRabbit

Refactor
- Standardized GPU stream handling across solvers by switching to raw stream handles for device operations.
- Introduced per-call stream scoping, added targeted synchronizations, and reset buffer streams to ensure correct stream lifetime and execution ordering during solves.
Chores
- Updated copyright years to 2026.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-14T00:00:29Z

📝 Walkthrough

Walkthrough

Replaced many uses of stream_view_ with stream_view_.value() when passing CUDA streams to device operations across LP and dual-simplex sources; updated several file header years to 2026; added per-call scoped rmm::cuda_stream / raft handle plus buffer stream resets and synchronizations in cpp/src/linear_programming/utilities/cython_solve.cu.

Changes

Cohort / File(s)	Summary
Dual simplex core `cpp/src/dual_simplex/barrier.cu`	Replaced `stream_view_` with `stream_view_.value()` for CUDA device calls (kernels, cub/thrust transforms/selects).
Initial scaling `cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu`	Switched `cub::DeviceTransform::Transform` stream argument from `stream_view_` to `stream_view_.value()`.
PDHG & PDLP solvers `cpp/src/linear_programming/pdhg.cu`, `cpp/src/linear_programming/pdlp.cu`	Replaced `stream_view_` with `stream_view_.value()` across CUDA graph/transform invocations; minor header year updates.
Restart & weighted average `cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu`, `cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu`	Updated execution stream arguments to use `stream_view_.value()`; header year bumps.
Step size & termination `cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu`, `cpp/src/linear_programming/termination_strategy/convergence_information.cu`, `cpp/src/linear_programming/termination_strategy/infeasibility_information.cu`	Converted `stream_view_` usages to `stream_view_.value()` for cub/thrust transforms/reduces and kernel launches; header year updates.
Python solver interface `cpp/src/linear_programming/utilities/cython_solve.cu`	Added per-call `rmm::cuda_stream` and scoped raft handle; moved solver object lifetime to respect stream scope; reset device-buffer streams to per-thread default and added warm-start synchronization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: explicitly using cudaStream_t for cub::DeviceTransform calls and implementing non-blocking stream usage, which directly addresses the regression and synchronization issues.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#741` by fixing the CCCL 3.2 regression (stream argument type changes across all CUDA operations) and implementing proper non-blocking stream handling with correct resource lifetime management in cython_solve.cu.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the PDLP synchronization issue: updating stream argument types for compatibility and managing stream lifetimes. Minor copyright year updates are standard maintenance changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b859963 and e740191.

📒 Files selected for processing (1)

cpp/src/dual_simplex/barrier.cu

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/src/dual_simplex/barrier.cu

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu (1)
388-393: Inconsistent stream parameter: This Transform call should also use stream_view_.value().

This cub::DeviceTransform::Transform call still passes stream_view_ directly (line 393), while the earlier Transform call at line 361 was updated to use stream_view_.value(). For consistency and to fully address the CCCL 3.2 API change, this call should be updated as well.
🔧 Proposed fix
   cub::DeviceTransform::Transform(
     cuda::std::make_tuple(reduced_cost_.data(), problem_ptr->variable_bounds.data()),
     bound_value_.data(),
     primal_size_h_,
     bound_value_reduced_cost_product<f_t, f_t2>(),
-    stream_view_);
+    stream_view_.value());
cpp/src/dual_simplex/barrier.cu (1)

3138-3150: Update cub::DeviceTransform::Transform calls for consistency with other stream parameter usage in the file.

Lines 3143 and 3150 use stream_view_ directly, while most other cub::DeviceTransform::Transform calls in the file use stream_view_.value(). Additionally, compute_final_direction (lines 3213, 3223, 3230) and compute_next_iterate (lines 3292, 3302, 3309) have the same inconsistency. For consistency with the predominant pattern across the file, either all cub::DeviceTransform::Transform calls should use .value() or all should use the bare stream_view_ with implicit conversion. Choose one approach and apply it uniformly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1e7703 and bff475c.

📒 Files selected for processing (10)

cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cu,cuh}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh}: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

**/*.cu

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.cu: Verify race conditions and correctness of GPU kernel shared memory, atomics, and warp-level operations
Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths

Files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

**/*.{cu,cuh,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh,cpp,hpp,h}: Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results
Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks
Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)
Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations
For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution
Assess algorithmic complexity for large-scale problems (millions of variables/constraints); ensure O(n log n) or better complexity, not O(n²) or worse
Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Identify assertions with overly strict numerical tolerances that fail on legitimate degenerate/edge cases (near-zero pivots, singular matrices, empty problems)
Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Check that hard-coded GPU de...

Files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

**/*.{cu,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code

Files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

🧠 Learnings (18)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-12-04T20:09:09.264Z

Learnt from: chris-maes
Repo: NVIDIA/cuopt PR: 602
File: cpp/src/linear_programming/solve.cu:732-742
Timestamp: 2025-12-04T20:09:09.264Z
Learning: In cpp/src/linear_programming/solve.cu, the barrier solver does not currently return INFEASIBLE or UNBOUNDED status. It only returns OPTIMAL, TIME_LIMIT, NUMERICAL_ISSUES, or CONCURRENT_LIMIT.

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Verify race conditions and correctness of GPU kernel shared memory, atomics, and warp-level operations

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cpp,hpp,h} : Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*test*.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment

Applied to files:

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/dual_simplex/barrier.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu
cpp/src/linear_programming/pdlp.cu
cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu
cpp/src/linear_programming/pdhg.cu
cpp/src/linear_programming/termination_strategy/convergence_information.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems

Applied to files:

cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu
cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results

Applied to files:

cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu

🧬 Code graph analysis (1)

cpp/src/dual_simplex/barrier.cu (1)

cpp/src/linear_programming/pdlp.cu (1)

d_z (1486-1486)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: wheel-build-cuopt-sh-client / 13.1.0, 3.10, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.13, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.11, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.12, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.11, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.10, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.12, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.10, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.13, amd64, rockylinux8
GitHub Check: checks / check-style

🔇 Additional comments (31)

cpp/src/linear_programming/restart_strategy/weighted_average_solution.cu (1)

82-89: LGTM: Stream parameter fix for CCCL 3.2 compatibility.

The change from stream_view_ to stream_view_.value() correctly extracts the raw cudaStream_t for cub::DeviceTransform::Transform, addressing the API change in CCCL 3.2. Both Transform calls are updated consistently.

cpp/src/linear_programming/pdhg.cu (4)

129-138: LGTM: Consistent stream parameter update for dual projection.

The stream_view_.value() change correctly provides the raw cudaStream_t to cub::DeviceTransform::Transform for CCCL 3.2 compatibility.

187-197: LGTM: Stream parameter correctly updated for primal projection.

337-364: LGTM: Both reflected projection Transform calls updated consistently.

The should_major branch correctly uses stream_view_.value() for both primal and dual reflected projections in compute_next_primal_dual_solution_reflected.

381-405: LGTM: Non-major iteration branch also updated consistently.

Both Transform calls in the else branch are correctly updated to use stream_view_.value().

cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu (2)

1680-1685: LGTM: Stream parameter updated for bounds extraction transform.

The change correctly uses stream_view_.value() for the cub::DeviceTransform::Transform call that extracts variable bounds.

1863-1869: LGTM: Stream parameter updated for clamping transform.

The stream_view_.value() change for the clamp transform is consistent with the fix pattern.

cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu (1)

301-307: LGTM: Stream parameter fix for interaction computation.

The stream_view_.value() change for the subtraction transform in compute_interaction_and_movement correctly addresses CCCL 3.2 compatibility.

cpp/src/linear_programming/termination_strategy/infeasibility_information.cu (1)

356-361: LGTM: Stream parameter updated for bound value gradient transform.

The change to stream_view_.value() correctly addresses CCCL 3.2 compatibility.

cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cu (1)

433-489: LGTM! Correct fix for CCCL 3.2 API compatibility.

The changes correctly extract the underlying cudaStream_t via .value() for the cub::DeviceTransform::Transform calls. This aligns with the CCCL 3.2 API change that now explicitly requires cudaStream_t rather than accepting stream wrapper objects. The other CUDA operations (kernel launches, cudaMemsetAsync, thrust, raft operations) appropriately continue using stream_view_ directly since they accept rmm::cuda_stream_view.

cpp/src/linear_programming/utilities/cython_solve.cu (1)

232-233: LGTM! Proper non-blocking stream initialization.

The change correctly creates a non-blocking CUDA stream and passes it to the raft handle. This is important for:

Avoiding implicit synchronization with the default stream in concurrent scenarios (batch solve at line 307)

Proper stream lifecycle management - the rmm::cuda_stream owns the underlying cudaStream_t and will destroy it when going out of scope, after the handle

As per coding guidelines, explicitly creating and managing dedicated streams for concurrent operations is the preferred approach.

cpp/src/linear_programming/pdlp.cu (4)

1032-1043: LGTM! Delta computation stream parameters fixed.

The cub::DeviceTransform::Transform calls for computing delta primal and delta dual correctly use stream_view_.value() to extract the underlying cudaStream_t, addressing the CCCL 3.2 API change.

1137-1150: LGTM! Initial primal projection stream parameters fixed.

The clamp operations for projecting initial primal solutions correctly pass the underlying stream handle.

1393-1420: LGTM! Halpern update stream parameters fixed.

Both primal and dual update transforms in the Halpern iteration correctly use stream_view_.value().

1525-1574: LGTM! Initial step size computation stream parameters fixed.

The normalization and residual computation transforms correctly extract the underlying stream handle.

cpp/src/linear_programming/termination_strategy/convergence_information.cu (4)

268-281: LGTM! Primal residual computation stream parameter fixed.

The cub::DeviceTransform::Transform call for computing primal residual and primal slack correctly uses stream_view_.value().

359-364: LGTM! Dual residual computation stream parameter fixed.

The transform for computing dual residual correctly passes the underlying stream handle.

457-462: LGTM! Reduced cost computation stream parameter fixed.

The bound value gradient transform correctly uses stream_view_.value().

490-503: Both DeviceTransform::Transform and DeviceReduce::Sum correctly accept stream_view_ due to implicit conversion from rmm::cuda_stream_view to cudaStream_t. The inconsistency in style (one using .value() explicitly, the other passing the wrapper directly) is not a functional issue. Both patterns are correct and work interchangeably.
cpp/src/dual_simplex/barrier.cu (12)
48-60: LGTM!

The pairwise_multiply helper functions correctly use stream.value() to extract the raw cudaStream_t for cub::DeviceTransform::Transform, addressing the CCCL 3.2 API change.

62-87: LGTM!

The axpy helper functions correctly use stream.value() for the cub::DeviceTransform::Transform calls.

1369-1377: LGTM!

The gpu_adat_multiply method correctly uses stream_view_.value() for the element-wise multiply transform.

2050-2103: LGTM!

All cub::DeviceTransform::Transform calls in gpu_compute_residuals are consistently updated to use stream_view_.value().

2343-2379: LGTM!

The diagonal, inverse diagonal, and sqrt inverse diagonal formation correctly uses stream_view_.value() for all cub::DeviceTransform::Transform calls.

2430-2457: LGTM!

The GPU compute H block correctly uses stream_view_.value() for the transform operations.

2786-2804: LGTM!

The dz formation correctly uses stream_view_.value().

2828-2847: LGTM!

The dv formation correctly uses stream_view_.value().

2914-2950: LGTM!

The dw formation and residual computations correctly use stream_view_.value().

2998-3021: LGTM!

The compute_affine_rhs method correctly uses stream_view_.value() for negating the complementarity residuals.

3205-3230: Inconsistent stream parameter usage for cub::DeviceTransform::Transform.

These three cub::DeviceTransform::Transform calls at lines 3213, 3223, and 3230 use stream_view_ without .value(), which is inconsistent with the other updated calls in this PR.
🔧 Suggested fix for consistency
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_dw_aff_.data(), data.d_dv_aff_.data(), data.d_dw_.data(), data.d_dv_.data()),
       thrust::make_zip_iterator(data.d_dw_.data(), data.d_dv_.data()),
       data.d_dw_.size(),
       [] HD(f_t dw_aff, f_t dv_aff, f_t dw, f_t dv) -> thrust::tuple<f_t, f_t> {
         return {dw + dw_aff, dv + dv_aff};
       },
-      stream_view_);
+      stream_view_.value());
 
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_dx_aff_.data(), data.d_dz_aff_.data(), data.d_dx_.data(), data.d_dz_.data()),
       thrust::make_zip_iterator(data.d_dx_.data(), data.d_dz_.data()),
       data.d_dx_.size(),
       [] HD(f_t dx_aff, f_t dz_aff, f_t dx, f_t dz) -> thrust::tuple<f_t, f_t> {
         return {dx + dx_aff, dz + dz_aff};
       },
-      stream_view_);
+      stream_view_.value());
 
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_dy_aff_.data(), data.d_dy_.data()),
       data.d_dy_.data(),
       data.d_dy_.size(),
       [] HD(f_t dy_aff, f_t dy) { return dy + dy_aff; },
-      stream_view_);
+      stream_view_.value());
3284-3309: Inconsistent stream parameter usage for cub::DeviceTransform::Transform.

These three cub::DeviceTransform::Transform calls at lines 3292, 3302, and 3309 use stream_view_ without .value(), which is inconsistent with the other updated calls in this PR.
🔧 Suggested fix for consistency
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_w_.data(), data.d_v_.data(), data.d_dw_.data(), data.d_dv_.data()),
       thrust::make_zip_iterator(data.d_w_.data(), data.d_v_.data()),
       data.d_dw_.size(),
       [step_primal, step_dual] HD(f_t w, f_t v, f_t dw, f_t dv) -> thrust::tuple<f_t, f_t> {
         return {w + step_primal * dw, v + step_dual * dv};
       },
-      stream_view_);
+      stream_view_.value());
 
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_x_.data(), data.d_z_.data(), data.d_dx_.data(), data.d_dz_.data()),
       thrust::make_zip_iterator(data.d_x_.data(), data.d_z_.data()),
       data.d_dx_.size(),
       [step_primal, step_dual] HD(f_t x, f_t z, f_t dx, f_t dz) -> thrust::tuple<f_t, f_t> {
         return {x + step_primal * dx, z + step_dual * dz};
       },
-      stream_view_);
+      stream_view_.value());
 
     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_y_.data(), data.d_dy_.data()),
       data.d_y_.data(),
       data.d_y_.size(),
       [step_dual] HD(f_t y, f_t dy) { return y + step_dual * dy; },
-      stream_view_);
+      stream_view_.value());

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/linear_programming/utilities/cython_solve.cu (1)
232-247: CUDA stream resource leak: stream created but never destroyed.

The cudaStream_t created at line 233 is never destroyed with cudaStreamDestroy(). When raft::handle_t is constructed with an external stream, it does not take ownership—the caller remains responsible for cleanup. This leaks a stream on every call to call_solve, which is especially problematic in batch mode where multiple parallel calls accumulate leaked streams.

Based on learnings, ensure cleanup of streams/events.
🔧 Proposed fix: Add stream synchronization and cleanup
 std::unique_ptr<solver_ret_t> call_solve(
   cuopt::mps_parser::data_model_view_t<int, double>* data_model,
   cuopt::linear_programming::solver_settings_t<int, double>* solver_settings,
   unsigned int flags,
   bool is_batch_mode)
 {
   raft::common::nvtx::range fun_scope("Call Solve");
   cudaStream_t stream;
   RAFT_CUDA_TRY(cudaStreamCreateWithFlags(&stream, flags));
   const raft::handle_t handle_{stream};

   auto op_problem = data_model_to_optimization_problem(data_model, solver_settings, &handle_);
   solver_ret_t response;
   if (op_problem.get_problem_category() == linear_programming::problem_category_t::LP) {
     response.lp_ret =
       call_solve_lp(op_problem, solver_settings->get_pdlp_settings(), is_batch_mode);
     response.problem_type = linear_programming::problem_category_t::LP;
   } else {
     response.mip_ret      = call_solve_mip(op_problem, solver_settings->get_mip_settings());
     response.problem_type = linear_programming::problem_category_t::MIP;
   }

+  RAFT_CUDA_TRY(cudaStreamSynchronize(stream));
+  RAFT_CUDA_TRY(cudaStreamDestroy(stream));
+
   return std::make_unique<solver_ret_t>(std::move(response));
 }
Alternatively, consider using a RAII wrapper for automatic cleanup to ensure exception safety.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bff475c and 2f88100.

📒 Files selected for processing (1)

cpp/src/linear_programming/utilities/cython_solve.cu

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cu,cuh}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh}: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Files:

cpp/src/linear_programming/utilities/cython_solve.cu

**/*.cu

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.cu: Verify race conditions and correctness of GPU kernel shared memory, atomics, and warp-level operations
Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths

Files:

cpp/src/linear_programming/utilities/cython_solve.cu

**/*.{cu,cuh,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh,cpp,hpp,h}: Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results
Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks
Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)
Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations
For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution
Assess algorithmic complexity for large-scale problems (millions of variables/constraints); ensure O(n log n) or better complexity, not O(n²) or worse
Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Identify assertions with overly strict numerical tolerances that fail on legitimate degenerate/edge cases (near-zero pivots, singular matrices, empty problems)
Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Check that hard-coded GPU de...

Files:

cpp/src/linear_programming/utilities/cython_solve.cu

**/*.{cu,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code

Files:

cpp/src/linear_programming/utilities/cython_solve.cu

🧠 Learnings (8)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2026-01-14T00:38:33.700Z

Learnt from: chris-maes
Repo: NVIDIA/cuopt PR: 746
File: cpp/src/dual_simplex/barrier.cu:549-552
Timestamp: 2026-01-14T00:38:33.700Z
Learning: In cpp/src/dual_simplex/barrier.cu's form_adat() method within the barrier solver, the raft::copy from d_original_A_values to device_AD.x is necessary and must not be removed. This copy restores the original matrix A values before they are scaled by the diagonal matrix D (via d_inv_diag_prime). Since D changes between iterations and the scaling mutates device_AD.x, copying the original values each iteration is required for correctness.

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

📚 Learning: 2025-12-04T20:09:09.264Z

Learnt from: chris-maes
Repo: NVIDIA/cuopt PR: 602
File: cpp/src/linear_programming/solve.cu:732-742
Timestamp: 2025-12-04T20:09:09.264Z
Learning: In cpp/src/linear_programming/solve.cu, the barrier solver does not currently return INFEASIBLE or UNBOUNDED status. It only returns OPTIMAL, TIME_LIMIT, NUMERICAL_ISSUES, or CONCURRENT_LIMIT.

Applied to files:

cpp/src/linear_programming/utilities/cython_solve.cu

🧬 Code graph analysis (1)

cpp/src/linear_programming/utilities/cython_solve.cu (1)

cpp/tests/distance_engine/waypoint_matrix_test.cpp (4)

stream (49-87)

stream (89-97)

stream (136-161)

stream (197-214)

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

aliceb-nv · 2026-01-14T09:44:42Z

cpp/src/linear_programming/utilities/cython_solve.cu

+    response.lp_ret.primal_solution_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.dual_solution_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.reduced_cost_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.current_primal_solution_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.current_dual_solution_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.initial_primal_average_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.initial_dual_average_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.current_ATY_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.sum_primal_solutions_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.sum_dual_solutions_->set_stream(rmm::cuda_stream_per_thread);
+    response.lp_ret.last_restart_duality_gap_primal_solution_->set_stream(
+      rmm::cuda_stream_per_thread);
+    response.lp_ret.last_restart_duality_gap_dual_solution_->set_stream(
+      rmm::cuda_stream_per_thread);


I think we'll have to be careful not to forget to set the streams here in cython_solve, if we made any additions to the PDLP structures. However I cannot think of a way to do this programmatically and cleanly.... Maybe coderabbit can remember this as a rule somehow and help us there

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/src/dual_simplex/barrier.cu (3)

3140-3152: Missing .value() calls in compute_cc_rhs - inconsistent with PR objective.

These cub::DeviceTransform::Transform calls still use stream_view_ directly instead of stream_view_.value(), which is inconsistent with all other changes in this PR. If CCCL 3.2 requires the raw cudaStream_t, these calls will exhibit the same regression.

🐛 Proposed fix

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_dx_aff_.data(), data.d_dz_aff_.data()),
       data.d_complementarity_xz_rhs_.data(),
       data.d_complementarity_xz_rhs_.size(),
       [new_mu] HD(f_t dx_aff, f_t dz_aff) { return -(dx_aff * dz_aff) + new_mu; },
-      stream_view_);
+      stream_view_.value());

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_dw_aff_.data(), data.d_dv_aff_.data()),
       data.d_complementarity_wv_rhs_.data(),
       data.d_complementarity_wv_rhs_.size(),
       [new_mu] HD(f_t dw_aff, f_t dv_aff) { return -(dw_aff * dv_aff) + new_mu; },
-      stream_view_);
+      stream_view_.value());

3207-3232: Missing .value() calls in compute_final_direction - inconsistent with PR objective.

These three cub::DeviceTransform::Transform calls still pass stream_view_ directly instead of stream_view_.value().

🐛 Proposed fix

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_dw_aff_.data(), data.d_dv_aff_.data(), data.d_dw_.data(), data.d_dv_.data()),
       thrust::make_zip_iterator(data.d_dw_.data(), data.d_dv_.data()),
       data.d_dw_.size(),
       [] HD(f_t dw_aff, f_t dv_aff, f_t dw, f_t dv) -> thrust::tuple<f_t, f_t> {
         return {dw + dw_aff, dv + dv_aff};
       },
-      stream_view_);
+      stream_view_.value());

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_dx_aff_.data(), data.d_dz_aff_.data(), data.d_dx_.data(), data.d_dz_.data()),
       thrust::make_zip_iterator(data.d_dx_.data(), data.d_dz_.data()),
       data.d_dx_.size(),
       [] HD(f_t dx_aff, f_t dz_aff, f_t dx, f_t dz) -> thrust::tuple<f_t, f_t> {
         return {dx + dx_aff, dz + dz_aff};
       },
-      stream_view_);
+      stream_view_.value());

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_dy_aff_.data(), data.d_dy_.data()),
       data.d_dy_.data(),
       data.d_dy_.size(),
       [] HD(f_t dy_aff, f_t dy) { return dy + dy_aff; },
-      stream_view_);
+      stream_view_.value());

3286-3311: Missing .value() calls in compute_next_iterate - inconsistent with PR objective.

These three cub::DeviceTransform::Transform calls still pass stream_view_ directly instead of stream_view_.value().

🐛 Proposed fix

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_w_.data(), data.d_v_.data(), data.d_dw_.data(), data.d_dv_.data()),
       thrust::make_zip_iterator(data.d_w_.data(), data.d_v_.data()),
       data.d_dw_.size(),
       [step_primal, step_dual] HD(f_t w, f_t v, f_t dw, f_t dv) -> thrust::tuple<f_t, f_t> {
         return {w + step_primal * dw, v + step_dual * dv};
       },
-      stream_view_);
+      stream_view_.value());

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(
         data.d_x_.data(), data.d_z_.data(), data.d_dx_.data(), data.d_dz_.data()),
       thrust::make_zip_iterator(data.d_x_.data(), data.d_z_.data()),
       data.d_dx_.size(),
       [step_primal, step_dual] HD(f_t x, f_t z, f_t dx, f_t dz) -> thrust::tuple<f_t, f_t> {
         return {x + step_primal * dx, z + step_dual * dz};
       },
-      stream_view_);
+      stream_view_.value());

     cub::DeviceTransform::Transform(
       cuda::std::make_tuple(data.d_y_.data(), data.d_dy_.data()),
       data.d_y_.data(),
       data.d_y_.size(),
       [step_dual] HD(f_t y, f_t dy) { return y + step_dual * dy; },
-      stream_view_);
+      stream_view_.value());

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9293aed and b859963.

📒 Files selected for processing (1)

cpp/src/dual_simplex/barrier.cu

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cu,cuh}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh}: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Files:

cpp/src/dual_simplex/barrier.cu

**/*.cu

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.cu: Verify race conditions and correctness of GPU kernel shared memory, atomics, and warp-level operations
Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths

Files:

cpp/src/dual_simplex/barrier.cu

**/*.{cu,cuh,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

**/*.{cu,cuh,cpp,hpp,h}: Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results
Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks
Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)
Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations
For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution
Assess algorithmic complexity for large-scale problems (millions of variables/constraints); ensure O(n log n) or better complexity, not O(n²) or worse
Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Identify assertions with overly strict numerical tolerances that fail on legitimate degenerate/edge cases (near-zero pivots, singular matrices, empty problems)
Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Check that hard-coded GPU de...

Files:

cpp/src/dual_simplex/barrier.cu

**/*.{cu,cpp,hpp,h}

📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)

Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code

Files:

cpp/src/dual_simplex/barrier.cu

🧠 Learnings (11)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations

Learnt from: chris-maes
Repo: NVIDIA/cuopt PR: 746
File: cpp/src/dual_simplex/barrier.cu:549-552
Timestamp: 2026-01-14T00:38:38.038Z
Learning: In cpp/src/dual_simplex/barrier.cu's form_adat() method within the barrier solver, the raft::copy from d_original_A_values to device_AD.x is necessary and must not be removed. This copy restores the original matrix A values before they are scaled by the diagonal matrix D (via d_inv_diag_prime). Since D changes between iterations and the scaling mutates device_AD.x, copying the original values each iteration is required for correctness.

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2026-01-14T00:38:38.038Z

Learnt from: chris-maes
Repo: NVIDIA/cuopt PR: 746
File: cpp/src/dual_simplex/barrier.cu:549-552
Timestamp: 2026-01-14T00:38:38.038Z
Learning: In cpp/src/dual_simplex/barrier.cu's form_adat() method within the barrier solver, the raft::copy from d_original_A_values to device_AD.x is necessary and must not be removed. This copy restores the original matrix A values before they are scaled by the diagonal matrix D (via d_inv_diag_prime). Since D changes between iterations and the scaling mutates device_AD.x, copying the original values each iteration is required for correctness.

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths

Applied to files:

cpp/src/dual_simplex/barrier.cu

📚 Learning: 2025-11-25T10:20:49.822Z

Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state

Applied to files:

cpp/src/dual_simplex/barrier.cu

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.12, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.13, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.12, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.13, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.10, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.11, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.11, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.1.0, 3.10, arm64, rockylinux8

🔇 Additional comments (5)

cpp/src/dual_simplex/barrier.cu (5)

48-87: LGTM: Consistent use of stream.value() for wrapper functions.

These non-template wrappers correctly pass the raw cudaStream_t via .value() to cub::DeviceTransform::Transform, addressing the CCCL 3.2 API compatibility issue.

1371-1376: LGTM: Correctly updated stream argument.

2052-2106: LGTM: All transform calls consistently updated in gpu_compute_residuals.

2345-2382: LGTM: Diagonal computation transforms correctly updated.

3001-3015: LGTM: Affine RHS computation transforms correctly updated.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

hlinsen · 2026-01-14T21:41:40Z

/merge

Use value of cudaStream in transform

540d271

hlinsen requested a review from a team as a code owner January 13, 2026 23:55

hlinsen added the bug Something isn't working label Jan 13, 2026

hlinsen requested review from akifcorduk and aliceb-nv January 13, 2026 23:55

hlinsen added the non-breaking Introduces a non-breaking change label Jan 13, 2026

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

Reset to non blocking

2f88100

hlinsen force-pushed the fix-device-transform branch from bff475c to 2f88100 Compare January 14, 2026 02:36

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

Use raii and reset to per stream default

8300eed

hlinsen added this to the 26.02 milestone Jan 14, 2026

Reset warm start stream which are created with an external stream

9293aed

hlinsen force-pushed the fix-device-transform branch from efa2ac0 to 9293aed Compare January 14, 2026 05:24

aliceb-nv reviewed Jan 14, 2026

View reviewed changes

aliceb-nv approved these changes Jan 14, 2026

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/cuopt into fix-device-transform

b859963

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

Fix barrier calls

e740191

rapids-bot bot merged commit f0d1ef7 into NVIDIA:main Jan 14, 2026
195 of 197 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitely use cudaStream_t for cub::DeviceTransform and use non blocking stream #771

Explicitely use cudaStream_t for cub::DeviceTransform and use non blocking stream #771

Uh oh!

hlinsen commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

aliceb-nv Jan 14, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

hlinsen commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Explicitely use cudaStream_t for cub::DeviceTransform and use non blocking stream #771

Explicitely use cudaStream_t for cub::DeviceTransform and use non blocking stream #771

Uh oh!

Conversation

hlinsen commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

aliceb-nv Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

hlinsen commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hlinsen commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 14, 2026 •

edited

Loading