-
Notifications
You must be signed in to change notification settings - Fork 3
Add support for solving uniform batches with CUDSS #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for solving uniform batches of quadratic programming problems using CUDSS (CUDA Direct Solver System). The implementation enables parallel solving of multiple QP problems with identical sparsity patterns on GPUs, which can provide significant performance benefits for applications like Model Predictive Control.
Key changes:
- Refactored the solver initialization and main loop into modular functions to support both single and batch solving modes
- Implemented a new UniformBatch extension providing batch-specific KKT system handling and solver coordination
- Added custom broadcasting infrastructure for efficient iteration over active solvers in the batch
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_gpu.jl | Added comprehensive test for CUDSS uniform batch solving, verifying results match individual solver runs |
| test/runtests.jl | Modified simple_lp helper to accept configurable Avals parameter and changed exact equality to approximate equality for robustness |
| src/solver.jl | Refactored initialization into pre_initialize!, init_starting_point_solve!, and post_initialize! for batch compatibility; extracted helper functions for key operations |
| src/linear_solver.jl | Added build_kkt! wrapper and extracted post_solve! logic from solve_system! for better modularity |
| ext/MadIPMCUDAExt/UniformBatch/structure.jl | Implemented UniformBatchSolver structure to manage multiple solver instances with shared batch KKT system |
| ext/MadIPMCUDAExt/UniformBatch/solver.jl | Implemented batch versions of initialization, factorization, and MPC algorithm with selective solver activation |
| ext/MadIPMCUDAExt/UniformBatch/kkt.jl | Implemented UniformBatchKKTSystem for managing batched matrix factorizations and solves with CUDSS |
| ext/MadIPMCUDAExt/UniformBatch/broadcast.jl | Added custom broadcasting to efficiently iterate over active solvers in batch |
| ext/MadIPMCUDAExt/UniformBatch/UniformBatch.jl | Module entry point defining helper functions for batch solving with reduced KKT systems |
| ext/MadIPMCUDAExt/MadIPMCUDAExt.jl | Added include statement for UniformBatch module |
| Project.toml | Relaxed MadNLPGPU version constraint from 0.7.15 to 0.7 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Andrew Rosemberg <[email protected]>
|
LGTM, I can recheck after the fixme and todo are done, but these seem straightforward. DCOPF test case will be nice to confirm on a larger case. |
|
Here are the benchmark results, from running https://github.com/klamike/MadIPM.jl/blob/mk/batch_profile/test.jl (note it is Starting 89_pegase
[ Info: Loading matpower file
┌ Info: 89_pegase x 4 -- Batch is 1.68x faster
│ t_loop = 0.292890551
│ t_batch = 0.174158794
│ t_loop - t_batch = 0.118731757
│ t_loop / batch_size = 0.07322263775
└ t_batch / batch_size = 0.0435396985
┌ Info: 89_pegase x 16 -- Batch is 1.41x faster
│ t_loop = 1.065516498
│ t_batch = 0.755904324
│ t_loop - t_batch = 0.3096121740000001
│ t_loop / batch_size = 0.066594781125
└ t_batch / batch_size = 0.04724402025
┌ Info: 89_pegase x 64 -- Batch is 1.78x faster
│ t_loop = 4.670513256
│ t_batch = 2.619187511
│ t_loop - t_batch = 2.0513257449999998
│ t_loop / batch_size = 0.072976769625
└ t_batch / batch_size = 0.040924804859375
Starting 1354_pegase
[ Info: Loading matpower file
┌ Info: 1354_pegase x 4 -- Batch is 1.93x faster
│ t_loop = 0.619273734
│ t_batch = 0.320906251
│ t_loop - t_batch = 0.298367483
│ t_loop / batch_size = 0.1548184335
└ t_batch / batch_size = 0.08022656275
┌ Info: 1354_pegase x 16 -- Batch is 2.32x faster
│ t_loop = 2.469379521
│ t_batch = 1.063210762
│ t_loop - t_batch = 1.406168759
│ t_loop / batch_size = 0.1543362200625
└ t_batch / batch_size = 0.066450672625
┌ Info: 1354_pegase x 64 -- Batch is 2.16x faster
│ t_loop = 10.421725742
│ t_batch = 4.823384208
│ t_loop - t_batch = 5.598341533999999
│ t_loop / batch_size = 0.16283946471875
└ t_batch / batch_size = 0.07536537825
Starting 2869_pegase
[ Info: Loading matpower file
┌ Info: 2869_pegase x 4 -- Batch is 2.25x faster
│ t_loop = 0.958210774
│ t_batch = 0.425290291
│ t_loop - t_batch = 0.5329204830000001
│ t_loop / batch_size = 0.2395526935
└ t_batch / batch_size = 0.10632257275
┌ Info: 2869_pegase x 16 -- Batch is 3.06x faster
│ t_loop = 3.95141398
│ t_batch = 1.29147746
│ t_loop - t_batch = 2.6599365199999996
│ t_loop / batch_size = 0.24696337375
└ t_batch / batch_size = 0.08071734125
┌ Info: 2869_pegase x 64 -- Batch is 3.15x faster
│ t_loop = 16.175443389
│ t_batch = 5.13312229
│ t_loop - t_batch = 11.042321099000002
│ t_loop / batch_size = 0.252741302953125
└ t_batch / batch_size = 0.08020503578125
Starting 6470_rte
[ Info: Loading matpower file
┌ Info: 6470_rte x 4 -- Batch is 1.7x faster
│ t_loop = 1.65961581
│ t_batch = 0.978801008
│ t_loop - t_batch = 0.680814802
│ t_loop / batch_size = 0.4149039525
└ t_batch / batch_size = 0.244700252
┌ Info: 6470_rte x 16 -- Batch is 4.13x faster
│ t_loop = 6.732655854
│ t_batch = 1.629377134
│ t_loop - t_batch = 5.10327872
│ t_loop / batch_size = 0.420790990875
└ t_batch / batch_size = 0.101836070875
┌ Info: 6470_rte x 64 -- Batch is 4.91x faster
│ t_loop = 27.903837811
│ t_batch = 5.677465911
│ t_loop - t_batch = 22.2263719
│ t_loop / batch_size = 0.435997465796875
└ t_batch / batch_size = 0.088710404859375
Starting 9241_pegase
[ Info: Loading matpower file
┌ Info: 9241_pegase x 4 -- Batch is 1.06x faster
│ t_loop = 2.934963509
│ t_batch = 2.772024485
│ t_loop - t_batch = 0.16293902399999993
│ t_loop / batch_size = 0.73374087725
└ t_batch / batch_size = 0.69300612125
┌ Info: 9241_pegase x 16 -- Batch is 4.73x faster
│ t_loop = 11.660498002
│ t_batch = 2.465172773
│ t_loop - t_batch = 9.195325229000002
│ t_loop / batch_size = 0.728781125125
└ t_batch / batch_size = 0.1540732983125
┌ Info: 9241_pegase x 64 -- Batch is 5.1x faster
│ t_loop = 48.443547009
│ t_batch = 9.497881699
│ t_loop - t_batch = 38.945665309999995
│ t_loop / batch_size = 0.756930422015625
└ t_batch / batch_size = 0.148404401546875 |
frapac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look good to me! The implementation of the batch solver is very nice, and does not touch a lot to the internals in MadIPM. I like the direction this project is taking. In the long term, I would be interesting to know how much do we lose by not storing our vector contiguously in memory along the different batches.
I only have minor comments so far. This PR can be merged as soon they are addressed, so we can move on this.
| while true | ||
| # Check termination criteria | ||
| for_active(batch_solver, | ||
| MadNLP.print_iter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we print the iter for all batches? Isn't the output a bit messy as a result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is quite messy. I just didn't want to think about it yet 😉
| # Print remaning options (unsupported) | ||
| if !isempty(remaining_options) | ||
| MadNLP.print_ignored_options(logger, remaining_options) | ||
| # MadNLP.print_ignored_options(logger, remaining_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead comment? We can remove this line if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to NoLinearSolver not considering the args that are meant for the batch solver. I do think it's a useful feature to report ignored options, it would be better to refactor how options are handled in the BatchSolver constructor so we detect what is meant for the batch solver, and what is meant for the individual MPCSolver..
|
Merging this into the |
| solver::MadNLP.AbstractMadNLPSolver; | ||
| kwargs... | ||
| ) | ||
| function solve!(solver::MadNLP.AbstractMadNLPSolver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klamike Why did you drop the support for kwargs... and options?
For example, it could be nice to use iterative refinement on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a bug here, (at least some of?) these options don't actually get set or something. Need to go back and check the details more closely.
Note it is not "on the fly" since we re-initialize in solve!. But it would still be nice to allow re-solve with updated options.
amontoison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @klamike 👍
I only have one comment with the keyword arguments of the function solve!.
@frapac @amontoison @andrewrosemberg
It is still using the last release of MadNLP, but it will be easy to rebase on #76 upon next release.
The main idea is:
aug_com.nzValpointing to a slice of the batch solver'stril.nzVal. This lets us re-use all the KKT building machinery.primal_dual(::UnreducedKKTVector)) to a slice of a dedicated buffer. After the batch solve, copy the results back to each sample'sUnreducedKKTVector.nzVal/RHS pointers when samples terminate.Vector{MPCSolver}that shares buffersAbstractSparseKKTSystem