Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse solver on GPU #600

Merged
merged 10 commits into from
Mar 14, 2023
Merged

Sparse solver on GPU #600

merged 10 commits into from
Mar 14, 2023

Conversation

nychiang
Copy link
Collaborator

@nychiang nychiang commented Mar 2, 2023

This PR shows the sparse solver can run under GPU mode, i.e., mem_space = device and compute_mode = gpu.
I created a new example NlpSparseRajaEx2 in this PR.

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

@nychiang
Copy link
Collaborator Author

nychiang commented Mar 2, 2023

Example NlpSparseRajaEx2 (gpu sparse solver) works fine on newell and lassen. Not sure why it fails on marianas.
In addition, not sure why one spack build failed. Receive same failure in PR #589 .
@cameronrutherford @pelesh Any idea?

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 2, 2023

@cameronrutherford : there is a Spack Build failure for all the PRs/branches, see #601

@nychiang
Copy link
Collaborator Author

nychiang commented Mar 3, 2023

@cnpetra it seems we are running out of space on PNNL platforms.

@nychiang nychiang marked this pull request as ready for review March 4, 2023 01:24
Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good and is a sound way to enable GPU-resident optimization with sparse linear algebra.

I am not sure if using inheritance CUSOLVERGPU -> CUSOLVER is the best approach here, even if this is just a temporary solution as @nychiang suggests. It will lead to a lot of duplicate code, which we will need to remove later on.

Current implementation of of cusolver-lu module already works with the data on the device. It requires minimal modification to its current code to enable receiving data at HiOp specified memory space. I suggest we work directly with hiopLinSolverSymSparseCUSOLVER class. Going around with a derived class seems unnecessary.

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 6, 2023

This looks very good and is a sound way to enable GPU-resident optimization with sparse linear algebra.

I am not sure if using inheritance CUSOLVERGPU -> CUSOLVER is the best approach here, even if this is just a temporary solution as @nychiang suggests. It will lead to a lot of duplicate code, which we will need to remove later on.

Current implementation of of cusolver-lu module already works with the data on the device. It requires minimal modification to its current code to enable receiving data at HiOp specified memory space. I suggest we work directly with hiopLinSolverSymSparseCUSOLVER class. Going around with a derived class seems unnecessary.

this PR is supposed to only take care of GPU porting up to the linear solver(s). cuSolver was somehow chosen arbitrarily, just to ensure the rest of the kernels run fine. The porting of linear solvers is a bit more involving and better to be done later on.

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 7, 2023

this PR looks great. How do you want to proceed re: PR #589 ? Which one to merge first?

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 7, 2023

maybe also merge develop in this.

also PNNL CI is not passing.

@nychiang
Copy link
Collaborator Author

nychiang commented Mar 7, 2023

maybe also merge develop in this.

also PNNL CI is not passing.

I tried to remove all the member objects from RAJA, but it still fails marianas. Not sure which part fails it.
Since I haven't figure out this problem, I think we can merge #589 first.

Comment on lines +1688 to +1690
hiopMatrixSparse* swap_ptr = M_;
M_ = M_host_;
M_host_ = swap_ptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sparse matrix is implemented correctly, you really should not have to do this. The matrix class needs to provide both, the device data and the host mirror, as well as methods to sync the two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hiopMatrixSparse has only got one member function M() to access its data, while hiopMatrixRajaSparseTriplet has accessors M() and M_host() for the data on host and device to see here.

However, the problem is that previously sparse solver only works on cpu and hybrid. It assumes the KKT matrix M_ is always on host and has type of hiopMatrixSparseTriplet. Therefore in the existing implementation of hiopLinSolverSparseCUSOLVER.cpp, there are many places use host data `M_->M()' and then transfer it to device (for hybrid usage).
(for example, see here). The current implementation cannot directly work with matrix on device.

If we want to use the existing hiopLinSolverSparseCUSOLVER without adding a new wrapper class, I need to dynamic_cast M_ to type of hiopMatrixRajaSparseTriplet first, sync the data, and change M_->M() to M_->M_host() in couple of places. This seems more works than adding a new wrapper class.

As a result, we think creating a new wrapper class can minimize the changes to your code, and easier to remove them once the linear solver can support data on device.

@pelesh
Copy link
Collaborator

pelesh commented Mar 8, 2023

this PR is supposed to only take care of GPU porting up to the linear solver(s). cuSolver was somehow chosen arbitrarily, just to ensure the rest of the kernels run fine. The porting of linear solvers is a bit more involving and better to be done later on.

The linear solver is already ported to GPU, so I am not sure what else needs to be done there. Regardless of that, creating a CUSOLVERGPU class is not helping what you are trying to achieve in this PR. It is really hard to see rationale behind adding that class.

@nychiang
Copy link
Collaborator Author

nychiang commented Mar 8, 2023

@pelesh As @cnpetra mentioned, right now I am working on GPU porting up to the linear solver(s).
This PR is to create a working example to test all the workflow.
Soon I will have another PR to port the initialization part onto device --- Right now we preprocess some data on cpu and then transfer them to gpu.

@nychiang
Copy link
Collaborator Author

@cnpetra see #605. Host array is used for var/con type, and it is used in MDS_RAJA example as well. Will file another PR to fix this issue.

@pelesh
Copy link
Collaborator

pelesh commented Mar 12, 2023

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

Note that GPU sparse linear solvers need to set up and run first factorization on the host before moving all data to GPU. This is what sparse HiOp module needs to account for, as well. This is different than MDS, where you can move data to GPU after the initial setup.

This is not the case only for sparse linear solver, but also if you have sparse matrix-matrix product, you would want to run first iteration on the host to compute the sparsity pattern of the matrix-matrix product there, and then reuse it for subsequent iterations on GPU.

Comment on lines +403 to +407
MJacS[0] = 4.0;
MJacS[1] = 2.0;
// --- constraint 2 body ---> 2*x_1 + x_3
MJacS[2] = 2.0;
MJacS[3] = 1.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this makes sense. Consider rewriting as

Suggested change
MJacS[0] = 4.0;
MJacS[1] = 2.0;
// --- constraint 2 body ---> 2*x_1 + x_3
MJacS[2] = 2.0;
MJacS[3] = 1.0;
if(itrow == 0) {
MJacS[0] = 4.0;
MJacS[1] = 2.0;
// --- constraint 2 body ---> 2*x_1 + x_3
MJacS[2] = 2.0;
MJacS[3] = 1.0;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is better added on the host and then copied to GPU, since these are just constant Jacobian elements that stay the same throughout the computation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if condition is not required here, since here RAJA::RangeSegment only has one valid number.

Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One missing piece in this PR is a flag that would inform HiOp when to move data to the device. Unlike in dense linear algebra case, data is not moved to GPU after initial setup, but after sparsity patterns for matrix-matrix products and matrix factors are computed. The flag should be set up to true once HiOp sparse module and the linear solver compute sparsity patterns.

@pelesh
Copy link
Collaborator

pelesh commented Mar 13, 2023

@pelesh As @cnpetra mentioned, right now I am working on GPU porting up to the linear solver(s). This PR is to create a working example to test all the workflow. Soon I will have another PR to port the initialization part onto device --- Right now we preprocess some data on cpu and then transfer them to gpu.

This is pretty much how it should work in the long term. I am not sure that porting initialization to the device will buy you much; it will likely be a performance loss. Most of setup operations are not SIMD and perform far better on CPU.

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 13, 2023

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

Note that GPU sparse linear solvers need to set up and run first factorization on the host before moving all data to GPU. This is what sparse HiOp module needs to account for, as well. This is different than MDS, where you can move data to GPU after the initial setup.

This is not the case only for sparse linear solver, but also if you have sparse matrix-matrix product, you would want to run first iteration on the host to compute the sparsity pattern of the matrix-matrix product there, and then reuse it for subsequent iterations on GPU.

when mem_space is gpu, the linear solver class can/should create data copies on cpu if it needs to. See for example: https://github.com/LLNL/hiop/blob/develop/src/LinAlg/hiopLinSolverCholCuSparse.cpp#L228

@cnpetra
Copy link
Collaborator

cnpetra commented Mar 13, 2023

One missing piece in this PR is a flag that would inform HiOp when to move data to the device. Unlike in dense linear algebra case, data is not moved to GPU after initial setup, but after sparsity patterns for matrix-matrix products and matrix factors are computed. The flag should be set up to true once HiOp sparse module and the linear solver compute sparsity patterns.

such data movements should be handled by the linear solver class. And there are sparse linear solvers currently in HiOp that do work fully on GPU (but under different options they may do cpu work, like computing min fill-in in the factors in the Cholesky class, see the comment above)

@cnpetra cnpetra merged commit d682ec8 into develop Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants