-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN Residuals with CUDA GMRES+ParILUT #1486
Comments
Part 1 of the matrix |
I have been debugging the issue and tracked it down to : line 131 lu_cur_val is NaN or INF I am using a slightly modified version of ilu-preconditioned-solver-example.cpp These are the parameters that I am using:
Please let me know whether you can reproduce the issue. |
Hi @iontcheva, the |
I think what's likely happening here is that while we guard against NaNs/Infs in our asynchronous sweep, they may still come up in the other operations, e.g. from an overflowing value in SpGEMM (which is where the lu_val comes from). Without looking at the specific problem, I'm not sure we can do much about this, the preconditioner may just not work on certain problems. |
Hi MarcelKoch, upsj, I did try many options for .with_iterations - including the value 20 which is quite high but the issue is not resolved. I think there is a bug in the implementation of the CUDA version. The OMP and Reference versions work perfectly fine as I have shown above so I think the ParILUT algorithm is not the problem. Regarding the comment from upsj - I have a check if (!is_finite(lu_val) || is_nan(lu_val)) in tri_spgeam_init and it does not seem to get triggered which I think should mean that the values computed in the CuSparse SpGEMM should be fine. What gets triggered is a similar check on the value lu_cur_val after line 131. Did you manage to assemble the matrix that I sent and try the specific example? I do not think without looking at the specific example one can say much anyway. If you can reproduce the issue though on your side and resolve it I think it would make Ginkgo CUDA much more useful for many applications. Now one cannot solve anything harder (and all the cases in real applications are of that kind) with a simple ILU(0) type of preconditioner like ParILU one needs more advanced preconditioners like ParILUT. I am not sure whether this is relevant but just an observation : After you did the fix with the atomic load_relaxed and store_relaxed in the sweep a few weeks ago I was able to get come of my smaller examples work with Ginkgo CUDA GMRES+ParILUT which was not the case before - I was getting NaNs on all of my examples. |
If using ginkgo static lib, then it crashed at terminate called after throwing an instance of 'gko::CusparseError' |
@uboats Are you also running this on a H100? I think our cuSPARSE exceptions might be a bit out-dated. In any case, I think this error might be due to insufficient GPU memory for the factorization. I had the same issue on my old personal GPU. |
@MarcelKoch yes, H100 (80GB) |
I will try alg 3 for spgemm and see |
Tried alg2 and it works. so it's gpu memory issue that alg1 needs too much mem |
for parilut, can we have one more param to choose the alg? |
Hello,
I have been testing ParILUT with GMRES with linear systems extracted from my application.
I am seeing NaN residuals with the CUDA exec on H100. The versions of the exact same code but with OMP and Reference exec converge in 14 GMRES iterations without issues.
The matrix of the linear system is in the 500K range and due to the limitations for the size of files that can be uploaded I have split it into 7 parts.
cis.mtx.gz.part-aa
cis.mtx.gz.part-ab
cis.mtx.gz.part-ac
cis.mtx.gz.part-ad
cis.mtx.gz.part-ae
cis.mtx.gz.part-af
cis.mtx.gz.part-ag
To merge these files into the actual matrix you can use :
cat cis.mtx.gz.part* > cis.mtx.gz
The rhs file is relatively small:
cis_rhs.mtx.gz
I will upload the files in a few submissions that follow.
See the snapshots showing the NaN with the CUDA exec and the runs with omp and reference exec.
The text was updated successfully, but these errors were encountered: