-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072
Comments
Hi etiennemlb, @etiennemlb I am suffering form the same proble now, with MUMPS on a large modeling. I didn't find any solutions here. Could you please tell me how you fix it. Thank you very much. |
Abort(1687183) on node 84 (rank 84 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack: |
Actually this is not entirely fixed. I some people are saying that the SHS 2.2.0 with the introduction of libfabric 1.20.1 fixed the issue. That is not my experience TBH. But for all the machines that are not going to upgrade to such recent SHS, there is not fix reported by HPE nor on this repo. To reproduce the issue, I need 20 bi-socket Genoa nodes and the issue seems to be produced only on the system without certainty of reproductibility). The lack of reproducer is a big problem (without taking into account the closed source mumps code). If you can shrink a reproducer, you may be able to forward it to your site's support which in turn could forward it to HPE. |
I'm far from decently understanding libfabric's cxi provider so, I would like to ask for detail on an observed error. I run the mumps solver https://mumps-solver.org/index.php on Adastra (a similar machine to lumi and frontier but running SS 2.2 and libfabric/prov/cxi 1.20), we observe, in a predictable manner the following crash:
I'm wondering what is triggering this issue and what could be done to fix it.
On LUMI, the issue has also been seen and more BTs are given in:
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/Sk2JJKnS6
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/SyjVLT3ra
Thanks.
The text was updated successfully, but these errors were encountered: