Replies: 14 comments 24 replies
-
I see you compiled both serial and MPI versions. I don't know what you used, but the plugin should compile with the same MPI or without MPI, so it will not work for both versions. |
Beta Was this translation helpful? Give feedback.
-
Okay, then I focus on the serial and skip that last MPI step. The compiler says:
So it is compiled with serial c++. I load
Still, I get same error:
For reference, the following modules are active:
|
Beta Was this translation helpful? Give feedback.
-
Maybe they have different _GLIBCXX_USE_CXX11_ABI macros. You can use |
Beta Was this translation helpful? Give feedback.
-
Thank you, but I tried it out (using your fork), and the exact same error reappeared. I have added the logs from my compilations. Maybe this can help diagnose the problem. cmake_deepmd_err.log |
Beta Was this translation helpful? Give feedback.
-
Thanks, I tried again, and now we get a slightly more informative error:
cmake_deepmd_err.log |
Beta Was this translation helpful? Give feedback.
-
I'm so sorry for the long silence. I think the new version of the DeePMD-kit resolved this specific issue, as I can load the plugin now without problems. Tensorflow with ROCM:
DeePMD-kit:
Horovod:
LAMMPS with DeePMD-kit:
Create environment
TestsModel training
Running Lammps
Then I get:
To what extent this is an issue related to Slurm or TensorFlow, I am unsure. I have tried TensorFlow 2.11 and 2.9, but the error remained. Any tips? |
Beta Was this translation helpful? Give feedback.
-
Thank you, that worked! I am doing some performance testing, trying to replicate table III for se_e2_a_64c, which for the MI250X GPU of LUMI should be similar to the MI250 GPU of the paper. For 1 GPU, which I assume is how the benchmark is done, I get 7.1 microseconds/step/atom, which is about four times slower than 1.74 for the compressed version reported in the paper. Am I doing something sub-optimally when running lammps? The output for 64 GPUs is provided here: output. I am also a little worried about the leveling of performance. Is this system too small for linear scaling with the number of GPUs?
For a single GPU:
For 8 GPUs:
For 32:
For 64 GPUs:
Btw, for people who are interested in how I got it installed: Tensorflow with ROCM:
DeePMD-kit:
Horovod:
LAMMPS with DeePMD-kit:
Create environment
|
Beta Was this translation helpful? Give feedback.
-
Try adjusting |
Beta Was this translation helpful? Give feedback.
-
Hi! This is great information for building DeePMD + LAMMPS on an AMD GPU system. However ran in to a problem when compiling the DeePMD-kit libraries using
Which results in this error on linking
Any ideas on how to solve this? |
Beta Was this translation helpful? Give feedback.
-
Ok, I solved this problem. It is related to a conflict between the ROCM libtinfo and the libtinfo from the conda env. But now I encountered a new problem with related to the 'GeluCustom' op. I tried this: /Daniel |
Beta Was this translation helpful? Give feedback.
-
Hi @njzjz I removed the lines in the gelu_multi_device.cc file as you suggested. However after recompiling it still doesn't work :( Now when I try and run LAMMPS with my DeePMD potential which I previously trained (using the CUDA version on another system) I get the following error
I also tried to train a new potential using the ROCM version but I'm running into this error when using
and this error when using
Any ideas how to solve this? |
Beta Was this translation helpful? Give feedback.
-
Dear @njzjz , I'm facing troubles following the instructions given by @sigbjobo to properly install DeepMD-Kit in our HPE-Cray EX (similar to Lumi). I have not reached the full set of steps, but I have followed these so far:
So far so good. These above does not seem to give any problems. But problems appear in the DeepMD-Kit installation:
A first problem with this is that the compilation of the
For some reason, somewhere in the process a semicolon ( So, the first question here is:
Anyway, I separately and manually tried to execute the compilation command (obviously removing manually the nasty semicolon, and the compiler complained anyway about the
(Anyway, question 1 still applies, even if unsetting HIP_HIPCC_FLAGS bypasses the problem.) But there still is a second problem in the compilation command, which is the use of So, second question here is: Thanks. So far these are the questions. I will try to move forward and will come back if any problems that I can solved appear. Regards, |
Beta Was this translation helpful? Give feedback.
-
Hi, after LUMI was updated, the old installation stopped working. The problem is that the installation points to /opt/rocm as sketched in this issue, which has changed from rocm5 to rocm6. @njzjz, do you know how to ensure the deepmd points to the correct version of ROCM? I am okay with a workaround as here |
Beta Was this translation helpful? Give feedback.
-
For anyone wondering, this is working installation for LUMI after updating to rocm6:
|
Beta Was this translation helpful? Give feedback.
-
I am trying to install DeePMD on LUMI, which is an AMD-based system. I have managed to install the DeePMD-kit python interface, but fail on the installation of LAMMPS. I making this post in the hope of getting a fully working installation on LUMI.
Tensorflow with ROCM:
DeePMD-kit:
Horovod:
DeePMD-kit libraries:
However, when I try to use the plugin mode I get:
Any help is much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions