Nan Values During DeePMD Training #4461

VenkatanarayananMridhula · 2024-12-07T13:55:35Z

VenkatanarayananMridhula
Dec 7, 2024

Hello,

I am currently using DeePMD to develop a model for predicting the energies of a single molecule (C₂H₄) using the se_e2_a descriptor. My dataset consists of over 5000 frames sampled from various configurations, which I have split into training, validation, and test datasets in a 70:20:10 ratio. The model is trained for 1 million steps. During training, the RMSE for both training and validation decreases steadily and shows significant overlap over the course of training, as illustrated in the attached plot. However, I encounter NaN values almost at the end of the training, specifically around step 944,500. The training parameters were defined based on the supplementary of the paper ["End-to-End Symmetry Preserving Inter-Atomic Potential Energy Model for Molecular Dynamics Simulations" for single-molecule systems. I have attached the following files for reference:

input.json: The input configuration file used for training.
input.json
lcurve.txt : The log file showing the progression of training and validation RMSE values.
lcurve.txt
Graph: A plot of rmse_train and rmse_val vs. steps, highlighting the model's behavior during training.

Despite the model showing a promising and stable trend throughout most of the training, the occurrence of NaN values near the very end is perplexing. I would greatly appreciate any insights into what might be causing this issue and how it can be resolved.
Thank you for your time and assistance!

njzjz · 2024-12-09T03:08:29Z

njzjz
Dec 9, 2024
Maintainer

Just confirm: did you use the latest version?

4 replies

VenkatanarayananMridhula Dec 9, 2024
Author

I am using DeePMD-kit v2.2.11 for my calculations. Around step 944,000, I encountered NaN values in the training logs, as shown in the attached image. I restarted the calculation using the last saved checkpoint just before the NaN values appeared (step 944,000). After the restart, the NaN values disappeared, and the training resumed normally.

However, I noticed that the RMSE values at the restarted step (944,000) are not identical to the values recorded before the restart. Is this discrepancy acceptable or indicative of an issue in the training process?

image :

Thank you.

njzjz Dec 9, 2024
Maintainer

I noticed that the RMSE values at the restarted step (944,000) are not identical to the values recorded before the restart.

Randomness exists with which points of data are picked to train and validate.

VenkatanarayananMridhula Dec 9, 2024
Author

Thank you so much. But what causes the nan errors and how does it disappear when the training is restarted in this case?

njzjz Dec 9, 2024
Maintainer

This is hard to say. It might come from your GPU or any bug in our program, TensorFlow, or CUDA. Besides, we haven't located the reason for an existing NaN issue #3103.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan Values During DeePMD Training #4461

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Nan Values During DeePMD Training #4461

VenkatanarayananMridhula Dec 7, 2024

Replies: 1 comment · 4 replies

njzjz Dec 9, 2024 Maintainer

VenkatanarayananMridhula Dec 9, 2024 Author

njzjz Dec 9, 2024 Maintainer

VenkatanarayananMridhula Dec 9, 2024 Author

njzjz Dec 9, 2024 Maintainer

VenkatanarayananMridhula
Dec 7, 2024

Replies: 1 comment 4 replies

njzjz
Dec 9, 2024
Maintainer

VenkatanarayananMridhula Dec 9, 2024
Author

njzjz Dec 9, 2024
Maintainer

VenkatanarayananMridhula Dec 9, 2024
Author

njzjz Dec 9, 2024
Maintainer