-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"nan" loss on training #16
Comments
I've seen NaN loss with this code before, but there are a lot of things that can cause it to happen. Sometimes it's an uninitialized variable that is set to |
Thanks for the quick reply. Also, to make sure it wasn't the machine(s) that we were using, I also started an Amazon instance (p2.xlarge, deep learning AMI) and ran from scratch. I still encountered the same error:
|
Hi @neubig, @Hyperparticle, I thought this might have something to do with some unexpected behaviour since the recent PR #13 but I cloned a version of udify c277ade before this PR was merged and I still get a nan loss almost immediately after commencing training. It is just a hunch but when
I have some local environments where udify can train successfully and I am able to train in a fresh environment if I install those requirements instead:
This is now not producing nan losses for me so it might be something to do with the versions of the software which are installed in requirements.txt. Oddly enough, I found installing these requirements to work too. The requirements I used are here: |
@jbrry Thanks for looking into it. So it appears to be due to a recent PyTorch update. I updated the |
Our preliminary tests indicate that this did indeed fix the nan errors. I'll re-open this if we encounter it during a full training run, but I think this is likely fixed so thanks a bunch! |
I have I took the reqs.txt file given by @jbrry and did a |
It looks like the |
Oh, I forgot to mention, but that didn't work either. It's something else that's the issue. Since I got it working with that file, I didn't investigate further. You could consider distributing a docker image or a Dockerfile with all the dependencies in it. A lot of people run training in a container anyway, so it would be helpful for them and less of a hassle for you to deal with issues with dependencies. |
I am getting the nan error with PyTorch 1.4.0 and allennlp==0.9.0, any fix for this? |
Fixed my problem with the requirement specified by you. CONLLU must remain at 1.3.1 |
Thanks for looking into it @dfvalio.
I tried with the I ran a diff on the packages which are installed by
After changing this package version the training script runs fine. Can you confirm this works for you as well @Hyperparticle @dfvalio? To reproduce:
|
Hi!
Thanks for releasing the library. I'm encountering "nan" loss on training with the following commit, which I think is the most recent version:
60f35edc52862109555f4acf66236becc29705ad
Here are instructions to reproduce:
The end of the training is this:
I've attached the full log below as well:
udify-log.txt
My pip environment is also here:
pip-list.txt
Do you have an idea what the issue is? I'd be happy to help debug further (cc: @antonisa and @LeYonan)
The text was updated successfully, but these errors were encountered: