Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"nan" loss on training #16

Open
neubig opened this issue Jun 10, 2020 · 11 comments
Open

"nan" loss on training #16

neubig opened this issue Jun 10, 2020 · 11 comments

Comments

@neubig
Copy link

neubig commented Jun 10, 2020

Hi!

Thanks for releasing the library. I'm encountering "nan" loss on training with the following commit, which I think is the most recent version: 60f35edc52862109555f4acf66236becc29705ad

Here are instructions to reproduce:

pip install -r ./requirements.txt
bash ./scripts/download_ud_data.sh
python train.py --config config/ud/en/udify_bert_train_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

The end of the training is this:

2020-06-10 16:23:38,177 - INFO - allennlp.training.trainer - Training
  0%|          | 0/392 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 110, in <module>
    train_model(train_params, serialization_dir, recover=bool(args.resume))
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 323, in _train_epoch
    raise ValueError("nan loss encountered")
ValueError: nan loss encountered

I've attached the full log below as well:
udify-log.txt

My pip environment is also here:
pip-list.txt

Do you have an idea what the issue is? I'd be happy to help debug further (cc: @antonisa and @LeYonan)

@Hyperparticle
Copy link
Owner

I've seen NaN loss with this code before, but there are a lot of things that can cause it to happen. Sometimes it's an uninitialized variable that is set to None. Can you try the previous commit afe5d4734e179155852ea3a2f80c353f58c8b6ec to see if that changes anything?

@neubig
Copy link
Author

neubig commented Jun 11, 2020

Thanks for the quick reply. afe5d4734e179155852ea3a2f80c353f58c8b6ec still encountered the same problem.

Also, to make sure it wasn't the machine(s) that we were using, I also started an Amazon instance (p2.xlarge, deep learning AMI) and ran from scratch. I still encountered the same error:

conda create --name=python3 python=3
conda init bash
source ~/.bashrc 
conda activate python3
git clone https://github.com/Hyperparticle/udify.git
cd udify/
pip install -r ./requirements.txt
bash ./scripts/download_ud_data.sh
python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

@jbrry
Copy link
Contributor

jbrry commented Jun 13, 2020

Hi @neubig, @Hyperparticle,

I thought this might have something to do with some unexpected behaviour since the recent PR #13 but I cloned a version of udify c277ade before this PR was merged and I still get a nan loss almost immediately after commencing training.

It is just a hunch but when allennlp==0.9.0 is installed it looks for torch >=1.2.0, which installs version 1.5.0 which might be too recent a version.

Collecting torch>=1.2.0
  Using cached torch-1.5.0-cp37-cp37m-manylinux1_x86_64.whl (752.0 MB)

I have some local environments where udify can train successfully and I am able to train in a fresh environment if I install those requirements instead:

# put versions of libraries from a working environment into `reqs.txt`
pip freeze > reqs.txt

conda create -n udify_alternative_reqs python=3.7
conda activate udify_alternative_reqs
pip install -r reqs.txt

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

This is now not producing nan losses for me so it might be something to do with the versions of the software which are installed in requirements.txt. Oddly enough, I found installing these requirements to work too.

The requirements I used are here:
reqs.txt

@Hyperparticle
Copy link
Owner

@jbrry Thanks for looking into it. So it appears to be due to a recent PyTorch update. I updated the requirements.txt to explicitly use 1.4.0. Hope that fixes it for everyone.

@neubig
Copy link
Author

neubig commented Jun 14, 2020

Our preliminary tests indicate that this did indeed fix the nan errors. I'll re-open this if we encounter it during a full training run, but I think this is likely fixed so thanks a bunch!

@neubig neubig closed this as completed Jun 14, 2020
@Genius1237
Copy link

I have torch==1.4.0 and allennlp==0.9.0 installed and I am still getting this error. My guess is that the error is somewhere else.

I took the reqs.txt file given by @jbrry and did a pip install --no-warn-conflicts -r reqs.txt (had to remove cerifi from it). Was able to get everything working then.

@Hyperparticle Hyperparticle reopened this Jun 29, 2020
@Hyperparticle
Copy link
Owner

It looks like the reqs.txt you used has an earlier version of torch (1.3.1). Would it work if you change the requirements.txt to torch==1.3.1?

@Genius1237
Copy link

Genius1237 commented Jun 29, 2020

Oh, I forgot to mention, but that didn't work either. It's something else that's the issue. Since I got it working with that file, I didn't investigate further.

You could consider distributing a docker image or a Dockerfile with all the dependencies in it. A lot of people run training in a container anyway, so it would be helpful for them and less of a hassle for you to deal with issues with dependencies.

@dfvalio
Copy link

dfvalio commented Feb 18, 2021

I am getting the nan error with PyTorch 1.4.0 and allennlp==0.9.0, any fix for this?

@dfvalio
Copy link

dfvalio commented Feb 18, 2021

Hi @neubig, @Hyperparticle,

I thought this might have something to do with some unexpected behaviour since the recent PR #13 but I cloned a version of udify c277ade before this PR was merged and I still get a nan loss almost immediately after commencing training.

It is just a hunch but when allennlp==0.9.0 is installed it looks for torch >=1.2.0, which installs version 1.5.0 which might be too recent a version.

Collecting torch>=1.2.0
  Using cached torch-1.5.0-cp37-cp37m-manylinux1_x86_64.whl (752.0 MB)

I have some local environments where udify can train successfully and I am able to train in a fresh environment if I install those requirements instead:

# put versions of libraries from a working environment into `reqs.txt`
pip freeze > reqs.txt

conda create -n udify_alternative_reqs python=3.7
conda activate udify_alternative_reqs
pip install -r reqs.txt

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

This is now not producing nan losses for me so it might be something to do with the versions of the software which are installed in requirements.txt. Oddly enough, I found installing these requirements to work too.

The requirements I used are here:
reqs.txt

Fixed my problem with the requirement specified by you. CONLLU must remain at 1.3.1

@jbrry
Copy link
Contributor

jbrry commented Feb 18, 2021

Thanks for looking into it @dfvalio.

Fixed my problem with the requirement specified by you. CONLLU must remain at 1.3.1

I tried with the conllu version that gets installed with allennlp 0.9.0: conllu==1.3.1 and also conllu==2.3.2 and even the current version conllu==4.4 and training runs with all three versions for me.

I ran a diff on the packages which are installed by requirements.txt and a working environment I had and I started changing packages to those in my working environment until I could launch the training command. The first time it worked was when I changed the gevent package:

pip install gevent==1.4.0

After changing this package version the training script runs fine. Can you confirm this works for you as well @Hyperparticle @dfvalio? To reproduce:

conda create -n udify_install python=3.7
conda activate udify_install
pip install -r requirements.txt 

# BREAKS
python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/ # doesn't work

pip install gevent==1.4.0

# WORKS
python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants