Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAN loss value #13

Open
srini1948 opened this issue Apr 24, 2019 · 5 comments
Open

NAN loss value #13

srini1948 opened this issue Apr 24, 2019 · 5 comments

Comments

@srini1948
Copy link

After successfully implementing the training I find that starting at Epoch 18 the loss value is set to NAN and all the models created are of no use.

Is there any way to save all models such that the ones where loss is not NAN they can be used for generation.

Can the system be trained to imitate a SINGLE PERSON's handwriting?

Thanks.

@ganji15
Copy link

ganji15 commented May 25, 2019

@srini1948 I meet the same problem, have you solved it? The version of my tensorflow is '1.12.0'. The errors are the following:

[  904/ 1000] loss = -4.0903835296630862019-05-24 13:48:54.633530: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x1020dc06000 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[{{node model/training/VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 287, in <module>
    main()
  File "train.py", line 278, in main
    vs.sequence: seq})
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[node model/training/VerifyFinite/CheckNumerics (defined at train.py:222)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]

Caused by op 'model/training/VerifyFinite/CheckNumerics', defined at:
  File "train.py", line 287, in <module>
    main()
  File "train.py", line 252, in main
    output_mixtures=args.output_mixtures)
  File "train.py", line 235, in create_graph
    train_model = create_model(generate=None)
  File "train.py", line 222, in create_model
    grad, _ = tf.clip_by_global_norm(grad, 3.)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
         [[node model/training/VerifyFinite/CheckNumerics (defined at train.py:222)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]

@srini1948
Copy link
Author

srini1948 commented May 28, 2019 via email

@ganji15
Copy link

ganji15 commented May 31, 2019

@srini1948 Thanks. No error occurs when I run the code again.

@Grzego
Copy link
Owner

Grzego commented Jun 2, 2019

@ganji15 @srini1948 Thanks for pointing that out, I will try to look into it. I suspect there might be some computations in loss (like division by zero or log of negative number) that are causing this behaviour.

If you want to store more models (as a sort of workaround) you can change the parameter max_to_keep to something like 10 or more. Models before the first NaN should still be usable.

@ganji15
Copy link

ganji15 commented Jun 3, 2019

@Grzego Thanks. This implementation is really interesting and impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants