-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAN loss value #13
Comments
@srini1948 I meet the same problem, have you solved it? The version of my tensorflow is '1.12.0'. The errors are the following:
|
I could not solve the problem. I stopped training at Epoch 15 early in order to recover the model.
Sent from Mail for Windows 10
From: ganji
Sent: Friday, May 24, 2019 10:44 PM
To: Grzego/handwriting-generation
Cc: srini1948; Mention
Subject: Re: [Grzego/handwriting-generation] NAN loss value (#13)
@srini1948 I meet the same problem, have you solved it? The version of my tensorflow is '1.12.0'. The errors are the following:
[ 904/ 1000] loss = -4.0903835296630862019-05-24 13:48:54.633530: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x1020dc06000 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node model/training/VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 287, in <module>
main()
File "train.py", line 278, in main
vs.sequence: seq})
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node model/training/VerifyFinite/CheckNumerics (defined at train.py:222) = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]
Caused by op 'model/training/VerifyFinite/CheckNumerics', defined at:
File "train.py", line 287, in <module>
main()
File "train.py", line 252, in main
output_mixtures=args.output_mixtures)
File "train.py", line 235, in create_graph
train_model = create_model(generate=None)
File "train.py", line 222, in create_model
grad, _ = tf.clip_by_global_norm(grad, 3.)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/ganji/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node model/training/VerifyFinite/CheckNumerics (defined at train.py:222) = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/training/global_norm/global_norm)]]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@srini1948 Thanks. No error occurs when I run the code again. |
@ganji15 @srini1948 Thanks for pointing that out, I will try to look into it. I suspect there might be some computations in loss (like division by zero or log of negative number) that are causing this behaviour. If you want to store more models (as a sort of workaround) you can change the parameter |
@Grzego Thanks. This implementation is really interesting and impressive. |
After successfully implementing the training I find that starting at Epoch 18 the loss value is set to NAN and all the models created are of no use.
Is there any way to save all models such that the ones where loss is not NAN they can be used for generation.
Can the system be trained to imitate a SINGLE PERSON's handwriting?
Thanks.
The text was updated successfully, but these errors were encountered: