Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with saving training onto model checkpoint #84

Open
ImNotOssy opened this issue May 23, 2024 · 6 comments
Open

Issue with saving training onto model checkpoint #84

ImNotOssy opened this issue May 23, 2024 · 6 comments

Comments

@ImNotOssy
Copy link

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,

restoring model from checkpoints/model-800
INFO:tensorflow:Restoring parameters from checkpoints/model-800
Restoring parameters from checkpoints/model-800
2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800
Traceback (most recent call last):
File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

@ImNotOssy
Copy link
Author

fixed it, I migrated from colab to google cloud VM Instance after running out of compute units, and realizing i have $300 in credits to spend. i went all out and got myself a nice VM with blazing training speeds. I was able to get it to work after realizing the minimum step to save was way higher than my total training step. I got it to work but i feel as if my data is way way to small for the model to learn anything. i tried to run it using demo.py but i get an error of a mismatched shape. The error message indicates that the parameter shape is [1, 20, 2] and the flat indices being accessed are [0, 20], which is out of bounds. I'm not sure this is a low data issue used to train the model or something else. No one else seems interested in this project but me, really hard to keep advancing.

@monickverma
Copy link

hey, im not able to run the project since the links in the READ.md are down, what do i do

@ImNotOssy
Copy link
Author

hey, im not able to run the project since the links in the READ.md are down, what do i do

I don't understand. What links?

@letsgocodego
Copy link

letsgocodego commented Aug 13, 2024

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,

restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I'm sure this question will come off as excessively elementary, but, how did you use your own data? Did you just use some data from the IAM On-Line Handwriting Database? Did you devise a method that enables you to create a file that looks like this: https://fki.tic.heia-fr.ch/static/iamondb/strokesz.xml but with your own writing sample?

Thank you!

@ImNotOssy
Copy link
Author

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,
restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I'm sure this question will come off as excessively elementary, but, how did you use your own data? Did you just use some data from the IAM On-Line Handwriting Database? Did you devise a method that enables you to create a file that looks like this: https://fki.tic.heia-fr.ch/static/iamondb/strokesz.xml but with your own writing sample?

Thank you!
I used this it worked great
https://github.com/acmattson3/handwriting-data

@npulsipher4
Copy link

fixed it, I migrated from colab to google cloud VM Instance after running out of compute units, and realizing i have $300 in credits to spend. i went all out and got myself a nice VM with blazing training speeds. I was able to get it to work after realizing the minimum step to save was way higher than my total training step. I got it to work but i feel as if my data is way way to small for the model to learn anything. i tried to run it using demo.py but i get an error of a mismatched shape. The error message indicates that the parameter shape is [1, 20, 2] and the flat indices being accessed are [0, 20], which is out of bounds. I'm not sure this is a low data issue used to train the model or something else. No one else seems interested in this project but me, really hard to keep advancing.

I'm getting the same error but I'm running it on docker locally. I got to about 2200 steps and then it stopped. Do you think I just need more data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants