Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN training issue #391

Closed
happyqiu opened this issue May 31, 2022 · 24 comments
Closed

NaN training issue #391

happyqiu opened this issue May 31, 2022 · 24 comments

Comments

@happyqiu
Copy link

Hello, I'm using Windows system and MATLAB 2022a to run the APT and choosing DLC, but always found training failed. (N. iterations: NaN/10000) And the log file shows like this:

Job 1:

C:\Users\yuechen\Documents.apt\tp754f9c10_94f3_440a_925f_8b8a4ca620b7\YQ2_20220517_side\20220531T120619view0_20220531T120636_new.log

C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
Traceback (most recent call last):
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 35, in
from deeplabcut.pose_estimation_tensorflow.train import train as deepcut_train
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut_init
.py", line 51, in
from deeplabcut.pose_estimation_tensorflow import train_network, return_train_network_path
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow_init
.py", line 18, in
from deeplabcut.pose_estimation_tensorflow.evaluate import *
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\evaluate.py", line 18, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

After I installed this package, the error refers to other modules not found. I'm wondering if there is anything else I need to install except the packages you mentioned in the APT documentation.
Thanks a lot!

@allenleetc
Copy link
Collaborator

Hi happyqiu,

Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update.

One question -- in setting up your Conda environment, did you follow the wiki and use the environment file located at <APT>\condaenv\env.yml? Or did you follow the instructions in the doc?

I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again!

@happyqiu
Copy link
Author

happyqiu commented Jun 1, 2022 via email

@kristinbranson
Copy link
Owner

I updated the instructions here: http://kristinbranson.github.io/APT/LocalBackEnd.html for setting up Conda and Windows. I've only tested so far with the multianimal branch, which is the branch everyone in our lab is working on.

@happyqiu
Copy link
Author

happyqiu commented Jun 5, 2022

Thanks for your update! But another error pops up which shows
'......
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

@kristinbranson
Copy link
Owner

kristinbranson commented Jun 5, 2022 via email

@happyqiu
Copy link
Author

happyqiu commented Jun 5, 2022 via email

@kristinbranson
Copy link
Owner

kristinbranson commented Jun 5, 2022 via email

@happyqiu
Copy link
Author

happyqiu commented Jun 6, 2022 via email

@happyqiu
Copy link
Author

happyqiu commented Jun 8, 2022 via email

@allenleetc
Copy link
Collaborator

allenleetc commented Jun 8, 2022

@happyqiu I'm using CUDA 11.6 and I believe the conda environment in the multianimal branch is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though.

If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator?

To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at <APTRoot>\install\apt_conda_environment.yml? Eg did you run this:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

allenleetc added a commit that referenced this issue Jun 8, 2022
Catch exception when Git not installed/found. rm old/obsolete conda environment yml
@allenleetc
Copy link
Collaborator

@happyqiu just FYI I pushed a minor fix and DLC Training is running successfully on my Windows machine. After we sort out your environment, it may be helpful to pull the latest before training.

@allenleetc
Copy link
Collaborator

@mkabra Training finished but the number of iterations on disk (7000) far exceeds my setting in the Tracking Parameters (1000). Tracking still proceeds but looks odd etc. Flagging maybe some kind of edge case for this net?

@happyqiu
Copy link
Author

happyqiu commented Jun 8, 2022 via email

@allenleetc
Copy link
Collaborator

OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? git pull

@happyqiu
Copy link
Author

happyqiu commented Jun 9, 2022 via email

@allenleetc
Copy link
Collaborator

Great glad to hear it! Kristin did the hard part she updated the conda environment!

@happyqiu
Copy link
Author

happyqiu commented Jun 9, 2022 via email

@allenleetc
Copy link
Collaborator

allenleetc commented Jun 9, 2022

I think this is not unusual as the training is often CPU-bound by the data pipeline (data read, augmentation+transformation etc).

If the training is proceeding at any kind of normal pace -- the Training Monitor is updating, etc -- then I think you should be utilizing your GPU.

That said @mkabra knows best, maybe he has thoughts. I also don't know how accurate the instrumentation is (eg task manager).

@allenleetc
Copy link
Collaborator

allenleetc commented Jun 10, 2022

@happyqiu OK I may have spoken too soon, just FYI we are debugging chasing some things down here. Thanks for your patience while we debug.

@allenleetc
Copy link
Collaborator

allenleetc commented Jun 14, 2022

@mkabra We are hitting open-mmlab/mmcv#1543 while building mmcv-full on Win10. Do you need mmcv==1.3.3 or can we bump to mmcv==1.4.0 or higher?
Alternatively, pytorch==1.8.0 or higher I guess.

allenleetc added a commit that referenced this issue Jun 15, 2022
@allenleetc
Copy link
Collaborator

@happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch.

The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest?

git checkout develop
git pull

You should now have an updated conda environment yml file in <APT>\install. Create the APT conda environment as before:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7.

You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following:

2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)

Please let us know if you can get going with this update!

@happyqiu
Copy link
Author

happyqiu commented Jun 16, 2022 via email

@happyqiu
Copy link
Author

happyqiu commented Jul 11, 2022 via email

@allenleetc
Copy link
Collaborator

Hey @happyqiu,

How long is it taking for training to start up? Can you share your project (.lbl) file?

For the 'dist' plot, I can't find the attachment, could you please re-attach? (Or if I am missing it please let me know!)

Thanks,
Allen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants