NaN training issue #391

happyqiu · 2022-05-31T16:32:37Z

Hello, I'm using Windows system and MATLAB 2022a to run the APT and choosing DLC, but always found training failed. (N. iterations: NaN/10000) And the log file shows like this:

Job 1:

C:\Users\yuechen\Documents.apt\tp754f9c10_94f3_440a_925f_8b8a4ca620b7\YQ2_20220517_side\20220531T120619view0_20220531T120636_new.log

C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
Traceback (most recent call last):
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 35, in
from deeplabcut.pose_estimation_tensorflow.train import train as deepcut_train
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut_init.py", line 51, in
from deeplabcut.pose_estimation_tensorflow import train_network, return_train_network_path
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow_init.py", line 18, in
from deeplabcut.pose_estimation_tensorflow.evaluate import *
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\evaluate.py", line 18, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

After I installed this package, the error refers to other modules not found. I'm wondering if there is anything else I need to install except the packages you mentioned in the APT documentation.
Thanks a lot!

allenleetc · 2022-06-01T01:01:21Z

Hi happyqiu,

Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update.

One question -- in setting up your Conda environment, did you follow the wiki and use the environment file located at <APT>\condaenv\env.yml? Or did you follow the instructions in the doc?

I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again!

happyqiu · 2022-06-01T12:17:20Z

Hi Allen, Thanks for your quick response! I followed the doc to setup the environment.

…

________________________________ From: Allen Lee ***@***.***> Sent: Tuesday, May 31, 2022 9:01:32 PM To: kristinbranson/APT ***@***.***> Cc: happyqiu ***@***.***>; Author ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) Hi happyqiu, Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update. One question -- in setting up your Conda environment, did you follow the wiki<https://github.com/kristinbranson/APT/wiki/Windows-&-Conda-Setup> and use the environment file located at <APT>\condaenv\env.yml? Or did you follow the instructions in the doc<http://kristinbranson.github.io/APT/LocalBackEnd.html>? I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again! ― Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMDHPHVJBXTK2ZJKMWHAU53VM2Y6ZANCNFSM5XOFD3QA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

kristinbranson · 2022-06-05T11:39:57Z

I updated the instructions here: http://kristinbranson.github.io/APT/LocalBackEnd.html for setting up Conda and Windows. I've only tested so far with the multianimal branch, which is the branch everyone in our lab is working on.

happyqiu · 2022-06-05T22:08:50Z

Thanks for your update! But another error pops up which shows
'......
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

kristinbranson · 2022-06-05T22:24:46Z

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check? For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail.

…

________________________________ From: happyqiu ***@***.***> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT ***@***.***> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) External Email: Use Caution Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this? Thanks! — Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA>. You are receiving this because you commented.Message ID: ***@***.***>

happyqiu · 2022-06-05T22:32:22Z

I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

…

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> wrote: Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check? For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail. ________________________________ From: happyqiu ***@***.***> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT ***@***.***> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) External Email: Use Caution Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this? Thanks! — Reply to this email directly, view it on GitHub< #391 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA >. You are receiving this because you commented.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kristinbranson · 2022-06-05T22:37:37Z

You can switch to the multianimal branch with the command git checkout multianimal For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail.

________________________________ From: happyqiu ***@***.***> Sent: Sunday, June 5, 2022 6:32 PM To: kristinbranson/APT ***@***.***> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) External Email: Use Caution I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> wrote: Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check? For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail. ________________________________ From: happyqiu ***@***.***> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT ***@***.***> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) External Email: Use Caution Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this? Thanks! — Reply to this email directly, view it on GitHub< #391 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA >. You are receiving this because you commented.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA>. You are receiving this because you commented.Message ID: ***@***.***>

happyqiu · 2022-06-06T14:40:15Z

Thanks for your suggestion, but I still got the same error message with multianimal one. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 44, in <module> import PoseUNet_dataset as PoseUNet File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py", line 1, in <module> from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py", line 26, in <module> from tensorflow.contrib.layers import batch_norm File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py", line 50, in __getattr__ module = self._load() File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py", line 44, in _load module = _importlib.import_module(self.__name__) File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\__init__.py", line 39, in <module> from tensorflow.contrib import compiler File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 21, in <module> from tensorflow.contrib.compiler import jit File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 22, in <module> from tensorflow.contrib.compiler import xla File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py", line 22, in <module> from tensorflow.python.estimator import model_fn as model_fn_lib File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py", line 26, in <module> from tensorflow_estimator.python.estimator import model_fn File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\__init__.py", line 10, in <module> from tensorflow_estimator._api.v1 import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\__init__.py", line 10, in <module> from tensorflow_estimator._api.v1.estimator import experimental File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\experimental\__init__.py", line 10, in <module> from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 27, in <module> from tensorflow_estimator.python.estimator import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in <module> from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' (C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\__init__.py) On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson ***@***.***> wrote:

…

You can switch to the multianimal branch with the command git checkout multianimal For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail. ________________________________ From: happyqiu ***@***.***> Sent: Sunday, June 5, 2022 6:32 PM To: kristinbranson/APT ***@***.***> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) External Email: Use Caution I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed. On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> wrote: > Do you know which git branch of APT you are using? I only tested on the > multianimal branch. Did you run the gpu/backend check? > > > For less delayed replies, please email ***@***.*** I check my GMail > account much more frequently than my HHMI mail. > ________________________________ > From: happyqiu ***@***.***> > Sent: Sunday, June 5, 2022 6:09 PM > To: kristinbranson/APT ***@***.***> > Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> > Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) > > External Email: Use Caution > > > > Thanks for your update! But another error pops up which shows > '...... > "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", > line 36, in > from tensorflow.python.profiler import trace > ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' > > When I simply typed in this command 'from tensorflow.python.profiler > import trace' in the spyder, it seems to work well. Do you have any ideas > of this? > > Thanks! > > — > Reply to this email directly, view it on GitHub< > #391 (comment) >, > or unsubscribe< > https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA > >. > You are receiving this because you commented.Message ID: ***@***.***> > > — > Reply to this email directly, view it on GitHub > < #391 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA > > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub< #391 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA >. You are receiving this because you commented.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

happyqiu · 2022-06-08T18:20:36Z

Hi, I'm thinking it may be a compatibility issue. I'm using CUDA 11.7. Which version of tensorflow-gpu do you think I should use?

…

On Mon, Jun 6, 2022 at 10:39 AM Karlie Qiu ***@***.***> wrote: Thanks for your suggestion, but I still got the same error message with multianimal one. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 44, in <module> import PoseUNet_dataset as PoseUNet File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py", line 1, in <module> from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py", line 26, in <module> from tensorflow.contrib.layers import batch_norm File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py", line 50, in __getattr__ module = self._load() File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py", line 44, in _load module = _importlib.import_module(self.__name__) File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\__init__.py", line 39, in <module> from tensorflow.contrib import compiler File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 21, in <module> from tensorflow.contrib.compiler import jit File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 22, in <module> from tensorflow.contrib.compiler import xla File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py", line 22, in <module> from tensorflow.python.estimator import model_fn as model_fn_lib File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py", line 26, in <module> from tensorflow_estimator.python.estimator import model_fn File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\__init__.py", line 10, in <module> from tensorflow_estimator._api.v1 import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\__init__.py", line 10, in <module> from tensorflow_estimator._api.v1.estimator import experimental File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\experimental\__init__.py", line 10, in <module> from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 27, in <module> from tensorflow_estimator.python.estimator import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in <module> from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' (C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\__init__.py) On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson ***@***.***> wrote: > You can switch to the multianimal branch with the command > git checkout multianimal > > > For less delayed replies, please email ***@***.*** I check my GMail > account much more frequently than my HHMI mail. > ________________________________ > From: happyqiu ***@***.***> > Sent: Sunday, June 5, 2022 6:32 PM > To: kristinbranson/APT ***@***.***> > Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> > Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) > > External Email: Use Caution > > > > I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) > I also checked the backend configuration. Both APT activation and GPU > tests > passed. > > On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> > wrote: > > > Do you know which git branch of APT you are using? I only tested on the > > multianimal branch. Did you run the gpu/backend check? > > > > > > For less delayed replies, please email ***@***.*** I check my GMail > > account much more frequently than my HHMI mail. > > ________________________________ > > From: happyqiu ***@***.***> > > Sent: Sunday, June 5, 2022 6:09 PM > > To: kristinbranson/APT ***@***.***> > > Cc: Branson, Kristin ***@***.***>; Comment ***@***.***> > > Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391) > > > > External Email: Use Caution > > > > > > > > Thanks for your update! But another error pops up which shows > > '...... > > > "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", > > line 36, in > > from tensorflow.python.profiler import trace > > ImportError: cannot import name 'trace' from > 'tensorflow.python.profiler' > > > > When I simply typed in this command 'from tensorflow.python.profiler > > import trace' in the spyder, it seems to work well. Do you have any > ideas > > of this? > > > > Thanks! > > > > — > > Reply to this email directly, view it on GitHub< > > > #391 (comment) > >, > > or unsubscribe< > > > https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA > > >. > > You are receiving this because you commented.Message ID: ***@***.***> > > > > — > > Reply to this email directly, view it on GitHub > > < > #391 (comment) > >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA > > > > . > > You are receiving this because you authored the thread.Message ID: > > ***@***.***> > > > > — > Reply to this email directly, view it on GitHub< > #391 (comment)>, > or unsubscribe< > https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA > >. > You are receiving this because you commented.Message ID: ***@***.***> > > — > Reply to this email directly, view it on GitHub > <#391 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

allenleetc · 2022-06-08T19:44:01Z

@happyqiu I'm using CUDA 11.6 and I believe the conda environment in the multianimal branch is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though.

If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator?

To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at <APTRoot>\install\apt_conda_environment.yml? Eg did you run this:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

Catch exception when Git not installed/found. rm old/obsolete conda environment yml

allenleetc · 2022-06-08T20:04:15Z

@happyqiu just FYI I pushed a minor fix and DLC Training is running successfully on my Windows machine. After we sort out your environment, it may be helpful to pull the latest before training.

allenleetc · 2022-06-08T20:19:03Z

@mkabra Training finished but the number of iterations on disk (7000) far exceeds my setting in the Tracking Parameters (1000). Tracking still proceeds but looks odd etc. Flagging maybe some kind of edge case for this net?

happyqiu · 2022-06-08T21:04:11Z

Hi Allen, The version for both tensorflow and tensorflow-estimator is 1.14.0. I reinstalled 10.0 CUDA and recreated the environment, and a different bug came up. The file does exist. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4805, in <module> main(sys.argv[1:]) File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4772, in main repo_info = PoseTools.get_git_commit() File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseTools.py", line 1670, in get_git_commit label = subprocess.check_output(cmd).strip() File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 411, in check_output **kwargs).stdout File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 488, in run with Popen(*popenargs, **kwargs) as process: File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 1207, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified On Wed, Jun 8, 2022 at 3:44 PM Allen Lee ***@***.******@***.***>> wrote: @happyqiu<https://github.com/happyqiu> I'm using CUDA 11.6 and I believe the conda environment is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though. If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator? To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at <APTRoot>\install\apt_conda_environment.yml? Eg did you run this: conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force ― Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMDHPHVGD5SFCKHWYNVZYD3VODZY5ANCNFSM5XOFD3QA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

allenleetc · 2022-06-08T22:02:02Z

OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? git pull

happyqiu · 2022-06-09T13:46:25Z

Hi Allen, looks like it starts training now! Thank you so much for your help!!!

…

On Wed, Jun 8, 2022 at 6:02 PM Allen Lee ***@***.***> wrote: OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? git pull — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHU566MUAFXCI7HSYC3VOEJ6LANCNFSM5XOFD3QA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

allenleetc · 2022-06-09T14:54:54Z

Great glad to hear it! Kristin did the hard part she updated the conda environment!

happyqiu · 2022-06-09T15:35:57Z

Thank you all! One more question, I found that during the training, only 1% of the GPU is used but 70% of CPU is used. Is that common? I thought the GPU was supposed to be used. Would that be because of the tensorflow version in the environment?

…

On Thu, Jun 9, 2022 at 10:55 AM Allen Lee ***@***.***> wrote: Great glad to hear it! Kristin did the hard part she updated the conda environment! — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHSMS4RD3BDSSCCGVSLVOIAUTANCNFSM5XOFD3QA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

allenleetc · 2022-06-09T18:48:24Z

I think this is not unusual as the training is often CPU-bound by the data pipeline (data read, augmentation+transformation etc).

If the training is proceeding at any kind of normal pace -- the Training Monitor is updating, etc -- then I think you should be utilizing your GPU.

That said @mkabra knows best, maybe he has thoughts. I also don't know how accurate the instrumentation is (eg task manager).

allenleetc · 2022-06-10T15:47:25Z

@happyqiu OK I may have spoken too soon, just FYI we are debugging chasing some things down here. Thanks for your patience while we debug.

allenleetc · 2022-06-14T21:41:08Z

@mkabra We are hitting open-mmlab/mmcv#1543 while building mmcv-full on Win10. Do you need mmcv==1.3.3 or can we bump to mmcv==1.4.0 or higher?
Alternatively, pytorch==1.8.0 or higher I guess.

allenleetc · 2022-06-15T20:05:04Z

@happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch.

The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest?

git checkout develop
git pull

You should now have an updated conda environment yml file in <APT>\install. Create the APT conda environment as before:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7.

You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following:

2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)

Please let us know if you can get going with this update!

happyqiu · 2022-06-16T15:40:11Z

Hi Allen, Thanks for your update! I tried the latest APT_develop, and it worked! I got a similar log file as yours, and the GPU was pretty much used. Really appreciate your help!

…

On Wed, Jun 15, 2022 at 4:05 PM Allen Lee ***@***.***> wrote: @happyqiu <https://github.com/happyqiu> Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch. The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest? git checkout develop git pull You should now have an updated conda environment yml file in <APT>\install. Create the APT conda environment as before: conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7. You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following: 2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5) Please let us know if you can get going with this update! — Reply to this email directly, view it on GitHub <#391 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMDHPHV7SG63BT76ULG27PLVPIZPVANCNFSM5XOFD3QA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

happyqiu · 2022-07-11T14:30:47Z

Hi Allen, Sorry to trouble you again, but I got some new problems with the latest APT-develop. 1. It takes much much more time to build the training dataset than it used to be when I start training. 2. During training, the 'dist' plot in the monitor looks very weird (snapshot attached). Have you ever seen these before?

…

On Thu, Jun 16, 2022 at 11:39 AM Karlie Qiu ***@***.***> wrote: Hi Allen, Thanks for your update! I tried the latest APT_develop, and it worked! I got a similar log file as yours, and the GPU was pretty much used. Really appreciate your help! On Wed, Jun 15, 2022 at 4:05 PM Allen Lee ***@***.***> wrote: > @happyqiu <https://github.com/happyqiu> Thanks for your patience! We > have an interim fix for you while we iron out the conda environment in the > multianimal branch. > > The interim fix is in the develop branch. In your APT repo can you > please go back to develop and pull the latest? > > git checkout develop > git pull > > You should now have an updated conda environment yml file in > <APT>\install. Create the APT conda environment as before: > > conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force > > I tested training/tracking with MDN and DLC. Just FYI I am testing with > CUDA 11.7. > > You were right that our last conda environment was not using the GPU! > That was my mistake. During Training, you can confirm GPU usage by > selecting "Show log files" in the Training Monitor and pressing "Go". For > instance my logs for a recent train contain lines like the following: > > 2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll > 2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 > 2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: > 2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 > 2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N > 2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5) > > Please let us know if you can get going with this update! > > — > Reply to this email directly, view it on GitHub > <#391 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AMDHPHV7SG63BT76ULG27PLVPIZPVANCNFSM5XOFD3QA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

allenleetc · 2022-07-12T19:11:54Z

Hey @happyqiu,

How long is it taking for training to start up? Can you share your project (.lbl) file?

For the 'dist' plot, I can't find the attachment, could you please re-attach? (Or if I am missing it please let me know!)

Thanks,
Allen

allenleetc added a commit that referenced this issue Jun 8, 2022

re #391

c868823

Catch exception when Git not installed/found. rm old/obsolete conda environment yml

allenleetc added a commit that referenced this issue Jun 15, 2022

re #391 updated conda env for develop branch

78719ad

allenleetc added a commit that referenced this issue Jun 15, 2022

re #391 mv. re #392 lower hardcoded GPU mem limit in interim.

9fa1c4a

allenleetc added a commit that referenced this issue Jun 15, 2022

re #391 mv

5c8099c

happyqiu closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN training issue #391

NaN training issue #391

happyqiu commented May 31, 2022

allenleetc commented Jun 1, 2022

happyqiu commented Jun 1, 2022 via email

kristinbranson commented Jun 5, 2022

happyqiu commented Jun 5, 2022

kristinbranson commented Jun 5, 2022 via email

happyqiu commented Jun 5, 2022 via email

kristinbranson commented Jun 5, 2022 via email

happyqiu commented Jun 6, 2022 via email

happyqiu commented Jun 8, 2022 via email

allenleetc commented Jun 8, 2022 •

edited

Loading

allenleetc commented Jun 8, 2022

allenleetc commented Jun 8, 2022

happyqiu commented Jun 8, 2022 via email

allenleetc commented Jun 8, 2022

happyqiu commented Jun 9, 2022 via email

allenleetc commented Jun 9, 2022

happyqiu commented Jun 9, 2022 via email

allenleetc commented Jun 9, 2022 •

edited

Loading

allenleetc commented Jun 10, 2022 •

edited

Loading

allenleetc commented Jun 14, 2022 •

edited

Loading

allenleetc commented Jun 15, 2022

happyqiu commented Jun 16, 2022 via email

happyqiu commented Jul 11, 2022 via email

allenleetc commented Jul 12, 2022

NaN training issue #391

NaN training issue #391

Comments

happyqiu commented May 31, 2022

Job 1:

C:\Users\yuechen\Documents.apt\tp754f9c10_94f3_440a_925f_8b8a4ca620b7\YQ2_20220517_side\20220531T120619view0_20220531T120636_new.log

allenleetc commented Jun 1, 2022

happyqiu commented Jun 1, 2022 via email

kristinbranson commented Jun 5, 2022

happyqiu commented Jun 5, 2022

kristinbranson commented Jun 5, 2022 via email

happyqiu commented Jun 5, 2022 via email

kristinbranson commented Jun 5, 2022 via email

happyqiu commented Jun 6, 2022 via email

happyqiu commented Jun 8, 2022 via email

allenleetc commented Jun 8, 2022 • edited Loading

allenleetc commented Jun 8, 2022

allenleetc commented Jun 8, 2022

happyqiu commented Jun 8, 2022 via email

allenleetc commented Jun 8, 2022

happyqiu commented Jun 9, 2022 via email

allenleetc commented Jun 9, 2022

happyqiu commented Jun 9, 2022 via email

allenleetc commented Jun 9, 2022 • edited Loading

allenleetc commented Jun 10, 2022 • edited Loading

allenleetc commented Jun 14, 2022 • edited Loading

allenleetc commented Jun 15, 2022

happyqiu commented Jun 16, 2022 via email

happyqiu commented Jul 11, 2022 via email

allenleetc commented Jul 12, 2022

allenleetc commented Jun 8, 2022 •

edited

Loading

allenleetc commented Jun 9, 2022 •

edited

Loading

allenleetc commented Jun 10, 2022 •

edited

Loading

allenleetc commented Jun 14, 2022 •

edited

Loading