-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN training issue #391
Comments
Hi happyqiu, Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update. One question -- in setting up your Conda environment, did you follow the wiki and use the environment file located at I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again! |
Hi Allen,
Thanks for your quick response!
I followed the doc to setup the environment.
…________________________________
From: Allen Lee ***@***.***>
Sent: Tuesday, May 31, 2022 9:01:32 PM
To: kristinbranson/APT ***@***.***>
Cc: happyqiu ***@***.***>; Author ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
Hi happyqiu,
Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update.
One question -- in setting up your Conda environment, did you follow the wiki<https://github.com/kristinbranson/APT/wiki/Windows-&-Conda-Setup> and use the environment file located at <APT>\condaenv\env.yml? Or did you follow the instructions in the doc<http://kristinbranson.github.io/APT/LocalBackEnd.html>?
I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again!
―
Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMDHPHVJBXTK2ZJKMWHAU53VM2Y6ZANCNFSM5XOFD3QA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I updated the instructions here: http://kristinbranson.github.io/APT/LocalBackEnd.html for setting up Conda and Windows. I've only tested so far with the multianimal branch, which is the branch everyone in our lab is working on. |
Thanks for your update! But another error pops up which shows When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this? Thanks! |
Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?
For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail.
…________________________________
From: happyqiu ***@***.***>
Sent: Sunday, June 5, 2022 6:09 PM
To: kristinbranson/APT ***@***.***>
Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
External Email: Use Caution
Thanks for your update! But another error pops up which shows
'......
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?
Thanks!
—
Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I downloaded the APT-develop one. (https://github.com/kristinbranson/APT)
I also checked the backend configuration. Both APT activation and GPU tests
passed.
…On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> wrote:
Do you know which git branch of APT you are using? I only tested on the
multianimal branch. Did you run the gpu/backend check?
For less delayed replies, please email ***@***.*** I check my GMail
account much more frequently than my HHMI mail.
________________________________
From: happyqiu ***@***.***>
Sent: Sunday, June 5, 2022 6:09 PM
To: kristinbranson/APT ***@***.***>
Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
External Email: Use Caution
Thanks for your update! But another error pops up which shows
'......
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
line 36, in
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
When I simply typed in this command 'from tensorflow.python.profiler
import trace' in the spyder, it seems to work well. Do you have any ideas
of this?
Thanks!
—
Reply to this email directly, view it on GitHub<
#391 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA
>.
You are receiving this because you commented.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
You can switch to the multianimal branch with the command
git checkout multianimal
For less delayed replies, please email ***@***.*** I check my GMail account much more frequently than my HHMI mail.
________________________________
From: happyqiu ***@***.***>
Sent: Sunday, June 5, 2022 6:32 PM
To: kristinbranson/APT ***@***.***>
Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
External Email: Use Caution
I downloaded the APT-develop one. (https://github.com/kristinbranson/APT)
I also checked the backend configuration. Both APT activation and GPU tests
passed.
On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***> wrote:
Do you know which git branch of APT you are using? I only tested on the
multianimal branch. Did you run the gpu/backend check?
For less delayed replies, please email ***@***.*** I check my GMail
account much more frequently than my HHMI mail.
________________________________
From: happyqiu ***@***.***>
Sent: Sunday, June 5, 2022 6:09 PM
To: kristinbranson/APT ***@***.***>
Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
External Email: Use Caution
Thanks for your update! But another error pops up which shows
'......
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
line 36, in
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
When I simply typed in this command 'from tensorflow.python.profiler
import trace' in the spyder, it seems to work well. Do you have any ideas
of this?
Thanks!
—
Reply to this email directly, view it on GitHub<
#391 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA
>.
You are receiving this because you commented.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Thanks for your suggestion, but I still got the same error message with
multianimal one.
Traceback (most recent call last):
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py",
line 44, in <module>
import PoseUNet_dataset as PoseUNet
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py",
line 1, in <module>
from PoseCommon_dataset import PoseCommon, PoseCommonMulti,
PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py",
line 26, in <module>
from tensorflow.contrib.layers import batch_norm
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py",
line 50, in __getattr__
module = self._load()
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py",
line 44, in _load
module = _importlib.import_module(self.__name__)
File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib\__init__.py",
line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\__init__.py",
line 39, in <module>
from tensorflow.contrib import compiler
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py",
line 21, in <module>
from tensorflow.contrib.compiler import jit
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py",
line 22, in <module>
from tensorflow.contrib.compiler import xla
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py",
line 22, in <module>
from tensorflow.python.estimator import model_fn as model_fn_lib
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py",
line 26, in <module>
from tensorflow_estimator.python.estimator import model_fn
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\__init__.py",
line 10, in <module>
from tensorflow_estimator._api.v1 import estimator
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\__init__.py",
line 10, in <module>
from tensorflow_estimator._api.v1.estimator import experimental
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\experimental\__init__.py",
line 10, in <module>
from tensorflow_estimator.python.estimator.canned.dnn import
dnn_logit_fn_builder
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py",
line 27, in <module>
from tensorflow_estimator.python.estimator import estimator
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
line 36, in <module>
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
(C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\__init__.py)
On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson ***@***.***>
wrote:
… You can switch to the multianimal branch with the command
git checkout multianimal
For less delayed replies, please email ***@***.*** I check my GMail
account much more frequently than my HHMI mail.
________________________________
From: happyqiu ***@***.***>
Sent: Sunday, June 5, 2022 6:32 PM
To: kristinbranson/APT ***@***.***>
Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
External Email: Use Caution
I downloaded the APT-develop one. (https://github.com/kristinbranson/APT)
I also checked the backend configuration. Both APT activation and GPU tests
passed.
On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***>
wrote:
> Do you know which git branch of APT you are using? I only tested on the
> multianimal branch. Did you run the gpu/backend check?
>
>
> For less delayed replies, please email ***@***.*** I check my GMail
> account much more frequently than my HHMI mail.
> ________________________________
> From: happyqiu ***@***.***>
> Sent: Sunday, June 5, 2022 6:09 PM
> To: kristinbranson/APT ***@***.***>
> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
>
> External Email: Use Caution
>
>
>
> Thanks for your update! But another error pops up which shows
> '......
>
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
> line 36, in
> from tensorflow.python.profiler import trace
> ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
>
> When I simply typed in this command 'from tensorflow.python.profiler
> import trace' in the spyder, it seems to work well. Do you have any ideas
> of this?
>
> Thanks!
>
> —
> Reply to this email directly, view it on GitHub<
> #391 (comment)
>,
> or unsubscribe<
>
https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA
> >.
> You are receiving this because you commented.Message ID: ***@***.***>
>
> —
> Reply to this email directly, view it on GitHub
> <
#391 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA
>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub<
#391 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA
>.
You are receiving this because you commented.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi, I'm thinking it may be a compatibility issue. I'm using CUDA 11.7.
Which version of tensorflow-gpu do you think I should use?
…On Mon, Jun 6, 2022 at 10:39 AM Karlie Qiu ***@***.***> wrote:
Thanks for your suggestion, but I still got the same error message with
multianimal one.
Traceback (most recent call last):
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py",
line 44, in <module>
import PoseUNet_dataset as PoseUNet
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py",
line 1, in <module>
from PoseCommon_dataset import PoseCommon, PoseCommonMulti,
PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut
File
"C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py",
line 26, in <module>
from tensorflow.contrib.layers import batch_norm
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py",
line 50, in __getattr__
module = self._load()
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\__init__.py",
line 44, in _load
module = _importlib.import_module(self.__name__)
File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib\__init__.py",
line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\__init__.py",
line 39, in <module>
from tensorflow.contrib import compiler
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py",
line 21, in <module>
from tensorflow.contrib.compiler import jit
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py",
line 22, in <module>
from tensorflow.contrib.compiler import xla
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py",
line 22, in <module>
from tensorflow.python.estimator import model_fn as model_fn_lib
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py",
line 26, in <module>
from tensorflow_estimator.python.estimator import model_fn
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\__init__.py",
line 10, in <module>
from tensorflow_estimator._api.v1 import estimator
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\__init__.py",
line 10, in <module>
from tensorflow_estimator._api.v1.estimator import experimental
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\_api\v1\estimator\experimental\__init__.py",
line 10, in <module>
from tensorflow_estimator.python.estimator.canned.dnn import
dnn_logit_fn_builder
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py",
line 27, in <module>
from tensorflow_estimator.python.estimator import estimator
File
"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
line 36, in <module>
from tensorflow.python.profiler import trace
ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'
(C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\__init__.py)
On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson ***@***.***>
wrote:
> You can switch to the multianimal branch with the command
> git checkout multianimal
>
>
> For less delayed replies, please email ***@***.*** I check my GMail
> account much more frequently than my HHMI mail.
> ________________________________
> From: happyqiu ***@***.***>
> Sent: Sunday, June 5, 2022 6:32 PM
> To: kristinbranson/APT ***@***.***>
> Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
>
> External Email: Use Caution
>
>
>
> I downloaded the APT-develop one. (https://github.com/kristinbranson/APT)
> I also checked the backend configuration. Both APT activation and GPU
> tests
> passed.
>
> On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson ***@***.***>
> wrote:
>
> > Do you know which git branch of APT you are using? I only tested on the
> > multianimal branch. Did you run the gpu/backend check?
> >
> >
> > For less delayed replies, please email ***@***.*** I check my GMail
> > account much more frequently than my HHMI mail.
> > ________________________________
> > From: happyqiu ***@***.***>
> > Sent: Sunday, June 5, 2022 6:09 PM
> > To: kristinbranson/APT ***@***.***>
> > Cc: Branson, Kristin ***@***.***>; Comment ***@***.***>
> > Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)
> >
> > External Email: Use Caution
> >
> >
> >
> > Thanks for your update! But another error pops up which shows
> > '......
> >
> "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py",
> > line 36, in
> > from tensorflow.python.profiler import trace
> > ImportError: cannot import name 'trace' from
> 'tensorflow.python.profiler'
> >
> > When I simply typed in this command 'from tensorflow.python.profiler
> > import trace' in the spyder, it seems to work well. Do you have any
> ideas
> > of this?
> >
> > Thanks!
> >
> > —
> > Reply to this email directly, view it on GitHub<
> >
> #391 (comment)
> >,
> > or unsubscribe<
> >
> https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA
> > >.
> > You are receiving this because you commented.Message ID: ***@***.***>
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
> #391 (comment)
> >,
> > or unsubscribe
> > <
> https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA
> >
> > .
> > You are receiving this because you authored the thread.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub<
> #391 (comment)>,
> or unsubscribe<
> https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA
> >.
> You are receiving this because you commented.Message ID: ***@***.***>
>
> —
> Reply to this email directly, view it on GitHub
> <#391 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
@happyqiu I'm using CUDA 11.6 and I believe the conda environment in the If you activate your To confirm, after changing to the
|
@happyqiu just FYI I pushed a minor fix and DLC Training is running successfully on my Windows machine. After we sort out your environment, it may be helpful to pull the latest before training. |
@mkabra Training finished but the number of iterations on disk (7000) far exceeds my setting in the Tracking Parameters (1000). Tracking still proceeds but looks odd etc. Flagging maybe some kind of edge case for this net? |
Hi Allen, The version for both tensorflow and tensorflow-estimator is 1.14.0.
I reinstalled 10.0 CUDA and recreated the environment, and a different bug came up. The file does exist.
Traceback (most recent call last):
File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4805, in <module>
main(sys.argv[1:])
File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4772, in main
repo_info = PoseTools.get_git_commit()
File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseTools.py", line 1670, in get_git_commit
label = subprocess.check_output(cmd).strip()
File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 411, in check_output
**kwargs).stdout
File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 488, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 1207, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
On Wed, Jun 8, 2022 at 3:44 PM Allen Lee ***@***.******@***.***>> wrote:
@happyqiu<https://github.com/happyqiu> I'm using CUDA 11.6 and I believe the conda environment is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though.
If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator?
To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at <APTRoot>\install\apt_conda_environment.yml? Eg did you run this:
conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force
―
Reply to this email directly, view it on GitHub<#391 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMDHPHVGD5SFCKHWYNVZYD3VODZY5ANCNFSM5XOFD3QA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? |
Hi Allen, looks like it starts training now! Thank you so much for your
help!!!
…On Wed, Jun 8, 2022 at 6:02 PM Allen Lee ***@***.***> wrote:
OK, we may have made progress! I think this was the bug I fixed; did/can
you pull the latest code in the multianimal branch? git pull
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHU566MUAFXCI7HSYC3VOEJ6LANCNFSM5XOFD3QA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Great glad to hear it! Kristin did the hard part she updated the conda environment! |
Thank you all!
One more question, I found that during the training, only 1% of the GPU is
used but 70% of CPU is used. Is that common? I thought the GPU was supposed
to be used. Would that be because of the tensorflow version in the
environment?
…On Thu, Jun 9, 2022 at 10:55 AM Allen Lee ***@***.***> wrote:
Great glad to hear it! Kristin did the hard part she updated the conda
environment!
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHSMS4RD3BDSSCCGVSLVOIAUTANCNFSM5XOFD3QA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think this is not unusual as the training is often CPU-bound by the data pipeline (data read, augmentation+transformation etc). If the training is proceeding at any kind of normal pace -- the Training Monitor is updating, etc -- then I think you should be utilizing your GPU. That said @mkabra knows best, maybe he has thoughts. I also don't know how accurate the instrumentation is (eg task manager). |
@happyqiu OK I may have spoken too soon, just FYI we are debugging chasing some things down here. Thanks for your patience while we debug. |
@mkabra We are hitting open-mmlab/mmcv#1543 while building mmcv-full on Win10. Do you need mmcv==1.3.3 or can we bump to mmcv==1.4.0 or higher? |
@happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the The interim fix is in the
You should now have an updated conda environment yml file in
I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7. You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following:
Please let us know if you can get going with this update! |
Hi Allen,
Thanks for your update!
I tried the latest APT_develop, and it worked! I got a similar log file as
yours, and the GPU was pretty much used.
Really appreciate your help!
…On Wed, Jun 15, 2022 at 4:05 PM Allen Lee ***@***.***> wrote:
@happyqiu <https://github.com/happyqiu> Thanks for your patience! We have
an interim fix for you while we iron out the conda environment in the
multianimal branch.
The interim fix is in the develop branch. In your APT repo can you please
go back to develop and pull the latest?
git checkout develop
git pull
You should now have an updated conda environment yml file in <APT>\install.
Create the APT conda environment as before:
conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force
I tested training/tracking with MDN and DLC. Just FYI I am testing with
CUDA 11.7.
You were right that our last conda environment was not using the GPU! That
was my mistake. During Training, you can confirm GPU usage by selecting
"Show log files" in the Training Monitor and pressing "Go". For instance my
logs for a recent train contain lines like the following:
2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)
Please let us know if you can get going with this update!
—
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDHPHV7SG63BT76ULG27PLVPIZPVANCNFSM5XOFD3QA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Allen,
Sorry to trouble you again, but I got some new problems with the latest
APT-develop.
1. It takes much much more time to build the training dataset than it used
to be when I start training.
2. During training, the 'dist' plot in the monitor looks very weird
(snapshot attached).
Have you ever seen these before?
…On Thu, Jun 16, 2022 at 11:39 AM Karlie Qiu ***@***.***> wrote:
Hi Allen,
Thanks for your update!
I tried the latest APT_develop, and it worked! I got a similar log file as
yours, and the GPU was pretty much used.
Really appreciate your help!
On Wed, Jun 15, 2022 at 4:05 PM Allen Lee ***@***.***>
wrote:
> @happyqiu <https://github.com/happyqiu> Thanks for your patience! We
> have an interim fix for you while we iron out the conda environment in the
> multianimal branch.
>
> The interim fix is in the develop branch. In your APT repo can you
> please go back to develop and pull the latest?
>
> git checkout develop
> git pull
>
> You should now have an updated conda environment yml file in
> <APT>\install. Create the APT conda environment as before:
>
> conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force
>
> I tested training/tracking with MDN and DLC. Just FYI I am testing with
> CUDA 11.7.
>
> You were right that our last conda environment was not using the GPU!
> That was my mistake. During Training, you can confirm GPU usage by
> selecting "Show log files" in the Training Monitor and pressing "Go". For
> instance my logs for a recent train contain lines like the following:
>
> 2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
> 2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
> 2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
> 2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
> 2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
> 2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)
>
> Please let us know if you can get going with this update!
>
> —
> Reply to this email directly, view it on GitHub
> <#391 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AMDHPHV7SG63BT76ULG27PLVPIZPVANCNFSM5XOFD3QA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Hey @happyqiu, How long is it taking for training to start up? Can you share your project (.lbl) file? For the 'dist' plot, I can't find the attachment, could you please re-attach? (Or if I am missing it please let me know!) Thanks, |
Hello, I'm using Windows system and MATLAB 2022a to run the APT and choosing DLC, but always found training failed. (N. iterations: NaN/10000) And the log file shows like this:
Job 1:
C:\Users\yuechen\Documents.apt\tp754f9c10_94f3_440a_925f_8b8a4ca620b7\YQ2_20220517_side\20220531T120619view0_20220531T120636_new.log
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
Traceback (most recent call last):
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 35, in
from deeplabcut.pose_estimation_tensorflow.train import train as deepcut_train
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut_init.py", line 51, in
from deeplabcut.pose_estimation_tensorflow import train_network, return_train_network_path
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow_init.py", line 18, in
from deeplabcut.pose_estimation_tensorflow.evaluate import *
File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\evaluate.py", line 18, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
After I installed this package, the error refers to other modules not found. I'm wondering if there is anything else I need to install except the packages you mentioned in the APT documentation.
Thanks a lot!
The text was updated successfully, but these errors were encountered: