Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception handling for CUDA/OpenCL errors #97

Open
bentsherman opened this issue Aug 27, 2019 · 5 comments
Open

Exception handling for CUDA/OpenCL errors #97

bentsherman opened this issue Aug 27, 2019 · 5 comments
Labels

Comments

@bentsherman
Copy link
Member

A lot of times when KINC crashes I get a long stack trace like this one:

terminate called after throwing an instance of 'EException*'
[node0181:14603] *** Process received signal ***
[node0181:14603] Signal: Aborted (6)
[node0181:14603] Signal code:  (-6)
[node0181:14603] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x1525f65555d0]
[node0181:14603] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x1525f56f92c7]
[node0181:14603] [ 2] /lib64/libc.so.6(abort+0x148)[0x1525f56fa9b8]
[node0181:14603] [ 3] /software/gcc/5.4.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x1525f625997d]
[node0181:14603] [ 4] /software/gcc/5.4.0/lib64/libstdc++.so.6(+0x8c9f6)[0x1525f62579f6]
[node0181:14603] [ 5] /software/gcc/5.4.0/lib64/libstdc++.so.6(+0x8ca41)[0x1525f6257a41]
[node0181:14603] [ 6] /software/gcc/5.4.0/lib64/libstdc++.so.6(+0x8cc59)[0x1525f6257c59]
[node0181:14603] [ 7] /home/btsheal/software/ACE/develop/lib/libacecore.so.0(_ZN4CUDA10throwErrorEP10EException14cudaError_enum+0x18f)[0x152603121e9f]
[node0181:14603] [ 8] /home/btsheal/software/ACE/develop/lib/libacecore.so.0(_ZN4CUDA6Kernel7executeERKNS_6StreamE+0x143)[0x152603124be3]
[node0181:14603] [ 9] kinc[0x462562]
[node0181:14603] [10] kinc[0x4582a2]
[node0181:14603] [11] /home/btsheal/software/ACE/develop/lib/libacecore.so.0(_ZN3Ace8Analytic13CUDARunThread3runEv+0x66)[0x152603112f46]
[node0181:14603] [12] /software/Qt/5.9.2/lib/libQt5Core.so.5(+0xac06d)[0x1525f6a9a06d]
[node0181:14603] [13] /lib64/libpthread.so.0(+0x7dd5)[0x1525f654ddd5]
[node0181:14603] [14] /lib64/libc.so.6(clone+0x6d)[0x1525f57c102d]
[node0181:14603] *** End of error message ***
/pscratch/scratch4/btsheal/benchmark-nf/work/e8/7afb360a11dbef9eae776a2eae4f90/.command.sh: line 7: 14603 Aborted                 (core dumped) taskset -c 0-1 kinc run similarity --input Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --clusmethod gmm --corrmethod spearman --preout true --postout true --bsize 32768 --gsize 4096 --lsize 1024

I can usually pick out where it's coming from (in this case I think a CUDA kernel failed to launch), but I really need to see the error message. I'm going to try to fix this by inserting try / catch statements into KINC but we may need to make some changes in ACE too, we'll see.

@bentsherman
Copy link
Member Author

So I followed this stack trace and I'm looking at a few functions in particular in ACE:

CUDA::Kernel::execute()
CUDARunThread::run()
EApplication::notify()

And from these functions I can tell that ACE is supposed to catch any ACE-specific exceptions from worker threads and re-throw them in the main thread so that they are properly handled. So it seems like if CUDA::Kernel::execute() threw an exception then it should have been handled properly, but it wasn't.

So this might actually be an ACE issue. @4ctrl-alt-del any thoughts on this? The command line that I used is in the log but I can try to come up with a more reproducible test case.

@4ctrl-alt-del
Copy link
Member

After closely inspecting the code you added for CUDA in ACE I did notice an error you made that could cause this. if you look at ace_analytic_cudarunthread.cpp:139 you are calling the set current method that can throw exceptions OUTSIDE of the try statement. This is a bug regardless and could cause the error you are getting.

@bentsherman
Copy link
Member Author

@4ctrl-alt-del Good catch, I have wrapped that call in a separate try/catch statement and I will make a PR for it later. However, I am still getting the same error as before. See this line in particular:

[node0181:14603] [ 8] /home/btsheal/software/ACE/develop/lib/libacecore.so.0(_ZN4CUDA6Kernel7executeERKNS_6StreamE+0x143)[0x152603124be3]

This is how I know the exception is coming from CUDA::Kernel::execute(). But I don't see why it isn't being handled properly. I'm looking at this code:

void CUDARunThread::run()
{
   // ...
         try
         {
            _result = _worker->execute(_work).release();
            _result->moveToThread(thread());
            _result->setParent(this);
         }
         catch (EException e)
         {
            _exception = new EException(e);
         }
   // ...
}

So any exception from a CUDA kernel should be caught right? Would the moveToThread() call affect anything?

@4ctrl-alt-del
Copy link
Member

4ctrl-alt-del commented Sep 9, 2019

moveToThread() simply moves the thread ownership of a qt object, it does nothing in regards to thread creation/destruction.

It should be caught in any thread that is created by ACE. My guess is somehow a thread is being made outside of ACE that is not protected by a try and catch.

@bentsherman
Copy link
Member Author

bentsherman commented Feb 25, 2020

Note to self -- this error was caused by setting the CUDA block size too high (1024), and I was able to reproduce a similar error with the ACE example. So ACE/KINC should definitely exit more gracefully, I'm just still trying to figure out if the issue lies with ACE or with KINC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants