-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception handling for CUDA/OpenCL errors #97
Comments
So I followed this stack trace and I'm looking at a few functions in particular in ACE:
And from these functions I can tell that ACE is supposed to catch any ACE-specific exceptions from worker threads and re-throw them in the main thread so that they are properly handled. So it seems like if So this might actually be an ACE issue. @4ctrl-alt-del any thoughts on this? The command line that I used is in the log but I can try to come up with a more reproducible test case. |
After closely inspecting the code you added for CUDA in ACE I did notice an error you made that could cause this. if you look at ace_analytic_cudarunthread.cpp:139 you are calling the set current method that can throw exceptions OUTSIDE of the try statement. This is a bug regardless and could cause the error you are getting. |
@4ctrl-alt-del Good catch, I have wrapped that call in a separate try/catch statement and I will make a PR for it later. However, I am still getting the same error as before. See this line in particular:
This is how I know the exception is coming from
So any exception from a CUDA kernel should be caught right? Would the |
moveToThread() simply moves the thread ownership of a qt object, it does nothing in regards to thread creation/destruction. It should be caught in any thread that is created by ACE. My guess is somehow a thread is being made outside of ACE that is not protected by a try and catch. |
Note to self -- this error was caused by setting the CUDA block size too high (1024), and I was able to reproduce a similar error with the ACE example. So ACE/KINC should definitely exit more gracefully, I'm just still trying to figure out if the issue lies with ACE or with KINC. |
A lot of times when KINC crashes I get a long stack trace like this one:
I can usually pick out where it's coming from (in this case I think a CUDA kernel failed to launch), but I really need to see the error message. I'm going to try to fix this by inserting try / catch statements into KINC but we may need to make some changes in ACE too, we'll see.
The text was updated successfully, but these errors were encountered: