-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV in JVM Runtime #30
Comments
Can you please try with |
After add
|
We try to construct a smaller dataset (11GB), shrinking the size about 4x, the result is correct and no more SIGSEGV. However, every time we try to experiment with larger dataset (44GB), the problem can be stably reproduced. This would suggest that somewhere buffer has overflow. |
Can you please send |
Does it work with the same parameters and not using ucx? Does it work with |
Yes. I've verified that just now, without sparkucx it is fine.
No. Adding "-XX:+UseParallelGC" to both the driver and the executor won't help, the problem still exists. |
Does it happen at the beginning of the job, at map phase or reduce phase or at the end? Do you see something in |
Some tasks can complete, for example, 447 succeed among 448 tasks in the map stage but the last failed. Both could happen, sometimes map phase, sometimes reduce phase.
Rarely we may observe the following triggered by SparkUCX on the driver side.
The problem still exists. UpdateWe found an interesting phenomenon that when increasing the number of reducers from 224 to 448, the word count can produce correct result. Moreover, for 224-reducer configuration, it will always ends with SIGSEGV; for 448-reducer configuration, it will always produce correct result. From 224 to 448 partition, the message size from each mapper to each reducer is reduced. We guess it is possible that somewhere a fixed-size buffer has overflow. Hope this information is useful. |
Configuration
./contrib/configure-release --with-java
Spark launch commandline
Scala application:
Phenomena
Of the first stage, with total 448 tasks, 447 tasks have been finished. After that, the Java Runtime is terminated by SIGSEGV as follow:
With the hs_err_pid3253764.log:
The text was updated successfully, but these errors were encountered: