-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the best way to debug segfaults in Lambda? #231
Comments
Can you run your service in a standard environment for a while and collect troubleshooting information that way? Or does this issue only show up on Lambda? |
The frequency of occurrence is about 0.1% so we've never seen it except in production and we need lambda for security there. I will try and provoke it without lambda but it will take a while even if it's going to happen at all. |
It seems like there's techniques to get a corefile or backtrace from Lambda. Have you tried, for instance, this https://stackoverflow.com/questions/53644056/aws-lambda-r-runtime-segmentation-fault |
Unfortunately, the technique described there only works if you provide your own node.js in a custom runtime. It may come down to that if I can't repro any other way. |
Just to follow up, I was eventually pointed to https://github.com/ddopson/node-segfault-handler which gives us backtraces in stderr, and I'm hooking that up to our code. I hope I'll have a usable backtrace the next time we see a segfault. Separately, I made some changes that should have had no effect on the segfaults but our incidence rate fell from tens per day to a couple per week. I don't understand why that happened, and probably never will. |
I don't recommend using node-segfault-handler. It actually causes segfaults under isolated-vm and I believe worker_threads as well. See: ddopson/node-segfault-handler#49 You can just set |
In our use-case we have exactly one isolate, and it appears to work in testing. Is that a reasonable scenario? |
Well, you have two isolates. The nodejs one and the isolated-vm one. If a segfault occurs within isolated-vm then segfault-handler will segfault itself and the debug information will be lost. A corefile is way more reliable and includes a complete snapshot of the program's state which helps tremendously with troubleshooting. |
For what it's worth, I tried out segfault-handler anyway - specifically with registering the handler before starting the isolated-vm isolate (and I also set up a test by passing a reference to the On the other hand, my production segfault looks like this (It's a null pointer dereference)
which looks like something related to GC? |
Hmm that's not a lot to go on. I'd recommend enabling |
@philipl-airtable are you still running isolated-vm inside Lambda? All good with segfaults? |
Yes. I don't have a great story for what happened. We had been running with Node 12 and after upgrading to Node 16, the segfaults stopped. So it may have been a Node GC bug that got fixed. |
@philipl did you end up finding a fix? My current team has also been looking at using isolated-vm |
No. As I said, once we updated to Node 16, it never happened again. In terms of what exactly was going on, I also narrowed it down to a particular import statement where we imported one of our own files, which I think could support the theory that it was GC related. I ended up adding some fairly elaborate logic (including forking a child node process) so that we could detect when the segfault happened and as long as it did before any non-idempotent code execution, we would just retry immediately. And although the segfault was non deterministic, it did always happen in the same place when it happened. So we just moved on at that point. (I don't work there anymore) |
Hi.
We've been trying to use isolated-vm inside a Lambda node environment, and we've been getting periodic segfaults and even occasional sigbrts, without any clear rhyme or reason. An unchanged script can segfault one day and run dozens of times around it with no trouble. Traditionally, one would look at the backtrace from where node crashed to try and narrow down the problem, but Lambda doesn't provide that backtrace, as far as I can tell (it's not in the lambda cloudwatch logs) - and I could well imagine they suppress it for their own security reasons to prevent people from using that info to build lambda break outs.
So, are there any things we can do to try and debug these occurrences? They are too infrequent and unpredictable for us to have been able to build a repeatable test case.
Thanks!
The text was updated successfully, but these errors were encountered: