Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the best way to debug segfaults in Lambda? #231

Open
philipl-airtable opened this issue Feb 12, 2021 · 14 comments
Open

What's the best way to debug segfaults in Lambda? #231

philipl-airtable opened this issue Feb 12, 2021 · 14 comments

Comments

@philipl-airtable
Copy link

Hi.

We've been trying to use isolated-vm inside a Lambda node environment, and we've been getting periodic segfaults and even occasional sigbrts, without any clear rhyme or reason. An unchanged script can segfault one day and run dozens of times around it with no trouble. Traditionally, one would look at the backtrace from where node crashed to try and narrow down the problem, but Lambda doesn't provide that backtrace, as far as I can tell (it's not in the lambda cloudwatch logs) - and I could well imagine they suppress it for their own security reasons to prevent people from using that info to build lambda break outs.

So, are there any things we can do to try and debug these occurrences? They are too infrequent and unpredictable for us to have been able to build a repeatable test case.

Thanks!

@laverdet
Copy link
Owner

Can you run your service in a standard environment for a while and collect troubleshooting information that way? Or does this issue only show up on Lambda?

@philipl-airtable
Copy link
Author

The frequency of occurrence is about 0.1% so we've never seen it except in production and we need lambda for security there. I will try and provoke it without lambda but it will take a while even if it's going to happen at all.

@laverdet
Copy link
Owner

It seems like there's techniques to get a corefile or backtrace from Lambda. Have you tried, for instance, this https://stackoverflow.com/questions/53644056/aws-lambda-r-runtime-segmentation-fault

@philipl-airtable
Copy link
Author

Unfortunately, the technique described there only works if you provide your own node.js in a custom runtime. It may come down to that if I can't repro any other way.

@philipl-airtable
Copy link
Author

Just to follow up, I was eventually pointed to https://github.com/ddopson/node-segfault-handler which gives us backtraces in stderr, and I'm hooking that up to our code. I hope I'll have a usable backtrace the next time we see a segfault.

Separately, I made some changes that should have had no effect on the segfaults but our incidence rate fell from tens per day to a couple per week. I don't understand why that happened, and probably never will.

@laverdet
Copy link
Owner

I don't recommend using node-segfault-handler. It actually causes segfaults under isolated-vm and I believe worker_threads as well. See: ddopson/node-segfault-handler#49

You can just set ulimit -c unlimited and pull the stack trace from the corefile using gdb.

@philipl-airtable
Copy link
Author

In our use-case we have exactly one isolate, and it appears to work in testing. Is that a reasonable scenario?

@laverdet
Copy link
Owner

Well, you have two isolates. The nodejs one and the isolated-vm one. If a segfault occurs within isolated-vm then segfault-handler will segfault itself and the debug information will be lost. A corefile is way more reliable and includes a complete snapshot of the program's state which helps tremendously with troubleshooting.

@philipl-airtable
Copy link
Author

philipl-airtable commented Mar 19, 2021

For what it's worth, I tried out segfault-handler anyway - specifically with registering the handler before starting the isolated-vm isolate (and I also set up a test by passing a reference to the causeSegfault() function into the isolated-vm isolate. In that test, I got a meaningful backtrace).

On the other hand, my production segfault looks like this (It's a null pointer dereference)

/opt/nodejs/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x2d66)[0x7fc4e8160d66]
/lib64/libpthread.so.0(+0x117e0)[0x7fc4eaf247e0]
/var/lang/bin/node(_ZN2v88internal4Heap31MonotonicallyIncreasingTimeInMsEv+0x1c)[0x559e860bc5fc]
/var/lang/bin/node(_ZN2v88internal8GCTracer15BackgroundScopeD1Ev+0x2c)[0x559e860a985c]
/var/lang/bin/node(+0xaee118)[0x559e86083118]
/var/lang/bin/node(_ZThn32_N2v88internal14CancelableTask3RunEv+0x2b9)[0x559e85fe5919]
/var/lang/bin/node(+0x80148c)[0x559e85d9648c]

which looks like something related to GC?

@laverdet
Copy link
Owner

Hmm that's not a lot to go on. I'd recommend enabling ulimit -c unlimited and taking a peek at the corefile.

@aalimovs
Copy link

@philipl-airtable are you still running isolated-vm inside Lambda? All good with segfaults?

@philipl
Copy link

philipl commented Aug 20, 2023

Yes. I don't have a great story for what happened. We had been running with Node 12 and after upgrading to Node 16, the segfaults stopped. So it may have been a Node GC bug that got fixed.

@DecathectZero
Copy link

@philipl did you end up finding a fix? My current team has also been looking at using isolated-vm
I've built it inside the official ECR lambda base-image, but it's still encountering problems.

@philipl
Copy link

philipl commented Aug 1, 2024

No. As I said, once we updated to Node 16, it never happened again. In terms of what exactly was going on, I also narrowed it down to a particular import statement where we imported one of our own files, which I think could support the theory that it was GC related.

I ended up adding some fairly elaborate logic (including forking a child node process) so that we could detect when the segfault happened and as long as it did before any non-idempotent code execution, we would just retry immediately. And although the segfault was non deterministic, it did always happen in the same place when it happened. So we just moved on at that point.

(I don't work there anymore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants