What's the best way to debug segfaults in Lambda? #231

philipl-airtable · 2021-02-12T21:25:30Z

Hi.

We've been trying to use isolated-vm inside a Lambda node environment, and we've been getting periodic segfaults and even occasional sigbrts, without any clear rhyme or reason. An unchanged script can segfault one day and run dozens of times around it with no trouble. Traditionally, one would look at the backtrace from where node crashed to try and narrow down the problem, but Lambda doesn't provide that backtrace, as far as I can tell (it's not in the lambda cloudwatch logs) - and I could well imagine they suppress it for their own security reasons to prevent people from using that info to build lambda break outs.

So, are there any things we can do to try and debug these occurrences? They are too infrequent and unpredictable for us to have been able to build a repeatable test case.

Thanks!

laverdet · 2021-02-12T21:56:40Z

Can you run your service in a standard environment for a while and collect troubleshooting information that way? Or does this issue only show up on Lambda?

philipl-airtable · 2021-02-12T22:24:10Z

The frequency of occurrence is about 0.1% so we've never seen it except in production and we need lambda for security there. I will try and provoke it without lambda but it will take a while even if it's going to happen at all.

laverdet · 2021-02-13T00:36:24Z

It seems like there's techniques to get a corefile or backtrace from Lambda. Have you tried, for instance, this https://stackoverflow.com/questions/53644056/aws-lambda-r-runtime-segmentation-fault

philipl-airtable · 2021-02-13T02:37:36Z

Unfortunately, the technique described there only works if you provide your own node.js in a custom runtime. It may come down to that if I can't repro any other way.

philipl-airtable · 2021-03-12T21:01:25Z

Just to follow up, I was eventually pointed to https://github.com/ddopson/node-segfault-handler which gives us backtraces in stderr, and I'm hooking that up to our code. I hope I'll have a usable backtrace the next time we see a segfault.

Separately, I made some changes that should have had no effect on the segfaults but our incidence rate fell from tens per day to a couple per week. I don't understand why that happened, and probably never will.

laverdet · 2021-03-12T21:08:36Z

I don't recommend using node-segfault-handler. It actually causes segfaults under isolated-vm and I believe worker_threads as well. See: ddopson/node-segfault-handler#49

You can just set ulimit -c unlimited and pull the stack trace from the corefile using gdb.

philipl-airtable · 2021-03-12T21:09:47Z

In our use-case we have exactly one isolate, and it appears to work in testing. Is that a reasonable scenario?

laverdet · 2021-03-12T21:12:56Z

Well, you have two isolates. The nodejs one and the isolated-vm one. If a segfault occurs within isolated-vm then segfault-handler will segfault itself and the debug information will be lost. A corefile is way more reliable and includes a complete snapshot of the program's state which helps tremendously with troubleshooting.

philipl-airtable · 2021-03-19T16:43:48Z

For what it's worth, I tried out segfault-handler anyway - specifically with registering the handler before starting the isolated-vm isolate (and I also set up a test by passing a reference to the causeSegfault() function into the isolated-vm isolate. In that test, I got a meaningful backtrace).

On the other hand, my production segfault looks like this (It's a null pointer dereference)

/opt/nodejs/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x2d66)[0x7fc4e8160d66]
/lib64/libpthread.so.0(+0x117e0)[0x7fc4eaf247e0]
/var/lang/bin/node(_ZN2v88internal4Heap31MonotonicallyIncreasingTimeInMsEv+0x1c)[0x559e860bc5fc]
/var/lang/bin/node(_ZN2v88internal8GCTracer15BackgroundScopeD1Ev+0x2c)[0x559e860a985c]
/var/lang/bin/node(+0xaee118)[0x559e86083118]
/var/lang/bin/node(_ZThn32_N2v88internal14CancelableTask3RunEv+0x2b9)[0x559e85fe5919]
/var/lang/bin/node(+0x80148c)[0x559e85d9648c]

which looks like something related to GC?

laverdet · 2021-03-19T19:39:55Z

Hmm that's not a lot to go on. I'd recommend enabling ulimit -c unlimited and taking a peek at the corefile.

aalimovs · 2023-08-20T10:57:19Z

@philipl-airtable are you still running isolated-vm inside Lambda? All good with segfaults?

philipl · 2023-08-20T14:29:06Z

Yes. I don't have a great story for what happened. We had been running with Node 12 and after upgrading to Node 16, the segfaults stopped. So it may have been a Node GC bug that got fixed.

DecathectZero · 2024-08-01T13:42:54Z

@philipl did you end up finding a fix? My current team has also been looking at using isolated-vm
I've built it inside the official ECR lambda base-image, but it's still encountering problems.

philipl · 2024-08-01T13:54:27Z

No. As I said, once we updated to Node 16, it never happened again. In terms of what exactly was going on, I also narrowed it down to a particular import statement where we imported one of our own files, which I think could support the theory that it was GC related.

I ended up adding some fairly elaborate logic (including forking a child node process) so that we could detect when the segfault happened and as long as it did before any non-idempotent code execution, we would just retry immediately. And although the segfault was non deterministic, it did always happen in the same place when it happened. So we just moved on at that point.

(I don't work there anymore)

bodinsamuel mentioned this issue Oct 27, 2021

Segfault on RemoveWeakCallback #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best way to debug segfaults in Lambda? #231

What's the best way to debug segfaults in Lambda? #231

philipl-airtable commented Feb 12, 2021

laverdet commented Feb 12, 2021

philipl-airtable commented Feb 12, 2021

laverdet commented Feb 13, 2021

philipl-airtable commented Feb 13, 2021

philipl-airtable commented Mar 12, 2021

laverdet commented Mar 12, 2021

philipl-airtable commented Mar 12, 2021

laverdet commented Mar 12, 2021

philipl-airtable commented Mar 19, 2021 •

edited

Loading

laverdet commented Mar 19, 2021

aalimovs commented Aug 20, 2023

philipl commented Aug 20, 2023

DecathectZero commented Aug 1, 2024

philipl commented Aug 1, 2024

What's the best way to debug segfaults in Lambda? #231

What's the best way to debug segfaults in Lambda? #231

Comments

philipl-airtable commented Feb 12, 2021

laverdet commented Feb 12, 2021

philipl-airtable commented Feb 12, 2021

laverdet commented Feb 13, 2021

philipl-airtable commented Feb 13, 2021

philipl-airtable commented Mar 12, 2021

laverdet commented Mar 12, 2021

philipl-airtable commented Mar 12, 2021

laverdet commented Mar 12, 2021

philipl-airtable commented Mar 19, 2021 • edited Loading

laverdet commented Mar 19, 2021

aalimovs commented Aug 20, 2023

philipl commented Aug 20, 2023

DecathectZero commented Aug 1, 2024

philipl commented Aug 1, 2024

philipl-airtable commented Mar 19, 2021 •

edited

Loading