Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak with KafkaJS Causing Application Crash #2332

Closed
ffc-gwakefield opened this issue Jul 8, 2024 · 9 comments
Closed

Memory Leak with KafkaJS Causing Application Crash #2332

ffc-gwakefield opened this issue Jul 8, 2024 · 9 comments

Comments

@ffc-gwakefield
Copy link

Description

NewrelicJS fills up the heap and crashes applications when paired with KafkaJS in an Express service. This is due to faulty code in KafkaJS's checkPendingRequests loop that causes Newrelic to shim setTimeout in a transaction and create a SegmentTrace object. Because of the flaw in KafkaJS with a runaway loop, thousands of these being created and quickly run out of memory in your Node heap on the server. The transaction doesn't start recording segments until the express http endpoint is hit.

Expected Behavior

Newrelic can manage runaway loops like from Kafka without maxing the heap, or have a feature/config to limit this behavior.

Troubleshooting or NR Diag results

Running a Node heapprofile or cpuprofile for a couple seconds shows that Newrelic is using all of the CPU and memory for the trace segment calls. I captured a few and got the same results. Interestingly with the latest package, there was a large increase in CPU time garbage collecting but it did not help. Heap still filled up and eventually crashed the app.

image
image

Usually trying to capture an entire heapsnapshot results in the app crashing as it's already using most of the machine's memory. But I managed to get one before it completely bloated the memory and with enough swap space:

image

Steps to Reproduce

  1. Requires an express service with KafkaJS (from tulios).
  2. Hit the express endpoint, and send a kafka message to a downstream service that never responds.
  3. Observe memory leak

Your Environment

RHEL 9 Linux L2
Node v18.20.2
@nestjs/axios ^3.0.0
@nestjs/core ^10.3.7
kafkajs ^2.2.0

Additional context

I realize the main problem is kafkajs which seems to be an unmaintained library now, but this is causing newrelic to spin out of control. Is there any way to mitigate this for bad acting software like kafkajs?

@workato-integration
Copy link

@bizob2828
Copy link
Member

bizob2828 commented Jul 8, 2024

@ffc-gwakefield Can you please provide a reproduction application exhibiting this behavior? We might be able to fix but based on what you're saying it sounds like it's a kafkajs issue which we may be compounding. Also can you link the kafkajs issue?

@bizob2828 bizob2828 self-assigned this Jul 8, 2024
@ffc-gwakefield
Copy link
Author

ffc-gwakefield commented Jul 8, 2024

Here is the link to the kafkajs issue: tulios/kafkajs#1704

I'll get back with the team and see if we can create a reproduction application for you to use.

In the meantime, the issue on kafkajs has a link to a fork with failing test, along with the fix for kafkajs if that's useful.

@bizob2828
Copy link
Member

Thanks @ffc-gwakefield. In the meantime I recommend you just disable the kafkajs feature flag within the agent: feature_flag.kafka_instrumentation = false

@bizob2828
Copy link
Member

@ffc-gwakefield a reproduction application would really help here. I can't seem to reproduce

@ffc-gwakefield
Copy link
Author

Thank you, @bizob2828 for your prompt replies and looking into this. It's unfortunate that you couldn't replicate it because our teams our quite busy since our projects are behind due to this issue. But hopefully we can build something when things settle down here.

@bizob2828
Copy link
Member

bizob2828 commented Jul 9, 2024

I'm sorry you're all busy but I can't do much without a repro case. Seems like the better angle may be to get the kafkajs folks to fix the issue. We're just compounding the problem as we instrument setTimeout.

@ffc-gwakefield
Copy link
Author

ffc-gwakefield commented Jul 9, 2024

I'm not on the dev team, but I tried to replicate the issue myself since everyone is busy and so far I could not. I see the same behavior of spamming calls to kafka, but not seeing a bloat in heap.
image
I wrote this output to capture how many times newrelic's wrapper is called and how many times the kafka pending call is made. The bottom is a producer sending a message to kafka when I hit an endpoint (this triggers the loop).
My local environment is different though so when I get time I'm going to slowly introduce similarities between my workspace and the developers'.

@bizob2828 bizob2828 removed their assignment Jul 9, 2024
@bizob2828 bizob2828 moved this to Triage Needed: Unprioritized Features in Node.js Engineering Board Jul 15, 2024
@bizob2828
Copy link
Member

Closing due to lack of repro case. If you can provide repro, please feel free to reopen

@github-project-automation github-project-automation bot moved this from Triage Needed: Unprioritized Features to Done: Issues recently completed in Node.js Engineering Board Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants