Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Microservices) gRPC transport potential memory leak #14296

Open
3 of 15 tasks
malsatin opened this issue Dec 9, 2024 · 7 comments
Open
3 of 15 tasks

(Microservices) gRPC transport potential memory leak #14296

malsatin opened this issue Dec 9, 2024 · 7 comments
Labels
needs triage This issue has not been looked into

Comments

@malsatin
Copy link

malsatin commented Dec 9, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current behavior

We are using Amazon EKS to run our services. In October we have updated its version from 1.30 to 1.31 and then noticed that our NestJS applications with gRPC healthchecks started leaking memory and being killed by the OOM.
image

The application code wasn't changed and no rebuilds were made, so it is clearly some change in k8s gRPC probes, which is not supported by nestjs / grpc-node.

Minimum reproduction code

https://stackblitz.com/edit/nestjs-typescript-starter-fuk8m3pw?file=src%2Fmain.ts

Steps to reproduce

  1. Setup the EKS 1.31 cluster
  2. Create a nestjs application with gRPC liveness probes with 10 seconds interval

Expected behavior

Memory consumption remains flat, as it was before.

Package

  • I don't know. Or some 3rd-party package
  • @nestjs/common
  • @nestjs/core
  • @nestjs/microservices
  • @nestjs/platform-express
  • @nestjs/platform-fastify
  • @nestjs/platform-socket.io
  • @nestjs/platform-ws
  • @nestjs/testing
  • @nestjs/websockets
  • Other (see below)

Other package

@grpc/grpc-js

NestJS version

10.4.13

Packages versions

"dependencies": {
    "@grpc/grpc-js": "^1.12.4",
    "@grpc/proto-loader": "^0.7.13",
    "@isaacs/ttlcache": "^1.4.1",
    "@nestjs/axios": "^3.1.3",
    "@nestjs/common": "^10.4.13",
    "@nestjs/config": "^3.3.0",
    "@nestjs/core": "^10.4.13",
    "@nestjs/microservices": "^10.4.13",
    "@nestjs/schedule": "^4.1.1",
    "@nestjs/terminus": "^10.2.3",
    "@songkeys/nestjs-redis": "10.0.0",
    "class-transformer": "^0.5.1",
    "class-validator": "^0.14.1",
    "ioredis": "^5.4.1",
    "launchdarkly-node-server-sdk": "^7.0.4",
    "lodash": "^4.17.21",
    "nestjs-pino": "^4.1.0",
    "reflect-metadata": "^0.2.2",
    "rxjs": "^7.8.1"
  },

Node.js version

22.12.0-alpine3.20

In which operating systems have you tested?

  • macOS
  • Windows
  • Linux

Other

The only related issue I managed to find: grpc/grpc-node#2629.

We use terminus/redis probes inside the Healthcheck endpoint, I tried to make the endpoint always return ServingStatus.SERVING - the leak was still present, so the issue is not in the terminus module.

I have tried to change runtime from NodeJS to Bun (oven/bun:1.1.38-alpine) - the leak was gone without any changes in application code.
I have tried to use the latest NodeJS version (node:23.3.0-alpine3.20) - the leak was still present.

Then I tried to add the following channelOptions in the microservice options parameter:

{
    // If a client is idle for {max_connection_idle_ms}, send a GOAWAY
    'grpc.max_connection_idle_ms': 5_000,
    // If any connection is alive for more than {max_connection_age_ms}, send a GOAWAY
    'grpc.max_connection_age_ms': 10_000,
    // Allow {max_connection_age_grace_ms} for pending RPCs to complete before forcibly closing connections
    'grpc.max_connection_age_grace_ms': 1_000,
    // Ping the client every {keepalive_time_ms} to ensure the connection is still active
    'grpc.keepalive_time_ms': 2_500,
    // Wait {keepalive_timeout_ms} for the ping ack before assuming the connection is dead
    'grpc.keepalive_timeout_ms': 1_000,
    // Maximum number of concurrent incoming streams to allow on a http2 connection.
    'grpc.max_concurrent_streams': 10,
    // Http/2 server is allowed to consume {max_session_memory} MB of memory
    'grpc-node.max_session_memory': 2,
  }

And the leak was gone!

After some playtesting with the config I have discovered that
1.

'grpc.max_connection_idle_ms': 5_000,

by itself gives leak reduction, but does not eliminate it.
2.

'grpc.max_connection_age_ms': 10_000,
'grpc.max_connection_age_grace_ms': 1_000,

by itself also gives leak reduction, but does not eliminate it.
3.

'grpc.max_connection_idle_ms': 5_000,
'grpc.max_connection_age_ms': 10_000,
'grpc.max_connection_age_grace_ms': 1_000,

both of them together almost stops the leak.
4.

'grpc.keepalive_time_ms': 2_500,
'grpc.keepalive_timeout_ms': 1_000,

by itself almost stops the leak.
5.

'grpc.max_concurrent_streams': 10,
'grpc-node.max_session_memory': 2,

both parameters does not affect the leak (it is still present).

I understand, that most probably the bug is not in the NestJS/Microservices or gRPC-Node, but I think this caveat is at least worth mentioning in the NestJS/Microservices documentation.

@malsatin malsatin added the needs triage This issue has not been looked into label Dec 9, 2024
@kamilmysliwiec
Copy link
Member

Could you try downgrading to v10.4.5 (@nestjs/microservices)? I wonder if there's any chance this leak might be related to our recent grpc streaming bug fixes

@malsatin
Copy link
Author

malsatin commented Dec 9, 2024

@kamilmysliwiec that is most probably not the case, as when the leak occurred the application had the following dependencies

  "dependencies": {
    "@grpc/grpc-js": "^1.10.3",
    "@grpc/proto-loader": "^0.7.10",
    "@isaacs/ttlcache": "^1.4.1",
    "@nestjs/axios": "^3.0.2",
    "@nestjs/common": "^10.3.4",
    "@nestjs/config": "^3.2.0",
    "@nestjs/core": "^10.3.4",
    "@nestjs/microservices": "^10.3.4",
    "@nestjs/schedule": "^4.0.1",
    "@nestjs/terminus": "^10.2.3",
    "@songkeys/nestjs-redis": "10.0.0",
    "class-transformer": "^0.5.1",
    "class-validator": "^0.14.1",
    "ioredis": "^5.3.2",
    "launchdarkly-node-server-sdk": "^7.0.4",
    "lodash": "^4.17.21",
    "nestjs-pino": "^4.0.0",
    "reflect-metadata": "^0.2.1",
    "rxjs": "^7.8.1"
  },

And my first immediate reaction was to update the dependencies as they were pretty outdated.

If you wish, I can still try downgrading to the @nestjs/microservices: "10.4.5", just let me know.

@kamilmysliwiec
Copy link
Member

And my first immediate reaction was to update the dependencies as they were pretty outdated.

Because of the ^ in front of the package version, we can’t be certain about the exact version that was installed without checking the lockfile.

If you wish, I can still try downgrading to the @nestjs/microservices: "10.4.5", just let me know.

If you can do that, it would be greatly appreciated!

@malsatin
Copy link
Author

malsatin commented Dec 9, 2024

The leak is present on @nestjs/microservices: "10.4.5"

  "dependencies": {
    "@grpc/grpc-js": "^1.12.4",
    "@grpc/proto-loader": "^0.7.13",
    "@isaacs/ttlcache": "^1.4.1",
    "@nestjs/axios": "^3.1.3",
    "@nestjs/common": "10.4.5",
    "@nestjs/config": "^3.3.0",
    "@nestjs/core": "10.4.5",
    "@nestjs/microservices": "10.4.5",
    "@nestjs/schedule": "^4.1.1",
    "@nestjs/terminus": "^10.2.3",
    "@songkeys/nestjs-redis": "10.0.0",
    "class-transformer": "^0.5.1",
    "class-validator": "^0.14.1",
    "ioredis": "^5.4.1",
    "launchdarkly-node-server-sdk": "^7.0.4",
    "lodash": "^4.17.21",
    "nestjs-pino": "^4.1.0",
    "reflect-metadata": "^0.2.2",
    "rxjs": "^7.8.1"
  },

image

@kamilmysliwiec
Copy link
Member

Thanks @malsatin. That confirms that this issue is very likely unrelated to NestJS itself.

Would you like to create an issue in the grpc repository?

@malsatin
Copy link
Author

I would like to see some community confirmation of the issue as I'm not entirely sure, that the problem is not in our setup.

@kamilmysliwiec
Copy link
Member

Have you tried asking on our Discord channel? (here + send a new post in the #⁠ 🐈 nestjs-help forum. Make sure to include a link to this issue, so you don't need to write it all again)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage This issue has not been looked into
Projects
None yet
Development

No branches or pull requests

2 participants