Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak Causing Pod Shutdowns #115

Open
MichaelPluessErni opened this issue Mar 14, 2024 · 11 comments
Open

Memory Leak Causing Pod Shutdowns #115

MichaelPluessErni opened this issue Mar 14, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@MichaelPluessErni
Copy link

Error description:

Since introducing the session serialization, we are experiencing multiple pod kills in our environment.
The Vaadin application is running in production and the pods on which it runs are regularly reaching the memor limit of 3gb, causing the pods to be shut down. Sometimes they reach the limit in as little as 4 hours.

This only happens since we use the session serialization with the kubernetes kit. Before that, the memory usage would go down during the afternoon and evening, when we have less traffic.

It looks as though there is a memory leak somewhere, as the memory is not cleared properly.

We tried using a different garbage collector. This helped a bit, but did not solve the problem.

Expected behaviour:

The memory usage should not be permanently affected by the session serialization.
While it is clear that there will be more memory usage during the serialization itself, it should not cause lasting memory leaks.
Errors during serialization or deserialization should not cause memory leaks.

Details:

A comparison of the memory usage of our pods before and after the introduction of the session serialization (session serialization was introduced on the 3rd of March):

image

It is visible from the logs, that the frequency with which pods are killed has increased drastically:

image (3)

The memory leaks seem to happen in "jumps".

image (1)

This pod is soon going to be killed after one more memory leak event. Logically there are more memory leak events occuring during times of higher usage (in our case, during working hours).

image (2)

The new garbage collector does not solve the problem:

image (4)

@heruan
Copy link
Member

heruan commented Mar 19, 2024

Thanks for reporting! We are investigating on this. Can you provide an estimate of the session size when serialization happens, or a project replicating the issue?

@heruan heruan added the bug Something isn't working label Mar 19, 2024
@anderslauri
Copy link

anderslauri commented Mar 19, 2024

Thanks for reporting! We are investigating on this. Can you provide an estimate of the session size when serialization happens, or a project replicating the issue?

Hi,

I work at the same client as @MichaelPluessErni - let me append some details here. The first image is a time vector of jvm_classes_unloaded_classes_total - delta between usage of Kubernetes Kit and prior is clearly visible. The second image below is showing jvm_classes_loaded_classes. I could assume with serialization to Redis, classes are discarded and with deserialization from Redis new classes are loaded. Forcing old gen increase and GC to work harder. We have stablized memory with G1 given some tuning, however, this data suggest something is not optimal. Perhaps a class pool can be used to dampen these numbers and GC.

grafik
grafik

@MichaelPluessErni
Copy link
Author

@heruan Based on my rough estimations, a single VaadinSession seems to be about 7kb big.
Sadly it is difficult to provide a sample project with this error as our application is rather large and I do not know what is causing the issue. Thus I cannot replicate it in a sample project. However, if this is a general issue, this error should appear in any project that uses Redis + session replication.

@mcollovati
Copy link
Contributor

@MichaelPluessErni @anderslauri could you be able to keep a couple of heap dumps and compare them to check what objects are actually making the memory usage grow? This would help a lot in the investigation.

@anderslauri
Copy link

@MichaelPluessErni @anderslauri could you be able to keep a couple of heap dumps and compare them to check what objects are actually making the memory usage grow? This would help a lot in the investigation.

Yes, this is possible. Let us do this.

@mcollovati
Copy link
Contributor

mcollovati commented Mar 26, 2024

@MichaelPluessErni @anderslauri and additional question: which driver are you using to connect to Redis, Lettuce or Jedis? Did you perhaps try to change the driver to verify that the leak is independent of it?

EDIT: looking at the other issues, it looks like Lettuce is in use

@MichaelPluessErni
Copy link
Author

@mcollovati We're using:
redis.clients.jedis 5.0.2
io.lettuce.lettuce-core 6.3.1.RELEASE

It is not easy to test other versions as the bug appears on the productive system.

@MichaelPluessErni
Copy link
Author

MichaelPluessErni commented Mar 27, 2024

@mcollovati I'm now able to produce heap dumps and analyze them.

A heap dump from one of our productive pods: 477 MB (hprof file size)

Summary:
grafik

Classes by size of instances:
grafik

Dominators by retained size:
grafik

I hope this helps already. Otherwise I'm available for more specific analyses on heap dumps.

@mcollovati
Copy link
Contributor

@MichaelPluessErni thank you very much!

Is this dump taken when memory is already leaked? If so, I would also take a dump before the memory grows to compare them.

Memory Analyzer (MAT) is a great tool to inspect heap dumps. It also provides a Leak Suspects report that may help in the investigation (although I don't remember if the report can be exported).

Otherwise, if you can privately share the dump with me, I can do further analysis.

@MichaelPluessErni
Copy link
Author

@mcollovati this dump is pre-leak, meaning from a "healthy" pod.

@MichaelPluessErni
Copy link
Author

MichaelPluessErni commented Mar 28, 2024

@mcollovati Using MAT proves difficult, as I'm not able to download it on the company laptop. We're investigating whether it is possible to send you the dump.

Meanwhile, we've found a GC configuration that helps ameliorate the memory leak:

// These settings define the following:
// InitiatingHeapOccupancyPercent  = 30     (default 45). Once the heap occupancy percentage for old gen is above 30% - Java will begin marking objects for GC.
// G1MixedGCLiveThresholdPercent   = 85     (default 85). Only old gen regions with lower live occupancy than configured value is collected is space reclamation phase.
// G1OldCSetRegionThresholdPercent = 25     (default 10). Up to this percentage is reclaimed for old gen in GC-cycle.
"application.JDK_JAVA_OPTIONS": "\"-XX:+UnlockExperimentalVMOptions -XX:InitiatingHeapOccupancyPercent=30 -XX:G1MixedGCLiveThresholdPercent=85 -XX:G1OldCSetRegionThresholdPercent=25\"",
// Reduce MAX_RAM_PERCENTAGE from 80% to 60%. Given 3100mb this represents 1860mb. Should be more than enough.
"application.MAX_RAM_PERCENTAGE": "60",
"application.INITIAL_RAM_PERCENTAGE": "60",

gc_helps_memory_leak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants