Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mattermost suddenly goes out of memory (OOM) and reboots #20625

Open
DummyThatMatters opened this issue Jul 11, 2022 · 22 comments · May be fixed by mattermost/docker#138
Open

Mattermost suddenly goes out of memory (OOM) and reboots #20625

DummyThatMatters opened this issue Jul 11, 2022 · 22 comments · May be fixed by mattermost/docker#138
Labels
Bug Report/Open Bug report/issue

Comments

@DummyThatMatters
Copy link

DummyThatMatters commented Jul 11, 2022

Summary

Mattermost goes for an unexpected reboot periodically (every 1-2 working days appox.) due to sudden increase in memory consumption.

Steps to reproduce

Mattermost 7.0.1 Team Edition, deployed on a pod in openshift (tryed to allocate from 2.6GB to 5.5GB RAM with the same result). Postgres 14.2 as a DB.
Around ~4000 users, ~1200 of them are active
~ 26000 messages per day load.

Expected behavior

Mattermost workes stably without reboots.

Observed behavior (that appears unintentional)

Mattermost goes for a reboot every 1-2 working days. The cause of reboot is OOM. Here is the example of log of the memory consumption:
image
image

The same increase of load can be observed on CPU part as well:
image

As you can see there is sudden growth of resource utilisation out of nowhere. The logs are relatevely clean and logs raito didnt show any increase of operations number or increased user activity.

We have done our small investigation and we think that it can be caused by unproper functioning of getPostsForChannel method. That assumption being done by inspecting mattermost go profile.
Here is example of heap tree made via pprof tool:
profile003

Please help with investigating, we can provide additional info if it needed (if it can be collected via our tools and does not contain corporate data)

@agnivade
Copy link
Member

Hi @DummyThatMatters - It would be awesome if you can capture a heap profile during the memory spike. They don't contain any user data and should be safe to share.

@DummyThatMatters
Copy link
Author

DummyThatMatters commented Jul 11, 2022

@agnivade , sure. Here is example of heap profile just before mm goes OOM
heap.zip

@DummyThatMatters
Copy link
Author

DummyThatMatters commented Jul 18, 2022

Hello! We found out that issue caused by calling api method api/v4/channels/{channel_id}/posts?since={timestamp}&skipFetchThreads=false&collapsedThreads=true&collapsedThreadsExtended=false

Most likely server fails at at api4/posts.go/getPostsForChannel (str. 249):

if err := clientPostList.EncodeJSON(w); err != nil {
		mlog.Warn("Error while writing response", mlog.Err(err))
	}
}

We have rewritten mattermost server a bit and deployed modified version in order to find out whats going wrong.

I assume that func being called when user search something in channel. And as far as i understand - there is no limitations on number of messages fetched, and calling it on channel with hight amount of messages and heavy content will cause lots of trouble for whole server. Can it be fixed somehow?

@DummyThatMatters
Copy link
Author

DummyThatMatters commented Jul 18, 2022

Can someone please check info we have provided and mark this as an bug/issue to be fixed?

@agnivade
Copy link
Member

Thank you @DummyThatMatters. Yes your profile matches with what you are seeing. We are looking into it.

@DummyThatMatters
Copy link
Author

@agnivade , ok, thanks! Let me know if you need more info, we will try to provide what we can.

@amyblais amyblais added the Bug Report/Open Bug report/issue label Jul 18, 2022
@agnivade
Copy link
Member

@DummyThatMatters - As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

@Madjew
Copy link

Madjew commented Sep 1, 2022

As a temporary solution, you can enable Bleve Indexing in the system settings, after turning on the server will stop crashing on oom, but turns off the search for bot messages.

@mjnaderi
Copy link

We are facing the same issue. Mattermost server is killed by OOM killer serveral times a day. Is there any progress on this?

@anx-ag
Copy link
Contributor

anx-ag commented Jan 31, 2023

Hi @mjnaderi ,

what Mattermost server version are you running currently?

@mjnaderi
Copy link

Mattermost Version: 7.7.1
Database Schema Version: 100
Database: postgres

The problem started when we upgraded to Mattermost 7.5.2 and upgrading to 7.7.1 didn't help. I dont remember which version was installed before 7.5.2.

image

@anx-ag
Copy link
Contributor

anx-ag commented Jan 31, 2023

Thanks, just wanted to confirm that it still happens with 7.7.1.

@matclab
Copy link

matclab commented Jan 31, 2023

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

@matclab
Copy link

matclab commented Jan 31, 2023

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

We did this after seeing messages like kernel: cgroup: fork rejected by pids controller in /docker/d0be5bd32fc127ce5cea8b781ac6f1dc0eb10b5a903851d4baf6ef58b52ac852 in our logs.

@anx-ag
Copy link
Contributor

anx-ag commented Feb 1, 2023

We stumbled upon this issue ourselves and found out that this is related to a cgroup memory leak, which seems to be fixed already in the kernel:
https://lore.kernel.org/all/[email protected]/

If this is also the issue on your end, you can try to add the kernel command line option cgroup_disable=memory:

GRUB_CMDLINE_LINUX_DEFAULT="quiet consoleblank=0 cgroup_disable=memory"

@mjnaderi
Copy link

mjnaderi commented Feb 2, 2023

@matclab I think in our case pids_limit is not the problem. Because number of processes is not too high, even during the crash, and that message about fork does not exist in our logs.

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

@Fidoshnik
Copy link

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

Greetings!
We also get an oom error. Couldn't find anything in the logs. Tell me - did adding this line to grub help you?

@mjnaderi
Copy link

Adding cgroup_disable=memory did not help, but increasing pids_limit as @matclab suggested fixed the problem for us.

I doubled the value of pids_limit in docker-compose.yml (for postgres service, changed from 100 to 200, and for mattermost service, changed from 200 to 400). Server has not crashed since then.

@amyblais
Copy link
Member

amyblais commented Aug 3, 2023

As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

@agnivade Was this fixed, or do we keep this open for now?

@mjnaderi mjnaderi linked a pull request Sep 28, 2023 that will close this issue
@agnivade
Copy link
Member

agnivade commented Dec 5, 2023

Apologies. Somehow I missed this.

So it seems like various users with different problems are commenting on this issue. It's not clear as to what's the real root cause is. For some users, bumping up the pids_limit seems to resolve it, but I don't understand how come bumping the number of allowed processes is preventing an OOM crash from happening.

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

@Kamisato-Yuna
Copy link

Kamisato-Yuna commented Dec 5, 2023

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

Don't worry, we also encountered OOM issues due to pids_limit before. After increasing the pids_limit, we have not experienced any OOM issues since then, up to the present (MM 9.2.2).

:D

@agnivade
Copy link
Member

agnivade commented Dec 5, 2023

I get that. But I'd like to have an explanation as to how does bumping up the pids_limit solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Report/Open Bug report/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants