Mattermost suddenly goes out of memory (OOM) and reboots #20625

DummyThatMatters · 2022-07-11T08:38:16Z

Summary

Mattermost goes for an unexpected reboot periodically (every 1-2 working days appox.) due to sudden increase in memory consumption.

Steps to reproduce

Mattermost 7.0.1 Team Edition, deployed on a pod in openshift (tryed to allocate from 2.6GB to 5.5GB RAM with the same result). Postgres 14.2 as a DB.
Around ~4000 users, ~1200 of them are active
~ 26000 messages per day load.

Expected behavior

Mattermost workes stably without reboots.

Observed behavior (that appears unintentional)

Mattermost goes for a reboot every 1-2 working days. The cause of reboot is OOM. Here is the example of log of the memory consumption:

The same increase of load can be observed on CPU part as well:

As you can see there is sudden growth of resource utilisation out of nowhere. The logs are relatevely clean and logs raito didnt show any increase of operations number or increased user activity.

We have done our small investigation and we think that it can be caused by unproper functioning of getPostsForChannel method. That assumption being done by inspecting mattermost go profile.
Here is example of heap tree made via pprof tool:

Please help with investigating, we can provide additional info if it needed (if it can be collected via our tools and does not contain corporate data)

agnivade · 2022-07-11T08:46:30Z

Hi @DummyThatMatters - It would be awesome if you can capture a heap profile during the memory spike. They don't contain any user data and should be safe to share.

DummyThatMatters · 2022-07-11T08:52:16Z

@agnivade , sure. Here is example of heap profile just before mm goes OOM
heap.zip

DummyThatMatters · 2022-07-18T08:36:15Z

Hello! We found out that issue caused by calling api method api/v4/channels/{channel_id}/posts?since={timestamp}&skipFetchThreads=false&collapsedThreads=true&collapsedThreadsExtended=false

Most likely server fails at at api4/posts.go/getPostsForChannel (str. 249):

if err := clientPostList.EncodeJSON(w); err != nil {
		mlog.Warn("Error while writing response", mlog.Err(err))
	}
}

We have rewritten mattermost server a bit and deployed modified version in order to find out whats going wrong.

I assume that func being called when user search something in channel. And as far as i understand - there is no limitations on number of messages fetched, and calling it on channel with hight amount of messages and heavy content will cause lots of trouble for whole server. Can it be fixed somehow?

DummyThatMatters · 2022-07-18T08:40:25Z

Can someone please check info we have provided and mark this as an bug/issue to be fixed?

agnivade · 2022-07-18T09:14:58Z

Thank you @DummyThatMatters. Yes your profile matches with what you are seeing. We are looking into it.

DummyThatMatters · 2022-07-18T09:31:34Z

@agnivade , ok, thanks! Let me know if you need more info, we will try to provide what we can.

agnivade · 2022-07-19T05:42:48Z

@DummyThatMatters - As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

Madjew · 2022-09-01T08:01:13Z

As a temporary solution, you can enable Bleve Indexing in the system settings, after turning on the server will stop crashing on oom, but turns off the search for bot messages.

mjnaderi · 2023-01-31T11:03:33Z

We are facing the same issue. Mattermost server is killed by OOM killer serveral times a day. Is there any progress on this?

anx-ag · 2023-01-31T11:08:34Z

Hi @mjnaderi ,

what Mattermost server version are you running currently?

mjnaderi · 2023-01-31T11:20:46Z

Mattermost Version: 7.7.1
Database Schema Version: 100
Database: postgres

The problem started when we upgraded to Mattermost 7.5.2 and upgrading to 7.7.1 didn't help. I dont remember which version was installed before 7.5.2.

anx-ag · 2023-01-31T11:53:43Z

Thanks, just wanted to confirm that it still happens with 7.7.1.

matclab · 2023-01-31T12:29:09Z

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

matclab · 2023-01-31T12:31:00Z

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

We did this after seeing messages like kernel: cgroup: fork rejected by pids controller in /docker/d0be5bd32fc127ce5cea8b781ac6f1dc0eb10b5a903851d4baf6ef58b52ac852 in our logs.

anx-ag · 2023-02-01T11:51:34Z

We stumbled upon this issue ourselves and found out that this is related to a cgroup memory leak, which seems to be fixed already in the kernel:
https://lore.kernel.org/all/[email protected]/

If this is also the issue on your end, you can try to add the kernel command line option cgroup_disable=memory:

GRUB_CMDLINE_LINUX_DEFAULT="quiet consoleblank=0 cgroup_disable=memory"

mjnaderi · 2023-02-02T12:13:16Z

@matclab I think in our case pids_limit is not the problem. Because number of processes is not too high, even during the crash, and that message about fork does not exist in our logs.

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

Fidoshnik · 2023-02-09T15:26:21Z

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

Greetings!
We also get an oom error. Couldn't find anything in the logs. Tell me - did adding this line to grub help you?

mjnaderi · 2023-02-12T13:53:54Z

Adding cgroup_disable=memory did not help, but increasing pids_limit as @matclab suggested fixed the problem for us.

I doubled the value of pids_limit in docker-compose.yml (for postgres service, changed from 100 to 200, and for mattermost service, changed from 200 to 400). Server has not crashed since then.

amyblais · 2023-08-03T18:58:24Z

As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

@agnivade Was this fixed, or do we keep this open for now?

agnivade · 2023-12-05T02:56:57Z

Apologies. Somehow I missed this.

So it seems like various users with different problems are commenting on this issue. It's not clear as to what's the real root cause is. For some users, bumping up the pids_limit seems to resolve it, but I don't understand how come bumping the number of allowed processes is preventing an OOM crash from happening.

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

Kamisato-Yuna · 2023-12-05T03:31:16Z

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

Don't worry, we also encountered OOM issues due to pids_limit before. After increasing the pids_limit, we have not experienced any OOM issues since then, up to the present (MM 9.2.2).

:D

agnivade · 2023-12-05T04:42:33Z

I get that. But I'd like to have an explanation as to how does bumping up the pids_limit solve the issue.

amyblais added the Bug Report/Open Bug report/issue label Jul 18, 2022

mjnaderi linked a pull request Sep 28, 2023 that will close this issue

Fix OOM (Out Of Memory) Errors mattermost/docker#138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mattermost suddenly goes out of memory (OOM) and reboots #20625

Mattermost suddenly goes out of memory (OOM) and reboots #20625

DummyThatMatters commented Jul 11, 2022 •

edited

Loading

agnivade commented Jul 11, 2022

DummyThatMatters commented Jul 11, 2022 •

edited

Loading

DummyThatMatters commented Jul 18, 2022 •

edited

Loading

DummyThatMatters commented Jul 18, 2022 •

edited

Loading

agnivade commented Jul 18, 2022

DummyThatMatters commented Jul 18, 2022

agnivade commented Jul 19, 2022

Madjew commented Sep 1, 2022

mjnaderi commented Jan 31, 2023

anx-ag commented Jan 31, 2023

mjnaderi commented Jan 31, 2023

anx-ag commented Jan 31, 2023

matclab commented Jan 31, 2023

matclab commented Jan 31, 2023

anx-ag commented Feb 1, 2023

mjnaderi commented Feb 2, 2023

Fidoshnik commented Feb 9, 2023

mjnaderi commented Feb 12, 2023

amyblais commented Aug 3, 2023

agnivade commented Dec 5, 2023

Kamisato-Yuna commented Dec 5, 2023 •

edited

Loading

agnivade commented Dec 5, 2023

Mattermost suddenly goes out of memory (OOM) and reboots #20625

Mattermost suddenly goes out of memory (OOM) and reboots #20625

Comments

DummyThatMatters commented Jul 11, 2022 • edited Loading

Summary

Steps to reproduce

Expected behavior

Observed behavior (that appears unintentional)

agnivade commented Jul 11, 2022

DummyThatMatters commented Jul 11, 2022 • edited Loading

DummyThatMatters commented Jul 18, 2022 • edited Loading

DummyThatMatters commented Jul 18, 2022 • edited Loading

agnivade commented Jul 18, 2022

DummyThatMatters commented Jul 18, 2022

agnivade commented Jul 19, 2022

Madjew commented Sep 1, 2022

mjnaderi commented Jan 31, 2023

anx-ag commented Jan 31, 2023

mjnaderi commented Jan 31, 2023

anx-ag commented Jan 31, 2023

matclab commented Jan 31, 2023

matclab commented Jan 31, 2023

anx-ag commented Feb 1, 2023

mjnaderi commented Feb 2, 2023

Fidoshnik commented Feb 9, 2023

mjnaderi commented Feb 12, 2023

amyblais commented Aug 3, 2023

agnivade commented Dec 5, 2023

Kamisato-Yuna commented Dec 5, 2023 • edited Loading

agnivade commented Dec 5, 2023

DummyThatMatters commented Jul 11, 2022 •

edited

Loading

DummyThatMatters commented Jul 11, 2022 •

edited

Loading

DummyThatMatters commented Jul 18, 2022 •

edited

Loading

DummyThatMatters commented Jul 18, 2022 •

edited

Loading

Kamisato-Yuna commented Dec 5, 2023 •

edited

Loading