-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbounded Memory Consumption #3698
Comments
@agadient could you share how you were running tracee ? which events being filtered. And the size of the machine you tested this ? If you could provide the cmdline how you ran it would be good (knowing if it was caching events or not, for example). I'll get into this, but do you have a meminfo, for example, when this is happening (a meminfo and slabinfo would be useful). If you can't get it, ok, I'll try to get it myself soon. Want to differentiate if memory consumption comes from the kmalloc (slub) or if its caused by the runtime itself. |
I see from the code:
That, because of Arg Filtering, one could (as we already knew), fulfil the pipeline with events and exhaust internal states (or maps keeping states between the eBPF and Go logic). @AlonZivony I didnt read the code yet but from a first look it is related to the fact that we're arg filtering too late (so if the current running policy filtering is too broad, one could do that). The same happens for the entire pipeline concept actually. If one stress the amount of events through the perf buffer, we could lose a detection. Stressing amount of events in the pipeline depends on which events are enabled by default (and if they are filtering in kernel (like scopes) or in userland (current arg filtering)). The fix for this type of thing is to have the in-kernel filtering for arguments (as we've spoken in a recent past). Only needs prioritization. |
@rafaeldtinoco One counterpoint to filtering being the solution is that any agent aware of tracee could easily bypass the filters, in fact this program which randomizes filename does exactly that, unless we ignore /tmp entirely. Kernel filtering is important for cases where the admin can control what runs on the cluster and tune tracee accordingly, which is of course very important, but it's not the whole story IMO. |
I wrote this too fast =) and did not mention its not a full fix, nor would get rid of the problem, maybe a 'helper' only.
I have not thought otherwise, and you are correct. By having in kernel filtering we would at least guarantee we are not spammed with things we don't want in userland, but it wouldn't be "a answer". Event type quota, de-prioritization of events, etc, could all be answers but there is always room for mixing real events with fake ones in an attack. I think the answer will be having signatures to detect attacks to tracee =). This way, we could miss the real attack but the attempt to taint tracee would be picked (and that could be 'good enough' for end-user). |
@rafaeldtinoco this is the command I used to run Tracee: |
@agadient we have tested this issue with different versions of exhaust program you've shared, changing the counter_value and so far didn't get any OOM kill. The Tracee memory consumption never goes up than 6% of the total available. Do you have any other clarification/insights in order to reproduce that? |
Hi @trvll! Did you try running it with |
yes, sure... also have tried with empty /tmp and with existing files as well... anything else we should try? |
I just reproduced the issue following these steps. I attached a screenshot of top:
|
It appears that the behavior you're experiencing is due to a misconfiguration related to how Tracee handles its default settings when no specific arguments are provided. Understanding Default BehaviorBy design, when Tracee is invoked without any arguments, it initializes with a predefined set of default arguments to ensure a base level of functionality. This is intended to make the tool immediately useful for typical use cases without requiring initial configuration by the user. from docker image entrypoint.sh
As noted, the default configuration includes setting the memory cache size for events at 512 MB. Addressing Memory Constraints in VM EnvironmentsGiven that your VM is configured with only 2GB of memory, allocating 512 MB for Tracee's event caching could lead to resource contention, affecting both Tracee's performance and that of other processes running on the VM. To mitigate this, consider specifying a smaller cache size that better fits your VM's memory constraints. Suggested SolutionYou can override the default cache size by specifying the --cache cache-type=mem and --cache mem-cache-size flags [1] when running Tracee. For instance, to reduce the memory cache size to 128 MB, you could use the following command:
|
@trvll I tested this configuration and confirmed that Tracee is no longer killed by the OOM killer. It would be nice if Tracee automatically checked the system's memory and adjusted its own limits accordingly. Perhaps this feature is something your team can consider at some point in the future. |
Agree. I opened an issue to track that: #3947 Going to close this one now. |
Description
A program that makes a significant number of successful open system calls causes tracee's memory usage to increase significantly to the point that it may be killed by the OOM killer. An example program that triggers this issue is provided here: https://github.com/Vali-Cyber/ebpf-attacks/tree/main/exhaust
Output of
tracee version
:Tracee version: "v0.19.0"
Output of
uname -a
:Linux hamden 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Additional details
Contents of /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
The text was updated successfully, but these errors were encountered: