Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbounded Memory Consumption #3698

Closed
agadient opened this issue Nov 17, 2023 · 12 comments
Closed

Unbounded Memory Consumption #3698

agadient opened this issue Nov 17, 2023 · 12 comments
Labels
kind/bug security issues that could taint tracee

Comments

@agadient
Copy link

Description

A program that makes a significant number of successful open system calls causes tracee's memory usage to increase significantly to the point that it may be killed by the OOM killer. An example program that triggers this issue is provided here: https://github.com/Vali-Cyber/ebpf-attacks/tree/main/exhaust

Output of tracee version:

Tracee version: "v0.19.0"

Output of uname -a:

Linux hamden 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional details

Contents of /etc/os-release

PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

@rafaeldtinoco
Copy link
Contributor

@agadient could you share how you were running tracee ? which events being filtered. And the size of the machine you tested this ? If you could provide the cmdline how you ran it would be good (knowing if it was caching events or not, for example).

I'll get into this, but do you have a meminfo, for example, when this is happening (a meminfo and slabinfo would be useful). If you can't get it, ok, I'll try to get it myself soon. Want to differentiate if memory consumption comes from the kmalloc (slub) or if its caused by the runtime itself.

@rafaeldtinoco rafaeldtinoco self-assigned this Nov 17, 2023
@rafaeldtinoco
Copy link
Contributor

I see from the code:

// The function run by worker processes. It sets its CPU affinity, increments
// a counter, and opens a file that doesn't exist in an infinite loop.
void exhaust(uint64_t *counters, uint64_t counter_index) {
    // Set the CPU affinity. We launch one worker per CPU
    cpu_set_t set;
    CPU_ZERO(&set);
    CPU_SET(counter_index, &set);
    sched_setaffinity(getpid(), sizeof(cpu_set_t), &set);
    // Create a random filename in the tmp directory
    uint64_t filename_size = 16;
    if (TARGET_IS_TRACEE) {
        filename_size = NAME_MAX;
    }

That, because of Arg Filtering, one could (as we already knew), fulfil the pipeline with events and exhaust internal states (or maps keeping states between the eBPF and Go logic).

@AlonZivony I didnt read the code yet but from a first look it is related to the fact that we're arg filtering too late (so if the current running policy filtering is too broad, one could do that).

The same happens for the entire pipeline concept actually. If one stress the amount of events through the perf buffer, we could lose a detection. Stressing amount of events in the pipeline depends on which events are enabled by default (and if they are filtering in kernel (like scopes) or in userland (current arg filtering)).

The fix for this type of thing is to have the in-kernel filtering for arguments (as we've spoken in a recent past). Only needs prioritization.

@NDStrahilevitz
Copy link
Collaborator

@rafaeldtinoco One counterpoint to filtering being the solution is that any agent aware of tracee could easily bypass the filters, in fact this program which randomizes filename does exactly that, unless we ignore /tmp entirely. Kernel filtering is important for cases where the admin can control what runs on the cluster and tune tracee accordingly, which is of course very important, but it's not the whole story IMO.

@rafaeldtinoco
Copy link
Contributor

The fix for this type of thing is to have the in-kernel filtering for arguments (as we've spoken in a recent past). Only needs prioritization.

I wrote this too fast =) and did not mention its not a full fix, nor would get rid of the problem, maybe a 'helper' only.

Kernel filtering is important for cases where the admin can control what runs on the cluster and tune tracee accordingly, which is of course very important, but it's not the whole story IMO.

I have not thought otherwise, and you are correct. By having in kernel filtering we would at least guarantee we are not spammed with things we don't want in userland, but it wouldn't be "a answer".

Event type quota, de-prioritization of events, etc, could all be answers but there is always room for mixing real events with fake ones in an attack. I think the answer will be having signatures to detect attacks to tracee =). This way, we could miss the real attack but the attempt to taint tracee would be picked (and that could be 'good enough' for end-user).

@agadient
Copy link
Author

@rafaeldtinoco this is the command I used to run Tracee: docker run -it --pid=host --cgroupns=host --privileged -v /etc/os-release:/etc/os-release-host:ro -v /boot:/boot:ro aquasec/tracee:0.19.0

@rafaeldtinoco rafaeldtinoco added the security issues that could taint tracee label Nov 30, 2023
@rafaeldtinoco rafaeldtinoco removed their assignment Nov 30, 2023
@trvll
Copy link

trvll commented Mar 13, 2024

@agadient we have tested this issue with different versions of exhaust program you've shared, changing the counter_value and so far didn't get any OOM kill. The Tracee memory consumption never goes up than 6% of the total available. Do you have any other clarification/insights in order to reproduce that?

@agadient
Copy link
Author

Hi @trvll! Did you try running it with ./exhaust -tracee? The flag is important because I only saw this behavior with tracee when the files being created and deleted existed. Also, try removing all files from the /tmp directory before you run the program.

@trvll
Copy link

trvll commented Mar 13, 2024

Hi @trvll! Did you try running it with ./exhaust -tracee?

yes, sure... also have tried with empty /tmp and with existing files as well... anything else we should try?

@agadient
Copy link
Author

I just reproduced the issue following these steps. I attached a screenshot of top:

  1. Download Ubuntu 22.04.4 LTS ISO from here: https://ubuntu.com/download/server.
  2. Set up a VM with 2CPUs, 2GB of RAM, 20 GB of disk.
  3. When installing Ubuntu, make sure to install SSH.
  4. SSH into the VM after the install is complete.
  5. Install git and g++. sudo apt install git g++.
  6. Install the latest docker following these instructions: https://docs.docker.com/engine/install/ubuntu/
  7. Clone the repo: git clone https://github.com/Vali-Cyber/ebpf-attacks.git
  8. Compile the exhaust.cpp program: cd ebpf-attacks/exhaust && ./build.sh
  9. Run tracee: docker run --name tracee --rm -it --pid=host --cgroupns=host --privileged -v /etc/os-release:/etc/os-release-host:ro -v /boot:/boot:ro aquasec/tracee:latest. Give it some time to initialize.
  10. Run exhaust: ./exhaust -tracee
Screenshot 2024-03-14 at 9 05 21 AM

@trvll
Copy link

trvll commented Mar 28, 2024

It appears that the behavior you're experiencing is due to a misconfiguration related to how Tracee handles its default settings when no specific arguments are provided.

Understanding Default Behavior

By design, when Tracee is invoked without any arguments, it initializes with a predefined set of default arguments to ensure a base level of functionality. This is intended to make the tool immediately useful for typical use cases without requiring initial configuration by the user.

from docker image entrypoint.sh

run_tracee() {
    mkdir -p $TRACEE_OUT

    if [ $# -ne 0 ]; then
        # no default arguments, just given ones
        $TRACEE_EXE "$@"
    else
        # default arguments
        $TRACEE_EXE \
        --metrics \
        --cache cache-type=mem \
        --cache mem-cache-size=512 \
        --capabilities bypass=$CAPABILITIES_BYPASS \
        --capabilities add=$CAPABILITIES_ADD \
        --capabilities drop=$CAPABILITIES_DROP \
        --output=json \
        --output=option:parse-arguments \
        --output=option:relative-time \
        --events signatures,container_create,container_remove
    fi

    tracee_ret=$?
}

As noted, the default configuration includes setting the memory cache size for events at 512 MB.

Addressing Memory Constraints in VM Environments

Given that your VM is configured with only 2GB of memory, allocating 512 MB for Tracee's event caching could lead to resource contention, affecting both Tracee's performance and that of other processes running on the VM. To mitigate this, consider specifying a smaller cache size that better fits your VM's memory constraints.

Suggested Solution

You can override the default cache size by specifying the --cache cache-type=mem and --cache mem-cache-size flags [1] when running Tracee. For instance, to reduce the memory cache size to 128 MB, you could use the following command:

docker run --name tracee --rm -it --pid=host --cgroupns=host --privileged -v /etc/os-release:/etc/os-release-host:ro -v /boot:/boot:ro aquasec/tracee:latest --metrics --cache cache-type=mem --cache mem-cache-size=128 --capabilities bypass=0 --capabilities add= --capabilities drop= --output=json --output=option:parse-arguments --output=option:relative-time --events signatures,container_create,container_remove

@agadient
Copy link
Author

agadient commented Apr 2, 2024

@trvll I tested this configuration and confirmed that Tracee is no longer killed by the OOM killer. It would be nice if Tracee automatically checked the system's memory and adjusted its own limits accordingly. Perhaps this feature is something your team can consider at some point in the future.

@yanivagman
Copy link
Collaborator

@trvll I tested this configuration and confirmed that Tracee is no longer killed by the OOM killer. It would be nice if Tracee automatically checked the system's memory and adjusted its own limits accordingly. Perhaps this feature is something your team can consider at some point in the future.

Agree. I opened an issue to track that: #3947

Going to close this one now.

@yanivagman yanivagman closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug security issues that could taint tracee
Projects
None yet
Development

No branches or pull requests

5 participants