Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zombie memcgroup observation tool. #5201

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ pair of .c and .py files, and some are directories of files.
- tools/[rdmaucma](tools/rdmaucma.py): Trace RDMA Userspace Connection Manager Access events. [Examples](tools/rdmaucma_example.txt).
- tools/[shmsnoop](tools/shmsnoop.py): Trace System V shared memory syscalls. [Examples](tools/shmsnoop_example.txt).
- tools/[slabratetop](tools/slabratetop.py): Kernel SLAB/SLUB memory cache allocation rate top. [Examples](tools/slabratetop_example.txt).
- tools/[zombiememcgstat](tools/zombiememcgstat.py): Display zombie memory cgroups. [Examples](tools/zombiememcgstat_example.txt).

##### Performance and Time Tools

Expand Down
103 changes: 103 additions & 0 deletions man/man8/zombiememcgstat.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
.TH zombiememcgstat 8 "2025-01-23" "USER COMMANDS"
.SH NAME
zombiememcgstat \- Show zombie memcgroups on a system, along with creator and offline duration.
.SH SYNOPSIS
.B zombiememcgstat [\-h] [\-p PID] [\-c COMM] [\-o OLDER] [interval] [count]
.SH DESCRIPTION
zombiememcgstat traces and matches the creation and destruction of memory cgroup(memcg)
related objects in the kernel and using this information it, tracks and reports zombie
memcgs along with information about creator and offline duration of these zombie memcgs.

This tool uses in-kernel eBPF maps for storing timestamps and other information, for efficiency.

This tool traces mem_cgroup_css_online, mem_cgroup_css_offline and mem_cgroup_css_free with
kprobes to track the time when a memory cgroup gets online, gets offline and when its kernel
objects are freed. From the time of being offline till the time its kernel objects are freed,
a memcg exists as zombie and taking the difference between any instant of time and time when
memcg was offlined, this tool reports for how long the memcg has existed as zombie till then.

Please note that this tool has been tested with kernel 5.4 and later versions.

Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-h
Print usage message.
.TP
\-p PID
Report zombie memcgs created by this pid only.
.TP
\-c COMM
Report zombie memcgs created by this comm only.
.TP
\-o OLDER
Report zombie memcgs that are offline for more than these many secs.
.TP
interval
Output interval, in seconds.
.TP
count
Number of outputs.
.SH EXAMPLES
.TP
List all zombie memcgs at 30 secs interval:
#
.B zombiememcgstat
.TP
List all zombie memcgs at 5 secs interval:
#
.B zombiememcgstat 5
.TP
List all zombie memcgs at 30 secs interval, 10 times:
#
.B zombiememcgstat 30 10
.TP
List zombie memcgs created by pid 100
#
.B zombiememcgstat -p 100
.TP
List zombie memcgs created by task with comm "foo"
#
.B zombiememcgstat -c foo
.TP
List zombie memcgs that have been offline for more than 100 secs
#
.B zombiememcgstat -o 100
.SH FIELDS
.TP
MEMCG
pointer to mem_cgroup object
.TP
NAME
name of memory cgroup
.TP
COMM
comm of task that created this memory cgroup
.TP
PID
pid of task that created this memory cgroup
.TP
AGE(secs)
time in seconds, for which this memcg has been offline for
.SH OVERHEAD
This traces memcgroup online, offline and destruction functions, which typically
are not very frequent, at least not frequent enough to cause any noticeable overhead.
Further it uses in-kernel maps for efficiency, so overall this tool has negligible
overhead.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Imran Khan
.SH SEE ALSO
memleak(8)

4 changes: 4 additions & 0 deletions tests/python/test_tools_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,6 +428,10 @@ def test_wakeuptime(self):
def test_wqlat(self):
self.run_with_int("wqlat.py 1 1", allow_early=True)

@skipUnless(kernel_version_ge(5,4), "requires kernel >= 5.4")
def test_zombiememcgstat(self):
self.run_with_duration("zombiememcgstat.py 1 1")

def test_xfsdist(self):
# Doesn't work on build bot because xfs functions not present in the
# kernel image.
Expand Down
202 changes: 202 additions & 0 deletions tools/zombiememcgstat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
#!/usr/bin/env python
# @lint-avoid-python-3-compatibility-imports
#
# zombiememcgstat Dump info about zombie memcgroups
# For Linux, uses BCC, eBPF.
#
# USAGE: zombiememcgstat [-h] [-p] [-c] [-o] [interval] [count]
#
# Copyright 2025 Oracle and/or its affiliates.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 01-Jan-2025 Imran Khan Created this.


from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import sys

# arguments
examples = """examples:
./zombiememcgstat # list all zombie memcgs at 30 secs interval
./zombiememcgstat 5 # list all zombie memcgs at 5 secs interval
./zombiememcgstat 30 10 # print 30 second summaries, 10 times
./zombiememcgstat -p 1 # list zombie memcgs created by pid 1
./zombiememcgstat -c systemd # list zombie memcgs created by systemd
./zombiememcgstat -o 600 # list zombie memcgs older than 600 secs
"""

parser = argparse.ArgumentParser(
description="""
Zombie memory cgroups (memcgs) are cgroups that have been removed
from user space but still exist in kernel space due to non-zero refcounts.
List such zombie memcgs on a system along with their creator and
offline duration.
""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid", type=int,
help="show zombie memcgs created by specified pid")
parser.add_argument("-c", "--comm", type=str,
help="show zombie memcgs created by specified comm")
parser.add_argument("-o", "--older", default=60, type=int,
help="show zombie memcgs that are offline for more than these many secs.")
parser.add_argument("interval", nargs="?", default=30,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
parser.add_argument("--ebpf", action="store_true",
help=argparse.SUPPRESS)
args = parser.parse_args()
countdown = int(args.count)
min_offline_time = args.older
debug = 0

if args.pid and int(args.pid) <= 0:
print("specified task pid should be greater than 0.")
exit(-1)

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h> /* For TASK_COMM_LEN */
#include <linux/memcontrol.h> /* For mem_cgroup_from_css */

#define MAX_NAME_LEN (256)

typedef struct memcg_info {
u64 online_ts;
u64 offline_ts;
u32 pid;
u8 offline;
char comm[TASK_COMM_LEN];
char name[MAX_NAME_LEN];
u64 memcg_ptr;
} memcg_info_t;

BPF_HASH(offline_memcg_info, u64, memcg_info_t);

static int cmp_comms(const char *comm1, const char *comm2, int len)
{
unsigned char n1, n2;
while (len-- > 0) {
n1 = *comm1++;
n2 = *comm2++;
if (n1 != n2)
return n1 - n2;
if (!n1)
break;
}
return 0;
}

int mem_cgroup_css_online_probe(struct pt_regs *ctx,
struct cgroup_subsys_state *css)
{
struct kernfs_node *kn;
struct mem_cgroup *memcg_ptr = (struct mem_cgroup *)PT_REGS_PARM1(ctx);
memcg_info_t memcg_val = {};
memcg_val.pid = bpf_get_current_pid_tgid() >> 32;
FILTER_PID
bpf_get_current_comm(&memcg_val.comm, sizeof(memcg_val.comm));
FILTER_COMM
kn = memcg_ptr->css.cgroup->kn;
bpf_probe_read_kernel_str(&memcg_val.name,
sizeof(memcg_val.name), kn->name);
memcg_val.offline = 0;
memcg_val.memcg_ptr = (u64)memcg_ptr;
memcg_val.online_ts = bpf_ktime_get_ns();
offline_memcg_info.update(&memcg_val.memcg_ptr, &memcg_val);
return 0;
}

int mem_cgroup_css_offline_probe(struct pt_regs *ctx,
struct cgroup_subsys_state *css)
{
u64 memcg_ptr = (u64)PT_REGS_PARM1(ctx);
memcg_info_t *memcg_val_p = offline_memcg_info.lookup(&memcg_ptr);
if (memcg_val_p == 0) {
return 0; //data absent
}
memcg_val_p->offline = 1;
memcg_val_p->offline_ts = bpf_ktime_get_ns();
return 0;
}

int mem_cgroup_css_free_probe(struct pt_regs *ctx,
struct cgroup_subsys_state *css)
{
u64 memcg_ptr = (u64)PT_REGS_PARM1(ctx);
offline_memcg_info.delete(&memcg_ptr);
return 0;
}
"""

# code substitutions
if args.pid:
filter_pid_text = """
if (memcg_val.pid != %d) {
return 0;
}
""" % (args.pid)
bpf_text = bpf_text.replace('FILTER_PID', filter_pid_text)
else:
bpf_text = bpf_text.replace('FILTER_PID', '')

if args.comm:
filter_comm_text = """
if (cmp_comms(memcg_val.comm, "%s", TASK_COMM_LEN)) {
return 0;
}
""" % (args.comm)
bpf_text = bpf_text.replace('FILTER_COMM', filter_comm_text)
else:
bpf_text = bpf_text.replace('FILTER_COMM', '')

# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="mem_cgroup_css_online",
fn_name="mem_cgroup_css_online_probe")
b.attach_kprobe(event="mem_cgroup_css_offline",
fn_name="mem_cgroup_css_offline_probe")
b.attach_kprobe(event="mem_cgroup_css_free",
fn_name="mem_cgroup_css_free_probe")

print("Show zombie memcgroups at specified intervals... Hit Ctrl-C to end.")

# header
print(f'{"MEMCG":<20} {"NAME":<22} {"COMM":<16} {"PID":<8} {"AGE(secs)":<8}')
# output
exiting = 0 if args.interval else 1
memcgs = b["offline_memcg_info"]
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1

print()
for address, info in sorted(memcgs.items(),
key=lambda memcgs: memcgs[1].offline_ts):
try:
if not info.offline:
continue
curr_time = BPF.monotonic_time()
offline_age = (curr_time - info.offline_ts) // 1000000000
if offline_age < min_offline_time:
continue
print("0x%-18x %-22s %-16s %-8d %-8d" %
(address.value, info.name.decode()[:22],
info.comm.decode()[:16], info.pid,
offline_age))
except KeyboardInterrupt:
exiting = 1

countdown -= 1
if exiting or countdown == 0:
exit()
Loading
Loading