Making memray third-party allocator-aware #577

pitrou · 2024-04-10T08:39:42Z

Is there an existing proposal for this?

I have searched the existing proposals

Is your feature request related to a problem?

It seems that memray currently reports the different "kinds" of allocations based on which libc function was called (malloc, mmap...). (*) However, third-party allocators such as mimalloc and jemalloc are growing in use because of their desirable performance characteristics. When those are used instead of the system allocator, allocations which are logically malloc-like are reported as mmap calls with very large allocation sizes.

There is an example in this issue report where a bunch of 64MiB blocks are reported by memray as allocated (one per thread, roughly), resulting in a large reported footprint of more than 1GiB, while those are the page reservations by mimalloc and the corresponding allocations on the application side are tiny (1kiB each).

This is a problem that is bound to produce many user reports of memory leaks or overconsumption, while actually the program is operating at normal.

(*) I may be wrong in this interpretation of mine, in which case please do correct me.

Describe the solution you'd like

Ideally, memray would also detect calls to third-party allocator routines and report a mi_malloc(1024) as allocating 1024 bytes, not 64 MiB :-)

Several technical solutions can be considered and I'm not an expert in the field. Here are two that comes to mind:

Hard-code support for the most popular 3rd-party allocators, by looking at their respective API names. This seems conceptually easy but will have limited benefits, because those allocators are often privately vendored and sometimes their symbols are mangled to avoid symbol clashes. Also, this means that less popular allocators will not get any coverage.
Devise some sort of runtime protocol where the allocator themselves may tag API functions (how? I have no idea :-)) as being malloc-like, realloc-like, etc. This is obviously more complex technically and requires cooperation to come up with a suitable protocol, but would work better in the long term.

Alternatives you considered

No response

The text was updated successfully, but these errors were encountered:

pablogsal · 2024-04-10T10:24:27Z

Thanks @pitrou for bringing this to us. This is a very interesting problem indeed.

The key to either method is that we need:

A specific name for a symbol. We could mangle it ourselves according to the Itanium ABI but we need something to override that's constant and predicable.
We need the symbol to have a PLT/GTO entry. This basically means that the symbol is in the dynamic table of the executable or shared library. I assume that this will be the case in most cases but it may not happen in others. For instance, if someone (looking at CPython) would compile mimalloc statically, the symbols won't be exposed and there is no way for us to properly override them. This has an easy fix for CPython because we can ensure these have semantic interposition for this purpose, but anything else has the same problem.

If we have these two things, we could offer a way to either override automatically by having constant symbol names or to offer some kind of dynamic naming via some configuration.

I suppose that the next step is for us to investigate how some of the applications/libraries out there are interacting with this allocators. Do you think you can give us some example with pyarrow that uses mimalloc or jmalloc?

pitrou · 2024-04-10T11:32:35Z

Here is a quick REPL example:

>>> import pyarrow as pa

# mimalloc
>>> pool = pa.mimalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000

# jemalloc
>>> pool = pa.jemalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000

Note that mimalloc_memory_pool and jemalloc_memory_pool return singleton instances.

You'll find the corresponding C++ code here:

Note that jemalloc symbols are mangled to avoid polluting the standard libc namespace (malloc etc.) so it's probably easier to look at mimalloc first.

We need the symbol to have a PLT/GTO entry. This basically means that the symbol is in the dynamic table of the executable or shared library.

Ah, interesting. So it must appear in nm --dynamic otherwise memray wouldn't find it?
To avoid potential name clashes, we un-expose most third-party symbols from libarrow.so.

For example:

$ nm libarrow.so.1500 | rg -w mi_malloc
0000000001d3c210 t mi_malloc
$ nm --dynamic libarrow.so.1500 | rg -w mi_malloc
$

$ nm libarrow.so.1500 | rg "je_arrow_" | head -n 4
0000000001cc7bb0 t je_arrow_aligned_alloc
0000000001cc8180 t je_arrow_calloc
0000000001ccd070 t je_arrow_dallocx
0000000001cc9a30 t je_arrow_free
$ nm --dynamic libarrow.so.1500 | rg "je_arrow_"
$

pablogsal · 2024-04-10T11:55:53Z

Ah, interesting. So it must appear in nm --dynamic otherwise memray wouldn't find it?

That is a sufficient condition but not necessary. The other option is that it should have a symbol called mi_malloc@plt or similar (in the normal symbol table). Otherwise it seems that you may be statically compiling against mimalloc (all the allocator code is within the shared lib) and in that case all bets are off because we cannot relocate the symbol (it could even be inlined for what is worth).

pitrou · 2024-04-10T11:57:49Z

The other option is that it should have a symbol called mi_malloc@plt or similar (in the normal symbol table).

Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?

Also, yes, we are statically compiling mimalloc and jemalloc.

pablogsal · 2024-04-10T11:59:10Z

Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?

I think you can do it with __attribute__((visibility("default"))) but that has other effects (like exporting the symbol).

pitrou · 2024-04-10T12:00:04Z

Hmm, actually, a function attribute wouldn't work, because we would have to patch the mimalloc source code for that...

(also, we use -fno-semantic-interposition and I'm unsure how it influences __attribute__((visibility("default"))))

pablogsal · 2024-04-10T12:01:47Z

An alternative view of this problem is that code with LD_PRELOAD should be able to interpose the symbol. We do the same but reimplementing the linker

(also, we use -fno-semantic-interposition and I'm unsure how it influences attribute((visibility("default"))))

That deactivates PLT entries for intra-calls in the shared library. This means that if the definition of the symbol it's inside the executable/shared lib there won't be a PLT entry, which is faster and maybe inalienable but it means it cannot be interposed.

pablogsal · 2024-04-10T12:03:39Z

It looks like if you statically compile the allocator and use -fno-semantic-interposition you are preventing any memory profiler to interpose calls to the allocators. (This also includes LD_PRELOAD based ones like https://github.com/KDE/heaptrack/). This is because it's impossible without rewriting the machine code to interpose the call. And sometimes this won't be enough because the call may be inlined.

I am afraid this is the classic compromise between performance and observability.

pitrou · 2024-04-10T12:06:02Z

I am afraid this is the classic compromise between performance and observability.

I agree. We could definitely make an exception for mimalloc and jemalloc calls, however, it's just that I don't know how to do that without affecting other symbols.

Also, a radical solution might be to first try dlsyming the symbols, and then fallback on the local symbol.

pablogsal · 2024-04-10T12:09:41Z

however, it's just that I don't know how to do that without affecting other symbols.

I think trying to use a __attribute__((visibility("default"))) or marking the symbol as weak (__attribute__((weak))) may be worth a try.

pablogsal · 2024-04-10T12:10:09Z

A quick check you can do when trying things out is to load a library with the same definition via LD_PRELOAD and check if its interposed or not.

pitrou · 2024-04-10T12:14:24Z

I think trying to use a __attribute__((visibility("default"))) or marking the symbol as weak (__attribute__((weak))) may be worth a try.

I thought so, but I realized it required patching the mimalloc or jemalloc source, something we'd like to avoid if possible (also, it could be pre-compiled and we would be linking against an existing libmimalloc.a).

That said, the dlsym route would probably be ok for us. I might give it a quick try.

pablogsal · 2024-04-10T12:17:51Z

Some interesting info: Apparently the way QT does this is to use -Bsymbolic-functions and:

--dynamic-list=dynamic-list-file
Specify the name of a dynamic list file to the linker. This is typically used when creating shared libraries to specify a list of global symbols whose references shouldn’t be bound to the definition within the shared library, or creating dynamically linked executables to specify a list of symbols which should be added to the symbol table in the executable. This option is only meaningful on ELF platforms which support shared libraries.

The format of the dynamic list is the same as the version node without scope and node name. See [VERSION Command](https://sourceware.org/binutils/docs/ld/VERSION.html) for more information.

Example: https://github.com/qt/qtbase/blob/aa896ca9f51252b6d01766e19a03e41bd49857f3/src/gui/CMakeLists.txt#L324

pablogsal · 2024-04-10T12:20:12Z

Also, a radical solution might be to first try dlsyming the symbols, and then fallback on the local symbol.

I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).

pablogsal · 2024-04-10T12:21:55Z

Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as __attribute__((visibility("default"))). We could override the wrapper.

pitrou · 2024-04-10T12:23:51Z

I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).

I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?

pablogsal · 2024-04-10T12:30:43Z

I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?

No, they patch the Global Offset Table at runtime. All call sites point to a PLT entry. For calls that have a PLT/GOT pair, the code normally trampolines through a small assembly code that grabs an address from the Global Offset Table and calls that. Call sites point to the trampoline and the trampoline grabs the address on every call. At first, the address in the GOT is in the linker resolution routine and once the linker finds the real address (lazy loading) the GOT is updated.

Profilers like memray and heap track work by locating the GOT and rewriting the address with their own functions. This can be done at runtime so it allows attaching and activating/deactivating.

LD_PRELOAD works the same except that interposes the symbol when the linker resolves it so it ends in the first GOT update, but it has several disadvantages (like it cannot be deactivated and attaching won't work).

The mechanism needs your function to have a PLT/GOT pair.

pablogsal · 2024-04-10T12:32:12Z

With this explanation you can see the cost: PLT trampolines require an extra read from the GOT and an extra jump, which makes every call a bit more inefficient.

pablogsal · 2024-04-10T12:34:50Z

-fno-semantic-interposition deactivates this mechanism for inter-library-calls. For example malloc in LIBC needs to be exposed for other libraries to call malloc, so libraries linking to malloc will need a PLT/GOT entry because they don't know where malloc lives so they need to allow the linker to resolve the address at load time (the linker could resolve every call site instead of trampolining but that requires as many relocations as call sites which is very inefficient, so the way it works is via indirection where the linker relocates it once and everyone reads from the indirect relocation), but LIBC itself doesn't really need this mechanism because malloc lives inside. You could still use PLT jumps to allow interposing malloc inside LIBC (so profilers and debuggers work) or you could use -fno-semantic-interposition to avoid internal malloc calls to go though the indirection, but then profilers won't see those calls.

pitrou · 2024-04-10T12:46:04Z

Ok, so --dynamic-list doesn't work for a statically linked mimalloc:

ld.gold: warning: Cannot export local symbol 'mi_malloc'

I think this might work, though it would be worse performance-wise:

Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as attribute((visibility("default"))). We could override the wrapper.

pablogsal · 2024-04-10T12:48:47Z

ld.gold: warning: Cannot export local symbol 'mi_malloc'

You may need to mark it as __attribute__((visibility("default"))) I am afraid :(

pitrou · 2024-04-10T15:59:22Z

Ok, I've got a PR which creates such interposable wrappers in Arrow. I've checked that they can be interposed using LD_PRELOAD:
apache/arrow#41128

pablogsal · 2024-04-10T16:16:12Z

Ok I will discuss with @godlygeek whats the best way to support something like this soon

pitrou · 2024-04-10T16:31:08Z

Also note you can download prebuilt wheels from the aforementioned PR using these links. Click on one of the green "Crossbow" badges, then click on the "Summary" link on the Github Actions page, then download the artifact at the bottom of the summary page.

pitrou added the enhancement New feature or request label Apr 10, 2024

pitrou mentioned this issue Apr 10, 2024

[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data apache/arrow#40301

Closed

pitrou mentioned this issue Apr 10, 2024

EXPERIMENT: [C++] Access mimalloc through dynamically-resolved symbols apache/arrow#41128

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making memray third-party allocator-aware #577

Making memray third-party allocator-aware #577

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

Making memray third-party allocator-aware #577

Making memray third-party allocator-aware #577

Comments

pitrou commented Apr 10, 2024

Is there an existing proposal for this?

Is your feature request related to a problem?

Describe the solution you'd like

Alternatives you considered

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 • edited Loading

pitrou commented Apr 10, 2024 • edited Loading

pablogsal commented Apr 10, 2024 • edited Loading

pitrou commented Apr 10, 2024 • edited Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024 • edited Loading

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 • edited Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 • edited Loading

pablogsal commented Apr 10, 2024

pablogsal commented Apr 10, 2024 • edited Loading

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024

pitrou commented Apr 10, 2024

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading

pablogsal commented Apr 10, 2024 •

edited

Loading