Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Buddy Allocator" in DART: Status #220

Closed
devreal opened this issue Dec 22, 2016 · 11 comments
Closed

"Buddy Allocator" in DART: Status #220

devreal opened this issue Dec 22, 2016 · 11 comments

Comments

@devreal
Copy link
Member

devreal commented Dec 22, 2016

Inside DART, we use a buddy allocator that was taken from https://github.com/cloudwu/buddy. However, there is no license shipped with it. I assume that we cannot just use it without explicit permission?

There is a refactoring todo but so far no one has seemed to touch it. Do we actually need our own allocator here? Can we show that malloc/free is significantly slower? As it stands right now, this allocator is not thread-safe and the implementation in general does not suite even our (non-existing) code quality standards.

Any reason to keep it?

@fuchsto
Copy link
Member

fuchsto commented Dec 26, 2016

@devreal We got in touch with the author of the original buddy allocator code, and there is no problem with respect to copyright. The code does not match our requirements anyhow, so it's an ancient refactoring ToDo.

If you see a way to replace it without putting too much effort into it, I will help as best as I can.
If reworking the local allocator does not conveniently fit into your task model research, leave it alone and let us (@fuerlinger and @fuchsto) delegate this to student work.
There is a couple of students in my course alone that would gladly experiment with this problem.

@devreal
Copy link
Member Author

devreal commented Jan 9, 2017

Thanks for the clarification. I think the permission should be documented somewhere.

The question still remains: Do we really need to have a custom allocator implementation? The buddy allocator is only used in dart_memalloc which seems to be a replacement for malloc, returning local memory wrapped in a global pointer. I'm not convinced that the added complexity (and effort to get it right now) brings any significant benefit over using malloc/free, esp. since dart_memalloc is not likely to be in a critical path. We use malloc/free in many other places as well.

An additional drawback of using a custom allocator is the loss of memory checking ability such as double free checks in libc, usage of valgrind, you name it.

@rkowalewski
Copy link

I agree with @devreal, however local memory allocation is not a collective team allocation. If we do not use a Buddy Allocator we still need a data structure to internally track all locally allocated memory segments and free them either at runtime if not needed anymore or at dart_finalize. That is where the Buddy Allocator comes in handy.

@devreal
Copy link
Member Author

devreal commented Jan 9, 2017

Good point. But we do not need an internal data structure to free the memory at runtime, i.e., by calling dart_memfree, because a call free on the pointer contained in the gptr is sufficient. It is debatable whether DART should be responsible for freeing locally allocated memory upon dart_finalize or whether this is the user's responsibility. I would vote for the latter and it is safe to do so.

On a more general point: What is the point of having dart_memalloc after all? Just to be able to wrap local memory in a global pointer? Why not have the user call dart_gptr_setaddr on a user-malloc'ed memory (or even assign to the fields of the public struct)? The documentation states that dart_memalloc

Allocates memory [...] in the global address space of the calling unit [...]

which it apparently does not because the buddy allocator only uses local memory (I put a \todo in dart_globmem.h earlier). The memory is never registered with the underlying transport and no window is associated with it.

@fuchsto
Copy link
Member

fuchsto commented Jan 10, 2017

Hmyaa, the wording is specific. Memory is allocated in the unit's global memory space (because it is accessible via global pointer), but not in the unit's shared global memory space.

If I remember correctly, I use dart_memalloc to allocate memory before it is attached (published in shared global memory space) in dynamic global allocators.

But I'm not religious about this part of the DART interface in particular.

@rkowalewski
Copy link

Actually there are two things to clarify. At the moment we have two windows:

  • One for collective team allocation in which we allocate memory related to all containers of a specific team (dart_team_memalloc or dart_team_memregister). Each memory segment is dynamically allocated (i.e., malloc) and then attached.
  • Another one for local memory allocation in which we use the buddy allocator. The buddy allocator is initialized at dart_initialize with one huge memory segment (16 MB) and it is attached to a separate window for local memory allocation. Whenever we need local memory allocation we get a memory segment of the requested size from the buddy allocator and wrap it into a global pointer.

The question is actually what does LOCAL memory allocation mean? And does this in turn require a buddy allocator? It is currently used in only a few places. One use case is for example dash::shared where only one unit (i.e., the owner) allocates LOCAL memory and broadcasts the global pointer to all other units. IMHO this is the same as if we globally allocate in a team in which all units allocate 0 bytes except the owner. Then no broadcast is required.

None of the examined examples really needs a buddy allocator. We can do it the way as described
by @devreal, i.e., calling malloc, attach it and finally free it if not needed anymore.

@fuerlinger
Copy link
Contributor

The original idea was to have two different kinds of memory allocation in DART

(A) A collective operation: every unit in a team participates and contributes an equal amount of memory
(B) A non-collective operation: only one unit (the others are not involved at all) allocates memory that is, however, meant to be accessible by the other units, "local" might be the wrong word to use in this context.

Since in MPI window creation is always a collective operation in DART-MPI (B) can only work by pre-allocating a chunk of memory and my managing that chunk of memory manually. This is where the buddy allocator comes in. It was originally used for a similar purpose in the DART-CUDA implementation and at some point incorporated by Huan in the DART-MPI code.

Its true that this feature (the non-collective allocation) is not currently used much in DASH but I would retain it.

@devreal
Copy link
Member Author

devreal commented Jan 10, 2017

Ahh thanks for your clarifications. I was not aware of the subtle difference between global and shared global memory... I understand that there is a need for non-collective allocations and I can see the use-case. Two points to make here:

  1. It might be more efficient to use MPI_Alloc_mem instead of a custom allocator to allocate memory that is accessible from other units, although the standard does not mandate the use of this function for memory allocation.
  2. Copying of DART gptr is a difficult thing right now. While it is legal to copy a gptr with non-collective memory it can become problematic for collectively allocated memory. The reason is the segment ID in the gptr: In the first case, the segment ID is always zero, hence no problem (although rather a coincidence that it works). In the latter case, the segment ID is arbitrary and not synchronized between units. See also Segment IDs and dynamic global data structures  #221. We need to clarify whether copying agptr to other units is supported or not.

Given the explanations above, I would make the case for keeping dart_memalloc but changing the memory allocation to MPI_Alloc_mem. However, the second point I made has to be dealt with but is independent of the discussion on the buddy allocator.

@devreal
Copy link
Member Author

devreal commented Jan 11, 2017

Argh, it's only after I started to shuffle around code to get rid of the buddy allocator that I understand what the buddy allocator is used for. Good documentation ftw! Turns out the allocator does not actually allocate memory from the OS but return an index to be used for some externally allocated chunk of memory. To add to the confusion, the variable holding the "allocator" instance in DART is called dart_localpool 🤦‍♂️

Google did a good job in translating the Chinese blog post, which is literally the only source of documentation at all:

Today, when I think of dinner, I need a custom memory allocator. [...]

Anyway, I realized that in dart_memalloc this index is used as an offset in a shared memory window and stored as offset in the gptr. This makes the shared memory window optimization possible. However, we cannot mix dynamic windows and shared memory windows, hence we need to have a pre-allocated shared memory window and cannot use malloc/free or MPI_Alloc_mem with subsequent attachment to a dynamic window in dart_memalloc. Oh boy... Then let's keep the buddy allocator, fix it, and just make it thread-safe using a mutex (unless someone wants to come up with a new implementation or adapt any of the existing allocators, e.g., jemalloc).

Fun fact I discovered: the way we currently use it (16MB with 24 levels in the binary tree --> smallest chunk size 1 Byte), the allocator uses 32MB of meta data to manage these 16MB. Maybe going down to 21 levels (smallest chunk 8 Byte) would be sufficient and would only require 4MB of meta data.

I apologize for my lack of understanding earlier and the proposals made based on it.

@fuchsto
Copy link
Member

fuchsto commented Feb 18, 2017

@devreal Is this issue still relevant, or does #118 cover it?

@devreal
Copy link
Member Author

devreal commented Feb 19, 2017

I've been meaning to close this as we will not get rid of the buddy allocator anytime soon (unless someone wants to get his hands dirty...). Closing it as mostly fixed in #118.

@devreal devreal closed this as completed Feb 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants