-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory in rc_recv_desc memory pool #9902
Comments
@kevinsala can you try forcing synchronization between sender and receiver by using MPI_Issend (instead if MPI_Isend) for the last message in every window (msg == NMessages-1) |
Using synchronous sends for the last message in every iteration avoids the out of memory error. The memory usage keeps around 180 MB per process. Is there any throttling mechanism inside UCX to avoid this issue (without a workaround in the application side)? We observe this problem in a task-based MPI+OpenMP application that does not use |
Currently there is no throttling mechanism in UCX for unexpected tags, though it's a good feature to add. Is it possible to add a blocking MPI_Ssend once a while to create such synchronization? |
Sorry for the late response. We tried combining MPI_Isend with some MPI_Issend throughout the execution to force some synchronization, but the results didn't improve. We have been doing some further experiments with this benchmark. We have seen that using multiple VCIs with MPICH solves (or hides) this performance issue. The installation that has been giving issues is an MPICH configured with global critical section and one VCI (default):
This installation works around 3x slower than an OpenMPI configured with the same UCX in our benchmark. Instead, if we use an MPICH configured with per-vci critical section, multiple vci (16), and implicit vci selection method, we match the performance of OpenMPI. Although we compile for 64 max VCI, we use only 16 at run-time (
Would it be possible that a single UCX communication context is not enough for this case (although the application only communicates from a single thread in each process)? |
The MPI program below is getting an out of memory because UCX tries to allocate too many descriptors from the
rc_recv_desc
memory pool. The program performs thousands of iterations, where each iteration exchanges data from the first process to the last one: process 0 sends to process 1, process 1 sends to process 2, and so on. The first process only sends data, the last only receives, and the rest send and receive. In each iteration, data is exchanged in multiple messages (4096 messages and 4096 bytes per message).The out of memory error is always in the process 1. It seems that the process 0 is executing iterations significantly ahead of the other ones (I guess because
MPI_Waitall
in process 0 does not synchronize with the receives of process 1). For instance, when the application crashes, the process 0 just executed the iteration 7152, while the rest just processed the iteration 3743:I'm attaching a PDF with the heap profile of process 1 using gperftools: memory.pdf. The profile shows the memory consumed at the last moments before the out of memory error (consuming around 110 GB). Most memory is allocated by the
ucp_worker_progress
call insideMPI_Waitall
.At the last moments of the execution, the debug information of UCX (
UCX_LOG_LEVEL=debug
) printed by process 1 is the following:Environment
The executions use MPICH 4.2.1 over UCX 1.16.0, but I've observed the same error with previous UCX releases, and also OpenMPI 4.1.6 over UCX.
Commands to reproduce
I can reproduce this error running on four processes across four nodes:
The text was updated successfully, but these errors were encountered: