after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

wwq2333 · 2024-04-19T03:59:40Z

Alluxio Version:
2.9.3

Describe the bug
it's not really a bug, just asking for advice;

we use alluxio to accelerate data set access in AI training, training pod read file via alluxio-fuse (mmap file and read at not fixed position); however, during the training, it was found that the GPU usage sometimes dropped to 0;

according to the log position, reading data was stuck (nearly 2 minutes); because mmap is used, strace cannot see the read syscall, but there are no other syscalls during the time it is stuck (other interference factors are eliminated);

meanwhile, we enabled the debug log for FUSE and found that during the period when it was stuck, it did not receive any requests. The time taken between the entry and exit of FUSE requests was very short (at the millisecond level) both before and after it got stuck;

furthermore, we used strace to trace the read and write behavior of the FUSE process on /dev/fuse.

strace -f -tt  -q  -T -P /dev/fuse  -x -y -p ${fuse-pid}

it appears that there were no requests with particularly high latency;

under such circumstances, could it be that the kernel is causing this, either by not dispatching requests to FUSE quickly enough, or by not returning the responses to the upper-level application promptly enough?

The text was updated successfully, but these errors were encountered:

wwq2333 · 2024-04-19T04:28:48Z

13:37:22 - 13:39:20 (+8) , the number of requests received from /dev/fuse has significantly decreased.
fuse-strace.log

GPU_UTIL

the FUSE debug log was rolled over and overwritten, however, from observing the previous FUSE debug logs, the phenomena appear to be similar, there are only periodic statfs requests when training stucked.

jja725 · 2024-04-19T17:56:13Z

@LuQQiu @jiacheliu3 Do you have any idea on this issue?

wwq2333 · 2024-04-24T01:25:38Z

Any suggestions? @LuQQiu @jiacheliu3

LuQQiu · 2024-04-24T20:28:16Z

With kernel -> JNI C++ -> Java
C++ attaching thread to Java, matching c++ object to java object, these matching need to done in very case sensitive way.
not quite sure whether mmap break some of the previous assumptions
e.g. the data transfer way, object management way or threading way

wwq2333 added the type-bug This issue is about a bug label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

wwq2333 commented Apr 19, 2024

wwq2333 commented Apr 19, 2024

jja725 commented Apr 19, 2024

wwq2333 commented Apr 24, 2024

LuQQiu commented Apr 24, 2024

after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

Comments

wwq2333 commented Apr 19, 2024

wwq2333 commented Apr 19, 2024

jja725 commented Apr 19, 2024

wwq2333 commented Apr 24, 2024

LuQQiu commented Apr 24, 2024