You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
it's not really a bug, just asking for advice;
we use alluxio to accelerate data set access in AI training, training pod read file via alluxio-fuse (mmap file and read at not fixed position); however, during the training, it was found that the GPU usage sometimes dropped to 0;
according to the log position, reading data was stuck (nearly 2 minutes); because mmap is used, strace cannot see the read syscall, but there are no other syscalls during the time it is stuck (other interference factors are eliminated);
meanwhile, we enabled the debug log for FUSE and found that during the period when it was stuck, it did not receive any requests. The time taken between the entry and exit of FUSE requests was very short (at the millisecond level) both before and after it got stuck;
furthermore, we used strace to trace the read and write behavior of the FUSE process on /dev/fuse.
it appears that there were no requests with particularly high latency;
under such circumstances, could it be that the kernel is causing this, either by not dispatching requests to FUSE quickly enough, or by not returning the responses to the upper-level application promptly enough?
The text was updated successfully, but these errors were encountered:
13:37:22 - 13:39:20 (+8) , the number of requests received from /dev/fuse has significantly decreased. fuse-strace.log
GPU_UTIL
the FUSE debug log was rolled over and overwritten, however, from observing the previous FUSE debug logs, the phenomena appear to be similar, there are only periodic statfs requests when training stucked.
With kernel -> JNI C++ -> Java
C++ attaching thread to Java, matching c++ object to java object, these matching need to done in very case sensitive way.
not quite sure whether mmap break some of the previous assumptions
e.g. the data transfer way, object management way or threading way
Alluxio Version:
2.9.3
Describe the bug
it's not really a bug, just asking for advice;
we use alluxio to accelerate data set access in AI training, training pod read file via alluxio-fuse (mmap file and read at not fixed position); however, during the training, it was found that the GPU usage sometimes dropped to 0;
according to the log position, reading data was stuck (nearly 2 minutes); because mmap is used, strace cannot see the read syscall, but there are no other syscalls during the time it is stuck (other interference factors are eliminated);
meanwhile, we enabled the debug log for FUSE and found that during the period when it was stuck, it did not receive any requests. The time taken between the entry and exit of FUSE requests was very short (at the millisecond level) both before and after it got stuck;
furthermore, we used strace to trace the read and write behavior of the FUSE process on /dev/fuse.
it appears that there were no requests with particularly high latency;
under such circumstances, could it be that the kernel is causing this, either by not dispatching requests to FUSE quickly enough, or by not returning the responses to the upper-level application promptly enough?
The text was updated successfully, but these errors were encountered: