Replies: 2 comments
-
One other detail: I would hope this isn't perturbing read performance, but the setup that I'm testing begins by creating files and fallocating a large amount of space. Read performance is then measured by sequentially reading data out of those fallocated files. I tried to avoid the usual "write/drop caches/read" sequence because I don't have sufficient privileges to drop caches on some of the machines I would like to test on. To be fair I don't precisely know what fallocate() is doing under the covers, though. |
Beta Was this translation helpful? Give feedback.
-
We can close this. I've done some extensive testing since posting this question, and to make a long story short, I can only reproduce the "concurrent large read operations are overly sensitive to tuning parameters" with systems using relatively old Linux kernels. Things are much more stable on newer kernels. Digging in further with analysis of the response time of individual operations shows that there is something unnatural in the behavior of the problematic platforms. As if there is some fixed delay putting a floor on response time, independent of the drive speed. At any rate, carry on, nothing to see here. |
Beta Was this translation helpful? Give feedback.
-
The use case is a little hard to distill, but imagine that I have a daemon that needs to read and write a high volume of concurrent data to a local file system on one or more NVMe drives. I have full control over the file layout (i.e. I don't have to maintain any canonical order; something else does that. So I can split data across files, write in log-structured order, page/block align, etc.).
On systems with good NVMe drives, the winning strategy is usually to align everything, use directio, issue concurrent writes to multiple files, maybe fallocate space ahead of time (I say maybe on the last one because that optimization exacts a heavy toll on management complexity). In the past, the IO has been issued as follows: one thread submits a read/write operation in an API abstraction layer. That operation gets queued and relayed to a pthread pool that services it with pread() or pwrite() and then signals completion. It's a little fancier than that in practice but that's what it will look like to the kernel. I'm now experimenting with an alternative implementation with liburing. Rather than queuing things up for a thread pool, the abstraction layer immediately submits uring operations while another pthread sits on wait_sqe() and signals completion.
Concurrent small (4 KiB-ish) reads and writes are fine, as are large (1 MiB-ish) writes. Liburing is a nice performance win in some parts of the parameter space. It's also aesthetically pleasing not to have so many threads laying around.
I'm having a hard time understanding large (1 MiB-ish) read performance, though.
With threads+pread: performance peaks with something like 4 threads issuing concurrent 1 MiB reads. Going to 16 threads it falls off a cliff, as losing 75% of the bandwidth (I actually didn't realize this was going on until recently -oops).
With liburing: same as above, except the cliff hits even sooner and falls off even more dramatically. I can make it track threads+pread performance by setting IOSQE_ASYNC. In that case it's no worse than the original implementation even though it's not great.
Based on those large read results, it would appear that I should throttle concurrency and set IOSQE_ASYNC to get the most out of my storage devices. However, both of those are de-optimizations for writes and smaller read operations :) In those cases I'm getting more performance by issuing as fast as I can and leaving the flags clear.
I could take on different strategies for different kinds of I/O operations to some degree, but I wanted to ask here first: am I just missing some trick that's needed to make large concurrent read operation performance more stable?
Thanks for having the patience to read such a long question :)
Beta Was this translation helpful? Give feedback.
All reactions