io_uring to improve throughput in the background thread #549

tarun-t · 2024-09-03T16:59:44Z

We could potentially improve the throughput of the logger by implementing io_uring in the background thread.

Proposed changes:

Implement io_uring setup in the background thread initialization
Modify the write operations to use io_uring submission and completion queues

Expected benefits:

Reduced system call overhead
Improved I/O performance, especially for high-frequency logging
Better scalability under high load

Considerations:

This would be Linux-specific and would require fallback mechanisms for other platforms
Careful benchmarking would be needed to quantify the improvements
May require adjusting the minimum supported Linux kernel version

Do you think this is something worth pursuing?

odygrd · 2024-09-03T17:29:40Z

Hey thanks for the ideas!

it can be easily implemented as a stand alone sink

It’s something i was considering to add 1-2 months ago. I put it on the back burner after reading this blog post https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html

I will try to implement it at some point and see if it actually makes any difference

tarun-t · 2024-09-04T04:06:03Z

The results in the post were surprising, so I ran a benchmark of my own using fio-3.28.

System Config

intel i9-14900K 5GHz
NVMe SSD over PCIe4
Ubuntu 22.04
Linux Kernel 6.8.0-40-generic

I have not done additional optimizations like CPU isolation, or disabling io-scheduling, or setting irq affinity

Benchmark Config

[global]
direct=1
time_based
runtime=30
group_reporting
filename=/tmp/test_file
size=4G
iodepth=32
numjobs=1
stonewall

[blocking-write]
rw=write
bs=128k
ioengine=sync

[blocking-read]
rw=read
bs=128k
ioengine=sync

[iouring-write]
rw=write
bs=128k
ioengine=io_uring
sqthread_poll=0

[iouring-read]
rw=read
bs=128k
ioengine=io_uring
sqthread_poll=0

Results

blocking-write: (groupid=0, jobs=1): err= 0: pid=614577: Wed Sep  4 09:18:45 2024
  write: IOPS=27.8k, BW=3475MiB/s (3644MB/s)(102GiB/30001msec); 0 zone resets
    clat (usec): min=29, max=5119, avg=32.07, stdev=56.37
     lat (usec): min=31, max=5122, avg=35.84, stdev=56.39
    clat percentiles (usec):
     |  1.00th=[   31],  5.00th=[   31], 10.00th=[   31], 20.00th=[   31],
     | 30.00th=[   31], 40.00th=[   31], 50.00th=[   31], 60.00th=[   31],
     | 70.00th=[   31], 80.00th=[   32], 90.00th=[   32], 95.00th=[   34],
     | 99.00th=[   51], 99.50th=[   51], 99.90th=[   57], 99.95th=[   96],
     | 99.99th=[ 4686]   
   bw (  MiB/s): min= 2245, max= 3550, per=100.00%, avg=3476.61, stdev=212.04, samples=59
   iops        : min=17967, max=28406, avg=27812.78, stdev=1696.37, samples=59
  cpu          : usr=11.75%, sys=21.99%, ctx=834239, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,834123,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

blocking-read: (groupid=1, jobs=1): err= 0: pid=614592: Wed Sep  4 09:18:45 2024
  read: IOPS=20.4k, BW=2550MiB/s (2674MB/s)(74.7GiB/30001msec)
    clat (usec): min=34, max=533, avg=48.87, stdev= 7.06
     lat (usec): min=34, max=533, avg=48.88, stdev= 7.06
    clat percentiles (usec):
     |  1.00th=[   43],  5.00th=[   45], 10.00th=[   46], 20.00th=[   47],
     | 30.00th=[   48], 40.00th=[   48], 50.00th=[   49], 60.00th=[   49],
     | 70.00th=[   50], 80.00th=[   50], 90.00th=[   51], 95.00th=[   53],
     | 99.00th=[   68], 99.50th=[   72], 99.90th=[  117], 99.95th=[  169],
     | 99.99th=[  338]   
   bw (  MiB/s): min= 1880, max= 2657, per=100.00%, avg=2550.76, stdev=133.96, samples=59
   iops        : min=15040, max=21258, avg=20406.05, stdev=1071.70, samples=59
  cpu          : usr=0.61%, sys=16.63%, ctx=612061, majf=0, minf=45
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=612016,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

iouring-write: (groupid=2, jobs=1): err= 0: pid=614607: Wed Sep  4 09:18:45 2024
  write: IOPS=49.5k, BW=6188MiB/s (6489MB/s)(181GiB/30001msec); 0 zone resets
    slat (usec): min=7, max=537, avg= 9.40, stdev= 1.57
    clat (usec): min=23, max=50729, avg=636.82, stdev=417.91
     lat (usec): min=31, max=50742, avg=646.23, stdev=417.92
    clat percentiles (usec):
     |  1.00th=[  482],  5.00th=[  594], 10.00th=[  594], 20.00th=[  594],
     | 30.00th=[  594], 40.00th=[  594], 50.00th=[  594], 60.00th=[  594],
     | 70.00th=[  611], 80.00th=[  627], 90.00th=[  652], 95.00th=[  660],
     | 99.00th=[  848], 99.50th=[ 4228], 99.90th=[ 5604], 99.95th=[ 5669],
     | 99.99th=[ 7832]   
   bw (  MiB/s): min= 5918, max= 6396, per=100.00%, avg=6188.22, stdev=113.91, samples=59
   iops        : min=47344, max=51172, avg=49505.75, stdev=911.37, samples=59
  cpu          : usr=14.57%, sys=36.07%, ctx=1445832, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1485253,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

iouring-read: (groupid=3, jobs=1): err= 0: pid=614660: Wed Sep  4 09:18:45 2024
  read: IOPS=54.9k, BW=6864MiB/s (7197MB/s)(201GiB/30001msec)
    slat (usec): min=5, max=117, avg= 7.39, stdev= 1.16
    clat (usec): min=142, max=4837, avg=575.20, stdev=94.67
     lat (usec): min=156, max=4843, avg=582.61, stdev=94.67
    clat percentiles (usec):
     |  1.00th=[  326],  5.00th=[  412], 10.00th=[  461], 20.00th=[  510],
     | 30.00th=[  537], 40.00th=[  553], 50.00th=[  578], 60.00th=[  594],
     | 70.00th=[  619], 80.00th=[  652], 90.00th=[  693], 95.00th=[  725],
     | 99.00th=[  791], 99.50th=[  816], 99.90th=[  873], 99.95th=[  947],
     | 99.99th=[ 1483]   
   bw (  MiB/s): min= 6823, max= 6884, per=100.00%, avg=6867.48, stdev=12.01, samples=59
   iops        : min=54590, max=55074, avg=54939.83, stdev=96.06, samples=59
  cpu          : usr=1.59%, sys=44.58%, ctx=928257, majf=0, minf=1037
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1647384,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=3475MiB/s (3644MB/s), 3475MiB/s-3475MiB/s (3644MB/s-3644MB/s), io=102GiB (109GB), run=30001-30001msec

Run status group 1 (all jobs):
   READ: bw=2550MiB/s (2674MB/s), 2550MiB/s-2550MiB/s (2674MB/s-2674MB/s), io=74.7GiB (80.2GB), run=30001-30001msec

Run status group 2 (all jobs):
  WRITE: bw=6188MiB/s (6489MB/s), 6188MiB/s-6188MiB/s (6489MB/s-6489MB/s), io=181GiB (195GB), run=30001-30001msec

Run status group 3 (all jobs):
   READ: bw=6864MiB/s (7197MB/s), 6864MiB/s-6864MiB/s (7197MB/s-7197MB/s), io=201GiB (216GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=2257637/2320708, merge=0/493, ticks=976455/961054, in_queue=1937549, util=81.63%

Summary

Test	IOPS	Bandwidth (MiB/s)	Avg Latency (μs)	CPU Usage (usr/sys)
Blocking Write	27.8k	3475	35.84	11.75% / 21.99%
Blocking Read	20.4k	2550	48.88	0.61% / 16.63%
io_uring Write	49.5k	6188	646.23	14.57% / 36.07%
io_uring Read	54.9k	6864	582.61	1.59% / 44.58%

These numbers show io_uring significantly outperforms blocking I/O in both read and write ops. I have ran this benchmark on random writes and reads as well and saw similar performance gains.

odygrd · 2024-09-10T14:42:25Z

I've opened a draft PR; feel free to take a look and let me know your thoughts.

Other things I tried:

Tried different block sizes and chunks sizes
Using io_uring_peek_cqe in a loop instead of io_uring_wait_cqe() significantly slowed down throughput—dropping to around 0.50 million. I’m unsure of the reason, and this slowdown still occurs even when pinning the backend thread to CPU 1, the frontend thread to CPU 5, and the OS to CPU 0.

IORING_SETUP_SQPOLL also caused a similar slowdown to io_uring_peek_cqe.

struct io_uring_params params;
memset(&params, 0, sizeof(params));
params.flags |= IORING_SETUP_SQPOLL; // Enable kernel-side polling
if (::io_uring_queue_init_params(default_queue_size, &_ring, &params) != 0)

O_DIRECT would be ideal, but it's quite complex to implement. We need to submit block sizes of 512 bytes, and each block must be fully filled—otherwise, we see NUL characters in the log file.

Results on my RHEL9 box with the current PR:

Without io_uring

[odygrd@binary-blaze:~]$ ./git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_throughput
Throughput is 4.44 million msgs/sec average, total time elapsed: 901 ms for 4000000 log messages

With io_uring

[odygrd@binary-blaze:~]$ ./git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_iouring_throughput
Throughput is 4.86 million msgs/sec average, total time elapsed: 822 ms for 4000000 log messages

tarun-t · 2024-09-16T16:31:09Z

Thank you for submitting this pull request.

I've conducted testing on Ubuntu 24.04 LTS to evaluate the performance. The io_uring-based implementation shows approximately 6% lower throughput compared to the standard version. I'm currently investigating the cause of this performance discrepancy and will share my findings.

Regarding the code structure, I suggest implementing an option in FileSink to enable or disable io_uring support. This modification would allow classes like RotatingFileSink to inherit io_uring functionality, enhancing flexibility and maintainability.

Please let me know if you have any thoughts about this.

odygrd · 2024-09-16T16:48:40Z

The benchmark numbers above were from before the latest commits on master. With recent improvements I introduced in the backend worker, it seems that the sync IO performance has now caught up with the performance of io uring sink...

I created this PR as a proof of concept to demonstrate how it could be implemented and to evaluate if it's worth adding.

One reason I’m hesitant to include it directly in FileSink is that we’d need to check for io_uring’s existence for all existing users. By keeping it in a separate header, we avoid affecting users who aren’t including it. The plan is to consider integration once we have solid proof that it brings throughput improvements.

odygrd · 2024-09-17T03:24:29Z

These are the numbers i am currently getting after syncing this PR with master branch

Without io_uring

$ /home/odygrd/git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_throughput

Throughput is 5.50 million msgs/sec average, total time elapsed: 727 ms for 4000000 log messages

With io_uring

$ /home/odygrd/git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_iouring_throughput

Throughput is 5.60 million msgs/sec average, total time elapsed: 713 ms for 4000000 log messages

i have also made the benchmark more accurate as the backend thread needs to spend a few seconds on the first log message to calculate the rdtsc and that was previously included in the latency. Thats why you see a different number

tarun-t · 2024-09-17T13:34:16Z

Here are my results with the latest code:

Setup

Ubuntu 24.04
liburing 2.5
Mainthread bound to 6
Backend bound to 5

With io_uring:

~/quill/build/benchmarks/backend_throughput$ ./BENCHMARK_quill_backend_iouring_throughput
Throughput is 6.33 million msgs/sec average, total time elapsed: 631 ms for 4000000 log messages

Without io_uring:

~/quill/build/benchmarks/backend_throughput$ ./BENCHMARK_quill_backend_throughput
Throughput is 6.33 million msgs/sec average, total time elapsed: 632 ms for 4000000 log messages

Given the identical perf, I investigated further with perf.

With io_uring:

perf stat ./BENCHMARK_quill_backend_iouring_throughput


 Performance counter stats for './BENCHMARK_quill_backend_iouring_throughput':

          1,639.29 msec task-clock                       #    1.792 CPUs utilized
            22,113      context-switches                 #   13.489 K/sec
                 1      cpu-migrations                   #    0.610 /sec
          1,41,639      page-faults                      #   86.403 K/sec
     <not counted>      cpu_atom/cycles/                                                        (0.00%)
    9,29,52,55,672      cpu_core/cycles/                 #    5.670 GHz
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
   29,37,24,01,532      cpu_core/instructions/
     <not counted>      cpu_atom/branches/                                                      (0.00%)
    5,13,21,48,550      cpu_core/branches/               #    3.131 G/sec
     <not counted>      cpu_atom/branch-misses/                                                 (0.00%)
         10,19,512      cpu_core/branch-misses/
             TopdownL1 (cpu_core)                 #     28.1 %  tma_backend_bound
                                                  #      1.2 %  tma_bad_speculation
                                                  #     15.1 %  tma_frontend_bound
                                                  #     55.5 %  tma_retiring

       0.914614828 seconds time elapsed

       0.802307000 seconds user
       0.837321000 seconds sys

Without io_uring

perf stat ./BENCHMARK_quill_backend_throughput



 Performance counter stats for './BENCHMARK_quill_backend_throughput':

          1,628.63 msec task-clock                       #    1.849 CPUs utilized
               249      context-switches                 #  152.889 /sec
                 3      cpu-migrations                   #    1.842 /sec
          1,41,638      page-faults                      #   86.967 K/sec
    3,47,43,10,232      cpu_atom/cycles/                 #    2.133 GHz                         (0.07%)
    9,25,56,65,847      cpu_core/cycles/                 #    5.683 GHz                         (99.96%)
    3,67,63,65,363      cpu_atom/instructions/           #    1.06  insn per cycle              (0.07%)
   29,27,49,38,942      cpu_core/instructions/           #    8.43  insn per cycle              (99.96%)
      65,15,46,014      cpu_atom/branches/               #  400.057 M/sec                       (0.07%)
    5,14,74,98,224      cpu_core/branches/               #    3.161 G/sec                       (99.96%)
       2,01,73,868      cpu_atom/branch-misses/          #    3.10% of all branches             (0.07%)
          6,45,229      cpu_core/branch-misses/          #    0.10% of all branches             (99.96%)
             TopdownL1 (cpu_core)                 #     27.7 %  tma_backend_bound
                                                  #      1.2 %  tma_bad_speculation
                                                  #     15.7 %  tma_frontend_bound
                                                  #     55.4 %  tma_retiring             (99.96%)
             TopdownL1 (cpu_atom)                 #     24.6 %  tma_bad_speculation
                                                  #     24.9 %  tma_retiring             (0.07%)
                                                  #     29.4 %  tma_backend_bound
                                                  #     29.4 %  tma_backend_bound_aux
                                                  #     21.1 %  tma_frontend_bound       (0.07%)

       0.880604791 seconds time elapsed

       0.856803000 seconds user
       0.771724000 seconds sys

This shows that the additional context switches may not let io_uring have higher throughput.

The above picture shows that the io_uring version spawns an additional worker thread that inherits the parent's affinity. perf record -e context_switches showed this is the reason for context switches.

I've tried locating this thread's creation by putting a catchpoint on clone and clone3 but couldn't trace it.

odygrd · 2024-09-17T15:25:27Z

Thanks for pointing this out. I hadn't noticed the additional io_uring thread. I assumed that io_uring operates in an OS thread not on the isolated cores. This also explains why the busy-wait polling in io_uring was making things slower.

odygrd added the enhancement New feature or request label Sep 3, 2024

odygrd linked a pull request Sep 10, 2024 that will close this issue

add iouring file sink #558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io_uring to improve throughput in the background thread #549

io_uring to improve throughput in the background thread #549

tarun-t commented Sep 3, 2024

odygrd commented Sep 3, 2024 •

edited

Loading

tarun-t commented Sep 4, 2024 •

edited

Loading

odygrd commented Sep 10, 2024 •

edited

Loading

tarun-t commented Sep 16, 2024

odygrd commented Sep 16, 2024

odygrd commented Sep 17, 2024 •

edited

Loading

tarun-t commented Sep 17, 2024

odygrd commented Sep 17, 2024

io_uring to improve throughput in the background thread #549

io_uring to improve throughput in the background thread #549

Comments

tarun-t commented Sep 3, 2024

Proposed changes:

Expected benefits:

Considerations:

odygrd commented Sep 3, 2024 • edited Loading

tarun-t commented Sep 4, 2024 • edited Loading

System Config

Benchmark Config

Results

Summary

odygrd commented Sep 10, 2024 • edited Loading

Other things I tried:

tarun-t commented Sep 16, 2024

odygrd commented Sep 16, 2024

odygrd commented Sep 17, 2024 • edited Loading

tarun-t commented Sep 17, 2024

Setup

odygrd commented Sep 17, 2024

odygrd commented Sep 3, 2024 •

edited

Loading

tarun-t commented Sep 4, 2024 •

edited

Loading

odygrd commented Sep 10, 2024 •

edited

Loading

odygrd commented Sep 17, 2024 •

edited

Loading