Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring to improve throughput in the background thread #549

Open
tarun-t opened this issue Sep 3, 2024 · 8 comments · May be fixed by #558
Open

io_uring to improve throughput in the background thread #549

tarun-t opened this issue Sep 3, 2024 · 8 comments · May be fixed by #558
Labels
enhancement New feature or request

Comments

@tarun-t
Copy link

tarun-t commented Sep 3, 2024

We could potentially improve the throughput of the logger by implementing io_uring in the background thread.

Proposed changes:

  • Implement io_uring setup in the background thread initialization
  • Modify the write operations to use io_uring submission and completion queues

Expected benefits:

  • Reduced system call overhead
  • Improved I/O performance, especially for high-frequency logging
  • Better scalability under high load

Considerations:

  • This would be Linux-specific and would require fallback mechanisms for other platforms
  • Careful benchmarking would be needed to quantify the improvements
  • May require adjusting the minimum supported Linux kernel version

Do you think this is something worth pursuing?

@odygrd
Copy link
Owner

odygrd commented Sep 3, 2024

Hey thanks for the ideas!

it can be easily implemented as a stand alone sink

It’s something i was considering to add 1-2 months ago. I put it on the back burner after reading this blog post https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html

I will try to implement it at some point and see if it actually makes any difference

@odygrd odygrd added the enhancement New feature or request label Sep 3, 2024
@tarun-t
Copy link
Author

tarun-t commented Sep 4, 2024

The results in the post were surprising, so I ran a benchmark of my own using fio-3.28.

System Config

  • intel i9-14900K 5GHz
  • NVMe SSD over PCIe4
  • Ubuntu 22.04
  • Linux Kernel 6.8.0-40-generic

I have not done additional optimizations like CPU isolation, or disabling io-scheduling, or setting irq affinity

Benchmark Config

[global]
direct=1
time_based
runtime=30
group_reporting
filename=/tmp/test_file
size=4G
iodepth=32
numjobs=1
stonewall

[blocking-write]
rw=write
bs=128k
ioengine=sync

[blocking-read]
rw=read
bs=128k
ioengine=sync

[iouring-write]
rw=write
bs=128k
ioengine=io_uring
sqthread_poll=0

[iouring-read]
rw=read
bs=128k
ioengine=io_uring
sqthread_poll=0

Results

blocking-write: (groupid=0, jobs=1): err= 0: pid=614577: Wed Sep  4 09:18:45 2024
  write: IOPS=27.8k, BW=3475MiB/s (3644MB/s)(102GiB/30001msec); 0 zone resets
    clat (usec): min=29, max=5119, avg=32.07, stdev=56.37
     lat (usec): min=31, max=5122, avg=35.84, stdev=56.39
    clat percentiles (usec):
     |  1.00th=[   31],  5.00th=[   31], 10.00th=[   31], 20.00th=[   31],
     | 30.00th=[   31], 40.00th=[   31], 50.00th=[   31], 60.00th=[   31],
     | 70.00th=[   31], 80.00th=[   32], 90.00th=[   32], 95.00th=[   34],
     | 99.00th=[   51], 99.50th=[   51], 99.90th=[   57], 99.95th=[   96],
     | 99.99th=[ 4686]   
   bw (  MiB/s): min= 2245, max= 3550, per=100.00%, avg=3476.61, stdev=212.04, samples=59
   iops        : min=17967, max=28406, avg=27812.78, stdev=1696.37, samples=59
  cpu          : usr=11.75%, sys=21.99%, ctx=834239, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,834123,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

blocking-read: (groupid=1, jobs=1): err= 0: pid=614592: Wed Sep  4 09:18:45 2024
  read: IOPS=20.4k, BW=2550MiB/s (2674MB/s)(74.7GiB/30001msec)
    clat (usec): min=34, max=533, avg=48.87, stdev= 7.06
     lat (usec): min=34, max=533, avg=48.88, stdev= 7.06
    clat percentiles (usec):
     |  1.00th=[   43],  5.00th=[   45], 10.00th=[   46], 20.00th=[   47],
     | 30.00th=[   48], 40.00th=[   48], 50.00th=[   49], 60.00th=[   49],
     | 70.00th=[   50], 80.00th=[   50], 90.00th=[   51], 95.00th=[   53],
     | 99.00th=[   68], 99.50th=[   72], 99.90th=[  117], 99.95th=[  169],
     | 99.99th=[  338]   
   bw (  MiB/s): min= 1880, max= 2657, per=100.00%, avg=2550.76, stdev=133.96, samples=59
   iops        : min=15040, max=21258, avg=20406.05, stdev=1071.70, samples=59
  cpu          : usr=0.61%, sys=16.63%, ctx=612061, majf=0, minf=45
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=612016,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

iouring-write: (groupid=2, jobs=1): err= 0: pid=614607: Wed Sep  4 09:18:45 2024
  write: IOPS=49.5k, BW=6188MiB/s (6489MB/s)(181GiB/30001msec); 0 zone resets
    slat (usec): min=7, max=537, avg= 9.40, stdev= 1.57
    clat (usec): min=23, max=50729, avg=636.82, stdev=417.91
     lat (usec): min=31, max=50742, avg=646.23, stdev=417.92
    clat percentiles (usec):
     |  1.00th=[  482],  5.00th=[  594], 10.00th=[  594], 20.00th=[  594],
     | 30.00th=[  594], 40.00th=[  594], 50.00th=[  594], 60.00th=[  594],
     | 70.00th=[  611], 80.00th=[  627], 90.00th=[  652], 95.00th=[  660],
     | 99.00th=[  848], 99.50th=[ 4228], 99.90th=[ 5604], 99.95th=[ 5669],
     | 99.99th=[ 7832]   
   bw (  MiB/s): min= 5918, max= 6396, per=100.00%, avg=6188.22, stdev=113.91, samples=59
   iops        : min=47344, max=51172, avg=49505.75, stdev=911.37, samples=59
  cpu          : usr=14.57%, sys=36.07%, ctx=1445832, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1485253,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

iouring-read: (groupid=3, jobs=1): err= 0: pid=614660: Wed Sep  4 09:18:45 2024
  read: IOPS=54.9k, BW=6864MiB/s (7197MB/s)(201GiB/30001msec)
    slat (usec): min=5, max=117, avg= 7.39, stdev= 1.16
    clat (usec): min=142, max=4837, avg=575.20, stdev=94.67
     lat (usec): min=156, max=4843, avg=582.61, stdev=94.67
    clat percentiles (usec):
     |  1.00th=[  326],  5.00th=[  412], 10.00th=[  461], 20.00th=[  510],
     | 30.00th=[  537], 40.00th=[  553], 50.00th=[  578], 60.00th=[  594],
     | 70.00th=[  619], 80.00th=[  652], 90.00th=[  693], 95.00th=[  725],
     | 99.00th=[  791], 99.50th=[  816], 99.90th=[  873], 99.95th=[  947],
     | 99.99th=[ 1483]   
   bw (  MiB/s): min= 6823, max= 6884, per=100.00%, avg=6867.48, stdev=12.01, samples=59
   iops        : min=54590, max=55074, avg=54939.83, stdev=96.06, samples=59
  cpu          : usr=1.59%, sys=44.58%, ctx=928257, majf=0, minf=1037
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1647384,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=3475MiB/s (3644MB/s), 3475MiB/s-3475MiB/s (3644MB/s-3644MB/s), io=102GiB (109GB), run=30001-30001msec

Run status group 1 (all jobs):
   READ: bw=2550MiB/s (2674MB/s), 2550MiB/s-2550MiB/s (2674MB/s-2674MB/s), io=74.7GiB (80.2GB), run=30001-30001msec

Run status group 2 (all jobs):
  WRITE: bw=6188MiB/s (6489MB/s), 6188MiB/s-6188MiB/s (6489MB/s-6489MB/s), io=181GiB (195GB), run=30001-30001msec

Run status group 3 (all jobs):
   READ: bw=6864MiB/s (7197MB/s), 6864MiB/s-6864MiB/s (7197MB/s-7197MB/s), io=201GiB (216GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=2257637/2320708, merge=0/493, ticks=976455/961054, in_queue=1937549, util=81.63%

Summary

Test IOPS Bandwidth (MiB/s) Avg Latency (μs) CPU Usage (usr/sys)
Blocking Write 27.8k 3475 35.84 11.75% / 21.99%
Blocking Read 20.4k 2550 48.88 0.61% / 16.63%
io_uring Write 49.5k 6188 646.23 14.57% / 36.07%
io_uring Read 54.9k 6864 582.61 1.59% / 44.58%

These numbers show io_uring significantly outperforms blocking I/O in both read and write ops. I have ran this benchmark on random writes and reads as well and saw similar performance gains.

@odygrd
Copy link
Owner

odygrd commented Sep 10, 2024

I've opened a draft PR; feel free to take a look and let me know your thoughts.

Other things I tried:

  • Tried different block sizes and chunks sizes
  • Using io_uring_peek_cqe in a loop instead of io_uring_wait_cqe() significantly slowed down throughput—dropping to around 0.50 million. I’m unsure of the reason, and this slowdown still occurs even when pinning the backend thread to CPU 1, the frontend thread to CPU 5, and the OS to CPU 0.
  • IORING_SETUP_SQPOLL also caused a similar slowdown to io_uring_peek_cqe.
    struct io_uring_params params;
    memset(&params, 0, sizeof(params));
    params.flags |= IORING_SETUP_SQPOLL; // Enable kernel-side polling
    if (::io_uring_queue_init_params(default_queue_size, &_ring, &params) != 0)
  • O_DIRECT would be ideal, but it's quite complex to implement. We need to submit block sizes of 512 bytes, and each block must be fully filled—otherwise, we see NUL characters in the log file.

Results on my RHEL9 box with the current PR:

Without io_uring

[odygrd@binary-blaze:~]$ ./git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_throughput
Throughput is 4.44 million msgs/sec average, total time elapsed: 901 ms for 4000000 log messages

With io_uring

[odygrd@binary-blaze:~]$ ./git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_iouring_throughput
Throughput is 4.86 million msgs/sec average, total time elapsed: 822 ms for 4000000 log messages

@odygrd odygrd linked a pull request Sep 10, 2024 that will close this issue
@tarun-t
Copy link
Author

tarun-t commented Sep 16, 2024

Thank you for submitting this pull request.

I've conducted testing on Ubuntu 24.04 LTS to evaluate the performance. The io_uring-based implementation shows approximately 6% lower throughput compared to the standard version. I'm currently investigating the cause of this performance discrepancy and will share my findings.

Regarding the code structure, I suggest implementing an option in FileSink to enable or disable io_uring support. This modification would allow classes like RotatingFileSink to inherit io_uring functionality, enhancing flexibility and maintainability.

Please let me know if you have any thoughts about this.

@odygrd
Copy link
Owner

odygrd commented Sep 16, 2024

The benchmark numbers above were from before the latest commits on master. With recent improvements I introduced in the backend worker, it seems that the sync IO performance has now caught up with the performance of io uring sink...

I created this PR as a proof of concept to demonstrate how it could be implemented and to evaluate if it's worth adding.

One reason I’m hesitant to include it directly in FileSink is that we’d need to check for io_uring’s existence for all existing users. By keeping it in a separate header, we avoid affecting users who aren’t including it. The plan is to consider integration once we have solid proof that it brings throughput improvements.

@odygrd
Copy link
Owner

odygrd commented Sep 17, 2024

These are the numbers i am currently getting after syncing this PR with master branch

Without io_uring

$ /home/odygrd/git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_throughput

Throughput is 5.50 million msgs/sec average, total time elapsed: 727 ms for 4000000 log messages

With io_uring

$ /home/odygrd/git/quill/cmake-build-release/benchmarks/backend_throughput/BENCHMARK_quill_backend_iouring_throughput

Throughput is 5.60 million msgs/sec average, total time elapsed: 713 ms for 4000000 log messages

i have also made the benchmark more accurate as the backend thread needs to spend a few seconds on the first log message to calculate the rdtsc and that was previously included in the latency. Thats why you see a different number

@tarun-t
Copy link
Author

tarun-t commented Sep 17, 2024

Here are my results with the latest code:

Setup

  • Ubuntu 24.04
  • liburing 2.5
  • Mainthread bound to 6
  • Backend bound to 5

With io_uring:

~/quill/build/benchmarks/backend_throughput$ ./BENCHMARK_quill_backend_iouring_throughput
Throughput is 6.33 million msgs/sec average, total time elapsed: 631 ms for 4000000 log messages

Without io_uring:

~/quill/build/benchmarks/backend_throughput$ ./BENCHMARK_quill_backend_throughput
Throughput is 6.33 million msgs/sec average, total time elapsed: 632 ms for 4000000 log messages

Given the identical perf, I investigated further with perf.

With io_uring:

perf stat ./BENCHMARK_quill_backend_iouring_throughput

 Performance counter stats for './BENCHMARK_quill_backend_iouring_throughput':

          1,639.29 msec task-clock                       #    1.792 CPUs utilized
            22,113      context-switches                 #   13.489 K/sec
                 1      cpu-migrations                   #    0.610 /sec
          1,41,639      page-faults                      #   86.403 K/sec
     <not counted>      cpu_atom/cycles/                                                        (0.00%)
    9,29,52,55,672      cpu_core/cycles/                 #    5.670 GHz
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
   29,37,24,01,532      cpu_core/instructions/
     <not counted>      cpu_atom/branches/                                                      (0.00%)
    5,13,21,48,550      cpu_core/branches/               #    3.131 G/sec
     <not counted>      cpu_atom/branch-misses/                                                 (0.00%)
         10,19,512      cpu_core/branch-misses/
             TopdownL1 (cpu_core)                 #     28.1 %  tma_backend_bound
                                                  #      1.2 %  tma_bad_speculation
                                                  #     15.1 %  tma_frontend_bound
                                                  #     55.5 %  tma_retiring

       0.914614828 seconds time elapsed

       0.802307000 seconds user
       0.837321000 seconds sys

Without io_uring

perf stat ./BENCHMARK_quill_backend_throughput


 Performance counter stats for './BENCHMARK_quill_backend_throughput':

          1,628.63 msec task-clock                       #    1.849 CPUs utilized
               249      context-switches                 #  152.889 /sec
                 3      cpu-migrations                   #    1.842 /sec
          1,41,638      page-faults                      #   86.967 K/sec
    3,47,43,10,232      cpu_atom/cycles/                 #    2.133 GHz                         (0.07%)
    9,25,56,65,847      cpu_core/cycles/                 #    5.683 GHz                         (99.96%)
    3,67,63,65,363      cpu_atom/instructions/           #    1.06  insn per cycle              (0.07%)
   29,27,49,38,942      cpu_core/instructions/           #    8.43  insn per cycle              (99.96%)
      65,15,46,014      cpu_atom/branches/               #  400.057 M/sec                       (0.07%)
    5,14,74,98,224      cpu_core/branches/               #    3.161 G/sec                       (99.96%)
       2,01,73,868      cpu_atom/branch-misses/          #    3.10% of all branches             (0.07%)
          6,45,229      cpu_core/branch-misses/          #    0.10% of all branches             (99.96%)
             TopdownL1 (cpu_core)                 #     27.7 %  tma_backend_bound
                                                  #      1.2 %  tma_bad_speculation
                                                  #     15.7 %  tma_frontend_bound
                                                  #     55.4 %  tma_retiring             (99.96%)
             TopdownL1 (cpu_atom)                 #     24.6 %  tma_bad_speculation
                                                  #     24.9 %  tma_retiring             (0.07%)
                                                  #     29.4 %  tma_backend_bound
                                                  #     29.4 %  tma_backend_bound_aux
                                                  #     21.1 %  tma_frontend_bound       (0.07%)

       0.880604791 seconds time elapsed

       0.856803000 seconds user
       0.771724000 seconds sys

This shows that the additional context switches may not let io_uring have higher throughput.

Threads

The above picture shows that the io_uring version spawns an additional worker thread that inherits the parent's affinity. perf record -e context_switches showed this is the reason for context switches.

I've tried locating this thread's creation by putting a catchpoint on clone and clone3 but couldn't trace it.

@odygrd
Copy link
Owner

odygrd commented Sep 17, 2024

Thanks for pointing this out. I hadn't noticed the additional io_uring thread. I assumed that io_uring operates in an OS thread not on the isolated cores. This also explains why the busy-wait polling in io_uring was making things slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants