Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support]: Interrupts seem only delivering to AF_XDP core rather than the core specified by IRQ affinity #334

Open
3 tasks done
YangZhou1997 opened this issue Jan 5, 2025 · 8 comments
Labels
Linux ENA driver support Ask a question or request support triage Determine the priority and severity

Comments

@YangZhou1997
Copy link

YangZhou1997 commented Jan 5, 2025

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

ena_linux_2.13.0

Custom Code

No

OS Platform and Distribution

Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)

Support request

Hi AWS driver maintainer,

I am using AFXDP zero-copy support from AWS ENA driver to send and receive 100Gbps traffic. I configure multiple NICs queues to receive interrupts, and bind each queue's irq affinity to different cores (through /proc/irq/$IRQ/smp_affinity_list). I also use another set of cores to run user-space applications (which receive and send packets using AFXDP APIs).

However, from htop, I find that the IRQ cores (ie, cores specified by irq affinity) do not nearly consume any CPU time, while the application cores consume around half of CPU time in kernel (ie, red bars in htop). From perf, the application cores are spending significant time on syscalls like __lib_sendto and also net_rx_action. So it seems that the NIC interrupts are handled by the application cores.

I am worried that this frequent context switching between user and kernel space causes poor networking performance. For example, when using Mellanox ConnectX-5 100G NIC with mlx5 driver, my AFXDP applications only requires 4 application cores and 4 IRQ cores to saturate dual-directional 200G traffic; on these 4 application cores, nearly no CPU time spent on the kernel, while the 4 IRQ cores are spent significant time in the kernel handling interrupts. When using c5n.18xlarge with AWS 100G NIC and ENA driver, I need to use 12 application cores and specify 12 IRQ cores (ie, 12 NIC queues) to saturate only 150G traffic.

Another short question is: how to determine the NUMA affinity of the ENA NIC? c5.18xlarge has two NUMA nodes, but I got -1 from /sys/bus/pci/devices/<PCI_device_ID>/numa_node and /sys/class/net/<nic_dev>/device/numa_node.

Best,
Yang

Contact Details

No response

@YangZhou1997 YangZhou1997 added support Ask a question or request support triage Determine the priority and severity labels Jan 5, 2025
@davidarinzon
Copy link
Contributor

Hi @YangZhou1997

Thank you for raising this issue, we will look into it and provide feedback soon.

@YangZhou1997
Copy link
Author

Thank you, David!

Another note is: on the c5n.18xlarge instance, using iperf with 32 connection (ie, -P 32 --dualtest) is able to achieve around 190 Gbps bandwidth under 9k MTU; with 3.5k MTU (ie, the maximum MTU of AFXDP), iperf is able to achieve around 183 Gbps. In comparison, AFXDP with 3.5k MTU can only achieve 150 Gps. I suspect there might be some driver issues for AFXPD support.

Btw, I disable interrupt collapsing by sudo ethtool -C ${NIC} adaptive-rx off rx-usecs 0 tx-usecs 0, as I found it typically does not help much.

Best,
Yang

@YangZhou1997
Copy link
Author

I am digging a bit on this, and realize that the ENA driver might run the softirq for TX and RX inside the send/recv syscall, while the mlx5 driver optimizes to run the softirq inside the NIC hardware interrupt processing. If so, is there any way to optimize or confie ENA driver to use the mlx5 manner?

@ShayAgros
Copy link
Contributor

Hi,
This is probably just my first comment as I need more time to dig deeper into this (as well as prepare test programs which simulate your setup).

I'll start by what I can answer right now.

Another short question is: how to determine the NUMA affinity of the ENA NIC? c5.18xlarge has two NUMA nodes, but I got -1 from /sys/bus/pci/devices/<PCI_device_ID>/numa_node and /sys/class/net/<nic_dev>/device/numa_node.

That depends on the HW generation. On newer multi-numa instances you can use the approach above to determine the NUMA node of the device.
On c5n.18xlarge however this is not supported.
I can tell you that on the current c5n.18xlarge HW provided by AWS the numa node of the device is 0 (this might change in the future though, hopefully by that time you'd be able to query the correct numa node).

However, from htop, I find that the IRQ cores (ie, cores specified by irq affinity) do not nearly consume any CPU time, while the application cores consume around half of CPU time in kernel (ie, red bars in htop).

Yes it is also observed in my tests. The difference from mlx5 stands from the implementation of the wakeup command sent to the driver.
When asking the driver to poll for new packets (e.g. using the sendto command), mlx5 driver sends a command to the underlying device to invoke an interrupt. This allows to retain the existing irq affinity.

ENA on the other hand doesn't currently have the ability to invoke an interrupt and so it schedules the napi handler directly. At least as I see it, it does have the benefit of better performance as waiting for an interrupt from a device, just adds an additional step to invoking napi. The benefit of mlx5 approach of course is respecting the irq affinity.

The issue can be relieved in one of the following ways:

  • One can have an application thread pinned to the same CPU as the irq which is responsible for waking the driver to start polling
  • busy poll mechanism + IRQ deferring might be used instead of manually asking for poll. For example, using these unoptimized configurations (didn't fine tune their values):
echo 50 | sudo tee /proc/sys/net/core/busy_poll
echo 50 | sudo tee /proc/sys/net/core/busy_read
echo 2 | sudo tee /sys/class/net/ens6/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/ens6/gro_flush_timeout

The second approach was tested by me and made the application thread run almost exclusively in userspace.
A solution similar to mlx5 might be utilized in the future, but it is not currently planned.

Btw, I disable interrupt collapsing by sudo ethtool -C ${NIC} adaptive-rx off rx-usecs 0 tx-usecs 0, as I found it typically does not help much.

These configurations allow to reduce the number of interrupts while allowing to retain the same BW. In your usecase, I don't think that it matters much as the IRQ cores are pretty free already.

Another note is: on the c5n.18xlarge instance, using iperf with 32 connection (ie, -P 32 --dualtest) is able to achieve around 190 Gbps bandwidth under 9k MTU; with 3.5k MTU (ie, the maximum MTU of AFXDP), iperf is able to achieve around 183 Gbps. In comparison, AFXDP with 3.5k MTU can only achieve 150 Gps. I suspect there might be some driver issues for AFXPD support.

I still owe you an answer for it, but it'd require me to write a new test application (unless you have one I can use (: )

@YangZhou1997
Copy link
Author

YangZhou1997 commented Jan 7, 2025

@ShayAgros Thank you for the thorough response---That indeed helps a lot! Now I just map IRQs to the app cores, and it does not impact any performance. I also tried the "busy poll mechanism + IRQ deferring", but it gives very poor performance (~7Gbps per core).

I will polish and open source my code soon, and will get back to you once I get a version for easy testing.

@YangZhou1997
Copy link
Author

@ShayAgros I clean up the afxdp repo I used to test network bw on c5n.18xlarge: https://github.com/uccl-project/uccl.

The readme also has a thorough guide on how to run my code. To reproduce the 150Gbps bw, one needs to:

  • spawn two c5n.18xlarge (instead of g4dn.8xlarge in the readme)
  • run up to step 3 in Getting Started
  • using python setup_all.py --target aws_c5_afxdp to build

I would appreciate any help that helps debug the performance issue.

@ShayAgros
Copy link
Contributor

Just an update, I followed the README and currently stuck at the phase of initializing the tests:

Run UCCL transport tests on VM1:
cd /opt/uccl && git pull
Edit nodes.txt to only include the two public IPs of the VMs
Build UCCL:

python setup_all.py --target aws_g4_afxdp
Keep setup_all.py running

The python setup_all.py keeps throwing exception (more specifically paramiko is complaining that it's not able to connect to the other host) which is also reproducible by running paramiko manually:

>>> client.connect("172.31.105.125")
Unknown exception: public_blob
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/paramiko/transport.py", line 2262, in run
    handler(m)
  File "/opt/conda/lib/python3.12/site-packages/paramiko/auth_handler.py", line 394, in _parse_service_accept
    key_type, bits = self._get_key_type_and_bits(self.private_key)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/paramiko/auth_handler.py", line 218, in _get_key_type_and_bits
    if key.public_blob:
       ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/paramiko/agent.py", line 476, in __getattr__
    raise AttributeError(name)
AttributeError: public_blob

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.12/site-packages/paramiko/client.py", line 485, in connect
    self._auth(
  File "/opt/conda/lib/python3.12/site-packages/paramiko/client.py", line 754, in _auth
    self._transport.auth_publickey(username, key)
  File "/opt/conda/lib/python3.12/site-packages/paramiko/transport.py", line 1709, in auth_publickey
    return self.auth_handler.wait_for_response(my_event)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/paramiko/auth_handler.py", line 248, in wait_for_response
    raise e
  File "/opt/conda/lib/python3.12/site-packages/paramiko/transport.py", line 2262, in run
    handler(m)
  File "/opt/conda/lib/python3.12/site-packages/paramiko/auth_handler.py", line 394, in _parse_service_accept
    key_type, bits = self._get_key_type_and_bits(self.private_key)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/paramiko/auth_handler.py", line 218, in _get_key_type_and_bits
    if key.public_blob:
       ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/paramiko/agent.py", line 476, in __getattr__
    raise AttributeError(name)
AttributeError: public_blob

I'm not familiar enough with Paramiko so it'd take me some time to understand what it wants (invoking ssh from the terminal just works w/o any additional parameters).

I'll probably switch to create a similar setup of several queues receiving traffic and redirecting it to a third server to find whether I can reproduce the bandwidth bottleneck mentioned in the ticket

@YangZhou1997
Copy link
Author

Thank you @ShayAgros! I am trying to remove the dependency of paramiko to make it easier to run.

The setup you mention would be very helpful for the debugging, as it largely simplifies the settings to the most basic part (my code uccl has implemented a bunch of reliable transport components such as loss recovery). Appreciate your efforts!

Just one note: AWS has per-flow rate limit of 5Gbps, and you may need to use dozens of flows (eg, with different udp ports) to bypass this limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Linux ENA driver support Ask a question or request support triage Determine the priority and severity
Projects
None yet
Development

No branches or pull requests

3 participants