Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variability of BBR throughput #1758

Open
huitema opened this issue Sep 30, 2024 · 14 comments
Open

Variability of BBR throughput #1758

huitema opened this issue Sep 30, 2024 · 14 comments

Comments

@huitema
Copy link
Collaborator

huitema commented Sep 30, 2024

A simple test of throughput can be done using picoquicdemo on local loopback on a Windows PC. (The same test on Unix devices is less interesting, because the loopback path on Linux is very different from what happens with actual network sockets.) The test works by starting a server on on terminal window using picoquicdemo.exe -p 4433 and running a client from another windows as picoquicdemo.exe -D -G {bbr|cubic} -n test ::1 4433 /1000000000. After running 5 tests using BBR and another 5 using Cubic on a Dell laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) we get the following results:

CC (on PC) Gbps P/trains Loss rate CWIN (p) Pacing (Gbps)
cubic 3.2 16 0.11% 878 226.5
cubic 3.1 17 0.05% 4477 645.9
cubic 2.9 16 0.09% 466 415.5
cubic 2.6 17 0.17% 359 750.9
cubic 3.1 17 0.11% 345 802.2
bbr 1.5 2 0.08% 137 2.3
bbr 2.4 6 0.05% 279 3.4
bbr 1.1 2 0.11% 269 1.8
bbr 1.8 3 0.08% 182 1.7
bbr 1.7 3 0.20% 273 2.7

For each connection, we listed:

  • The congestion control algorithm
  • The average throughput measured when sending 1 GB of data
  • The number of packets sent in a single sendmsg call, which we call packets per train,
  • The packet loss rate, defined as the number of packets repeated divided by total number of packets,
  • The observed congestion window, expressed in number of packets, observed at the end of the session
  • The pacing rate, observed at the end of the session

The obvious conclusion is that Cubic is much faster in that specific test, with an average data rate of about 3 Gbps, versus about 1.7 Gbps for BBR. We observe that much of the slow down in the BBR tests is due to the conservative pacing rate, leading to a small number of packets per "sendmsg" call. The pacing rate is the main control parameter in BBR, and it does not evolve to reflect the capacity of the loopback path.

We expect some variability between tests, for example because other programs and services running on the laptop may cause variations in available CPU. We do see some of that variability when using Cubic, with observed data rates between 2.6 and 3.2 Gbps, but we see a much higher variability using BBR, with data rates between 1.1 Gbps and 2.4 Gbps. This confirms other observations that small changes in the environment can produce big variations in BBR performance. We need to investigate the cause of these variations.

@huitema
Copy link
Collaborator Author

huitema commented Oct 1, 2024

Investigating BBR requires doing logging in a way that does not impact performance. The current options, qlog or even binary log, are not adequate. Turning the binary log option on for a transfer of 1GB, the data rate drops to 400Mbps while producing a 200MB file. Instead, we should try to log the key variables in memory, and save them to a csv file at the end of the connection.

@alexrabi
Copy link
Collaborator

alexrabi commented Dec 7, 2024

I am not sure whether this is related to this problem in particular or not, but I have also noticed some strange issues with BRR, likely related to the pacing rate.

In my scenario, I am using MP-QUIC together with QUIC datagrams in a VPN-like setup, where traffic is being tunneled across a MP-QUIC connection. If I let the traffic going into the tunnel to be split across all paths, everything seems to work well. However, splitting the traffic is not necessarily a good idea for all types of flows, because heterogeneous path characteristics may lead to excessive reordering, which in turn may lead to issues such as, for example, spurious retransmissions and unnecessary CWND reductions if the underlying traffic going into the tunnel is itself congestion controlled. It may therefore be preferable to limit the use of the number of MP-QUIC paths to just one in certain scenarios. However, limiting the use of paths (i.e. using the picoquic_mark_datagram_ready_path API to explicitly tell picoquic which path to send the datagrams on, while keeping the other paths "up" but inactive) seems to have some strange effects on the congestion control.

In one scenario, where I am sending UDP-based traffic with a constant bit rate (not congestion controlled) across the tunnel, I have noticed a massive amount of packet losses and very high latency for the few packets that actually made it through (~333ms RTT, when the base RTT is ~15ms), despite the fact that the data rate of the UDP flow is much lower than the capacity of the network. Again, I do not see the same behavior if I split the traffic across all paths, only when limiting the sending of the datagrams to one path. Setting up a connection with only a single path available (while still negotiating and agreeing to use MP-QUIC) also doesn't seem to display the same issue. Curiously, this problem only seems to happen on the server side as well. Using cubic on the server instead of BBR solves this issue, so I am fairly certain that it is not a problem with the tunnel framework, but rather that something is off with BBR; likely the pacing rate.

It could just be a coincidence of course, but the ~333ms RTT that I have been seeing seems oddly specific, as if some default value is being used for the pacing rate. Is the pacing rate calculated for the "inactive" paths being applied to ALL paths? And why does it only seem to occur on the server side?

@huitema
Copy link
Collaborator Author

huitema commented Dec 7, 2024

That's a bizarre result. The pacing rate is definitely per path. We would need some traces to understand what is happening in your scenario. Just to check, did you try setting the CC to "cubic" to see whether the issue persists?

@alexrabi
Copy link
Collaborator

alexrabi commented Dec 8, 2024

I will try to produce some traces to help you figure out what is going on. As I mentioned, setting CC to cubic does not show the same issue, so it is definitely something related to BBR.

@alexrabi
Copy link
Collaborator

Alright, so I have done some more testing, and it does seem to be reproducible in a few different environments. You can find logs here: logs.zip

Some interesting observations:

  • The problem occurs for BBRv1 and BBRv3 on the server side, but I have not encountered it when using any other CCA.
  • The choice of CCA on the client side does not appear to matter.
  • The problem only occurs when the path is application limited. However, if the path has been under load (i.e. the path capacity has been saturated at some point prior, e.g. by running an iperf test through the tunnel), everything seems to be working as expected even when running application limited traffic.
  • Looking at the logs, it seems like the client side is correctly identifying that it is being application limited, but that does not appear to be the case for the server side.

This may hint at some issue with the BBR startup phase, coupled with the server incorrectly assuming that it is not application limited?

@huitema
Copy link
Collaborator Author

huitema commented Dec 16, 2024

The "application limited" test with BBR is indeed more difficult than with Cubic or Reno. Cubic and Reno have a single control variable, the congestion window, so a simple test of bytes_in_flight < cwnd will return whether the traffic is app limited. BBR has two control variables. The main control variable is the pacing rate, but BBR also uses the congestion window, either as a safety to not send too much traffic in the absence of feedback, or as a short term limiter if packet losses were detected. Testing on the congestion window is imprecise, because in normal scenarios pacing dominates and the bytes in flight remain below the congestion window. Testing on the pacing rate is also imprecise, because pacing is implemented with a leaky bucket. There will be brief period in which the leaky bucket will be full, letting packets go without pacing, only limiting the last packet out of a train -- but that can happen whether the application is "limiting" or not. For example, an application that sends a large frame periodically will experience pacing, even though the average traffic is well below capacity.

That's a bit of a mess. Suggestions are welcome.

@alexrabi
Copy link
Collaborator

How about using the RTT as a test? BBR is different from Cubic or Reno in that it actively probes for the "non-congested" round-trip propagation time ("min_rtt") by attempting to drain the queue from the bottleneck. If the sender is application limited, it should also not contribute to any queue build-up at the bottleneck, and therefore the SRTT should not differ significantly from the min_rtt. This is not fool proof, of course, as the RTT probing phase is only done periodically. Changes to path characteristics, such as increased latency due to mobility or via the addition of competing flows at the bottleneck, could therefore be misinterpreted as no longer being application limited. It might be worth considering combining a few different tests using different metrics to determine whether a path is application or not. I.e. if both the pacing rate method and the above described method indicates that we are not application limited, only then should we consider us to no longer being application limited.

@huitema
Copy link
Collaborator Author

huitema commented Dec 17, 2024

The problem is time span. Suppose at application that sends 5Mbps of traffic on a 20 Mbps path -- kind of the definition of "app limited". Suppose now that the 5 Mbps of traffic consists of periodic video frames, i.e., one large message 30 times per second. That message will be sent in 8 or 9 ms -- 1/4th of the link capability. There will be no long term queue, but there will be an 8 or 9 ms queue built up for each frame. We can have a transient measurement that indicates saturation when the long term measurement does not.

@huitema
Copy link
Collaborator Author

huitema commented Dec 17, 2024

But the handling of app limited is really a different issue. The issue here was opened after test of transfers on a loopback interface, when the application was trying to send 1GB or 10BG of data -- definitely not an app limited scenario. The issue is analyzed in detail in the blog "Loopback ACK delays cause Cubic slowdowns." The Cubic issue was traced to bad behavior of the "ACK Frequency" implementation when running over very low latency paths, such as the loopback path. Fixing that fixed a lot of the BBR issues, but some remain. BBR is very complex, and thus hard to tune.

@alexrabi
Copy link
Collaborator

Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios.

Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

@huitema
Copy link
Collaborator Author

huitema commented Dec 18, 2024 via email

@joergdeutschmann-i7
Copy link

Very interesting discussion here. I just wanted to chime in regarding the following:

Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)
Out of helplessness, I sometimes do weird cabling setups in order to have more "realistic" topologies. Not sure if this is really a good approach. Also my colleague created (yet another?) testbed framework based on full operating system virtualization, but we've not evaluated detailed packet timings using that framework, yet.

@alexrabi
Copy link
Collaborator

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)

This seems to be an issue with the netem qdisc rather than Mininet itself, no? To emulate a "loopback" interface this way we would not need to impose any artificial delays nor artificial bandwidth (i.e. netem would not be required), as the CPU should be the main bottleneck, so I am not sure how applicable this is problem is for this particular scenario. Good to keep in mind though!

@joergdeutschmann-i7
Copy link

Mininet relies on NetEm according to the first reference and my knowledge. You're right, if you don't need artificial link emulation this might all not matter (to me it's then more testing of the implementation performance than congestion control performance), but maybe you also want to test the congestion control for different types of paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants