-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variability of BBR throughput #1758
Comments
Investigating BBR requires doing logging in a way that does not impact performance. The current options, qlog or even binary log, are not adequate. Turning the binary log option on for a transfer of 1GB, the data rate drops to 400Mbps while producing a 200MB file. Instead, we should try to log the key variables in memory, and save them to a csv file at the end of the connection. |
I am not sure whether this is related to this problem in particular or not, but I have also noticed some strange issues with BRR, likely related to the pacing rate. In my scenario, I am using MP-QUIC together with QUIC datagrams in a VPN-like setup, where traffic is being tunneled across a MP-QUIC connection. If I let the traffic going into the tunnel to be split across all paths, everything seems to work well. However, splitting the traffic is not necessarily a good idea for all types of flows, because heterogeneous path characteristics may lead to excessive reordering, which in turn may lead to issues such as, for example, spurious retransmissions and unnecessary CWND reductions if the underlying traffic going into the tunnel is itself congestion controlled. It may therefore be preferable to limit the use of the number of MP-QUIC paths to just one in certain scenarios. However, limiting the use of paths (i.e. using the In one scenario, where I am sending UDP-based traffic with a constant bit rate (not congestion controlled) across the tunnel, I have noticed a massive amount of packet losses and very high latency for the few packets that actually made it through (~333ms RTT, when the base RTT is ~15ms), despite the fact that the data rate of the UDP flow is much lower than the capacity of the network. Again, I do not see the same behavior if I split the traffic across all paths, only when limiting the sending of the datagrams to one path. Setting up a connection with only a single path available (while still negotiating and agreeing to use MP-QUIC) also doesn't seem to display the same issue. Curiously, this problem only seems to happen on the server side as well. Using cubic on the server instead of BBR solves this issue, so I am fairly certain that it is not a problem with the tunnel framework, but rather that something is off with BBR; likely the pacing rate. It could just be a coincidence of course, but the ~333ms RTT that I have been seeing seems oddly specific, as if some default value is being used for the pacing rate. Is the pacing rate calculated for the "inactive" paths being applied to ALL paths? And why does it only seem to occur on the server side? |
That's a bizarre result. The pacing rate is definitely per path. We would need some traces to understand what is happening in your scenario. Just to check, did you try setting the CC to "cubic" to see whether the issue persists? |
I will try to produce some traces to help you figure out what is going on. As I mentioned, setting CC to cubic does not show the same issue, so it is definitely something related to BBR. |
Alright, so I have done some more testing, and it does seem to be reproducible in a few different environments. You can find logs here: logs.zip Some interesting observations:
This may hint at some issue with the BBR startup phase, coupled with the server incorrectly assuming that it is not application limited? |
The "application limited" test with BBR is indeed more difficult than with Cubic or Reno. Cubic and Reno have a single control variable, the congestion window, so a simple test of That's a bit of a mess. Suggestions are welcome. |
How about using the RTT as a test? BBR is different from Cubic or Reno in that it actively probes for the "non-congested" round-trip propagation time ("min_rtt") by attempting to drain the queue from the bottleneck. If the sender is application limited, it should also not contribute to any queue build-up at the bottleneck, and therefore the SRTT should not differ significantly from the min_rtt. This is not fool proof, of course, as the RTT probing phase is only done periodically. Changes to path characteristics, such as increased latency due to mobility or via the addition of competing flows at the bottleneck, could therefore be misinterpreted as no longer being application limited. It might be worth considering combining a few different tests using different metrics to determine whether a path is application or not. I.e. if both the pacing rate method and the above described method indicates that we are not application limited, only then should we consider us to no longer being application limited. |
The problem is time span. Suppose at application that sends 5Mbps of traffic on a 20 Mbps path -- kind of the definition of "app limited". Suppose now that the 5 Mbps of traffic consists of periodic video frames, i.e., one large message 30 times per second. That message will be sent in 8 or 9 ms -- 1/4th of the link capability. There will be no long term queue, but there will be an 8 or 9 ms queue built up for each frame. We can have a transient measurement that indicates saturation when the long term measurement does not. |
But the handling of app limited is really a different issue. The issue here was opened after test of transfers on a loopback interface, when the application was trying to send 1GB or 10BG of data -- definitely not an app limited scenario. The issue is analyzed in detail in the blog "Loopback ACK delays cause Cubic slowdowns." The Cubic issue was traced to bad behavior of the "ACK Frequency" implementation when running over very low latency paths, such as the loopback path. Fixing that fixed a lot of the BBR issues, but some remain. BBR is very complex, and thus hard to tune. |
Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios. Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"? |
On 12/18/2024 12:26 AM, alexrabi wrote:
Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios.
Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?
Maybe. I have no experience with that, but if you want to step in, you
are welcome!
…-- Christian Huitema
|
Very interesting discussion here. I just wanted to chime in regarding the following:
I think also lightweight virtualization may impact your timings. Example (pdf). |
This seems to be an issue with the netem qdisc rather than Mininet itself, no? To emulate a "loopback" interface this way we would not need to impose any artificial delays nor artificial bandwidth (i.e. netem would not be required), as the CPU should be the main bottleneck, so I am not sure how applicable this is problem is for this particular scenario. Good to keep in mind though! |
Mininet relies on NetEm according to the first reference and my knowledge. You're right, if you don't need artificial link emulation this might all not matter (to me it's then more testing of the implementation performance than congestion control performance), but maybe you also want to test the congestion control for different types of paths. |
A simple test of throughput can be done using
picoquicdemo
on local loopback on a Windows PC. (The same test on Unix devices is less interesting, because the loopback path on Linux is very different from what happens with actual network sockets.) The test works by starting a server on on terminal window usingpicoquicdemo.exe -p 4433
and running a client from another windows aspicoquicdemo.exe -D -G {bbr|cubic} -n test ::1 4433 /1000000000
. After running 5 tests using BBR and another 5 using Cubic on a Dell laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) we get the following results:For each connection, we listed:
The obvious conclusion is that Cubic is much faster in that specific test, with an average data rate of about 3 Gbps, versus about 1.7 Gbps for BBR. We observe that much of the slow down in the BBR tests is due to the conservative pacing rate, leading to a small number of packets per "sendmsg" call. The pacing rate is the main control parameter in BBR, and it does not evolve to reflect the capacity of the loopback path.
We expect some variability between tests, for example because other programs and services running on the laptop may cause variations in available CPU. We do see some of that variability when using Cubic, with observed data rates between 2.6 and 3.2 Gbps, but we see a much higher variability using BBR, with data rates between 1.1 Gbps and 2.4 Gbps. This confirms other observations that small changes in the environment can produce big variations in BBR performance. We need to investigate the cause of these variations.
The text was updated successfully, but these errors were encountered: