-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTU discovery improvements #61
Conversation
This cleans up the PMTU probing function a little bit. It moves the low-level sending of packets to a separate function, so that the code reads naturally instead of using a weird for loop with "special indexes". In addition, comments are moved inside the body of the function for additional context. This shouldn't introduce any change of behavior, except for local discovery which has some minor logic fixes and which now always uses small packets (16 bytes) because there's no need for a full-length probe just to try the local network.
Currently, the TX path (starting from send_packet()) in tinc has three responsabilities: - Making sure packets can be sent (e.g. fetching SPTPS keys); - Making sure they can be sent optimally (e.g. fetching non-SPTPS keys so that UDP can be used); - Sending the actual packet, if feasible. The first two are closely related; the third one, however, can be cleanly separated from the other two - meaning, we can loosen code coupling between sending packets and "optimizing" the way packets are sent. This will become increasingly important as future commits will move more tunnel establishment and maintenance code into the TX path, so we will benefit from a cleaner separation of concerns. This is especially relevant because of the dual nature of the TX path (SPTPS versus non-SPTPS), which can make things really complicated when trying to share low-level code between both. In this commit, code related to establishing or improving tunnels is moved away from the core TX path by introducing the "try_*()" family of function, of which try_sptps() already existed before this commit. This is a pure refactoring; this commit shouldn't introduce any change in behavior.
This moves related functions together. try_tx() is at the right place since its only caller is send_packet(). This is a pure cut-and-paste change. The reason it was not done in the previous commit is because it would have made the diff harder to review.
This adds a new mechanism by which tinc can determine if a node is reachable via UDP. The new mechanism is currently redundant with the PMTU discovery mechanism - that will be fixed in a future commit. Conceptually, the UDP discovery mechanism works similarly to PMTU discovery: it sends UDP probes (of minmtu size, to make sure the tunnel is fully usable), and assumes UDP is usable if it gets replies. It assumes UDP is broken if too much time has passed since the last reply. The big difference with the current PMTU discovery mechanism, however, is that UDP discovery probes are only triggered as part of the packet TX path (through try_tx()). This is quite interesting, because it means tinc will never send UDP pings more often than normal packets, and most importantly, it will automatically stop sending pings as soon as packets stop flowing, thereby nicely reducing network chatter. Of course, there are small drawbacks in some edge cases: for example, if a node only sends one packet every minute to another node, these packets will only be sent over TCP, because the interval between packets is too long for tinc to maintain the UDP tunnel. I consider this a feature, not a bug: I believe it is appropriate to use TCP in scenarios where traffic is negligible, so that we don't pollute the network with pings just to maintain a UDP tunnel that's seeing negligible usage.
Since UDP discovery is the place where UDP feasibility is checked, it makes sense to test for local connectivity as well. This was previously done as part of PMTU discovery.
This is a rewrite of the send_mtu_probe_handler() function to make it focus on the actual discovery of PMTU. In particular, the PMTU discovery code doesn't care about tunnel state anymore - it only cares about doing the initial PMTU discovery, and once that's done, making sure PMTU did not increase by checking it from time to time. All other duties have already been rewritten in the UDP discovery code. As a result, the send_mtu_probe_handler(), which previously implemented a nightmarish state machine which was very difficult to follow and understand, has been massively simplified. We moved from four persistent states to only two - initial discovery and steady state. Furthermore, a side effect is that network chatter is reduced: instead of sending bursts of three minmtu-sized packets in the steady state, there is only one such packet that's sent from the UDP discovery code. However, that introduces a slight regression in the bandwidth estimation code, which relies on three-packet bursts in order to function. Considering that this estimation is extremely unreliable (in my experience) and isn't relied on by anything, this seems like an acceptable regression.
Currently, the PMTU discovery code is run by a timeout callback, independently of tunnel activity. This commit moves it into the TX path, meaning that send_mtu_probe_handler() is only called if a packet is about to be sent. Consequently, it has been renamed to try_mtu() for consistency with try_tx(), try_udp() and try_sptps(). Running PMTU discovery code only as part of the TX path prevents PMTU discovery from generating unreasonable amounts of traffic when the "real" traffic is negligible. One extreme example is sending one real packet and then going silent: in the current code this one little packet will result in the entire PMTU discovery algorithm being run from start to finish, resulting in absurd write traffic amplification. With this patch, PMTU discovery stops as soon as "real" packets stop flowing, and will be no more aggressive than the underlying traffic. Furthermore, try_mtu() only runs if there is confirmed UDP connectivity as per the UDP discovery mechanism. This prevents unnecessary network chatter - previously, the PMTU discovery code would send bursts of (potentially large) probe packets every second even if there was nothing on the other side. With this patch, the PMTU code only does that if something replied to the lightweight UDP discovery pings. These inefficiencies were made even worse when the node is not a direct neighbour, as tinc will use PMTU discovery both on the destination node *and* the relay. UDP discovery is more lightweight for this purpose. As a bonus, this code simplifies overall code somewhat - state is easier to manage when code is run in predictable contexts as opposed to "surprise callbacks". In addition, there is no need to call PMTU discovery code outside of net_packet.c anymore, thereby simplifying module boundaries.
This moves related functions together, and is a pure cut-and-paste change. The reason it was not done in the previous commit is because it would have made the diff harder to review.
If a probe reply is received that makes minmtu equal to maxmtu, we have to wait until try_mtu() runs to realize that. Since try_mtu() runs after a packet is sent, this means there is at least one packet (possibly more, depending on timing) that won't benefit from the fixed MTU. This also happens when maxmtu is updated from the send() path. This commit fixes that by making sure we check whether the MTU can be fixed every time minmtu or maxmtu is touched.
This is a minor cosmetic nit to emphasise the distinction between the initial MTU discovery phase, and the post-initial phase (i.e. maxmtu checking). Furthermore, this is an improvement with regard to the DRY (Don't Repeat Yourself) principle, as the maximum mtuprobes value is only written once.
Currently, tinc sends MTU probes in batches of three every second. This commit changes that to send one packet every 333 milliseconds instead. This change brings two benefits: - It makes MTU probing faster, because MTU probe lengths are calculated based on minmtu, and minmtu is adjusted based on the replies. When sending batches of three packets, all three packets are based on the same minmtu estimation; in contrast, by sending one packet more frequently, each subsequent packet can benefit from the replies that have been received since the last packet was sent. As a result, MTU discovery converges much faster (2-3 times as fast, typically). - It reduces network spikiness - it's more network-friendly to send one packet from time to time as opposed to sending bursts.
tinc bandwidth estimation has always been quite unreliable (at least in my experience), but there's no chance of it working anymore since the last changes to MTU discovery code, because packets are not sent in batches of three anymore. This commit removes the dead code - fortunately, nothing depends on this estimation (it's not even shown in node info). We probably need be smarter about this if we do want this estimation back.
Currently, tinc uses a naive algorithm for choosing MTU discovery probe sizes, picking a size at random between minmtu and maxmtu. This is of course suboptimal - since the behavior of probes is deterministic (assuming no packet loss), it seems likely that using a non-deterministic discovery algorithm will not yield the best results. Furthermore, the randomness introduces a lot of variation in convergence times. The random solution also suffers from pathological cases - since it's using a uniform distribution, it doesn't take into account the fact that it's often more interesting to send small probes rather than large ones, because getting replies is the only way we can make progress (assuming the worst case scenario in which the OS doesn't know anything, therefore keeping maxmtu constant). This can lead to absurd situations where the discovery algorithm is close to the real MTU, but can't get to it because the random number generator keeps generating numbers that are past it. The algorithm implemented in this patch aims to improve on the naive random algorithm. It is organized around "cycles" of 8 probes; the sizes of the probes decrease as we go through the cycle, thus making sure the algorithm can cover lots of ground quickly (in case we're far from actual MTU), but also examining the local area (in case we're close to actual MTU). Using cycles ensures that the algorithm will "go back" to large probes to better cover the new interval and to protect against packet loss. For the probe size itself, various mathematical models were simulated in an attempt to find the one that converges the fastest; it has been determined that using an exponential based on the size of the remaining interval was the most effective option. The exponential is adjusted with a magic multiplier fine-tuned to make tinc jump to the "most interesting" (i.e. 1400+) section as soon as discovery starts. Simulations indicate that assuming no packet loss and no help from the OS (i.e. maxmtu stays constant), this algorithm will typically converge to the *exact* MTU value in less than 10 probes, and will get within 8 bytes in less than 5 probes, for actual MTUs between 1417 and ~1450 (which is the range the algorithm is fine-tuned for). In contrast, the previous algorithm gives results all over the place, sometimes taking 30+ probes to get in the ballpark. Because of the issues with the distribution, the previous algorithm sometimes never gets to the precise MTU value within any reasonable amount of time - in contrast, the new algorithm will always get to the precise value in less than 30 probes, even if the actual MTU is completely outside the optimized range.
The recently introduced new MTU discovery algorithm converges much faster than the previous one, which allows us to reduce the number of probes required before we can confidently fix the MTU. This commit reduces the number of initial discovery probes from 90 to 20. With the new algorithm this is more than enough to get to the precise (byte-level accuracy) MTU value; in cases of packet loss or weird MTU values for which the algorithm is not optimized, we should get close to the actual value, and then we rely on MTU increase detection (steady state probes) to fine-tune it later if the need arises. This patch also triggers MTU increase detection even if the MTU we have is off by only one byte. Previously we only did that if it was off by at least 8 bytes. Considering that (1) this should happen less often, (2) restarting MTU discovery is cheaper than before and (3) having MTUs that are subtly off from their intended values by just a few bytes sounds like trouble, this sounds like a good idea.
If MTU discovery comes up with an MTU smaller than 512 bytes (e.g. due to massive packet loss), it's pretty much guaranteed to be wrong. Even if it's not, most Internet applications assume the MTU will be at least 512, so fixing the MTU to a small value is likely to cause trouble anyway. This also makes the discovery algorithm converge even faster, since the interval it has to consider is smaller.
Linux provides a getsockopt() option, IP_MTU, to get the kernel's best guess at a connection MTU. In practice, it seems to return the MTU of the physical interface the socket is using. This patch uses this option to initialize maxmtu to a better value when MTU discovery starts. Unfortunately, this is not supported on Windows. Winsock has options such as SO_MAX_MSG_SIZE, SO_MAXDG and SO_MAXPATHDG but they seem useless as they always return absurdly large values (typically, 65507), as confirmed by http://support.microsoft.com/kb/822061/
The original multiplier constant for the MTU discovery algorithm, 0.97, assumes a somewhat pessmistic scenario where we don't get any help from the OS - i.e. maxmtu never changes. This can happen if IP_MTU is not usable and the OS doesn't reject overly large packets. However, in most systems the OS will, in fact, contribute to the MTU discovery process. In these situations, an actual MTU equal to maxmtu is quite likely (as opposed to the maxmtu = 1518 case where that is highly unlikely, unless the physical network supports jumbo frames). It therefore makes sense to use a multiplier of 1 - that will make the first probe length equal to maxmtu. The best results are obtained if the OS supports the getsockopt(IP_MTU) call, and its result is accurate. In that case, tinc will typically fix the MTU after one single probe(!), like so: Using system-provided maximum tinc MTU for foobar (1.2.3.4 port 655): 1442 Sending UDP probe length 1442 to foobar (1.2.3.4 port 655) Got type 2 UDP probe reply 1442 from foobar (1.2.3.4 port 655) Fixing MTU of foobar (1.2.3.4 port 655) to 1442 after 1 probes
Currently, if a MTU probe is sent and gets rejected by the system because it is too large (i.e. send() returns EMSGSIZE), the MTU discovery algorithm is not aware of it and still behaves as if the probe was actually sent. This patch makes the MTU discovery algorithm recalculate and send a new probe when this happens, so that the probe "slot" does not go to waste.
On Fri, Jan 02, 2015 at 02:46:24AM -0800, Etienne Dechamps wrote:
I completely agree.
Well, in my mind it's one and the same thing; it's just trying to
I'm not against this :)
That's indeed a waste of bandwidth.
That's an excellent solution to the problem!
Great.
There is one issue here, and that is UDP hole punching. If you have
Yes, however the burst bandwidth estimation code then indeed does not
I'm very sceptical about this. It's very easy to optimize the algorithm
Well, before tinc usually fixed the MTU in 7 probes if ICMP Packet Too
I tried it out, but there are some issues. Running it between two nodes After the first UDP probe is succesfully received, MTU probing starts, Using system-provided maximum tinc MTU for xar (10.1.1.1 port 28540): This is because choose_initial_maxmtu() doesn't apply the overhead of Apart from the algorithm for choosing packet sizes, I think the MTU Do you want to change anything before I pull your changes? I don't mind Met vriendelijke groet / with kind regards, |
Oh, right, I always wondered why tinc viewed direct neighbors as "special snowflakes" in that regard, that answers it. As you probably already noticed, this is not an issue with SPTPS because it will spontaneously relay in that case, so indirect UDP tunnels will always get established if necessary. I understand that the non-SPTPS path doesn't do that (it doesn't know how to fall back to indirect UDP), hence the problem. To be honest, I'm not sure how "well" you want 1.0.x-1.1.x cohabitation to work. I always thought this was simply to ease the transition path, and that no-one would keep an heterogeneous graph in the long term, and that is was therefore acceptable to have less-than-optimal performance characteristics between 1.0.x and 1.1.x nodes, as long as communication is still possible (e.g. via TCP). That said this looks like an easy fix, so we might as well do that.
That's collateral damage I'm afraid. I strongly believe it's worth it, but it's your call of course :) That said, you are quite lucky to at least get the correct order of magnitude. In my experience, if the link is just a little unpredictable (jitter, congestion), that thing returns completely absurd numbers.
Well, the algorithm I wrote is optimized for a MTU in the 1400-1450 range (which is typical of tinc running over typical physical networks), with no help from the system (i.e. no
Oh yeah, I spent a lot of time doing tons of simulations as well. Yes, I agree, it's surprisingly hard to beat the random algorithm. None of the formulas I tried managed to beat it, except the one I implemented in this pull request. It was the only one to show significantly better results compared to the random one. Here's the spreadsheet that I used to determine that. I believe you should be able to copy it to your own Drive and then put in new numbers from there. My main problem with the random algorithm is, well, it's random. In my opinion, that alone is a disadvantage, as by definition is leads to somewhat unpredictable results. Having an MTU that depends on the phase of the moon is not great. Of course there is always some nondeterminism in the form of packet loss, but that's not as bad. In addition, my algorithm pretty much always guarantee it will get to byte-level accuracy in a reasonable amount of time (typically < 20 probes). With the random algorithm there are pathological cases where that will take a very long time (consider minmtu 1418, maxmtu 1518 and actual MTU 1419, each probe has a 1% chance of landing on 1419). That cannot happen with the one I implemented.
All the numbers I mentioned assume a worst case scenario where these responses are filtered.
That's weird. I think I've tested broadcast packets and they seemed to work correctly. I'll look into it.
Yes. I had no idea how to calculate overhead for 1.0.x, so I cheated and applied my "we don't care that much about 1.0" philosophy instead :) For SPTPS packets the overhead calculation is accurate. If you know how to calculate that for 1.0, then by all means, be my guest.
Maybe. I don't really see that as a problem, but I'm fine either way.
I guess that makes sense, probes sent while udp_confirmed is false are very lightweight anyway (16 bytes payload). I'll change the pull request.
That's related to the hole punching issue with 1.0.x that you mentioned above, right? In that case I'm fine with you making the change, you seem to care more about 1.0.x than I do :)
I'll try to investigate the problem with broadcasts and fix it. I'll also make the change to fine-tune the UDP discovery interval. After that, feel free to merge and do anything you want with regard to the legacy stuff and the handling of maxmtu changes in |
On Fri, Jan 02, 2015 at 10:07:23AM -0800, Etienne Dechamps wrote:
I don't really see how this is different for SPTPS... the issue is that
I hope things are better at your work, but at my last job as a UNIX "On paper: legacy system will be phased out in three months -> In Any network with >10 devices that are not under the control of a single
I would like to keep performance of mixed networks as optimal as
Ok. Well it just gave the raw values from a single burst, if it were
I have several drives, but they are all in my own computers ;) In any case, I have put your algorithm into my own simulation, and there Your algorithm has some weird behavior. It is quite sensitive to the Your algorithm is slightly more sensitive to packet loss, but this only I'll try to see if I can combine the best of both algorithms :)
That's indeed true, and probably that is why your algorithm converges so It's too late now, but tomorrow I'll send you the simulation code.
Sure, I'll add it :)
Ok :)
Ok. Met vriendelijke groet / with kind regards, |
Ah, yes, of course. That makes sense. Actually I knew about this at some point in the past, because I remember noticing the potential race condition issues related to piggybacking on this message: if Well, the next thing I was planning to do after this was to implement UDP info messages, which is designed specifically to address this problem. It will make timing irrelevant, because these messages will be sent periodically so that UDP address information can be updated in real-time as intermediate nodes establish tunnels (which they always do because SPTPS relaying fallback). Of course that will only work if every node in the path is able to understand these messages, so that excludes 1.0.x. Is that okay with you? From your message I get the feeling you had something similar in mind.
Well, I fine-tuned it for the worst case scenario (no
Setting it to 1 is a bad idea, because it means the first probe is always set to maxmtu ( The reason why the multiplier is very sensitive is because it is directly multiplying an exponent. Exponentials are known to be quite sensitive :)
You might want to try using your random algorithm but with a non-uniform distribution biased towards small probes. The problem is, you're unlikely to get the nice behavior of trying large probes first (which is what my algorithm does) if you do it that way.
I thought about it, but I was trying to refrain from adding more complexity and state to the algorithm, so I did not investigate that option. But if you're fine with it, by all means, be my guest. |
This introduces a new configuration option, UDPDiscoveryKeepaliveInterval, which is used as the UDP discovery interval once the UDP tunnel is established. The pre-existing option, UDPDiscoveryInterval, is therefore only used before UDP connectivity is established. The defaults are set so that tinc sends UDP pings more aggressively if the tunnel is not established yet. This is appropriate since the size of probes in that scenario is very small (16 bytes).
I've added a commit for the discovery timeout change. I've set it to 2 seconds, that seems more appropriate than 0.3 second which sounds very aggressive to me. Tunnel establishment getting delayed by two or four seconds because of lost packets doesn't seem like a big deal to me. Feel free to readjust if you disagree. I tried to reproduce your issue with broadcasts, but was unable to. I've set up a testbed where I only send broadcasts and nothing else, and everything works normally. The alternative would have been very surprising because broadcasts use the same I'm okay with merging now. |
On Sat, Jan 03, 2015 at 01:26:40AM -0800, Etienne Dechamps wrote:
Yes. But between 1.1 and 1.0 nodes it should still be possible to just
That might be interesting. Anyway, I put a tarball with the simulation http://tinc-vpn.org/temp/simpmtu.tar.gz I ran the simulations with both 1% and 10% packet loss. With 1% packet Met vriendelijke groet / with kind regards, |
On Sat, Jan 03, 2015 at 02:24:57AM -0800, Etienne Dechamps wrote:
I didn't think so... but maybe I made a mistake in my setup, I'll try it
Well, since those probes are only sent when there is other traffic
Great! Met vriendelijke groet / with kind regards, |
Well, it hurts because it increases network chatter, though how much chatter is acceptable is a pure judgement call, so I'm fine with anything you think is best. That said, I would urge you to consider the worst case scenario of sustained traffic from a node behind a TCP-only firewall: in that case 4 probes will be sent per interval (1 for the destination node, 1 for the relay, and both are in duplicate because of local discovery). If you do that every 0.3 second, you end up with 13 pings per second, which seem a little extreme to me. |
On Sat, Jan 03, 2015 at 07:25:49AM -0800, Etienne Dechamps wrote:
Ah, maybe I forgot to mention, but I don't intend to do probes every 0.3
The rationale is to have quick (re-)establishment of UDP tunnels, then Met vriendelijke groet / with kind regards, |
I don't think this is worth the extra complexity, but that's just my opinion. Furthermore, if you're going to do something like this you might as well go all the way and implement exponential backoff, since that's best practice for this sort of thing as far as I know. |
As promised, here is the result of my work on tinc's MTU discovery code.
This massive pull request has three major goals, which are described below.
Code simplification
The current MTU discovery code is quite difficult to understand. The crux of the problem is the
send_mtu_probe_handler()
function which implements a complicated and hard to read state machine.I believe the main reason for the complexity if because the MTU discovery code is trying to do too many things at once: it's trying to estimate the MTU for the link, while making sure UDP connectivity is still up and running. These are two separate concerns intermingled in the same code, which increases state machine complexity.
In this pull request, I separate these two concerns by introducing a new concept, "UDP discovery". UDP discovery concentrates on keeping the UDP tunnel alive and checking its status, while MTU discovery concentrates on improving and fixing tunnel MTU as soon as UDP discovery establishes a working tunnel. This massively simplifies the MTU discovery code, which as a result only has two states to care about: initial discovery and "steady-state" (i.e. maxmtu checking). Separating these two concerns also makes it easier to implement the next improvements described below.
Less network chatter
Something that's been grinding my gears for quite some time is tinc's eagerness to send probes when the original reason for sending them is gone. Consider a scenario in which a node makes a DNS request to another node over the VPN, and then the link between these two goes silent. From the high-level VPN point of view, that's one single packet going over the wire in each direction. In response to this, tinc will start MTU discovery, which in some cases can result in 90 packets (~126KB) going over the wire. Then, it will continue maintaining the tunnel basically forever, sending 3 packets (~4KB) every few seconds. This is in response to sending one single small VPN packet. The resulting write amplification is somewhat absurd.
While this is not a critical problem per se, it results in lots of network pollution, and makes tinc look "chatty" and "verbose". The inefficiency is detrimental to mobile devices (battery time) and constrained network links. More importantly, it does not scale: if a tinc node sends one packet to 1000 other nodes and then goes silent (e.g. P2P application), the resulting chatter will look truly awful.
The solution implemented in this pull request is simple: if you're not sending any real VPN packets, then you don't get to send any probes. Probes can only be sent from the
send_packet()
TX path. No packets, no probes. This results in the following behavior:UDPDiscoveryTimeout
), tinc can't assume the UDP link is usable anymore. Therefore it will reset UDP discovery and MTU discovery information. If the link gets used again, the first packet will go over TCP (which is fine), while tinc re-establishes a UDP tunnel and rediscovers the MTU. The latter operation is not really that expensive thanks to the next improvement below.The keyword here is adaptive. tinc adapts to the real usage of the tunnel between two nodes, and will send probes more or less aggressively depending on how much the tunnel is used.
Improved MTU discovery algorithm
This pull request also contains two changes to the core of the MTU discovery algorithm itself, yielding massive improvements with regard to convergence time and accuracy.
The first change makes the MTU probing code send one packet at a time instead of three. The frequency is adjusted from 1 second to 0.3 second to compensate, so that the number of packets per second stays the same. The rationale is that it leads to more efficient feedback, as the size of each probe is based on the previous replies; therefore, allowing replies to come in between probes decreases the total number of probe packets required to converge.
The second change moves from a naive random probe size model to a more sophisticated mathematical model with the goal of more efficiently covering the remaining MTU interval. This leads to massive improvements in convergence time and accuracy.
With these two improvements, tinc is typically able to converge on the precise (byte-level accuracy) MTU in less than 10 probe packets (3 seconds!). In contrast, the current random algorithm gives highly variable results, sometimes using 6 times more probe packets to even get remotely close to the actual MTU.
This pull request supersedes #45.