-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Next Hop-based routing with fallback to flooding #2856
base: master
Are you sure you want to change the base?
Conversation
I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it. |
🤖 Pull request artifacts
|
It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations. |
I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node. I also changed the way how the |
I think this PR is ready to merge, except that it’s a breaking change unfortunately. |
The NodeNum is only 32 bits and based on the MAC address of the device. Where does your 0.4% probability refer to? |
UINT32 modulo 251 = 0..250 |
I'm sorry but I don't get why that is better than using the last byte, which is 0..255. It's not a checksum, it's to uniquely identify a node. If the NodeNum is already uniformly distributed, taking the last byte should not have any bias. |
About the routing itself ... Each node could regularly publish a broadcast of its direct neighbors and communicate the reception quality of RSSI and SNR. The list is sorted by reception quality and only the best 32 nodes are published. 1 byte or 8 bits are used per node: Together with the node DB and the GPS coordinates, either the node itself or the app (smartphone, web browser, CLI) can then create a network topology. Only 7 bytes would be required for a route with 7 hops |
That's what the neighborinfo module does. |
Ok, now I understand the benefits of the Neighbor Info Module. Should the node do the routing itself later? Does the node DB then store all the neighbors of each node? How much memory would this require? I once read somewhere that the node DB would only store up to 80 nodes. The documentation on the Neighbor Info Module does not really encourage activation. The module would be needed for routing and ever larger mesh networks. |
I do not understand the "wait for 3.0", most likely I don't understand what deploying 3.0 means... I would assume that nodes in a mesh will be upgraded slowly, especially big meshes, like SoCalMesh with hundreds of nodes, some in pretty inaccessible locations. This means that whether it's called 3.0 or not the next hop routing will have to co-exist with 2.x. Unless you count on all the 2.X having been upgraded to a version that sets next_hop to zero by then... I'm wondering whether it would be possible to use the NeighborInfo to gate next-hop routing. If the NeighborInfo can have a flag that says "I do next-hop routing" then on the outgoing side a relayer knows whether populating next-hop makes sense or not. On the incoming side it's a bit trickier in that a relayer would look up the NeighborInfo based on relay_node. If relay_node matches a NeighborInfo that says it does next-hop routing then the relayer would obey next_hop, although of course the match may be by chance. As a special case, if hop_start is set and equal to hop_limit then IIUC the message is received from the original sender and thus NeighborInfo can be looked up using the from address. In any case, it seems worth it to figure out some heuristics to make 3.0 interoperable with pre-2.3, which introduced the explicit next_hop field as far as I can tell and not rely on "it's going to take so long for 3.0 to become available that everyone will have upgraded past 2.3 by then"... Apologies if I'm completely off-base and don't understand some key fact here. |
Shouldn't the NeighborInfo module be enabled by default? |
Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible. I'm not really sure half-working interoperable solutions are the way to go here, especially if it requires the NeighborInfo module to be enabled. NeighborInfo creates quite a lot of additional traffic and is not necessary for NextHop routing, so I don't think it should be enabled by default. |
@GUVWAF I'd like to chime in here and say that I think this part has some major downsides. This threatens to split deployed infrastructure between old and new and drop coverage levels for both old and new until everything, including all client devices, have been changed over. For anyone who has gone further with deployment than just tinkering, this could be a major undertaking just due to the number of devices. And for anyone with infrastructure on mountaintops or other places which are difficult to access, these devices can be a pain in the neck to try to update. The more people using the same settings, the better it is for everyone's coverage, and the experience for new users would be better if they already have an existing network to connect to without having to build it all out themselves. Changing the default modem preset should not be done lightly. And I agree with @tve's comments about version interoperability. A thoughtful and gradual upgrade path would make it far more likely for people to actually upgrade. Ideally you want people to upgrade to the new version so that you can continue to innovate moving forward. If the changes are too drastic, you risk having people stuck with "good enough, works for me" instead, or even potentially having someone maintain an old compatible version as an LTS release. Or looking at the project/protocol as so unstable that it's not worth continuing to deploy. I love the idea of this, but a gradual rollout would make it a much easier pill to swallow. |
Ofcourse the decision will not be taken lightly and people have been saying it will be happing at least a year from now. I'm not entirely following why you're advocating for gradual changes, as precisely that leads to "good enough, works for me", instead of doing all breaking changes at once, where you have to upgrade to continue to talk to others that have. |
This pull is not merged because it is a breaking change that needs to wait until 3.0, and there is not enough other stuff to make a breaking change worth it. There is no way to do this gradually. Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated. |
I'm not convinced, that's why I made suggestions above. Yes, there's no way to make the changes without compromises, but I don't know that the compromises have been explored or discussed. Quite possibly just my ignorance.
This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare. This kind'a guarantees that no more non-breaking changes can be made going forward which I'd rate as "not good". At a high level I'm wondering what the plan is for wide-area meshes, which is where the routing improvements have the most impact I believe. I'm part of the southern california mesh with hundreds of nodes spread over hundreds of square km, many mountain ranges dividing the geography, and many nodes in difficult to access/update places. My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool). If the goal is to make this type of mesh usable over longer distances (like sending messages from Santa Barbara to San Diego -- approx 350km) some serious thought needs to go into routing. Maybe the answer is that this should be done via MQTT (internet), maybe the answer is that there should be an overlay backbone network to bridge long distances, maybe the answer lies in just improving the current type of routing. I don't know and it also depends a lot on what the project goals are. I'm not sure what the best place for this type of discussion is, but I believe it's relevant here because if there's going to be a breaking change then at least make the most out of it. I'm also not trying to piss on this PR: I actually think it's great, I'm just wondering about the bigger picture. |
But this is really beyond the scope of this PR. Currently these two bytes are useless anyway, because before 2.3 they could take any value depending on what was in that memory location. This is just one of the breaking changes currently proposed for 3.0 (fixing text compression is another one), I believe a change to the header to make it future-proof deserves its own PR. Edit: I now see @KyleMaas already opened a discussion for this: #3932
What kind of issues are you having and what makes you believe this has to do with routing? For broadcasts the current managed flooding approach is really not too bad. I'm not convinced there even exists a protocol that's guaranteed to be better in all the different use cases for Meshtastic (especially fixed or mobile nodes, or a mix), that doesn't require an internet connection and can run on microprocessors with limited RAM. For "smarter" routing you need periodic control messages, but those eat up the same bandwidth as regular traffic. Other problems with wireless mesh networks, e.g. the hidden node problem or unstable/changing RF conditions are not solved with better routing. |
@GUVWAF Yeah, starting that discussion to ask about forward compatibility is what led me into joining the discussion on this issue. I have several places in mind to add nodes that would vastly increase coverage in my area, but they are inaccessible enough that once they're up there, I'm probably realistically only going to be able to update the firmware maybe every 12 months or so. So I need to wait for a version of the firmware that has enough interoperability, or have all of the proposed protocol changes interoperable enough with older versions, that I can actually make these nodes available to the community. All of the current nodes I run are in areas which are at least reasonably accessible. But there's no sense in me taking on the time, effort, and expense of building out the expansions if it's going to be completely obsolete in a month - it makes more sense to wait for stability and not recommend Meshtastic to anyone who isn't already in our local coverage zones. So protocol stability and interoperability is extremely important to me. |
3.0 is like 18 months away best case. Honestly not really enough there yet to even consider it. |
@GUVWAF "Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast)" Where are such far-reaching decisions discussed and decided? What exactly is the background? The minimal increase in bandwidth is "bought" with a loss of range and all 2.x routers & repeaters. Alternatively, 3.x clients switch to Long / Fast or both networks run in parallel and apart. |
After looking at this again, I realized I made a big assumption that’s likely often not valid. I set the Another thing I didn’t consider is that only the original transmitter will fall back to flooding when it doesn’t hear anyone rebroadcast, but it might also be that intermediate hops loose their At least before the first issue is resolved, I’m not confident on merging this yet. Happy to hear any suggestions on this. |
@GUVWAF I got curious about this and dug around the git history a bit. I think maybe a way to determine if you can depend on these bytes being initialized is to see if Since they were both publicly released in the same official release, I think you could use I think the only way this doesn't hold true is if someone compiled a firmware themselves between those two commits. Unfortunately these really are the last free bits in the header. I dug around a bit see if there was a backwards-compatible way to add a trailing struct at the end of the packet for expansion but I don't see a way that would be possible with the way the rest of the packet is immediately memcpy'ed into a encrypted meshpacket variant. But - I wonder if it would be possible to use the bytes to indicate protocol version? The benefit would be a way to accomplish mostly backwards compatible extensibility in the transport layer. For instance, the first byte could indicate what version of the packet header is used ( for instance:
This of course would only be possible for non-broadcast messages currently but eventually if most clients upgrade to a version that support the extended header - we could deprecate the old handling of packet headers and use the extended header broadcast messages as well. I think this also allows room for V3 without totally breaking interoperability with V2. I don't know enough about Lora's MTU to know if it's even possible to increase the total packet size that way and this is probably the wrong place to discuss it anyway but I figured I'd throw that out here. |
@daramos Indeed using hopStart to determine whether the bytes are valid is something that was discussed on Discord, and would be doable. While I’m open to other ideas for those bytes, I’m not sure if a protocol version and extended header is worthwhile. First, you will still need a breaking change to introduce this the first time, because old firmware wouldn’t know about an extended header. |
Makes sense, thanks! |
So I was thinking... Is there any reason NeighborInfo packets must honor the same number of hops as other packet types? For NextHopRouting this could default to zero hops (only transmitted to immediately-connected nodes) as an example, severely reducing overhead on a high-saturation mesh. Any thoughts? I figure if you want telemetry from a node you can manually query NeighborInfo with a DM to get that data "phoned home" if you will, and otherwise neighbors are only sent to their adjacent nodes for supporting NextHop operation. |
I like this a lot! I exactly like to find a solution like this including telemetry. Would be great to have different Hop Settings for all messages sent by intervall with an additional option to only send them via MQTT. (for example for home node telemetry, where i don't want to disable TX on LoRa) |
@KodinLanewave By default, if you don't have the NeighborInfo Module enabled, you'll also not rebroadcast those packets, so essentially they are already 0-hop if others are not interested in them. I agree it would be nice to have different intervals for LoRa, PhoneAPI and MQTT (or disable one of them), but it also complicates things quite a lot. Maybe something for Meshtastic 3.0. |
This pull request has been mentioned on Meshtastic. There might be relevant details there: https://meshtastic.discourse.group/t/swiss-meshtastic-user/9688/153 |
This adds a NextHopRouter which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from either the NeighborInfoModule, or a previous successful delivery via flooding. Right now, it is only used for DMs and not used for broadcasts. Using the NeighborInfoModule, it can derive the next hop of neighbors and that of neighbors of neighbors. For others, it has no information in the beginning, which results into falling back to the FloodingRouter. Upon successful delivery via flooding, it updates the next hop of the recipient to the node that last relayed the ACK to us. When the ReliableRouter is doing retransmissions because it didn't get an ACK, for the last retry, it will reset the next hop, in order to fall back to the FloodingRouter.
Some implementation details: the
next_hop
andcurrent_relayer
are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer. This means that there is a chance that the last byte of two nodes match, and then they will both relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100
), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100
).This means that there is actually no additional overhead, except that per node, the
next_hop
has to be stored in the NodeDB (only 4 bytes).With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.)
With the NextHopRouter, node 0 doesn't try to relay:
Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying:
With the NextHopRouter, because 0 knew 2 is a neighbor of 3, 2 was set as next hop and the DM to 3 correctly arrives, as well as the real ACK.
I'm hoping to get some feedback, and maybe new ideas on assigning the next hop from this. Also, I would love to get ideas on how to improve broadcasts as well.
Edit: There are at least two remaining issues, see: #2856 (comment)