-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: fix inconsistent state in gossip syncer #9424
Conversation
This commit adds more logs around the ChannelUpdate->edge policy process flow.
From staticcheck: QF1002 - Convert untagged switch to tagged switch.
This is a pure refactor to add a dedicated handler when the gossiper is in state syncingChans.
This commit fixes the following race, 1. syncer(state=syncingChans) sends QueryChannelRange 2. remote peer replies ReplyChannelRange 3. ProcessQueryMsg fails to process the remote peer's msg as its state is neither waitingQueryChanReply nor waitingQueryRangeReply. 4. syncer marks its new state waitingQueryChanReply, but too late. The historical sync will now fail, and the syncer will be stuck at this state. What's worse is it cannot forward channel announcements to other connected peers now as it will skip the broadcasting during initial graph sync. This is now fixed to make sure the following two steps are atomic, 1. syncer(state=syncingChans) sends QueryChannelRange 2. syncer marks its new state waitingQueryChanReply.
The mocked peer used here blocks on `sendToPeer`, which is not the behavior of the `SendMessageLazy` of `lnpeer.Peer`. To reflect the reality, we now make sure the `sendToPeer` is non-blocking in the tests.
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch
@@ -832,9 +832,13 @@ func (d *AuthenticatedGossiper) ProcessRemoteAnnouncement(msg lnwire.Message, | |||
|
|||
// If we've found the message target, then we'll dispatch the | |||
// message directly to it. | |||
syncer.ProcessQueryMsg(m, peer.QuitSignal()) | |||
err := syncer.ProcessQueryMsg(m, peer.QuitSignal()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why the chainSyncer remaining in the waitingQueryChanReply
will cause the chan_announcments received from the peer directly to not be relayed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's very intertwined...so we skip broadcasting channel anns from our peers here,
Lines 1574 to 1580 in e0a920a
if newAnns != nil && shouldBroadcast { | |
// TODO(roasbeef): exclude peer that sent. | |
deDuped.AddMsgs(newAnns...) | |
} else if newAnns != nil { | |
log.Trace("Skipping broadcast of announcements received " + | |
"during initial graph sync") | |
} |
and the shouldBroadcast
is determined here,
Lines 1533 to 1535 in e0a920a
// We should only broadcast this message forward if it originated from | |
// us or it wasn't received as part of our initial historical sync. | |
shouldBroadcast := !nMsg.isRemote || d.syncMgr.IsGraphSynced() |
which relies on this method,
Lines 806 to 811 in e0a920a
// IsGraphSynced determines whether we've completed our initial historical sync. | |
// The initial historical sync is done to ensure we've ingested as much of the | |
// public graph as possible. | |
func (m *SyncManager) IsGraphSynced() bool { | |
return atomic.LoadInt32(&m.initialHistoricalSyncCompleted) == 1 | |
} |
and the var initialHistoricalSyncCompleted
is marked via,
Lines 800 to 804 in e0a920a
// markGraphSynced allows us to report that the initial historical sync has | |
// completed. | |
func (m *SyncManager) markGraphSynced() { | |
atomic.StoreInt32(&m.initialHistoricalSyncCompleted, 1) | |
} |
and the above method is called inside processChanRangeReply
here,
Lines 951 to 954 in e0a920a
// Ensure that the sync manager becomes aware that the | |
// historical sync completed so synced_to_graph is updated over | |
// rpc. | |
g.cfg.markGraphSynced() |
and the processChanRangeReply
is called here,
Lines 528 to 537 in e0a920a
select { | |
case msg := <-g.gossipMsgs: | |
// The remote peer is sending a response to our | |
// initial query, we'll collate this response, | |
// and see if it's the final one in the series. | |
// If so, we can then transition to querying | |
// for the new channels. | |
queryReply, ok := msg.(*lnwire.ReplyChannelRange) | |
if ok { | |
err := g.processChanRangeReply(queryReply) |
Note that it be blocked on case msg := <-g.gossipMsgs:
, as the msg is never sent to the channel here,
Lines 1515 to 1532 in e0a920a
// Reply messages should only be expected in states where we're waiting | |
// for a reply. | |
case *lnwire.ReplyChannelRange, *lnwire.ReplyShortChanIDsEnd: | |
syncState := g.syncState() | |
if syncState != waitingQueryRangeReply && | |
syncState != waitingQueryChanReply { | |
return fmt.Errorf("received unexpected query reply "+ | |
"message %T", msg) | |
} | |
msgChan = g.gossipMsgs | |
default: | |
msgChan = g.gossipMsgs | |
} | |
select { | |
case msgChan <- msg: |
As the above case *lnwire.ReplyChannelRange, *lnwire.ReplyShortChanIDsEnd:
will error out due to the state of the syncer not being updated to waitingQueryRangeReply
yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, agree really nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice investigation! 🔥 thanks for the detailed description
@@ -2430,7 +2433,8 @@ func (d *AuthenticatedGossiper) handleNodeAnnouncement(nMsg *networkMsg, | |||
// TODO(roasbeef): get rid of the above | |||
|
|||
log.Debugf("Processed NodeAnnouncement: peer=%v, timestamp=%v, "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try get around to updating the gossiper to use structured logging. That way we only need to add all this to the context once and then can just log.DebugS(ctx, "Processed NodeAnnouncement")
Fixed a long standing issue, which is also sometimes found in our itest,
The Issue
When we receive a query response from the remote, we require the syncer to be in a waiting for reply state,
lnd/discovery/syncer.go
Lines 1515 to 1524 in e0a920a
And this state is only set when we finishes sending the msg to the peer,
lnd/discovery/syncer.go
Lines 509 to 518 in e0a920a
However, we may get a reply from the remote peer before we finish setting the state above. The consequence is we never get out of the initial historical sync stage, causing the node to refuse to propagate channel announcements to its peers.
We barely notice this as the error is ignored here (and the error returned from
ProcessRemoteAnnouncement
is not processed anyway...more future PRs),lnd/discovery/gossiper.go
Lines 831 to 836 in e0a920a
The fix
Quite some debug logs have been added to help catch the bug, and the fix is to lock the state changes when sending msgs to the remote peer, so
send msg -> update state
is now atomic.