Fix deadlock and backfill funding amounts#105
Fix deadlock and backfill funding amounts#105TheBlueMatt merged 6 commits intolightningdevkit:mainfrom
Conversation
|
👋 Thanks for assigning @joostjager as a reviewer! |
c0b5607 to
e57ab25
Compare
joostjager
left a comment
There was a problem hiding this comment.
Is this PR a response to a user report?
| funding_amount_sats, \ | ||
| announcement_signed \ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO NOTHING", &[ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO UPDATE SET funding_amount_sats = $2", &[ |
There was a problem hiding this comment.
Does this guarantee that everything will be backfilled and the described unwrap cannot fail anymore?
There was a problem hiding this comment.
No, not fully, but it should make it much more robust. Surprisingly I'd never seen it, but other folks have 🤷♂️
There was a problem hiding this comment.
Not fully, but it is safe still? I was looking for the mentioned unwrap, but wasn't completely sure which one it was.
There was a problem hiding this comment.
Its here -
rapid-gossip-sync-server/src/lookup.rs
Line 163 in b4e8434
There was a problem hiding this comment.
Should this be a safe get then for the final 1% and maybe just skip the row?
There was a problem hiding this comment.
I'm kinda confused why its coming up now to begin with - the commit was landed months ago and I assume most users have updated, so this should just get the stragglers working again. We can revisit it if there's still issues.
The DB persistence loop is responsible for taking new gossip off of the persistence queue and writing it to Postgres. If it gets stalled, the logic pushing gossip onto the event queue might also stall. In the next commit, we'll move the gossip persistence logic into a separate tokio runtime to avoid deadlocks, but in order for that to work it has to be the case that the gossip persistence task can never block waiting on a task on the main runtime, which would be possible via the `NetworkGraph` mutexes. Thus, here, we move the graph writing to its own task in the main runtime.
When we have many peers but relatively few threads, and postgres writes fall behind enough to block the async stream of gossip, we can end up deadlocking on the `PeerManager` `peers` thread. Specifically, because `PeerManager::process_events` holds a `peers` (read) lock while fetching messages (during which time we `block_in_place` trying to push gossip to the async stream), any other tasks which try to touch the `peers` lock in write mode (e.g. connecting to a peer or processing a disconnection) we'll hang until some postgres writes finish. If we get enough threads blocked or we block the tokio reactor and the postgres writes will stall, fully deadlocking. To avoid this, we here move the postgres writes to their own tokio runtime, ensuring that that reactor cannot be blocked.
In c02f2ec we added `funding_amount_sats` to `channel_announcements`, making it `NOT NULL` for new DBs, but `NULL` for old ones. When querying, however, we (indirectly) `unwrap`ed the values. Here, we try to backfill by looking at the local network graph on startup and updating the existing entries, rather than ignoring conflicting writes.
Yes, see discord. |
736dc32 to
406cf7f
Compare
5c6eb95 to
b9204d3
Compare
joostjager
left a comment
There was a problem hiding this comment.
Left few small non-blocking remarks.
| GossipPersister::new(self.logger.clone()).await; | ||
| log_info!(self.logger, "Starting gossip db persistence listener"); | ||
| tokio::spawn(async move { persister.persist_gossip().await; }); | ||
| let runtime = Builder::new_multi_thread() |
There was a problem hiding this comment.
It disappointing that quite a few issues around async seem to require another runtime 😞 Maybe copy the explanation in the commit message to a code comment here. Don't think it'll be obvious otherwise why this is needed.
There was a problem hiding this comment.
Yea, async is really such a strongly-colored function :(
| funding_amount_sats, \ | ||
| announcement_signed \ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO NOTHING", &[ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO UPDATE SET funding_amount_sats = $2", &[ |
There was a problem hiding this comment.
Should this be a safe get then for the final 1% and maybe just skip the row?
|
Pushed an additional comment. |
No description provided.