-
Notifications
You must be signed in to change notification settings - Fork 1.2k
perf: make m_peer_mutex a SharedMutex #6468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: make m_peer_mutex a SharedMutex #6468
Conversation
|
Guix Automation has began to build this PR tagged as v22.1.0-devpr6468.fba3a04a. A new comment will be made when the image is pushed. |
|
Guix Automation has completed; a release should be present here: https://github.com/dashpay/dash-dev-branches/releases/tag/v22.1.0-devpr6468.fba3a04a. The image should be on dockerhub soon. |
fba3a04 to
f044507
Compare
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
|
Guix Automation has began to build this PR tagged as v23.0.0-devpr6468.f0445074. A new comment will be made when the image is pushed. |
|
Below is the aggregated contention data I collected on masternode-1 on testnet during 2 minutes of ~ 8tps, so lots of tx locking happening. As you can see,
Also; when running a local functional test for testing is lock latency, I saw this contention (contention detected 563 times): After this change, there is no contention over the lock. |
WalkthroughThis change replaces the exclusive Sequence Diagram(s)mermaid Note over Reader1,Reader2: Old flow (exclusive lock) Note over Reader1,Reader2,Writer: New flow (shared locks for readers) Note right of Mutex: Colored blocks — shared reads (green), exclusive writes (orange) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/net_processing.cpp (1)
2515-2532: Fix lifetime hazard: raw Peer used after releasing m_peer_mutex.*You store
Peer*inbest[...]under a shared lock, release the lock, then callPushAddress(*best[i].second, …). If a peer is removed concurrently, this can become a dangling pointer (UAF). CapturePeerRefinstead to keep the object alive post-unlock.Apply this diff:
- std::array<std::pair<uint64_t, Peer*>, 2> best{{{0, nullptr}, {0, nullptr}}}; + std::array<std::pair<uint64_t, PeerRef>, 2> best{{{0, {}}, {0, {}}}}; @@ - best[i] = std::make_pair(hashKey, peer.get()); + best[i] = std::make_pair(hashKey, peer); @@ - for (unsigned int i = 0; i < nRelayNodes && best[i].first != 0; i++) { - PushAddress(*best[i].second, addr, insecure_rand); - } + for (unsigned int i = 0; i < nRelayNodes && best[i].first != 0; i++) { + PushAddress(*best[i].second, addr, insecure_rand); + }This mirrors the safe pattern used elsewhere (e.g., collecting
PeerRefs inAskPeersForTransaction) and removes the UAF risk.
🧹 Nitpick comments (1)
src/net_processing.cpp (1)
2370-2375: Shrink critical section to reduce contention (optional).Current loops hold
m_peer_mutex(shared) while acquiring per‑peer locks and doing work. Consider:
- Collect
std::vector<PeerRef>underREAD_LOCK, then releasem_peer_mutexand process each peer (callGetInvRelay(),PushInv(), etc.).- This reduces time readers hold
m_peer_mutexand lets writers (insert/remove) proceed sooner.Also applies to: 2463-2471, 2476-2482
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between e5ce1f0 and f0445074a18e4961f3be8880dbc8a7268bcd1f4e.
📒 Files selected for processing (1)
src/net_processing.cpp(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,h,cc,cxx,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{cpp,h,cc,cxx,hpp}: Dash Core C++ codebase must be written in C++20 and require at least Clang 16 or GCC 11.1
Dash uses unordered_lru_cache for efficient caching with LRU eviction
Files:
src/net_processing.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: win64-build / Build source
- GitHub Check: Lint / Run linters
- GitHub Check: arm-linux-gnueabihf / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
🔇 Additional comments (4)
src/net_processing.cpp (4)
1731-1734: GetPeerRef under shared lock — LGTM.Read-only map lookup with
READ_LOCK(m_peer_mutex)is correct; returnsPeerRefto extend lifetime post-unlock.
2304-2306: SendPings: shared lock + atomic write — LGTM.Iterating under
READ_LOCKand settingstd::atomic<bool> m_ping_queuedis safe.
2314-2324: AskPeersForTransaction: shared lock to collect PeerRefs — LGTM.Collecting
PeerRefs underREAD_LOCKthen acting undercs_mainavoids holdingm_peer_mutexacross downstream locks.
2382-2395: RelayDSQ: shared lock usage — LGTM.Read-only iteration and deferring sends via
nodes_send_allafter unlocking is sound.
| @@ -819,7 +819,7 @@ class PeerManagerImpl final : public PeerManager | |||
|
|
|||
| /** Protects m_peer_map. This mutex must not be locked while holding a lock | |||
| * on any of the mutexes inside a Peer object. */ | |||
| mutable Mutex m_peer_mutex; | |||
| mutable SharedMutex m_peer_mutex; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
SharedMutex swap looks good; confirm macro support/fairness.
Change to SharedMutex is appropriate for read-heavy access. Please confirm:
READ_LOCK(...)is defined by our sync wrappers and mapped to a shared-lock type.- No writer starvation policy surprises on our platforms (clang/gcc toolchains).
🏁 Script executed:
#!/bin/bash
# Ensure shared-lock macros/types exist and are used consistently
rg -nP '(^|[^A-Z_])READ_LOCK\s*\(' --type=cpp
rg -nP 'class|using\s+SharedMutex|struct\s+SharedMutex' --type=cpp
rg -nP 'LOCK_SHARED\s*\(' --type=cppLength of output: 205545
I can confirm the macro support is in place. Now I need to verify the actual code at the flagged lines to assess the lifetime hazard mentioned in the scratchpad.
I need to verify the actual code at the flagged lines to assess the lifetime hazard mentioned in the scratchpad.
🏁 Script executed:
#!/bin/bash
# Inspect RelayAddress and _RelayTransaction code context for potential UAF issues
echo "=== RelayAddress context (line 2515-2528) ==="
sed -n '2510,2535p' src/net_processing.cpp
echo ""
echo "=== _RelayTransaction context (line 2463-2471) ==="
sed -n '2458,2480p' src/net_processing.cpp
echo ""
echo "=== Check if PeerRef is defined/used ==="
rg -n "struct PeerRef|class PeerRef|using PeerRef" src/net_processing.cppLength of output: 1797
🏁 Script executed:
#!/bin/bash
# Inspect RelayAddress and _RelayTransaction code context for potential UAF issues
echo "=== RelayAddress context (line 2515-2528) ==="
sed -n '2510,2535p' src/net_processing.cpp
echo ""
echo "=== _RelayTransaction context (line 2463-2471) ==="
sed -n '2458,2480p' src/net_processing.cpp
echo ""
echo "=== Check if PeerRef is used and defined ==="
rg -n "PeerRef|using.*Ref" src/net_processing.cpp | head -20Length of output: 2935
Fix the lifetime hazard in RelayAddress where raw peer pointers are used after releasing the lock.
The READ_LOCK(...) macro is properly supported (defined at src/sync.h:343 and mapped to ReadLock<decltype(cs)> on std::shared_mutex). However, RelayAddress stores raw Peer* pointers in the best array at line 2522, then uses them after the lock scope ends at line 2532. If a peer is removed from m_peer_map between lock release and the PushAddress call, this becomes a use-after-free.
Lines 2515–2528 (RelayAddress):
Capture PeerRef (shared_ptr, line 439) instead of raw pointers to extend peer lifetime beyond the lock. Replace peer.get() with the shared_ptr directly, or collect into std::vector<PeerRef> like AskPeersForTransaction does (line 2310).
The other READ_LOCK sites (_RelayTransaction, RelayRecoveredSig) are safe because all peer access occurs within the lock scope.
🤖 Prompt for AI Agents
src/net_processing.cpp lines 2515-2528: RelayAddress currently collects raw
Peer* pointers into `best` while holding the READ_LOCK, then uses them after the
lock is released causing a potential use-after-free; change the collection to
capture PeerRef (the shared_ptr alias used for peers) instead of raw pointers
(e.g., gather a std::vector<PeerRef> while inside the READ_LOCK), replace later
uses of `peer.get()` with the stored PeerRef, and then call PushAddress/other
post-lock APIs using the preserved PeerRef so peers remain alive after the lock
is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look at this again;
For the first case, it appears to me the READ_LOCK is held through the destruction of the best array. I don't see a UAF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, something went wrong! Please try again later. 🐰 💔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look at this again;
For the first case, it appears to me the READ_LOCK is held through the destruction of the best array. I don't see a UAF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PastaPastaPasta You’re right — in RelayAddress the READ_LOCK(m_peer_mutex) is acquired in the function scope and remains held through both loops, including the PushAddress calls. With the shared lock held, m_peer_map cannot be modified by writers, so the Peer* captured into best stays valid during its use. No UAF there. My earlier concern was unfounded — thanks for the nudge.
Optional: if you ever want to reduce the time m_peer_mutex is held, you could collect PeerRef entries under the read lock, release it, then call PushAddress on those refs. Not necessary for correctness here, just a potential micro-optimization.
|
Guix Automation has completed; a release should be present here: https://github.com/dashpay/dash-dev-branches/releases/tag/v23.0.0-devpr6468.f0445074. The image should be on dockerhub soon. |
kwvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK f0445074a18e4961f3be8880dbc8a7268bcd1f4e
|
This pull request has conflicts, please rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK f0445074a18e4961f3be8880dbc8a7268bcd1f4e (need to rebase)
f044507 to
43fd60f
Compare
knst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-utACK 43fd60f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
src/net_processing.cpp (6)
2293-2298: SendPings: read-lock + atomic write is safe; consider shortening lock scope.Optional: collect PeerRef under the read lock, then set m_ping_queued outside to minimize time holding m_peer_mutex.
- READ_LOCK(m_peer_mutex); - for(auto& it : m_peer_map) it.second->m_ping_queued = true; + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + peer->m_ping_queued = true; + }
2361-2366: RelayInv: avoid holding m_peer_mutex while pushing to peers.Optional: mirror the AskPeersForTransaction pattern to shorten lock time.
- READ_LOCK(m_peer_mutex); - for (const auto& [_, peer] : m_peer_map) { - if (!peer->GetInvRelay()) continue; - PushInv(*peer, inv); - } + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (!peer->GetInvRelay()) continue; + PushInv(*peer, inv); + }
2370-2393: RelayDSQ: consistent with shared lock; optionally move INV path out of lock.You already defer DSQ::ALL sends via NodeId vector. For DSQ::INV, consider collecting PeerRef under READ_LOCK, then PushInv after releasing it to further reduce contention.
2453-2464: _RelayTransaction: shorten lock scope while broadcasting.Optional: collect PeerRef, then PushInv outside the READ_LOCK.
- READ_LOCK(m_peer_mutex); - for(auto& it : m_peer_map) { - Peer& peer = *it.second; - auto tx_relay = peer.GetTxRelay(); - if (!tx_relay) continue; - PushInv(peer, inv); - }; + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (auto tx_relay = peer->GetTxRelay()) { + PushInv(*peer, inv); + } + }
2468-2475: RelayRecoveredSig: same optional lock-narrowing.Collect PeerRef under READ_LOCK, then PushInv after releasing it.
- READ_LOCK(m_peer_mutex); - for (const auto& [_, peer] : m_peer_map) { - if (peer->m_wants_recsigs) { - PushInv(*peer, inv); - } - } + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (peer->m_wants_recsigs) { + PushInv(*peer, inv); + } + }
2504-2525: RelayAddress: current approach is safe; optional lifetime-friendly tweak.With READ_LOCK held across both loops, raw Peer* remains valid — no UAF. If you want to release m_peer_mutex before PushAddress, store PeerRef instead of raw pointers.
- std::array<std::pair<uint64_t, Peer*>, 2> best{{{0, nullptr}, {0, nullptr}}}; + std::array<std::pair<uint64_t, PeerRef>, 2> best{{{0, {}}, {0, {}}}}; @@ - READ_LOCK(m_peer_mutex); - for (auto& [id, peer] : m_peer_map) { + { + READ_LOCK(m_peer_mutex); + for (const auto& [id, peer] : m_peer_map) { @@ - best[i] = std::make_pair(hashKey, peer.get()); + best[i] = std::make_pair(hashKey, peer); @@ - }; + } // release m_peer_mutex @@ - PushAddress(*best[i].second, addr, insecure_rand); + PushAddress(*best[i].second, addr, insecure_rand);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between f0445074a18e4961f3be8880dbc8a7268bcd1f4e and 43fd60f.
📒 Files selected for processing (1)
src/net_processing.cpp(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,h,cc,cxx,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{cpp,h,cc,cxx,hpp}: Dash Core C++ codebase must be written in C++20 and require at least Clang 16 or GCC 11.1
Dash uses unordered_lru_cache for efficient caching with LRU eviction
Files:
src/net_processing.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: linux64-build / Build source
- GitHub Check: arm-linux-build / Build source
- GitHub Check: linux64_nowallet-build / Build source
- GitHub Check: x86_64-apple-darwin / Build depends
- GitHub Check: x86_64-pc-linux-gnu_multiprocess / Build depends
- GitHub Check: Lint / Run linters
- GitHub Check: x86_64-w64-mingw32 / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
- GitHub Check: Lint / Run linters
🔇 Additional comments (3)
src/net_processing.cpp (3)
815-824: SharedMutex swap is correct for read-dominant access.m_peer_map still uses exclusive LOCKs on mutation (InitializeNode/RemovePeer). An appropriate upgrade.
1724-1729: GetPeerRef now uses READ_LOCK — safe lifetime.Returning PeerRef under a shared lock, then releasing, is correct per the comment that a shared_ptr keeps the peer alive.
2304-2315: AskPeersForTransaction: good pattern.Collecting PeerRef under READ_LOCK and using them after releasing the lock is exactly the right approach here.
kwvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 43fd60f
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 43fd60f
…ntion e918729 refactor: add CacheTipHeight, improve cache API safety (UdjinM6) 493d077 refactor: simplify optional usage (pasta) 0e1f114 instantsend: cache heights and reduce cs_main contention (pasta) Pull request description: ## Issue being fixed or feature implemented Today's testing on testnet showed under load, cs_main was a major contender: (however, note the major contenders from before #6912 / #6468 are now significantly reduced) ``` ======================================================================================================================== TOP 40 LOCATIONS BY TOTAL CONTENTION TIME (AGGREGATE) ======================================================================================================================== Lock Name Location Count Total(μs) Avg(μs) Max(μs) Nodes ------------------------------------------------------------------------------------------------------------------------ cs llmq/signing_shares.cpp:507 11952 81357095 6807.0 299734 8 ::cs_main instantsend/signing.cpp:210 75241 62985844 837.1 93240 8 cs_main index/base.cpp:340 29584 27040349 914.0 1613365 8 cs_main net_processing.cpp:5462 69291 19784473 285.5 1329191 8 cs llmq/signing_shares.cpp:1732 2355 17466035 7416.6 127471 8 ::cs_main llmq/quorums.cpp:1214 38975 15478543 397.1 883884 8 m_nodes_mutex net.cpp:2043 33457 14736059 440.4 31622 8 ::cs_main llmq/quorums.cpp:526 10587 10577503 999.1 2083501 8 pnode->cs_vSend net.cpp:2435 68528 9671050 141.1 37265 8 cs_main net_processing.cpp:4196 4945 8608157 1740.8 1326146 8 ::cs_main instantsend/instantsend.cpp:272 5602 8416269 1502.4 1260997 8 cs txmempool.cpp:1319 7942 8059684 1014.8 356199 8 ::cs_main validation.cpp:3747 309 6468144 20932.5 1225274 8 ::cs_main validation.cpp:6009 21855 3393152 155.3 228195 8 pnode->cs_vSend net.cpp:4709 7532 2299813 305.3 41543 8 m_nodes_mutex ./net.h:1374 6229 1998506 320.8 15593 8 inv_relay->m_tx_inventory_mutex net_processing.cpp:1169 7687 1871859 243.5 16292 8 cs_db instantsend/db.cpp:239 4323 1527297 353.3 20779 8 cs_cache spork.cpp:244 18083 1331472 73.6 27722 8 ::cs_main chainlock/signing.cpp:58 98 1312496 13392.8 1273691 8 cs_db ./instantsend/db.h:139 4381 1305738 298.0 29777 8 mutexMsgProc net.cpp:2604 4877 1243737 255.0 24641 8 ``` - Add unordered_lru_cache for block heights in CInstantSendManager - Provide GetTipHeight() (non-optional) and GetBlockHeight() helpers - Use cache in InstantSendSigner::CheckCanLock - Cache-first for cycleHash height in ISLOCK message path and batch verify - Cache-first mined-height for HasChainLock and WriteInstantSendLockMined ## What was done? ## How Has This Been Tested? ## Breaking Changes _Please describe any breaking changes your code introduces_ ## Checklist: _Go over all the following points, and put an `x` in all the boxes that apply._ - [ ] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [ ] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_ ACKs for top commit: UdjinM6: utACK e918729 Tree-SHA512: 1ea45e92c717dbefe5625bba3b27612b9ffbb21ea106d11005d78836fbe6ce00503e3f3fdb927fb405b09fd795f50137ca2beb07351bf94488d2af5694e543ba
Issue being fixed or feature implemented
m_peer_mutex is a great use case for a shared mutex; it's private, well contained scope, and 99% of usages are for the read case. The only time we hold an exclusive lock is when adding or removing a peer, which happens rarely compared to all other usages.
On latest nightly, this mutex only represents about 2% of contention (all in PeerRef PeerManagerImpl::GetPeerRef), however, I intend to move over mnauth / verified protx related stuff over into PeerManager, and I expect that will result in significantly more contention over m_peer_mutex, making this change more valuable.
I expect this will reduce contention over this mutex to ~0
What was done?
How Has This Been Tested?
On testnet
Breaking Changes
Checklist:
Go over all the following points, and put an
xin all the boxes that apply.