-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binance: Retry keep alive. #2958
Conversation
f436428
to
7b64763
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do something like this so that we can test.
Also, please look into the LIST_SUBSCRIPTIONS
websocket message to see if we can leverage that in a periodic check to catch discrepancies.
Looking into querying the subscriptions. |
Sorry I have not tested this on testnet/mainnet yet, but I should be able to use on testnet correct? |
Maybe? I haven't found testing mm on testnet very useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better technique would be if an orderbook has not received an update within a certain period, maybe 30 seconds, then resync the orderbook. Currently we will only know that an orderbook is stale every time checkSubs
is run, so it could remain stale for 30 mins.
This would result in tons of unnecessary reloads on low-volume markets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using this diff to test on mainnet. I've turned off my internet connection for 20 mins, then turned it back on, and the orderbooks immediately resync without hitting the new code here. The code looks good otherwise and works fine with the simnet test. Would be nice to be able to reproduce the issue though.
56ed9ba
to
e3cf2a8
Compare
@martonp does the order book desync look ok? https://github.com/decred/dcrdex/compare/1a223db8b8ae83b3e4da70b6ddd1adbf421ef027..e3cf2a8358ff0ac4a6e6e61507439a2b1478f56b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I disconnect for > 3 minutes, then reconnect, there seems to be a deadlock when calling VWAP
. I'm looking into it.
client/mm/libxc/binance.go
Outdated
@@ -288,6 +291,11 @@ func (b *binanceOrderBook) Connect(ctx context.Context) (*sync.WaitGroup, error | |||
if retry != nil { // don't hammer | |||
continue | |||
} | |||
case <-b.disconnectedChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if for some reason the Websocket has not reconnected, but the snapshot request successfully completes, the orderbook will appear synced, but no messages will be coming through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order book stuff doesn't happen over websocket? It's a separate secure http request? So we should stop the books from looking synced until the websocket looks up again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the last change https://github.com/decred/dcrdex/compare/7833026d42ae23acc577fd07548c97c2158e580f..adef014da6fd9c56e84197ce13a1a5e3a24d0147
So, not allowing sync again until the websocket tells us it is up. If the connect messages came out of order or we missed one that would be bad though, unsure if that can happen.
7833026
to
adef014
Compare
I don't think this test fail is due to my changes.... |
client/mm/libxc/binance.go
Outdated
// will not place new orders. | ||
connected := cs == comms.Connected | ||
bnc.booksMtx.RLock() | ||
defer bnc.booksMtx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer bnc.booksMtx.RLock() | |
defer bnc.booksMtx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops
client/mm/libxc/binance.go
Outdated
select { | ||
case reconnectC <- struct{}{}: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think this is a batch of changes I got from @buck54321
A natural reconnect shouldn't need a new connection I guess.
There were a few issues I fixed.. consider these changes: 68008b3 You can test using the I think another good change would be to trigger a reconnect if the list subscriptions request fails, because this probably means that the connection is broken. |
d2dc7fb
to
b97d591
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martonp I added your changes.
The changes in wsconn.go, why not just add the reconnect timer to keepAlive? As it is now we have to wait for a read or a write for the scheduled reconnect to happen. So, it will happen after we want it to, or not at all if we never have another read or write.
client/mm/libxc/binance.go
Outdated
PingWait: time.Minute * 4, | ||
EchoPingData: true, | ||
ReconnectSync: func() { | ||
bnc.log.Debugf("Binance reconnected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK not to trigger a full reconnect here, but maybe a call to checkSubs
would be prudent.
client/comms/wsconn.go
Outdated
type WsCfg struct { | ||
// URL is the websocket endpoint URL. | ||
URL string | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a separate argument to the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to URL to only be stored in one place, be able to be updated, and I didn't want to add a mutex for the config or a field in the config. If we don't update the URL, when there is a read error, and it reconnects, it will only reconnect to the first market that was subscribed to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to URL to only be stored in one place
I don't really see this as a compelling reason to change the function signature for every consumer. You can still have an unexported field that you update. When we have structs for config settings, we generally try to keep everything contained to the struct.
I made some updated to that commit: a15e541
You're right, but I don't think the timer should be in |
b97d591
to
a15e541
Compare
Thanks those changes look good to me. Added them. They replaced the last commit. |
client/comms/wsconn.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making a lot of changes to shoehorn the auto-reconnect into the read
/readRaw
loops. Can you tell me what functionally is the difference between the proposed changes for this file (+112/-63) and this alternative take (+51/-6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that change, won't there be an additional read loop running after each reconnect? When we reconnect from handleReadError
, the read
/readRaw
loops will return, but that won't happen during a reconnect. It would work if there was a new context created each time read
/readRaw
is called, and it is cancelled before sending a struct to reconnectChan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Instead of using conn.reconnectCh <- struct{}{}
, we could
conn.wsMtx.Lock()
if conn.ws != nil {
conn.ws.Close()
}
conn.wsMtx.Unlock()
I think, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending to reconnectCh already calls connect
which closes the old connection so I guess the read loop would error out already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This latest change doesn't work.. after the first AutoReconnect it just starts reconnecting over and over every second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this one: martonp@bd18aaa
There is a generic read function to avoid duplication.
With this commit you can use TestVWAP
to test: martonp@f6029d7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This latest change doesn't work.. after the first AutoReconnect it just starts reconnecting over and over every second.
Oh yeah it does, hmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added marton changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was nothing wrong with the read loops and we don't need to spawn a new goroutine for every message. Why are we making that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the read loop change is to be able to end the read loop whenever we do an auto reconnect. The reason for the new goroutine is to be able to end the read loop without having to wait for the ReadMessage
or ReadJSON
call to return.
If we just send a message on reconnectCh
, the old read loop will still be running, and it will attempt to reconnect again whenever it encounters an error. We could solve this by creating new contexts for each call to read
, but I think it's the most simple to only have one read loop running at one time, and only send a message on reconnectCh
whenever the read loop is being ended either due to an error or an auto reconnect.
cd816d9
to
d089412
Compare
testbinance panic:
|
d089412
to
1ea349e
Compare
Removed the wsconn changes for now. They should maybe be a separate pr. Attempting to fix the other issues pointed out by @martonp https://github.com/decred/dcrdex/compare/d089412d7cd1f74acae443b4728d581dad35efb3..1ea349eed8446916cd75718efdf265400db9925d |
* handle ws reconnect signal * binance: Retry keep alive. * testbinance: Add flappy websocket. * binance: Check market depth subs. * binance: Desync books on disconnect. --------- Co-authored-by: Brian Stafford <[email protected]>
Untested. Attempting to solve a problem with stuck books.