-
Notifications
You must be signed in to change notification settings - Fork 63
Description
We ran into an issue recently where we had a MongoDB node fail. During this failure, it looks like a ShareDB op was deleted, even though its snapshot was committed.
I can't confirm for sure, but my suspicion is that this chain of events (or something very similar) happened:
writeOp()
succeedssharedb-mongo
attempts towriteSnapshot()
- MongoDB commits the snapshot to disk
- MongoDB falls over before sending the ack to the client
- The client disconnects because of node outage; presumably assumes the write has failed(?)
- Attempts to "tidy up" the failed commit op, even though it succeeded
- The result is a committed snapshot with a missing op
I'm not entirely sure what my recommendation is. At first, I thought we should just delete the code that tidies these ops, but I do worry that it will result in bloat of the op collection during periods of high concurrency on a document.
We could move to transactions, although I worry about the performance implications (and I can't see much online, apart from guidance to use them sparingly, which wouldn't be the case here...).
We could add an extra DB call before the deletion, which double-checks the op is non-canonical before deleting. Would have to check both o_collection
and collection
to see if there's another op in the chain that references this op, or if the current snapshot references it. This requires 2 extra fetches, which isn't super nice, but I guess it would only happen in the tidy-up case, and it avoids the general use of transactions.
Other...?