Skip to content

Ops incorrectly deleted during MongoDB node-out #164

@alecgibson

Description

@alecgibson

We ran into an issue recently where we had a MongoDB node fail. During this failure, it looks like a ShareDB op was deleted, even though its snapshot was committed.

I can't confirm for sure, but my suspicion is that this chain of events (or something very similar) happened:

  1. writeOp() succeeds
  2. sharedb-mongo attempts to writeSnapshot()
  3. MongoDB commits the snapshot to disk
  4. MongoDB falls over before sending the ack to the client
  5. The client disconnects because of node outage; presumably assumes the write has failed(?)
  6. Attempts to "tidy up" the failed commit op, even though it succeeded
  7. The result is a committed snapshot with a missing op

I'm not entirely sure what my recommendation is. At first, I thought we should just delete the code that tidies these ops, but I do worry that it will result in bloat of the op collection during periods of high concurrency on a document.

We could move to transactions, although I worry about the performance implications (and I can't see much online, apart from guidance to use them sparingly, which wouldn't be the case here...).

We could add an extra DB call before the deletion, which double-checks the op is non-canonical before deleting. Would have to check both o_collection and collection to see if there's another op in the chain that references this op, or if the current snapshot references it. This requires 2 extra fetches, which isn't super nice, but I guess it would only happen in the tidy-up case, and it avoids the general use of transactions.

Other...?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions