Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd Go & Java client SDK's retry mechanism may break Serializable #18424

Open
ahrtr opened this issue Aug 8, 2024 · 11 comments
Open

etcd Go & Java client SDK's retry mechanism may break Serializable #18424

ahrtr opened this issue Aug 8, 2024 · 11 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Aug 8, 2024

Background

Jepsen team raised an issue #14890, and stated that etcd may cause lost update and cyclic information flow. There is a long discussion.

Firstly, there is strong evidence to indicate that it isn't an etcdserver issue, and a key was written twice by the client. Refer to #14890 (comment). So we thought it might be jetcd or Jepsen's issue.

Eventually it turned out to be caused by client's retry mechanism. Refer to

Note Jepsen uses jetcd (java client). But I believe etcd go client sdk also has this issue.

Breaks Serializable

When a database system processes multiple concurrent transactions, it must produces the same effect as some serial execution of those transactions. This is what the Serializable means.

But etcd client sdk's retry (including both go & java) mechanism may break Serializable.

Let's work with an example, assuming there are two concurrent transactions,

  • transaction 1 (txn1): read k1, and write k2: 20
  • transaction 2 (txn2): read k2, and write k1: 10

Based on the definition of Serializable, the final result must be the same as executing the two transaction as some serial execution. There are only two possibilities,

  • execute txn1 first, then txn2
    • txn1 read nothing for k1, and write k2 = 20
    • txn2 read 20 for k2, and write k1 = 10
  • execute txn2 first, then txn1
    • txn2 read nothing for k2, and write k1 = 10
    • txn1 read 10 for k1, and write k2 = 20

But client's retry may lead to a third possibility, see an example workflow below

  • execute txn1 firstly, read nothing for k1, and write k2 = 20.
    • But somehow the client side gets an error response for whatever reason (e.g. temporary network issue);
  • execute txn2: read 20 for k2, and write k1 = 10
  • client retries txn1: read 10 for k1, and write k2 = 20

So finally it leads to cyclic information flow, so it breaks Serializable

  • txn1 reads k1/10, which was written by txn2
  • txn2 reads k2/20, which was writeen by txn1

Break Read Committed

Let's work with an example/workflow,

  • client 1 sends a request write k/v: 277/1;
  • client 2 sends a request write k/v: 277/4
  • client 2 receives a success response; It means 277/4 was successfully persisted;
  • kill the etcdserver & restart etcdserver;
  • etcd client sdk retries write k/v: 277/1; so it's also successfully persisted.
    • But it's a problem if the client 1 doesn't get a success response for whatever reason, e.g timeout.
  • client 3 read k:277, but gets 277/1 instead of 277/4.

Obviously, from client perspective, it should read 277/4 in such case, because it's confirmed committed. So it breaks Read Committed.

  • Note usually breaking Read Committed means client sees uncommitted data or dirty read.

EDIT: even without the client's retry, it's also possible for users to run into this "issue", because it's possible that an user may get a failure response but etcdserver actually has already successfully processed the request. We know it's a little confusing to users, but it isn't an issue from etcd perspective. The Proposal (see below) can mitigate it, but can't completely resolve it.

What did you expect to happen?

etcd should never break Serializable, nor Read committed

How can we reproduce it (as minimally and precisely as possible)?

See workflow mentioned above. We need to create two e2e test cases to reproduce this issue.

We can leverage gofailpoint to reproduce the Serializable issue. When etcdserver receives two transaction requests, it intentionally return a failure response for the first transaction only once; when etcdserver receives the retried failed transaction, it should return success.

We also leverage sleep gofailpoint to interleave the execution of the two transaction.

Proposal

  • We should guarantee that client never retries when the previous operation may be possible already successful.
    • One valid case to retry when the client receives an auth failure
  • We should expose API to let users to enable/disable the retry.

see also #14890 (comment)

Action

  • Create an e2e test case to reproduce the "Serializable" issue.
  • Follow proposal above to resolve the issue

Reference

@ahrtr ahrtr added the type/bug label Aug 8, 2024
@ahrtr ahrtr pinned this issue Aug 8, 2024
@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 8, 2024
@ahrtr
Copy link
Member Author

ahrtr commented Aug 8, 2024

@lavacat are you interested in working on this issue as discussed in the community meeting?

@ahrtr ahrtr changed the title etcd Go & Java client SDK's retry mechanism may break Serializable and Read committed etcd Go & Java client SDK's retry mechanism may break Serializable Aug 9, 2024
@ahrtr ahrtr removed the area/raft label Aug 12, 2024
@parthlaw
Copy link

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

@ahrtr
Copy link
Member Author

ahrtr commented Aug 14, 2024

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

Thank you. Unfortunately, this issue isn't good first issue, not even intermediate; I think it is hard (at least hard to create the e2e test).

@lavacat are you interested in working on this issue as discussed in the community meeting?

Let me know if you are working on this. I will work on it if I do not see a response by the end of next week.

@lavacat
Copy link

lavacat commented Aug 14, 2024

@ahrtr, yes, will try to find time this week. Please assign to me.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 14, 2024

/assign @lavacat

@eliben
Copy link

eliben commented Sep 14, 2024

Potential duplication of non-idempotent requests is a known problem with Raft-based systems. There's half a section dedicated to it in the extended Raft paper, in section 8:

However, as described so far Raft can execute a command multiple times: for
example, if the leader crashes after committing the log entry but before
responding to the client, the client will retry the command with a new
leader, causing it to be executed a second time. The solution is for clients
to assign unique serial numbers to every command. Then, the state machine
tracks the latest serial number processed for each client, along with the as-
sociated response. If it receives a command whose serial number has already
been executed, it responds immediately without re-executing the request.

The proposed solution is somewhat heavy-handed and requires deep changes - including in the server (because the de-duplication logic should be part of the replicated state machine), but it seems difficult to solve the problem without it.

If you simply tell the client "don't retry non-idempotent operations on failure", it adds significant burden on the user for handling such failures.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 14, 2024

Your comment is not exactly the same as this issue, but the two are somewhat related.

For a distributed system, it's possible that a client may somehow get an error response due to whatever temporary issue (e.g. network jitter) but actually the server side may have already successfully processed the requests. It's rare, but it happens.

From client perspective, if it gets a successful response, then it can trust the response. But if it gets an error response, it doesn't mean that the server side indeed fails. Note that the following comment focuses on the case that the client gets an error response.

etcd has two kinds of data, key space (key/value) data and non key-space data (i.e. membership data).

Key space

For key space data, etcd supports MVCC (multi-version concurrent control), refer to here. When the client gets an error response for the request against the key space, it has two choice.

  • The first choice is to simply retry.
    • If previous request was indeed failed on the server side, then there is no any issue to retry.
    • If previous request was actually successful on the server side, then there will be (at least) two revisions for the same key after the retry (assuming the retry is successful). It's a problem in such case, but not a big problem, because etcd guarantees linearizablity, which means the clients always read the latest data (revision). The other (minor) problem is that the watch clients may get two events against the the same key.
  • avoid duplication using TXN (refer to here and here). (Note that TXN guarantees atomicity)
    • No matter the first execution or the retry, the client always executes the same TXN: check the revision of the key and perform different operations based on the check result. For examples,
      • If the key doesn't exist beforehand,
        If createRevision == 0 {
            write the k/v
        } else {
            read the k/v
        }
        
      • If the key already exists beforehand
        read the initial revision of the key  # This isn't part of the TXN
      
        If createRevision == initialRevision {
            write the k/v
        } else {
            read the k/v
        }
      

Obviously the first choice is much simpler, but with minor problems. The second doesn't have the problems, but more complex.

Non-key space

For non-key space (i.e. membership data), etcd doesn't support MVCC. The client can still follow similar patter (check before operation), but it can't use TXN. Please refer to an example in kubernetes/kubeadm#3111.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 23, 2024

any progress? @lavacat

@lavacat
Copy link

lavacat commented Sep 24, 2024

I'm focusing on reproducing interleaving transactions with default go client. We have 3 possibilities for retry:

  1. isContextError check
  2. shouldRefreshToken check
  3. isSafeRetry check

We'll ignore 2 for now (but auth is checked before proposal enters raft).
For 3, in default case we have nonRepeatable retryPolicy for Txn and it will only retry in case

return desc == "there is no address available" || desc == "there is no connection available"

I think in this case there is no chance first attempt got to raft.

We left only with 1. It becomes interesting. I'd expect that if server side generates Cancel or Deadline error, client should retry. But that's not the case for Txn. parseProposeCtxErr converts context.Canceled to errors.ErrCanceled and context.DeadlineExceeded to errors.ErrTimeoutDueToConnectionLost or errors.ErrTimeout. The last 2 map to code Unavailable. Only ErrGRPCCanceled maps to code Canceled that will be retried. But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

So, it's not possible for Txn to retry.

For java client logic is different and I think that's the reason Jepsen test failed.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 24, 2024

But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

Indeed it seems like a bug. But let's do not change it until there is a clear summary on the existing errors (see also #18493 (comment)).

You can intentionally inject an error in processInternalRaftRequestOnce in the case x := <-ch to return a context.Canceled directly.

@ahrtr
Copy link
Member Author

ahrtr commented Oct 2, 2024

It might not be safe to automatically retry when seeing context error, because the server side may have successfully processed the request.

if isContextError(lastErr) {

func isContextError(err error) bool {
return status.Code(err) == codes.DeadlineExceeded || status.Code(err) == codes.Canceled
}

Based on the investigation that we have done so far.

So this issue can't be reproduced with etcd golang SDK. So I will unpin the issue and deescalate the priority.

Proposed followup actions:

@ahrtr ahrtr unpinned this issue Oct 2, 2024
@ahrtr ahrtr removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants