etcd Go & Java client SDK's retry mechanism may break `Serializable` #18424

ahrtr · 2024-08-08T19:36:42Z

Background

Jepsen team raised an issue #14890, and stated that etcd may cause lost update and cyclic information flow. There is a long discussion.

Firstly, there is strong evidence to indicate that it isn't an etcdserver issue, and a key was written twice by the client. Refer to #14890 (comment). So we thought it might be jetcd or Jepsen's issue.

Eventually it turned out to be caused by client's retry mechanism. Refer to

etcd/client/v3/client.go

Line 273 in 9f58999

grpc.WithUnaryInterceptor(c.unaryClientInterceptor(withMax(unaryMaxRetries), rrBackoff)),
etcd java client

Note Jepsen uses jetcd (java client). But I believe etcd go client sdk also has this issue.

Breaks `Serializable`

When a database system processes multiple concurrent transactions, it must produces the same effect as some serial execution of those transactions. This is what the Serializable means.

But etcd client sdk's retry (including both go & java) mechanism may break Serializable.

Let's work with an example, assuming there are two concurrent transactions,

transaction 1 (txn1): read k1, and write k2: 20
transaction 2 (txn2): read k2, and write k1: 10

Based on the definition of Serializable, the final result must be the same as executing the two transaction as some serial execution. There are only two possibilities,

execute txn1 first, then txn2
- txn1 read nothing for k1, and write k2 = 20
- txn2 read 20 for k2, and write k1 = 10
execute txn2 first, then txn1
- txn2 read nothing for k2, and write k1 = 10
- txn1 read 10 for k1, and write k2 = 20

But client's retry may lead to a third possibility, see an example workflow below

execute txn1 firstly, read nothing for k1, and write k2 = 20.
- But somehow the client side gets an error response for whatever reason (e.g. temporary network issue);
execute txn2: read 20 for k2, and write k1 = 10
client retries txn1: read 10 for k1, and write k2 = 20

So finally it leads to cyclic information flow, so it breaks Serializable

txn1 reads k1/10, which was written by txn2
txn2 reads k2/20, which was writeen by txn1

Break `Read Committed`

Let's work with an example/workflow,

client 1 sends a request write k/v: 277/1;
client 2 sends a request write k/v: 277/4
client 2 receives a success response; It means 277/4 was successfully persisted;
kill the etcdserver & restart etcdserver;
etcd client sdk retries write k/v: 277/1; so it's also successfully persisted.
- But it's a problem if the client 1 doesn't get a success response for whatever reason, e.g timeout.
client 3 read k:277, but gets 277/1 instead of 277/4.

Obviously, from client perspective, it should read 277/4 in such case, because it's confirmed committed. So it breaks Read Committed.

Note usually breaking Read Committed means client sees uncommitted data or dirty read.

EDIT: even without the client's retry, it's also possible for users to run into this "issue", because it's possible that an user may get a failure response but etcdserver actually has already successfully processed the request. We know it's a little confusing to users, but it isn't an issue from etcd perspective. The Proposal (see below) can mitigate it, but can't completely resolve it.

What did you expect to happen?

etcd should never break Serializable, nor Read committed

How can we reproduce it (as minimally and precisely as possible)?

See workflow mentioned above. We need to create two e2e test cases to reproduce this issue.

We can leverage gofailpoint to reproduce the Serializable issue. When etcdserver receives two transaction requests, it intentionally return a failure response for the first transaction only once; when etcdserver receives the retried failed transaction, it should return success.

We also leverage sleep gofailpoint to interleave the execution of the two transaction.

Proposal

We should guarantee that client never retries when the previous operation may be possible already successful.
- One valid case to retry when the client receives an auth failure
We should expose API to let users to enable/disable the retry.

Action

Create an e2e test case to reproduce the "Serializable" issue.
Follow proposal above to resolve the issue

Reference

The text was updated successfully, but these errors were encountered:

ahrtr · 2024-08-08T19:48:10Z

@lavacat are you interested in working on this issue as discussed in the community meeting?

parthlaw · 2024-08-12T19:24:21Z

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

ahrtr · 2024-08-14T19:56:58Z

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

Thank you. Unfortunately, this issue isn't good first issue, not even intermediate; I think it is hard (at least hard to create the e2e test).

@lavacat are you interested in working on this issue as discussed in the community meeting?

Let me know if you are working on this. I will work on it if I do not see a response by the end of next week.

lavacat · 2024-08-14T20:05:54Z

@ahrtr, yes, will try to find time this week. Please assign to me.

ahrtr · 2024-08-14T20:21:15Z

/assign @lavacat

eliben · 2024-09-14T16:18:26Z

Potential duplication of non-idempotent requests is a known problem with Raft-based systems. There's half a section dedicated to it in the extended Raft paper, in section 8:

However, as described so far Raft can execute a command multiple times: for
example, if the leader crashes after committing the log entry but before
responding to the client, the client will retry the command with a new
leader, causing it to be executed a second time. The solution is for clients
to assign unique serial numbers to every command. Then, the state machine
tracks the latest serial number processed for each client, along with the as-
sociated response. If it receives a command whose serial number has already
been executed, it responds immediately without re-executing the request.

The proposed solution is somewhat heavy-handed and requires deep changes - including in the server (because the de-duplication logic should be part of the replicated state machine), but it seems difficult to solve the problem without it.

If you simply tell the client "don't retry non-idempotent operations on failure", it adds significant burden on the user for handling such failures.

ahrtr · 2024-09-14T20:35:31Z

Your comment is not exactly the same as this issue, but the two are somewhat related.

For a distributed system, it's possible that a client may somehow get an error response due to whatever temporary issue (e.g. network jitter) but actually the server side may have already successfully processed the requests. It's rare, but it happens.

From client perspective, if it gets a successful response, then it can trust the response. But if it gets an error response, it doesn't mean that the server side indeed fails. Note that the following comment focuses on the case that the client gets an error response.

etcd has two kinds of data, key space (key/value) data and non key-space data (i.e. membership data).

Key space

For key space data, etcd supports MVCC (multi-version concurrent control), refer to here. When the client gets an error response for the request against the key space, it has two choice.

The first choice is to simply retry.
- If previous request was indeed failed on the server side, then there is no any issue to retry.
- If previous request was actually successful on the server side, then there will be (at least) two revisions for the same key after the retry (assuming the retry is successful). It's a problem in such case, but not a big problem, because etcd guarantees linearizablity, which means the clients always read the latest data (revision). The other (minor) problem is that the watch clients may get two events against the the same key.
avoid duplication using TXN (refer to here and here). (Note that TXN guarantees atomicity)
- No matter the first execution or the retry, the client always executes the same TXN: check the revision of the key and perform different operations based on the check result. For examples,
  - If the key doesn't exist beforehand,
```
If createRevision == 0 {
    write the k/v
} else {
    read the k/v
}
```
  - If the key already exists beforehand
```
  read the initial revision of the key  # This isn't part of the TXN

  If createRevision == initialRevision {
      write the k/v
  } else {
      read the k/v
  }
```

Obviously the first choice is much simpler, but with minor problems. The second doesn't have the problems, but more complex.

Non-key space

For non-key space (i.e. membership data), etcd doesn't support MVCC. The client can still follow similar patter (check before operation), but it can't use TXN. Please refer to an example in kubernetes/kubeadm#3111.

ahrtr · 2024-09-23T15:37:51Z

any progress? @lavacat

lavacat · 2024-09-24T09:06:02Z

I'm focusing on reproducing interleaving transactions with default go client. We have 3 possibilities for retry:

We'll ignore 2 for now (but auth is checked before proposal enters raft).
For 3, in default case we have nonRepeatable retryPolicy for Txn and it will only retry in case

return desc == "there is no address available" || desc == "there is no connection available"

I think in this case there is no chance first attempt got to raft.

We left only with 1. It becomes interesting. I'd expect that if server side generates Cancel or Deadline error, client should retry. But that's not the case for Txn. parseProposeCtxErr converts context.Canceled to errors.ErrCanceled and context.DeadlineExceeded to errors.ErrTimeoutDueToConnectionLost or errors.ErrTimeout. The last 2 map to code Unavailable. Only ErrGRPCCanceled maps to code Canceled that will be retried. But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

So, it's not possible for Txn to retry.

For java client logic is different and I think that's the reason Jepsen test failed.

ahrtr · 2024-09-24T20:26:09Z

But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

Indeed it seems like a bug. But let's do not change it until there is a clear summary on the existing errors (see also #18493 (comment)).

You can intentionally inject an error in processInternalRaftRequestOnce in the case x := <-ch to return a context.Canceled directly.

You can add a filter to only return a context.Canceled on a certain request type (i.e. Put) and on certain key (i.e. "foo").
Ensure the failpoint only fails once, namely only the first time fails and all the following requests all succeed. Refer to https://github.com/etcd-io/gofail/blob/32cd1c822104138c918eeb9c0fb4cc707064b821/runtime/terms.go#L65

ahrtr · 2024-10-02T09:23:59Z

It might not be safe to automatically retry when seeing context error, because the server side may have successfully processed the request.

etcd/client/v3/retry_interceptor.go

Line 72 in c79c7d5

if isContextError(lastErr) {

etcd/client/v3/retry_interceptor.go

Lines 348 to 350 in c79c7d5

    
           func isContextError(err error) bool { 
        
           	return status.Code(err) == codes.DeadlineExceeded || status.Code(err) == codes.Canceled 
        
           }

Based on the investigation that we have done so far.

The original design wanted to automatically retry in context error (which may not be safe as mentioned above);
but the implementation doesn't match the design, and the client side will never see a context error (so the final behaviour is correct) as mentioned above in etcd Go & Java client SDK's retry mechanism may break Serializable #18424 (comment).

So this issue can't be reproduced with etcd golang SDK. So I will unpin the issue and deescalate the priority.

Proposed followup actions:

generate a summary on all the errors as mentioned in errors.Is not work for rpctypes.ErrGRPCxxxx #18493 (comment)

Remove line 72-79 below to avoid any confusion. Note that removing them doesn't change the behaviour.

etcd/client/v3/retry_interceptor.go

Lines 72 to 79 in c79c7d5

    
           if isContextError(lastErr) { 
        
           	if ctx.Err() != nil { 
        
           		// its the context deadline or cancellation. 
        
           		return lastErr 
        
           	} 
        
           	// its the callCtx deadline or cancellation, in which case try again. 
        
           	continue 
        
           }

ahrtr added the type/bug label Aug 8, 2024

ahrtr pinned this issue Aug 8, 2024

ahrtr mentioned this issue Aug 8, 2024

Lost update and cyclic information flow on process crash #14890

Closed

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 8, 2024

ahrtr changed the title ~~etcd Go & Java client SDK's retry mechanism may break Serializable and Read committed~~ etcd Go & Java client SDK's retry mechanism may break Serializable Aug 9, 2024

jmhbnz added area/raft area/clientv3 labels Aug 11, 2024

ahrtr removed the area/raft label Aug 12, 2024

ahrtr mentioned this issue Aug 13, 2024

clientv3: allow setting JWT directly #16803

Open

k8s-ci-robot assigned lavacat Aug 14, 2024

serathius added the area/robustness-testing label Aug 28, 2024

fabriziopandini mentioned this issue Sep 16, 2024

Audit KCP codebase for re-entrancy & error handling of non-key space operations kubernetes-sigs/cluster-api#11184

Open

ahrtr unpinned this issue Oct 2, 2024

ahrtr removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd Go & Java client SDK's retry mechanism may break `Serializable` #18424

etcd Go & Java client SDK's retry mechanism may break `Serializable` #18424

ahrtr commented Aug 8, 2024 •

edited

Loading

ahrtr commented Aug 8, 2024

parthlaw commented Aug 12, 2024

ahrtr commented Aug 14, 2024

lavacat commented Aug 14, 2024

ahrtr commented Aug 14, 2024

eliben commented Sep 14, 2024 •

edited

Loading

ahrtr commented Sep 14, 2024

ahrtr commented Sep 23, 2024

lavacat commented Sep 24, 2024 •

edited

Loading

ahrtr commented Sep 24, 2024

ahrtr commented Oct 2, 2024

etcd Go & Java client SDK's retry mechanism may break Serializable #18424

etcd Go & Java client SDK's retry mechanism may break Serializable #18424

Comments

ahrtr commented Aug 8, 2024 • edited Loading

Background

Breaks Serializable

Break Read Committed

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Proposal

Action

Reference

ahrtr commented Aug 8, 2024

parthlaw commented Aug 12, 2024

ahrtr commented Aug 14, 2024

lavacat commented Aug 14, 2024

ahrtr commented Aug 14, 2024

eliben commented Sep 14, 2024 • edited Loading

ahrtr commented Sep 14, 2024

Key space

Non-key space

ahrtr commented Sep 23, 2024

lavacat commented Sep 24, 2024 • edited Loading

ahrtr commented Sep 24, 2024

ahrtr commented Oct 2, 2024

etcd Go & Java client SDK's retry mechanism may break `Serializable` #18424

etcd Go & Java client SDK's retry mechanism may break `Serializable` #18424

ahrtr commented Aug 8, 2024 •

edited

Loading

Breaks `Serializable`

Break `Read Committed`

eliben commented Sep 14, 2024 •

edited

Loading

lavacat commented Sep 24, 2024 •

edited

Loading