Write txn shouldn't End() on a failure #18679

shyamjvs · 2024-10-05T01:31:14Z

What happened?

When two nodes both execute the same write txn (i.e validation step already passed on both), but one ends up in failed execution and the other in success, we shouldn't let the failed node increment its CI and move on - but explicitly fail and crash at the old CI - because we never expect a write txn to fail. Otherwise it would cause asymmetric side-effect across nodes and lead to data inconsistency.

Here's where I think our code may be potentially violating that:

etcd/server/etcdserver/txn/txn.go

Lines 307 to 314 in c1976a6

    
           _, err := executeTxn(ctx, lg, txnWrite, rt, txnPath, txnResp) 
        
           if err != nil { 
        
           	if isWrite { 
        
           		// end txn to release locks before panic 
        
           		txnWrite.End() 
        
           		// When txn with write operations starts it has to be successful 
        
           		// We don't have a way to recover state in case of write failure 
        
           		lg.Panic("unexpected error during txn with writes", zap.Error(err))

If I'm reading that correctly, when we run into a txn failure (which may contain some partially executed writes) the txn is ended before triggering server panic - which means there could be a KV rev and CI bump with some incorrect changes. So if/when the server restarts, it will assume it's caught up (till that badly applied txn) and move on to the next record leading to inconsistent state. Reproducing such failure is going to be hard iiuc but it seems like we should panic without calling txn.End here?

What did you expect to happen?

The etcd node on which write txn execution fails should crash without trying to commit the failed transaction (or other side effects like incrementing KV revision or CI).

Anything else we need to know?

There was no observed failure or a confirmed repro I have for this issue, but bringing it up as a potential risk in our current code.

Please share any thoughts about making the above change/trying to repro or if you think this is a non-issue.
/cc @serathius @ahrtr

serathius · 2024-10-05T10:30:30Z

Context why this was introduced #14149

shyamjvs · 2024-10-07T20:39:41Z

As I read through that change, it seems the goal was to make timeouts of read-only txns not cause panic. But the change also makes undesirable side-effect of letting the partial-write end before panic.

shyamjvs · 2024-10-07T20:50:46Z

This comment from @ptabor was quite relevant actually (the miss seems to be to not panic right away):

I have small preference towards apply workflow being extremely strict and deterministic and any unexpected error causing premature exit from function (even in RO) code is something that should lead to investigation and fixing.

@ahrtr - please share any add'l context you might have from that PR - or I can make a change to remove txn.End and test

ahrtr · 2024-10-08T13:20:59Z

Thanks @shyamjvs for the good catch. Right, we need to remove the txnWrite.End(), otherwise the data might be partially committed into the backend storage (bbolt) before panicking. cc @lavacat

etcd/server/etcdserver/txn/txn.go

Lines 310 to 311 in f1aefa5

    
           // end txn to release locks before panic 
        
           txnWrite.End()

serathius · 2024-10-08T13:25:04Z

Looks like a good place for a failpoint.

shyamjvs · 2024-10-17T23:05:23Z

Sorry this took a while, I sent out a fix above.

@serathius I'm happy to add failpoints too, but had a few questions first around how to make them effective. Could we discuss over a call (maybe next SIG sync)?

serathius · 2024-10-18T08:52:50Z

See example in #17555

ahrtr · 2024-10-26T18:00:18Z

Open to track the backport effort #18749 (comment)

shyamjvs added the type/bug label Oct 5, 2024

shyamjvs mentioned this issue Oct 5, 2024

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

Open

shyamjvs changed the title ~~Write txn shouldn't be ended on a failure~~ Write txn shouldn't End() on a failure Oct 5, 2024

shyamjvs mentioned this issue Oct 17, 2024

Fix risk of a partial write txn being applied #18749

Merged

serathius closed this as completed in #18749 Oct 24, 2024

ahrtr reopened this Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write txn shouldn't End() on a failure #18679

Write txn shouldn't End() on a failure #18679

shyamjvs commented Oct 5, 2024 •

edited

Loading

serathius commented Oct 5, 2024

shyamjvs commented Oct 7, 2024

shyamjvs commented Oct 7, 2024

ahrtr commented Oct 8, 2024

serathius commented Oct 8, 2024

shyamjvs commented Oct 17, 2024

serathius commented Oct 18, 2024

ahrtr commented Oct 26, 2024

Write txn shouldn't End() on a failure #18679

Write txn shouldn't End() on a failure #18679

Comments

shyamjvs commented Oct 5, 2024 • edited Loading

What happened?

What did you expect to happen?

Anything else we need to know?

serathius commented Oct 5, 2024

shyamjvs commented Oct 7, 2024

shyamjvs commented Oct 7, 2024

ahrtr commented Oct 8, 2024

serathius commented Oct 8, 2024

shyamjvs commented Oct 17, 2024

serathius commented Oct 18, 2024

ahrtr commented Oct 26, 2024

shyamjvs commented Oct 5, 2024 •

edited

Loading