Skip to content

Conversation

socketpair
Copy link

@k8s-ci-robot
Copy link

Hi @socketpair. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

serathius commented Apr 15, 2025

Don't know if this is something that we want users to depend on.

@spzala
Copy link
Member

spzala commented Apr 15, 2025

/ok-to-test

@socketpair
Copy link
Author

socketpair commented Apr 15, 2025

@serathius This is exceptionally important. For example, TXN are used to ensure uniqueness of something, i.e. if one wants to change something, he might do a single TXN that erases something and adds another record. So, if a watcher first sees adding and only then erasing, it may face a uniqueness problem updating a cache inside an app.

Actually, I faced exactly this issue in my app.

@siyuanfoundation
Copy link
Contributor

If we want the user to depend on this, we should probably add test coverage in the robustness tests.

@socketpair
Copy link
Author

@siyuanfoundation I don't know how to make a test for this case. We need to check that etcd DOES NOT reorder. But since a test writer doesn't know WHEN it may reorder — it is difficult to trigger. AFAIK (now) etcd has subrevisions that guarantee the order. But we don't have a public API to access them.

A simple test that issues an TXN, and watches it comparing the order - possible. But for me almost useless.

Finally, I think the doc should be merged, and another issue regarding testcase should be created.

@socketpair
Copy link
Author

I just fixed DCO

@serathius
Copy link
Member

So, if a watcher first sees adding and only then erasing,

Watcher should apply a single revisions atomically. Exposing partial state in cache is the problem, not etcd documentation.

@socketpair
Copy link
Author

socketpair commented Apr 15, 2025

@serathius it does. But in order to accomplish this, the watcher typically should sort events while applying. For example, if the cache is an SQL db. Yes, in single SQL transaction. But the order inside it is important because of integrity rules in the DB.

But if etcd guarantees the order, the etcd changer might issue correct TXNs that do not require sorting in watchers.

Next, since subrevisions exist, what was the intention?

And also, if the order is not guaranteed in etcd here, we need to state this. If yes - state that preserved.

I think, that order guarantees should be stated.

@socketpair
Copy link
Author

@serathius @spzala @siyuanfoundation
What is the next step?

@socketpair
Copy link
Author

@serathius @spzala @siyuanfoundation

Ping. Who is supposed to decide if the order should be documented or not ? Again, I strongly think that should be documented.

@socketpair
Copy link
Author

@serathius @spzala @siyuanfoundation ping again

1 similar comment
@socketpair
Copy link
Author

@serathius @spzala @siyuanfoundation ping again

@jberkus
Copy link
Contributor

jberkus commented Jun 5, 2025

@socketpair I'm going to discuss this at an upcoming community meeting.

@jberkus
Copy link
Contributor

jberkus commented Jun 12, 2025

We discussed this in the Etcd community meeting. The consensus was that the order of events inside a transaction is NOT guaranteed (we already do some skipping of duplicate operations) and might be even less guaranteed in the future (e.g., consider an implementation of efficient batch writes).

@serathius any comments?

@serathius
Copy link
Member

@serathius any comments?

No, I agree. Thanks for clearing it out.

@socketpair
Copy link
Author

Okay.

  1. What is skipping of duplicate operations? I'm just curious
  2. I will change PR, so it will add information that the order may not be preserved. Are you OK with that ?

@jberkus
Copy link
Contributor

jberkus commented Jun 12, 2025

  • What is skipping of duplicate operations? I'm just curious

Imagine that inside a transaction you add a key, then delete the same key, and then add it again. Benjamin was saying we do some skipping for that, and we might do more in the future (because it's good for performance). @ahrtr ?

  • I will change PR, so it will add information that the order may not be preserved. Are you OK with that ?

What would the PR consist of, then? Pretty much the change you made here is to say that we will guarantee it.

@socketpair
Copy link
Author

Regarding duplicates:@jberkus it is impossible. Because keys in txn action list can not intersect or be non-unique.

Regardig doc: I want to state in the doc, that watchers should not rely on the order of operations in TXN actions blocks.

P.S.
I did not see the conversation on Etcd community meeting and no one invited me there. So, I'm still unsure if there are any reasons why the order may not be the same...read above about subrevisions. If the order is not expected to be preserved, possibly simplify etcd and remove subrevsions ?

Something not very clear.

@ahrtr
Copy link
Member

ahrtr commented Jun 13, 2025

Probably I did not say it clearly in yesterday's community meeting, also I didn't know this conversation at that time.

  • We guarantee atomicity of each TXN, either all operations in the TXN are successful or all failed.
  • We don't allow modify the same key multiple times in the same TXN (i.e. put k/v, and then delete it later; or put the same key multiple times). But please be aware that there is a known issue on this check for some nested TXNs (refer to checkIntervals will not validate correctly in certain nested txn scenarios etcd#16380)
  • etcd executes the operations in the same TXN in order. Otherwise, you can't guarantee a TXN some like "put k1/v1; get k1" always generate a stable response. I think we can also clarify this in document.
  • We guarantee that the watch events are ordered by revision as documented in https://etcd.io/docs/v3.6/learning/api_guarantees/#watch-apis
    • I think this also applies to the case where a TXN contain multiple write (including both put and del) operations. When there are multiple write operations in a TXN, they will have the same Revision.Main, but different Revision.Sub. It still guarantees that watch events are ordered by revision (including both Main and Sub).
    • However it's a little complicated if there are nested TXNs. Currently etcd follows a depth-first order to execute all nested TXNs. We’re not 100% sure whether we’ll change the algorithm in the future. We shouldn't depend on this, so we can't guarantee this. But I think we can only guarantee the operations in the same non-nested TXN will always generate the same order of watch events.

Example: a TXN contains three put operations

$ ./etcd-dump-db iterate-bucket ../../default.etcd/member/snap/db key --decode
rev={Revision:{Main:2 Sub:2} tombstone:false}, value=[key "k3" | val "v3" | created 2 | mod 2 | ver 1]
rev={Revision:{Main:2 Sub:1} tombstone:false}, value=[key "k2" | val "v2" | created 2 | mod 2 | ver 1]
rev={Revision:{Main:2 Sub:0} tombstone:false}, value=[key "k1" | val "v1" | created 2 | mod 2 | ver 1]

@socketpair
Copy link
Author

@ahrtr huge thanks for explanations. What you think about these fixes:

  1. Clarify the order of events if no nested TXN used is always the same as in txn
  2. Clarify that order is not guaranteed when nested TXNs present.
  3. Clarify stability of reads inside TXN

If yes, I would fix

@ahrtr
Copy link
Member

ahrtr commented Jun 13, 2025

@ahrtr huge thanks for explanations. What you think about these fixes:

  1. Clarify the order of events if no nested TXN used is always the same as in txn
  2. Clarify that order is not guaranteed when nested TXNs present.
  3. Clarify stability of reads inside TXN

If yes, I would fix

Basically YES. For the "3. Clarify stability of reads inside TXN",

  • We also need to clarify it isn't guaranteed for nested TXNs case.

also cc @serathius

@socketpair
Copy link
Author

@ahrtr @jberkus @serathius @spzala @siyuanfoundation

please review

Comment on lines 237 to 241
For transactions without nested TXNs, the order of execution of operations
is guaranteed to be the same as in its list of operations, which means stable
GET responses within the transaction and the same order of watch events.
For transactions with nested TXNs, the order of execution is not specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to clarify all guarantees in one place (the api_guarantees.md doc). This comment applies to all other versions as well (3.5 and 3.4).

Suggested change
For transactions without nested TXNs, the order of execution of operations
is guaranteed to be the same as in its list of operations, which means stable
GET responses within the transaction and the same order of watch events.
For transactions with nested TXNs, the order of execution is not specified.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ahrtr
Copy link
Member

ahrtr commented Jun 16, 2025

Please note that only 3.4, 3.5 and 3.6 are supported versions. Suggest not to update older versions < 3.4.

@socketpair
Copy link
Author

@ahrtr fixed. please review

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM & thx

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, socketpair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@socketpair
Copy link
Author

@ahrtr sorry, During changes I forgot about stable GETs. Fixed. Please look again..

@socketpair
Copy link
Author

@ahrtr done, and also rebased, added to v 3.7 as well. Please review.

has already been posted.
has already been posted. For transactions without nested TXNs, the order of
generated events is guaranteed to be the same as in its list of operations.
For transactions with nested TXNs, the order of generated events is not
Copy link
Member

@serathius serathius Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm against about over specifying this. Users should apply single revisions atomically and not depend on order of operations within a TXN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, I don't care about enough to argue. But don't expect me to help maintain a guarantee that is useless and has zero testing and might impede development/optimization of etcd in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is useless

Why do you think it's useless? Read the 3rd item in #984 (comment). Also FYI in a broader database industry, the order of commands in one transaction is guaranteed by MVCC/visibility check.

has zero testing

I agree it's a gap that we need to resolve.

Copy link
Member

@serathius serathius Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FYI in a broader database industry, the order of commands in one transaction is guaranteed by MVCC/visibility check.

We should not compare etcd TXN with database transactions, it's closer to single UPDATE statement with WHERE condition. When I execute a UPDATE request on SQL database I don't know nor care in which order the rows were updated.

This is because in both cases such operation should not touch a single row/key twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I execute a UPDATE request on SQL database I don't know nor care in which order the rows were updated.

This is because in both cases such operation should not touch a single row/key twice.

It seems that we are not talking the same thing. We are discussing the execution order of commands/operations in one transaction, not the the about data/rows the commands are updating.

Copy link
Author

@socketpair socketpair Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius

We should not compare etcd TXN with database transactions

SQL databases typically don't have modification notifications. It would be nice if Etcd could send ONE BIG event for whole TXN transactions. But it's not so! Instead, it sends a sequence of events. Yes, with revisions, but a sequence (!). That's why I want the order to be clarified.

it's closer to single UPDATE statement with WHERE condition

Yes, but we have GET requests in TXN list of operations.

@socketpair
Copy link
Author

@serathius I can write something like "It's strongly advised to interpret all events generated by a single TXN as a whole, not depending on the order of events". Are you OK with this ?

@ahrtr
Copy link
Member

ahrtr commented Jul 2, 2025

For all reference, actually the etcd official document already clarifies this,

https://etcd.io/docs/v3.6/dev-guide/api_reference_v3/


compare is a list of predicates representing a conjunction of terms. If the comparisons succeed, 
then the success requests will be processed in order, and the response will contain their respective
responses in order. If the comparisons fail, then the failure requests will be processed in order, and
the response will contain their respective responses in order.
--



@ahrtr
Copy link
Member

ahrtr commented Jul 2, 2025

@jberkus
Copy link
Contributor

jberkus commented Jul 8, 2025

Feels like we need to discuss this in a community meeting again. This is the regular etcd community meeting, or triage meeting.

@jberkus
Copy link
Contributor

jberkus commented Jul 10, 2025

Added to agenda for today's community meeting

@ahrtr
Copy link
Member

ahrtr commented Jul 10, 2025

To summary my thoughts/points:

  • The discussion has nothing to do with atomicity. atomicity only means all operations in one TXN either all succeed or all fail. We are talking about the execution order of operations in one TXN, and the order of watch events being consistent with the orders or request in TXN. I am confused why atomicity is argued.

  • Regarding the execution order of operations in one TXN,

  • Regarding the order of watch events (correspond to the multiple operations in a TXN),

    • As mentioned the community meeting, it has already been implicitly guaranteed, because,
      • We already guarantee the execution order of the operations in one TXN.
      • Each operation will generate a different revision (same Main, but different Sub). The existing watch API guarantees that users will get the watch events in order.
    • The only concern is that there is NO test cases to cover it.
    • @socketpair can you elaborate your use case why you depend on this behaviour/guarantee?

@siyuanfoundation
Copy link
Contributor

It is hard to imagine a KEY k1, and read it back in one TXN. to be a real use case in txn.
I think the focus should be if there are 2 write ops to different keys in the same txn, should the user depend on the order. Suppose there is a txn of put(key1, val1) then put(key2, val2): if the user has an application where observed order of put(key1, val1) and put(key2, val2) works and the order of put(key2, val2) and put(key1, val1) would fail, then that application is problematic because the 2 ops should happen atomically, so the order should be undefined in the db state.
If we explicitly guarantee the order in documentation, that would give users the false impression that they can hard depend on the order.

@jberkus jberkus removed the approved label Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants