Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NONEVM-984][solana] - Reorg Detection + lighter rpc call #951

Open
wants to merge 94 commits into
base: develop
Choose a base branch
from

Conversation

Farber98
Copy link
Contributor

@Farber98 Farber98 commented Nov 28, 2024

Description

  • Track transaction statuses on per signature basis to identify which signature for a transaction was included to detect re-orgs specifically for it
  • Update the confirmation logic to identify regression in a signature’s transaction status when no or processed status is received to detect re-orgs
    • A transaction can revert back to no status or processed from confirmed status
    • A transaction can revert back to no status from processed status
  • If a re-org is detected
    • from confirmed, restart the retry/bump loop
    • from processed, we don't do nothing and it's handled by expiration rebroadcast if needed later

Tickets

Soak Testing

@Farber98 Farber98 marked this pull request as ready for review January 2, 2025 19:40
@Farber98 Farber98 requested a review from a team as a code owner January 2, 2025 19:40
@@ -112,11 +130,8 @@ func (c *pendingTxContext) New(tx pendingTx, sig solana.Signature, cancel contex
return err
}

// upgrade to write lock if sig or id do not exist
// upgrade to write lock if id do not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// upgrade to write lock if id do not exist
// upgrade to write lock if id does not exist

return err
}

var pTx pendingTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: think we can move this into the write lock block since we don't really need it outside

return
}

txm.lggr.Warnw("re-org detected for transaction", "txID", txInfo.id, "signature", sig, "previousStatus", txInfo.state, "currentStatus", currentTxState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can consider re-orgs as part of the normal TXM processes since it's an expected occurrence eventually. Warn might be too alarming here. I think Info or even Debug might be a better log level here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's do Debug to keep this the same as our expiration log

}

// For regressions from "Confirmed, we'll need to rebroadcast the tx.
if regressionType == FromConfirmed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I thought I remember you mentioning that if we detect a re-org, we'd have to replace the blockhash in all cases since the re-org implies that the block hash we were using is invalid. So rebroadcasting with the same blockhash would fail anyways. If that's the case, I think we can ditch this regressionType and just assign a new blockhash always. We could also skip the specific confirmed re-orged tx log below too then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if that's not an issue, we can maybe still consider keeping the same logic for both cases if it helps simplify things. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Great catch with this one. This approach simplifies the design a lot. Let me test it first to ensure we don’t unintentionally introduce races on Processed ones, though I have a good feeling about this.

To back up our thoughts, here are some relevant quotes from the Solana Docs:

What is a blockhash?

A blockhash refers to the last Proof of History (PoH) hash for a slot. Since Solana uses PoH as a trusted clock, a transaction's recent blockhash can be thought of as a timestamp.

Proof of History refresher

Solana's Proof of History mechanism uses a very long chain of recursive SHA-256 hashes to build a trusted clock. The “history” part of the name comes from the fact that block producers hash transaction id's into the stream to record which transactions were processed in their block.

PoH can be used as a trusted clock because each hash must be produced sequentially. Each produced block contains a blockhash and a list of hash checkpoints called “ticks” so that validators can verify the full chain of hashes in parallel and prove that some amount of time has actually passed.

How does transaction expiration work?

Each transaction includes a “recent blockhash” which is used as a PoH clock timestamp and expires when that blockhash is no longer “recent enough”.

As each block is finalized, the final hash of the block is added to the BlockhashQueue which stores a maximum of the 300 most recent blockhashes. During transaction processing, Solana Validators will check if each transaction's recent blockhash is recorded within the most recent 151 stored hashes (aka "max processing age"). If the transaction's recent blockhash is older than this max processing age, the transaction is not processed.

In conclusion, I have a good feeling this approach will be viable. If a block gets reorged, my understanding is that its blockhash will no longer be present in the mentioned BlockhashQueue. Therefore, any retry attempt using the previous blockhash would not be accepted, mitigating the risk of duplicate transactions. The only transaction that should go through would be the new one created with a fresh (and hopefully valid) blockhash.

// - Confirmed -> Processed || Broadcasted || Not Found
// - Processed -> Broadcasted || Not Found
currentTxState := convertStatus(status)
if regressionType, isRegressed := isStatusRegression(txInfo.state, currentTxState); isRegressed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need regressionType anymore based on my other comment, I think we can just use TxHasReorg for this and eliminate GetSignatureInfo and UpdateSignatureStatus to minimize the number of storage methods we use.

  1. We could rework TxHasReorg to take in a sig and a current state and return the id and bool. That way we can do the isStatusRegression on the storage side and we can drop GetSignatureInfo
  2. We could also update TxHasReorg to fetch the id from the sig map to fetch the tx. Since we'd have access to the current sig and its new state, I don't think we'd have to call UpdateSignatureStatus. We'd be dropping all old sigs anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

status, errGetStatus = txmInstance.GetTransactionStatus(ctx, txID)
require.NoError(t, errGetStatus)
require.Equal(t, types.Finalized, status, "tx should be finalized after reorg")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include a test case where we detect a re-org from Processed state?

Copy link
Contributor Author

@Farber98 Farber98 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t achieve this because the local live validator transitions to Confirmed instantly. I can try spamming a large number of transactions to overwhelm it, which might delay the transition and provide a window to stop it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants