Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536

ThomasBrady · 2024-11-06T00:23:29Z

Description

Resolves #X

Adds a new config option, CATCHUP_SKIP_KNOWN_RESULTS. When this config option is enabled, transaction results are downloaded from history archives for the catchup range. Failed transactions are not applied, and signatures are not verified.

Preliminary perf testing locally running catchup on 1000 ledgers:

user/system/total time seconds

*Baseline (no skipping):*     429 / 115 / 138s
*Skip Failed:*                373 / 99  / 114s  (1.14x / 1.16x / 1.21x speedup over baseline)
*Skip Failed + verification:* 334 / 88  / 95s   (1.28x / 1.30x / 1.45x speedup over baseline)

Remaining work:

Run catchup in supercluster to get a more representative idea of the performance characteristics of the new mode
Add more test coverage to test correct meta generation in this mode

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

marta-lokhova · 2024-11-13T19:08:29Z

src/transactions/TransactionFrame.h

@@ -70,6 +70,10 @@ class TransactionFrame : public TransactionFrameBase
    mutable Hash mFullHash;     // the hash of the contents and the sig.

    std::vector<std::shared_ptr<OperationFrame const>> mOperations;
+    mutable std::optional<TransactionResult> mReplaySuccessfulTransactionResult{


Please avoid adding mutable state to TransactionFrame - @SirTyson recently did a lot of work to make TransactionFrame immutable, so we should keep it that way. Reason is that tx frame is used across the codebase for overlay flooding as well as ledger application, so any incorrectly set mutable state leads to awful failure modes like state divergence. If you want to override results, could we modify MutableTransactionResult class instead?

Ah I see, that makes sense. I'll change this to store the replay result in MutableTransactionResult.

marta-lokhova · 2024-11-13T19:12:13Z

src/transactions/TransactionFrame.cpp

@@ -1785,8 +1816,19 @@ TransactionFrame::apply(AppConnector& app, AbstractLedgerTxn& ltx,
    {
        mCachedAccountPreProtocol8.reset();
        uint32_t ledgerVersion = ltx.loadHeader().current().ledgerVersion;
-        SignatureChecker signatureChecker{ledgerVersion, getContentsHash(),
-                                          getSignatures(mEnvelope)};
+        auto skipMode = mReplayFailingTransactionResult.has_value() ||


I'm not sure this is correct: don't we still need to do signature verification for failed transactions?

(A good sanity check for this check would be running full parallel catchup)

We need to update one time signers (which are removed whether the signature verification succeeds or fails), but I don't see why we need to verify signatures for failed transactions as it doesn't effect the ledger state.

marta-lokhova · 2024-11-13T19:14:25Z

src/transactions/SignatureChecker.h

+    virtual bool checkSignature(std::vector<Signer> const&, int32_t) = 0;
+    virtual bool checkAllSignaturesUsed() const = 0;
+};
+class SignatureCheckerImpl : public SignatureChecker


Request for some renaming for clarity:

SignatureChecker -> AbstractSignatureChecker

SignatureCheckerImpl -> SignatureChecker

marta-lokhova · 2024-11-13T19:17:35Z

src/catchup/ApplyCheckpointWork.h

    TransactionHistoryEntry mTxHistoryEntry;
+    TransactionHistoryResultEntry mTxHistoryResultEntry;


To avoid footguns, results should be optional

marta-lokhova · 2024-11-13T19:18:13Z

src/catchup/CatchupConfiguration.h

-// to mark catchup as complete and node as synced. In OFFLINE mode node is not
-// connected to network, so new ledgers are not being externalized. Only
-// buckets and transactions from history archives are applied.
+// Catchup can be done in two modes - ONLINE and OFFLINE. In ONLINE mode, the


Why did the formatting change?

I reworded it a bit, which changed the line lengths and formatting.

marta-lokhova · 2024-11-13T20:38:18Z

src/catchup/DownloadApplyTxsWork.cpp

-                        mCheckpointToQueue);
-    auto getAndUnzip =
-        std::make_shared<GetAndUnzipRemoteFileWork>(mApp, ft, mArchive);
+    OnFailureCallback cb = [archive = mArchive, filesToTransfer]() {


Please keep the implementation of OnFailureCallback the way it was before: mArchive can be null, in which case GetAndUnzipRemoteFileWork will pick a random archive. With the current change, core would crash de-referencing a null pointer.

marta-lokhova · 2024-11-13T20:41:16Z

src/catchup/DownloadApplyTxsWork.cpp

@@ -98,8 +94,10 @@ DownloadApplyTxsWork::yieldMoreWork()
    {
        auto prev = mLastYieldedWork;
        bool pqFellBehind = false;
+        auto applyName = apply->getName();


marta-lokhova · 2024-11-13T20:42:21Z

src/catchup/DownloadApplyTxsWork.cpp

-                std::filesystem::remove(
-                    std::filesystem::path(ft.localPath_nogz()));
-                CLOG_DEBUG(History, "Deleted transactions {}",
+                CLOG_DEBUG(History, "Deleting transactions {}",


nit: the log is misleading, since we're deleting transactions and results

marta-lokhova · 2024-11-13T20:43:32Z

src/ledger/LedgerManagerImpl.cpp

@@ -1520,6 +1522,17 @@ LedgerManagerImpl::applyTransactions(

    prefetchTransactionData(txs);

+    std::optional<std::vector<TransactionResultPair>::const_iterator>
+        expectedResultsIter = std::nullopt;
+    if (mApp.getConfig().CATCHUP_SKIP_KNOWN_RESULTS && expectedResults)


Is it a valid scenario when CATCHUP_SKIP_KNOWN_RESULTS=true but expectedResults is not set?

If the CATCHUP_SKIP_KNOWN_RESULTS=true, but for some reason we do not have expectedResults (e.g. if for some reason they aren't in the archive), catchup will proceed but no transactions will be skipped. I think I should probably add a warning to inform users that this is the case, but I'm hesitant to report an error in that case as its not fatal. What do you think?

I see, this makes me realize: with the current implementation of catchup, won't it fail if results couldn't be downloaded? If so, I think we should make it less strict, so it'll try to download results, but if it can't, catchup should still proceed normally. In this case, I agree the logic here makes sense and we can add a warning (probably not at individual ledger level though as it'll get spammy. We can warn per checkpoint).

I changed WorkSequence to take an vector of "optional" works, which, if they fail do not block the execution of subsequent works. This step now will log a warning every checkpoint if the downloading of results fails, but catchup will proceed.

marta-lokhova · 2024-11-13T20:47:49Z

src/ledger/LedgerManagerImpl.cpp

+        {
+            releaseAssert(*expectedResultsIter !=
+                          expectedResults->results.end());
+            while ((*expectedResultsIter)->transactionHash !=


This looks a bit suspicious: the order of results should match the order of transactions application exactly. I think you can do expectedResults->at(i) instead of the loop.

Correct, the order should match, but the loop accounts for the case when there are gaps in the results stored in the archive, which I believe can be the case when there are ledgers with empty txn sets (?)

if you have ledgers with empty sets, then txs should be empty as well, so we should use the same index logic for both results and txs

…s is not fatal

sisuresh · 2024-11-19T23:59:54Z

src/ledger/LedgerManagerImpl.cpp

+        // here.
+        std::optional<std::vector<TransactionResultPair>::const_iterator>
+            expectedResultsIter = std::nullopt;
+        if (expectedResults)


I think we should add an invariant either here or in TransactionFrame that makes sure we're actually doing catchup when we see an expected result. What do you think?

That makes sense, something like releaseAssert(mApp.getCatchupManager().isCatchupInitialized()); ?

Is it possible for CatchupManagerImpl::catchupWorkIsDone() to be true along with a non null mCatchupWork here? I don't recall how the catchup state transitions work. Maybe @marta-lokhova has some input on a safe way to do this check.

Since this change targets a very specific use case (parallel catchup for testing), I'd recommend putting all this functionality behind BUILD_TESTS. The prod paths should remain unchanged.

Ah I thought this was meant for running catchup in general. Is it just for parallel catchup testing?

Meta, events and metrics are not equivalent to live execution when this mode is enabled, making it unsuitable to general catchup use case. It could be used in all scenarios where one doesn't care about them. Its not limited to parallel catchup, but generally development/testing scenarios so I agree that it should be behind an ifdef BUILDTESTS.

Yes, we don't win much by skipping work in prod paths. Different application flows increase the risk of divergence. Parallel catchup seems like a nice middle ground given the performance benefits + low risk.

I agree. In that case, a lot more of this PR should be behind the BUILD_TESTS ifdef.

src/transactions/MutableTransactionResult.h

…e to ifdef BUILDTEST

… from workseq

ThomasBrady force-pushed the lightweight-catchup branch 3 times, most recently from 1116893 to b03fc48 Compare November 8, 2024 23:52

ThomasBrady requested a review from marta-lokhova November 13, 2024 19:00

marta-lokhova requested changes Nov 13, 2024

View reviewed changes

anupsdf requested a review from sisuresh November 15, 2024 19:06

ThomasBrady added 5 commits November 15, 2024 16:43

WIP

222f2c0

Yolo test new mode without supercluster config

a8764e7

Revert default to skip mode

0f80cc8

refactoring

40701f8

update p22 soroban rust

c2bc090

ThomasBrady force-pushed the lightweight-catchup branch from ff8f71a to c2bc090 Compare November 16, 2024 00:43

ThomasBrady changed the title ~~WIP: New mode to use transaction results to skip failed transaction and signature verification in catchup~~ Add new catchup mode to use transaction results to skip failed transaction and signature verification Nov 16, 2024

ThomasBrady added 3 commits November 18, 2024 13:42

CATCHUP_SKIP_KNOWN_RESULTS=true

92090fe

Downloading results is optional, failure is not fatal. Missing result…

d6041b4

…s is not fatal

Handle resplay for feebumptransactionframe

42242c0

sisuresh reviewed Nov 19, 2024

View reviewed changes

sisuresh reviewed Nov 20, 2024

View reviewed changes

src/transactions/MutableTransactionResult.h Outdated Show resolved Hide resolved

ThomasBrady added 3 commits November 19, 2024 17:49

Set result of mutable fee bump txn result

e7e7b96

remove feebump special case

a30ae35

Just one variable for replay result

6d2fa0b

ThomasBrady requested review from marta-lokhova and sisuresh November 20, 2024 18:59

ThomasBrady added 2 commits November 20, 2024 11:02

format

3c2e52b

only access results() of tx results with code txFAILED, move skip mod…

c451463

…e to ifdef BUILDTEST

ThomasBrady force-pushed the lightweight-catchup branch from e2ad648 to c451463 Compare November 22, 2024 03:44

ThomasBrady added 4 commits November 21, 2024 20:26

ifdef

f02ab87

format

8d8b6a6

ifdefs, moving results plumbing to one function'

b23ffdc

refactor, extra checks around file access, remove optional work param…

fb725ab

… from workseq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536

Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536

ThomasBrady commented Nov 6, 2024 •

edited

Loading

marta-lokhova Nov 13, 2024

ThomasBrady Nov 14, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

ThomasBrady Nov 14, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

ThomasBrady Nov 13, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

marta-lokhova Nov 13, 2024

ThomasBrady Nov 14, 2024

marta-lokhova Nov 14, 2024

ThomasBrady Nov 20, 2024

marta-lokhova Nov 13, 2024

ThomasBrady Nov 14, 2024

marta-lokhova Nov 14, 2024

sisuresh Nov 19, 2024 •

edited

Loading

ThomasBrady Nov 20, 2024

sisuresh Nov 20, 2024

marta-lokhova Nov 20, 2024

sisuresh Nov 20, 2024

ThomasBrady Nov 20, 2024 •

edited

Loading

marta-lokhova Nov 20, 2024

sisuresh Nov 20, 2024

		TransactionHistoryEntry mTxHistoryEntry;
		TransactionHistoryResultEntry mTxHistoryResultEntry;

Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536

Are you sure you want to change the base?

Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536

Conversation

ThomasBrady commented Nov 6, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sisuresh Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasBrady Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasBrady commented Nov 6, 2024 •

edited

Loading

sisuresh Nov 19, 2024 •

edited

Loading

ThomasBrady Nov 20, 2024 •

edited

Loading