Fetch leaves recursively by hash, verifying chaining #726

jbearer · 2024-10-31T23:08:48Z

This changes leaf fetching to require already having the next leaf. This tells us what the hash should be for the leaf being fetched.

This addresses two unrelated issues:

Leaf fetching is trusted: we were not previously verifying that the leaf returned by a peer is valid in any way. Now, we can verify the fetched leaf against the expected hash (and similarly for the fetched QC). We exploit the chaining property of HotShot leaves to avoid having to run any kind of consensus light client to verify fetched leaves.
Fetching leaves by hash instead of (or in addition to) by height allows us to implement more providers. For example, we can now implement a provider that pulls leaves from undecided consensus storage, by hash, which allows us to fetch a leaf from our own storage even if we missed the corresponding decide event.

This PR:

Changes leaf fetching to require the next leaf, and fetch/authenticate by hash
Proactive scanning now runs backwards, since leaf fetching becomes kind of inherently backwards. This was a change we wanted anyways (closes Sync from newest to oldest #620)
To facilitate that, we have added reverse streams for all the resource types, which may come in handy for the explorer
Receiving a leaf (either by fetching or decide event) now triggers us to fetch the parent if it is missing. This is supplements the proactive fetcher and can often help us obtain missing data really quickly (e.g. if we just missed one decide)
NoStorage is gone. The purpose of NoStorage was to stress test fetching, which it did in the beginning, but it has been ages since we found a real bug this way; we now have plenty of other adversarial fetching tests, and NoStorage is becoming more trouble than it's worth to maintain. In particular, effective fetching now depends on having somewhat reliable storage (a reasonable assumption!) because we assume if you fetch a leaf, you can look it up shortly thereafter and thus fetch its parent. Thus, the NoStorage tests were failing or very slow because of many failed requests, with this change.

Key places to review:

New leaf fetching logic: active_fetch in src/data_source/fetching/leaf.rs
Triggering fetch of the parent leaf: trigger_fetch_for_parent in src/data_source/fetching/leaf.rs
Reverse streams: src/data_source/fetching.rs

This changes leaf fetching to require already having the _next_ leaf. This tells us what the hash should be for the leaf being fetched. This addresses two unrelated issues: 1. Leaf fetching is trusted: we were not previously verifying that the leaf returned by a peer is valid in any way. Now, we can verify the fetched leaf against the expected hash (and similarly for the fetched QC). We exploit the chaining property of HotShot leaves to avoid having to run any kind of consensus light client to verify fetched leaves. 2. Fetching leaves by hash instead of (or in addition to) by height allows us to implement more providers. For example, we can now implement a provider that pulls leaves from undecided consensus storage, by hash, which allows us to fetch a leaf from our own storage even if we missed the corresponding decide event. Knock-on changes: * Proactive scanning now runs backwards, since leaf fetching becomes kind of inherently backwards. This was a change we wanted anyways (closes #620) * To facilitate that, we have added reverse streams for all the resource types, which may come in handy for the explorer * Receiving a leaf (either by fetching or decide event) now triggers us to fetch the parent if it is missing. This is supplements the proactive fetcher and can often help us obtain missing data really quickly (e.g. if we just missed one decide) * `NoStorage` is gone. The purpose of `NoStorage` was to stress test fetching, which it did in the beginning, but it has been ages since we found a real bug this way; we now have plenty of other adversarial fetching tests, and `NoStorage` is becoming more trouble than it's worth to maintain. In particular, effective fetching now depends on having somewhat reliable storage (a reasonable assumption!) because we assume if you fetch a leaf, you can look it up shortly thereafter and thus fetch its parent. Thus, the `NoStorage` tests were failing or very slow because of many failed requests, with this change.

If we do not have a leaf and also do not have the next leaf, so that we cannot directly fetch the required leaf, at least we can find the first leaf that we do have and trigger a fetch, which in turn will trigger a fetch of its parent, and so on. This brings the behavior of leaf fetching closer to before, where if a leaf is requested and we don't have it, we will immediately start working to get it.

…transaction

… transaction

Before switching to nextest, we only ran 2 tests at a time. This prevented us from seeing any weirdness due to conflicting ports and the like. Now, I am seeing some flakiness in tests that use Postgres, which I believe is due to the increased concurrency allowed by nextest. This change puts all the tests using postgres in a group, and restricts that group to 2 tests at a time. Other tests can run with unrestricted concurrency.

* Ensure proper return code from PayloadMetadata when missing from DB * No more retries for upsert; they accomplish nothing as these errors are almost always unrecoverable without restarting the whole transaction, but they cost a ton of time * Avoid inserting stale data when pruner is racing with fetcher

* Avoid long delay in `test_archive_recovery` due to competing write transactions, by tweaking backoff parameters * In aggregates test, wait for aggregator task to update statistics before checking * Don't fetch parent leaves that have already been pruned

…ades

imabdulbasit

Looks very good! I like how we are using parent_commitment() to verify the fetched leaf.

jbearer force-pushed the jb/leaf-fetching branch from c680b81 to b0fd395 Compare October 31, 2024 23:12

jbearer force-pushed the jb/leaf-fetching branch 3 times, most recently from 3820561 to 924cefa Compare November 14, 2024 19:18

jbearer marked this pull request as ready for review November 14, 2024 19:19

jbearer requested review from sveitser and imabdulbasit as code owners November 14, 2024 19:19

jbearer force-pushed the jb/leaf-fetching branch 2 times, most recently from 2d6e36f to 6bdab3b Compare November 19, 2024 04:31

jbearer added 7 commits November 20, 2024 15:41

Spawn fetch for parent leaf to avoid potential deadlock opening read …

8d32b85

…transaction

Spawn fetch for missing data to avoid potential deadlock opening read…

bc697a4

… transaction

Fix docs

de8a66e

jbearer force-pushed the jb/leaf-fetching branch from 4a2c7ab to de8a66e Compare November 20, 2024 23:44

jbearer added 4 commits December 4, 2024 09:41

Merge remote-tracking branch 'origin/main' into jb/leaf-fetching

c5e825f

Remove no-storage references from CI

2eb2f4a

Fix flaky tests

c7008da

* Avoid long delay in `test_archive_recovery` due to competing write transactions, by tweaking backoff parameters * In aggregates test, wait for aggregator task to update statistics before checking * Don't fetch parent leaves that have already been pruned

Prevent sqlite transaction rollbacks due to deadlock due to lock upgr…

adad23f

…ades

imabdulbasit approved these changes Dec 5, 2024

View reviewed changes

jbearer merged commit fec6377 into main Dec 5, 2024
10 checks passed

jbearer deleted the jb/leaf-fetching branch December 5, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch leaves recursively by hash, verifying chaining #726

Fetch leaves recursively by hash, verifying chaining #726

jbearer commented Oct 31, 2024

imabdulbasit left a comment

Fetch leaves recursively by hash, verifying chaining #726

Fetch leaves recursively by hash, verifying chaining #726

Conversation

jbearer commented Oct 31, 2024

This PR:

Key places to review:

imabdulbasit left a comment

Choose a reason for hiding this comment