Skip to content
This repository has been archived by the owner on Feb 5, 2025. It is now read-only.

Fetch leaves recursively by hash, verifying chaining #726

Merged
merged 11 commits into from
Dec 5, 2024
Merged

Conversation

jbearer
Copy link
Member

@jbearer jbearer commented Oct 31, 2024

This changes leaf fetching to require already having the next leaf. This tells us what the hash should be for the leaf being fetched.

This addresses two unrelated issues:

  1. Leaf fetching is trusted: we were not previously verifying that the leaf returned by a peer is valid in any way. Now, we can verify the fetched leaf against the expected hash (and similarly for the fetched QC). We exploit the chaining property of HotShot leaves to avoid having to run any kind of consensus light client to verify fetched leaves.
  2. Fetching leaves by hash instead of (or in addition to) by height allows us to implement more providers. For example, we can now implement a provider that pulls leaves from undecided consensus storage, by hash, which allows us to fetch a leaf from our own storage even if we missed the corresponding decide event.

This PR:

  • Changes leaf fetching to require the next leaf, and fetch/authenticate by hash
  • Proactive scanning now runs backwards, since leaf fetching becomes kind of inherently backwards. This was a change we wanted anyways (closes Sync from newest to oldest #620)
  • To facilitate that, we have added reverse streams for all the resource types, which may come in handy for the explorer
  • Receiving a leaf (either by fetching or decide event) now triggers us to fetch the parent if it is missing. This is supplements the proactive fetcher and can often help us obtain missing data really quickly (e.g. if we just missed one decide)
  • NoStorage is gone. The purpose of NoStorage was to stress test fetching, which it did in the beginning, but it has been ages since we found a real bug this way; we now have plenty of other adversarial fetching tests, and NoStorage is becoming more trouble than it's worth to maintain. In particular, effective fetching now depends on having somewhat reliable storage (a reasonable assumption!) because we assume if you fetch a leaf, you can look it up shortly thereafter and thus fetch its parent. Thus, the NoStorage tests were failing or very slow because of many failed requests, with this change.

Key places to review:

  • New leaf fetching logic: active_fetch in src/data_source/fetching/leaf.rs
  • Triggering fetch of the parent leaf: trigger_fetch_for_parent in src/data_source/fetching/leaf.rs
  • Reverse streams: src/data_source/fetching.rs

@jbearer jbearer force-pushed the jb/leaf-fetching branch 3 times, most recently from 3820561 to 924cefa Compare November 14, 2024 19:18
@jbearer jbearer marked this pull request as ready for review November 14, 2024 19:19
@jbearer jbearer force-pushed the jb/leaf-fetching branch 2 times, most recently from 2d6e36f to 6bdab3b Compare November 19, 2024 04:31
This changes leaf fetching to require already having the _next_
leaf. This tells us what the hash should be for the leaf being fetched.

This addresses two unrelated issues:
1. Leaf fetching is trusted: we were not previously verifying that
   the leaf returned by a peer is valid in any way. Now, we can verify
   the fetched leaf against the expected hash (and similarly for the
   fetched QC). We exploit the chaining property of HotShot leaves to
   avoid having to run any kind of consensus light client to verify
   fetched leaves.
2. Fetching leaves by hash instead of (or in addition to) by height
   allows us to implement more providers. For example, we can now
   implement a provider that pulls leaves from undecided consensus
   storage, by hash, which allows us to fetch a leaf from our own
   storage even if we missed the corresponding decide event.

Knock-on changes:
* Proactive scanning now runs backwards, since leaf fetching becomes
  kind of inherently backwards. This was a change we wanted anyways
  (closes #620)
* To facilitate that, we have added reverse streams for all the
  resource types, which may come in handy for the explorer
* Receiving a leaf (either by fetching or decide event) now triggers
  us to fetch the parent if it is missing. This is supplements the
  proactive fetcher and can often help us obtain missing data really
  quickly (e.g. if we just missed one decide)
* `NoStorage` is gone. The purpose of `NoStorage` was to stress test
  fetching, which it did in the beginning, but it has been ages since
  we found a real bug this way; we now have plenty of other adversarial
  fetching tests, and `NoStorage` is becoming more trouble than it's
  worth to maintain. In particular, effective fetching now depends on
  having somewhat reliable storage (a reasonable assumption!) because
  we assume if you fetch a leaf, you can look it up shortly thereafter
  and thus fetch its parent. Thus, the `NoStorage` tests were failing
  or very slow because of many failed requests, with this change.
If we do not have a leaf and also do not have the next leaf, so
that we cannot directly fetch the required leaf, at least we can
find the first leaf that we do have and trigger a fetch, which in
turn will trigger a fetch of its parent, and so on.

This brings the behavior of leaf fetching closer to before, where
if a leaf is requested and we don't have it, we will immediately
start working to get it.
Before switching to nextest, we only ran 2 tests at a time. This
prevented us from seeing any weirdness due to conflicting ports and
the like. Now, I am seeing some flakiness in tests that use Postgres,
which I believe is due to the increased concurrency allowed by nextest.

This change puts all the tests using postgres in a group, and restricts
that group to 2 tests at a time. Other tests can run with unrestricted
concurrency.
* Ensure proper return code from PayloadMetadata when missing from
  DB
* No more retries for upsert; they accomplish nothing as these
  errors are almost always unrecoverable without restarting the
  whole transaction, but they cost a ton of time
* Avoid inserting stale data when pruner is racing with fetcher
* Avoid long delay in `test_archive_recovery` due to competing write
  transactions, by tweaking backoff parameters
* In aggregates test, wait for aggregator task to update statistics
  before checking
* Don't fetch parent leaves that have already been pruned
Copy link
Contributor

@imabdulbasit imabdulbasit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! I like how we are using parent_commitment() to verify the fetched leaf.

@jbearer jbearer merged commit fec6377 into main Dec 5, 2024
10 checks passed
@jbearer jbearer deleted the jb/leaf-fetching branch December 5, 2024 16:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync from newest to oldest
2 participants