Which blocks should the subscribeRepos blocks
field contain?
#2165
Replies: 1 comment
-
This is a great question and I just added an entry to #2128 to nail this down and specify it better. As some context, we have run in to a couple bugs around this and tweaks what we include in recent months. Here is an informal non-binding summary as I remember things: You're right on with your points 1 and 2. If the Relay has the full previous tree (and records) from the just-previous commit, the event should contain all the blocks needed to have the full new tree (and records). Tangentially, note that the Relay also has enough state and info that it doesn't really need the op list, or could verify the op list is correct and complete. We also want a less-stateful but still signature-verifying consumer be able to verify proofs, even if they don't have any previous MST tree state (but do presumably have a cache or ability to resolve identity info, including repo signing key from DID document). The tricky case here is deletes, where the consumer is kind of trusting the op list when it comes to deletions. The consumer can verify deletions in the op list against the blocks, but the op list could be bogus (either indicated records deleted which never existed, or not mentioning some additional records which were also deleted). I believe these are the current semantics and the proof blocks are included for deletes (but I could be forgetting). There is another edge case, which is non-unique records (or MST nodes). If you add a record, then delete it, the tree goes back to a previous state, and all those intermediate nodes have already been sent before, so the PDS might skip sending them. Or, if you include the same exact record, either as a duplicate in current tree or updating back to a previous state (eg, profile record edit to a previous state), then PDS might think those are not "new" block and omit them. But then a stateless consumer doesn't have blocks. I believe we have updated to include the blocks now, to make things easier on downstream. In general, right now we are leaning towards including some extra blocks if they make things easier for downstream. This usually isn't much extra overhead. When we formalize this better we'll probably stay it is ok to include some extra blocks... I guess we don't want folks doing a soft resource amplification lamer attack stuffing unused blocks in the event to waste bandwidth? Or other abuse I can imagine but won't describe here now? But we probably want to roll that up that with overall bandwidth limits and generic abuse policies. While we are on this topic, two related things: we have some informal guidelines on general binary data limits that i'll post soon (eg, max record size, max websocket event frame size, things like that). And we are aware that the current repo event stream is not very ergonomic in the common case: most consumers aren't verifying signatures (so don't need the MST nodes, which are much more significant overhead on the firehose than for full-repo exports/snapshots), and most consumers aren't keeping full repo copies so things like "is this a duplicate like" or "who was unfollowed when this follow record deleted" are hard. We want to do an "app-specific" stream/firehose from the bsky appview which is WebSocket+JSON and helps with these ergonomics, but that might be a minute. |
Beta Was this translation helpful? Give feedback.
-
The lexicon describes the field as "CAR file containing relevant blocks."
atproto/lexicons/com/atproto/sync/subscribeRepos.json
Lines 56 to 60 in d0be052
I think "relevant" is underspecified, and I'm not sure it's documented more explicitly anywhere else yet.
A repo's state can be encoded as the set of blocks it contains (including both MST nodes, and records themselves), along with a reference to the MST root node. For a commit that transforms a repo from state "A" to state "B", there's at least 2 ways to define the set of blocks that are "relevant":
All blocks that are present in state B, but not in state A. (i.e. just the "new" blocks)
The union of blocks required to prove each individual operation within a commit. (all inclusion-proofs for record creates and updates, and exclusion-proofs for record deletions, plus any new record values)
Something in between (or something else entirely...)
For commits that only contain a single operation, I believe there's no difference between
1
and2
. But, for a commit that contains multiple operations, some MST blocks (or even records) may be ephemeral, existing for some intermediate state of the MST, but not in either of the initial or final states. Thus, the set of blocks included by2
is a superset of1
.Method
1
should be all that's relevant for consumption by a service that knows the previous state of the MST (e.g. a Relay).But, for a more "stateless" service like a feed generator, you ideally want to be able to verify each operation as a standalone event, without needing to know the previous state of the repo/MST, hence method
2
makes more sense there (maybe - it could make batched writes unnecessarily expensive in terms of the number of blocks included).So I have two questions:
What are the current PDS and Relay behaviours? (I have a feeling it's
1
, but I'm not sure)What should the correct behaviour be? (and does that depend on the roles of the service(s) talking to each other?)
I know dholms has answered one of my queries related to this on bluesky, but I can't find the post now and things have probably changed since then anyway - hopefully any answers here will be both up-to-date and more easily findable in the future!
Beta Was this translation helpful? Give feedback.
All reactions