Graph-aware ΔQ #32

rkuhn · 2024-10-01T19:46:19Z

rkuhn
Oct 1, 2024
Collaborator

So far, ΔQ expressions have been used to model the timeliness of block diffusion in Cardano using an approach like the following:

hop 0.6⇌99.4 hop ->- hop 8.58⇌90.82 hop ->- hop ->- hop 65.86⇌24.96 hop ->- hop ->- hop ->- hop

The probabilities for needing one hop, two hops, etc. have been taken from measurements or simulations of a peer-to-peer network shaped like the Cardano network.

This approach is not satisfactory for any analysis that aims to take network load or buffers into account, because it has no access to the branching nature of the messages traveling through the network as well as the crucial deduplication functionality: when information reaches a node where it is already known, it won’t be replicated further. With the separation into headers and block data this becomes even more of an inaccuracy, as a node will only fetch the heavy block data from one peer.

Notation: CDF[(0.1, 0.3), (0.5, 0.6), (0.8, 1)] describes a cumulative distribution function starting at (0,0) and having steps at 0.1, 0.5, and 0.8, rising to the value 0.3, 0.5, and 1, respectively.

It seems to me that we need to capture more information about the network in order to solve this problem. The first step is to recognise that diffusion is hierarchical: peers in the same location will very quickly get the data, much quicker than diffusion takes across locations (cf. the latency of 12ms vs. 69ms or 268ms cited in the technical report). The second step is to quantify the graph in terms of size S and degree D. Let L1{small,large} be the short local latency and L2{small,large} be the latency between locations — the system can be extended to more tiers later. Let R=D-1 be the replication factor during diffusion steps.

Ignoring co-located (L1) peers for now and assuming a random graph between locations, the process may be approximated like so:

we start with the generation of the block at one node, characterised as o_0=CDF[(0,1/S)] (meaning that if we stop here, only one node of the S graph nodes has the block)
at the first step, D messages are sent to peers, none of which have the data yet
- this is described as o_L2small = CDF[(L2small,1)]
- the D peers now fetch the block, described as o_L2large = CDF[(L2large,1)]
- this combined yields o_step=o_L2small × o_L2Large
- the contribution of the first step is o_1=D*(o_step × o_0) (so after L2 latencies D nodes have newly received the data)
- the overall outcome now is o_0 + o_1
at the second step, D*R messages are sent to peers
- none reach the initial node; some may reach the D nodes from step 1; some may reach the (S−D−1) new nodes as duplicates
- the probability distribution for duplicates can be calculated, leading to a probability distribution Prob2 for newly reached nodes per sender
- this yields a probability distribution for blocks fetched in this step(→ for load analysis)
- the result of step 2 is now a probability distribution of outcomes o_2=Prob2*(o_step × o_1) (note how this convolution also shifts the timing accordingly)
- the overall outcome is o_0 + o_1 + o_2
at the third step the number of messages sent is a probability distribution D*Prob2
- some may reach the nodes from steps 1 & 2; some may reach the (S−Prob2−D−1) new nodes as duplicates
- again, the probability distributions for reaching old nodes or sending duplicates can be calculated, yielding Prob3 new nodes
- same steps as above, leading to result o_0 + o_1 + o_2 + o_3

This process continues as long as new nodes can be reached. Since theoretically lots of duplicates may happen, the probability distribution won’t reach certainty about reaching all S nodes soon, but we can apply a suitable cutoff for pruning the distribution functions that appear in the process, leading to steady progress of the lower bound on how many nodes should have been reached at a given step. This will end the recursion without having to specify other limits than the graph degree and size.

@dcoutts Is this roughly the kind of fix point you had in mind?

Now, coming back to the hierarchy, the above process may be added as a ΔQ operator, something like

Gossip(S, D, light, heavy)

where light is the CDF for doing the part that allows deduplication (i.e. header transfer) and heavy is the CDF for performing the rest of the transfer (i.e. the block body). The hierarchical approach then allows us to decompose the diffusion process into the far-away part followed by the co-located part — the latter starts at each location as it is reached and proceeds otherwise in parallel, thanks to how convolution works:

o_diffuse = Gossip(S_far, D_far, o_L2small, o_L2large) × Gossip(S_near, D_near, o_L1small, o_L1large)

I don’t have sufficient information on the actual network, but perhaps S_far=200 and S_near=12. From the design discussion it seems that the choice of “half random, half closeby” for peer selection results in D_far and D_near being roughly the same (because random is overwhelmingly far-away unless there are some really large locations). It should be obvious that the second Gossip step involves a much shorter timeframe than the first, so even if we pessimistically made S_near larger the result wouldn’t be dramatically different.

If any of this makes any sense, the hard part is finding the right formulae for the probability distributions, I’m a bit rusty with that right now.

abailly · 2024-10-01T20:34:25Z

abailly
Oct 1, 2024

Interesting idea :)

My current understanding is that the topology of the graph is already factored in the probability distribution of paths length: assuming some number of nodes, some average degree, one (not me but I know it's doable) can compute or estimate this distribution. It's a simplification but it seems to me it reasonably well describes what we care about at least in the Praos (and Peras) cases: given some node produces a block, what's the 95% percentile of the delay distribution for this block across the given distribution of possible paths.

Interestingly, we have been able to compare the distribution that DeltaQ gives us, with the actual 1-hop distribution observed across a significant of the network (that's in the first tech report for Peras) and it seems pretty accurate. Perhaps this is an artefact, and more analysis might be required, but to me this is a solid hint that, within the error margin of a coarse grained simulation a simple Delta-Q expression can model real world behaviour accurately.

I therefore don't understand why you would want to go to that level of details with DeltaQ in the case of Leios. We should first try to model Leios using the existing combinators, possibly factoring in congestion by including in each hop delay a random variable modelling delay distribution according to some queueing model's congestion.

0 replies

rkuhn · 2024-10-02T06:34:45Z

rkuhn
Oct 2, 2024
Collaborator Author

You’re right about the elegance and usefulness of the ΔQ approach when it comes to timeliness, and what I propose above isn’t needed to improve that. My impetus is to tackle the load analysis. As far as I can see the beauty of ΔQ lies in just abstracting away exactly those details that would be needed for such an analysis — notice how the amplification of messages at each hop is absent in the expression, hence it isn’t possible to judge the bandwidth usage of the sending node.

2 replies

njd42 Oct 2, 2024

The issue of hop count / path length was carefully investigated at the start of the whole project. We leaned heavily on the notion of small worlds the book by Watts. As this type of graph has a strong academic following we also socialised it within the UK's Mathematics of Networks including contributing papers.

We tried to find a theoretical basis for production systems to construct systems (i.e 'rules' that would give suitable network structures) but this was a topic beyond the current theory, their work was all about analysis of systems not the a-priori production.

The conclusion was, beyond some critical valency, we could reasonable assume 'small world' properties, attributes that were validated by work with the University of Athens.

rkuhn Oct 14, 2024
Collaborator Author

Thanks @njd42, that was an enlightening read! Do you happen to recall the graph parameters for Cardano? The mean path length can be derived from the ΔQ expression at the top, it would be interesting to also know the global clustering coefficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph-aware ΔQ #32

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Graph-aware ΔQ #32

rkuhn Oct 1, 2024 Collaborator

Replies: 2 comments · 2 replies

abailly Oct 1, 2024

rkuhn Oct 2, 2024 Collaborator Author

njd42 Oct 2, 2024

rkuhn Oct 14, 2024 Collaborator Author

rkuhn
Oct 1, 2024
Collaborator

Replies: 2 comments 2 replies

abailly
Oct 1, 2024

rkuhn
Oct 2, 2024
Collaborator Author

rkuhn Oct 14, 2024
Collaborator Author