Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Merkle root computation in the Renter-Host Protocol #29

Open
lukechampine opened this issue Feb 6, 2020 · 8 comments
Open
Assignees

Comments

@lukechampine
Copy link
Owner

Currently, we encrypt all messages in the RHP with ChaCha20-Poly1305. This means that sector data is encrypted twice. The encryption itself adds some overhead, but not enough to worry about. However, treating sector as an opaque message introduces a bottleneck in the protocol: we can't start computing the Merkle root of the sector until we've received and verified the entire message!

This affects both the renter and the host: when uploading, the host needs to wait for the entire sector to arrive before it can compute its Merkle root; when downloading, the same applies to the renter. Imagine you have 1 Gbps of bandwidth, and computing a Merkle root takes 20ms. The total transfer time for a sector is now (4 MiB / 1 Gbps = 33ms) + 20ms = 53 ms, which means your throughput is 4 MiB / 53ms = 633 Mbps. But if the Merkle root could be computed in streaming fashion as the sector arrived, the total transfer time would be much closer to 33ms. So we're looking at a potential 58% speed-up here!

But -- how do we accomplish this? At first I thought we would need to add a new RPC that sends the sector un-encrypted. But on further thought, I think we can actually pull it off without changing the protocol at all! Here's why. ChaCha20-Poly1305 just means encrypting with ChaCha20 (a stream cipher) and appending a poly1305 tag. You don't need to verify the tag before decrypting the data, you're just supposed to. In our case, we're not doing anything dangerous with the plaintext; we're just computing its Merkle root. So it's fine if we do a streaming decryption, pipe that into our Merkle root function, and verify the tag in parallel. (Actually, I think verifying the tag may be unnecessary too, since the root itself serves as a tag.)

@lukechampine lukechampine self-assigned this Feb 6, 2020
@MeijeSibbel
Copy link

That's a novel idea and definitely worth exploring. Any speed-up during transfers is very welcome.

@jkawamoto
Copy link
Contributor

Looks nice!

I also agree that we don't need to verify the Poly1305 tag as long as verifying the Merkle root succeeds.

@lukechampine
Copy link
Owner Author

lukechampine commented Feb 8, 2020

Some initial results on this:

I implemented "sector streaming" for downloads, since it was slightly easier than uploads. To my dismay, benchmarks showed no significant increase in throughput: when downloading from a single host on the same machine, downloads run at about 800 Mbps with or without streaming.

I dug a little deeper, and realized that, since the benchmark was conducted locally, the effective network bandwidth was 6 Gbps! At that speed, computing the Merkle root in parallel will only save a few milliseconds. So, to more accurately simulate real network conditions, I introduced artificial slowdowns in the connection. I tested artificial speeds of 100 Mbps, 500 Mbps, 1 Gbps, and 2 Gbps. Results are summarized below:

100 Mbps 500 Mbps 1 Gbps 2 Gbps 6 Gbps
No streaming 64 Mbps 240 Mbps 384 Mbps 448 Mbps 800 Mbps
Streaming 72 Mbps 344 Mbps 672 Mbps 752 Mbps 800 Mbps

This table is much more gratifying. We can see that sector streaming is about 1.5x faster than non-streaming, except at the extremes. This makes sense: if the network is very slow, then the Merkle root overhead (~20ms) does not add much to the total transfer time. Conversely, if the network is very fast, then the Merkle root begins to dominate the total transfer time, making it irrelevant whether we stream or not.

Next, I will implement sector streaming for uploads. I expect to achieve similar results there; currently, on a 1 Gbps connection, the maximum upload speed is about 330 Mbps (as evidenced by local testing with ghost), so I estimate that we could increase that to about 600-700 Mbps with sector streaming.

@lukechampine
Copy link
Owner Author

It occurred to me that with this change, hashing is now the last computational bottleneck remaining. With a faster hash function, we should be able to dramatically increase maximum throughput. So for fun, I swapped out our BLAKE2b Merkle root calculation with calls to XChaCha20, which should be about the same speed as SIMD-optimized BLAKE3. Consequently, I was able to achieve download speeds of 2400 Mbps. 🔥

Unfortunately, we can't switch to BLAKE3 without breaking compatibility, and even if we wanted to, there's no SIMD-optimized implementation for Go. It might, however, be possible to further optimize BLAKE2b for our purposes. It won't be nearly as fast, but I imagine we could get a 2x-4x speedup in hashing speed, which should be enough to get us over 1 Gbps per host.

@MeijeSibbel
Copy link

1 Gbps per host sounds amazing and is probably far enough since most hosts don't have home connections faster than 1 Gbps. That said, what has to change on the host side to be able to get that speed? Currently the host code limits throughput significantly (~40 Mbit/s). I'm also wondering what other latency effects might severely impact this optimization.

@lukechampine
Copy link
Owner Author

Hosts are currently limited by their own non-streaming Merkle root computations, as well as the 500ms sync interval that we discovered previously. It will likely be easier to achieve 1Gbps when downloading, as there are no fsyncs required and (if the renter is requesting a full sector) no Merkle proof needs to be computed. I'll try benchmarking it later.

@lukechampine
Copy link
Owner Author

Update: I decided to try my hand at writing optimized assembly to compute Merkle roots. After much wailing and gnashing of teeth, I have emerged from the darkness with an AVX2 implementation that can compute the Merkle root of a sector in just 2 milliseconds -- 10x faster than Sia today.

There's just one catch...my implementation computes the BLAKE3 Merkle root, not the BLAKE2b Merkle root. Sorry. BLAKE3, for various reasons, is easier to write assembly for than BLAKE2b. However, BLAKE3 and BLAKE2b have very similar designs (big surprise), so I now feel more confident that I could write an optimized BLAKE2b Merkle root function. BLAKE2b is slower than BLAKE3, so I probably won't get it down to 2 milliseconds, but it'll be closer to 2 than 20.

@lukechampine
Copy link
Owner Author

Success!! I now have an AVX-2 BLAKE2b Merkle root implementation that runs in 5ms, a full 4x speedup over the status quo. As expected, not quite 2ms, but close enough. 🔥 🔥 🔥

The crazy thing is that this function actually runs faster than the following code:

blake2b.Sum256(sector[:])
blake2b.Sum256(sector[:])

i.e. running the stdlib blake2b function (which is also AVX2-optimized) on the same amount of data that the Merkle root function has to process (2x the tree size). This is possible because when computing Merkle roots, we know that 63 bytes of each 128-byte block will be 0, so we can skip them entirely.

Another nice thing is that my implementation is written to scale with the size of the SIMD vectors. So it would not be too arduous to write an AVX-512 version that yields yet another 2x speedup (i.e. ~2.5ms).

The only bummer is that this is platform-specific code, so it will only benefit users running amd64. People running Sia on ARM or other archs won't see any speedup, at least until someone ports the SIMD code to their platform.

I'll open a PR to merge this sometime next week. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants