Benchmark compression algorithms with L1 and L2 example data #233
Replies: 7 comments
-
Preliminary benchmarking from 22k
|
Beta Was this translation helpful? Give feedback.
-
More bench results
|
Beta Was this translation helpful? Give feedback.
-
My summary of the compression benchmarking is as follows: zlib is pure go which should make compiling to MIPS better. The other contender is Brottli - I believe the option that I selected as the default is tuned for a higher compression ratio + it appears to be much slower (however it's running a golang implementation b/c I could not get the cgo one to work). Using a dictionary seems to have a small benefit (and there's a benefit even with a small dictionary), but it's not quite as large as I expected. The remaining questions are about compression/decompression speed and how to meter gas (I believe it will be hard to measure the effect that each transaction has on the total size of the compressed size). |
Beta Was this translation helpful? Give feedback.
-
What is your worry here? That the implementations are not actually correct with respect to the Go semantics, but work because they're compiled to x64, but not in MIPS? Or compiler bug giving slightly results?
Very true. But what about trying to compress in the current state and reporting that as the cost? The pitfall here is that if their block land in a further batch, the cost might be higher than the reported one (but that's a general risk even with basefee and such). Another remark/question, you've trained the algorithsm on subset (2.5k) and full (22k) corpuses. But training on the whole corpus then running the compression on the same corpus will produce optimal overfitted outcome. What about training it on half the corpus then trying the compression on the other half? That gives you a ~10k corpus and no overfitting. |
Beta Was this translation helpful? Give feedback.
-
It's just the effort of cross compiling + more a concern of platform dependent code (zstd actually has a test suite to make sure that the result is the same across a bunch of platforms - including mips).
It depends on the API of the compression algorithm, but generally flushing the in flight data multiple times rather than waiting to flush until the end reduces the compression efficacy (as it operates over the full data). I have some ideas on how to estimate the impact, but nothing that works for online processing.
Good idea. One note is that the subset actually did better than the full corpus, but figuring out how to do this better is something worth doing - I was going for something quick and dirty to understand the ballpark benefit. One note is that zstd recommends 100s-1000s of files to train the dictionary on. |
Beta Was this translation helpful? Give feedback.
-
Copied from other thread
|
Beta Was this translation helpful? Give feedback.
-
Wow, that's interesting! |
Beta Was this translation helpful? Give feedback.
-
There is prior discussion here: #10
Corpus Preparation
Compression Benchmark tool
Compression Algorithms
The go standard library provides
Beta Was this translation helpful? Give feedback.
All reactions