This repo contains Rust implementations of Entropy coders , namely a Huffman Entropy coder and a Table Based Asymmetric numerical system.
Both of these are compression algorithms that work by using less bits to represent frequently occurring characters and more bits to represent less frequently occurring characters.
The library is very much in beta and some files may not work(like siliesa/mozilla
)
Do note that the Huffman encoder and decoder implemented here is not the best compression wise, you can achieve better compression ratio with Huffman encoders.
This is just meant to be a simple(and a fast) illustration of Huffman encoding and decoding.
Table Based Asymmetric Numerical systems stem from Jarek Duda's ANS and Yann Collet's work on Finite State Entropy.
It features better compression ratio(Closer to the Shannon Limit) as compared to Huffman, and it really shines with skewed distributions.
It does have an asymptomatically slower speeds than Huffman as it does more work per symbol than a Huffman Encoder/Decoder, this is visible in really aggressive out-of-order machines(see benchmarks) but is still a good replacement for it.
For running benchmarks of this library there is another library here
I'm going to compare it to Yann Collet's FSE library since they both do the same things extremely well and to also point out a few gotchas.
-
Model name: AMD Ryzen 5 4500U with Radeon Graphics
-
CPU family: 23
-
Model: 96
-
Thread(s) per core: 1
-
Core(s) per socket: 6
-
L1d: 192 KiB (6 instances)
-
L1i: 192 KiB (6 instances)
-
L2: 3 MiB (6 instances)
-
L3: 8 MiB (2 instances)
File | Library | Encoder | Decoder | Compression ratio |
---|---|---|---|---|
enwiki8 | ||||
Huffman | 940.97 MB/s | 1310.28 MB/s | 65.00% | |
TANS | 602.56 MB/s | 1022.93 MB/s | 63.90% | |
FSE/Huff0 | 924.6 MB/s | 1049.5 MB/s | 63.26% | |
FSE/TANS | 457.7 MB/s | 572.8 MB/s | 63.12% |
Note that FSE/* beats this library by compression ratio on both Huffman and TANS, it employs better and more sophisticated algorithms while I'm using the simplest ones I could come up with.
The TANS implementation is faster since this library employs more interleaving which pays off well
This couldn't have been made possible without the great works of
-
Jarek Duda - The inventor of ANS based algorithms, and collector, most of his work and others can be found here
-
Eric Biggers - xpack (lz77+tANS).
-
Yann Collet - FiniteStateEntropy,
-
Yann Collet - FSE algorithm blog posts
-
Charles Bloom - Understanding TANS blog series,
-
Fabian Geisen - Reading bits in far too many ways
-
Fabian Geisen - Interleaving rANS