-
Notifications
You must be signed in to change notification settings - Fork 49
FAQ
TerarkDB implemented a SSTable with Terark’s algorithms and data structures, and gains much better random read performance and much better compression ratio.
Yes, TerarkDB changed a little bit of RocksDB code:
- Add two-pass scan capability to reduce disk/ssd usage during Terark SSTable building(This does not impact existing SSTables).
- Add TerarkDB config by environment var(users can replace official librocksdb.so by TerarkDB’s librocksdb.so and set env var to enable TerarkDB’s SSTable).
TerarkDB doesn’t change any user API of RocksDB, from the user’s perspective TerarkDB is 100% compatible with RocksDB, this compatibility is on ABI level, user applications even need not to recompile.
TerarkDB is compatible with MongoDB and MySQL through MyRocks and MongoRocks.
5. Since the core algorithm of TerarkDB is proprietary, how to be compliant with MongoDB’s AGPL license?
- All Terark code related to MongoDB is open source (MongoDB itself and MongoRocks), these do not use any TerarkDB code or API.
- The core of TerarkDB is a plug-in for RocksDB, so it is not directly applied into MongoDB -- it is loaded as a dynamic library for librocksdb.so, and TerarkDB is compliant with RocksDB’s license.
- For key, the peak memory is about total_key_length * 1.3 during compression.
- For value, the memory usage is about input_size*18%, and never exceeds 12G during compression.
To give you an example, to compress 2TB of data, in which "keys" are 30GB:
- TerarkDB would need ~39GB of RAM to compress the keys (into index).
- TerarkDB would need ~12GB of RAM to compress the values.
- Key compressing and Value compressing can be in parallel if the memory is sufficient, and in serial if the memory is not sufficient.
Level compaction has larger write amplification, resulting slower compression speed and shorter SSD lifetime.
We also improved 'universal compaction' than RocksDB's native version.
If TerarkDB use " level compaction", all things will work well.
8. TerarkDB is faster in random read and smaller in compressed data size, but what is the trade-off?
The compression speed is slower, and use more memory during compression. This does not impact instant writes. SINGLE thread compression speed is about 15~50 MB/s, so multiple threads are used for compression.
TerarkDB has nothing to do with lock, SSTable in RocksDB itself is read-only after compaction, and write-only during compaction. So we did not make any change on locking.
RocksDB’s SStable will not be that large.
TerarkDB supports data of huge size like that, but each SSTable is not necessarily that large, a single SSTable is hundreds of GB at most.
It depends, usually the max number of keys in an SSTable can not exceed 1.5 billions, and the total length of all keys can be much larger(practically 30GB should be safe).
For values, the total length of compressed values can not exceeds 128PB(practically this limit will never be hit), the number of values is the same as the number of keys.
People will unlikely to generate such large single SSTable and hit the limit, and TerarkDB will finish such large SSTable and create a new SSTable.
It depends on the data set and memory limit.
On TPC-H lineitem data with unlimited memory (row len 615 bytes, key len 23 bytes, text field length 512 bytes), for a SINGLE thread:
- For key, ~10MB/s on Xeon 2630 v3, up to ~80MB/s on a different data set.
- For value, ~500 MB/s on Xeon 2630 v3, up to ~7GB/s on a different data set.
As a general rule, below showing the relative comparable magnitudes of read speed:
- When Memory is slightly limited(all data fit in memory by TerarkDB, not by Other DBs)
DB | Read mode | Speed | Description | |
---|---|---|---|---|
TerarkDB | sequential | 8000 | all data exactly in memory no page cache miss |
CPU Cache & TLB miss is medium |
random | 4000 | CPU Cache & TLB miss is heavy | ||
Other DB | sequential | 10000 | hot data always fit in cache | |
random | 10 | no hot data, heavy cache miss |
- When Memory is extremely limited
DB | Read mode | Speed | Description |
---|---|---|---|
TerarkDB | sequential | 5000 | hot data always fit in cache |
random | 100 | no hot data, heavy cache miss | |
Other DB | sequential | 10000 | hot data always fit in cache |
random | 1 | no hot data, very heavy cache miss |
It depends on the data set.
On TPC-H lineitem data (row len 615 bytes, key len 23 bytes, text field length 512 bytes), for a SINGLE thread:
- For key, ~9MB/s on Xeon 2630 v3, up to ~90MB/s on a different data set.
- For value, ~40 MB/s on Xeon 2630 v3, up to ~120MB/s on a different data set.
It depends on the data set.
On wikipedia data, all english text of ~109G, is compressed into ~23G.
On TPC-H lineitem of ~550G, keys are ~22G, values are ~528G. Average row len is 615 bytes, in which key is 23 bytes, value is 592 bytes, in which the configurable text field is 512 bytes:
- All keys are compressed into ~5.3G.
- All values are compressed into ~24G. (So the SSTable size is 29.3G)
- For key, < 80 bytes will be best fit, and of course >80 Bytes still works.
- For value, between 50 bytes and 30KB will be best fit, and it works for other length too.
the high level pseudo code:
int id = indexStore.find(key); // indexStore is mmap'ed
if (id >= 0) {
// metadata is mmap, content is mmap or pread or by user cache
value = valueStore.get(id);
}
17. Have you run any benchmarks comparing just the PA-Zip algorithm with other standard compression algorithms (particularly LZ77) for parameters such as compression ratio, compression/decompression performance.
If we ignore the Point Accessible feature, and only compare compression capabilities. For different data set, the metrics would be vary, here is a rough metric(on Amazon movie data):
compression ratio |
compression speed |
decompression speed |
Point Accessible |
---|---|---|---|
PA-Zip | 5x | 3x | 50x |
bzip2 | 5x | 1x | 1x |
gzip | 3x | 3x | 3x |
snappy | 2x | 15x | 25x |
zstd | 3x | 10x | 15x |
18. How does compression ratio and performance compare to (for example) using something like Zstandard and training it on a large portion of a given table and using it to compress each row.
We had benchmarked zstd with training mode, the trained result(dictionary) is too small and can not be large enough. zstd compression speed with such training mode is much slower(compress each record with pre-trained dictionary) than PA-Zip, and the compression ratio is worse.
Almost all traditional compression algorithms support such kind of "training mode", and rocksdb also supports such feature (it just sampling fixed size fragments as dictionary):
Set rocksdb::CompressionOptions::max_dict_bytes
(in options.h
) to a nonzero value indicating the maximum per-file dictionary size.
For decompression(Point Access), zstd is also much slower than PA-Zip.
PA-Zip is dedicated for Point Access, we hadn't find any other algorithm has such feature.
Yes, PA-Zip does not support more features, For example:
Does it supprot regex search? (I don't mean search all records one by one!) |
PA-Zip does not support regex search. CO-Index support regex search! |
---|---|
Is it only addressable ("seekable extract functionality")? |
Yes! It just does what it can do well: Point Access -- extract a record by an integer id |
20. In your estimate, how much of the improved performance and compression ratio is due to CO-index and how much is due to PA-Zip?
It depends, if the "value" is relatively smaller, CO-Index contributes more, if the "value" is relatively larger, PA-Zip contributes more. As a general perspective:
Point Search on CO-Index: 20MB/s, if average keylen is 20 bytes, the QPS is 1M. Point Access on PA-Zip: 500MB/s, if average keylen is 500 bytes, the QPS is 1M. Comprise the CO-Index and PA-Zip above, the QPS is 500K. (single thread)
Technically, running many DB instance on a single machine is a very bad design, there is no any DB can efficiently handle such cases, esp. for TerarkDB, TerarkDB is highly optimized for big database instance, and is likely worse than other DB on very small DB instances(e.g. smaller than 1GB).
The real requirement which using many DB instances is to make a namespace
, a good solution should be Using PrefixID:
Databases based on RocksDB(such as MyRocks and MongoRocks) using PrefixID
to emulate DB namespace, PrefixID
schema can handle millions of DB namespaces.