Logblock is a software block device built on the concept of log-structured merge-trees (LSMs).
Logblock is inspired by how the Flash Translation Layer (FTLs) of an SSD works. But instead of NAND devices, we have blob storage systems, which are conceptually similar at the API level. By pushing the FTL up the stack, we can build block devices on top of abstract distributed storage systems, decoupling the block device from the underlying storage hardware. This is more CPU/memory instensive than simply mapping block ranges on a storage device to a logical block device, which is largely how traditional distributed block device are implemented. But decoupling from the storage hardware create a large amount of flexibility in terms of data allocation, replication, encoding, etc.
Currently, logblock is very basic and only supports storing data on the
local filesystem. The logblock binary is located in cmd/logblock
and can be
run as:
% ./logblock -size 64G /dev/nbd0 /path/to/data/storage/directory
% sudo mke2fs /dev/nbd0
% sudo mount /dev/nbd0 /mnt/test
Conceptually, a block device can be thought of as a key-value store (a very limited one). The key is a fixed 8-byte integer, which is the block index (or LBA), and the value is simply the block data. Therefore, one can trivially build a block device on top of their favourite key-value database.
This is not particularly useful for building a locally-attached block device (maybe except for simulating a very large device for testing). However, one can use a distributed KV-store to build scalable, fault-tolerent block devices for VMs and other applications.
Logblock goes one step further, and instead of building on top of a database, it leverages the properties of a block device (i.e. fixed-size keys and values) to build a more efficient solution.
As an LSM, writes are written to a write-ahead log. Once the log reaches a certain size, it is "compacted" into a sparse block format. A block mapping keeps track of which file contains which block. The compaction step also frees up space which has been overwritten by more recently written blocks.
Because it's a "Log"-structured "Block" device. Also, naming is hard.
Because I have an original logblock git repo, which is >5 years old. It has multiple iterations of the write-ahead log and sparse block format, various block mapping data structures, a "graveyard" of unused code, and a git history which is starting to look like https://xkcd.com/1296/. It was probably best to start a new repo as I open-source this project.
Honestly, it isn't. For one, this isn't a complete project, but rather a technical demo. It has too many missing features to be useful, not the least of which is integration with a blob/log storage backend.
- TRIM support
- Re-designed metadata model
- Storage backends (i.e. S3, HDFS, etc)
- Performance improvements
- More tests
- Alternate compaction strategies
Logblock is released under a BSD 3-Clause License.