This mostly houses decisions/thoughts on details not organised under another document.
This is just a checksum of the element's identifier (a u64 encoded to bytes as big-endian) followed by its data. The current checksum algorithm is BLAKE2b configured for 256 bits output.
Note: the checksum output size can easily be adjusted (BYTES
in
src/detail/sum.rs
— other values should work but are not often tested),
but doing so requries building new binaries and renders old data files
incompatible. 256 bits appears to be a common choice and a good compromise
between size and security.
A state sum is a checksum on a state (a commit or snapshot) of a partition. This serves both as an identifier and a means to securely identify the partition's data (the set of elements) and metadata (parent state(s), author, timestamp, moved elements, maybe more).
A checksum of the metadata is computed from the following stream of data:
- the partition identifier (u64 but lower 24 bits are zero, big-endian)
- the bytes
CNUM
- the commit number (big-endian u32)
- the commit's timestamp (UNIX time as big-endian i64)
- each parent's statesum, as ordered by the commit
- if present, the extra metadata byte-stream (without padding)
The state sum is the metadata checksum and element sums combined via bit-wise exclusive-or (XOR) operator (in any order).
A new partition should be started with a blank state: no parents, no elements, commit number 0; however the timestamp should be set to the time of initialisation and extra metadata can be included (e.g. comments or author).
Printing state sums and commit checksums: use base 16 (0-9, A-F) — hexadecimal. Optionally put a space between pairs of characters.
Like git, accept abbreviated sums as identifiers so long as these are unique within known history.
Element identifiers are a unique u64
(unsigned 64-bit number) and also serve
to identify the element's partition.
The first 40 bits (high part) of the number is allocated to a partition identifier, and the low 24 bits (more than 16 million numbers) are allocated to an element identifier within the partition.
(Example: a partition might use 123 as the high part. A new element can be assigned any 32-bit number unique within the partition, e.g. 5. The element identifier would then be 123 × (2^40) + 5.)
The low 24 bits can be assigned in any way such that they are unique within the partition. A good strategy might be to randomly sample a number distributed evenly throughout the range, then increment by one until a unique number is found.
Each partition is assigned a range of partition identifiers; the initial "partition" in a repository gets the whole available range.
The partition uses the smallest number within its range when assigning new element identifiers.
When a partition is partitioned, the range is first adjusted to exclude the smallest number (which has already been used), then is divided equally between the new partitions, which get (roughly) equally-sized non-overlapping ranges. The old partition will not be used anymore (unless possibly the new partitions are merged back together).
Should a partition run out of new numbers for partitioning, another strategy is possible: find a little-used partition with more numbers than it needs, and steal some of its range. The details for this are yet to be defined.
The range (0, 0) is a special case used in the library when loading data. This range should never be used otherwise.
See identifiers section under partitions, below.
A repository must have a name specified when created (UTF-8; not more than 16 bytes long). This serves (1) as an identifier to check that partitions come from the same repo, and (2) as a user description of the repo. It is visible near the top of each repository file in plain text (assuming no compression).
Support cloning by straight copy of data files? Ideally yes, but it means there is no opportunity to introduce a fresh "clone" identifier.
Pulling in commits from a remote copy (or possibly even committing locally) can result in multiple tip states which then require merging. Unlike git, such states are found without needing to keep track of branches. The compromise is that a partition cannot be used until merging is complete (not even retrieving elements).
This operation reduces the size of the historical log by reducing essentially to a few snapshots in history. Elements not mentioned in these snapshots will be forgotten completely. The purpose is to allow user-controlled partial deletion of history.