Technical design of sync-sets

This mostly houses decisions/thoughts on details not organised under another document.

Checksums

Element sum

This is just a checksum of the element's identifier (a u64 encoded to bytes as big-endian) followed by its data. The current checksum algorithm is BLAKE2b configured for 256 bits output.

Note: the checksum output size can easily be adjusted (BYTES in src/detail/sum.rs — other values should work but are not often tested), but doing so requries building new binaries and renders old data files incompatible. 256 bits appears to be a common choice and a good compromise between size and security.

State sums

A state sum is a checksum on a state (a commit or snapshot) of a partition. This serves both as an identifier and a means to securely identify the partition's data (the set of elements) and metadata (parent state(s), author, timestamp, moved elements, maybe more).

A checksum of the metadata is computed from the following stream of data:

the partition identifier (u64 but lower 24 bits are zero, big-endian)
the bytes CNUM
the commit number (big-endian u32)
the commit's timestamp (UNIX time as big-endian i64)
each parent's statesum, as ordered by the commit
if present, the extra metadata byte-stream (without padding)

The state sum is the metadata checksum and element sums combined via bit-wise exclusive-or (XOR) operator (in any order).

A new partition should be started with a blank state: no parents, no elements, commit number 0; however the timestamp should be set to the time of initialisation and extra metadata can be included (e.g. comments or author).

Displaying checksum identifiers

Printing state sums and commit checksums: use base 16 (0-9, A-F) — hexadecimal. Optionally put a space between pairs of characters.

Like git, accept abbreviated sums as identifiers so long as these are unique within known history.

Element and partition identifiers

Element identifiers are a unique u64 (unsigned 64-bit number) and also serve to identify the element's partition.

The first 40 bits (high part) of the number is allocated to a partition identifier, and the low 24 bits (more than 16 million numbers) are allocated to an element identifier within the partition.

(Example: a partition might use 123 as the high part. A new element can be assigned any 32-bit number unique within the partition, e.g. 5. The element identifier would then be 123 × (2^40) + 5.)

Element identifier assignment

The low 24 bits can be assigned in any way such that they are unique within the partition. A good strategy might be to randomly sample a number distributed evenly throughout the range, then increment by one until a unique number is found.

Partition identifier assignment

Each partition is assigned a range of partition identifiers; the initial "partition" in a repository gets the whole available range.

The partition uses the smallest number within its range when assigning new element identifiers.

When a partition is partitioned, the range is first adjusted to exclude the smallest number (which has already been used), then is divided equally between the new partitions, which get (roughly) equally-sized non-overlapping ranges. The old partition will not be used anymore (unless possibly the new partitions are merged back together).

Should a partition run out of new numbers for partitioning, another strategy is possible: find a little-used partition with more numbers than it needs, and steal some of its range. The details for this are yet to be defined.

The range (0, 0) is a special case used in the library when loading data. This range should never be used otherwise.

Snapshot and commit log file names

See identifiers section under partitions, below.

Repository name

A repository must have a name specified when created (UTF-8; not more than 16 bytes long). This serves (1) as an identifier to check that partitions come from the same repo, and (2) as a user description of the repo. It is visible near the top of each repository file in plain text (assuming no compression).

Cloning

Support cloning by straight copy of data files? Ideally yes, but it means there is no opportunity to introduce a fresh "clone" identifier.

Branching?

Pulling in commits from a remote copy (or possibly even committing locally) can result in multiple tip states which then require merging. Unlike git, such states are found without needing to keep track of branches. The compromise is that a partition cannot be used until merging is complete (not even retrieving elements).

Compaction

This operation reduces the size of the historical log by reducing essentially to a few snapshots in history. Elements not mentioned in these snapshots will be forgotten completely. The purpose is to allow user-controlled partial deletion of history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

technical-decisions.md

technical-decisions.md

Technical design of sync-sets

Checksums

Element sum

State sums

Displaying checksum identifiers

Element and partition identifiers

Element identifier assignment

Partition identifier assignment

Snapshot and commit log file names

Repository name

Cloning

Branching?

Compaction

Files

technical-decisions.md

Latest commit

History

technical-decisions.md

File metadata and controls

Technical design of sync-sets

Checksums

Element sum

State sums

Displaying checksum identifiers

Element and partition identifiers

Element identifier assignment

Partition identifier assignment

Snapshot and commit log file names

Repository name

Cloning

Branching?

Compaction