Please clarify hyperdrive metadata "children"/"trie" encoding #92

bnewbold · 2017-10-30T02:48:26Z

I've been working on a from-scratch implementation of dat/hyperstuff/sleep following the whitepaper/spec. One place i've run in to trouble is understanding directory structures are supposed to be encoded in the metadata register (the "Filename Resolution" section of the whitepaper).

A first small question about indexing conventions: in the whitepaper examples are given pointing to metadata entry index [0], but by convention this is always the Header/Index message (containing the public key of the paired content register). I assume this means entry numbers should be translated on lookup (aka, add one to the "hyperdrive index" to get the "hypercore/SLEEP index"), which is the convention i'll use below. I'll note that the dat log output does not do this translation though (starts with 1 index).

To clarify questions around the children/trie encoding, here's an example directory tree:

.
├── Animalia
│   └── Chordata
│       └── Mammalia
│           └── Carnivora
│               ├── Caniformia
│               │   └── Mustelidae
│               │       └── Lutrinae
│               │           └── Enhydra
│               │               └── E_lutris.txt
│               └── Feliformia
│                   └── Felidae
│                       └── Felis
│                           └── F_silvestris.txt
├── datapackage.json
├── Fungi
│   └── Basidiomycota
│       └── Agaricormycetes
│           └── Cantharellales
│               └── Cantharellaceae
│                   └── Cantharellus
│                       ├── C_appalachiensis.txt
│                       └── C_cibarius.txt
└── README.md

Assume these were added in order "README.md", "E_lutris.txt", "F_silvestris.txt", "C_cibarius.txt", "C_appelachiensis.txt". What should the entries look like?

0: name=/README.md
    children=[]
    stat=<protobuf>
1: name=/Animalia/Chordata/Mammalia/Carnivora/Caniformia/Mustelidae/Lutrinae/Enhydra/E_lutris.txt
    children=[[0]]
    stat=<protobuf>
2: name=/Animalia/Chordata/Mammalia/Carnivora/Feliformia/Felidae/Felis/F_silvestris.txt
    children=[[0]]
    stat=<protobuf>
3: name=/Fungi/Basidiomycota/Agaricormycetes/Cantharellales/Cantharellaceae/Cantharellus/C_cibarius.txt
    children=[[0]]
    stat=<protobuf>
4: name=/Fungi/Basidiomycota/Agaricormycetes/Cantharellales/Cantharellaceae/Cantharellus/C_appalachiensis.txt
    children=[[0], [], [], [], [], [], [3]]
    stat=<protobuf>

This clearly can't be correct, because there's no way to go from [4] to find any of the Animalia entries. Should [4]'s children look like [[0,1,2], [], [], [], [], [], [3]]? The paper says:

The children field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision.

Does that mean that every single file in the whole drive/archive is pointed to from every appended entry? Or is that the case if there are, eg, 100k files in the root directory, in which case an appended entry would be at least ~100 KB, even with the compressed encoding?

Later "F_silvestris.txt" is edited and updated, then "datapackage.json" gets added. What does it's node look like? Given this [6] node, how do you look up the most recent version of "F_silvestris.txt" (node [5]), without getting confused with [2]? Also, what about deletions?

The current state of the whitepaper looks like it has an early version of multi-writer protocol stuff added, but isn't complete. An older revision seemed more complete and is what I followed in trying to be compatible with stable dat (i'm running 13.8.1).

I took a look at the implementation in https://github.com/mafintosh/append-tree/blob/master/index.js, but there are no comments and i'm not familiar enough with javascript for it to be enlightening.

The text was updated successfully, but these errors were encountered:

bnewbold · 2017-11-05T02:05:21Z

I did some reverse engineering, and now speculate that:

the children sub-arrays for a given entry includes entries for all files and subdirectories for every parent directory of the path; they do not necessarily contain entries for all files in the drive.
sub (child) directories at a given path are included, with the newest file at any depth of the directory (which could include this entry itself). So, for example, every new file at any depth will have as it's first subarray the list of all files in the root directory, and one file for each subdirectory of the root directory. this clarifies how recursive lookup works.
there is a preceding header varint, which currently has just a single bit, representing "endsWithSeq". If this flag is set, the current entry index is presumed to be appended to every sub-array. This flag seems to always be set, even for deletions.
unlike in the whitepaper examples, filenames themselves count as path entries. So, eg, "/README.md" has two path subarrays (one for "/", one for "README.md").
the convention is to refer to entries by their register index. The first register entry is the encoded content register public key, so the first drive/directory entry has index 1.

bnewbold · 2018-01-03T09:45:38Z

In the current protocol, endsWithSeq is always set for additions, and never set for deletions.

bnewbold mentioned this issue Mar 30, 2018

document metadata.ogd #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please clarify hyperdrive metadata "children"/"trie" encoding #92

Please clarify hyperdrive metadata "children"/"trie" encoding #92

bnewbold commented Oct 30, 2017 •

edited

Loading

bnewbold commented Nov 5, 2017

bnewbold commented Jan 3, 2018

Please clarify hyperdrive metadata "children"/"trie" encoding #92

Please clarify hyperdrive metadata "children"/"trie" encoding #92

Comments

bnewbold commented Oct 30, 2017 • edited Loading

bnewbold commented Nov 5, 2017

bnewbold commented Jan 3, 2018

bnewbold commented Oct 30, 2017 •

edited

Loading