Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal container format based on progressive specialization #23

Open
rotemdan opened this issue Sep 27, 2018 · 4 comments
Open

Universal container format based on progressive specialization #23

rotemdan opened this issue Sep 27, 2018 · 4 comments

Comments

@rotemdan
Copy link

rotemdan commented Sep 27, 2018

[This is a work-in-progress draft design which has been heavily edited since it was first published]

This is an attempt at designing a highly flexible, yet compact, multipurpose container format that can function both as a content/entity identifier, a file header, as a part of a protocol message, or even to contain both metadata and data by itself.

Basically there's a very simple underlying concept here: that successive type enumerations can be used to progressively "namespace" into more and more specialized contexts describing more fine-grained information. Note these type enumerations don't have to be limited to built-in fields (like entity domain or schema version) -- they can be dynamically inferred from fields whose semantics are progressively refined by the schema itself (somewhat like a state machine).

(This is mostly an illustrative example of how such format could be designed, but I did put a lot of thought into it so I think it's a worthwhile read)

It starts with a message encoding identifier (1 character), which can be any one of raw-binary, base64, base32 etc:

<message encoding [1 char]>

Now that we're in binary, a version number for the container format (varint):

<container version [varint]>

Now a varint for a entity domain identifier (e.g. file, ipfs, ipns, https, bitcoin, ethereum etc.)

<entity domain [varint]>

And now a varint version number of the schema for the domain (each domain independently maintains its own schema versioning):

<domain-specific schema version [varint]>

Now the base payload (AKA required fields), where its schema is specialized for the particular domain and version number, (note that total length is included to allow for a client to segment it even if it is unfamiliar with the particular combination):

<base payload length [varint]>
<base payload [arbitrary binary layout - can be variable length]>

And now field data (AKA optional fields), in a simplified protocol buffer like encoding (roughly described below):

<field data [unspecified total length])>

That's all really. It's not bound to contain a hash of any sort, or to be associated with a particular category within a set of predefined codec types.

Example: say we want to encode [raw-binary, container version 2, IPFS, schema version 1] so the first required field would be resource type, say it's UnixFS File, which in turn would refine the schema further to expect <dag hash type [varint]> and <dag hash [binary string]> as following fields.

The base document would look something like:

<encoding: "b" [1 character]>
<container version: 2 [1 byte]>
<entity domain: IPFS [1 byte]>
<domain-specific schema version: 1 [1 byte]>
<base payload length: 34 [1 byte]>
<resource type: UnixFS file [1 byte]>
<dag hash type: sha-256 [1 byte]>
<dag hash [32 bytes]>

(Total length: 1 char + 38 bytes)

Optional fields:

Each optional field is structured as:

<data type and field identifier [varint]>
<field payload>

Where the first bit of data type and field identifier represents the type and the rest the field identifier (specific for the particular schema), which can grow indefinitely since its a varint (fitting into a single byte would allow for 6 bits which can support up to 64 different field IDs).

Data type can be:

0: varint 
1: length prepended binary string (where length is a varint)

(I'm not sure if there's a need for anything else, since booleans can be contained in bitfields and floats can be stored in binary strings)

So let's say for the example we wanted to add a file size, chunking algorithm and max chunk size optional fields to the base CID:

<data type: 0, field id: file size (#0) [1 byte]>
<field payload [6 bytes]>
<data type: 0, field id: chunking algorithm (#1) [1 byte]>
<field payload [1 byte]>
<data type: 0, field id: max chunk size (#2) [1 byte]>
<field payload [3 bytes]>

Totals (file size: 7 bytes, chunking algorithm: 2 bytes, chunk size: 4 bytes). Of course if the information cannot be represented here (say, chunking is variable): it may simply not be included at all.

Now let's say the user wants to also add a signature for the hash, and that is not supported in the base schema, so they would need to use their own application specific field identifier in a reserved range (for this example say 4096+ is reserved [4096 is roughly midway within the range available for 2 byte identifiers]).

<data type: 1, field id: hmac-sha-256 hash signature (#4096) [2 bytes]>
<field payload [1 for length + 32 bytes for data]>

Even if the client doesn't understand this field, it can safely ignore and skip it since all the length information is available through the encoding itself.

Note that it's possible to standardize identifiers within the range 4096+ as application reserved globally for all domains. This would mean that application-specific fields could be added to a document even if its schema is not understood by the client.

@rotemdan rotemdan changed the title Compact self-describing content descriptor scheme based on progressive specialization Flexible self-describing content descriptor scheme based on progressive specialization Sep 27, 2018
@rotemdan rotemdan changed the title Flexible self-describing content descriptor scheme based on progressive specialization Flexible content descriptor scheme based on progressive specialization Sep 27, 2018
@rotemdan rotemdan changed the title Flexible content descriptor scheme based on progressive specialization Universal metadata container based on progressive specialization Oct 1, 2018
@rotemdan
Copy link
Author

rotemdan commented Oct 1, 2018

I've made some major changes, especially to generalize the terminology:

  1. Flexible content descriptor -> Universal container format
  2. CID -> Container
  3. Protocol -> Entity domain
  4. Required fields -> Base payload
  5. Optional fields -> Field data

and removed resource type as a built-in field, since not all domains/protocols would need it.

It turned out to be a significant challenge to describe this in a clear manner so I might come back to polish it a little more. Since I think it got to a reasonably stable form I would be interested in getting some feedback. Any questions? suggestions for improvement? clarifications?

@rotemdan rotemdan changed the title Universal metadata container based on progressive specialization Universal container format based on progressive specialization Oct 1, 2018
@geoah
Copy link

geoah commented Nov 19, 2018

Is this effort part of cid/ipfs/ipld/mutliformats or something different/new?

@eikeon
Copy link

eikeon commented Jan 9, 2021

I've made some major changes, especially to generalize the terminology:

  1. Flexible content descriptor -> Universal container format
  2. CID -> Container
  3. Protocol -> Entity domain
  4. Required fields -> Base payload
  5. Optional fields -> Field data

and removed resource type as a built-in field, since not all domains/protocols would need it.

It turned out to be a significant challenge to describe this in a clear manner so I might come back to polish it a little more. Since I think it got to a reasonably stable form I would be interested in getting some feedback. Any questions? suggestions for improvement? clarifications?

@rotemdan, It's been a couple year. Curious if you've continued down this path?

@rotemdan
Copy link
Author

rotemdan commented Jan 9, 2021

This is an idea I suggested several years ago with the purpose of potentially unifying content identifiers and IPLD documents.

Basically having one highly flexible data format that could describe anything, and that would be compact enough to be transmitted as a link (albeit possibly a long one).

This means that resonably simple/small files would not require fetching an additional metadata (IPLD) file from the network. The link would contain all the hashing information and the extra metadata required to safely retrieve and verify the data (not just from IPFS, but also from http, bittorrent or potentially any other protocol). As long as you have the link. You'd still have a chance of safely acquiring the file from somewhere. This is in contrast to IPFS, where once the IPLD document becomes unavailable, the associated data cannot be retrieved or verified.

In a sense, it presents a vision that's quite different from the way IPFS was initially designed. It's not bound to set of predetermined protocols, and is not "locked" to the IPFS ecosystem.

Since I never got any comment about this idea from IPFS team members, I'm assuming it's this either doesn't fit their business model or it is too much of a departure from the founders' original design of the network, to the point they may feel that going in this particular direction would diminish their sense of "ownership" of their own product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants