Reproducible File Imports #15

Stebalien · 2018-09-26T18:06:54Z

I'd like to be able to encode the chunking algorithm/add options used inside a file's metadata. This would make it easier to reproducibly add files to unixfs (and verify them).

Use-cases:

Archival.
Verifying responses from a gateway. See: Reproducible CID multiformats/cid#22
"Convergent file adding"(?). Basically, if I already have some large files contained in an IPFS directory tree, I'd like to be able to add them myself (locally) instead of having to download them from a peer (just to get the correct chunking/add options).

This shouldn't add too much additional metadata and will add no additional metadata for files under 256KiB (they'll fit in a single object).

mikeal · 2018-09-26T19:26:27Z

Are existing definitions of chunkers standardized enough that we could rely on them or are we going to have to also specify that ourselves if we want to ensure they interoporate?

For instance, we know that when we chunk media files we should respect the keyframe/header boundaries for better seeking/range requests but I'm not 100% certain that there's a standardized way of describing that chunker.

Similarly, how reliable is Rabin across different implementations of the algorithm?

Stebalien · 2018-09-26T23:38:47Z

We'll have to come up with a naming scheme, unfortunately. This falls under the DEX project (ipfs/notes#204). Ideally we'd just point at a webasm program but that's probably not going to happen for a while...

Similarly, how reliable is Rabin across different implementations of the algorithm?

Uh... no idea?

mikeal · 2018-09-27T00:03:52Z

Another thing to think about is how sharded file data is represented when it's too large for a single node. https://github.com/ipfs/unixfs-v2/pull/13/files#diff-916b3e1e005fd96e3a0546715235477dR38

If one implementation limits to 100 chunks per data array and compacts links starting at the first element of the array and another implementation limits to 1000 chunks and compacts into the end of the array backwards (it sounds awkward but it's probably more efficient to make access to the earliest parts of the file faster) then we also have a reproducibility problem.

We can either specify in great detail how this MUST be done in the data spec or we can try to define all the different methods people might want to use and add it to the metadata we're talking about here.

Stebalien · 2018-09-27T01:09:48Z

Yeah, I don't think we'll be able to come up with a clean, complete language for describing all possible chunking algorithms. I'd just like to record this information when possible.

Really, we'll probably record it in the file's general "metadata" section so we can figure out how exactly we want to do this later. We should just keep it in mind.

mikeal · 2018-09-27T02:17:42Z

Really, we'll probably record it in the file's general "metadata" section

Any thoughts on whether or not we want to have a top level property specific to how the data field is constructed or should we just standardize a property inside the existing top level metadata property?

I'm just wondering if we want to create a separation between metadata that unixfs implementations themselves are adding vs metadata added by higher level actors. We have threads elsewhere pointing out that some people would like to store Content-Type metadata and I wonder if we're worried about conflicting in the same metadata namespace.

Stebalien · 2018-09-27T04:01:06Z

So, HTTP tried this with X- headers and, well, we all know how that went down. Basically, I'm not convinced we need to separate canonical from non-canonical.

Note: I'd still have a separate "metadata" section (separate from critical file information like size, etc.). However, I feel like forcing a canonical location and an "extra fields" location will lead to metadata duplication for compatibility.

Kubuxu · 2018-09-27T04:53:07Z

For unixfs we will want to have extended attributes, we could store it as one. Then if file is downloaded using ipfs get the chunking information can be saved in filesystem if the filesystem supports extended attributes.

mikeal · 2018-09-27T15:43:28Z

So, HTTP tried this with X- headers and, well, we all know how that went down. Basically, I'm not convinced we need to separate canonical from non-canonical.

Good point. Dropping the prefixing helps the migration from de-facto standard to actual standard.

warpfork · 2018-10-07T12:22:10Z

Has anyone tried spec'ing this out from the opposite direction -- what user stories do we have that really demands having parameterized chunkers?

Have we spent enough time considering our options for staking out a position on the reductionist/simplicity side here?

I don't understand why we parameterize chunkers.

I think we shouldn't.

Ecosystemically, using various parameterizations of chunkers is an antifeature: it adds complexity, and using any more than one value of it anyway is a net loss for the entire system because it both breaks deduplication as well as causing these questions about.

I'm not aware of any other systems deployed in the wild which have significant usage and support parameterized chunking. Git doesn't. Venti didn't. I'm pretty sure from Backblaze, etc, blogs that they don't. Casync does, oddly, but it's new at this (and arguably, not designed for global pools, which changes things. You'd still never want to use casync commands with different chunking parameters and point them at the same storage pool, and iirc that's fairly loudly documented).

At most, changing major parameters like chunking algorithm should be treated as a migration, and handled extremely cautiously, because the cost of having more than one value active in the system is massive.

I can understand parameterized chunkers as a library; I can't as a tool with that's user facing, because no reasonable UX should foist the choice of chunker on a user who A) doesn't care, and B) can only possibly make wrong choices, per "using more than one value is a net loss for the system". If someone wants to use our libraries to build new tools with different values, that's fine. But our ecosystem should be an ecosystem: and part of things working well together involves picking concrete values for these things so user's don't have to.

For unixfs we will want to have extended attributes, we could store it as one.

Please no.

Xattrs are already one of the swampiest string-string bags in linux. Let's not add to them. Do we really seriously even want to consider fragmenting our already-fragmented-by-variable-chunking files by putting the chunking parameters in another header that makes even more hashes tend towards not converging?

Similarly, how reliable is Rabin across different implementations of the algorithm?

Extremely, and if it's not, it's a critical bug. A Rabin fingerprint is supposed to be not far off from the complexity of a CRC. There should be test vectors. There is no 'close'; there is correct and not correct. Non-identical behavior of a Rabin fingerprinting implementation is exactly as wrong as a non-identical behavior of a function that calls itself "sha1": there is no acceptable amount of mismatch.

warpfork · 2018-10-07T12:27:48Z

Tl;dr: Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content. The "uncoordinated" part is important.

(My perspective on this is shaped a lot by working on Repeatr, which cares deeply about this kind of convergence, because we want to use the hashes of filesystems to check equality, and if that property isn't available without {unreasonably vasty amounts of additional configuration}, then we've got... well, we've got something that doesn't work.)

mikeal · 2018-10-07T15:33:29Z

Has anyone tried spec'ing this out from the opposite direction -- what user stories do we have that really demands having parameterized chunkers?

There's a broad collection of use cases where files mutate and the changes need to be synced to a client who has the old version of the graph.

In these cases it's important for the party who is doing the mutation to know how the data was chunked so that when it chunks the new data it follows the same parameters. If it doesn't the new representation will have far more new blocks than it would have if it followed the chunking settings of the prior chunker.

I'm not aware of any other systems deployed in the wild which have significant usage and support parameterized chunking. Git doesn't. Venti didn't. I'm pretty sure from Backblaze, etc, blogs that they don't. Casync does

All of these tools are more specific to a single use cases than we are. They have algorithms for chunking specific to their use case and can easily assume that their other clients will follow the same logic.

Since we want to support use cases that would have conflicting requirements on the chunker we need a way to state which method was used at encode time in order to be more interoporable.

It's worth noting that these settings only tell another client how the data was initially encoded, it doesn't force them to use the same settings. Another client may not even support the algorithm used by the original encoder of the data and decide to use something else entirely.

Tl;dr: Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content.

Correct, it is mostly useful when one actor uploads content and another modifies it.

For unixfs we will want to have extended attributes, we could store it as one.

Please no.

I think we might be mixing up "extended attributes" with the need for some sort of "meta" property that we allow clients to populate with whatever arbitrary information they want. "extended attributes" implies that these attributes would be serialized into a filesystem that supported extended attributes and I don't believe that is the goal here.

jbenet · 2018-10-14T00:20:41Z

Hey-- @warpfork mentioned this issue to me.

Heads up that:

@whyrusleeping and I spoke at great length about this in early 2017 -- we landed on the facts that:
- the parameter space is huge -- there are many dimensions (chunker/layout/hash/cid version), and the amount of dimensions grows over time (unavoidably)
- no single choice satisfies user requirements for {performance (chunker/layout/hash), security/trust (hash), use case (chunker/layout), ...}
- users {can, will, do} choose different parameter choices
- tools need to be able to reproduce any choice of parameters (need a way to express precisely the choice of parameters) (ipfs add --fmt=<fmtstr>)
- ideally all the UX across IPFS should use the same format expression
we already solved some of this in https://github.com/ipfs/ipfs-pack
we decided that it was critical to come up with a "format string" (fmtstr) that specifies in exact detail how to reproducibly import a file
see https://github.com/ipfs/ipfs-pack/blob/master/fmtstr.go and https://github.com/ipfs/ipfs-pack/blob/master/fmtstr_test.go
see also https://github.com/ipfs/ipfs-pack/blob/master/spec.md
see also Proposing some tooling for datasets (ipfs-pack and stuff) ipfs/notes#205

mikeal · 2018-10-15T20:07:22Z

In the interest of getting the spec ready for use this quarter I'm going to table this.

The spec will include a meta field we can use to store this information. Based on usage we'll standardize something along the lines of fmtstr in the future.

rvagg · 2022-12-06T01:32:00Z

closing for archival

Stebalien mentioned this issue Sep 26, 2018

Reproducible CID multiformats/cid#22

Closed

lidel mentioned this issue Oct 10, 2018

HTTP Gateway Validator ipfs/ipfs-companion#593

Open

4 tasks

lidel mentioned this issue Nov 2, 2018

Verifiable HTTP Gateway Responses ipfs/in-web-browsers#128

Closed

6 tasks

chrysn mentioned this issue Jan 10, 2019

Support for exotic chunking ipfs-shipyard/ipfs-pack#36

Open

lidel mentioned this issue Feb 2, 2019

docs: de-scope the IPFS Project Roadmap for 2019 ipfs/roadmap#21

Merged

eocarragain mentioned this issue Feb 4, 2019

Prioritizing unixfs-v2 ipfs/roadmap#19

Closed

djdv mentioned this issue Jul 18, 2019

[EPIC] Towards better file system APIs ipfs-inactive/package-managers#71

Closed

mikeal mentioned this issue Aug 8, 2019

UnixFS Reboot #28

Closed

rvagg added the closed for archival label Dec 6, 2022

rvagg closed this as completed Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible File Imports #15

Reproducible File Imports #15

Stebalien commented Sep 26, 2018

mikeal commented Sep 26, 2018

Stebalien commented Sep 26, 2018

mikeal commented Sep 27, 2018

Stebalien commented Sep 27, 2018

mikeal commented Sep 27, 2018

Stebalien commented Sep 27, 2018

Kubuxu commented Sep 27, 2018

mikeal commented Sep 27, 2018

warpfork commented Oct 7, 2018

warpfork commented Oct 7, 2018

mikeal commented Oct 7, 2018 •

edited

Loading

jbenet commented Oct 14, 2018 •

edited

Loading

mikeal commented Oct 15, 2018

rvagg commented Dec 6, 2022

Reproducible File Imports #15

Reproducible File Imports #15

Comments

Stebalien commented Sep 26, 2018

mikeal commented Sep 26, 2018

Stebalien commented Sep 26, 2018

mikeal commented Sep 27, 2018

Stebalien commented Sep 27, 2018

mikeal commented Sep 27, 2018

Stebalien commented Sep 27, 2018

Kubuxu commented Sep 27, 2018

mikeal commented Sep 27, 2018

warpfork commented Oct 7, 2018

warpfork commented Oct 7, 2018

mikeal commented Oct 7, 2018 • edited Loading

jbenet commented Oct 14, 2018 • edited Loading

mikeal commented Oct 15, 2018

rvagg commented Dec 6, 2022

mikeal commented Oct 7, 2018 •

edited

Loading

jbenet commented Oct 14, 2018 •

edited

Loading